Data Profiling Using SSIS
What is data Profiling ?
Data profiling is the
process of examining the data available
in an existing data source
(e.g. a database or a file) and collecting statistics and information about
that data. The purpose of these statistics may be to: Find out whether
existing data can easily be
used for other purposes.
Data Profiling Using SSIS to save the Column Statistics Profile data to database
Implementation
Step 1: Create a package and add ADO.NET connection
with a database for data profiling.
Step 2: Create variables as shown in the picture below
DataProfileIn & DataProfileOut are the variables for data profiling task to get the data and pass the data to return variable
Remaining variables are to mapping the output of the column
statistic profile to database table
Step 3:
Add Data Profiling Task to the package and open task editor.
Step 4: Edit the data profiling task like
Configure
destination variable in the Destination property.
Step 5:
Click Quick Profile button.
In place of Table or View use single table or select the all tables (Entire Database tables)
Now we are loading data for Column Statistic Profile so
select that checkbox only and click ok.
Configure the Profile Requests like above picture
Step 6: Use Script Task and edit it like
Configure the read only and read write variables with
created user variables like above picture and click edit Script
Add the reference from the below path
C:\Program Files\Microsoft SQL
Server\110\DTS\Binn\Microsoft.SqlServer.DataProfiler.dll
#region
Namespaces
using
System;
using
System.Data;
using
Microsoft.SqlServer.Dts.Runtime;
using
System.Windows.Forms;
using
System.Xml;
using
Microsoft.DataDebugger.DataProfiling;
#endregion
Write the below script
/// <summary>
/// This method is called when this script task executes in
the control flow.
/// To open Help, press F1.
/// </summary>
public void Main()
{
// TODO: Add your code here
//Dts.TaskResult = (int)ScriptResults.Success;
Dts.TaskResult = (int)ScriptResults.Success;
// Retrieve the profile from the package variable
string dataProfleXML = Dts.Variables["User::DataProfileIn"].Value.ToString();
// Deserialize
DataProfileXmlSerializer serializer = new DataProfileXmlSerializer();
DataProfile profile = serializer.Deserialize(new System.IO.StringReader(dataProfleXML));
//CReate one Temp Table
DataTable dt = new
DataTable();
dt.Columns.Add("TableName", typeof(string));
dt.Columns.Add("ColumnName", typeof(string));
dt.Columns.Add("MinimumValue", typeof(string));
dt.Columns.Add("MaximumValue", typeof(string));
dt.Columns.Add("Mean", typeof(string));
dt.Columns.Add("StandardDeviation", typeof(string));
// Cycle through the profiles to find the one we're looking
for
foreach (Profile p
in profile.DataProfileOutput.Profiles)
{
if (p is ColumnStatisticsProfile)
{
// Read the ColumnStatisticsProfile class
ColumnStatisticsProfile csp = p as ColumnStatisticsProfile;
// Add that data to the Data Table
dt.Rows.Add(csp.Table.Table, csp.Column.Name, csp.MinValue,
csp.MaxValue, csp.Mean, csp.StdDev);
}
}
//After filling the data to datatable and pass the
datatable to variable
Dts.Variables["User::DataProfileOut"].Value = dt;
}
Save and close the script task and click ok
Step 7: Use Foreach loop container and configure the task as
below picture
Click the Collection
tab and configure the properties
Enumerator : Foreach
ADO Enumerator
ADO object source
variable : User::DataProfileOut(variable)
Select the Variable Mappings and configure the variables
like above picture
Click ok
Use Execute SQL Task and place inside the For Each loop
Step 8: Edit the Execute SQL Task and configure as below
Connection Type:
- OLEDB
Connection: - Use
your destination Data base
SQL Statement is
IF OBJECT_ID('dbo.DataProfilingLevel2Temp', 'U') IS NULL
CREATE TABLE dbo.DataProfilingLevel2Temp
(Sno int NOT NULL IDENTITY ,
DatabaseName varchar(100) NULL,
TableName varchar(100) NULL,
ColumnName varchar(100) NULL,
MinValue varchar(100) NULL,
MaxValue varchar(100) NULL,
Mean varchar(100) NULL,
StdDev varchar(100) NULL,
)
INSERT INTO dbo.DataProfilingLevel2Temp(DatabaseName,TableName,ColumnName,MinValue,MaxValue,Mean,StdDev)
VALUES ('DataHub',?,?,?,?,?,?)
-> For Parameter mapping use the below picture
Click ok and execute the package
The output of the data will see in the table
SELECT * FROM
dbo.DataProfilingLevel2Temp dplt WITH(NOLOCK)
Thanks for reading.