Friday, March 4, 2016

Data Profiling Using SSIS to load data to Database

Data Profiling Using SSIS

What is data Profiling ?
Data profiling is the process of examining the data available in an existing data source (e.g. a database or a file) and collecting statistics and information about that data. The purpose of these statistics may be to: Find out whether existing data can easily be used for other purposes.

Data Profiling Using SSIS to save the Column Statistics Profile data to database

Implementation
Step 1: Create a package and add ADO.NET connection with a database for data profiling.

Step 2: Create variables as shown in the picture below












DataProfileIn & DataProfileOut are the variables for data profiling task to get the data and pass the data to return variable

Remaining variables are to mapping the output of the column statistic profile to database table

Step 3: Add Data Profiling Task to the package and open task editor.
Step 4: Edit the data profiling task like

Configure destination variable in the Destination property.
Step 5: Click Quick Profile button.

In place of Table or View use single table or select the all tables (Entire Database tables)
Now we are loading data for Column Statistic Profile so select that checkbox only and click ok.

Configure the Profile Requests like above picture


Step 6: Use Script Task and edit it like


Configure the read only and read write variables with created user variables like above picture and click edit Script
Add the reference from the below path
C:\Program Files\Microsoft SQL Server\110\DTS\Binn\Microsoft.SqlServer.DataProfiler.dll
#region Namespaces
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;
using System.Xml;
using Microsoft.DataDebugger.DataProfiling;
#endregion

Write the below script
/// <summary>
        /// This method is called when this script task executes in the control flow.
        /// To open Help, press F1.
        /// </summary>
        public void Main()
        {
            // TODO: Add your code here

            //Dts.TaskResult = (int)ScriptResults.Success;

            Dts.TaskResult = (int)ScriptResults.Success;

            // Retrieve the profile from the package variable
            string dataProfleXML = Dts.Variables["User::DataProfileIn"].Value.ToString();

            // Deserialize
            DataProfileXmlSerializer serializer = new DataProfileXmlSerializer();
            DataProfile profile = serializer.Deserialize(new System.IO.StringReader(dataProfleXML));

            //CReate one Temp Table
            DataTable dt = new DataTable();
            dt.Columns.Add("TableName", typeof(string));
            dt.Columns.Add("ColumnName", typeof(string));
            dt.Columns.Add("MinimumValue", typeof(string));
            dt.Columns.Add("MaximumValue", typeof(string));
            dt.Columns.Add("Mean", typeof(string));
            dt.Columns.Add("StandardDeviation", typeof(string));

            // Cycle through the profiles to find the one we're looking for
            foreach (Profile p in profile.DataProfileOutput.Profiles)
            {
                if (p is ColumnStatisticsProfile)
                {
                    // Read the ColumnStatisticsProfile class
                    ColumnStatisticsProfile csp = p as ColumnStatisticsProfile;
                    // Add that data to the Data Table
                    dt.Rows.Add(csp.Table.Table, csp.Column.Name, csp.MinValue, csp.MaxValue, csp.Mean, csp.StdDev);

                }
            }
            //After filling the data to datatable and pass the datatable to variable
            Dts.Variables["User::DataProfileOut"].Value = dt;
        }

Save and close the script task and click ok
Step 7: Use Foreach loop container and configure the task as below picture



Click the Collection tab and configure the properties
Enumerator : Foreach ADO Enumerator
ADO object source variable : User::DataProfileOut(variable)

Select the Variable Mappings and configure the variables like above picture
Click ok
Use Execute SQL Task and place inside the For Each loop

Step 8: Edit the Execute SQL Task and configure as below

Connection Type: - OLEDB
Connection: - Use your destination Data base
SQL Statement is
IF OBJECT_ID('dbo.DataProfilingLevel2Temp', 'U') IS NULL
CREATE TABLE dbo.DataProfilingLevel2Temp
(Sno int NOT NULL IDENTITY ,
DatabaseName varchar(100) NULL,
TableName varchar(100) NULL,
ColumnName varchar(100) NULL,
MinValue varchar(100) NULL,
MaxValue varchar(100) NULL,
Mean varchar(100) NULL,
StdDev varchar(100) NULL,
)
INSERT INTO dbo.DataProfilingLevel2Temp(DatabaseName,TableName,ColumnName,MinValue,MaxValue,Mean,StdDev)
VALUES ('DataHub',?,?,?,?,?,?)
-> For Parameter mapping use the below picture

Click ok and execute the package

The output of the data will see in the table

SELECT * FROM dbo.DataProfilingLevel2Temp dplt WITH(NOLOCK)
Thanks for reading.


Read More »