How To – Setup Azure Data Factory (ADF) - TenPoint7
18417
post-template-default,single,single-post,postid-18417,single-format-standard,bridge-core-3.0.1,qode-page-transition-enabled,ajax_fade,page_not_loaded,,qode-theme-ver-30.3.1,qode-theme-bridge,wpb-js-composer js-comp-ver-6.7.0,vc_responsive

How To – Setup Azure Data Factory (ADF)

How To – Setup Azure Data Factory (ADF)

1. What is Azure Data Factory (ADF)

Data Factory is a cloud-based data integration service that orchestrates and automates the movement and transformation of data.

fig1

Source: azure.microsoft.com

2. ADF Architecture

Azure Data Factory has a few key entities that work together to define the input and output data, processing events, and the schedule and resources required to execute the desired data flow.

Figure 2. Relationships between Dataset, Activity, Pipeline, and Linked service (source: azure.microsoft.com)

Relationships between Dataset, Activity, Pipeline, and Linked service. Source: azure.microsoft.com

2.1. Linked Service

Linked Service defines the information needed for ADF to connect to external resources/data stores, it can simply be called connection information. ADF accesses data stores via information provided in the correspondent linked service.

For example, to access an Azure SQL Database, ADF requires a linked service with AzureSqlDatabase type and connection string as below:

fig3

Data stores can be on the cloud such as Azure Blob, Azure table, Azure SQL Database, Azure DocumentDB, Azure SQL Data Warehouse or on-premise such as an SQL database.

When creating a Linked Service for on-premise resources, Data Gateway is required to be implemented on the on-premise infrastructure. The gateway is an agent that enables hybrid data movement and processing.

2.2. Dataset

A Dataset is a named reference/pointer to the data we want to use as an input or output of an Activity, they have a variety of data structures depending on the data store including tables, files, folders, and documents.

This is a SQL table dataset.

This is a SQL table dataset.

2.3. Activity

An Activity defines the actions to perform on the data, there are 2 kinds of actions: copy and transformation, for example, the Copy activity copies data from one source dataset to a sink dataset, a Hive activity runs a Hive query on Azure HDInsight cluster to transform or analyze the data.

Each activity takes zero or more datasets as inputs and produces one or more datasets as outputs, which means that the output datasets are mandatory but the input datasets are not.

This is an example of activity.

This is an example of activity.

2.4. Pipeline

Pipeline is a logical grouping of Activities. It is used to group activities into a unit that performs a task. Activities have to belong to a pipeline and a pipeline must have at least one activity. We can put logically related activities together as one pipeline such that they can be active or paused together

An output dataset from an activity in a pipeline can be the input dataset to another activity in the same/different pipeline by defining dependencies among activities.

3. Create a pipeline to load data from a csv sample data file to an Azure SQL database

 

Typical steps to create a pipeline in ADF are:

  1. Create a data factory (if not created).
  2. Create a linked service for each data store or compute.
  3. Create input and output dataset(s).
  4. Create a pipeline with activities which operate on the datasets defined above.

fig6

3.1. Create a Data Factory

Follow https://azure.microsoft.com/en-us/pricing/free-trial/ to create a trial account with subscription then login to the https://portal.azure.com/

  1. Create a resource group, I created “TenPoint7” resource group
    fig7
  2. Create a new data factory, I used the “TenPoint7-Data-Factory” for this example
    fig8

3.2. Create a linked service for each data store

  1. Create a storage account with “Resource Manager” instead of “Classic” deployment mode and “TenPoint7” Resource Group to provide cloud storage for Azure blobs, Azure tables, Azure queues, and Azure files. After the storage account has been created successfully, browse it to copy its key for the next step.
    fig9
  2. Create the linked service for Azure Storage named “StorageLinkedService” by clicking on “Author and Deploy”, then select “New data store”->”Azure storage”
    fig16fig11
    Edit the preferred name and correct your storage account name with its key from the previous step.fig12
  3. Now we will create Azure SQL database server and database for the target/sink storage/dataset, remember the Azure SQL Database credentials for accessing later. I created the “tenpoint7dbserver” Server and “TenPoint7DB” Database Name.fig13
    • After the “tenpoint7dbserver” is created, set the firewall to allow the current client IP (ADF IP) and the local IP to access this server, click Save when you are done.fig14
      fig15
  4.  Navigate back to the “Author and Deploy” function of ADF to create a new Linked Service for AzureSQLdatabase. I named it “AzuresqlLinkedService”

fig16

fig17

Change the Linked Service name and the connection information to the Azure SQL server

fig18

3.3. Create Input and Output Dataset(s)

  1. Prepare Source Data File

    fig20

  2. Create Input Dataset
    • Return back to portal.azure.com, launch the “Author and Deploy” of the “tenpoint7-data-factory” Data Factory then click on “New dataset” to create a new “Azure blob storage” dataset named “GDPBlobtable” referring to gdpdata/Current_Dollars_GDP_All_Industries_Per_State_1997_To_2011.csv file.
      fig21fig22{

      “name”: “GDPBlobtable“,
      “properties”: {

      “published”: false,
      “type”: “AzureBlob”,
      “linkedServiceName”: “StorageLinkedService“,
      “typeProperties”: {

      “fileName”: “Current_Dollars_GDP_All_Industries_Per_State_1997_To_2011.csv“,
      “folderPath”: “gdpdata/“,
      “format”: {

      “type”: “TextFormat”,
      “columnDelimiter”: “,”,
      “nullValue”: “”,
      “quoteChar”: “\””

      }

      },

      “availability”: {

      “frequency”: “Hour”,
      “interval”: 12,
      “style”: “StartOfInterval”

      },

      “external”: true,
      “policy”: {}

      }

      }

      • The “external” option should be “true” in this case because this external table is not generated by any activity in the data factory
  3. Create Output Dataset
    • Launch the MS SQL

      Server Management Studio from your local machine to connect to Azure SQL Server with credentials of step 3.2.4fig23

    • Run the SQL script (CreateTable.sql file) to create SQL tables and stored procedure on “TenPoint7DB” database
    • Back to the portal.azure.com, launch the “Author and Deploy” of the “tenpoint7-data-factory” Data Factory then click on “New dataset” to create a new “Azure SQL” dataset named “GDPSQLtable

fig24{

“name”: “GDPSQLTable“,

“properties”: {

“published”: false,

“type”: “AzureSqlTable“,

“linkedServiceName”: “AzureSqlLinkedService“,

“typeProperties”: {

“tableName”: “sqlGDPTable

},

“availability”: {

“frequency”: “Hour”,

“interval”: 12,

“style”: “StartOfInterval”

}

}

}

3.4. Create a pipeline with activities which operate on the datasets defined above

  • With this clean and simple data, the “Copy” action is used, however, with data required to be process, the “Transformation” actions should be used such as Hive, Pig, .NETcustomized actions.
  • In the “Author and Deploy” function of “tenpoint7-data-factory” data factory, click “More commands” then click “New pipeline” to create a new pipleline name “GDPPipeline” for the datasets of step 3.3

fig25{

“name”: “GDPPipeline”,

“properties”: {

“description”: “Copy data from a blob to Azure SQL table”,

“activities”: [

{

“type”: “Copy”,

“typeProperties”: {

“source”: {

“type”: “BlobSource”,

“treatEmptyAsNull”: true,

“skipHeaderLineCount”: 1

},

“sink”: {

“type”: “SqlSink”,

“writeBatchSize”: 10000,

“writeBatchTimeout”: “60.00:00:00”

}

},

“inputs”: [{“name”: “GDPBlobtable“} ],

“outputs”: [ {“name”: “GDPSQLTable” } ],

“policy”: {

“timeout”: “01:00:00”,

“concurrency”: 1,

“executionPriorityOrder”: “NewestFirst”

},

“scheduler”: {

“frequency”: “Hour”,

“interval”: 12,

“style”: “StartOfInterval”

},

“name”: “CopyFromBlobToSQL”,

“description”: “Push GDP data from HDFS file to Azure SQL database”

}

],

“start”: “2015-12-28T00:00:00Z“,

“end”: “2015-12-28T12:00:00Z“,

“isPaused”: false,

“hubName”: “tenpoint7-data-factory_hub”,

“pipelineMode”: “Scheduled”

}

}

  • The input dataset should be “GDPBlobtable” dataset referring to the source data file
  • The output dataset should be “GDPSQLtable” dataset of Azure SQL table
  • The start and end slice time in the pipeline is the active period of the pipeline
  • The setting value of “Schedule” in the pipeline must match with the “Availability” in relating datasets

 

4. Monitor the Pipeline

 

  • Launch the “Diagram” function of the ADF

fig26fig27

  • To monitor a data set, double click on each dataset name, we can re-run it if the previous run failed
  • To view activities names and flows of a pipeline, right click on the pipeline then select “open pipeline”.

 

5. Process Data into Dimension and Fact Tables for Visualization

 

Run “EXEC dbo.GDPDatawarehouseProcess” against the Tenpoint7DB SQL Database to process data into dimension and fact tables. However, only the dbo.sqlGDPTable (denormalization table) is used for the report because of the performance purpose.

6. Report Visualization

 

  • Create a new empty Excel worksheet and connect to Azure SQL Server

fig28

fig29

  • Select “sqlGDPTable” of “TenPoint7DB” Database then click Finish.

fig30

  • Add a new sheet with Pivot table and chart for Geography filter condition, we can change the Geofilter value to observe the change of data and respective charts.

fig31

Linh Nguyen (Lead Data Consultant, linh@TenPoint7.com)



TenPoint7