01 Feb How To – Setup Azure Data Factory (ADF)

Posted at 23:43h in How To by TenPoint7

0 Likes

1. What is Azure Data Factory (ADF)

Data Factory is a cloud-based data integration service that orchestrates and automates the movement and transformation of data.

Source: azure.microsoft.com

2. ADF Architecture

Azure Data Factory has a few key entities that work together to define the input and output data, processing events, and the schedule and resources required to execute the desired data flow.

Relationships between Dataset, Activity, Pipeline, and Linked service. Source: azure.microsoft.com

2.1. Linked Service

Linked Service defines the information needed for ADF to connect to external resources/data stores, it can simply be called connection information. ADF accesses data stores via information provided in the correspondent linked service.

For example, to access an Azure SQL Database, ADF requires a linked service with AzureSqlDatabase type and connection string as below:

fig3

Data stores can be on the cloud such as Azure Blob, Azure table, Azure SQL Database, Azure DocumentDB, Azure SQL Data Warehouse or on-premise such as an SQL database.

When creating a Linked Service for on-premise resources, Data Gateway is required to be implemented on the on-premise infrastructure. The gateway is an agent that enables hybrid data movement and processing.

2.2. Dataset

A Dataset is a named reference/pointer to the data we want to use as an input or output of an Activity, they have a variety of data structures depending on the data store including tables, files, folders, and documents.

This is a SQL table dataset.

2.3. Activity

An Activity defines the actions to perform on the data, there are 2 kinds of actions: copy and transformation, for example, the Copy activity copies data from one source dataset to a sink dataset, a Hive activity runs a Hive query on Azure HDInsight cluster to transform or analyze the data.

Each activity takes zero or more datasets as inputs and produces one or more datasets as outputs, which means that the output datasets are mandatory but the input datasets are not.

This is an example of activity.

2.4. Pipeline

Pipeline is a logical grouping of Activities. It is used to group activities into a unit that performs a task. Activities have to belong to a pipeline and a pipeline must have at least one activity. We can put logically related activities together as one pipeline such that they can be active or paused together

An output dataset from an activity in a pipeline can be the input dataset to another activity in the same/different pipeline by defining dependencies among activities.

3. Create a pipeline to load data from a csv sample data file to an Azure SQL database

Typical steps to create a pipeline in ADF are:

Create a data factory (if not created).
Create a linked service for each data store or compute.
Create input and output dataset(s).
Create a pipeline with activities which operate on the datasets defined above.

fig6

3.1. Create a Data Factory

Follow https://azure.microsoft.com/en-us/pricing/free-trial/ to create a trial account with subscription then login to the https://portal.azure.com/

Create a resource group, I created “TenPoint7” resource group
Create a new data factory, I used the “TenPoint7-Data-Factory” for this example

3.2. Create a linked service for each data store

Create a storage account with “Resource Manager” instead of “Classic” deployment mode and “TenPoint7” Resource Group to provide cloud storage for Azure blobs, Azure tables, Azure queues, and Azure files. After the storage account has been created successfully, browse it to copy its key for the next step.
Create the linked service for Azure Storage named “StorageLinkedService” by clicking on “Author and Deploy”, then select “New data store”->”Azure storage”

Edit the preferred name and correct your storage account name with its key from the previous step.
Now we will create Azure SQL database server and database for the target/sink storage/dataset, remember the Azure SQL Database credentials for accessing later. I created the “tenpoint7dbserver” Server and “TenPoint7DB” Database Name.
- After the “tenpoint7dbserver” is created, set the firewall to allow the current client IP (ADF IP) and the local IP to access this server, click Save when you are done.
Navigate back to the “Author and Deploy” function of ADF to create a new Linked Service for AzureSQLdatabase. I named it “AzuresqlLinkedService”

fig16

fig17

Change the Linked Service name and the connection information to the Azure SQL server

fig18

3.3. Create Input and Output Dataset(s)

Prepare Source Data File
- Download the ”Current_Dollars_GDP_All_Industries_Per_State_1997_To_2011.csv” file from https://datamarket.azure.com/dataset/explore/f5fbb235-9cf1-4720-82d1-833888d0bdea
- Download and install the Azure Storage Explorer tool from https://azurestorageexplorer.codeplex.com/
- Launch the Azure Storage Explorer and click “Add Account” button to add the Azure Storage Account (created at step 2.2.a), then create “gdpdata” folder and upload the csv file to this folder.
Create Input Dataset
- Return back to portal.azure.com, launch the “Author and Deploy” of the “tenpoint7-data-factory” Data Factory then click on “New dataset” to create a new “Azure blob storage” dataset named “GDPBlobtable” referring to gdpdata/Current_Dollars_GDP_All_Industries_Per_State_1997_To_2011.csv file.
  {
  
  “name”: “GDPBlobtable“,
  “properties”: {
  
  “published”: false,
  “type”: “AzureBlob”,
  “linkedServiceName”: “StorageLinkedService“,
  “typeProperties”: {
  
  “fileName”: “Current_Dollars_GDP_All_Industries_Per_State_1997_To_2011.csv“,
  “folderPath”: “gdpdata/“,
  “format”: {
  
  “type”: “TextFormat”,
  “columnDelimiter”: “,”,
  “nullValue”: “”,
  “quoteChar”: “\””
  
  }
  
  },
  
  “availability”: {
  
  “frequency”: “Hour”,
  “interval”: 12,
  “style”: “StartOfInterval”
  
  },
  
  “external”: true,
  “policy”: {}
  
  }
  
  }
  - The “external” option should be “true” in this case because this external table is not generated by any activity in the data factory
Create Output Dataset
- Launch the MS SQL
  viagra soft tabs canada
  
  Server Management Studio from your local machine to connect to Azure SQL Server with credentials of step 3.2.4
- Run the SQL script (CreateTable.sql file) to create SQL tables and stored procedure on “TenPoint7DB” database
- Back to the portal.azure.com, launch the “Author and Deploy” of the “tenpoint7-data-factory” Data Factory then click on “New dataset” to create a new “Azure SQL” dataset named “GDPSQLtable”

fig24 {

“name”: “GDPSQLTable“,

“properties”: {

“published”: false,

“type”: “AzureSqlTable“,

“linkedServiceName”: “AzureSqlLinkedService“,

“typeProperties”: {

“tableName”: “sqlGDPTable“

“availability”: {

“frequency”: “Hour”,

“interval”: 12,

“style”: “StartOfInterval”

}

3.4. Create a pipeline with activities which operate on the datasets defined above

With this clean and simple data, the “Copy” action is used, however, with data required to be process, the “Transformation” actions should be used such as Hive, Pig, .NETcustomized actions.
In the “Author and Deploy” function of “tenpoint7-data-factory” data factory, click “More commands” then click “New pipeline” to create a new pipleline name “GDPPipeline” for the datasets of step 3.3

fig25 {

“name”: “GDPPipeline”,

“properties”: {

“description”: “Copy data from a blob to Azure SQL table”,

“activities”: [

{

“type”: “Copy”,

“typeProperties”: {

“source”: {

“type”: “BlobSource”,

“treatEmptyAsNull”: true,

“skipHeaderLineCount”: 1

“sink”: {

“type”: “SqlSink”,

“writeBatchSize”: 10000,

“writeBatchTimeout”: “60.00:00:00”

}

“inputs”: [{“name”: “GDPBlobtable“} ],

“outputs”: [ {“name”: “GDPSQLTable” } ],

“policy”: {

“timeout”: “01:00:00”,

“concurrency”: 1,

“executionPriorityOrder”: “NewestFirst”

“scheduler”: {

“frequency”: “Hour”,

“interval”: 12,

“style”: “StartOfInterval”

“name”: “CopyFromBlobToSQL”,

“description”: “Push GDP data from HDFS file to Azure SQL database”

}

“start”: “2015-12-28T00:00:00Z“,

“end”: “2015-12-28T12:00:00Z“,

“isPaused”: false,

“hubName”: “tenpoint7-data-factory_hub”,

“pipelineMode”: “Scheduled”

}

The input dataset should be “GDPBlobtable” dataset referring to the source data file
The output dataset should be “GDPSQLtable” dataset of Azure SQL table
The start and end slice time in the pipeline is the active period of the pipeline
The setting value of “Schedule” in the pipeline must match with the “Availability” in relating datasets

4. Monitor the Pipeline

Launch the “Diagram” function of the ADF

fig26 fig27

To monitor a data set, double click on each dataset name, we can re-run it if the previous run failed
To view activities names and flows of a pipeline, right click on the pipeline then select “open pipeline”.

5. Process Data into Dimension and Fact Tables for Visualization

Run “EXEC dbo.GDPDatawarehouseProcess” against the Tenpoint7DB SQL Database to process data into dimension and fact tables. However, only the dbo.sqlGDPTable (denormalization table) is used for the report because of the performance purpose.

6. Report Visualization

Create a new empty Excel worksheet and connect to Azure SQL Server

fig28

fig29

Select “sqlGDPTable” of “TenPoint7DB” Database then click Finish.

fig30

Add a new sheet with Pivot table and chart for Geography filter condition, we can change the Geofilter value to observe the change of data and respective charts.

fig31

Linh Nguyen (Lead Data Consultant, linh@TenPoint7.com)

How To – Setup Azure Data Factory (ADF)