Polybase Query Service and Hadoop – Welcome SQL Server 2016

Introduction

One of the coolest features of SQL Server 2016 is Polybase. Already available for Parallel Data Warehouse, this functionality is now integrated in SQL Server 2016 and allows to combine relational and non-relational data, for example, query data in Hadoop and join it with relational data, import external data into SQL Server or export data from the server into Hadoop or Azure Blob Storage. This last case is especially interesting since is possible to transfer old transactions or historical data to a Hadoop file system and dramatically reduce the storage costs.

Setup Polybase

I installed the following components:

After install SQL Server enable TCP/IP connectivity:

enable tcp ip

Verify that the Polybase services are running:

polybase services

Create an external data source

Open a connection to the AdventureworksDW2016CTP3

Polybase connectivity configuration:

sp_configure @configname = 'hadoop connectivity', @configvalue = 7;
GO
RECONFIGURE
GO

‘hadoop connectivity’ is the name of the configuration option. The @configvalue is the corresponding supported Hadoop data source. In my case I selected the 7 corresponding to Hortonworks 2.1, 2.2, and 2.3 on Windows Server. I am using my own Hadoop 2.7.1 and the Hortonworks version is HDP 2.3 and 2.4

More info here:

Hortonworks Products

Polybase Connectivity Configuration

Create external data source script:

CREATE EXTERNAL DATA SOURCE HDP2 WITH
(
    TYPE = HADOOP,
    LOCATION = 'hdfs://localhost:9000'
)

HADOOP is the external data source type and the location is the NameNode URI. You will find this value in <your Hadoop directory>\etc\hadoop\core-site.xml

NameNode URI.jpg

Once the source is created you will find it under “External Data Sources” folder in Management Studio:

External data source.jpg

It is important to remark that the location is not validated when you create the external data source

Create a sample file for this example

Just for demo purposes, create a .csv file and populate it with a query from AdventureworksDW2016CTP3. This is just an example, you can create your own example and also change the file format in the next section accordingly.

Here my query:

SELECT TOP 1000
  [SalesOrderNumber]
 ,[SalesOrderLineNumber]
 ,p.EnglishProductName as ProductName
 ,st.SalesTerritoryCountry
 ,[OrderQuantity]
 ,[UnitPrice]
 ,[ExtendedAmount]
 ,[SalesAmount]
 ,convert(date,[OrderDate]) AS [OrderDate]
FROM [AdventureworksDW2016CTP3].[dbo].[FactInternetSales] a
inner join dbo.DimProduct p on a.ProductKey = p.ProductKey
inner join dbo.DimSalesTerritory st on st.SalesTerritoryKey = a.SalesTerritoryKey

I populated the csv file using Management Studio as follows:

Open the Export wizard: right click on the database name –> Tasks –> Export Data…

Export Data.jpg

Select a data source

select a data source.jpg

Choose a destination

Choose a destination.jpg

Specify a query to select the data to export

specify query

Source query

source query.jpg

Configure flat file destination

configure flat file destination.jpg

Save and run the package

save and run the package.jpg

Export done!

execution finished.jpg

Transfer the csv to HDFS

I created a directory called input in my Hadoop file system and store the csv file in c:\tmp

In case you haven’t done before, to create a directory in HDFS open a command prompt, go to your Hadoop directory and type:

<Your_hadoop-directoy>hadoop fs -mkdir /input

Here is my shell command to move the file from windows file system to HDFS:

<Your_hadoop-directoy>hadoop fs -copyFromLocal c:\tmp\AWExport.csv /input/

Set read and write permissions for other members of your group and others:

<Your_hadoop-directoy>hadoop fs -chmod 777 /input/AWExport.csv

List files in the input directory:

<Your_hadoop-directoy>hadoop fs -ls /input

hdfs commands.jpg

Create an external file format

To create a file format, in a query window in management studio copy and paste the following script:

CREATE EXTERNAL FILE FORMAT SalesExport WITH (
        FORMAT_TYPE = DELIMITEDTEXT,
        FORMAT_OPTIONS (
                FIELD_TERMINATOR =';',
                DATE_FORMAT = 'yyyy-MM-dd' ,
                USE_TYPE_DEFAULT = TRUE
                           )
)

SalesExport is just the name I gave.

The format type is delimited. There are some other types, more info here

The field terminator is the same I used when I exported the data to the flat file.

The date format corresponds also to the format in the flat file

Create an external table

This table references the file stored in HDFS (In my case AWExport.csv). The format corresponds to the structure of the file.

CREATE EXTERNAL TABLE SalesImportcsv
(
    SalesOrderNumber nvarchar(20)
   ,SalesOrderLineNumber tinyint
   ,ProductName nvarchar(50)
   ,SalesTerritoryCountry nvarchar(50)
   ,OrderQuantity smallint
   ,UnitPrice money
   ,ExtendedAmount money
   ,SalesAmount money
   ,OrderDate date
)
WITH
(
   LOCATION = '/input/AWExport.csv',
   DATA_SOURCE = HDP2,
   FILE_FORMAT = SalesExport,
   REJECT_TYPE = value,
   REJECT_VALUE=0
)

Location: location of the file in HDFS.

Data Source: the one created in a previous step.

File Format: also the one created in a previous step.

Reject type: the rejected value is a literal value and not a percentage (the other option is percentage).

Reject value: how many rows could fail. Fail means dirty records, in this context, when a value does not match the column definition.

MSDN Documentation

Query the external table

If everything works you should be able to see the external table in management studio. Then just right click and select the top 1000 records, for example:

select from external table.jpg

Further Topics

  • Insert records in an external table.
  • Configure an external source with credentials.
  • Build a SSIS package to import and export data from Hadoop.
  • View the execution plans of the Polybase queries

References

About Paul Hernandez

I'm an Electronic Engineer and Computer Science professional, specialized in Data Analysis and Business Intelligence Solutions. Also a father, swimmer and music lover.
This entry was posted in Big Data, Business Intelligence, Data Processing Engines, hadoop, SQL Server and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s