Introduction
One of the coolest features of SQL Server 2016 is Polybase. Already available for Parallel Data Warehouse, this functionality is now integrated in SQL Server 2016 and allows to combine relational and non-relational data, for example, query data in Hadoop and join it with relational data, import external data into SQL Server or export data from the server into Hadoop or Azure Blob Storage. This last case is especially interesting since is possible to transfer old transactions or historical data to a Hadoop file system and dramatically reduce the storage costs.
Setup Polybase
I installed the following components:
- Windows Server 2012 R2
- SQL Server 2016 RC3 download
- Hadoop 2.7.1 (I built the sources by myself: here how to)
- SQL Server Management Studio 2016 download
- AdventureWorks sample databases download + install instructions
After install SQL Server enable TCP/IP connectivity:
Verify that the Polybase services are running:
Create an external data source
Open a connection to the AdventureworksDW2016CTP3
Polybase connectivity configuration:
sp_configure @configname = 'hadoop connectivity', @configvalue = 7; GO RECONFIGURE GO
‘hadoop connectivity’ is the name of the configuration option. The @configvalue is the corresponding supported Hadoop data source. In my case I selected the 7 corresponding to Hortonworks 2.1, 2.2, and 2.3 on Windows Server. I am using my own Hadoop 2.7.1 and the Hortonworks version is HDP 2.3 and 2.4
More info here:
Polybase Connectivity Configuration
Create external data source script:
CREATE EXTERNAL DATA SOURCE HDP2 WITH ( TYPE = HADOOP, LOCATION = 'hdfs://localhost:9000' )
HADOOP is the external data source type and the location is the NameNode URI. You will find this value in <your Hadoop directory>\etc\hadoop\core-site.xml
Once the source is created you will find it under “External Data Sources” folder in Management Studio:
It is important to remark that the location is not validated when you create the external data source
Create a sample file for this example
Just for demo purposes, create a .csv file and populate it with a query from AdventureworksDW2016CTP3. This is just an example, you can create your own example and also change the file format in the next section accordingly.
Here my query:
SELECT TOP 1000 [SalesOrderNumber] ,[SalesOrderLineNumber] ,p.EnglishProductName as ProductName ,st.SalesTerritoryCountry ,[OrderQuantity] ,[UnitPrice] ,[ExtendedAmount] ,[SalesAmount] ,convert(date,[OrderDate]) AS [OrderDate] FROM [AdventureworksDW2016CTP3].[dbo].[FactInternetSales] a inner join dbo.DimProduct p on a.ProductKey = p.ProductKey inner join dbo.DimSalesTerritory st on st.SalesTerritoryKey = a.SalesTerritoryKey
I populated the csv file using Management Studio as follows:
Open the Export wizard: right click on the database name –> Tasks –> Export Data…
Select a data source
Choose a destination
Specify a query to select the data to export
Source query
Configure flat file destination
Save and run the package
Export done!
Transfer the csv to HDFS
I created a directory called input in my Hadoop file system and store the csv file in c:\tmp
In case you haven’t done before, to create a directory in HDFS open a command prompt, go to your Hadoop directory and type:
<Your_hadoop-directoy>hadoop fs -mkdir /input
Here is my shell command to move the file from windows file system to HDFS:
<Your_hadoop-directoy>hadoop fs -copyFromLocal c:\tmp\AWExport.csv /input/
Set read and write permissions for other members of your group and others:
<Your_hadoop-directoy>hadoop fs -chmod 777 /input/AWExport.csv
List files in the input directory:
<Your_hadoop-directoy>hadoop fs -ls /input
Create an external file format
To create a file format, in a query window in management studio copy and paste the following script:
CREATE EXTERNAL FILE FORMAT SalesExport WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS ( FIELD_TERMINATOR =';', DATE_FORMAT = 'yyyy-MM-dd' , USE_TYPE_DEFAULT = TRUE ) )
SalesExport is just the name I gave.
The format type is delimited. There are some other types, more info here
The field terminator is the same I used when I exported the data to the flat file.
The date format corresponds also to the format in the flat file
Create an external table
This table references the file stored in HDFS (In my case AWExport.csv). The format corresponds to the structure of the file.
CREATE EXTERNAL TABLE SalesImportcsv ( SalesOrderNumber nvarchar(20) ,SalesOrderLineNumber tinyint ,ProductName nvarchar(50) ,SalesTerritoryCountry nvarchar(50) ,OrderQuantity smallint ,UnitPrice money ,ExtendedAmount money ,SalesAmount money ,OrderDate date ) WITH ( LOCATION = '/input/AWExport.csv', DATA_SOURCE = HDP2, FILE_FORMAT = SalesExport, REJECT_TYPE = value, REJECT_VALUE=0 )
Location: location of the file in HDFS.
Data Source: the one created in a previous step.
File Format: also the one created in a previous step.
Reject type: the rejected value is a literal value and not a percentage (the other option is percentage).
Reject value: how many rows could fail. Fail means dirty records, in this context, when a value does not match the column definition.
Query the external table
If everything works you should be able to see the external table in management studio. Then just right click and select the top 1000 records, for example:
Further Topics
- Insert records in an external table.
- Configure an external source with credentials.
- Build a SSIS package to import and export data from Hadoop.
- View the execution plans of the Polybase queries
Leave a Reply