Introduction
This post is a continuation of Polybase Query Service and Hadoop – Welcome SQL Server 2016
One of the most interesting use cases of Polybase is the ability to store historical data from relational databases into a Hadoop File System. The storage costs could be reduced while keeping the data accessible and still can be joined with the regular relational tables. So let`s do the first steps for our new archiving solution.
Requirements
- SQL Server 2016 Developer (I suppose it works with higher editions) download
- Hadoop 2.7.1 (I built my own sources and also successfully tested with HDP 2.4)
- AdventureWorks Sample Databases and Scripts for SQL Server 2016 CTP3
- Install and configure Polybase: Get started with Polybase
Create a folder for the pdw_user in Hadoop
Polybase use the default user name pdw_user when connecting to the Hadoop cluster. For this example, I will use an unsecured Hadoop, that is, a Hadoop cluster without Kerberos authentication. For production environments a better security approach should be used.
Open a command line session with administrator rights and issue the following commands:
Create a directory for the pdw_user:
C:\>hadoop fs -mkdir /user/pdw_user
Change the ownership of the directory:
C:\>hadoop fs -chown -R pdw_user /user/pdw_user
Verify the results using the command line:
Verify the results using the web browser:
In my case: http://localhost:50070/explorer.html#/user
You can name the directory whatever you want, important is to change the ownership to the pdw_user user.
Create an external data source and file format
Open a query window in management studio pointing to the AdventureworksDW2016CTP3 and run the following queries.
CREATE EXTERNAL DATA SOURCE HDP2 WITH ( TYPE = HADOOP, LOCATION = 'hdfs://localhost:9000' ) CREATE EXTERNAL FILE FORMAT SalesExport WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS ( FIELD_TERMINATOR =';', DATE_FORMAT = 'yyyy-MM-dd' , USE_TYPE_DEFAULT = TRUE ) )
Create an external table
CREATE EXTERNAL TABLE HistoricalSales ( SalesOrderNumber nvarchar(20) ,SalesOrderLineNumber tinyint ,ProductName nvarchar(50) ,SalesTerritoryCountry nvarchar(50) ,OrderQuantity smallint ,UnitPrice money ,ExtendedAmount money ,SalesAmount money ,OrderDate date ) WITH ( LOCATION = '/user/pdw_user', DATA_SOURCE = HDP2, FILE_FORMAT = SalesExport, REJECT_TYPE = value, REJECT_VALUE=0 )
The key point here is the location. It must point to a directory and not to a specific file like in my previous post. If the location does not exist It will be created.
Insert into external table
This example uses the Adventure Works DW database:
-- Enable INSERT into external table sp_configure 'allow polybase export', 1; reconfigure -- Export data: Move old data to Hadoop while keeping it query-able via an external table. INSERT INTO [dbo].[HistoricalSales] SELECT [SalesOrderNumber] ,[SalesOrderLineNumber] ,p.EnglishProductName as ProductName ,st.SalesTerritoryCountry ,[OrderQuantity] ,[UnitPrice] ,[ExtendedAmount] ,[SalesAmount] ,convert(date,[OrderDate]) AS [OrderDate] FROM [AdventureworksDW2016CTP3].[dbo].[FactInternetSales] a inner join dbo.DimProduct p on a.ProductKey = p.ProductKey inner join dbo.DimSalesTerritory st on st.SalesTerritoryKey = a.SalesTerritoryKey where year(OrderDate) < 2011
Examining the results
Using the web browser:
Polybase export operation creates multiple files under the specified location.The external files are named QueryID_date_time_ID.format, where ID is an incremental identifier and format is the exported data format.
Select the exported data from the external table:
Conclusion
With this small tutorial I demonstrated how to use SQL Server 2016 and Hadoop to create a cost effective and functional archiving solution. There are still several other aspects to explain and be considered but we can start to build our proof of concepts, let’s get started.
References
Apache Hadoop File System Shell Commands
Acknowledgments
Special thanks to Sumin Mohanan and Sonya Marshall from Microsoft to helped me to troubleshoot my tests.
Leave a Reply