Export data to Hadoop using Polybase – Insert into external table

Introduction

This post is a continuation of Polybase Query Service and Hadoop – Welcome SQL Server 2016

One of the most interesting use cases of Polybase is the ability to store historical data from relational databases into a Hadoop File System. The storage costs could be reduced while keeping the data accessible and still can be joined with the regular relational tables. So let`s do the first steps for our new archiving solution.

Requirements

Create a folder for the pdw_user in Hadoop

Polybase use the default user name pdw_user when connecting to the Hadoop cluster. For this example, I will use an unsecured Hadoop, that is, a Hadoop cluster without Kerberos authentication. For production environments a better security approach should be used.

Open a command line session with administrator rights and issue the following commands:

Create a directory for the pdw_user:

C:\>hadoop fs -mkdir /user/pdw_user

Change the ownership of the directory:

C:\>hadoop fs -chown -R pdw_user /user/pdw_user

Verify the results using the command line:

create pdw_user dir

Verify the results using the web browser:

In my case: http://localhost:50070/explorer.html#/user

browse user dir.JPG

You can name the directory whatever you want, important is to change the ownership to the pdw_user user.

Create an external data source and file format

Open a query window in management studio pointing to the AdventureworksDW2016CTP3 and run the following queries.

CREATE EXTERNAL DATA SOURCE HDP2 WITH
(
    TYPE = HADOOP,
    LOCATION = 'hdfs://localhost:9000'
)

CREATE EXTERNAL FILE FORMAT SalesExport WITH (
        FORMAT_TYPE = DELIMITEDTEXT,
        FORMAT_OPTIONS (
                    FIELD_TERMINATOR =';',
                    DATE_FORMAT = 'yyyy-MM-dd' ,
                    USE_TYPE_DEFAULT = TRUE
                           )
)

Create an external table

CREATE EXTERNAL TABLE HistoricalSales
(
    SalesOrderNumber nvarchar(20)
       ,SalesOrderLineNumber tinyint
       ,ProductName nvarchar(50)
       ,SalesTerritoryCountry nvarchar(50)
       ,OrderQuantity smallint
       ,UnitPrice money
       ,ExtendedAmount money
       ,SalesAmount money
       ,OrderDate date
)
WITH
(
    LOCATION = '/user/pdw_user',
    DATA_SOURCE = HDP2,
    FILE_FORMAT = SalesExport,
    REJECT_TYPE = value,
    REJECT_VALUE=0
)

The key point here is the location. It must point to a directory and not to a specific file like in my previous post. If the location does not exist It will be created.

Insert into external table

This example uses the Adventure Works DW database:

-- Enable INSERT into external table
sp_configure 'allow polybase export', 1;
reconfigure

-- Export data: Move old data to Hadoop while keeping it query-able via an external table.
INSERT INTO [dbo].[HistoricalSales]
 SELECT 
       [SalesOrderNumber]
      ,[SalesOrderLineNumber]
      ,p.EnglishProductName as ProductName
      ,st.SalesTerritoryCountry
      ,[OrderQuantity]
      ,[UnitPrice]
      ,[ExtendedAmount]
      ,[SalesAmount]
      ,convert(date,[OrderDate]) AS [OrderDate]
  FROM [AdventureworksDW2016CTP3].[dbo].[FactInternetSales] a
  inner join dbo.DimProduct p on a.ProductKey = p.ProductKey
  inner join dbo.DimSalesTerritory st 
  on st.SalesTerritoryKey = a.SalesTerritoryKey
  where year(OrderDate) < 2011

insert stmt.JPG

Examining the results

Using the web browser:

browse dir created files

Polybase export operation creates multiple files under the specified location.The external files are named QueryID_date_time_ID.format, where ID is an incremental identifier and format is the exported data format.

Select the exported data from the external table:

select ext table stmt.JPG

Conclusion

With this small tutorial I demonstrated how to use SQL Server 2016 and Hadoop to create a cost effective and functional archiving solution. There are still several other aspects to explain and be considered but we can start to build our proof of concepts, let’s get started.

References

PolyBase Queries

Apache Hadoop File System Shell Commands

Acknowledgments

Special thanks to Sumin Mohanan and Sonya Marshall from Microsoft to helped me to troubleshoot my tests.

Advertisement

18 responses to “Export data to Hadoop using Polybase – Insert into external table”

  1. […] Paul Hernandez shows how to insert data into an external table using Polybase: […]

  2. When I tried to export data to Hadoop, I got the following error:
    Msg 596, Level 21, State 1, Line 52
    Cannot continue the execution because the session is in the kill state.
    Msg 0, Level 20, State 0, Line 52
    A severe error occurred on the current command. The results, if any, should be discarded.

    1. Hi Shady, could you please share your settings (Hadoop version, SQL version, OS, etc)?

      1. Hadoop Hortonworks HDP-2.4.2.0 on Linux Cent OS 6.5, SQL Server 2016 (13.0.1601.5) on Windows Server 2012 R2.

      2. Hi Shady, I experience this error message casually today. I was not using Polyase, it was just my database was offline due to a restore. In your case I would have a look in the Polybase logs (they are located in the SQL Server installation folder) and also in the Hadoop logs. After you find the error message you can post it back. Good luck.

  3. Hi Paul,

    This and the previous post really helped me to setup SQL 2016 with HDP 2.4 on Azure.

    There were some Java setting in PATH variable which should refer to jvm.dll and also the first line on the exported csv file from SQL Server needs to be deleted.

    Apart from the above points everything else worked nicely following the post.

    Thanks again.

    Regards,
    Sendhil

  4. Can i able use my own Hadoop(2.7.2) running in my local machine,to connect with Polybase? instead of using HDP and CDH?

    1. Yes, even when it’s not officially supported I did it with my own built distribution. For the configuration, I would find which HDP comes with Hadoop 2.7.2

      1. For that how can i map my hadoop distribution in polybase ?

      2. Hi Karthik, when you configure the Polybase connectivity (as I explained here) , you need to specify the Hadoop version: sp_configure @configname = 'hadoop connectivity', @configvalue = 7;
        GO
        the parameter @configvalue determine the connectivity setting, please refer to this link to more info PolyBase Connectivity Configuration
        I select for my post the option 7, which correspond to Hortonworks 2.1, 2.1, and 2.3 on Windows Server. In this link you can find which Hadoop version is included in the HDPs 2.1 to 2.3 HDP-versions. You need to know only two things, your Hadoop version 2.7.2, and you OS, Windows or Linux. I have not found a HDP with Hadoop 2.7.2, you need the 2.7.1 or the 2.7.3. But anyway, I would just use the @configvalue = 7 and run a test, I think it should work.

      3. i have created the table, while fetching the data from the table am facing the below error,
        Msg 46916, Level 16, State 1, Line 2
        Queries that reference external tables are not supported by the legacy cardinality estimation framework. Ensure that trace flag 9481 is not enabled, the database compatibility level is at least 120 and the legacy cardinality estimator is not explicitly enabled through a database scoped configuration setting.

      4. how do i configure to use my own hadoop distribution ?

  5. Hi Paul,

    Can you please look at this issue.
    http://stackoverflow.com/questions/43624721/sql-server-2016-polybase-error-using-hadoop

    I seem to get this error:

    Msg 596, Level 21, State 1, Line 26
    Cannot continue the execution because the session is in the kill state.

    Msg 0, Level 20, State 0, Line 26
    A severe error occurred on the current command. The results, if any, should be discarded.

    Thanks,
    Bhavik

    1. Here is what I just posted there moments ago.

      I just spent 3 weeks with the SQL Product group to track down this issue.

      In my case, we had EVENT NOTIFICATIONS set up to notify of DDL changes. It is a bug in their code and will be a future CU. If you have event notifications, drop them and try again.

      Hope this helps.

  6. Hi Paul,

    I see the following post ( Polybase Insert into External table failed – Path does not exist) created by you in MSDN site
    https://social.msdn.microsoft.com/Forums/en-US/4b807528-c181-4687-9009-ddead9b1f104/polybase-insert-into-external-table-failed-path-does-not-exist?forum=SQLServer2016Preview
    I went through the thread, did you found solution for this issue?

    Thank you
    Vasanth Kumar

    1. Hi Vasanth; this was a helpful answer “For export using polybase, please ensure that the specified path is a directory which exists or can be created” but as far as I can remember I just uninstall the RC3 version and installed the RTM version of SQL Server 2016 and I did not face this problem any longer. I hope that helps. Best, Paul

  7. Hi,
    I am using SQL-server 2017 and HDP 2.6 and made primary configuration set-up to connect hadoop from SQL-server. But when try to access the external table I get the below error.
    connection:
    CREATE EXTERNAL DATA SOURCE HDP2 WITH
    (
    TYPE = HADOOP,
    LOCATION = ‘hdfs://sandbox.hortonworks.com:8020’
    )

    CREATE EXTERNAL TABLE HistoricalSales
    (
    SID int,
    saledate datetime,
    DocumentType int,
    Category nvarchar,
    Description nvarchar,
    txroll_ParcelID int,
    FIPS int,
    Cad_countyname nvarchar,
    cad_State nvarchar,
    OCALUC int,
    psxr_CADAccountNumber int,
    txroll_StreetNumber int,
    txroll_StreetName nvarchar,
    txroll_PropCity nvarchar,
    txroll_PropZip int,
    GBA int,
    GrossSalePrice float,
    LandSize int,
    RecordingDate datetime,
    ReportNumber int
    )
    WITH
    (
    LOCATION = ‘/user/’,
    DATA_SOURCE = HDP2,
    FILE_FORMAT = SALESPRICE,
    REJECT_TYPE = value,
    REJECT_VALUE=0

    )
    when i run the above code, I get the below error.
    EXTERNAL TABLE access failed due to internal error: ‘Java exception raised on call to HdfsBridge_Connect. Java exception message:
    org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.fs.viewfs.ViewFileSystem could not be instantiated: Error [org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.fs.viewfs.ViewFileSystem could not be instantiated] occurred while accessing external file.’
    anyone help on this

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: