Installing Apache Zeppelin 0.7.3 in HDP 2.5.3 with Spark and Spark2 Interpreters

Background

As a recent client requirement I needed to propose a solution in order to add spark2 as interpreter to zeppelin in HDP (Hortonworks Data Platform) 2.5.3
The first hurdle is, HDP 2.5.3 comes with zeppelin 0.6.0 which does not support spark2, which was included as a technical preview. Upgrade the HDP version was not an option due to the effort and platform availability. At the end I found in the HCC (Hortonworks Community Connection) a solution, which involves installing a standalone zeppelin which does not affect the Ambari managed zeppelin delivered with HDP 2.5.3.
I want to share how I did it with you.

Preliminary steps

Stop current Zeppelin: version 0.6.0 comes with HDP 2.5.3

su zeppelin
 /usr/hdp/current/zeppelin-server/bin/zeppelin-daemon.sh stop

Deactivate script that starts this version by a system reboot
Zeppelin is started as an Ambari dependency in the script

 usr/lib/hue/tools/start_deps.mf

In order to avoid a modification in this file a custom init script could be crated to stop the default HDP Zeppelin and start the newer version

Apache Zeppelin Installation

Download Zeppelin: https://zeppelin.apache.org/download.html
Copy the .tar file tot he /tmp directory using WinSCP
Extract the .tar file in the target directory, i.e. opt

tar –xvf zeppelin-0.7.3-bin-all.tar -C /opt

Create a symlink to the last version (optional)

sudo ln –s zeppelin-0.7.3-bin-all/ zeppelin

Change the ownership of the folder

chown –R zeppelin:zeppelin /opt/zeppelin

Zeppelin Configuration

First copy the „conf“ directory from the existing zeppelin installation to the new version:

sudo yes | cp -rf /usr/hdp/current/zeppelin-server/conf/ /opt/zeppelin

In order to configure zeppelin to work with spark and spark2 client, the SPARK_HOME content needs to bind by the interpreter and comment out in the zeppelin-env.sh configuration file:
/opt/zeppelin/conf/zeppelin-env.sh

edit zeppelin-env

zeppelin-env.sh

According to the documentation, the variable ZEPPELIN_JAVA_OPTS changed in spark2 to ZEPPELIN_INTP_JAVA_OPTS. Since both versions are active these two variables are defined:

export ZEPPELIN_JAVA_OPTS=“-Dhdp.version=None -Dspark.executor.memory=512m -Dspark.executor.instances=2 -Dspark.yarn.queue=default”

export ZEPPELIN_INTP_JAVA_OPTS=“-Dhdp.version=None -Dspark.executor.memory=512m -Dspark.executor.instances=2 -Dspark.yarn.queue=default”

Start zeppelin 0.7.3

su zeppelin
/opt/zeppelin/bin/zeppelin-daemon.sh start

A pending issue here is to modifiy the startup scripts in order to persist the changes by a system reboot.

Configuring the spark interpreters

Navigate to the interpreter settings page:

interpreter menu

Open Interpreter Menu

Scroll-down to the spark interpreter and add the property:

SPARK_HOME = /usr/hdp/current/spark-client
add property spark interpreter

Add SPARK_HOME property to the spark interpreter

Create a new interpreter with interpreter group spark and name it spark2

Add new interpreter

create new interpreter

Create a new interpreter

Interpreter name and group (leave all other values as default)

create spark2 interpreter

Set interpreter name and group

Add the property:

SPARK_HOME = /usr/hdp/current/spark2-client
add property spark2 interpreter

Add SPARK_HOME property to the spark2 interpreter

Installation test

In order to test the installation create a new notebook and verify the binding of the interpreters

interpreter binding

Interpreter binding for the test notebook

Execute the following code in two different paragraphs:

%spark

sc.version
%spark2

sc.version
spark2 test

Test notebook

References

Advertisements
Posted in Analytics, hadoop, Spark | Tagged | 1 Comment

Talend job to lookup geographic coordinates into a shape file

Introduction

Recently for an open data integration project I had to select some tools in order to be able to process geospatial data. I had a couple of choices: I could use R and try to work out a solution with the packages available on the server or use Talend. One of the biggest restrictions was, the development environment had no internet connection due to security policies and I wanted to try some options iteractively. I decided to give Talend a try and asked the system admins to install the spatial plugin. I only had tried Talend before to accomplish some exercises from the book Talend for Big Data but never used it for a “real-world” project, which was challenging but also made me feel motivated.

Software requirements

Talend open studio for big data

https://www.talend.com/download/

Spatial extension for Talend

https://talend-spatial.github.io/

The experiment

Input data

Customers coordinates: a flat file containing x,y coordinates for every customer.

Municipalities in Austria: a shape file with multi-polygons defining the municipalities areas in Austria: source

Goal

Use the x,y coordinates from the customers to “look-up” the municipality code GKZ in the shape file, which in german stand for “Gemeindekennzahl”. The idea is to determine in which municipality lies every point (customer location).

This is an overview of the overall Talend job

jobOverview

Figure 1. Talend Job Overview

Create a generic schema

crateSchema.jpg

Figure 2. Create a generic schema

Use a sShapeFileInput component

Shapefile Input.JPG

Figure 3. Shape file input

The shapefile contains multipolygons and I want to have polygons. My solution was to use an sSimplify component. I used the default settings. You may need to analyze or find in the source metadata what kind of data is available within the shape file.

The projection of the shapefile was “MGI / Austria Lambert” which corresponds to EPSG 31287. I want to re-project it as EPSG 4326 (GCS_WGS_1984) which is the one used by my input coordinates.

sProj

Figure 4. Re-project the polygons

I read the x, y coordinates from a csv file.

With a s2DPointReplacer I converted the x,y coordinates as Point(x,y) (WKT: well-known text)

PointReplacer

Figure 5. Point replacer

Finally I created an expression in a tMap just to get the polygon and point intersection. The “contains” function would also work:

tmap

Figure 6. Calculate the intersection between the input geometries

Conclusion

Talend did the job and I recommend it as an alternative not only for classical ETL projects but also to create analytical data sets to be consumed by data scientists. Sometimes data cleansing (or data munging/wrangling, or whatever you want to call it) could be cumbersome with scripting languages. With Talend the jobs are easy to understand, could be fully parameterized and reused.

References

Posted in Business Intelligence, Geospatial data, Open Data, Talend | Tagged , , , , | 3 Comments

Connect to Hive using Teradata Studio 16

Introduction

Teradata Studio is the client used to perform database administration task on Aster and Teradata databases, as well as moving data from and to Hadoop. Recently I was asked to test a solution to integrate Hadoop with Teradata in order to build a modern Data Warehouse architecture, this was my first step and I want to share it with you.

Teradata Studio Download

1. Download Teradata Studio 16 following this link:

Teradata Studio Download

2. Open Teradata studio

Teradata Studio has three different profiles

  • Administration
  • Query Development
  • Data Transfer

3. Change to the query development profile for this quick demo

change-profile

Change profile view

Create a new Hadoop connection profile

1. Click on the New Connection Profile button as shown in the figure

create-a-new-connection-profile

Create a new connection profile

Depending on the Hadoop distribution used you have to change the following parameters.

I tested it with Hortonworks HDP 2.5 with hive.server2.transport.mode = http

2. Select Hadoop as profile type and give it a name. Click on Next:

create-a-hadoop-profile

Create a Hadoop profile

3. Select Hortonworks and Hive connection service

Hive connection service.png

Hive connection service

4. Set WebHCat Connection Properties and test the connection. Click on Next:

webhcat-connection-properties

WebHCat connection properties

Again, I used the host name and credentials from my environment.

5. Set the JDBC connection properties. In my case I used the foodmart sample database.

jdbc-connection-properties

JDBC connection properties

Test the connection

If everything was properly set in the previous steps you should be able to see your databases in the Data Source Explorer:

databases-available

Data source explorer

 

Open a SQL Editor and execute a query. I used this sample query:

select c.country,  s.store_Type, sum(unit_sales) as sum_unit_sales
from sales_fact_dec_1998 as f
inner join customer as c
on c.customer_id = f.customer_id
inner join store as s
on s.store_id = f.store_id
group by c.country,  s.store_Type
order by c.country,  s.store_Type

Result Set

resultset

Result set

And that’s it. I hope you find it useful.

Posted in Big Data, hadoop, Teradata | Tagged , , , , | 9 Comments

Teradata Express 15.10 Installation using Oracle VirtualBox

Introduction

For professional reasons I needed to start learning Teradata after some years of intensive Microsoft BI projects. To start breaking the ice and have a playground to test everything I want, I decided to download the newest Teradata Express virtual machine (TDE), which comes with the 15.10 engine plus some additional tools. In my current company I am not able to use VMware (for some dark reasons) and I am only allowed to use Oracle VirtualBox. I would like to share the steps I followed with you.

1.  Download Teradata Express 15

The latest virtual machine could be downloaded from: http://downloads.teradata.com/download/database/teradata-express-for-vmware-player

The image is only available for VMware and an account is required to download it.

2.  Create a new Virtual Box Machine

Open Oracle Virtual Box

  • Click on “New”
  • Enter a name for the new machine
  • Select Linux as the type
  • The openSUSE (64-bit) is the most similar Linux Vesion.

 

create-virtual-machine

Create a new VM

  • Depending on your local resources assign a memory size (greater than 1GB)
create-virtual-machine2

Set the RAM memory

  • Do not add a hard disc and click on “Create”

 

create-virtual-machine3

Do not add a virtual hard disk

  • On the Warning pop-up click on “Continue”

 

create-virtual-machine4

Ignore the warning

 

  • Select the created VM, click on “Settings” and go to the “Storage” section:

 

create-virtual-machine5

Storage settings

VMWare image comes with SATA hard disks. Oracle Virtual Box needs SCSI Controller for the Teradata Express machine.

  • Delete SATA Controller
create-virtual-machine6

Delete SATA controller

  • Add SCSI Controller
create-virtual-machine7

Add SCSI controller

  • Add a hard disk
create-virtual-machine8

Add hard disk

  • Choose existing disk
  •  Go to the location where you extracted the TDExpress15.10…………disk1.vmdk file and selected

 

create-virtual-machine9

Select virtual hard disk file

  • Repeate the previous step for the disks 2 and 3
  • Go to the “System” section and in the “Acceleration” Tab select “Hyper-V” as the paravirtualization interface

 

create-virtual-machine10

Adjust virtualization settings

  • Click “Ok” and close the VM settings.
  • Click on “Start” to run the VM

3.  Start and log into the VM

Start the VM. The first screen you should see is the following

Default login and password is root

logintothemachine1

Start virtual machine

  • Select the highlighted option

The first time you start the machine the Gnome Interface is not started. You should see a login screen similar to this:

logintothemachine2

Log into the virtual machine in console mode

  • In order to fix it login and issue the following commands:
vmware-uninstall-tools.pl
mv /etc/X11/xorg.conf /etc/X11/xorg.conf.vmware
reboot

If everything was properly set in the previous step you should be able to see a similar login screen:

logintothemachine4

Log into the virtual machine – Gnome

4.  Add the Virtual Box Linux Guest Additions

  • Mount the ISO image of the guest additions by clicking onDevices menu -> CD/DVD devices and point to the GuestAdditions ISO file. The Guest additions is available in Program Files\Oracle\VirtualBox folder
  • Open a terminal and excute the following commands:
cd /media
mkdir vbox
sudo mount /dev/sr0 vbox/
cd vbox/
./VBoxLinuxAdditions.run
reboot
add-guest-additionals

Install VBox additions

5.  Test the Teradada Installation

  • Open the Teradate Studio Express (The icon is available on the Desktop)
  • Right click on “Database Connections” –> New…
test-installation-1

Create new connection

  • Select “Teradata Database” and give it a name
test-installation-2

Teradata Database connection profile

  • Connection Details:
    • Database Server Name: 127.0.0.1
    • User Name: dbc
    • Password: dbc
    • Use the default values for the other fields
  • Click on “Test Conection”
  • Click on “Finish”
test-installation-3

Test created connection

  • ENJOY!!!

References

Migrating from VMware to VirtualBox (Part 1): Oracle Enterprise Linux

Teradata Express 14.0 for VMware User Guide

Teradata Express Edition 14.10 converting from VMWare to VirtualBox

Posted in Business Intelligence, Teradata, VirtualBox | Tagged , , , , , , | 5 Comments

Apache Zeppelin installation on Windows 10

Disclaimer: I am not a Windows or Microsoft fan, but I am a frequent Windows user and it’s the most common OS I found in the Enterprise everywhere. Therefore, I decided to try Apache Zeppelin on my Windows 10 laptop and share my experience with you. The behavior should be similar in other operating systems.

Introduction

It is not a secret that Apache Spark became a reference as a powerful cluster computing framework, especially useful for machine learning applications and big data processing. Applications could be written in several languages as Java, Scala, Python or R. Apache Zeppelin is a Web-based tool that tries to cover according to the official project Website all of our needs (Apache Zeppelin):

  • Data ingestion
  • Data discovery
  • Data analytics
  • Data visualization and collaboration

The interpreter concept is what makes Zeppelin powerful, because you can theoretically plug in any language/data-processing-backend. It provides built-in Spark integration, and that is what I have tested first.

Apache Zeppelin Download

You can download the latest release from this link: download

I downloaded the version 0.6.2 binary package with all interpreters.

Since this version, the Spark interpreter is compatible with Spark 2.0 and Scala 2.11

According to the documentation, it supports Oracle JDK 1.7 (I guess it should work with 1.8) and Mac OSX, Ubuntu 14.4, CentOS 6.X and Windows 7 pro SP1 (And according to my tests also with Windows 10 Home).

Too much bla bla bla, let’s get started.

Zeppelin Installation

After download open the file (I used 7 Zip) and extract it to a proper location (in my case just the c drive to avoid possible problems)

Set the JAVA_HOME system variable to your JDK bin folder.

Set the variable HADOOP_HOME to your Hadoop folder location. If you don’t have the HADOOP binaries you can download my binaries from here: Hadoop-2.7.1

system-variables

My system variables

I am not really sure why Hadoop is needed if Zeppelin supposed to be autonomous but I guess Spark looks for the winutils.exe if you are using Windows. I posted about it in my previous post: Apache Spark Installation on Windows 10

This is the error I found in the Zeppelin logs (ZEPPELIN_DIR\logs –> there is a file for the server log and a separated file for each interpreter):

winutils error.JPG

winutils.exe error

Zeppelin Configuration

There are several settings you can adjust. Basically, there are two main files in the ZEPPELIN_DIR\conf :

  • zeppelin-env
  • zeppelin-site.xml

In the first one you can configure some interpreter settings. In the second more aspects related to the Website, like for instance, the Zeppelin server port (I am using the 8080 but most probably yours is already used by another application)

If you don’t touch the zeppelin-env file, Zeppelin use the built-in Spark version, which it has been used for the results posted in this entry.

Start Zeppelin

Open a command prompt and start Zeppelin executing the zeppelin.cmd in Drive:\ZEPELLIN_DIR\bin\zeppelin.cmd

start-zeppelin

Start Zeppelin

Then, open your favorite browser and navigate to localhost:8080 (or the one you set in the zeppelin-site.xml)

You should see the starting page. Verify that the indicator in the top-right-side of the windows is green, otherwise your server is down or is not running properly)

zeppelin home.JPG

Zeppelin home

If you have not configured Hive, before start trying the tutorials included in the release, you should need to set the value of the zeppelin.spark.useHiveContext to false. Apart from the config files, Zeppelin has an interpreter configuration page. You can find it by clicking on your user “anonymous” –> Interpreter

interpreter-config

Go to interpreter settings

Scroll-down to the bottom where you’ll find the Spark config values:

spark interpreter properties.JPG

Spark interpreter settings

Press on the edit button and change the value to false in order to use the SQL context instead of Hive.

Press the Save button to persist the change:

hive-content-set-to-false

Set zeppelin.spark.useHiveContext to false

Now let’s try the Zeppelin Tutorial

From the Notebook menu click on the Zeppelin Tutorial link:

zeppelin-tutorial

Navigate to the Zeppelin Tutorial

The first time you open it, Zeppelin ask you to set the Interpreter bindings:

interpreter bindings 1.JPG

Interpreter binding

Just scroll-down and save them:

interpreter-bindings-2

Save biding

Some notes are presented with different layouts. For more about the display system visit the documentation online.

Other possible annoying error

I was getting the following error when tried to run some notes in the Zeppelin Tutorial:

spark-warehouse folder 2.JPG

Spark warehouse URI error

I found a suggested solution in the following stack overflow question: link

An URI syntax exception trying to find the folder spark-warehouse in the Zeppelin folder. I struggled a little bit with that. The folder was not created in my Zeppelin directory, I thought it was a permissions problem, so I created it manually and assigned 777 permissions.

spark-warehouse-folder

spark-warehouse folder permission settings

It still failed. In the link above a forum user suggested to use triple slashes to define the proper path file:///C:/zeppelin-0.6.2-bin-all/spark-warehouse

But I still don’t know where to place this configuration. I couldn´t do it in the spark shell, also not while creating a spark session (zeppelin does it for me) and the conf/spark-defaults.conf doesn´t seem to be a good idea for Zeppelin because I was using the spark built-in version.

Finally, I remembered that is possible to add additional spark setting in the interpreter configuration page and I just navigated there and created it:

warehouse-dir

spark.sql.warehouse.dir

Just as additional info, you can verify the settings saved in this page in the file Drive:\ZEPELLIN_DIR\conf\interpreter.json

spark-warehouse folder 3.JPG

interpreter.json

After these steps, I was able to run all of the notes from the Zeppelin tutorials.

running-notes-zeppelin-tutorial

Running the load data into table note

Note that the layout from the tutorial is telling you more or less the order in which you have to execute the notes. The note “Load data into table” must be executed before you play the notes below. I guess that is the reason it spans over the whole width of the page, because it must be executed before you visualize or analyze the data, while the notes below could be executed in parallel, or in any order. I mean, this layout is not a must but it helps to keep an execution order.

note reults.JPG

Visualizing data with Zeppelin

I hope this helps you on your way to learn Zeppelin!

Posted in Analytics, data visualization, R, Spark | Tagged , , , , | 17 Comments

Introduction to R Services and R client – SQL Server 2016

Introduction

After some time using R and SQL server as two different tools (not 100% true because I already have imported data from SQL Server into R Studio), now Microsoft is offering as part of the SQL Server 2016 R services. That seems to be very promising, especially for Microsoft BI professionals. One of the advantages is to keep analytics close to data and use an integrated environment.

In this post I will show some basic operations and how to get started with these technologies. I took most of the R code from this Microsoft walkthrough that I highly recommend to you:

Data Science End-to-End Walkthrough

Prerequisites

  • SQL Server 2016 – I installed the enterprise edition but with the others should work
  • R services: this is part of the SQL Server 2016, so you need to add this feature during the first installation or add it later
  • R client: http://aka.ms/rclient/download
  • WideWorldImporters database (WWI): download and documentation . This is the new sample database for SQL Server 2016 replacing the famous AdventureWorks

Set up R Services

That is very well documented in the MSDN, nothing to add from my side:

Set up SQL Server R Services (In-Database)

Create a view to generate a dataset to analyze

Once you installed the WWI database, create this view:

USE [WideWorldImportersDW]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
create view [Fact].[SalesByLocation] as
select City, Location.Lat, Location.Long, sum(profit) AS SUM_Profit, AVG([Unit Price]) AVG_UnitPrice
from fact.Sale s
inner join Dimension.City c on s.[City Key] = c.[City Key]
group by City, Location.Lat, Location.Long
GO

 

Add the require R packages

We will need the following R packages:

  • Ggmap
  • Mapproj
  • ROCR
  • RODBC

Using the R client, we have to options:

  • From the menu –> Package –> Install Packages

load package 1

  • Or running the following script
if (!('ggmap' %in% rownames(installed.packages()))){ 
  install.packages('ggmap') 
} 
if (!('mapproj' %in% rownames(installed.packages()))){ 
  install.packages('mapproj') 
} 
if (!('ROCR' %in% rownames(installed.packages()))){ 
  install.packages('ROCR') 
} 
if (!('RODBC' %in% rownames(installed.packages()))){ 
  install.packages('RODBC') 
}

 

Create a connection to the SQL Server instance from r client

library(RevoScaleR)
# Define the connection string
connStr <- "Driver=SQL Server;Server=HERZO01;Database=WideWorldImportersDW;
Trusted_Connection = True"
# Set ComputeContext
sqlShareDir <- paste("C:\\AllShare\\",Sys.getenv("USERNAME"),sep="")
sqlWait <- TRUE
sqlConsoleOutput <- FALSE
cc <- RxInSqlServer(connectionString = connStr, shareDir = sqlShareDir,
                    wait = sqlWait, consoleOutput = sqlConsoleOutput)
rxSetComputeContext(cc)
sampleDataQuery <- "select * from [Fact].[SalesByLocation]"
inDataSource <- RxSqlServerData(sqlQuery = sampleDataQuery,
 connectionString = connStr, rowsPerRead=500)

 

The first step is to load the RevoScaleR library. This is an amazing library that allows to create scalable and performant applications with R.

Then a connection string is defined, in my case using Windows Authentication. If you want to use SQL Server authentication the user name and password are needed.

We define a local folder as the compute context.

RxInSQLServer: generates a SQL Server compute context using SQL Server R Services – documentation

Sample query: I already prepared the dataset in the view, this is a best practice in order to reduce the size of the query in the R code and for me is also easier to maintain.

rxSQLServerData generates the data source object

Get some basic statistics and visualize the dataset

# Dataset summary

rxGetVarInfo(data = inDataSource)

rxSummary(~SUM_Profit, data = inDataSource)

summary

# Plot the distribution of the profit

rxHistogram(~SUM_Profit, data = inDataSource, title = "Sum of the profit")

sum profit histogram

#Plot the distribution of the average unit price

rxHistogram(~AVG_UnitPrice, data = inDataSource, title = "Average unit price")

average unit price histo

In both histograms you can easily identify outliers and we have a better understanding about the distribution of the data. Here is where R plays an important role (as a tool). This kind of analysis is not performed by many BI professionals, or at least this is what I have seen in my professional life.

Summary

In this post I demonstrated how we can get data into R client from SQL Server and perform some basic analysis over a simple dataset. What would be the next steps?

  • Continue visualizing the data
  • Create a machine learning model
  • Integrate the R code in SQL Server using functions and stored procedure

References

Data Science End-to-End Walkthrough

Big Data Analysis with Revolution R Enterprise 

Posted in Analytics, Business Intelligence, R, SQL Server | Tagged , , , | 1 Comment

Export data to Hadoop using Polybase – Insert into external table

Introduction

This post is a continuation of Polybase Query Service and Hadoop – Welcome SQL Server 2016

One of the most interesting use cases of Polybase is the ability to store historical data from relational databases into a Hadoop File System. The storage costs could be reduced while keeping the data accessible and still can be joined with the regular relational tables. So let`s do the first steps for our new archiving solution.

Requirements

Create a folder for the pdw_user in Hadoop

Polybase use the default user name pdw_user when connecting to the Hadoop cluster. For this example, I will use an unsecured Hadoop, that is, a Hadoop cluster without Kerberos authentication. For production environments a better security approach should be used.

Open a command line session with administrator rights and issue the following commands:

Create a directory for the pdw_user:

C:\>hadoop fs -mkdir /user/pdw_user

Change the ownership of the directory:

C:\>hadoop fs -chown -R pdw_user /user/pdw_user

Verify the results using the command line:

create pdw_user dir

Verify the results using the web browser:

In my case: http://localhost:50070/explorer.html#/user

browse user dir.JPG

You can name the directory whatever you want, important is to change the ownership to the pdw_user user.

Create an external data source and file format

Open a query window in management studio pointing to the AdventureworksDW2016CTP3 and run the following queries.

CREATE EXTERNAL DATA SOURCE HDP2 WITH
(
    TYPE = HADOOP,
    LOCATION = 'hdfs://localhost:9000'
)

CREATE EXTERNAL FILE FORMAT SalesExport WITH (
        FORMAT_TYPE = DELIMITEDTEXT,
        FORMAT_OPTIONS (
                    FIELD_TERMINATOR =';',
                    DATE_FORMAT = 'yyyy-MM-dd' ,
                    USE_TYPE_DEFAULT = TRUE
                           )
)

Create an external table

CREATE EXTERNAL TABLE HistoricalSales
(
    SalesOrderNumber nvarchar(20)
       ,SalesOrderLineNumber tinyint
       ,ProductName nvarchar(50)
       ,SalesTerritoryCountry nvarchar(50)
       ,OrderQuantity smallint
       ,UnitPrice money
       ,ExtendedAmount money
       ,SalesAmount money
       ,OrderDate date
)
WITH
(
    LOCATION = '/user/pdw_user',
    DATA_SOURCE = HDP2,
    FILE_FORMAT = SalesExport,
    REJECT_TYPE = value,
    REJECT_VALUE=0
)

The key point here is the location. It must point to a directory and not to a specific file like in my previous post. If the location does not exist It will be created.

Insert into external table

This example uses the Adventure Works DW database:

-- Enable INSERT into external table
sp_configure 'allow polybase export', 1;
reconfigure

-- Export data: Move old data to Hadoop while keeping it query-able via an external table.
INSERT INTO [dbo].[HistoricalSales]
 SELECT 
       [SalesOrderNumber]
      ,[SalesOrderLineNumber]
      ,p.EnglishProductName as ProductName
      ,st.SalesTerritoryCountry
      ,[OrderQuantity]
      ,[UnitPrice]
      ,[ExtendedAmount]
      ,[SalesAmount]
      ,convert(date,[OrderDate]) AS [OrderDate]
  FROM [AdventureworksDW2016CTP3].[dbo].[FactInternetSales] a
  inner join dbo.DimProduct p on a.ProductKey = p.ProductKey
  inner join dbo.DimSalesTerritory st 
  on st.SalesTerritoryKey = a.SalesTerritoryKey
  where year(OrderDate) < 2011

insert stmt.JPG

Examining the results

Using the web browser:

browse dir created files

Polybase export operation creates multiple files under the specified location.The external files are named QueryID_date_time_ID.format, where ID is an incremental identifier and format is the exported data format.

Select the exported data from the external table:

select ext table stmt.JPG

Conclusion

With this small tutorial I demonstrated how to use SQL Server 2016 and Hadoop to create a cost effective and functional archiving solution. There are still several other aspects to explain and be considered but we can start to build our proof of concepts, let’s get started.

References

PolyBase Queries

Apache Hadoop File System Shell Commands

Acknowledgments

Special thanks to Sumin Mohanan and Sonya Marshall from Microsoft to helped me to troubleshoot my tests.

Posted in Big Data, hadoop, SQL Server | Tagged , , , , | 17 Comments