Apache Zeppelin installation on Windows 10

Disclaimer: I am not a Windows or Microsoft fan, but I am a frequent Windows user and it’s the most common OS I found in the Enterprise everywhere. Therefore, I decided to try Apache Zeppelin on my Windows 10 laptop and share my experience with you. The behavior should be similar in other operating systems.

Introduction

It is not a secret that Apache Spark became a reference as a powerful cluster computing framework, especially useful for machine learning applications and big data processing. Applications could be written in several languages as Java, Scala, Python or R. Apache Zeppelin is a Web-based tool that tries to cover according to the official project Website all of our needs (Apache Zeppelin):

  • Data ingestion
  • Data discovery
  • Data analytics
  • Data visualization and collaboration

The interpreter concept is what makes Zeppelin powerful, because you can theoretically plug in any language/data-processing-backend. It provides built-in Spark integration, and that is what I have tested first.

Apache Zeppelin Download

You can download the latest release from this link: download

I downloaded the version 0.6.2 binary package with all interpreters.

Since this version, the Spark interpreter is compatible with Spark 2.0 and Scala 2.11

According to the documentation, it supports Oracle JDK 1.7 (I guess it should work with 1.8) and Mac OSX, Ubuntu 14.4, CentOS 6.X and Windows 7 pro SP1 (And according to my tests also with Windows 10 Home).

Too much bla bla bla, let’s get started.

Zeppelin Installation

After download open the file (I used 7 Zip) and extract it to a proper location (in my case just the c drive to avoid possible problems)

Set the JAVA_HOME system variable to your JDK bin folder.

Set the variable HADOOP_HOME to your Hadoop folder location. If you don’t have the HADOOP binaries you can download my binaries from here: Hadoop-2.7.1

system-variables

My system variables

I am not really sure why Hadoop is needed if Zeppelin supposed to be autonomous but I guess Spark looks for the winutils.exe if you are using Windows. I posted about it in my previous post: Apache Spark Installation on Windows 10

This is the error I found in the Zeppelin logs (ZEPPELIN_DIR\logs –> there is a file for the server log and a separated file for each interpreter):

winutils error.JPG

winutils.exe error

Zeppelin Configuration

There are several settings you can adjust. Basically, there are two main files in the ZEPPELIN_DIR\conf :

  • zeppelin-env
  • zeppelin-site.xml

In the first one you can configure some interpreter settings. In the second more aspects related to the Website, like for instance, the Zeppelin server port (I am using the 8080 but most probably yours is already used by another application)

If you don’t touch the zeppelin-env file, Zeppelin use the built-in Spark version, which it has been used for the results posted in this entry.

Start Zeppelin

Open a command prompt and start Zeppelin executing the zeppelin.cmd in Drive:\ZEPELLIN_DIR\bin\zeppelin.cmd

start-zeppelin

Start Zeppelin

Then, open your favorite browser and navigate to localhost:8080 (or the one you set in the zeppelin-site.xml)

You should see the starting page. Verify that the indicator in the top-right-side of the windows is green, otherwise your server is down or is not running properly)

zeppelin home.JPG

Zeppelin home

If you have not configured Hive, before start trying the tutorials included in the release, you should need to set the value of the zeppelin.spark.useHiveContext to false. Apart from the config files, Zeppelin has an interpreter configuration page. You can find it by clicking on your user “anonymous” –> Interpreter

interpreter-config

Go to interpreter settings

Scroll-down to the bottom where you’ll find the Spark config values:

spark interpreter properties.JPG

Spark interpreter settings

Press on the edit button and change the value to false in order to use the SQL context instead of Hive.

Press the Save button to persist the change:

hive-content-set-to-false

Set zeppelin.spark.useHiveContext to false

Now let’s try the Zeppelin Tutorial

From the Notebook menu click on the Zeppelin Tutorial link:

zeppelin-tutorial

Navigate to the Zeppelin Tutorial

The first time you open it, Zeppelin ask you to set the Interpreter bindings:

interpreter bindings 1.JPG

Interpreter binding

Just scroll-down and save them:

interpreter-bindings-2

Save biding

Some notes are presented with different layouts. For more about the display system visit the documentation online.

Other possible annoying error

I was getting the following error when tried to run some notes in the Zeppelin Tutorial:

spark-warehouse folder 2.JPG

Spark warehouse URI error

I found a suggested solution in the following stack overflow question: link

An URI syntax exception trying to find the folder spark-warehouse in the Zeppelin folder. I struggled a little bit with that. The folder was not created in my Zeppelin directory, I thought it was a permissions problem, so I created it manually and assigned 777 permissions.

spark-warehouse-folder

spark-warehouse folder permission settings

It still failed. In the link above a forum user suggested to use triple slashes to define the proper path file:///C:/zeppelin-0.6.2-bin-all/spark-warehouse

But I still don’t know where to place this configuration. I couldn´t do it in the spark shell, also not while creating a spark session (zeppelin does it for me) and the conf/spark-defaults.conf doesn´t seem to be a good idea for Zeppelin because I was using the spark built-in version.

Finally, I remembered that is possible to add additional spark setting in the interpreter configuration page and I just navigated there and created it:

warehouse-dir

spark.sql.warehouse.dir

Just as additional info, you can verify the settings saved in this page in the file Drive:\ZEPELLIN_DIR\conf\interpreter.json

spark-warehouse folder 3.JPG

interpreter.json

After these steps, I was able to run all of the notes from the Zeppelin tutorials.

running-notes-zeppelin-tutorial

Running the load data into table note

Note that the layout from the tutorial is telling you more or less the order in which you have to execute the notes. The note “Load data into table” must be executed before you play the notes below. I guess that is the reason it spans over the whole width of the page, because it must be executed before you visualize or analyze the data, while the notes below could be executed in parallel, or in any order. I mean, this layout is not a must but it helps to keep an execution order.

note reults.JPG

Visualizing data with Zeppelin

I hope this helps you on your way to learn Zeppelin!

Posted in Analytics, data visualization, R, Spark | Tagged , , , , | 2 Comments

Introduction to R Services and R client – SQL Server 2016

Introduction

After some time using R and SQL server as two different tools (not 100% true because I already have imported data from SQL Server into R Studio), now Microsoft is offering as part of the SQL Server 2016 R services. That seems to be very promising, especially for Microsoft BI professionals. One of the advantages is to keep analytics close to data and use an integrated environment.

In this post I will show some basic operations and how to get started with these technologies. I took most of the R code from this Microsoft walkthrough that I highly recommend to you:

Data Science End-to-End Walkthrough

Prerequisites

  • SQL Server 2016 – I installed the enterprise edition but with the others should work
  • R services: this is part of the SQL Server 2016, so you need to add this feature during the first installation or add it later
  • R client: http://aka.ms/rclient/download
  • WideWorldImporters database (WWI): download and documentation . This is the new sample database for SQL Server 2016 replacing the famous AdventureWorks

Set up R Services

That is very well documented in the MSDN, nothing to add from my side:

Set up SQL Server R Services (In-Database)

Create a view to generate a dataset to analyze

Once you installed the WWI database, create this view:

USE [WideWorldImportersDW]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
create view [Fact].[SalesByLocation] as
select City, Location.Lat, Location.Long, sum(profit) AS SUM_Profit, AVG([Unit Price]) AVG_UnitPrice
from fact.Sale s
inner join Dimension.City c on s.[City Key] = c.[City Key]
group by City, Location.Lat, Location.Long
GO

 

Add the require R packages

We will need the following R packages:

  • Ggmap
  • Mapproj
  • ROCR
  • RODBC

Using the R client, we have to options:

  • From the menu –> Package –> Install Packages

load package 1

  • Or running the following script
if (!('ggmap' %in% rownames(installed.packages()))){ 
  install.packages('ggmap') 
} 
if (!('mapproj' %in% rownames(installed.packages()))){ 
  install.packages('mapproj') 
} 
if (!('ROCR' %in% rownames(installed.packages()))){ 
  install.packages('ROCR') 
} 
if (!('RODBC' %in% rownames(installed.packages()))){ 
  install.packages('RODBC') 
}

 

Create a connection to the SQL Server instance from r client

library(RevoScaleR)
# Define the connection string
connStr <- "Driver=SQL Server;Server=HERZO01;Database=WideWorldImportersDW;
Trusted_Connection = True"
# Set ComputeContext
sqlShareDir <- paste("C:\\AllShare\\",Sys.getenv("USERNAME"),sep="")
sqlWait <- TRUE
sqlConsoleOutput <- FALSE
cc <- RxInSqlServer(connectionString = connStr, shareDir = sqlShareDir,
                    wait = sqlWait, consoleOutput = sqlConsoleOutput)
rxSetComputeContext(cc)
sampleDataQuery <- "select * from [Fact].[SalesByLocation]"
inDataSource <- RxSqlServerData(sqlQuery = sampleDataQuery,
 connectionString = connStr, rowsPerRead=500)

 

The first step is to load the RevoScaleR library. This is an amazing library that allows to create scalable and performant applications with R.

Then a connection string is defined, in my case using Windows Authentication. If you want to use SQL Server authentication the user name and password are needed.

We define a local folder as the compute context.

RxInSQLServer: generates a SQL Server compute context using SQL Server R Services – documentation

Sample query: I already prepared the dataset in the view, this is a best practice in order to reduce the size of the query in the R code and for me is also easier to maintain.

rxSQLServerData generates the data source object

Get some basic statistics and visualize the dataset

# Dataset summary

rxGetVarInfo(data = inDataSource)

rxSummary(~SUM_Profit, data = inDataSource)

summary

# Plot the distribution of the profit

rxHistogram(~SUM_Profit, data = inDataSource, title = "Sum of the profit")

sum profit histogram

#Plot the distribution of the average unit price

rxHistogram(~AVG_UnitPrice, data = inDataSource, title = "Average unit price")

average unit price histo

In both histograms you can easily identify outliers and we have a better understanding about the distribution of the data. Here is where R plays an important role (as a tool). This kind of analysis is not performed by many BI professionals, or at least this is what I have seen in my professional life.

Summary

In this post I demonstrated how we can get data into R client from SQL Server and perform some basic analysis over a simple dataset. What would be the next steps?

  • Continue visualizing the data
  • Create a machine learning model
  • Integrate the R code in SQL Server using functions and stored procedure

References

Data Science End-to-End Walkthrough

Big Data Analysis with Revolution R Enterprise 

Posted in Analytics, Business Intelligence, R, SQL Server | Tagged , , , | 1 Comment

Export data to Hadoop using Polybase – Insert into external table

Introduction

This post is a continuation of Polybase Query Service and Hadoop – Welcome SQL Server 2016

One of the most interesting use cases of Polybase is the ability to store historical data from relational databases into a Hadoop File System. The storage costs could be reduced while keeping the data accessible and still can be joined with the regular relational tables. So let`s do the first steps for our new archiving solution.

Requirements

Create a folder for the pdw_user in Hadoop

Polybase use the default user name pdw_user when connecting to the Hadoop cluster. For this example, I will use an unsecured Hadoop, that is, a Hadoop cluster without Kerberos authentication. For production environments a better security approach should be used.

Open a command line session with administrator rights and issue the following commands:

Create a directory for the pdw_user:

C:\>hadoop fs -mkdir /user/pdw_user

Change the ownership of the directory:

C:\>hadoop fs -chown -R pdw_user /user/pdw_user

Verify the results using the command line:

create pdw_user dir

Verify the results using the web browser:

In my case: http://localhost:50070/explorer.html#/user

browse user dir.JPG

You can name the directory whatever you want, important is to change the ownership to the pdw_user user.

Create an external data source and file format

Open a query window in management studio pointing to the AdventureworksDW2016CTP3 and run the following queries.

CREATE EXTERNAL DATA SOURCE HDP2 WITH
(
    TYPE = HADOOP,
    LOCATION = 'hdfs://localhost:9000'
)

CREATE EXTERNAL FILE FORMAT SalesExport WITH (
        FORMAT_TYPE = DELIMITEDTEXT,
        FORMAT_OPTIONS (
                    FIELD_TERMINATOR =';',
                    DATE_FORMAT = 'yyyy-MM-dd' ,
                    USE_TYPE_DEFAULT = TRUE
                           )
)

Create an external table

CREATE EXTERNAL TABLE HistoricalSales
(
    SalesOrderNumber nvarchar(20)
       ,SalesOrderLineNumber tinyint
       ,ProductName nvarchar(50)
       ,SalesTerritoryCountry nvarchar(50)
       ,OrderQuantity smallint
       ,UnitPrice money
       ,ExtendedAmount money
       ,SalesAmount money
       ,OrderDate date
)
WITH
(
    LOCATION = '/user/pdw_user',
    DATA_SOURCE = HDP2,
    FILE_FORMAT = SalesExport,
    REJECT_TYPE = value,
    REJECT_VALUE=0
)

The key point here is the location. It must point to a directory and not to a specific file like in my previous post. If the location does not exist It will be created.

Insert into external table

This example uses the Adventure Works DW database:

-- Enable INSERT into external table
sp_configure 'allow polybase export', 1;
reconfigure

-- Export data: Move old data to Hadoop while keeping it query-able via an external table.
INSERT INTO [dbo].[HistoricalSales]
 SELECT 
       [SalesOrderNumber]
      ,[SalesOrderLineNumber]
      ,p.EnglishProductName as ProductName
      ,st.SalesTerritoryCountry
      ,[OrderQuantity]
      ,[UnitPrice]
      ,[ExtendedAmount]
      ,[SalesAmount]
      ,convert(date,[OrderDate]) AS [OrderDate]
  FROM [AdventureworksDW2016CTP3].[dbo].[FactInternetSales] a
  inner join dbo.DimProduct p on a.ProductKey = p.ProductKey
  inner join dbo.DimSalesTerritory st 
  on st.SalesTerritoryKey = a.SalesTerritoryKey
  where year(OrderDate) < 2011

insert stmt.JPG

Examining the results

Using the web browser:

browse dir created files

Polybase export operation creates multiple files under the specified location.The external files are named QueryID_date_time_ID.format, where ID is an incremental identifier and format is the exported data format.

Select the exported data from the external table:

select ext table stmt.JPG

Conclusion

With this small tutorial I demonstrated how to use SQL Server 2016 and Hadoop to create a cost effective and functional archiving solution. There are still several other aspects to explain and be considered but we can start to build our proof of concepts, let’s get started.

References

PolyBase Queries

Apache Hadoop File System Shell Commands

Acknowledgments

Special thanks to Sumin Mohanan and Sonya Marshall from Microsoft to helped me to troubleshoot my tests.

Posted in Big Data, hadoop, SQL Server | Tagged , , , , | 13 Comments

Polybase Query Service and Hadoop – Welcome SQL Server 2016

Introduction

One of the coolest features of SQL Server 2016 is Polybase. Already available for Parallel Data Warehouse, this functionality is now integrated in SQL Server 2016 and allows to combine relational and non-relational data, for example, query data in Hadoop and join it with relational data, import external data into SQL Server or export data from the server into Hadoop or Azure Blob Storage. This last case is especially interesting since is possible to transfer old transactions or historical data to a Hadoop file system and dramatically reduce the storage costs.

Setup Polybase

I installed the following components:

After install SQL Server enable TCP/IP connectivity:

enable tcp ip

Verify that the Polybase services are running:

polybase services

Create an external data source

Open a connection to the AdventureworksDW2016CTP3

Polybase connectivity configuration:

sp_configure @configname = 'hadoop connectivity', @configvalue = 7;
GO
RECONFIGURE
GO

‘hadoop connectivity’ is the name of the configuration option. The @configvalue is the corresponding supported Hadoop data source. In my case I selected the 7 corresponding to Hortonworks 2.1, 2.2, and 2.3 on Windows Server. I am using my own Hadoop 2.7.1 and the Hortonworks version is HDP 2.3 and 2.4

More info here:

Hortonworks Products

Polybase Connectivity Configuration

Create external data source script:

CREATE EXTERNAL DATA SOURCE HDP2 WITH
(
    TYPE = HADOOP,
    LOCATION = 'hdfs://localhost:9000'
)

HADOOP is the external data source type and the location is the NameNode URI. You will find this value in <your Hadoop directory>\etc\hadoop\core-site.xml

NameNode URI.jpg

Once the source is created you will find it under “External Data Sources” folder in Management Studio:

External data source.jpg

It is important to remark that the location is not validated when you create the external data source

Create a sample file for this example

Just for demo purposes, create a .csv file and populate it with a query from AdventureworksDW2016CTP3. This is just an example, you can create your own example and also change the file format in the next section accordingly.

Here my query:

SELECT TOP 1000
  [SalesOrderNumber]
 ,[SalesOrderLineNumber]
 ,p.EnglishProductName as ProductName
 ,st.SalesTerritoryCountry
 ,[OrderQuantity]
 ,[UnitPrice]
 ,[ExtendedAmount]
 ,[SalesAmount]
 ,convert(date,[OrderDate]) AS [OrderDate]
FROM [AdventureworksDW2016CTP3].[dbo].[FactInternetSales] a
inner join dbo.DimProduct p on a.ProductKey = p.ProductKey
inner join dbo.DimSalesTerritory st on st.SalesTerritoryKey = a.SalesTerritoryKey

I populated the csv file using Management Studio as follows:

Open the Export wizard: right click on the database name –> Tasks –> Export Data…

Export Data.jpg

Select a data source

select a data source.jpg

Choose a destination

Choose a destination.jpg

Specify a query to select the data to export

specify query

Source query

source query.jpg

Configure flat file destination

configure flat file destination.jpg

Save and run the package

save and run the package.jpg

Export done!

execution finished.jpg

Transfer the csv to HDFS

I created a directory called input in my Hadoop file system and store the csv file in c:\tmp

In case you haven’t done before, to create a directory in HDFS open a command prompt, go to your Hadoop directory and type:

<Your_hadoop-directoy>hadoop fs -mkdir /input

Here is my shell command to move the file from windows file system to HDFS:

<Your_hadoop-directoy>hadoop fs -copyFromLocal c:\tmp\AWExport.csv /input/

Set read and write permissions for other members of your group and others:

<Your_hadoop-directoy>hadoop fs -chmod 777 /input/AWExport.csv

List files in the input directory:

<Your_hadoop-directoy>hadoop fs -ls /input

hdfs commands.jpg

Create an external file format

To create a file format, in a query window in management studio copy and paste the following script:

CREATE EXTERNAL FILE FORMAT SalesExport WITH (
        FORMAT_TYPE = DELIMITEDTEXT,
        FORMAT_OPTIONS (
                FIELD_TERMINATOR =';',
                DATE_FORMAT = 'yyyy-MM-dd' ,
                USE_TYPE_DEFAULT = TRUE
                           )
)

SalesExport is just the name I gave.

The format type is delimited. There are some other types, more info here

The field terminator is the same I used when I exported the data to the flat file.

The date format corresponds also to the format in the flat file

Create an external table

This table references the file stored in HDFS (In my case AWExport.csv). The format corresponds to the structure of the file.

CREATE EXTERNAL TABLE SalesImportcsv
(
    SalesOrderNumber nvarchar(20)
   ,SalesOrderLineNumber tinyint
   ,ProductName nvarchar(50)
   ,SalesTerritoryCountry nvarchar(50)
   ,OrderQuantity smallint
   ,UnitPrice money
   ,ExtendedAmount money
   ,SalesAmount money
   ,OrderDate date
)
WITH
(
   LOCATION = '/input/AWExport.csv',
   DATA_SOURCE = HDP2,
   FILE_FORMAT = SalesExport,
   REJECT_TYPE = value,
   REJECT_VALUE=0
)

Location: location of the file in HDFS.

Data Source: the one created in a previous step.

File Format: also the one created in a previous step.

Reject type: the rejected value is a literal value and not a percentage (the other option is percentage).

Reject value: how many rows could fail. Fail means dirty records, in this context, when a value does not match the column definition.

MSDN Documentation

Query the external table

If everything works you should be able to see the external table in management studio. Then just right click and select the top 1000 records, for example:

select from external table.jpg

Further Topics

  • Insert records in an external table.
  • Configure an external source with credentials.
  • Build a SSIS package to import and export data from Hadoop.
  • View the execution plans of the Polybase queries

References

Posted in Big Data, Business Intelligence, Data Processing Engines, hadoop, SQL Server | Tagged , , , | Leave a comment

My experience building Hadoop 2.7.1 on Windows Server 2012

Introduction

Building the Hadoop sources on windows could be cumbersome even when the official documentation states: “… building a Windows package from the sources is fairly straightforward”. There are several good resources containing the steps needed in order to successfully build a distribution. The most useful for me was this one:

Hadoop 2.7.1 for Windows 10 binary build with Visual Studio 2015 (unofficial)

I solved most of the hurdles with the directions in this blog post (thanks Kplitz Kahran), but I still had to suffer a little bit more. In this post I will show you the additional details I needed to fix.

You can try with the information in the link above and if you have no luck this solutions could help you.

Build winutils project – error C2065: ‘L’: undeclared identifier

I order to solve this problem; I just rewrote this line of code:

const WCHAR* wsceConfigRelativePath = WIDEN_STRING(STRINGIFY(WSCE_CONFIG_DIR)) L”\\” WIDEN_STRING(STRINGIFY(WSCE_CONFIG_FILE));

as:

const WCHAR* wsceConfigRelativePath = STRINGIFY(WSCE_CONFIG_DIR) “\\” STRINGIFY(WSCE_CONFIG_FILE);

Basically the concatenation of these variables are converted explicit to wide character (WCHAR) using a macro. I tested the concatenation without this explicit conversion and it worked. I am not sure why is failing without this, if someone could explain it, I will really appreciate it.

Build native project – LINK Error to libwinutils

The libwinutils library is an external reference of the native project. Verify in the project properties, Linker section, that the “Additional Library Directories” includes the targets of the libwinutils project.

Here a Screencast with the steps above:

 

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin

After solving the previous errors, I thought I was the master of the universe until the next error damaged my enthusiasm again. Luckily the fix proposed by Rushikesh Garadade in this Stack Overflow thread solved the issue:

http://stackoverflow.com/questions/21752279/failed-to-execute-goal-org-apache-maven-pluginsmaven-antrun-plugin1-6-run-pr

After this, the build crashed again but fortunately was just a temporary network error. And finally the happiest image of the day:

BuildHadoop

Hope that helps.

References

Build and Install Hadoop 2.x or newer on Windows

Hadoop 2.7.1 for Windows 10 binary build with Visual Studio 2015 (unofficial)

Working with Strings

Posted in Big Data, hadoop | Tagged , , , , , , | 2 Comments

Apache Kafka 0.8 on Windows

A very helpful step-by-step tutorial to help us learn and play with modern technologies using our windows computer.

JanSchulte.com

Apache Kafka is a scalable, distributed messaging system, which is increasingly getting popular and used by such renowned companies like LinkedIn, Tumblr, Foursquare, Spotify and Netflix [1].

Setting up a Kafka development environment on a Windows machine requires some configuration, so I created this little step-by-step installation tutorial for all the people who want to save themselves from some hours work😉

View original post 1,497 more words

Posted in Messaging Systems | Tagged , , | Leave a comment

Apache Spark installation on Windows 10

 

Introduction

This post is to help people to install and run Apache Spark in a computer with window 10 (it may also help for prior versions of Windows or even Linux and Mac OS systems), and want to try out and learn how to interact with the engine without spend too many resources. If you really want to build a serious prototype, I strongly recommend to install one of the virtual machines I mentioned in this post a couple of years ago: Hadoop self-learning with pre-configured Virtual Machines or to spend some money in a Hadoop distribution on the cloud. The new version of these VMs come with Spark ready to use.

A few words about Apache Spark

Apache Spark is making a lot of noise in the IT world as a general engine for large-scale data processing, able to run programs up to 100x faster than Hadoop MapReduce, thanks to its in-memory computing capabilities. It is possible to write Spark applications using Java, Python, Scala and R, and it comes with built-in libraries to work with structure data (Spark SQL), graph computation (GraphX), machine learning (MLlib) and streaming (Spark Streaming).

Spark runs on Hadoop, Mesos, in the cloud or as standalone. The latest is the case of this post. We are going to install Spark 1.6.0 as standalone in a computer with a 32-bit Windows 10 installation (my very old laptop). Let’s get started.

Install or update Java

For any application that uses the Java Virtual Machine is always recommended to install the appropriate java version. In this case I just updated my java version as follows:

Start –> All apps –> Java –> Check For Updates

Check java updates

Update Java

 

In the same way you can verify your java version. This is the version I used:

 

about java

Java Version

 

Download Scala

Download from here. Then execute the installer.

I just downloaded the binaries for my system:

download scala

Scala Download

 

 

Download Spark

Select any of the prebuilt version from here

As we are not going to use Hadoop it make no difference the version you choose. I downloaded the following one:

Download spark

Spark Download

 

Feel free also to download the source code and make your own build if you feel comfortable with it.

Extract the files to any location in your drive with enough permissions for your user.

Download winutils.exe

This was the critical point for me, because I downloaded one version and did not work until I realized that there are 64-bits and 32-bits versions of this file. Here you can find them accordingly:

32-bit winutils.exe

64-bit winutils.exe

In order to make my trip still longer, I had to install Git to be able to download the 32-bits winutils.exe. If you know another link where we can found this file you can share it with us.

Git client download (I hope you don’t get stuck in this step)

Extract the folder containing the file winutils.exe to any location of your preference.

Environment Variables Configuration

This is also crucial in order to run some commands without problems using the command prompt.

  • _JAVA_OPTION: I set this variable to the value showed in the figure below. I was getting Java Heap Memory problems with the default values and this fixed this problem.
  • HADOOP_HOME: even when Spark can run without Hadoop, the version I downloaded is prebuilt for Hadoop 2.6 and looks in the code for it. To fix this inconvenient I set this variable to the folder containing the winutils.exe file
  • JAVA_HOME: usually you already set this variable when you install java but it is better to verify that exist and is correct.
  • SCALA_HOME: the bin folder of the Scala location. If you use the standard location from the installer should be the path in the figure below.
  • SPARK_HOME: the bin folder path of where you uncompressed Spark

 

env variables 2

Environment Variables 1/2

env variables 1

Environment Variables 2/2

 

Permissions for the folder tmp/hive

I struggled a little bit with this issue. After I set everything I tried to run the spark-shell from the command line and I was getting an error, which was hard to debug. The shell tries to find the folder tmp/hive and was not able to set the SQL Context.

I look at my C drive and I found that the C:\tmp\hive folder was created. If not you can created by yourself and set the 777 permissions for it. In theory you can do it with the advanced sharing options of the sharing tab in the properties of the folder, but I did it in this way from the command line using winutils:

Open a command prompt as administrator and type:

chmod 777

Set 777 permissions for tmp/hive

 

Please be aware that you need to adjust the path of the winutils.exe above if you saved it to another location.

We are finally done and could start the spark-shell which is an interactive way to analyze data using Scala or Python. In this way we are going also to test our Spark installation.

Using the Scala Shell to run our first example

In the same command prompt go to the Spark folder and type the following command to run the Scala shell:

 

start the spark shell

Start the Spark Scala Shell

 

After some executions line you should be able to see a similar screen:

scala shell.jpg

Shell started

 

You are going to receive several warnings and information in the shell because we have not set different configuration options. By now just ignore them.

Let’s run our first program with the shell, I took the example from the Spark Programming Guide. The first command creates a resilient data set (RDD) from a text file included in the Spark’s root folder. After the RDD is created, the second command just counts the number of items inside:

second command.jpg

Running a Spark Example

 

And that’s it. Hope you can follow my explanation and be able to run this simple example. I wish you a lot of fun with Apache Spark.

References

Why does starting spark-shell fail with NullPointerException on Windows?

Apache Spark checkpoint issue on windows

Configure Standalone Spark on Windows 10

Posted in Data Processing Engines | Tagged , , , , , , , | 12 Comments