Apache Zeppelin installation on Windows 10

Disclaimer: I am not a Windows or Microsoft fan, but I am a frequent Windows user and it’s the most common OS I found in the Enterprise everywhere. Therefore, I decided to try Apache Zeppelin on my Windows 10 laptop and share my experience with you. The behavior should be similar in other operating systems.

Introduction

It is not a secret that Apache Spark became a reference as a powerful cluster computing framework, especially useful for machine learning applications and big data processing. Applications could be written in several languages as Java, Scala, Python or R. Apache Zeppelin is a Web-based tool that tries to cover according to the official project Website all of our needs (Apache Zeppelin):

  • Data ingestion
  • Data discovery
  • Data analytics
  • Data visualization and collaboration

The interpreter concept is what makes Zeppelin powerful, because you can theoretically plug in any language/data-processing-backend. It provides built-in Spark integration, and that is what I have tested first.

Apache Zeppelin Download

You can download the latest release from this link: download

I downloaded the version 0.6.2 binary package with all interpreters.

Since this version, the Spark interpreter is compatible with Spark 2.0 and Scala 2.11

According to the documentation, it supports Oracle JDK 1.7 (I guess it should work with 1.8) and Mac OSX, Ubuntu 14.4, CentOS 6.X and Windows 7 pro SP1 (And according to my tests also with Windows 10 Home).

Too much bla bla bla, let’s get started.

Zeppelin Installation

After download open the file (I used 7 Zip) and extract it to a proper location (in my case just the c drive to avoid possible problems)

Set the JAVA_HOME system variable to your JDK bin folder.

Set the variable HADOOP_HOME to your Hadoop folder location. If you don’t have the HADOOP binaries you can download my binaries from here: Hadoop-2.7.1

system-variables

My system variables

I am not really sure why Hadoop is needed if Zeppelin supposed to be autonomous but I guess Spark looks for the winutils.exe if you are using Windows. I posted about it in my previous post: Apache Spark Installation on Windows 10

This is the error I found in the Zeppelin logs (ZEPPELIN_DIR\logs –> there is a file for the server log and a separated file for each interpreter):

winutils error.JPG

winutils.exe error

Zeppelin Configuration

There are several settings you can adjust. Basically, there are two main files in the ZEPPELIN_DIR\conf :

  • zeppelin-env
  • zeppelin-site.xml

In the first one you can configure some interpreter settings. In the second more aspects related to the Website, like for instance, the Zeppelin server port (I am using the 8080 but most probably yours is already used by another application)

If you don’t touch the zeppelin-env file, Zeppelin use the built-in Spark version, which it has been used for the results posted in this entry.

Start Zeppelin

Open a command prompt and start Zeppelin executing the zeppelin.cmd in Drive:\ZEPELLIN_DIR\bin\zeppelin.cmd

start-zeppelin

Start Zeppelin

Then, open your favorite browser and navigate to localhost:8080 (or the one you set in the zeppelin-site.xml)

You should see the starting page. Verify that the indicator in the top-right-side of the windows is green, otherwise your server is down or is not running properly)

zeppelin home.JPG

Zeppelin home

If you have not configured Hive, before start trying the tutorials included in the release, you should need to set the value of the zeppelin.spark.useHiveContext to false. Apart from the config files, Zeppelin has an interpreter configuration page. You can find it by clicking on your user “anonymous” –> Interpreter

interpreter-config

Go to interpreter settings

Scroll-down to the bottom where you’ll find the Spark config values:

spark interpreter properties.JPG

Spark interpreter settings

Press on the edit button and change the value to false in order to use the SQL context instead of Hive.

Press the Save button to persist the change:

hive-content-set-to-false

Set zeppelin.spark.useHiveContext to false

Now let’s try the Zeppelin Tutorial

From the Notebook menu click on the Zeppelin Tutorial link:

zeppelin-tutorial

Navigate to the Zeppelin Tutorial

The first time you open it, Zeppelin ask you to set the Interpreter bindings:

interpreter bindings 1.JPG

Interpreter binding

Just scroll-down and save them:

interpreter-bindings-2

Save biding

Some notes are presented with different layouts. For more about the display system visit the documentation online.

Other possible annoying error

I was getting the following error when tried to run some notes in the Zeppelin Tutorial:

spark-warehouse folder 2.JPG

Spark warehouse URI error

I found a suggested solution in the following stack overflow question: link

An URI syntax exception trying to find the folder spark-warehouse in the Zeppelin folder. I struggled a little bit with that. The folder was not created in my Zeppelin directory, I thought it was a permissions problem, so I created it manually and assigned 777 permissions.

spark-warehouse-folder

spark-warehouse folder permission settings

It still failed. In the link above a forum user suggested to use triple slashes to define the proper path file:///C:/zeppelin-0.6.2-bin-all/spark-warehouse

But I still don’t know where to place this configuration. I couldn´t do it in the spark shell, also not while creating a spark session (zeppelin does it for me) and the conf/spark-defaults.conf doesn´t seem to be a good idea for Zeppelin because I was using the spark built-in version.

Finally, I remembered that is possible to add additional spark setting in the interpreter configuration page and I just navigated there and created it:

warehouse-dir

spark.sql.warehouse.dir

Just as additional info, you can verify the settings saved in this page in the file Drive:\ZEPELLIN_DIR\conf\interpreter.json

spark-warehouse folder 3.JPG

interpreter.json

After these steps, I was able to run all of the notes from the Zeppelin tutorials.

running-notes-zeppelin-tutorial

Running the load data into table note

Note that the layout from the tutorial is telling you more or less the order in which you have to execute the notes. The note “Load data into table” must be executed before you play the notes below. I guess that is the reason it spans over the whole width of the page, because it must be executed before you visualize or analyze the data, while the notes below could be executed in parallel, or in any order. I mean, this layout is not a must but it helps to keep an execution order.

note reults.JPG

Visualizing data with Zeppelin

I hope this helps you on your way to learn Zeppelin!

About Paul Hernandez

I'm an Electronic Engineer and Computer Science professional, specialized in Data Analysis and Business Intelligence Solutions. Also a father, swimmer and music lover.
This entry was posted in Analytics, data visualization, R, Spark and tagged , , , , . Bookmark the permalink.

2 Responses to Apache Zeppelin installation on Windows 10

  1. Pingback: Installing Zeppelin On Windows 10 – Curated SQL

  2. This is so useful! I was struggling to get this running on Windows 10 – thanks for the comprehensive breakdown

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s