Apache Spark installation on Windows 10

 

Introduction

This post is to help people to install and run Apache Spark in a computer with window 10 (it may also help for prior versions of Windows or even Linux and Mac OS systems), and want to try out and learn how to interact with the engine without spend too many resources. If you really want to build a serious prototype, I strongly recommend to install one of the virtual machines I mentioned in this post a couple of years ago: Hadoop self-learning with pre-configured Virtual Machines or to spend some money in a Hadoop distribution on the cloud. The new version of these VMs come with Spark ready to use.

A few words about Apache Spark

Apache Spark is making a lot of noise in the IT world as a general engine for large-scale data processing, able to run programs up to 100x faster than Hadoop MapReduce, thanks to its in-memory computing capabilities. It is possible to write Spark applications using Java, Python, Scala and R, and it comes with built-in libraries to work with structure data (Spark SQL), graph computation (GraphX), machine learning (MLlib) and streaming (Spark Streaming).

Spark runs on Hadoop, Mesos, in the cloud or as standalone. The latest is the case of this post. We are going to install Spark 1.6.0 as standalone in a computer with a 32-bit Windows 10 installation (my very old laptop). Let’s get started.

Install or update Java

For any application that uses the Java Virtual Machine is always recommended to install the appropriate java version. In this case I just updated my java version as follows:

Start –> All apps –> Java –> Check For Updates

Check java updates

Update Java

 

In the same way you can verify your java version. This is the version I used:

 

about java

Java Version

 

Download Scala

Download from here. Then execute the installer.

I just downloaded the binaries for my system:

download scala

Scala Download

 

 

Download Spark

Select any of the prebuilt version from here

As we are not going to use Hadoop it make no difference the version you choose. I downloaded the following one:

Download spark

Spark Download

 

Feel free also to download the source code and make your own build if you feel comfortable with it.

Extract the files to any location in your drive with enough permissions for your user.

Download winutils.exe

This was the critical point for me, because I downloaded one version and did not work until I realized that there are 64-bits and 32-bits versions of this file. Here you can find them accordingly:

32-bit winutils.exe

64-bit winutils.exe

In order to make my trip still longer, I had to install Git to be able to download the 32-bits winutils.exe. If you know another link where we can found this file you can share it with us.

Git client download (I hope you don’t get stuck in this step)

Extract the folder containing the file winutils.exe to any location of your preference.

Environment Variables Configuration

This is also crucial in order to run some commands without problems using the command prompt.

  • _JAVA_OPTION: I set this variable to the value showed in the figure below. I was getting Java Heap Memory problems with the default values and this fixed this problem.
  • HADOOP_HOME: even when Spark can run without Hadoop, the version I downloaded is prebuilt for Hadoop 2.6 and looks in the code for it. To fix this inconvenient I set this variable to the folder containing the winutils.exe file
  • JAVA_HOME: usually you already set this variable when you install java but it is better to verify that exist and is correct.
  • SCALA_HOME: the bin folder of the Scala location. If you use the standard location from the installer should be the path in the figure below.
  • SPARK_HOME: the bin folder path of where you uncompressed Spark

 

env variables 2

Environment Variables 1/2

env variables 1

Environment Variables 2/2

 

Permissions for the folder tmp/hive

I struggled a little bit with this issue. After I set everything I tried to run the spark-shell from the command line and I was getting an error, which was hard to debug. The shell tries to find the folder tmp/hive and was not able to set the SQL Context.

I look at my C drive and I found that the C:\tmp\hive folder was created. If not you can created by yourself and set the 777 permissions for it. In theory you can do it with the advanced sharing options of the sharing tab in the properties of the folder, but I did it in this way from the command line using winutils:

Open a command prompt as administrator and type:

chmod 777

Set 777 permissions for tmp/hive

 

Please be aware that you need to adjust the path of the winutils.exe above if you saved it to another location.

We are finally done and could start the spark-shell which is an interactive way to analyze data using Scala or Python. In this way we are going also to test our Spark installation.

Using the Scala Shell to run our first example

In the same command prompt go to the Spark folder and type the following command to run the Scala shell:

 

start the spark shell

Start the Spark Scala Shell

 

After some executions line you should be able to see a similar screen:

scala shell.jpg

Shell started

 

You are going to receive several warnings and information in the shell because we have not set different configuration options. By now just ignore them.

Let’s run our first program with the shell, I took the example from the Spark Programming Guide. The first command creates a resilient data set (RDD) from a text file included in the Spark’s root folder. After the RDD is created, the second command just counts the number of items inside:

second command.jpg

Running a Spark Example

 

And that’s it. Hope you can follow my explanation and be able to run this simple example. I wish you a lot of fun with Apache Spark.

References

Why does starting spark-shell fail with NullPointerException on Windows?

Apache Spark checkpoint issue on windows

Configure Standalone Spark on Windows 10

About Paul Hernandez

I'm an Electronic Engineer and Computer Science professional, specialized in Data Analysis and Business Intelligence Solutions. Also a father, swimmer and music lover.
This entry was posted in Data Processing Engines and tagged , , , , , , , . Bookmark the permalink.

12 Responses to Apache Spark installation on Windows 10

  1. Paul Barbadew says:

    Hi Paul
    This is a great help to me, but seems I’m doing something wrong.
    I have Windows 10 Pro 64 bits. I downloaded the winutils.exe (64 bits) but when I tried to execute the:
    C:\WINDOWS\system32>c:\Hadoop\bin\winutils.exe chmod 777 \tmp\hive
    I obtain an error (the winutils.exe is not compatible with the Windows version

    “Esta versión de c:\Hadoop\bin\winutils.exe no es compatible con la versión de Windows que está ejecutando. Compruebe la información de sistema del equipo y después póngase en contacto con el editor de software.”

    Must I download all the folder where the winutils.exe is?

    Do you have any idea what’s I’m doing bad?

    Best regards

  2. Pingback: spark installation in windows by hernandezpaul.wordpress.com | hadoopminds

  3. Hi Paul,
    The winutils issue was my headache. Please try to do the following:
    – Copy the content of the whole library and try again.
    – If this doesn’t help, try to build the hadoop sources by yourself, I wrote a post about it (https://wordpress.com/stats/day/hernandezpaul.wordpress.com). It was also a pain in the a…
    – If you don’t want to walk this way just let me know and I will share a link to downlod the winutils I built. I did it with Windows Server 64 bits but it should work also for Windows 10.
    – Last thing I can offer to you is download the hadoop binaries that this blogger offers in this post: http://kplitzkahran.blogspot.de/2015/08/hadoop-271-for-windows-10-binary-build.html
    the download link is at the very end of the post.
    Kind Regards,
    Paul

    • Paul Barbadew says:

      Hi Paul,

      I downloaded whole the library and seems fine!
      Only appears some warning, but the program now is running.

      Thank you very much!!

  4. Pingback: Learning Path : Step by Step Guide for Beginners to Learn SparkR

  5. Hwee Xing says:

    Hi Paul,

    How do you configure your environment variables? I am facing some problems here.

  6. Pingback: Getting Started with Spark on Windows 10 | abgoswam's tech blog

  7. Evgenii Ermolov says:

    Thank a lot !

  8. Pingback: Beberapa Hal yang Perlu Diketahui Sebelum Menggunakan Spark - Montoska

  9. chandan says:

    Works perfectly . Thanks a ton

  10. Pingback: Apache Zeppelin installation on Windows 10 | Paul Hernandez playing with Data

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s