This post is to help people to install and run Apache Spark in a computer with window 10 (it may also help for prior versions of Windows or even Linux and Mac OS systems), and want to try out and learn how to interact with the engine without spend too many resources. If you really want to build a serious prototype, I strongly recommend to install one of the virtual machines I mentioned in this post a couple of years ago: Hadoop self-learning with pre-configured Virtual Machines or to spend some money in a Hadoop distribution on the cloud. The new version of these VMs come with Spark ready to use.
A few words about Apache Spark
Apache Spark is making a lot of noise in the IT world as a general engine for large-scale data processing, able to run programs up to 100x faster than Hadoop MapReduce, thanks to its in-memory computing capabilities. It is possible to write Spark applications using Java, Python, Scala and R, and it comes with built-in libraries to work with structure data (Spark SQL), graph computation (GraphX), machine learning (MLlib) and streaming (Spark Streaming).
Spark runs on Hadoop, Mesos, in the cloud or as standalone. The latest is the case of this post. We are going to install Spark 1.6.0 as standalone in a computer with a 32-bit Windows 10 installation (my very old laptop). Let’s get started.
Install or update Java
For any application that uses the Java Virtual Machine is always recommended to install the appropriate java version. In this case I just updated my java version as follows:
Start –> All apps –> Java –> Check For Updates
In the same way you can verify your java version. This is the version I used:
Download from here. Then execute the installer.
I just downloaded the binaries for my system:
Select any of the prebuilt version from here
As we are not going to use Hadoop it make no difference the version you choose. I downloaded the following one:
Feel free also to download the source code and make your own build if you feel comfortable with it.
Extract the files to any location in your drive with enough permissions for your user.
This was the critical point for me, because I downloaded one version and did not work until I realized that there are 64-bits and 32-bits versions of this file. Here you can find them accordingly:
In order to make my trip still longer, I had to install Git to be able to download the 32-bits winutils.exe. If you know another link where we can found this file you can share it with us.
Git client download (I hope you don’t get stuck in this step)
Extract the folder containing the file winutils.exe to any location of your preference.
Environment Variables Configuration
This is also crucial in order to run some commands without problems using the command prompt.
- _JAVA_OPTION: I set this variable to the value showed in the figure below. I was getting Java Heap Memory problems with the default values and this fixed this problem.
- HADOOP_HOME: even when Spark can run without Hadoop, the version I downloaded is prebuilt for Hadoop 2.6 and looks in the code for it. To fix this inconvenient I set this variable to the folder containing the winutils.exe file
- JAVA_HOME: usually you already set this variable when you install java but it is better to verify that exist and is correct.
- SCALA_HOME: the bin folder of the Scala location. If you use the standard location from the installer should be the path in the figure below.
- SPARK_HOME: the bin folder path of where you uncompressed Spark
Environment Variables 1/2
Environment Variables 2/2
Permissions for the folder tmp/hive
I struggled a little bit with this issue. After I set everything I tried to run the spark-shell from the command line and I was getting an error, which was hard to debug. The shell tries to find the folder tmp/hive and was not able to set the SQL Context.
I look at my C drive and I found that the C:\tmp\hive folder was created. If not you can created by yourself and set the 777 permissions for it. In theory you can do it with the advanced sharing options of the sharing tab in the properties of the folder, but I did it in this way from the command line using winutils:
Open a command prompt as administrator and type:
Set 777 permissions for tmp/hive
Please be aware that you need to adjust the path of the winutils.exe above if you saved it to another location.
We are finally done and could start the spark-shell which is an interactive way to analyze data using Scala or Python. In this way we are going also to test our Spark installation.
Using the Scala Shell to run our first example
In the same command prompt go to the Spark folder and type the following command to run the Scala shell:
Start the Spark Scala Shell
After some executions line you should be able to see a similar screen:
You are going to receive several warnings and information in the shell because we have not set different configuration options. By now just ignore them.
Let’s run our first program with the shell, I took the example from the Spark Programming Guide. The first command creates a resilient data set (RDD) from a text file included in the Spark’s root folder. After the RDD is created, the second command just counts the number of items inside:
Running a Spark Example
And that’s it. Hope you can follow my explanation and be able to run this simple example. I wish you a lot of fun with Apache Spark.
Why does starting spark-shell fail with NullPointerException on Windows?
Apache Spark checkpoint issue on windows
Configure Standalone Spark on Windows 10