My experience building Hadoop 2.7.2 on Windows Server 2012

Introduction

Building the Hadoop sources on windows could be cumbersome even when the official documentation states: “… building a Windows package from the sources is fairly straightforward”. There are several good resources containing the steps needed in order to successfully build a distribution. The most useful for me was this one:

Hadoop 2.7.1 for Windows 10 binary build with Visual Studio 2015 (unofficial)

I solved most of the hurdles with the directions in this blog post (thanks Kplitz Kahran), but I still had to suffer a little bit more. In this post I will show you the additional details I needed to fix.

You can try with the information in the link above and if you have no luck this solutions could help you.

Build winutils project – error C2065: ‘L’: undeclared identifier

I order to solve this problem; I just rewrote this line of code:

const WCHAR* wsceConfigRelativePath = WIDEN_STRING(STRINGIFY(WSCE_CONFIG_DIR)) L”\\” WIDEN_STRING(STRINGIFY(WSCE_CONFIG_FILE));

as:

const WCHAR* wsceConfigRelativePath = STRINGIFY(WSCE_CONFIG_DIR) “\\” STRINGIFY(WSCE_CONFIG_FILE);

Basically the concatenation of these variables are converted explicit to wide character (WCHAR) using a macro. I tested the concatenation without this explicit conversion and it worked. I am not sure why is failing without this, if someone could explain it, I will really appreciate it.

Build native project – LINK Error to libwinutils

The libwinutils library is an external reference of the native project. Verify in the project properties, Linker section, that the “Additional Library Directories” includes the targets of the libwinutils project.

Here a Screencast with the steps above:

 

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin

After solving the previous errors, I thought I was the master of the universe until the next error damaged my enthusiasm again. Luckily the fix proposed by Rushikesh Garadade in this Stack Overflow thread solved the issue:

http://stackoverflow.com/questions/21752279/failed-to-execute-goal-org-apache-maven-pluginsmaven-antrun-plugin1-6-run-pr

After this, the build crashed again but fortunately was just a temporary network error. And finally the happiest image of the day:

BuildHadoop

Hope that helps.

References

Build and Install Hadoop 2.x or newer on Windows

Hadoop 2.7.1 for Windows 10 binary build with Visual Studio 2015 (unofficial)

Working with Strings

Posted in Big Data, hadoop | Tagged , , , , , , | Leave a comment

Apache Kafka 0.8 on Windows

A very helpful step-by-step tutorial to help us learn and play with modern technologies using our windows computer.

JanSchulte.com

Apache Kafka is a scalable, distributed messaging system, which is increasingly getting popular and used by such renowned companies like LinkedIn, Tumblr, Foursquare, Spotify and Netflix [1].

Setting up a Kafka development environment on a Windows machine requires some configuration, so I created this little step-by-step installation tutorial for all the people who want to save themselves from some hours work😉

View original post 1,497 more words

Posted in Messaging Systems | Tagged , , | Leave a comment

Apache Spark installation on Windows 10

 

Introduction

This post is to help people to install and run Apache Spark in a computer with window 10 (it may also help for prior versions of Windows or even Linux and Mac OS systems), and want to try out and learn how to interact with the engine without spend too many resources. If you really want to build a serious prototype, I strongly recommend to install one of the virtual machines I mentioned in this post a couple of years ago: Hadoop self-learning with pre-configured Virtual Machines or to spend some money in a Hadoop distribution on the cloud. The new version of these VMs come with Spark ready to use.

A few words about Apache Spark

Apache Spark is making a lot of noise in the IT world as a general engine for large-scale data processing, able to run programs up to 100x faster than Hadoop MapReduce, thanks to its in-memory computing capabilities. It is possible to write Spark applications using Java, Python, Scala and R, and it comes with built-in libraries to work with structure data (Spark SQL), graph computation (GraphX), machine learning (MLlib) and streaming (Spark Streaming).

Spark runs on Hadoop, Mesos, in the cloud or as standalone. The latest is the case of this post. We are going to install Spark 1.6.0 as standalone in a computer with a 32-bit Windows 10 installation (my very old laptop). Let’s get started.

Install or update Java

For any application that uses the Java Virtual Machine is always recommended to install the appropriate java version. In this case I just updated my java version as follows:

Start –> All apps –> Java –> Check For Updates

Check java updates

Update Java

 

In the same way you can verify your java version. This is the version I used:

 

about java

Java Version

 

Download Scala

Download from here. Then execute the installer.

I just downloaded the binaries for my system:

download scala

Scala Download

 

 

Download Spark

Select any of the prebuilt version from here

As we are not going to use Hadoop it make no difference the version you choose. I downloaded the following one:

Download spark

Spark Download

 

Feel free also to download the source code and make your own build if you feel comfortable with it.

Extract the files to any location in your drive with enough permissions for your user.

Download winutils.exe

This was the critical point for me, because I downloaded one version and did not work until I realized that there are 64-bits and 32-bits versions of this file. Here you can find them accordingly:

32-bit winutils.exe

64-bit winutils.exe

In order to make my trip still longer, I had to install Git to be able to download the 32-bits winutils.exe. If you know another link where we can found this file you can share it with us.

Git client download (I hope you don’t get stuck in this step)

Extract the folder containing the file winutils.exe to any location of your preference.

Environment Variables Configuration

This is also crucial in order to run some commands without problems using the command prompt.

  • _JAVA_OPTION: I set this variable to the value showed in the figure below. I was getting Java Heap Memory problems with the default values and this fixed this problem.
  • HADOOP_HOME: even when Spark can run without Hadoop, the version I downloaded is prebuilt for Hadoop 2.6 and looks in the code for it. To fix this inconvenient I set this variable to the folder containing the winutils.exe file
  • JAVA_HOME: usually you already set this variable when you install java but it is better to verify that exist and is correct.
  • SCALA_HOME: the bin folder of the Scala location. If you use the standard location from the installer should be the path in the figure below.
  • SPARK_HOME: the bin folder path of where you uncompressed Spark

 

env variables 2

Environment Variables 1/2

env variables 1

Environment Variables 2/2

 

Permissions for the folder tmp/hive

I struggled a little bit with this issue. After I set everything I tried to run the spark-shell from the command line and I was getting an error, which was hard to debug. The shell tries to find the folder tmp/hive and was not able to set the SQL Context.

I look at my C drive and I found that the C:\tmp\hive folder was created. If not you can created by yourself and set the 777 permissions for it. In theory you can do it with the advanced sharing options of the sharing tab in the properties of the folder, but I did it in this way from the command line using winutils:

Open a command prompt as administrator and type:

chmod 777

Set 777 permissions for tmp/hive

 

Please be aware that you need to adjust the path of the winutils.exe above if you saved it to another location.

We are finally done and could start the spark-shell which is an interactive way to analyze data using Scala or Python. In this way we are going also to test our Spark installation.

Using the Scala Shell to run our first example

In the same command prompt go to the Spark folder and type the following command to run the Scala shell:

 

start the spark shell

Start the Spark Scala Shell

 

After some executions line you should be able to see a similar screen:

scala shell.jpg

Shell started

 

You are going to receive several warnings and information in the shell because we have not set different configuration options. By now just ignore them.

Let’s run our first program with the shell, I took the example from the Spark Programming Guide. The first command creates a resilient data set (RDD) from a text file included in the Spark’s root folder. After the RDD is created, the second command just counts the number of items inside:

second command.jpg

Running a Spark Example

 

And that’s it. Hope you can follow my explanation and be able to run this simple example. I wish you a lot of fun with Apache Spark.

References

Why does starting spark-shell fail with NullPointerException on Windows?

Apache Spark checkpoint issue on windows

Configure Standalone Spark on Windows 10

Posted in Data Processing Engines | Tagged , , , , , , , | Leave a comment

Business Intelligence without excuses part 1 – Business Analytics Platform Installation

Disclaimer

This first tutorial is part of a series that I’m planning in order to show how to use Pentaho to build BI applications. The expected audience is people without previous knowledge about Pentaho, for this reason I decided to start from the very beginning. I think and hope that students or professional who want to step into BI will find these tutorials useful.

For experienced Pentaho users I recommend this article to catch on what’s new in BA Server 5.0 CE: A first look to the new Pentaho BA Server 5.0 CE 

Introduction

The renamed Pentaho Analytic Platform is the central component to host the content of our BI application. From the platform it is possible to run and show reports and dashboards, manage security, perform OLAP analysis and many other tasks.

All Pentaho software, except the Pentaho Mobile App, requires the Sun/Oracle version 1.7 distribution of the Java Runtime Environment (JRE) or Java Development Kit (JDK), therefore is essential that Java is installed and at least the variable JRE_HOME or JAVA_HOME should be configured. I show how to set the JAVA_JOME system variable in a Windows environment.

As I mentioned in the first post of this series, the first step is to download the BA Server from:

Pentaho Community 

Pentaho BA Server CE 5.0 installation 

Plugins Installation

Using the Marketplace plugin (comes with the default installation) it is possible to install other useful plugins, which are going to be used to design dashboards, perform OLAP analysis, etc.

Users and Roles

The default installation comes with a set of users and their respective directories. The users with the admin role can see all of the directories. There is also a “Public” folder, where the examples showed in the screencast above are stored.

 

Summary

The Pentaho Business Analytics Platform hosts Pentaho-created and user-created content. It is open source and could easily download and install.  If you are a developer, specially a Java developer, I encourage you to dive and study how is build the whole platform, understand the architecture behind, and why not, to collaborate with the community.

In future posts I will examine in detail some important features and characteristics of the Server

Posted in Business Intelligence, Pentaho | Tagged , , | Leave a comment

Business Intelligence without excuses part 0 – Pentaho CE 5.0

It is well known that not only in companies from different business sectors but also privately, an enormous amount of data is collected every day, every hour, every minute…
The first question always arise is, what could be learned from these data.
There are a variety of technologies in the market to create applications that aim the development of the sequence:

Data -> Information -> Knowledge

I don´t want to discuss which one is better; I have experience with both open source and non-open source tools and I have nothing to complain about. I just want to present you in a series of posts the Pentaho Community Edition products and how to build a complete Business Intelligence application. I’ll try to cover the basics and some advanced tasks, but keep in mind that the tutorials will be intended from people with zero Knowledge of Pentaho. If you are an experienced Pentaho user you could find the tutorials not interesting.

First step is to download and install the Business Analytics Platform:
You can find it here: Pentaho Community

And remember, it is for free, and I’ll try to show you the basics, and at the end you will have NO EXCUSES to profit yourself from Business Intelligence.

Posted in Business Intelligence, Pentaho | Tagged , , | Leave a comment

Hadoop self-learning with pre-configured Virtual Machines

The first obstacle I found when I tried to learn Hadoop is I don’t have a cluster at home and I don’t want to pay for resources in the cloud. Even if you have access to a cluster, setting Hadoop could be an arduous task. There are too many new things to learn that I didn’t want to spend time trying to setting up Hadoop because it could result frustrating.
The good news is there are pre-configured Hadoop virtual machines that will help you to learn by yourself.
Here I listed three options, each one from a different Hadoop vendor. This is not a survey of Hadoop virtual machines, which would be very nice by the way.
The scope of this post is just to give some information about the possibility to learn Hadoop using your laptop or desktop computer.
Hadoop free download pre-configured VM:

Hortonworks Sandbox
Cloudera’s CDH4
MapR M3, M5 and M7

Hope you enjoy learning!

Posted in Big Data, Business Intelligence | Tagged , , | 1 Comment

Deploy SSIS Packages across servers

Background
Recently I had to deploy a set of SSIS packages stored in the msdb of a development server into a test server. I should mention that a SSIS package could be stored in the file system, in a package store or in the msdb database of a SQL Server instance.
I don’t want to discuss the reasons to choose a specific option but if you’re curious I left here a couple of articles:
Deployment Storage Options for SSIS Package
What are the advantages/disadvantages of storing SSIS packages to MSDB vs File System?

To organize my packages I’ve created a root folder with the name of the project, and three subfolders according to my task categories:

folder structure

The Task
The task to perform is to move all of the SSIS packages stored in this folder to a test server with similar settings (SQL server Database running on Windows Server 2008 R2 Enterprise)

Discussion
If I used the file system or package store storage option I just need to move the packages to the new location, but this is not the case. A simple option is to use the SSMS (management studio) and individually export each package:

individual exportBut I’m too lazy to do that and I want to move all of the packages at the same time.

There is a deployment utility for SSIS projects in the Business Intelligence Development Studio (BIDS). Here is a great article which explains its usage (also apply for SQL Server 2008):
Deploying SSIS Packages in SQL Server 2005

The disadvantage of this “just click next up to the end” option is you can choose one and only one target folder. At the moment I couldn’t figure out a way to have logical folders in the BIDS solution explorer under the SSIS Packages folder and map them to their correspondents target folder in the msdb.

Another option is to use the PowerShell extension for SSIS:
http://sev17.com/2011/02/02/importing-and-exporting-ssis-packages-using-powershell/
The main drawback of this option is these extensions need to be installed, and the execution of scripts must be enabled, but it seems to work nicely.
My last option in discussion, and the one that I finally used is the DTUTIL command line utility, but like I said before, I’m too lazy so I’ve created a couple of stored procedures that I want to share with you.

createSSISpackagesFolder
This stored procedure creates a folder in the root folder of the msdb database. The input parameter is the name of the folder. The name of the folder must be enclosed into single quotes if contains special characters. Please feel free to modify this stored procedure to create subfolders; you will need to add another input parameter and lookup the parent folder id.

USE [msdb]
GO

/****** Object: StoredProcedure [dbo].[createSSISpackagesFolder] Script Date: 04/02/2013 15:49:08 ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE PROCEDURE [dbo].[createSSISpackagesFolder]
@destFolder sysname = '' -- Specify the name of the folder in the msdb
AS
BEGIN
SET NOCOUNT ON;

EXEC sp_configure 'show advanced options', 1
RECONFIGURE
EXEC sp_configure 'xp_cmdshell', 1
RECONFIGURE

DECLARE @folders table(folderid uniqueidentifier, parentfolderid uniqueidentifier, foldername sysname);
insert into @folders
EXEC msdb.dbo.sp_ssis_listfolders '00000000-0000-0000-0000-000000000000';

if (select Count(*) from @folders where foldername = @destFolder)=0
begin
/*Add a Folder*/

exec msdb.dbo.sp_ssis_addfolder

@parentfolderid = '00000000-0000-0000-0000-000000000000'

,@name = @destFolder;
end

EXEC msdb.dbo.sp_ssis_listfolders '00000000-0000-0000-0000-000000000000';

END

GO

copySSISpackagesPlus
This is the modified version of the stored procedure I’ve found in this article:
http://www.databasejournal.com/features/mssql/article.php/3734096/Using-dtutil-to-copy-SSIS-packages-stored-in-SQL-Server.htm
The procedure receives seven parameters: IP address or server name of the source and target server, SQL Server credentials if apply, that is, user name and password for each server (If Windows Authentication is used the credentials are not needed) and the name of the target folder. The stored procedure must be run from the source server. The name of the source folder and target folder must match. If you want another behavior it is easy to create another input parameter to have different source and target folder names.

USE [msdb]
GO

/****** Object: StoredProcedure [dbo].[copySSISpackagesPlus] Script Date: 04/02/2013 15:51:03 ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE PROCEDURE [dbo].[copySSISpackagesPlus]
@srcServer sysname='', -- Source server name
@destServer sysname='', -- Destination server name
@srcUser sysname = '', -- SQL Server login used to connect to the source server
@srcPassword sysname = '', -- Password of the SQL Server login on the source server
@destUser sysname = '', -- SQL Server login used to connect to the destination server
@destPassword sysname = '', -- Password of the SQL Server login on the destination server
@srcFolder sysname = '' -- Specify the name of the source folder in the msdb
AS
BEGIN
SET NOCOUNT ON;

DECLARE @execStrings table(Idx int identity(1,1), execCmd varchar(1000))

Insert into @execStrings(execCmd)
select 'dtutil /Quiet /COPY SQL;' +
case foldername when '' then '"' + [name] + '"' else '"' + foldername + '\' + [name] + '"' end
+ ' /SQL ' + case foldername when '' then '"' + [name] + '"' else '"' + foldername + '\' + [name] + '"' end
+ ' /SOURCESERVER ' + @srcServer
+ case @srcUser when '' then '' else ' /SourceUser ' + @srcUser + ' /SourcePassword ' + @srcPassword end
+ ' /DESTSERVER ' + @destServer
+ case @destUser when '' then '' else ' /DestUser ' + @destUser + ' /DestPassword ' + @destPassword end
from dbo.sysssispackages pkg join dbo.sysssispackagefolders fld
on pkg.folderid = fld.folderid
where fld.foldername = @srcFolder

DECLARE @cnt int = (select COUNT(*) from @execStrings);
DECLARE @tmpCmd varchar(1000);
WHILE(@cnt>0)
BEGIN
set @tmpCmd = (select execCmd from @execStrings where Idx = @cnt);
print @tmpCmd;
exec [master].[sys].[xp_cmdshell] @tmpCmd;
set @cnt = @cnt -1;
END

END

GO

Further questions or comments? Please write me.

Posted in Business Intelligence, SQL Server | Tagged , , , | 3 Comments