Business Intelligence without excuses part 1 – Business Analytics Platform Installation

Disclaimer

This first tutorial is part of a series that I’m planning in order to show how to use Pentaho to build BI applications. The expected audience is people without previous knowledge about Pentaho, for this reason I decided to start from the very beginning. I think and hope that students or professional who want to step into BI will find these tutorials useful.

For experienced Pentaho users I recommend this article to catch on what’s new in BA Server 5.0 CE: A first look to the new Pentaho BA Server 5.0 CE 

Introduction

The renamed Pentaho Analytic Platform is the central component to host the content of our BI application. From the platform it is possible to run and show reports and dashboards, manage security, perform OLAP analysis and many other tasks.

All Pentaho software, except the Pentaho Mobile App, requires the Sun/Oracle version 1.7 distribution of the Java Runtime Environment (JRE) or Java Development Kit (JDK), therefore is essential that Java is installed and at least the variable JRE_HOME or JAVA_HOME should be configured. I show how to set the JAVA_JOME system variable in a Windows environment.

As I mentioned in the first post of this series, the first step is to download the BA Server from:

Pentaho Community 

Pentaho BA Server CE 5.0 installation 

Plugins Installation

Using the Marketplace plugin (comes with the default installation) it is possible to install other useful plugins, which are going to be used to design dashboards, perform OLAP analysis, etc.

Users and Roles

The default installation comes with a set of users and their respective directories. The users with the admin role can see all of the directories. There is also a “Public” folder, where the examples showed in the screencast above are stored.

 

Summary

The Pentaho Business Analytics Platform hosts Pentaho-created and user-created content. It is open source and could easily download and install.  If you are a developer, specially a Java developer, I encourage you to dive and study how is build the whole platform, understand the architecture behind, and why not, to collaborate with the community.

In future posts I will examine in detail some important features and characteristics of the Server

Posted in Business Intelligence, Pentaho | Tagged , , | Leave a comment

Business Intelligence without excuses part 0 – Pentaho CE 5.0

It is well known that not only in companies from different business sectors but also privately, an enormous amount of data is collected every day, every hour, every minute…
The first question always arise is, what could be learned from these data.
There are a variety of technologies in the market to create applications that aim the development of the sequence:

Data -> Information -> Knowledge

I don´t want to discuss which one is better; I have experience with both open source and non-open source tools and I have nothing to complain about. I just want to present you in a series of posts the Pentaho Community Edition products and how to build a complete Business Intelligence application. I’ll try to cover the basics and some advanced tasks, but keep in mind that the tutorials will be intended from people with zero Knowledge of Pentaho. If you are an experienced Pentaho user you could find the tutorials not interesting.

First step is to download and install the Business Analytics Platform:
You can find it here: Pentaho Community

And remember, it is for free, and I’ll try to show you the basics, and at the end you will have NO EXCUSES to profit yourself from Business Intelligence.

Posted in Business Intelligence, Pentaho | Tagged , , | Leave a comment

Hadoop self-learning with pre-configured Virtual Machines

The first obstacle I found when I tried to learn Hadoop is I don’t have a cluster at home and I don’t want to pay for resources in the cloud. Even if you have access to a cluster, setting Hadoop could be an arduous task. There are too many new things to learn that I didn’t want to spend time trying to setting up Hadoop because it could result frustrating.
The good news is there are pre-configured Hadoop virtual machines that will help you to learn by yourself.
Here I listed three options, each one from a different Hadoop vendor. This is not a survey of Hadoop virtual machines, which would be very nice by the way.
The scope of this post is just to give some information about the possibility to learn Hadoop using your laptop or desktop computer.
Hadoop free download pre-configured VM:

Hortonworks Sandbox
Cloudera’s CDH4
MapR M3, M5 and M7

Hope you enjoy learning!

Posted in Big Data, Business Intelligence | Tagged , , | Leave a comment

Deploy SSIS Packages across servers

Background
Recently I had to deploy a set of SSIS packages stored in the msdb of a development server into a test server. I should mention that a SSIS package could be stored in the file system, in a package store or in the msdb database of a SQL Server instance.
I don’t want to discuss the reasons to choose a specific option but if you’re curious I left here a couple of articles:
Deployment Storage Options for SSIS Package
What are the advantages/disadvantages of storing SSIS packages to MSDB vs File System?

To organize my packages I’ve created a root folder with the name of the project, and three subfolders according to my task categories:

folder structure

The Task
The task to perform is to move all of the SSIS packages stored in this folder to a test server with similar settings (SQL server Database running on Windows Server 2008 R2 Enterprise)

Discussion
If I used the file system or package store storage option I just need to move the packages to the new location, but this is not the case. A simple option is to use the SSMS (management studio) and individually export each package:

individual exportBut I’m too lazy to do that and I want to move all of the packages at the same time.

There is a deployment utility for SSIS projects in the Business Intelligence Development Studio (BIDS). Here is a great article which explains its usage (also apply for SQL Server 2008):
Deploying SSIS Packages in SQL Server 2005

The disadvantage of this “just click next up to the end” option is you can choose one and only one target folder. At the moment I couldn’t figure out a way to have logical folders in the BIDS solution explorer under the SSIS Packages folder and map them to their correspondents target folder in the msdb.

Another option is to use the PowerShell extension for SSIS:

http://sev17.com/2011/02/02/importing-and-exporting-ssis-packages-using-powershell/

The main drawback of this option is these extensions need to be installed, and the execution of scripts must be enabled, but it seems to work nicely.
My last option in discussion, and the one that I finally used is the DTUTIL command line utility, but like I said before, I’m too lazy so I’ve created a couple of stored procedures that I want to share with you.

createSSISpackagesFolder
This stored procedure creates a folder in the root folder of the msdb database. The input parameter is the name of the folder. The name of the folder must be enclosed into single quotes if contains special characters. Please feel free to modify this stored procedure to create subfolders; you will need to add another input parameter and lookup the parent folder id.

USE [msdb]
GO

/****** Object: StoredProcedure [dbo].[createSSISpackagesFolder] Script Date: 04/02/2013 15:49:08 ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE PROCEDURE [dbo].[createSSISpackagesFolder]
@destFolder sysname = '' -- Specify the name of the folder in the msdb
AS
BEGIN
SET NOCOUNT ON;

EXEC sp_configure 'show advanced options', 1
RECONFIGURE
EXEC sp_configure 'xp_cmdshell', 1
RECONFIGURE

DECLARE @folders table(folderid uniqueidentifier, parentfolderid uniqueidentifier, foldername sysname);
insert into @folders
EXEC msdb.dbo.sp_ssis_listfolders '00000000-0000-0000-0000-000000000000';

if (select Count(*) from @folders where foldername = @destFolder)=0
begin
/*Add a Folder*/

exec msdb.dbo.sp_ssis_addfolder

@parentfolderid = '00000000-0000-0000-0000-000000000000'

,@name = @destFolder;
end

EXEC msdb.dbo.sp_ssis_listfolders '00000000-0000-0000-0000-000000000000';

END

GO

copySSISpackagesPlus
This is the modified version of the stored procedure I’ve found in this article:

http://www.databasejournal.com/features/mssql/article.php/3734096/Using-dtutil-to-copy-SSIS-packages-stored-in-SQL-Server.htm

The procedure receives seven parameters: IP address or server name of the source and target server, SQL Server credentials if apply, that is, user name and password for each server (If Windows Authentication is used the credentials are not needed) and the name of the target folder. The stored procedure must be run from the source server. The name of the source folder and target folder must match. If you want another behavior it is easy to create another input parameter to have different source and target folder names.

USE [msdb]
GO

/****** Object: StoredProcedure [dbo].[copySSISpackagesPlus] Script Date: 04/02/2013 15:51:03 ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE PROCEDURE [dbo].[copySSISpackagesPlus]
@srcServer sysname='', -- Source server name
@destServer sysname='', -- Destination server name
@srcUser sysname = '', -- SQL Server login used to connect to the source server
@srcPassword sysname = '', -- Password of the SQL Server login on the source server
@destUser sysname = '', -- SQL Server login used to connect to the destination server
@destPassword sysname = '', -- Password of the SQL Server login on the destination server
@srcFolder sysname = '' -- Specify the name of the source folder in the msdb
AS
BEGIN
SET NOCOUNT ON;

DECLARE @execStrings table(Idx int identity(1,1), execCmd varchar(1000))

Insert into @execStrings(execCmd)
select 'dtutil /Quiet /COPY SQL;' +
case foldername when '' then '"' + [name] + '"' else '"' + foldername + '\' + [name] + '"' end
+ ' /SQL ' + case foldername when '' then '"' + [name] + '"' else '"' + foldername + '\' + [name] + '"' end
+ ' /SOURCESERVER ' + @srcServer
+ case @srcUser when '' then '' else ' /SourceUser ' + @srcUser + ' /SourcePassword ' + @srcPassword end
+ ' /DESTSERVER ' + @destServer
+ case @destUser when '' then '' else ' /DestUser ' + @destUser + ' /DestPassword ' + @destPassword end
from dbo.sysssispackages pkg join dbo.sysssispackagefolders fld
on pkg.folderid = fld.folderid
where fld.foldername = @srcFolder

DECLARE @cnt int = (select COUNT(*) from @execStrings);
DECLARE @tmpCmd varchar(1000);
WHILE(@cnt>0)
BEGIN
set @tmpCmd = (select execCmd from @execStrings where Idx = @cnt);
print @tmpCmd;
exec [master].[sys].[xp_cmdshell] @tmpCmd;
set @cnt = @cnt -1;
END

END

GO

Further questions or comments? Please write me.

Posted in Business Intelligence, SQL Server | Tagged , , , | 3 Comments

Store file names and modified dates in a table without a Foreach Loop Task – SQL Server – SSIS

Last week I received a request from the project manager. I had to improve the performance of an SSIS package that reads a collection of files from a source folder, and then stores their name, extension, location and modified date. The last step is to move each file to a specific folder depending on a predefined criterion.
The initial design uses a SSIS Foreach Loop Task in order to iterate all of the files, retrieve the full path, and then extract the modified date with a C# script.

foreachloop_files

This approach works well for a few files, but for the current scenario, more than 20.000 files must be processed every day. Even when the design is acceptable, there is a better solution to accomplish this task (or many more).
After a discussion in the SQLServerCentral forum I decided to use the extended procedure xp_cmdshell (thanks so much Phil Parkin for your contribution).

The xp_cmdshell executes a shell command. Its main argument is the command string to be executed. It has also a non_output parameter.
I know that SSIS is a pretty cool tool, with its boxes, dropdown list and so on, but sometimes we can get our hands dirty and write a command shell script, which is executed in a lower layer within the server and bring us the opportunity to improve the performance of a process.

The first thing to use the xp_cmdshell is to enable it because is disabled by default:

In this link an example of the command usage to retrieve and insert the file names in a table could be found: Get List Of Files From a Windows Directory to SQL Server

I had to adapt it, because I need not only the file name but also the modified date.

This is my version of the script:

-- Variable declaration
-- Source folder of the PDF Documents

DECLARE @link2Pdfs varchar(256) = ?
-- Complete path for the command shell
DECLARE @Path varchar(256) = 'dir '+@link2Pdfs+'*.pdf '
-- Command to be executed
DECLARE @Command varchar(1024) = @Path + ' /A-D /T:W'
-- Auxiliar counter for the sp
DECLARE @rc int;
-- The source type of the files
DECLARE @source nvarchar(20) = ?;

-- Create a table to temporary store the command line raw results
create table #output (id int identity(1,1), output nvarchar(255) null)

insert #output (output) exec @rc = master..xp_cmdshell @Command

INSERT INTO [STG_CardRequestsDigitals]
([FileName]
,[MatchInput]
,[LinkToPdf]
,[Source]
,[fileCreatedOn]
,[STG_loadedOn])
select Q1.fileName
,SUBSTRING(Q1.fileName,6,LEN(Q1.fileName)-12) MatchInput
,@link2Pdfs+Q1.fileName link2Pdf
,@source source
,CONVERT(DATETIME,SUBSTRING(Q1.modifiedDate,7,4)+'-'+SUBSTRING(Q1.modifiedDate,4,2)+'-'+SUBSTRING(Q1.modifiedDate,1,2)+' '+SUBSTRING(Q1.modifiedDate,12,6)+':00') modifiedDate
,GETDATE() STG_loadedOn
from
(select SUBSTRING(output,1,17) modifiedDate
,SUBSTRING(output,CHARINDEX(' ',output,31)+1,LEN(output)-CHARINDEX(' ',output,31)) fileName
from #output
where output is not null and id>4 and id < (select MAX(id) from #output)-2
group by id,output
)Q1

drop table #output
GO

At the beginning of the script, some variables are declared to parameterize the source and target folders according to my needs. I used “?” because my script is executed inside an SSIS Execute SQL Task. The command to be executed is also parameterized. Then, a temporary table is created to store the results of the command, which is a column for the result line number, and another column to store the result line.
The next step is to query the temporary table to extract the desired data.
A dir command, when it is used without the “/B” (bare format) option, returns the list of files and folder with a header and a footer. Therefore, I filter the internal query using “where output is not null and id>4 and id < (select MAX(id) from #output)-2”. This removes both, the header and footer from the result set. Then I performed several SUBSTINGS to obtain the file name and a substring of the file name because is meaningful in my case. The relevant part is the extraction of the modified date. I used the dir command parameter “/T:W” which return the last written time of the file. I manipulated the string to obtain the desired format of the date, in my case “YYYY-MM-DD hh:mm:ss”

The results: The former process takes more than an hour to complete and the new one just a few seconds (about 15 seconds).

If you have questions please do not hesitate to write me.

Posted in Business Intelligence, SQL Server | Tagged , , , , | 4 Comments

Read and Write variables in a Script Component in SSIS (SQL Server Integration Services) using C#

The Script Component is a SSIS Data flow component and it differs from the Script Task in a Control Flow. It has some limitations to read and write values from package or data flow task level variables.

There is more than one way to read and write variables, the easiest way is to add the desired variables to the ReadOnlyVariables and ReadWriteVariables in the custom properties of the component as is needed.

script component settings

The limitation of this method is that The ReadWrite variables can only be used in the PostExecute method (The ReadOnly variables can be read from anywhere in the script).

A better option, at least for me, is to use the variable dispenser. The methods are a little bit tricky but at the end they bring more flexibility.

When the variable dispenser is used adding the variables to the ReadOnlyVariables and ReadWriteVariables is not needed. I successfully tested the following methods for read and write:

//Write a value in the specify variable.
//Example: this.WriteVariable("User::receiptKey", Row.TransactionExtId.ToString())

private void WriteVariable(String varName, Object varValue)
{
  IDTSVariables100 vars = null;
  VariableDispenser.LockForWrite(varName);
  VariableDispenser.GetVariables(out vars);
  try
  {
    vars[varName].Value = varValue;
  }
  catch(Exception ex)
  {
    MessageBox.Show(ex.Message.ToString());
  }
  finally
  {
    vars.Unlock();
  }
}

//Read a variable and return its value.
//Example: currentReceipt = this.ReadVariable("User::receiptKey").ToString();

private Object ReadVariable(String varName)
{
  Object varValue;
  IDTSVariables100 vars = null;
  VariableDispenser.LockForRead(varName);
  VariableDispenser.GetVariables(out vars);
  try
  {
    varValue = vars[varName].Value;
  }
  catch (Exception ex)
  {
    varValue = null;
    MessageBox.Show(ex.Message.ToString());
  }
  finally
  {
    vars.Unlock();
  }
  return varValue;
}

Hope you find it useful.

Posted in Business Intelligence | Tagged , , , , , | 1 Comment

Manage packages stored in the MSDB database with T-SQL

Last week we’ve performed a typical migration from a development server to a production server.

Both servers are running SQL Server 2008 R2 databases. I created a set of SSIS ETL packages and I wanted to move them to production environment too.

In our development environment we usually store the packages in the msdb database or in a package store. To administer the packages stored using any of these options, I connect to the Integration Services service using the Management Studio. In this way I can use the Management Studio to create, edit or delete packages and folders with only a few clicks.

Everything was migrated to production server. Then I tried to connect to the Integration Services using the Management Studio. Can you guess what our surprise was?

The production server is installed in a cluster node and Integration Services service is not a Cluster service. It could be configured in a cluster but Microsoft strongly recommends avoiding it because the disadvantages outweigh the advantages. Link to Microsoft Documentation.

The important thing was to realize that the Integration Services service is not needed to run or develop SSIS packages, but supports the administrative interface in SQL Server Management Studio for listing, starting, stopping, monitoring, importing, and exporting Integration Services packages.

I found this post also very interesting and it helped me with this issue: Recommended Post

After lost this capability I started to find another option to admin my packages and at the end I found this set of built-in stored procedures that made my life easy:

  • msdb.dbo.sp_ssis_addfolder
  • msdb.dbo.sp_ssis_listfolders
  • msdb.dbo.sp_ssis_deletefolder
  • msdb.dbo.sp_ssis_deletepackage
  • msdb.dbo.sp_ssis_listpackages
  • msdb.dbo.sp_ssis_renamefolder

Although the names of the procedures are self-explanatory it was hard for me to find useful documentation. I let you some examples:

/*Add a Folder*/

exec msdb.dbo.sp_ssis_addfolder

@parentfolderid = ’00000000-0000-0000-0000-000000000000′

,@name = ‘My_Packages’;

/*List Folders*/

EXEC msdb.dbo.sp_ssis_listfolders ’00000000-0000-0000-0000-000000000000′

/* folderid=’0EAB9DEE-2B77-4821-9C36-EE70A7C76A92′ that is the ID of the Folder*/

/*Delete Folder*/

EXEC msdb.dbo.sp_ssis_deletefolder @folderid=’0EAB9DEE-2B77-4821-9C36-EE70A7C76A92′

I hope you find my experience useful or at least it could give you some clues if you face a similar issue.

Posted in Business Intelligence | Tagged , , , , , , , , , | Leave a comment