Introduction to R Services and R client – SQL Server 2016

Introduction

After some time using R and SQL server as two different tools (not 100% true because I already have imported data from SQL Server into R Studio), now Microsoft is offering as part of the SQL Server 2016 R services. That seems to be very promising, especially for Microsoft BI professionals. One of the advantages is to keep analytics close to data and use an integrated environment.

In this post I will show some basic operations and how to get started with these technologies. I took most of the R code from this Microsoft walkthrough that I highly recommend to you:

Data Science End-to-End Walkthrough

Prerequisites

  • SQL Server 2016 – I installed the enterprise edition but with the others should work
  • R services: this is part of the SQL Server 2016, so you need to add this feature during the first installation or add it later
  • R client: http://aka.ms/rclient/download
  • WideWorldImporters database (WWI): download and documentation . This is the new sample database for SQL Server 2016 replacing the famous AdventureWorks

Set up R Services

That is very well documented in the MSDN, nothing to add from my side:

Set up SQL Server R Services (In-Database)

Create a view to generate a dataset to analyze

Once you installed the WWI database, create this view:

USE [WideWorldImportersDW]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
create view [Fact].[SalesByLocation] as
select City, Location.Lat, Location.Long, sum(profit) AS SUM_Profit, AVG([Unit Price]) AVG_UnitPrice
from fact.Sale s
inner join Dimension.City c on s.[City Key] = c.[City Key]
group by City, Location.Lat, Location.Long
GO

 

Add the require R packages

We will need the following R packages:

  • Ggmap
  • Mapproj
  • ROCR
  • RODBC

Using the R client, we have to options:

  • From the menu –> Package –> Install Packages

load package 1

  • Or running the following script
if (!('ggmap' %in% rownames(installed.packages()))){ 
  install.packages('ggmap') 
} 
if (!('mapproj' %in% rownames(installed.packages()))){ 
  install.packages('mapproj') 
} 
if (!('ROCR' %in% rownames(installed.packages()))){ 
  install.packages('ROCR') 
} 
if (!('RODBC' %in% rownames(installed.packages()))){ 
  install.packages('RODBC') 
}

 

Create a connection to the SQL Server instance from r client

library(RevoScaleR)
# Define the connection string
connStr <- "Driver=SQL Server;Server=HERZO01;Database=WideWorldImportersDW;
Trusted_Connection = True"
# Set ComputeContext
sqlShareDir <- paste("C:\\AllShare\\",Sys.getenv("USERNAME"),sep="")
sqlWait <- TRUE
sqlConsoleOutput <- FALSE
cc <- RxInSqlServer(connectionString = connStr, shareDir = sqlShareDir,
                    wait = sqlWait, consoleOutput = sqlConsoleOutput)
rxSetComputeContext(cc)
sampleDataQuery <- "select * from [Fact].[SalesByLocation]"
inDataSource <- RxSqlServerData(sqlQuery = sampleDataQuery,
 connectionString = connStr, rowsPerRead=500)

 

The first step is to load the RevoScaleR library. This is an amazing library that allows to create scalable and performant applications with R.

Then a connection string is defined, in my case using Windows Authentication. If you want to use SQL Server authentication the user name and password are needed.

We define a local folder as the compute context.

RxInSQLServer: generates a SQL Server compute context using SQL Server R Services – documentation

Sample query: I already prepared the dataset in the view, this is a best practice in order to reduce the size of the query in the R code and for me is also easier to maintain.

rxSQLServerData generates the data source object

Get some basic statistics and visualize the dataset

# Dataset summary

rxGetVarInfo(data = inDataSource)

rxSummary(~SUM_Profit, data = inDataSource)

summary

# Plot the distribution of the profit

rxHistogram(~SUM_Profit, data = inDataSource, title = "Sum of the profit")

sum profit histogram

#Plot the distribution of the average unit price

rxHistogram(~AVG_UnitPrice, data = inDataSource, title = "Average unit price")

average unit price histo

In both histograms you can easily identify outliers and we have a better understanding about the distribution of the data. Here is where R plays an important role (as a tool). This kind of analysis is not performed by many BI professionals, or at least this is what I have seen in my professional life.

Summary

In this post I demonstrated how we can get data into R client from SQL Server and perform some basic analysis over a simple dataset. What would be the next steps?

  • Continue visualizing the data
  • Create a machine learning model
  • Integrate the R code in SQL Server using functions and stored procedure

References

Data Science End-to-End Walkthrough

Big Data Analysis with Revolution R Enterprise 

One response to “Introduction to R Services and R client – SQL Server 2016”

  1. […] Paul Hernandez has an introduction to using the R client and RODBC to connect to SQL Server: […]

Leave a comment