Introduction
After some time using R and SQL server as two different tools (not 100% true because I already have imported data from SQL Server into R Studio), now Microsoft is offering as part of the SQL Server 2016 R services. That seems to be very promising, especially for Microsoft BI professionals. One of the advantages is to keep analytics close to data and use an integrated environment.
In this post I will show some basic operations and how to get started with these technologies. I took most of the R code from this Microsoft walkthrough that I highly recommend to you:
Data Science End-to-End Walkthrough
Prerequisites
- SQL Server 2016 – I installed the enterprise edition but with the others should work
- R services: this is part of the SQL Server 2016, so you need to add this feature during the first installation or add it later
- R client: http://aka.ms/rclient/download
- WideWorldImporters database (WWI): download and documentation . This is the new sample database for SQL Server 2016 replacing the famous AdventureWorks
Set up R Services
That is very well documented in the MSDN, nothing to add from my side:
Set up SQL Server R Services (In-Database)
Create a view to generate a dataset to analyze
Once you installed the WWI database, create this view:
USE [WideWorldImportersDW] GO SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO create view [Fact].[SalesByLocation] as select City, Location.Lat, Location.Long, sum(profit) AS SUM_Profit, AVG([Unit Price]) AVG_UnitPrice from fact.Sale s inner join Dimension.City c on s.[City Key] = c.[City Key] group by City, Location.Lat, Location.Long GO
Add the require R packages
We will need the following R packages:
- Ggmap
- Mapproj
- ROCR
- RODBC
Using the R client, we have to options:
- From the menu –> Package –> Install Packages
- Or running the following script
if (!('ggmap' %in% rownames(installed.packages()))){ install.packages('ggmap') } if (!('mapproj' %in% rownames(installed.packages()))){ install.packages('mapproj') } if (!('ROCR' %in% rownames(installed.packages()))){ install.packages('ROCR') } if (!('RODBC' %in% rownames(installed.packages()))){ install.packages('RODBC') }
Create a connection to the SQL Server instance from r client
library(RevoScaleR) # Define the connection string connStr <- "Driver=SQL Server;Server=HERZO01;Database=WideWorldImportersDW; Trusted_Connection = True" # Set ComputeContext sqlShareDir <- paste("C:\\AllShare\\",Sys.getenv("USERNAME"),sep="") sqlWait <- TRUE sqlConsoleOutput <- FALSE cc <- RxInSqlServer(connectionString = connStr, shareDir = sqlShareDir, wait = sqlWait, consoleOutput = sqlConsoleOutput) rxSetComputeContext(cc) sampleDataQuery <- "select * from [Fact].[SalesByLocation]" inDataSource <- RxSqlServerData(sqlQuery = sampleDataQuery, connectionString = connStr, rowsPerRead=500)
The first step is to load the RevoScaleR library. This is an amazing library that allows to create scalable and performant applications with R.
Then a connection string is defined, in my case using Windows Authentication. If you want to use SQL Server authentication the user name and password are needed.
We define a local folder as the compute context.
RxInSQLServer: generates a SQL Server compute context using SQL Server R Services – documentation
Sample query: I already prepared the dataset in the view, this is a best practice in order to reduce the size of the query in the R code and for me is also easier to maintain.
rxSQLServerData generates the data source object
Get some basic statistics and visualize the dataset
# Dataset summary rxGetVarInfo(data = inDataSource) rxSummary(~SUM_Profit, data = inDataSource)
# Plot the distribution of the profit rxHistogram(~SUM_Profit, data = inDataSource, title = "Sum of the profit")
#Plot the distribution of the average unit price rxHistogram(~AVG_UnitPrice, data = inDataSource, title = "Average unit price")
In both histograms you can easily identify outliers and we have a better understanding about the distribution of the data. Here is where R plays an important role (as a tool). This kind of analysis is not performed by many BI professionals, or at least this is what I have seen in my professional life.
Summary
In this post I demonstrated how we can get data into R client from SQL Server and perform some basic analysis over a simple dataset. What would be the next steps?
- Continue visualizing the data
- Create a machine learning model
- Integrate the R code in SQL Server using functions and stored procedure
References
Leave a comment