Last week I received a request from the project manager. I had to improve the performance of an SSIS package that reads a collection of files from a source folder, and then stores their name, extension, location and modified date. The last step is to move each file to a specific folder depending on a predefined criterion.
The initial design uses a SSIS Foreach Loop Task in order to iterate all of the files, retrieve the full path, and then extract the modified date with a C# script.
This approach works well for a few files, but for the current scenario, more than 20.000 files must be processed every day. Even when the design is acceptable, there is a better solution to accomplish this task (or many more).
After a discussion in the SQLServerCentral forum I decided to use the extended procedure xp_cmdshell (thanks so much Phil Parkin for your contribution).
The xp_cmdshell executes a shell command. Its main argument is the command string to be executed. It has also a non_output parameter.
I know that SSIS is a pretty cool tool, with its boxes, dropdown list and so on, but sometimes we can get our hands dirty and write a command shell script, which is executed in a lower layer within the server and bring us the opportunity to improve the performance of a process.
The first thing to use the xp_cmdshell is to enable it because is disabled by default:
In this link an example of the command usage to retrieve and insert the file names in a table could be found: Get List Of Files From a Windows Directory to SQL Server
I had to adapt it, because I need not only the file name but also the modified date.
This is my version of the script:
-- Variable declaration
-- Source folder of the PDF Documents
DECLARE @link2Pdfs varchar(256) = ?
-- Complete path for the command shell
DECLARE @Path varchar(256) = 'dir '+@link2Pdfs+'*.pdf '
-- Command to be executed
DECLARE @Command varchar(1024) = @Path + ' /A-D /T:W'
-- Auxiliar counter for the sp
DECLARE @rc int;
-- The source type of the files
DECLARE @source nvarchar(20) = ?;
-- Create a table to temporary store the command line raw results
create table #output (id int identity(1,1), output nvarchar(255) null)
insert #output (output) exec @rc = master..xp_cmdshell @Command
INSERT INTO [STG_CardRequestsDigitals]
,CONVERT(DATETIME,SUBSTRING(Q1.modifiedDate,7,4)+'-'+SUBSTRING(Q1.modifiedDate,4,2)+'-'+SUBSTRING(Q1.modifiedDate,1,2)+' '+SUBSTRING(Q1.modifiedDate,12,6)+':00') modifiedDate
(select SUBSTRING(output,1,17) modifiedDate
,SUBSTRING(output,CHARINDEX(' ',output,31)+1,LEN(output)-CHARINDEX(' ',output,31)) fileName
where output is not null and id>4 and id < (select MAX(id) from #output)-2
group by id,output
drop table #output
At the beginning of the script, some variables are declared to parameterize the source and target folders according to my needs. I used “?” because my script is executed inside an SSIS Execute SQL Task. The command to be executed is also parameterized. Then, a temporary table is created to store the results of the command, which is a column for the result line number, and another column to store the result line.
The next step is to query the temporary table to extract the desired data.
A dir command, when it is used without the “/B” (bare format) option, returns the list of files and folder with a header and a footer. Therefore, I filter the internal query using “where output is not null and id>4 and id < (select MAX(id) from #output)-2”. This removes both, the header and footer from the result set. Then I performed several SUBSTINGS to obtain the file name and a substring of the file name because is meaningful in my case. The relevant part is the extraction of the modified date. I used the dir command parameter “/T:W” which return the last written time of the file. I manipulated the string to obtain the desired format of the date, in my case “YYYY-MM-DD hh:mm:ss”
The results: The former process takes more than an hour to complete and the new one just a few seconds (about 15 seconds).
If you have questions please do not hesitate to write me.