Store file names and modified dates in a table without a Foreach Loop Task – SQL Server – SSIS

Last week I received a request from the project manager. I had to improve the performance of an SSIS package that reads a collection of files from a source folder, and then stores their name, extension, location and modified date. The last step is to move each file to a specific folder depending on a predefined criterion.
The initial design uses a SSIS Foreach Loop Task in order to iterate all of the files, retrieve the full path, and then extract the modified date with a C# script.

foreachloop_files

This approach works well for a few files, but for the current scenario, more than 20.000 files must be processed every day. Even when the design is acceptable, there is a better solution to accomplish this task (or many more).
After a discussion in the SQLServerCentral forum I decided to use the extended procedure xp_cmdshell (thanks so much Phil Parkin for your contribution).

The xp_cmdshell executes a shell command. Its main argument is the command string to be executed. It has also a non_output parameter.
I know that SSIS is a pretty cool tool, with its boxes, dropdown list and so on, but sometimes we can get our hands dirty and write a command shell script, which is executed in a lower layer within the server and bring us the opportunity to improve the performance of a process.

The first thing to use the xp_cmdshell is to enable it because is disabled by default:

In this link an example of the command usage to retrieve and insert the file names in a table could be found: Get List Of Files From a Windows Directory to SQL Server

I had to adapt it, because I need not only the file name but also the modified date.

This is my version of the script:

-- Variable declaration
-- Source folder of the PDF Documents

DECLARE @link2Pdfs varchar(256) = ?
-- Complete path for the command shell
DECLARE @Path varchar(256) = 'dir '+@link2Pdfs+'*.pdf '
-- Command to be executed
DECLARE @Command varchar(1024) = @Path + ' /A-D /T:W'
-- Auxiliar counter for the sp
DECLARE @rc int;
-- The source type of the files
DECLARE @source nvarchar(20) = ?;

-- Create a table to temporary store the command line raw results
create table #output (id int identity(1,1), output nvarchar(255) null)

insert #output (output) exec @rc = master..xp_cmdshell @Command

INSERT INTO [STG_CardRequestsDigitals]
([FileName]
,[MatchInput]
,[LinkToPdf]
,[Source]
,[fileCreatedOn]
,[STG_loadedOn])
select Q1.fileName
,SUBSTRING(Q1.fileName,6,LEN(Q1.fileName)-12) MatchInput
,@link2Pdfs+Q1.fileName link2Pdf
,@source source
,CONVERT(DATETIME,SUBSTRING(Q1.modifiedDate,7,4)+'-'+SUBSTRING(Q1.modifiedDate,4,2)+'-'+SUBSTRING(Q1.modifiedDate,1,2)+' '+SUBSTRING(Q1.modifiedDate,12,6)+':00') modifiedDate
,GETDATE() STG_loadedOn
from
(select SUBSTRING(output,1,17) modifiedDate
,SUBSTRING(output,CHARINDEX(' ',output,31)+1,LEN(output)-CHARINDEX(' ',output,31)) fileName
from #output
where output is not null and id>4 and id < (select MAX(id) from #output)-2
group by id,output
)Q1

drop table #output
GO

At the beginning of the script, some variables are declared to parameterize the source and target folders according to my needs. I used “?” because my script is executed inside an SSIS Execute SQL Task. The command to be executed is also parameterized. Then, a temporary table is created to store the results of the command, which is a column for the result line number, and another column to store the result line.
The next step is to query the temporary table to extract the desired data.
A dir command, when it is used without the “/B” (bare format) option, returns the list of files and folder with a header and a footer. Therefore, I filter the internal query using “where output is not null and id>4 and id < (select MAX(id) from #output)-2”. This removes both, the header and footer from the result set. Then I performed several SUBSTINGS to obtain the file name and a substring of the file name because is meaningful in my case. The relevant part is the extraction of the modified date. I used the dir command parameter “/T:W” which return the last written time of the file. I manipulated the string to obtain the desired format of the date, in my case “YYYY-MM-DD hh:mm:ss”

The results: The former process takes more than an hour to complete and the new one just a few seconds (about 15 seconds).

If you have questions please do not hesitate to write me.

About Paul Hernandez

I'm an Electronic Engineer and Computer Science professional, specialized in Data Analysis and Business Intelligence Solutions. Also a father, swimmer and music lover.
This entry was posted in Business Intelligence, SQL Server and tagged , , , , . Bookmark the permalink.

4 Responses to Store file names and modified dates in a table without a Foreach Loop Task – SQL Server – SSIS

  1. Martijn says:

    Hello Paul.
    The script you wrote is more or less exactly what i needed.
    One question though: Could you provide the create statement for the table STG_CardRequestsDigitals?

    I keep running into a statement error.
    I thought the [fileCreatedOn] & [STG_loadedOn] were of the value “datetime” but i get a “Arithmetic overflow error converting expression to data type datetime.” error.

    Thanks for the help.
    martijn

    • Hi Martijn,

      the script to create the table is the following:

      CREATE TABLE [dbo].[STG_CardRequestsDigitals](
      [ID_STG_CardRequestsDigitals] [uniqueidentifier] NOT NULL,
      [MatchInput] [nvarchar](255) NULL,
      [FileName] [nvarchar](100) NULL,
      [LinkToPdf] [nvarchar](1024) NULL,
      [Source] [nvarchar](50) NULL,
      [fileCreatedOn] [datetime] NULL,
      [STG_loadedOn] [datetime] NULL,
      [CRM_contactId] [uniqueidentifier] NULL,
      [CRM_loadedOn] [datetime] NULL,
      [ERROR_code] [int] NULL,
      [ERROR_description] [nvarchar](1024) NULL,
      [ERROR_occuredOn] [datetime] NULL,
      CONSTRAINT [PK_STG_CardRequestsDigitals] PRIMARY KEY CLUSTERED
      (
      [ID_STG_CardRequestsDigitals] ASC
      )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
      ) ON [PRIMARY]

      GO

      ALTER TABLE [dbo].[STG_CardRequestsDigitals] ADD CONSTRAINT [DF_STG_CardRequestsDigitals_ID_STG_CardRequestsDigitals] DEFAULT (newid()) FOR [ID_STG_CardRequestsDigitals]
      GO

      Tell me if you have further questions.

      Kind Regards,

      Paul

  2. Paul,

    Excellent posting, thanks to Google it helped me to figure out how to retrieve the modified dates.

    One minor suggestion, there is an easier way to convert the dates without using a nested query. Instead of the following:
    select
    Q1.fileName
    ,CONVERT(DATETIME,SUBSTRING(Q1.modifiedDate,7,4)+’-‘+SUBSTRING(Q1.modifiedDate,4,2)+’-‘+SUBSTRING(Q1.modifiedDate,1,2)+’ ‘+SUBSTRING(Q1.modifiedDate,12,6)+’:00′) modifiedDate
    from
    (
    select
    SUBSTRING(output,1,17) modifiedDate
    ,SUBSTRING(output,CHARINDEX(‘ ‘,output,31)+1,LEN(output)-CHARINDEX(‘ ‘,output,31)) fileName
    from #output
    where
    output is not null and
    id>4 and
    id 4 and
    id <
    (
    select MAX(id)
    from #output
    ) -2
    group by id,output

    As you can see, by using CONVERT(DATETIME, [date string], [format string]), the need for the subquery is eliminated. The 105 string was to convert my UK-based MM/dd/yyyy, but other formats could be converted using different codes.

    Thanks again for the great post!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s