Skip to main content

Loading a CSV file of more than 10 GB in 7 minutes and 51 seconds

Exchanging files between systems has been around for a while and to my opinion it will not to disappear soon. The reasons that this type of data exchange is still around could be related to:
  • Security rules prevent to connect directly to the source system
  • The source system is outside the organization’s network so it is not possible to connect directly
  • There is no compatible data connector for the source system
  • etc.
Most systems can generate and read text files and the formats are rather trivial. The most common is the comma separated values format or better known as the CSV format. Everyone knows them and how to handle them. Tools like Excel can be used to examine them and every database system provides a connector to load them, including SQL Server.

CSV files do not escape on the fact that our data need is growing, thus so are CSV files. Due to that the processing of these files take more and more time and it happens nowadays more and more that IT departments ran out of their available daily batch time for loading and processing data.

Once the files are loaded the database knows how to handle the large volumes of data using parallelism and compression. Query execution times of 1 second on hundreds of millions of rows is rather common on a FastTrack Data Warehouse solution and this is in sharp contrast with the loading of a large CSV file, which can take several hours.

More and more our customers contact us with the question how to handle the problem of loading large files. This paper describe a general solution developed for a relative normal production environment with one ETL server using Integration Services and one SQL Server Fast Track Data Warehouse server.

I have put everything in a easy to read document that you can download here.

Comments

Popular posts from this blog

Privacy and the liberty to express yourself on LinkedIn

Unaware that LinkedIn has such a strong filtering policy that it does not allow me posting a completely innocent post on a Chinese extreme photography website I tried to post the following: "As an Mpx lover I was suprised to find out that the M from Million is now replaced by the B from Billion. This picture is 24 Bpx! Yes you read this well, 24 billion pixels.  Searching on the picture I stumbled on a fellow Nikon lover. If you want to search for him yourself you can find him here: http://www.bigpixel.cn/t/5834170785f26b37002af46a " In my eyes nothing is wrong with this post, but LinkedIn considers it as offending. I changed the lover words, but I could not post it.  Even taking a picture and post it will not let this pass:  Or my critical post on LinkedIn crazy posting policy: it will not pass and I cannot post it.  The technology LinkedIn shows here is an example what to expect in the near future.  Newspapers will have a unified re...

How to run SQL Server 2016 with In-Databasse R on Windows 2016 CTP5

For those who like me tried to run SQL Server 2016 with In-Database R might have run into the same problem as me: In-Database R or the LaunchPad service gives a timeout and won't start. I did several clean installations with different configuration options - for instance I like to put my data on another disk than the system disk - but in the end I tried to do the next, next, next, finish install to see if it something in the setup options is hard coded in there (yes, it happens developers!). For some reason this problem is related to Windows 2016 and not on Windows 2012R2 and I hope the SQL Server team will soon resolve these issues because they are in one word a bit sloppy.  There are 2 issues (maybe even 3 so I give this one also):  The R setup does not create the ExtensibilityLog directory in the "C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Log" directory The R setup sets the number of users in the SQL Server Launchpad service to 0 it is pos...

Truncate a table in SQL Server and nobody will ever see it....

Today I ran into a very strange problem. I was doing an audit because a client lost 32 billion records in a table and he wanted to know "Who did it!". They did not use auditing because it had a too large impact on performance. The server and instance was not restarted so the DMV's were still available with data.  I presumed there were two options how the data could have been lost and so quickly: run a TRUNCATE TABLE. Very fast and and in less than a second your table is unrecoverable empty  Switch a partition out into a temporary table and just truncate the original table. Yes it happens and the effect is the same: nothing!  So I ran this query to collect the last queries: SELECT t.TEXT QueryName, s.execution_count AS ExecutionCount, s.max_elapsed_time AS MaxElapsedTime, ISNULL(s.total_elapsed_time / s.execution_count, 0) AS AvgElapsedTime, s.creation_time AS LogCreatedOn FROM sys.dm_exec_query_stats s CROSS APPLY sys.dm_exec_sql_text( s.sql_handle ) t WHERE t.TE...