Loading a CSV file of more than 10 GB in 7 minutes and 51 seconds

Exchanging files between systems has been around for a while and to my opinion it will not to disappear soon. The reasons that this type of data exchange is still around could be related to:

Security rules prevent to connect directly to the source system
The source system is outside the organization’s network so it is not possible to connect directly
There is no compatible data connector for the source system
etc.

Most systems can generate and read text files and the formats are rather trivial. The most common is the comma separated values format or better known as the CSV format. Everyone knows them and how to handle them. Tools like Excel can be used to examine them and every database system provides a connector to load them, including SQL Server.

CSV files do not escape on the fact that our data need is growing, thus so are CSV files. Due to that the processing of these files take more and more time and it happens nowadays more and more that IT departments ran out of their available daily batch time for loading and processing data.

Once the files are loaded the database knows how to handle the large volumes of data using parallelism and compression. Query execution times of 1 second on hundreds of millions of rows is rather common on a FastTrack Data Warehouse solution and this is in sharp contrast with the loading of a large CSV file, which can take several hours.

More and more our customers contact us with the question how to handle the problem of loading large files. This paper describe a general solution developed for a relative normal production environment with one ETL server using Integration Services and one SQL Server Fast Track Data Warehouse server.

I have put everything in a easy to read document that you can download here.

Comments

Privacy and the liberty to express yourself on LinkedIn

Unaware that LinkedIn has such a strong filtering policy that it does not allow me posting a completely innocent post on a Chinese extreme photography website I tried to post the following: "As an Mpx lover I was suprised to find out that the M from Million is now replaced by the B from Billion. This picture is 24 Bpx! Yes you read this well, 24 billion pixels. Searching on the picture I stumbled on a fellow Nikon lover. If you want to search for him yourself you can find him here: http://www.bigpixel.cn/t/5834170785f26b37002af46a " In my eyes nothing is wrong with this post, but LinkedIn considers it as offending. I changed the lover words, but I could not post it. Even taking a picture and post it will not let this pass: Or my critical post on LinkedIn crazy posting policy: it will not pass and I cannot post it. The technology LinkedIn shows here is an example what to expect in the near future. Newspapers will have a unified re...

How to run SQL Server 2016 with In-Databasse R on Windows 2016 CTP5

For those who like me tried to run SQL Server 2016 with In-Database R might have run into the same problem as me: In-Database R or the LaunchPad service gives a timeout and won't start. I did several clean installations with different configuration options - for instance I like to put my data on another disk than the system disk - but in the end I tried to do the next, next, next, finish install to see if it something in the setup options is hard coded in there (yes, it happens developers!). For some reason this problem is related to Windows 2016 and not on Windows 2012R2 and I hope the SQL Server team will soon resolve these issues because they are in one word a bit sloppy. There are 2 issues (maybe even 3 so I give this one also): The R setup does not create the ExtensibilityLog directory in the "C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Log" directory The R setup sets the number of users in the SQL Server Launchpad service to 0 it is pos...

Windows Storage Spaces and SQL Server: a ride to super performance

This post is based on the tests I did to see if storage spaces in Windows 2012R2 can serve as a platform for our Fast Track environments. When Microsoft developed the Fast Track Data Warehouse architecture, which was at first very limited in hardware choice and for version SQL Server 2012 became a reference guide, Storage Spaces as a functionality in Windows did not exist.That has changed with the release of Windows 2012 and later on with version 2012R2 and soon 2016. Why is Storage Spaces as a storage technology so interesting for SQL Server? Anyone who is a pro in SQL Server knows that parallelism - adding more disks - can greatly improve performance. Adding an hard disk for the tempdb and another one for the LOG files will do the job if the hard disks perform sufficiently (that might be another issue!). To my opinion (and also to others) SQL Server does not do a great job in using the available hardware. Before and after the installation it is mandatory to tune a SQL Se...

Roger's BI Blog

Search This Blog