Skip to main content

Loading a CSV file of more than 10 GB in 7 minutes and 51 seconds

Exchanging files between systems has been around for a while and to my opinion it will not to disappear soon. The reasons that this type of data exchange is still around could be related to:
  • Security rules prevent to connect directly to the source system
  • The source system is outside the organization’s network so it is not possible to connect directly
  • There is no compatible data connector for the source system
  • etc.
Most systems can generate and read text files and the formats are rather trivial. The most common is the comma separated values format or better known as the CSV format. Everyone knows them and how to handle them. Tools like Excel can be used to examine them and every database system provides a connector to load them, including SQL Server.

CSV files do not escape on the fact that our data need is growing, thus so are CSV files. Due to that the processing of these files take more and more time and it happens nowadays more and more that IT departments ran out of their available daily batch time for loading and processing data.

Once the files are loaded the database knows how to handle the large volumes of data using parallelism and compression. Query execution times of 1 second on hundreds of millions of rows is rather common on a FastTrack Data Warehouse solution and this is in sharp contrast with the loading of a large CSV file, which can take several hours.

More and more our customers contact us with the question how to handle the problem of loading large files. This paper describe a general solution developed for a relative normal production environment with one ETL server using Integration Services and one SQL Server Fast Track Data Warehouse server.

I have put everything in a easy to read document that you can download here.

Comments

Popular posts from this blog

Privacy and the liberty to express yourself on LinkedIn

Unaware that LinkedIn has such a strong filtering policy that it does not allow me posting a completely innocent post on a Chinese extreme photography website I tried to post the following: "As an Mpx lover I was suprised to find out that the M from Million is now replaced by the B from Billion. This picture is 24 Bpx! Yes you read this well, 24 billion pixels.  Searching on the picture I stumbled on a fellow Nikon lover. If you want to search for him yourself you can find him here: http://www.bigpixel.cn/t/5834170785f26b37002af46a " In my eyes nothing is wrong with this post, but LinkedIn considers it as offending. I changed the lover words, but I could not post it.  Even taking a picture and post it will not let this pass:  Or my critical post on LinkedIn crazy posting policy: it will not pass and I cannot post it.  The technology LinkedIn shows here is an example what to expect in the near future.  Newspapers will have a unified re...

Windows Server 2016 with ... XBox extensions !!!

Microsoft must have been thinking that the live of a Windows administrator gets so boring that they need a distraction and they have integrated the XBox Live extensions standard in Windows Server 2016.  No kidding: take a look. I did not select it as a feature and it is there. As you can see the XBox Live service is started automatically even in mode manual (see error log).  To my opinion these 3 XBox Live services should never be available on a server. Unless the server is part of the XBox Live platform of cause but honestly I don't think that Microsoft will allow that.  Every IT manager with a serious Windows production environment would fire any administrator playing XBox Live on the production servers. So though it might be tempting don't do it Windows administrator. It is a trap! :-)

Windows Storage Spaces and SQL Server: a ride to super performance

This post is based on the tests I did to see if storage spaces in Windows 2012R2 can serve as a platform for our Fast Track environments. When Microsoft developed the Fast Track Data Warehouse architecture, which was at first very limited in hardware choice and for version SQL Server 2012 became a reference guide, Storage Spaces as a functionality in Windows did not exist.That has changed with the release of Windows 2012 and later on with version 2012R2 and soon 2016. Why is Storage Spaces as a storage technology so interesting for SQL Server?  Anyone who is a pro in SQL Server knows that parallelism - adding more disks - can greatly improve performance. Adding an hard disk for the tempdb and another one for the LOG files will do the job if the hard disks perform sufficiently (that might be another issue!). To my opinion (and also to others) SQL Server does not do a great job in using the available hardware. Before and after the installation it is mandatory to tune a SQL Se...