First and foremost this is not a guide but an outline of my experience.
I was recently tasked with writing an ETL process at work and it was something that I really enjoyed doing. The goal was to take 20 million or more documents from a Mark Logic database and move them into a Postgresdatabase. Easy enough. However the caveat was that all the data in Mark Logic is in an XML format and the desired data format in Postgres is Json. Still, thanks to Json.NET this isn't too difficult a problem, just a few lines of code. I'll walk through the process that I went with below.
What is ETL?
An ETL process is a process to E xtract T ransform and L oad data from one database to another. Depending on your datasets this could be an easy task or a very time consuming process. In my case, I was lucky enough that our principle engineer already had classes setup in our primary application that I could re-use to easily read the XML data from Mark Logic. A great example of why it's important to write clean, re-useable code.
Planning the Application
During the planning stages of the process I knew that I would be dealing with a lot of data. It might take N hours to load 5 million records, what if the application crashes before it reaches the end? We don't want to load the same 5 million records again. I would need to write a way to keep track of the records that were already inserted. In addition to that, I would need a configurable file that lets us manipulate it's contents after run-time. The file would need to be executable in any environment which presented challenges of it's own. These were my two biggest concerns, the actual extracting, transforming and loading of data was at the back of mind at this point.
So one of the first things you're going to want to do is create a Config file. There are lots of options when it comes to creating a config, I went with using a json file. Here is a more detailed guide by Gary Woodfine on setting up a Configuration file in .Net Core for console applications. By using a Json file for your configuration, you get to use Name-Value pairs which is a really easy and straightforward way to use your settings.
My appsettings.json file is designed to hold three primary bits of information. First the batch size for the amount of files I want to move at once. There will be more on this in the next section below. Secondly the connection details for the MarkLogic database that I will be connecting to and grabbing data from. Finally, the Postgres database connection information that I will be inserting data into. Your results may vary depending on how complex your console applications becomes. This would be a great place to introduce additional settings.
The Extraction process should be relatively straight forward. The goal is to grab the data and then break it into chunks that you can control better. For example, if you're writing an ETL process for 100,000,000 rows of data, you likely aren't going to do it all in one go. What if the process fails half way through? How much memory would that take to run the process? More than likely your program would crash for one reason or another. So instead you want to do this in increments.
In my situation, I am grabbing all of the IDs from the original database (roughly 20 million at the time of writing) and then storing those IDs in chunks of 10,000. Then for 10,000 documents at a time I am using their IDs and grabbing the actual documents from MarkLogic. This is a setting in my appsettings.json file and it can be adjusted based on performance or requirements.
After grabbing the 10,000 documents by their Ids, I needed to transform the data from XML to Json. This is where the Json.NET library comes into play. With just a few simple lines of code Json.NET will translate most things to Json, however based on your needs, your results may very. Note the transform section will be short here, because the Json.NET library handled most of the heavy lifting for me.
Loading should be pretty similar to extracting in the sense that you're connecting to a database using the information in your config file and then transferring data. In my case I am connecting to Postgres and inserting a single row for each document. This row holds the unique file ID, the Jsonb document, the created and updated dates and finally a transnational ID to help prevent duplicate data being added once the base loading was completed.
Initially I was using INSERT statements for getting the data into Postgres however after some research I found that using insert statements were not the most efficent way to handle this. Postgres supports COPY, which can transfer data between a file or text and a table in binary format. I found that using COPY opposed to INSERT saved me roughly 50% overall time in the loading process, which as you can imagine is more helpful as your dataset grows. I highly recommend researching if COPY is the right choice for you when writing an ETL process.
Testing is going to be an important step in your process. You don't want to just try to run your application in production, that's not a good place to find out you made an error in your transform logic or load logic. I highly recommend setting up a docker container with Postgres and connecting locally there first. You can set this to be an option in your appsettings.json file and you will be able to easily switch between environments. For setting up a Docker container for Postgres, here's a useful article. Once you've tested your application, you're ready to transfer real data.
At my company this console application is actively being used and it runs daily to continuously move data from one database to another. As always, there is room for optimization however for our needs this simple console application fits the bill pretty well. Thanks to using asynchronous tasks in the console application, I can create an order for which type of documents I want to pull and then run the application and have it do all the heavy lifting for me. The application does what it was created to do and gets us the results we need so I would call it a success.
TL;DR Tips Summarized
- Use Existing Code and Libraries when applicable.
- Asynchronous programming is likely something you'll want to use here.
- Create a Config File, this will help cut down on duplicate code.
- If you're dealing with a lot of data, break it into chunks.
- Transforming and Loading may be tightly knitted together thanks to libraries that can convert data from one format to another (like XML to Json). This isn't a bad thing.
- Explore all your options when trying to optimize your code. For example, when dealing with large chunks of data using Postgres, COPY is a lot faster than INSERT.
- Create a Recovery Process. You don't want to get 19,000,000/ 20,000,000 rows of data moved and then have your application crash, only to have to do the first 19,000,000 records again.
- Test Locally first, Docker is great for this sort of thing.