
Advanced techniques to process and cargo data efficiently

On this story, I would really like to discuss things I like about Pandas and use often in ETL applications I write to process data. We’ll touch on exploratory data evaluation, data cleansing and data frame transformations. I’ll display a few of my favourite techniques to optimize memory usage and process large amounts of knowledge efficiently using this library. Working with relatively small datasets in Pandas isn’t an issue. It handles data in data frames with ease and provides a really convenient set of commands to process it. In relation to data transformations on much greater data frames (1Gb and more) I’d normally use Spark and distributed compute clusters. It may handle terabytes and petabytes of knowledge but probably may also cost plenty of money to run all that hardware. That’s why Pandas may be a better option when we have now to take care of medium-sized datasets in environments with limited memory resources.
Pandas and Python generators
In one among my previous stories I wrote about methods to process data efficiently using generators in Python [1].
It’s a straightforward trick to optimize the memory usage. Imagine that we have now an enormous dataset somewhere in external storage. It may be a database or simply a straightforward large CSV file. Imagine that we’d like to process this 2–3 TB file and apply some transformation to every row of knowledge on this file. Let’s assume that we have now a service that may perform this task and it has only 32 Gb of memory. It will limit us in data loading and we won’t have the ability to load the entire file into the memory to separate it line by line applying easy Python split(‘n’)
operator. The answer could be to process it row by row and yield
it every time freeing the memory for the following one. This will help us to create a always streaming flow of ETL data into the ultimate destination of our data pipeline. It may be anything — a cloud storage bucket, one other database, an information warehouse solution (DWH), a streaming topic or one other…