Home Artificial Intelligence Pandas for Data Engineers

Pandas for Data Engineers

0
Pandas for Data Engineers

Advanced techniques to process and cargo data efficiently

Towards Data Science
AI-generated image using Kandinsky

On this story, I would really like to discuss things I like about Pandas and use often in ETL applications I write to process data. We’ll touch on exploratory data evaluation, data cleansing and data frame transformations. I’ll display a few of my favourite techniques to optimize memory usage and process large amounts of knowledge efficiently using this library. Working with relatively small datasets in Pandas isn’t an issue. It handles data in data frames with ease and provides a really convenient set of commands to process it. In relation to data transformations on much greater data frames (1Gb and more) I’d normally use Spark and distributed compute clusters. It may handle terabytes and petabytes of knowledge but probably may also cost plenty of money to run all that hardware. That’s why Pandas may be a better option when we have now to take care of medium-sized datasets in environments with limited memory resources.

Pandas and Python generators

In one among my previous stories I wrote about methods to process data efficiently using generators in Python [1].

It’s a straightforward trick to optimize the memory usage. Imagine that we have now an enormous dataset somewhere in external storage. It may be a database or simply a straightforward large CSV file. Imagine that we’d like to process this 2–3 TB file and apply some transformation to every row of knowledge on this file. Let’s assume that we have now a service that may perform this task and it has only 32 Gb of memory. It will limit us in data loading and we won’t have the ability to load the entire file into the memory to separate it line by line applying easy Python split(‘n’) operator. The answer could be to process it row by row and yield it every time freeing the memory for the following one. This will help us to create a always streaming flow of ETL data into the ultimate destination of our data pipeline. It may be anything — a cloud storage bucket, one other database, an information warehouse solution (DWH), a streaming topic or one other…

LEAVE A REPLY

Please enter your comment!
Please enter your name here