Learn the best way to use SQL to question your Polars DataFrames
In my last few articles on data analytics, I speak about two necessary up-and-coming libraries which can be currently gaining lots of tractions within the industry:
- DuckDB — where you possibly can query your dataset in-memory using SQL statements.
- Polars — a rather more efficient DataFrame library in comparison with the venerable Pandas library.
What about combining the ability of those two libraries?
The truth is, you possibly can directly query a Polars dataframe through DuckDB, using SQL statements.
So what are the advantages of querying your Polars dataframe using SQL? Despite the benefit of use, manipulating Polars dataframes still require a little bit of practise and a comparatively steep learning curve. But since most developers are already conversant in SQL, isn’t it more convenient to control the dataframes directly using SQL? Using this approach, developers have the most effective of each worlds:
- the flexibility to question Polars dataframes using all the varied functions, or
- use SQL for cases where it’s rather more natural and easier to extract the info that they need
In this text, I gives you some examples of how you possibly can make use of SQL through DuckDB to question your Polars dataframes.
For this text, I’m using Jupyter Notebook. Make sure that you could have installed Polars and DuckDB using the next commands:
!pip install polars
!pip install duckdb
To start, let’s create a Polars DataFrame by hand:
import polars as pldf = pl.DataFrame(
{
'Model': ['iPhone X','iPhone XS','iPhone 12',
'iPhone 13','Samsung S11',
'Samsung S12','Mi A1','Mi A2'],
'Sales': [80,170,130,205,400,30,14,8],
'Company': ['Apple','Apple','Apple','Apple',
'Samsung','Samsung','Xiao Mi',
'Xiao Mi'],
})
df
Here’s how the dataframe looks:
Say, you now want to seek out all phones from Apple which has sales of greater than 80. You need to use the filter()
function in Polars, like this:
df.filter(
(pl.col('Company') == 'Apple') &
(pl.col('Sales') > 80)
)
And the result looks like this:
Let’s now do the precise query that we did within the previous section, except that this time round we are going to use DuckDB with a SQL statement. But first, let’s select all of the rows within the dataframe:
import duckdbresult = duckdb.sql('SELECT * FROM df')
result
You’ll be able to directly reference the
df
dataframe out of your SQL statement.
Using DuckDB, you issue a SQL statement using the sql()
function. Alternatively, the query()
function also works:
result = duckdb.query('SELECT * FROM df')
The result
variable is a duckdb.DuckDBPyRelation
object. Using this object, you possibly can perform quite a variety of different tasks, akin to:
- Getting the mean of the Sales column:
result.mean('Sales')
- Describing the dataframe:
result.describe()
- Applying a scaler function to the columns within the dataframe:
result.apply("max", 'Sales,Company')
- Reordering the dataframe:
result.order('Sales DESC')
However the easiest method is to question the Polars DataFrame is to make use of SQL directly.
For instance, if you ought to get all of the rows with sales greater than 80, simply use the sql()
function with the SQL statement below:
duckdb.sql('SELECT * FROM df WHERE Sales >80').pl()
The
pl()
function converts theduckdb.DuckDBPyRelation
object to a Polars DataFrame. If you ought to convert it to a Pandas DataFrame as an alternative, use thedf()
function.
If you ought to get all of the rows whose model name starts with “iPhone”, then use the next SQL statement:
duckdb.sql("SELECT * FROM df WHERE Model LIKE 'iPhone%'").pl()
If you happen to want all devices from Apple and Xiao Mi, then use the next SQL statement:
duckdb.sql("SELECT * FROM df WHERE Company = 'Apple' OR Company ='Xiao Mi'").pl()
The actual power of using DuckDB with Polars DataFrame is when you ought to query from multiple dataframes. Consider the next three CSV files from the 2015 Flights Delay dataset:
2015 Flights Delay dataset — https://www.kaggle.com/datasets/usdot/flight-delays. Licensing — CC0: Public Domain
- flights.csv
- airlines.csv
- airports.csv
Let’s load them up using Polars:
import polars as pldf_flights = pl.scan_csv('flights.csv')
df_airlines = pl.scan_csv('airlines.csv')
df_airports = pl.scan_csv('airports.csv')
display(df_flights.collect().head())
display(df_airlines.collect().head())
display(df_airports.collect().head())
The above statements use lazy evaluation to load up the three CSV files. This ensures that any queries on the dataframes will not be performed until all of the queries are optimized. The
collect()
function forces Polars to load the CSV files into dataframes.
Here is how the df_flights
, df_airlines
, and df_airports
dataframes seem like:
Suppose you ought to count the variety of times an airline has a delay , and at the identical time display the name of every airline, here is the SQL statement which you could use using the df_airlines
and df_flights
dataframes:
duckdb.sql('''
SELECT
count(df_airlines.AIRLINE) as Count,
df_airlines.AIRLINE
FROM df_flights, df_airlines
WHERE df_airlines.IATA_CODE = df_flights.AIRLINE AND df_flights.ARRIVAL_DELAY > 0
GROUP BY df_airlines.AIRLINE
ORDER BY COUNT DESC
''')
And here is the result:
If you ought to count the variety of airports in each state and kind the count in descending order, you should utilize the next SQL statement:
duckdb.sql('''
SELECT STATE, Count(*) as AIRPORT_COUNT
FROM df_airports
GROUP BY STATE
ORDER BY AIRPORT_COUNT DESC
''')
Finally, suppose you ought to know which airline has the best average delay. You need to use the next SQL statement to calculate the varied statistics, akin to minimum arrival delay, maximum array delay, mean arrival delay, and standard deviation of arrival delay:
duckdb.sql('''
SELECT AIRLINE, MIN(ARRIVAL_DELAY), MAX(ARRIVAL_DELAY),
MEAN(ARRIVAL_DELAY), stddev(ARRIVAL_DELAY)
FROM df_flights
GROUP BY AIRLINE
ORDER BY MEAN(ARRIVAL_DELAY)
''')
Based on the mean arrival delay, we are able to see that the AS airline is the one with the shortest delay (as the worth is negative, this implies more often than not it arrives earlier!) and NK airline is the one with the longest delay. Need to know what’s the AS airline? Try it out using what you could have just learned! I’ll leave it as an exercise and the reply is at the tip of this text.
If you happen to like reading my articles and that it helped your profession/study, please consider signing up as a Medium member. It’s $5 a month, and it gives you unlimited access to all of the articles (including mine) on Medium. If you happen to enroll using the next link, I’ll earn a small commission (at no additional cost to you). Your support implies that I’ll have the option to devote more time on writing articles like this.
On this short article, I illustrated how DuckDB and Polars could be used together to question your dataframes. Utilizing each libraries gives you the most effective of each worlds — using a well-known querying language (which is SQL) to question an efficient dataframe. Go ahead and take a look at it out using your personal dataset and share with us the way it has helped your data analytics processes.
Answer to quiz:
duckdb.sql("SELECT AIRLINE from df_airlines WHERE IATA_CODE = 'AS'")