3. Friends with Pandas
If there may be one thing that makes Pandas the king of information evaluation libraries, it’s got to be its integration with the remaining of the information ecosystem.
For instance, by now you have to have realized how you may change the plotting backend of Pandas from Matplotlib to either Plotly, HVPlot, holoviews, Bokeh, or Altair.
Yes, Matplotlib is best friends with Pandas but for from time to time, you fancy something interactive like Plotly or Altair.
import pandas as pd
import plotly.express as px# Set the default plotting backend to Plotly
pd.options.plotting.backend = 'plotly'
Talking about backends, you’ve also noticed that Pandas added a fully-supported PyArrow implementation for its read_*
functions to load data files within the brand-new 2.0.0 version.
import pandas as pdpd.read_csv(file_name, engine='pyarrow')
When it was NumPy backend only, there have been many limitations like little support for non-numeric data types, near-total disregard to missing values or no support for complex data structures (dates, timestamps, categoricals).
Before 2.0.0, Pandas had been cooking up in-house solutions to those problems but they weren’t nearly as good as some heavy users have hoped. With PyArrow backend, loading data is considerably faster and it brings a collection of information types that Apache Arrow users are acquainted with:
import pandas as pdpd.read_csv(file_name, engine='pyarrow', dtype_engine='pyarrow')
One other cool feature of Pandas I’m sure you employ on a regular basis in JupyterLab is styling DataFrames.
Since project Jupyter is so awesome, Pandas developers added a little bit of HTML/CSS magic under the .style
attribute so you may boost plain old DataFrames in a way that reveals additional insights
df.sample(20, axis=1).describe().T.style.bar(
subset=["mean"], color="#205ff2"
).background_gradient(
subset=["std"], cmap="Reds"
).background_gradient(
subset=["50%"], cmap="coolwarm"
)
4. The info sculptor
Since Pandas is a knowledge evaluation and manipulation library, the truest sign you’re pro is how flexibly you may shape and transform datasets to fit your purposes.
While most online courses provide the ready-made, cleaned columnar format data, the datasets within the wild are available in many shapes and forms. For instance, one of the annoying formats of information is row-based (quite common with financial data):
import pandas as pd# create example DataFrame
df = pd.DataFrame(
{
"Date": [
"2022-01-01",
"2022-01-02",
"2022-01-01",
"2022-01-02",
],
"Country": ["USA", "USA", "Canada", "Canada"],
"Value": [10, 15, 5, 8],
}
)
df
It’s essential to find a way to convert row-based format right into a more useful format just like the below example using pivot
function:
pivot_df = df.pivot(
index="Date",
columns="Country",
values="Value",
)pivot_df
It’s possible you’ll also must perform the alternative of this operation, called a melt.
Here is an example with melt
function of Pandas that turns columnar data into row-based format:
df = pd.DataFrame(
{
"Date": ["2022-01-01", "2022-01-02", "2022-01-03"],
"AAPL": [100.0, 101.0, 99.0],
"GOOG": [200.0, 205.0, 195.0],
"MSFT": [50.0, 52.0, 48.0],
}
)df
melted_df = pd.melt(
df, id_vars=["Date"], var_name="Stock", value_name="Price"
)melted_df
Such functions might be quite difficult to grasp and even harder to use.
There are other similar ones like pivot_table
, which creates a pivot table that may compute several types of aggregations for every value within the table.
One other function is stack/unstack
, which might collapse/explode DataFrame indices. crosstab
computes a cross-tabulation of two or more aspects, and by default, computes a frequency table of the aspects but may compute other summary statistics.
Then there’s groupby
. Regardless that the fundamentals of this function is easy, its more advanced use-cases are very hard to master. If the contents of the Pandas groupby function were made right into a separate library, it could be larger than most within the Python ecosystem.
# Group by a date column, use a monthly frequency
# and find the full revenue for `category`grouped = df.groupby(['category', pd.Grouper(key='date', freq='M')])
monthly_revenue = grouped['revenue'].sum()
Skillfully selecting the suitable function for a selected situation is an indication you’re true data sculptor.
Read parts two and three to learn the ins and outs of the functions mentioned on this section.