
Querying a complete table
We are able to dive right into it by taking a look at the classic SELECT ALL from a table.
Here’s the SQL:
SELECT * FROM df
And here’s the pandas
df
All you should do is call the DataFrame in Pandas to return the entire table and all its columns.
You might also want to simply take a look at a small subset of your table as a fast check before writing a more complicated query. In SQL, you’d use LIMIT 10
or something much like get only a select variety of rows. In Pandas, similarly, you’ll be able to call df.head(10)
or df.tails(10)
to get the primary or last 10 rows of the table.
Querying a table without null values
So as to add to our initial select query, along with just limiting the variety of rows, you’ll put conditions to filter the table inside a WHERE clause in SQL. For instance, should you’d want all rows within the table with none null values within the Order_ID
column, the SQL would appear to be this:
SELECT * FROM df WHERE Order_ID IS NOT NULL
In Pandas, you might have two options:
# Option 1
df.dropna(subset="Order_ID")# Option 2
df.loc[df["Order_ID"].notna()]
Now, the table we get back doesn’t have any null values from the Order_ID
column (which you’ll be able to compare to the primary output above). Each options will return a table without the null values, but they work barely in another way.
You need to use the native dropna
method in Pandas to return the DataFrame with none null rows, specifying within the subset
parameter which columns you’d wish to drop nulls from.
Alternatively, the loc
method helps you to pass a mask or boolean label you’ll be able to specify to filter the DataFrame. Here, we pass df["Order_ID"].notna()
, which should you would call it by itself would return a Series of True and False values that may map to the unique DataFrame rows for whether the Order_ID
is null. Once we pass it to the loc
method, it as a substitute returns the DataFrame where df["Order_ID"].notna()
evaluates to True (so all rows where the Order_ID
column isn’t null.
Querying specific columns from a table
Next, as a substitute of choosing all columns from the table, let’s as a substitute select just a couple of specific columns. In SQL, you’d write the column names within the SELECT a part of the query like this:
SELECT Order_ID, Product, Quantity_Ordered FROM df
In Pandas, we’d write the code like this:
df[["Order_ID", "Product", "Quantity_Ordered"]]
To pick out a particular subset of columns, you’ll be able to pass a listing of the column names into the DataFrame in Pandas. You can too define the list individually like this for clarity:
target_cols = ["Order_ID", "Product", "Quantity_Ordered"]
df[target_cols]
Assigning a listing of goal columns you could then pass right into a DataFrame could make working with a table over time when you should make changes in your code slightly easier. For instance, you possibly can have a function return the columns you wish as a listing, or append and take away columns to the list as needed depending on what type of output the user needs.
The GROUP BY in SQL and Pandas
We are able to now move on to aggregating data. In SQL, we do that by passing a column to the SELECT and GROUP BY clauses that we wish to group on after which adding the column to an aggregate measure like COUNT within the SELECT clause as well. For example, doing so will allow us to group all the person Order_ID
rows in the unique table for every Product
and count what number of there are. The query can appear to be this:
SELECT
Product,
COUNT(Order_ID)
FROM df
WHERE Order_ID IS NOT NULL
GROUP BY Product
In Pandas, it could appear to be this:
df[df["Order_ID"].notna()].groupby(["Product"])["Order_ID"].count()
The output is a Pandas Series where the table is grouped the products and there’s a count of all of the Order_ID
for every product. Along with our previous query in Pandas where we included a filter, we now do three things:
- Add
groupby
and pass a column (or list of columns) that you ought to group the DataFrame on; - Pass the name of the column in square brackets on the raw grouped DataFrame;
- Call the
count
(or some other aggregate) method to perform the aggregation on the DataFrame for the goal column.
For higher readability, we will assign the condition to a variable (it will turn out to be useful later) and format the query so it’s easier to read.
condition = df["Order_ID"].notna()
grouped_df = (
df.loc[condition]
.groupby("Product")
["Order_ID"] # select column to count
.count()
)
grouped_df
Now that we now have a lot of the components of a whole SQL query, let’s take a take a look at a more complicated one and see what it could appear to be in Pandas.
SELECT
Product,
COUNT(Order_ID)
FROM df
WHERE Order_ID IS NOT NULL
AND Purchase_Address LIKE "%Los Angeles%"
AND Quantity_Ordered == 1
GROUP BY Product
ORDER BY COUNT(Order_ID) DESC
Here, we add slightly to our previous query by including multiple filter conditions in addition to an ORDER BY in order that the table returned in our query is sorted by the measure we’re aggregating on. Since there are a couple of more components to this question, let’s have a look step-by-step at how we’d implement this in Pandas.
First, as a substitute of passing multiple conditions once we call the loc
method, let’s as a substitute define a listing of conditions and assign them to a variable FILTER_CONDITIONS
.
FILTER_CONDITIONS = [
df["Order_ID"].notna(),
df["Purchase_Address"].str.comprises("Los Angeles"),
df["Quantity_Ordered"] == "1",
]
As before, a condition passed into loc
needs to be a Pandas mask that evaluates to either true or false. It’s possible to pass multiple conditions to loc
, however the syntax should appear to be this:
df.loc[condition_1 & condition_2 & condition_3]
Nevertheless, just passing a listing of conditions like this won’t work:
df.loc[FILTER_CONDITIONS]
# doesn't work -> you'll be able to't just pass a listing into loc
You’ll get an error should you try the above because each condition needs to be separated by the &
operator for “and” conditions (or the |
operator should you need “or” conditions). As an alternative, we will write some quick code to return the conditions in the right format. We’ll make use of the functools.reduce
method to place the conditions together.
If you ought to see what it looks like in a notebook and see what it looks wish to mix some strings using the reduce
function, do this:
reduce(lambda x, y: f"{x} & {y}", ["condition_1", "condition_2", "condition_3"])
This outputs the string like this:
>>> 'condition_1 & condition_2 & condition_3'
Going back to our actual Pandas conditions, we will write this as a substitute (without the string formatting and just using our defined list of conditions within the FILTER_CONDITIONS
variable).
reduce(lambda x, y: x & y, FILTER_CONDITIONS)
What reduce
does is apply a function cumulatively to the weather present in an iterable, or in our case run the lambda
function over the items in our FILTER_CONDITIONS
list which mixes each of them with the &
operator. This runs until there are not any conditions left, or on this case, for all three conditions it could effectively return:
df["Order_ID"].notna() & df["Purchase_Address"].str.comprises("Los Angeles") & df["Quantity_Ordered"] == "1"
Finally, let’s add the list of conditions to create a final group by query in Pandas:
final_df = (
df
.loc[reduce(lambda x, y: x & y, FILTER_CONDITIONS)]
.groupby("Product")
.size()
.sort_values(ascending=False)
)
You’ll notice two additional differences from the previous query:
- As an alternative of specifying the precise column to count on, we will simply call the
size
method which can return the variety of rows within the DataFrame (as before where everyOrder_ID
value was unique and meant to represent one row once we counted on it); - There are a couple of other ways to do the ORDER BY in Pandas- a technique is to easily call
sort_values
and passascending=False
to sort on descending order.
Should you wanted to make use of the previous syntax for aggregating the information it could appear to be this:
final_df = (
df
.loc[reduce(lambda x, y: x & y, FILTER_CONDITIONS)]
.groupby("Product")
["Order_ID"].count()
.sort_values(ascending=False)
)
The output of each methods will likely be the identical as before, which is a Series with the column you’re grouping on and the counts for every product.
If as a substitute, you desired to output a DataFrame, you’ll be able to call the reset_index
method on the series to get the unique column names back for which column you grouped on and the column you’re aggregating on (on this case we grouped on “Product” and are counting the “Order_ID”.
final_df.reset_index()
And there we now have it! All of the components of a full SQL query but finally written in Pandas. A few of the things we will do further to optimize this process for working with data over time include:
- Putting the several lists of columns to SELECT or GROUP BY to their very own variables or functions (so that you or a user can modify them over time);
- Move the logic to mix the list of columns for a filter condition to its own function so the tip user doesn’t should be confused over what the
reduce
logic is doing; - After passing
reset_index
we will rename the output column (or columns if we’re aggregating on multiple) for clarity, for instance to “Count_Order_ID”.