Home Artificial Intelligence The right way to Construct Popularity-Based Recommenders with Polars Initial Thoughts Most Popular Across All Customers Most Popular Per Customer Conclusion

The right way to Construct Popularity-Based Recommenders with Polars Initial Thoughts Most Popular Across All Customers Most Popular Per Customer Conclusion

0
The right way to Construct Popularity-Based Recommenders with Polars
Initial Thoughts
Most Popular Across All Customers
Most Popular Per Customer
Conclusion

Created by me on dreamstudio.ai.

Recommender systems are algorithms designed to offer user recommendations based on their past behavior, preferences, and interactions. Becoming integral to numerous industries, including e-commerce, entertainment, and promoting, recommender systems improve user experience, increase customer retention, and drive sales.

While various advanced recommender systems exist, today I need to indicate you one of the straightforward — yet often difficult to beat — recommenders: the popularity-based recommender. It is a superb baseline recommender that you need to all the time check out along with a more advanced model, resembling matrix factorization.

We are going to create two different flavors of popularity-based recommenders using polars in this text. Don’t worry if you could have not used the fast pandas-alternative polars before; this text is an awesome place to learn it along the best way. Let’s start!

Popularity-based recommenders work by suggesting essentially the most regularly purchased products to customers. This vague idea might be changed into no less than two concrete implementations:

  1. Check which articles are bought most frequently across all customers. Recommend these articles to every customer.
  2. Check which articles are bought most frequently per customer. Recommend these per-customer articles to their corresponding customer.

We are going to now show methods to implement these concretely using our own custom-crated dataset.

If you should follow together with a real-life dataset, the H&M Personalized Fashion Recommendations challenge on Kaggle provides you with a superb example. As a consequence of copyright reasons, I won’t use this lovely dataset for this text.

The Data

First, we’ll create our own dataset. Ensure to put in polars when you haven’t done so already:

pip install polars

Then, allow us to create random data consisting of a (customer_id, article_id) pairs that you need to interpret as “The client with this ID bought the article with that ID.”. We are going to use 1,000,000 customers that should buy 50,000 products.

import numpy as np

np.random.seed(0)

N_CUSTOMERS = 1_000_000
N_PRODUCTS = 50_000
N_PURCHASES_MEAN = 100 # customers buy 100 articles on average

with open("transactions.csv", "w") as file:
file.write(f"customer_id,article_idn") # header

for customer_id in tqdm(range(N_CUSTOMERS)):
n_purchases = np.random.poisson(lam=N_PURCHASES_MEAN)
articles = np.random.randint(low=0, high=N_PRODUCTS, size=n_purchases)
for article_id in articles:
file.write(f"{customer_id},{article_id}n") # transaction as a row

Image by the creator.

This medium-sized dataset has over 100,000,000 rows (transactions), an amount you may find in a business context.

The Task

We now need to construct recommender systems that scan this dataset as a way to recommend popular items in some sense. We are going to make clear two variants of methods to interpret this:

  • hottest across all customers
  • hottest per customer

Our recommenders should recommend ten articles for every customer.

Note: We are going to not assess the standard of the recommenders here. Drop me a message when you are eager about this topic, though, because it’s price having a separate article about this.

On this recommender, we don’t even care who bought the articles — all the knowledge we want is within the article_id column alone.

High-level, it really works like this:

  1. Load the info.
  2. Count how often each article appears within the column article_id.
  3. Return the ten most frequent products because the advice for every customer.

Familiar Pandas Version

As a delicate start, allow us to take a look at how you may do that in pandas.

import pandas as pd

data = pd.read_csv("transactions.csv", usecols=["article_id"])
purchase_counts = data["article_id"].value_counts()
most_popular_articles = purchase_counts.head(10).index.tolist()

On my machine, this takes about 31 seconds. This appears like a bit of, however the dataset still has only a moderate size; things get really ugly with larger datasets. To be fair, 10 seconds are used for loading the CSV file. Using a greater format, resembling parquet, would decrease the loading time.

Note: I used pandas 2.0.1, which is the newest and most optimized version.

Still, to organize yet a bit of bit more for the polars version, allow us to do the pandas version using method chaining, a method I grew to like.

most_popular_articles = (
pd.read_csv("transactions.csv", usecols=["article_id"])
.squeeze() # turn the dataframe with one column right into a series
.value_counts()
.head(10)
.index
.tolist()
)

This is beautiful since you’ll be able to read from top to bottom what is occurring without the necessity for a number of intermediate variables that individuals normally struggle to call (df_raw → df_filtered → df_filtered_copy → … → df_final anyone?). The run time is similar, nevertheless.

Faster Polars Version

Allow us to implement the identical logic in polars using method chaining as well.

import polars as pl

most_popular_articles = (
pl.read_csv("transactions.csv", columns=["article_id"])
.get_column("article_id")
.value_counts()
.sort("counts", descending=True) # value_counts doesn't sort routinely
.head(10)
.get_column("article_id") # there aren't any indices in polars
.to_list()
)

Things look pretty similar, aside from the running time: 3 seconds as an alternative of 31, which is impressive!

Polars is just SO much faster than pandas.

Unarguably, that is one among the major benefits of polars over pandas. Other than that, polars also has a convenient syntax for creating complex operations that pandas doesn’t have. We are going to see more of that when creating the opposite popularity-based recommender.

It is usually essential to notice that pandas and polars produce the identical output as expected.

In contrast to our first recommender, we wish to slice the dataframe per customer now and get the most well-liked products for every customer. Which means we want the customer_id in addition to the article_id now.

We illustrate the logic using a small dataframe consisting of only ten transactions from three customers A, B, and C buying 4 articles 1, 2, 3, and 4. We wish to get the top two articles per customer. We will achieve this using the next steps:

Image by the creator.
  1. We start with the unique dataframe.
  2. We then group by customer_id and article_id and aggregate via a count.
  3. We then aggregate again over the customer_id and write the article_ids in an inventory, just as in our last recommender. The twist is that we sort this list by the count column.

That way, we find yourself with precisely what we wish.

  • A bought products 1 and a couple of most regularly.
  • B bought products 4 and a couple of most regularly. Products 4 and 1 would have been an accurate solution as well, but internal orderings just happened to flush product 2 into the advice.
  • C only bought product 3, in order that’s all there may be.

Step 3 of this procedure sounds especially difficult, but polars lets us handle this conveniently.

most_popular_articles_per_user = (
pl.read_csv("transactions.csv")
.groupby(["customer_id", "article_id"]) # first arrow from the image
.agg(pl.count()) # first arrow from the image
.groupby("customer_id") # second arrow
.agg(pl.col("article_id").sort_by("count", descending=True).head(10)) # second arrow
)

By the best way: This version runs for about a few minute on my machine already. I didn’t create a pandas version for this, and I’m definitely scared to achieve this and let it run. Should you are brave, give it a try!

A Small Improvement

To date, some users might need lower than ten recommendations, and a few even have none. A simple thing to do is pad each customer’s recommendations to 10 articles. For instance,

  • using random articles, or
  • using the most well-liked articles across all customers from our first popularity-based recommender.

We will implement the second version like this:

improved_recommendations = (
most_popular_articles_per_user
.with_columns([
pl.col("article_id").fill_null([]).alias("personal_top_<=10"),
pl.lit([most_popular_articles]).alias("global_top_10")
])
.with_columns(
pl.col("personal_top_<=10").arr.concat(pl.col("global_top_10")).arr.head(10).alias("padded_recommendations")
)
.select(["customer_id", "padded_recommendations"])
)

Popularity-based recommenders hold a major position within the realm of advice systems resulting from their simplicity, ease of implementation, and effectiveness as an initial approach and a difficult-to-beat baseline.

In this text, now we have learned methods to transform the straightforward idea of popularity-based recommendations into code using the fabulous polars library.

The major drawback, especially of the personalized popularity-based recommender, is that the recommendations are not inspiring in any way. People have seen the entire really helpful things before, meaning they’re stuck in an extreme echo chamber.

One solution to mitigate this problem to some extent is by utilizing other approaches, resembling collaborative filtering or hybrid approaches, resembling here:

I hope that you simply learned something latest, interesting, and priceless today. Thanks for reading!

Because the last point, when you

  1. need to support me in writing more about machine learning and
  2. plan to get a Medium subscription anyway,

why not do it via this link? This is able to help me quite a bit! 😊

To be transparent, the value for you doesn’t change, but about half of the subscription fees go on to me.

Thanks quite a bit when you consider supporting me!

If you could have any questions, write me on LinkedIn!

LEAVE A REPLY

Please enter your comment!
Please enter your name here