What’s sampling bias in advice, and the best way to correct them
Recommendations are ubiquitous in our digital lives, starting from e-commerce giants to streaming services. Nonetheless, hidden beneath every large recommender system lies a challenge that may significantly impact their effectiveness — sampling bias.
In this text, I’ll introduce how sampling bias occurs during training advice models and the way we are able to solve this issue in practice.
Let’s dive in!
On the whole, we are able to formulate the advice problem as follows: given query x (which might contain user information, context, previously clicked items, etc.), find the set of things {y1,.., yk} that the user will likely be focused on.
Certainly one of the principal challenges for large-scale recommender systems is low-latency requirements. Nonetheless, user and item pools are vast and dynamic, so scoring every candidate and greedily finding one of the best one is unattainable. Due to this fact, to satisfy the latency requirement, recommender systems are generally broken down into 2 principal stages: retrieval and rating.
Retrieval is an inexpensive and efficient method to quickly capture the highest item candidates (a number of hundred) from the vast candidate pool (thousands and thousands or billions). Retrieval optimization is especially about 2 objectives:
- Through the training phase, we wish to encode users and items into embeddings that capture the user’s behaviour and preferences.
- Through the inference, we wish to quickly retrieve relevant items through Approximate Nearest Neighbors (ANN).
For the primary objective, one in all the common approaches is the two-tower neural networks. The model gained its popularity for tackling the cold-start problems by incorporating item content features.
Intimately, queries and items are encoded by corresponding DNN towers in order that the relevant (query, item) embeddings stay…