An ongoing Advice System series

I caught myself today, once more for the 100…01-th day in a row, holding my late unopened dinner box as I’m browsing through Netflix for a show to observe while munching on my food. My feed is stuffed with just a few too many Asian romance and American coming of age suggestions, probably based on a series or two from these categories that I watched, like, a month or two ago. “There’s nothing to observe here…”–sighed me as I finished reading all of the synopses, feeling confident that I could predict how the plot would unveil. I whipped out my alternative go-to entertainment option, Tiktok, while subconsciously pondering to myself that I’ll probably must Not interested some videos and Like, Save others to…recommend the algorithm sending me some latest stream of content today.
Advice Systems (RecSys) may be considered such a longtime algorithm, which has been deeply implanted into our on daily basis lives, to an extent that, on the size of 1 to Chat-GPT, it feels almost like an 80s trend each to the tutorial and non-academic world. Nonetheless, it’s under no circumstances a near perfect algorithm. The moral, social, technical, and legal challenges that include operating a advice application have never been on the forefront of research (as is the case of most other technology products…). Select group unfairness and privacy violation are examples of the favored concerns revolving around RecSys which might be still not fully addressed by the businesses who implemented it. Besides, there exists many more subtle issues which might be rarely given enough deliberations, certainly one of which is the lack of autonomy in a person’s decision making process. A “powerful” RecSys can undoubtedly nudge users in a specific direction [2], making them purchase, watch, think, imagine in something that they would not have done had they not been subject to such manipulation.
Hence, I need to jot down up a series along my grad school journey as I start learning and dive deeper into RecSys, their strengths and shortcomings…all from scratch! And I figure it will probably start with excited about movies and…Thompson Sampling!
Thompson Sampling (TS) is certainly one of the foundational algorithms not only in advice system literature, but additionally in reinforcement learning. It’s arguably a greater A/B testing in online learning settings, as clearly explained by Samuele Mazzanti on this amazing article. In easy terms, in movie advice context, TS tries to discover the very best movie to recommend me that can maximize the possibility that I’ll click to observe. It will probably accomplish that effectively using relatively less data because it allows the parameters to be updated each time it observes me click or not click right into a movie. Roughly speaking, this dynamic characteristic enables TS to consider, on top of the my watch history and bookmarked series, real time aspects equivalent to the browsing, or the search results throughout the app I’m making in the meanwhile to offer me essentially the most suitable suggestion. Nevertheless, on this beginner friendly tutorial, let’s just look right into a simplified evaluation below.
Let’s break it down even further!
Consider these 3 movies, which, all amazing as they’re, I, controversially enough, do have my very own personal rating for. Out of those 3 movies, say, there’s one which I’ll 100% rewatch if it comes up on my feed, one which I’m highly unlikely going to rewatch (5%), and one that there is a 70% probability I’ll click watch at any time when I see it. TS obviously doesn’t have this details about me beforehand and its goal is to learn my behavior in order to, as common intuition goes, recommend me the movie that it knows I’ll obviously click watch.
Within the TS algorithm, the predominant workflow goes as follows:
- Motion: TS suggests me a selected movie, amongst tons of of others
- End result: I determine that the movie sounds interesting enough to me and click on to observe it, or I find it boring and click on out of the page after reading the synopsis
- Reward: Might be considered the variety of “points” TS scores if I click to observe a certain movie or TS misses if I do not click. In basic movie or ad advice settings, we are able to treat reward to be an equivalent of end result, so 1 click on the movie = 1 point!
- Update knowledge: TS registers my selection and update its belief as to which movie is my favorite.
- Repeat step 1 (may be inside my current browsing session, or at time for supper the following day), but now with some additional knowledge about my preferences.
Exploration/Exploitation
That is essentially the most used term on this literature, and can also be what sets TS and other related algorithms apart. Step 5 above is where this logic kicks in. In TS world, every thing has some extent of uncertainty to it. Me drinking latte thrice and matcha five times in per week doesn’t necessarily mean I like matcha greater than latte, what if it’s just that one week (and I’m actually drinking more latte than matcha on average per week)? For that reason, every thing in TS is represented by some form of distribution, moderately than simply single numbers.
Starting out, TS obviously has numerous uncertainty around my preference for the flicks, so its priority is to explore this by giving me many alternative movie suggestions with a view to observe my response to the suggestions. After some few clicks and skips, TS can form of work out the flicks that I are likely to click and the flicks that yield no advantages, and hence it has gained more confidence in what movie to offer out to me next time. That is when TS starts to exploit the highly rewarding options, where it gives me the movie I click watch often, but still leaves some room for more exploration. The boldness builds up as more observations are available, which, in easy cases, will reach the purpose where the exploration work is now very minimal since TS already has numerous confidence to use the advice that provides numerous rewards.
Exploration vs Exploitation are thus sometimes called the tradeoff or the dilemma because an excessive amount of exploration (i.e little elimination of low value decisions despite already gaining enough evidence to know that such decisions aren’t optimal) and also you incur numerous loss, an excessive amount of exploitation (i.e eliminate too many options too quickly) and also you re more likely to falsely eliminate the true optimal motion.
As within the matcha-latte graph above, TS works with different sorts of distributions to know our preference for various options. In essentially the most basic cases of films (and ads as well), we regularly use the Beta-Bernoulli combo.
Bernoulli distribution is a discrete distribution wherein there are only two possible outcomes: 1 and 0. Bernoulli distribution consists of just one parameters, which indicates the probability of some variable, say Y, being 1. So, if we are saying Y~ Bern(p), and as an example, p = 0.7, meaning Y has 0.7 probability of getting the worth of 1, and 1–p = 1–0.7 = 0.3 probability of being 0. Thus, Bernoulli distribution is suitable to model the reward (also end result in our case) because our reward only has binary end result: Clicked or Not Clicked.
Beta distribution is however used to model TS belief regarding my movie interests. Beta distribution takes in two parameters, alpha and beta, which might be often considered the variety of successes and failures, respectively, and each need to be ≥ 1. Thus, it’s suitable to make use of Beta distribution to model the variety of times that I click watch and the variety of times I skip a movie. Let’s take a have a look at an example. Here, these are 3 different beta distributions representing 3 movies, over 10 observations, so the entire variety of clicks and skips for all 3 movies are the identical (10), but the clicking and skip rates are different. For movie 1, I click watch 2 times (alpha = 2) and skip 8 times (beta = 8); for movie 2, I click watch 5 times and skip 5 times; for movie 3, I click watch 8 times and skip 2.
In line with the graph, we are able to see that the probability of me watching movie 2 again peaks around 50%, whereas this probability for movie 1 is way lower, for instance. We are able to consider the curves here because the probability of probability (of me watching a movie again), and so Beta distribution is good for representing TS’s belief about my movie preferences.
On this section, I’ll show you how to gain a transparent understanding of the algorithm implementation clever and methodology clever. Firstly, here’s a snipper of the Thompson Sampling algorithm, in pseudocode and in Python. The pseudocode is taken from a tremendous book on TS, called A tutorial on Thompson Sampling [Russo, 2017].