Explore an abundant selection of metrics and find the most effective one in your problem
Ranking is an issue in machine learning where the target is to sort an inventory of documents for an end user in probably the most suitable way, so probably the most relevant documents appear on top. Rating appears in several domains of information science, ranging from recommender systems where an algorithm suggests a set of things for purchase and ending up with NLP search engines like google where by a given query, the system tries to return probably the most relevant search results.
The query which arises naturally is methods to estimate the standard of a rating algorithm. As in classical machine learning, there doesn’t exist a single universal metric that might be suitable for any kind of task. Why? Just because every metric has its own application scope which will depend on the character of a given problem and data characteristics.
That’s the reason it’s crucial to concentrate on all of the essential metrics to successfully tackle any machine learning problem. This is precisely what we’re going to do in this text.
Nevertheless, before going ahead allow us to understand why certain popular metrics shouldn’t be normally used for rating evaluation. By taking this information into consideration, it is going to be easier to know the need of the existence of other, more sophisticated metrics.
Note. The article and used formulas are based on the presentation on offline evaluation from Ilya Markov.
There are several kinds of information retrieval metrics that we’re going to discuss in this text:
Imagine a recommender system predicting rankings of films and showing probably the most relevant movies to users. Rating often represents a positive real number. At first sight, a regression metric like MSE (RMSE, MAE, etc.) seems an inexpensive selection to judge the standard of the system on a hold-out dataset.
MSE takes all the expected movies into consideration and measures the typical square error between true and predicted labels. Nonetheless, end users are frequently interested only in the highest results which appear on the primary page of an internet site. This means that they usually are not really fascinated with movies with lower rankings appearing at the top of the search result that are also equally estimated by standard regression metrics.
An easy example below demonstrates a pair of search results and measures the MSE value in each of them.
Though the second search result has a lower MSE, the user is not going to be satisfied with such a advice. By first looking only at non-relevant items, the user could have to scroll up all the best way all the way down to find the primary relevant item. That’s the reason from the user experience perspective, the primary search result’s a lot better: the user is just completely happy with the highest item and proceeds to it while not caring about others.
The identical logic goes with classification metrics (precision, recall) which consider all items as well.
What do all of described metrics have in common? All of them treat all items equally and don’t consider any differentiation between high and low-relevant results. That’s the reason they’re called unranked.
By having passed through these two similar problematic examples above, the aspect we must always concentrate on while designing a rating metric seems more clear:
A rating metric should put more weight on more relevant results while lowering or ignoring the less relevant ones.
Kendall Tau distance
Kendall Tau distance relies on the variety of rank inversions.
An invertion is a pair of documents (i, j) resembling document i having a greater relevance than document j, appears after on the search result than j.
Kendall Tau distance calculates all of the variety of inversions within the rating. The lower the variety of inversions, the higher the search result’s. Though the metric might look logical, it still has a downside which is demonstrated in the instance below.
It looks as if the second search result is healthier with only 8 inversions versus 9 in the primary one. Similarly to the MSE example above, the user is just fascinated with the primary relevant result. By going through several non-relevant search ends in the second case, the user experience can be worse than in the primary case.
Precision@k & Recall@k
As a substitute of usual precision and recall, it is feasible to contemplate only at a certain variety of top recommendations k. This manner, the metric doesn’t care about low-ranked results. Depending on the chosen value of k, the corresponding metrics are denoted as precision@k (“precision at k”) and recall@k (“recall at k”) respectively. Their formulas are shown below.
Imagine top k results are shown to the user where each result will be relevant or not. precision@k measures the share of relevant results amongst top k results. At the identical time, recall@k evaluates the ratio of relevant results amongst top k to the overall variety of relevant items in the entire dataset.
To raised understand the calculation strategy of these metrics, allow us to seek advice from the instance below.
There are 7 documents within the system (named from A to G). Based on its predictions, the algorithm chooses k = 5 documents amongst them for the user. As we are able to notice, there are 3 relevant documents (A, C, G) amongst top k = 5 which ends up in precision@5 being equal to 3 / 5. At the identical time, recall@5 takes into consideration relevant items in the entire dataset: there are 4 of them (A, C, F and G) making recall@5 = 3 / 4.
recall@k at all times increases with the expansion of k making this metric not likely objective in some scenarios. In the sting case where all of the items within the system are shown to the user, the worth of recall@k equals 100%. precision@k doesn’t have the identical monotonic property as recall@k has because it measures the rating quality in relation to top k results, not in relation to the variety of relevant items in the entire system. Objectivity is one in every of the explanations precision@k is normally a preferred metric over recall@k in practice.
AP@k (Average Precision) & MAP@k (Mean Average Precision)
The issue with vanilla precision@k is that it doesn’t consider the order of relevant items appearing amongst retrieved documents. For instance, if there are 10 retrieved documents with 2 of them being relevant, precision@10 will at all times be the identical despite the situation of those 2 documents amongst 10. For example, if the relevant items are positioned in positions (1, 2) or (9, 10), the metric does differentiate each of those cases leading to precision@10 being equal to 0.2.
Nonetheless, in real life, the system should give the next weight to relevant documents ranked on the highest somewhat than on the underside. This issue is solved by one other metric called average precision (AP). As a traditional precision, AP takes values between 0 and 1.
AP@k calculates the typical value of precision@i for all values of i from 1 to k for those of which the i-th document is relevant.
Within the figure above, we are able to see the identical 7 documents. The response to the query Q₁ resulted in k = 5 retrieved documents where 3 relevant documents are positioned at indexes (1, 3, 4). For every of those positions i, precision@i is calculated:
- precision@1 = 1 / 1
- precision@3 = 2 / 3
- precision@4 = 3 / 4
All other mismatched indexes i are ignored. The ultimate value of AP@5 is computed as a mean over the precisions above:
- AP@5 = (precision@1 + precision@3 + precision@4) / 3 = 0.81
For comparison, allow us to take a look at the response to a different query Q₂ which also incorporates 3 relevant documents amongst top k. Nevertheless, this time, 2 irrelevant documents are positioned higher in the highest (at positions (1, 3)) than within the previous case which ends up in lower AP@5 being equal to 0.53.
Sometimes there’s a necessity to judge the standard of the algorithm not on a single query but on multiple queries. For that purpose, the mean average precision (MAP) is utilised. Is is solely takes the mean of AP amongst several queries Q:
The instance below shows how MAP is calculated for 3 different queries:
RR (Reciprocal Rank) & MRR (Mean Reciprocal Rank)
Sometimes users have an interest only in the primary relevant result. Reciprocal rank is a metric which returns a number between 0 and 1 indicating how removed from the highest the primary relevant result’s positioned: if the document is positioned at position k, then the worth of RR is 1 / k.
Similarly to AP and MAP, mean reciprocal rank (MRR) measures the typical RR amongst several queries.
The instance below shows how RR and MRR are computed for 3 queries:
Though ranked metrics consider rating positions of things thus being a preferable selection over the unranked ones, they still have a major downside: the data about user behaviour will not be taken into consideration.
User-oriented approaches ensure assumptions about user behaviour and based on it, produce metrics that suit rating problems higher.
DCG (Discounted Cumulative Gain) & nDCG (Normalized Discounted Cumulative Gain)
The DCG metric usage relies on the next assumption:
Highly relevant documents are more useful when appearing earlier in a search engine result list (have higher ranks) — Wikipedia
This assumption naturally represents how users evaluate higher search results, in comparison with those presented lower.
In DCG, each document is assigned a gain which indicates how relevant a specific document is. Given a real relevance Rᵢ (real value) for each item, there exist several ways to define a gain. One of the crucial popular is:
Principally, the exponent puts a powerful emphasis on relevant items. For instance, if a rating of a movie is assigned an integer between 0 and 5, then each film with a corresponding rating will approximatively have double importance, in comparison with a movie with the rating reduced by 1:
Other than it, based on its rating position, each item receives a reduction value: the upper the rating position of an item, the upper the corresponding discount is. Discount acts as a penalty by proportionally reducing the item’s gain. In practice, the discount is normally chosen as a logarithmic function of a rating index:
Finally, DCG@k is defined because the sum of a gain over a reduction for all first k retrieved items:
Replacing gainᵢ and discountᵢ with the formulas above, the expression takes the next form:
To make DCG metric more interpretable, it is normally normalised by the utmost possible value of DCGₘₐₓ within the case of perfect rating when all items are appropriately sorted by their relevance. The resulting metric is named nDCG and takes values between 0 and 1.
Within the figure below, an example of DCG and nDCG calculation for five documents is shown.
RBP (Rank-Biased Precision)
Within the RBP workflow, the user doesn’t have the intention to look at every possible item. As a substitute, she or he sequentially progresses from one document to a different with probability p and with inverse probability 1 — p terminates the search procedure at the present document. Each termination decision is taken independently and doesn’t rely on the depth of the search. In response to the conducted research, such user behaviour has been observed in lots of experiments. Based on the data from Rank-Biased Precision for Measurement of Retrieval Effectiveness, the workflow will be illustrated within the diagram below.
Parameter p is named persistence.
On this paradigm, the user looks at all times looks on the 1-st document, then looks on the 2-nd document with probability p, looks on the 3-rd document with probability p² and so forth. Ultimately, the probability of document i becomes equal to:
The user examines document i in just when document i has just already been checked out and the search procedure is straight away terminated with probability 1 — p.
After that, it is feasible to estimate the expected variety of examined documents. Since 0 ≤ p ≤ 1, the series below is convergent and the expression will be transformed into the next format:
Similarly, given each document’s relevance Rᵢ, allow us to find the expected document relevance. Higher values of expected relevance indicate that the user can be more satisfied with the document she or he decides to look at.
Finally, RPB is computed because the ratio of expected document relevance (utility) to the expected variety of checked documents:
RPB formulation makes sure that it takes values between 0 and 1. Normally, relevance scores are of binary type (1 if a document is relevant, 0 otherwise) but can take real values between 0 and 1 as well.
The suitable value of p must be chosen, based on how persistent users are within the system. Small values of p (lower than 0.5) place more emphasis on top-ranked documents within the rating. With greater values of p, the load on first positions is reduced and is distributed across lower positions. Sometimes it is perhaps difficult to search out out a great value of persistence p, so it is healthier to run several experiments and select p which works the most effective.
ERR (Expected Reciprocal Rank)
Because the name suggests, this metric measures the typical reciprocal rank across many queries.
This model is analogous to RPB but with slightly difference: if the present item is relevant (Rᵢ) for the user, then the search procedure ends. Otherwise, if the item will not be relevant (1 — Rᵢ), then with probability p the user decides whether she or he desires to proceed the search process. In that case, the search proceeds to the subsequent item. Otherwise, the users ends the search procedure.
In response to the presentation on offline evaluation from Ilya Markov, allow us to find the formula for ERR calculation.
To begin with, allow us to calculate the probability that the user looks at document i. Principally, it signifies that all i — 1 previous documents weren’t relevant and at each iteration, the user proceeded with probability p to the subsequent item:
If a user stops at document i, it signifies that this document has already been looked and with probability Rᵢ, the user has decided to terminate the search procedure. The probability corresponding to this event is definitely the identical because the reciprocal rank equals 1 / i.
From now, by simply using the formula for the expected value, it is feasible to estimate the expected reciprocal rank:
Parameter p is normally chosen near 1.
As within the case of RBP, the values of Rᵢ can either be binary or real within the range from 0 to 1. An example of ERR calculation is demonstrated within the figure below for a set of 6 documents.
On the left, all of the retrieved documents are sorted within the descending order of their relevance leading to the most effective possible ERR. Contrary to the situation on the correct, the documents are presented within the ascending order of their relevance resulting in the worst possible ERR.
ERR formula assumes that each one relevance scores are within the range from 0 to 1. In case when initial relevance scores are given from out of that range, they have to be normalised. One of the crucial popular ways to do it’s to exponentially normalise them:
We’ve got discussed all of the essential metrics used for quality evaluation in information retrieval. User-oriented metrics are used more actually because they reflect real user behaviour. Moreover, nDCG, BPR and ERR metrics have a bonus over other metrics we’ve checked out up to now: they work with multiple relevance levels making them more versatile, compared to metrics like AP, MAP or MRR that are designed just for binary levels of relevance.
Unfortunately, the entire described metrics are either discontinuous or flat making the gradient at problematic points equal to 0 and even not defined. As a consequence, it’s difficult for many rating algorithms to optimise these metrics directly. Nonetheless, numerous research has been elaborated on this area and lots of advanced heuristics have appeared under the hood of the preferred rating algorithms to resolve this issue.
All images unless otherwise noted are by the writer.