Metrics to measure the gap between neural text and human text
Recently, large language models have shown tremendous ability in generating human-like texts. There are a lot of metrics to measure how close/similar a text generated by large language models is to the reference human text. In actual fact, bridging this gap is an energetic area of research.
On this post, we glance into two well-known metrics for mechanically evaluating the machine generated texts.
Consider you might be given a reference text that’s human-generated, and a machine-generated text that’s generated by an LLM. To compute the semantic similarity between these two texts, BERTScore compute pairwise cosine similarity of token embeddings. See the image below:
Here the reference text is “the weather is cold today” and the candidate text which is machine generated is “it’s freezing today”. If we compute the n-gram similarity these two texts can have a low rating. Nevertheless, we all know they’re semantically very similar. So BERTScore computes the contextual embedding of every token in each reference text and the candidate text and the based on these embedding vectors, it computes the pairwise cosine similarities.
Based on pairwise cosine similarities, we are able to compute precision, recall and F1 rating. To achieve this as following:
- Recall: we get the utmost cosine similarity for each token within the reference text and get their average
- Precision: we get the utmost cosine similarity for each token within the candidate text and get their average
- F1 rating: the harmonic mean of precision and recall
BERTScore[1] also propose a modification to above rating called as “importance weighting”. In “importance weighting” , considers the incontrovertible fact that rare word that are common between two sentences are more…