Home Artificial Intelligence Why Probabilistic Linkage is More Accurate than Fuzzy Matching or Term Frequency based approaches

Why Probabilistic Linkage is More Accurate than Fuzzy Matching or Term Frequency based approaches

0
Why Probabilistic Linkage is More Accurate than Fuzzy Matching or Term Frequency based approaches

How effectively do different approaches to record linkage use information within the records to make predictions?

Towards Data Science
Wringing information out of knowledge. Image created by the creator using DALL·E 3

A pervasive data quality problem is to have multiple different records that seek advice from the identical entity but no unique identifier that ties these entities together.

Within the absence of a singular identifier resembling a Social Security number, we are able to use a mix of individually non-unique variables resembling name, gender and date of birth to discover individuals.

To get one of the best accuracy in record linkage, we want a model that wrings as much information from this input data as possible.

This text describes the three forms of information which are most vital in making an accurate prediction, and the way all three are leveraged by the Fellegi-Sunter model as utilized in Splink.

It also describes how some alternative record linkage approaches throw away a few of this information, leaving accuracy on the table.

The three forms of information

Broadly, there are three categories of data which are relevant when attempting to predict whether a pair of records match:

  1. Similarity of the pair of records
  2. Frequency of values in the general dataset, and more broadly measuring how common different scenarios are
  3. Data quality of the general dataset

Let’s take a look at each in turn.

1. Similarity of the pairwise record comparison: Fuzzy matching

Essentially the most obvious method to predict whether two records represent the identical entity is to measure whether the columns contain the identical or similar information.

The similarity of every column will be measured quantitatively using fuzzy matching functions like Levenshtein or Jaro-Winker for text, or numeric differences resembling absolute or percentage difference.

For instance, Hammond vs Hamond has a Jaro-Winkler similarity of 0.97 (1.0 is an ideal rating). It’s probably a typo.

These measures could possibly be assigned weights, and summed together to compute a complete similarity rating.

The approach is typically often known as fuzzy matching, and it’s a crucial a part of an accurate linkage model.

Nevertheless using this approach alone has major drawback: the weights are arbitrary:

  • The importance of various fields needs to be guessed at by the user. For instance, what weight needs to be assigned to a match on age? How does this compare to a match on first name? How should we choose the dimensions of punitive weights when information doesn’t matches?
  • The connection between the strength of prediction and every fuzzy matching metric needs to be guessed by the user, versus being estimated. For instance, how much should our prediction change if the primary name is a Jaro-Winkler 0.9 fuzzy match versus an actual match? Should it change by the identical amount if the Jaro-Winkler rating reduces to 0.8?

2. Frequency of values in the general dataset, or more broadly measuring how common different scenarios are

We are able to improve on fuzzy matching by accounting for the frequency of values in the general dataset (sometimes often known as ‘term frequencies’).

For instance, John vs John, and Joss vs Joss are each exact matches so have the identical similarity rating, however the later is stronger evidence of a match than the previous, because Joss is an unusual name.

The relative term frequencies of John v Joss provide a data-driven estimate of the relative importance of those different names, which will be used to tell the weights.

This idea will be prolonged to encompass similar records that aren’t an actual match. Weights can derived from an estimate of how common it’s to look at fuzzy matches across the dataset. For instance, if it’s really common to see fuzzy matches on first name at a Jaro-Winkler rating of 0.7, even amongst non-matching records, then if we observe such a match, it doesn’t offer much evidence in favour of a match. In probabilistic linkage, this information is captured in parameters often known as the u probabilities, which is described in additional detail here.

3. Data quality of the general dataset: measuring the importance of non-matching information

We’ve seen that fuzzy matching and term frequency based approaches can allow us to attain the similarity between records, and even, to some extent, weight the importance of matches on different columns.

Nevertheless, none of those techniques help quantify the relative importance of non-matches to the expected match probability.

Probabilistic methods explicitly estimate the relative importance of those scenarios by estimating data quality. In probabilistic linkage, this information is captured within the m probabilities, that are defined more precisely here.

For instance, if the information quality within the gender variable is incredibly high, then a non-match on gender could be strong evidence against the 2 records being a real match.

Conversely, if records have been observed over a variety of years, a non-match on age wouldn’t be strong evidence of the 2 records being a match.

Probabilistic linkage

Much of the ability of probabilistic models comes from combining all three sources of data in a way which shouldn’t be possible in other models.

Not only is all of this information be incorporated within the prediction, the partial match weights within the Fellegi-Sunter model enable the relative importance of the differing types of data to be estimated from the information itself, and hence weighted together appropriately to optimise accuracy.

Conversely, fuzzy matching techniques often use arbitrary weights, and can’t fully incorporate information from all three sources. Term frequency approaches lack the flexibility to make use of details about data quality to negatively weight non-matching information, or a mechanism to appropriately weight fuzzy matches.

The creator is the developer of Splink, a free and open source Python package for probabilistic linkage at scale.

LEAVE A REPLY

Please enter your comment!
Please enter your name here