An Interactive Introduction to Record Linkage (Data Deduplication) in the Fellegi-Sunter framework

Originally posted: 2021-05-20. Last updated: 2023-09-12. Live edit this notebook here.

This is part 1 of the tutorial

#Aims

This is part one of a series of interactive articles that aim to provide an introduction to the theory of probabilistic record linkage and deduplication.

In this article I provide a high-level introduction to the Fellegi-Sunter framework and an interactive example of a linkage model.

Subsequent articles explore the theory in more depth.

These materials align closely to the probabilistic model used by Splink, a free software package for record linkage at scale.

These articles cover the theory only. For practical model building using Splink, see the tutorial in the Splink docs.

#What is probabilistic record linkage?

Probablistic record linkage is a technique used to link together records that lack unique identifiers.

In the absence of a unique identifier such as a National Insurance number, we can use a combination of individually non-unique variables such as name, gender and date of birth to identify individuals.

Record linkage can be done within datasets (deduplication), between datasets (linkage), or both¹.

Linkage is 'probabilistic' in the sense that it subject to uncertainty and relies on the balance of evidence. For instance, in a large dataset, observing that two records match on the full name John Smith provides some evidence that these two records may refer to the same person, but this evidence is inconclusive because it's possible there are two different John Smiths.

More broadly, it is often impossible to classify pairs of records as matches or non-matches beyond any doubt. Instead, the aim of probabilisitic record linkage is to quantify the probability that a pair of records refer to the same entity by considering evidence in favour and against a match and weighting it appropriately.

The most common type of probabilistic record linkage model is called the Fellegi-Sunter model.

We start with a prior, which represents the probability that two records drawn at random are a match. We then compare the two records, increasing the match probability where information in the record matches, and decreasing it when information differs.

The amount we increase and decrease the match probability is determined by the 'partial_match_weights' of the model.

For example, a match on postcode gives us more evidence in favour of a match on gender, since the latter is much more likely to occur by chance.

The final prediction is a simple calculation: we sum up partial_match_weights to compute a final match weight, which is then converted into a probability.

#Example

Let's take a look at an example of a simple Fellegi-Sunter model to calculate match probability interactively. This model will compare the two records in the table, and assess whether they refer to the same person, or different people.

You may edit the values in the table to see how the match probability changes.

We can decompose this calculation into the sum of the partial_match_weights using a waterfall chart, which is read from left to right. We start with the prior, and take each column into account in turn. The size of the bar corresponds to the partial_match_weight.

You can hover over the bars to see how the probability changes as each subsequent field is taken into account.

The final estimated match probability is shown in the rightmost bar. Note that the y axis on the right converts match weight into probability.

In the next article, we will look at partial match weights in great depth.

#Footnotes

Record linkage and deduplication are equivalent problems. The only difference is that linkage involves finding matching entities across datasets and deduplication involves finding matches within datasets. ↩

Probabilistic Linkage Tutorial Navigation:

An Interactive Introduction to Record Linkage (Data Deduplication) in the Fellegi-Sunter framework
Partial match weights
m and u values in the Fellegi-Sunter model
The mathematics of the Fellegi Sunter model
Computing the Fellegi Sunter model
Why Probabilistic Linkage is More Accurate than Fuzzy Matching For Data Deduplication
The Intuition Behind the Use of Expectation Maximisation to Train Record Linkage Models
An alternative way to think about predicted probabilities in the Fellegi Sunter model

#An Interactive Introduction to the Fellegi-Sunter Model for Data Linkage/Deduplication

#Aims

#What is probabilistic record linkage?

#Example

#Footnotes