Computing the Fellegi Sunter model

The previous article showed how we could derive a mathematical representation of the Fellegi-Sunter model.

This article contains a step-by-step demonstration of how we use this mathematical model to compute predictions from data.

Objective of the Model

The aim of the model is to compute a prediction of which records match. This prediction is a probability that quantifies the likelihood the two records represent the same entity.

Computing the prediction step by step

We begin with some data. In this example we will attempt to link but not deduplicate these records.

Input table 1

Input table 2:

In the first step of the calculation we compare each record in the first table with all records in the second table.1

We wish to predict which of these pairwise record comparisons represent the same entity (i.e. which records match).

Table scrolls right →

At first glance, it may seem challenging to turn this table of text into a numeric prediction. But this can be achieved using the concepts of scenarioA scenario defines a category of similarity for a subset of information in the input records, often a column.s and partial_match_weightA partial match weight quantifies the importance of a piece of information, or scenario, in the Fellegi-Sunter model.s introduced earlier in this tutorial.

Specifically, we define scenarios for each column and then use a lookup table to find the partial match weights.

For example, we may choose to define our scenarios as follows:

Note the ‘comparison vector value’ column in this table. This is an integer value that identifies the scenario within the column. We will use it to help us look up the correct partial match weights.

We then proceed by computing which scenario is activateA scenario is activated when its conditions are met for the pairwise record comparison at hand.d for each pairwise record comparison, and representing this using the comparison vector value. We use the gamma symbol (γ) to refer to the comparison vector value.

Table scrolls right →

Now that we’ve used the text values to compute the comparison vector value, we no longer need it. What remains are known as the comparison vectors for each pairwise comparison:

Each row in this table is known as the ‘comparison vector’ for the record comparison

We now look up the partial match weights from the comparison vector. I’ve also added the priorThe prior is the probability that two records picked at random from the input data refer to the same entity., expressed as a partial match weight, and the sum of the match weights:

Table scrolls right →

Finally, we can then sum the values in the partial match weights to produce a final match weight. This can then be converted into a probability using:

p=2ω1+2ω\Large{p = \frac{2^{\omega}}{1 + 2^{\omega}}}

We can now see how the original comparison pairs were scored by this simple model:

Table scrolls right →

Footnotes

  1. In real applications, we would typically use blocking to reduce the number of record comparisons.