Originally posted: 2021-11-01. View source code for this page here.

Part 1: Supervised learning

Probabilistic linkage models generate pairwise comparisons of records and make predictions of whether the two records refer to the same entity. They are often used for deduplicating and linking datasets where unique person identifiers are not available.

The linkage score is computed by comparing different columns of the dataset to assess their similarity, and then weighting the importance of these comparisons to produce a final score. For those unfamiliar with these models, you can learn more in my introductory training material.

In the simplest models, the comparison may be a simple equality check. For example, is the first name equal or different? In this model, a match on the first name will increase the match score by a constant weighting. A non-match on first name will decrease the match score by some other constant.

This is clearly a crude model of the real world, and there are a variety of ways in which it can be made more realistic. For example:

  • We may allow more than two match weights to be computed for a given column. For example, we may want a separate 'intermediate' match weight for the case that name is similar but not identical e.g. "Rachel" vs "Rachael".
  • We may allow different match weights for different values of a column. For example, the match weight for a match on a common name like "John" may be lower than for a match on an uncommon name like "Robin".
  • We may want to group together certain columns to avoid double counting. For instance, if postcodes match, we would probably want to avoid further increasing the final match score for a match on a 'city' field.

This kind of feature engineering is dataset-specific and can be time consuming. This post considers whether it's worth it. Do more complex models have higher accuracy, and if so, how much higher? What kinds of feature engineering have the greatest impact on accuracy?

Questions of accuracy are most easily answered if we have access to a realistic test dataset. This dataset must be fully labelled, so that the true match status of each pairwise record comparison is known. The dataset used for this post is a synthetic dataset derived from records in WikiData. The code used to create this dataset if available here. Please contact me if you would like access to the full data.

In this post, I use Splink (software for probabilistic record linkage at scale) to compute match scores, and then compare these scores to the ground truth. The headline finding is that models with more complex feature engineering are dramatically more accurate.

Models are trained using supervised learning: that is, I use the labels to compute model parameters parameters are derived from the labelled data. In part 2, I present the results of the same experiment using unsupervised learning.

We compare the following models, further details of which are provided below.

  1. A rules-based (deterministic) matching approach
  2. Basic two-level equality checks for all columns
  3. Simple fuzzy matching using default Splink settings
  4. More complex fuzzy matches, using customised Splink settings
  5. More complex fuzzy matches and term frequency adjustments
  6. More complex fuzzy matches and term frequency adjustments, with term frequency weights for fuzzy matches

We can summarise the results using a ROC curve.

You can zoom into this chart using mousewheel/touchpad, and move around by dragging.

This chart tells us, for example, that for a false positive rate of 1%:

  • Model 0 achieves recall (a true positive rate) of around 20%
  • Model 1 achieves recall (a true positive rate) of around 38%
  • Model 2 achieves recall (a true positive rate) of around 54%
  • Model 3 achieves recall (a true positive rate) of around 60%
  • Model 4 achieves recall (a true positive rate) of around 68%
  • Model 5 achieves recall (a true positive rate) of around 73%

In this section, I give a brief description of the differences between the models, and present their match weights. I then discuss how they were trained, and a few caveats.

Full reproducibility materials for all results can be found here.

The input data is a synthetic dataset obtained by from the WikiData query service, and then creating duplicating of these records that contain various errors corruptions.

There are a total of 1,105,780 records in the dataset, with 113,232 distinct persons. Cluster sizes range from 1 to 21, with a fairly equal distribution.

An example of some records is as follows, derived from this master record.

1865-12-30B95 5DGjournalistkiplingrudyard
1865-12-60B95 5DGkiplingrudyard
1865-12-30B05 5DGkiplingjoseph
CV37 9TWauthorkiplingrudyardjoe

Each model uses information from all of these columns.

As can be seen from the table, the data quality is poor - a large number of corruptions and nulls have been introduced. This should help to expose the differences in the accuracy of different models: if the data was very high quality, then most records would be trivially matchable and even the simplest model would perform well.

Model 0 is a rules-based (deterministic) matching algorithm. The rules used to generate matches can be found in the code here and are as follows.

deterministic_matching_rules = [ "l.postcode = r.postcode and l.dob = r.dob and l.forename1_dm = r.forename1_dm ", "l.dob = r.dob and l.forename1_dm = r.forename1_dm and l.forename2_dm = r.forename2_dm and l.surname_dm = r.surname_dm", "l.postcode = r.postcode and l.forename1_dm = r.forename1_dm and l.forename2_dm = r.forename2_dm and l.surname_dm = r.surname_dm", "l.forename1_dm = r.forename1_dm and l.forename2_dm = r.forename2_dm and l.surname_dm = r.surname_dm and l.occupation = r.occupation", "l.outward_postcode_std = r.outward_postcode_std and l.forename1_std = r.forename1_std and l.surname_std = r.surname_std", ]

In these rules 'dm' stands for the dmetaphone of a variable.

Since this is a rules based model, there are no match scores and thus only a single point is reachable in ROC space.

The full settings for this model are here.

In this model, a match weights are computed each of the seven columns using a simple two-level check of equality.

The match weights are as follows:

The full settings for this model are here.

The model is the same as model 1, except:

  • forename1_std and surname_std compute a separate match weight for a fuzzy match, where the name does not exactly match, but the jaro winker score is >0.88
  • dob and postcode compute a separate match weight for a fuzzy match, whereby there's a single letter typo (a levenshtein distance of 1)

The match weights are as follows:

The full settings for this model are here.

In this model, a larger number of match weights are computed for each column:

  • forename1_std, forename2_std and surname_std have four match weights, corresponding to an exact match, jaro winker score of >0.88, jaro winker score of > 0.7 and no match.
  • dob has an additional match weight relative to model 2, covering 'suspicious' dates of birth on 1st January. In the dataset, these occur disproportionately, presumably because when the precise dob is unknown, the 1st of January is entered as a default
  • postcode has 5 match weights, which are computed based on various geographic distances between postcodes, and possible typos
  • occupation is a simple two level equality check

The match weights are as follows:

Model 4 is the same as model 3, but in addition term frequency adjustments are made for each column.

The average match weights for model 4 are the same as those for model 3, above.

Model 5 is the same as model 4, but we allow weighted term frequency adjustments for partial matches. The specific weights can be found here.

For example, this allows a term-frequency adjustment to be made for a fuzzy match of Robin vs Robim (note the typo). The weight moderates the strength of this adjustment downwards by a fixed multiple.

The average match weights for model 5 are the same as those for model 3, above.

Since our data is fully labelled, it's possible to compute match weights directly from the labels rather than estimate them. This means that any differences in accuracy can be attributed solely to the differences in how the records are compared (the feature engineering), as opposed to any problems that could occur in training.

The more complex models involve more computation, and so they take longer to run. In practice, how much does this matter?

The following chart shows that increase in runtime for the more complex models is relatively modest.

The term frequency adjustment used in Model 4 have the greatest impact on runtime, but even then, the runtime is only 8% longer than the simplest model.

Further experiments would be needed to determine the extent to which these results generalisable to other datasets. One particular caveat is that the synthetic dataset deliberately contains many errors: there are many records which are of such poor quality that they are difficult to match. This makes it suitable for testing different models, and finding which perform best against these hard-to-match records.

However, for use cases which much higher quality datasets, most of the records will be easy to match, so the gains from using more complex models are likely to be less dramatic.

The experiments outlined in this post suggest that more complex probabilistic linkage models dramatically outperform simpler models. For many use cases, the effort of configuring these models, and the modest addition runtimes, are likely to be worth it.

In part 2 I consider whether these results change when we use unsupervised learning to estimate model parameters.