Probabilistic record linkage

These pages present some introductory training material on probabilistic record linkage using the Fellegi Sunter model. Many of the articles are interactive.

This material presents a simplified version of the model used by Splink, a piece of probabalistic linkage software for which I'm lead developer.

Many of the graphics presented re-use Splink's graphical output, and the representation of model parameters used is the same as Splink's settings object.

Training materials on probabilistic linkage

Introductory Interactive Tutorial

  1. An Interactive Introduction to Record Linkage (Data Deduplication) in the Fellegi-Sunter framework
  2. Partial match weights
  3. m and u values in the Fellegi-Sunter model
  4. The mathematics of the Fellegi Sunter model
  5. Computing the Fellegi Sunter model
  6. Why Probabilistic Linkage is More Accurate than Fuzzy Matching For Data Deduplication
  7. The Intuition Behind the Use of Expectation Maximisation to Train Record Linkage Models
  8. An alternative way to think about predicted probabilities in the Fellegi Sunter model

Other articles

Useful tools

Splink Benchmarking and Performance

Archived Material

Further reading (external links)

  1. Splink: MoJ's open source library for probabilistic record linkage at scale
  2. Splink homepage
  3. Try Splink live in your browser
  4. Interactive settings editor