Originally posted: 2021-05-21. Last updated: 2023-10-02. Live edit this notebook here.
The previous article showed how m
and u
probabilities can be converted into Bayes Factors, which in turn could be combined with the prior probability to compute an updated prediction (the 'posterior' probability).
We now have all the groundwork in place to present a full mathematical definition of a model in the Fellegi-Sunter framework.
A more technical treatment is given in AP Enamorado, T., Fifield, B., & Imai, K. (2019), which contains a slightly more generalised form of the model presented here.
In the previous article we saw what Bayes Factors act as a relative multiplier that increases or decreases the overall prediction of whether the records match.
We saw that, for a Bayes Factor for a specific scenario such as a match on month of birth we could write:
$\text{posterior odds of match} = \text{prior odds of match} \times \text{Bayes Factor}$
It turns out that this formula can be extended to account for the information in multiple scenarios such as: a match on first name and a match on surname and a fuzzy match on date of birth.
Let us denote by $K_i$ the Bayes Factors for each activated scenario^{1}, then the general formula is:
$\text{posterior odds of match} = \text{prior odds} \times K_1 \times K_2 \times \ldots \times K_n \quad (\text{Eq 4.1})$
This formula is quite intuitive. For example:
If both match, it becomes 24x more likely the records match.
Multiplying Bayes Factors to compute a predicted probability in this way sometimes known as a Naive Bayes classifier.^{2}
Since Bayes Factors are not estimated directly, and instead we estimte the m
and u
probabilities of the model, we can use $K_i = \frac{m_i}{u_i}$ to write the full specification of our model:
$\text{posterior odds of match} = \text{prior odds} \times \frac{m_1 m_2 \ldots m_n}{u_1 u_2 \ldots u_n}$
This final result can be converted into a probability using
$\text{probability} = \frac{\text{odds}}{1 + \text{odds}}$
In the first article in this tutorial we claimed that the Fellegi-Sunter model makes its predictions by summing of the partial match weights.
This follows from equation 4.1. Specifically, by taking the logarithm, we can write:
Applying $\log_2$ to both sides, you get:
$\text{final match weight} = \text{prior match weight} + \omega_1 + \omega_2 + \ldots + \omega_n$
where $\omega_i = \log_2(K_i)$ are the partial match weights.
The remainder of this article delves a bit deeper into equation (4.1) and can be safely skipped if you're not interested in the maths. In particular, it shows how the match probability can be expressed in terms of the m
and u
probabilities using a formulation equivalent to equation 4 in the Fastlink paper.
The next article will look at the mechanics of computing the final prediction.
In the previous article we saw that by applying Bayes Theorem to a specific scenario we could write:
$\operatorname{Pr}(\text{match}|\text{first name matches}) = \frac{\operatorname{Pr}(\text{first name matches}|\text{match})\operatorname{Pr}(\text{match})}{\operatorname{Pr}(\text{first name matches}|\text{match})\operatorname{Pr}(\text{match}) + \operatorname{Pr}(\text{first name matches}|\text{non match})\operatorname{Pr}(\text{non match})}$
We can make this easier to read by recalling that
$m = \text{Pr}(\text{scenario}|\text{records match})$
and
$u = \text{Pr}(\text{scenario}|\text{records do not match})$
and denoting $\lambda$ as the prior probability
this can be written:
$\text{posterior probability of match} = \frac{\lambda m}{\lambda m + (1 - \lambda)u}$
By repeated application of Bayes Theorem (see annex here), we can generalise this formula to:
$\text{posterior probability of match} = \frac{\lambda m_1 m_2 \ldots m_n}{\lambda m_1 m_2 \ldots m_n + (1 - \lambda)u_1 u_2 \ldots u_n} \quad (\text{Eq 4.2})$
We can write a similar formula for a non-match:
$\text{posterior probability of non-match} = \frac{(1- \lambda) u_1 u_2 \ldots u_n}{\lambda m_1 m_2 \ldots m_n + (1 - \lambda)u_1 u_2 \ldots u_n} \quad (\text{Eq 4.3})$
Now, recall that:
$\text{odds} = \frac{\text{probability}}{1-\text{probability}}$
This means we can divide (4.2) by (4.3 to obtain the odds). Since they share a denominator this becomes:
$\text{posterior odds of match} = \frac{\lambda m_1 m_2 \ldots m_n}{(1- \lambda) u_1 u_2 \ldots u_n}$
Since $K_i = \frac{m_i}{u_i}$ this is equivalent to Equation 4.1 in this article, and also is equivalent to Equation 4 in the FastLink paper.
The 'activated scenario' is the similarity category for the record comparison under evaluation. For example, the 'activated scenario', three scenarios (similarity categories) may be defined for the first name column - 'exact match', 'fuzzy match' and 'all other'. Each will have a Bayes Factor associated with it. The 'activated scenario' is which of these scenarios describes the record comparison under evaluation. ↩
The result is only valid if we assume that the columns are independent conditional on match status. This is 'naive' in the sense that it's rarely true. However, it turns out for record linkage purposes that Fellegi Sunter models are quite robust to the violation of this assumption. ↩