Originally posted: 2023-09-20. Live edit this notebook here.
In the previous article we saw that the Fellegi-Sunter model makes its predictions with a simple calculation: the sum of the partial_match_weights.
But what are partial match weights and where do they come from?^{1}
Some columns are more important than others to the calculation of overall match probability.^{2} This is quantified by the partial match weights.
For example:
Positive values indicate evidence in favour of a match whilst negative values mean evidence against a match.
The concept applies more generally than just to matches and non-matches: partial match weights can be estimated for more subtle scenarios like 'first name not an exact match, but similar'.
It is up to the modeller to define these scenarios and the resultant partial match weights they wish to estimate.
For first name, we could define the following three scenarios:
The estimated partial match weights may look like this:^{3}
Observe that:
These scenarios are mutually exclusive, and are implemented like a sequence of if
statements:
if first_name_l == first_name_r:first_name_partial_match_weight = 7.2elif jaro_winkler_sim(first_name_l, first_name_r) >= 0.9:first_name_partial_match_weight = 5.9else:first_name_partial_match_weight = -3.3
We can also estimate partial match weights for the remaining columns.
The number of scenarios modelled and the definition of these scenarios can vary depending on the type of information.
For example, since gender is a categorical variable we'd likely have just two partial weights: match, and non-match.
For date of birth, we may define three scenarios, but Levenshtein would be more suitable than Jaro Winkler for assessing similarity.
We can plot all the partial match weights in a single chart like this:
This provides a succinct summary of the whole model.
The partial match weights in the chart have intuitive interpretations.
For example, the partial match weight associated with an exact match on postcode is stronger than the partial match weight associated with gender.
This makes sense: a match on gender doesn't tell us much about whether two records are a match, whereas a match on postcode is much more informative.
The pattern of negative partial match weights is also intuitive: If gender is recorded as categorical variable, then it may be very accurate. A mismatch would thus be strong evidence against a match.
Conversely, a mismatch on postcode may offer little evidence against a match since people move house, or typos may have been made.
Let's take a look at an example record comparison, and see how the partial match weights are used.
You can edit the data in the below table to see how the calculation changes
The first stage is to calculate which scenarios apply for each column. For example, are the first names an exact match, a fuzzy match, or neither?
In the below chart, I reproduce the partial match weights chart, but with the activated partial match weights highlighted.
The activated (selected) partial match weights can then be plotted in a waterfall chart, which represents how they are summed up to compute the final match weight:
Alternatively we can represent the calculation of overall match weight as a sum:
The final match weight is a measure of the similarity of the two records. In subsequent articles we will see how it can be converted into a probability using the formula:
We now have a good qualitative understanding of the meaning of partial match weights.
In the next article we explore how they are calculated, and how to interpret their values.
In this context, the word 'partial' means 'using only a subset of the information in the record' - often a single column. There is then a partial match weight for each column. ↩
In this article, for simplicity, there's a one to one correspondence between columns and partial match weights. Each column is treated as a separate piece of information. In reality, the model allows for more flexibility. We could split a column (e.g. date of birth) into several pieces of information (e.g. day, month, year) and estimate partial match weights for each. Or we could combine several columns (e.g. first, middle and surname) into a single piece of information and estimate a partial match weights for this information. ↩
A later article will explain how to estimate partial match weights. Here I just assume they have already been estimated. ↩