Originally posted: 2025-06-29. Last updated: 2025-06-30. View source code for this page here.
Address matching is a notoriously difficult problem due to unpredictable structure of addresses1 and the many different ways an address can be written.
Substantial variation can exist between two matching addresses - for instance:
Flat 165 Block 3 Philpot Square, Hammersmith And Fulham
165, Philpot Square, London
whereas the following example has far less variation, yet does not match:
Flat A 24 Jubilee Street, London, LO1 23D
Flat A 25 Jubilee Street, London, LO1 23D
The challenge is therefore to develop an algorithm that can somehow 'see' that the first pair of addresses are more similar than the second pair.
Unfortunately one of the most effective general approaches to record linkage – known as the Fellegi-Sunter model and implemented in Splink – is not well suited to address matching because:
In this post, I share some tricks and feature engineering techniques we can use to exploit the information in addresses as much as possible, to maximise geocoding accuracy.2
Concrete implementations of some of these tricks can be found in two open source address matching libraries:
uk_address_matcher
a Splink based address matching library focussed on UK addresseswhereabouts
, a DuckDB-powered address matching library, using signature based geocoding methods. Initially developed for Australian addresses, it also supports the creation of geocoding databases for other countriesI will assume that we have:
To be as general as possible, we'll assume the address in these files is just a single string.
The address matching problem can be structured into two main stages:
These steps are independent, so we can mix and match different blocking and scoring techniques.
However, prior to either of these steps, we need to decide how best to represent the address data.
A tempting first step is to attempt to parse the messy addresses semantically - e.g. splitting out flat number, house number, street name and so on.3
This is appealing because it appears to enable powerful scoring rules, such as 'if the building number is different, the addresses do not match'.
In practice, this approach suffers from a paradox: the hardest addresses to match often contain ambiguities which make them the hardest to parse, and the problem of parsing the address correctly collapses into needing to find the true address.4
As a result, a data-driven approach is more powerful. We can treat the address as a single string from which we can extract certain features such as 'the first number' or 'most unusual bigram' without attaching too much semantic meaning. This enables us to draw on aggregate characteristics of the data itself to detect the most important features of the address.
It gives us greater flexibility in feature engineering, because we can find features that have no particular structural or semantic meaning but are nonetheless useful for matching.
It also prevents us being too rigid later in our blocking and scoring approaches - for example, it avoids the trap of logic like 'if the building number is different then it's not a match' - which contains an implicit assumption of complete information that's correctly parsed.
The purpose of blocking is to recover a list of plausible candidate addresses. The aim of the blocking stage is to have high recall (i.e. not to miss the true match)5. Precision is less important, because we will be scoring the candidates to find the true match at the next stage.
For example, a simple blocking strategy could be to find all addresses in the postcode of the query address6. This is a good start, but in practice suffers two flaws:
As a result we need to supplement a postcode-based blocking strategy with some additional techniques. The idea is that any of these techniques in isolation may miss the true match, but the combined results should almost always contain it.
We can pre-process addresses to extract n-grams and their associated term frequencies (frequency of occurrence across the corpus of all addresses). This allows us to identify which tokens are uncommon enough to be useful for blocking (i.e. will restrict the number of candidates to a manageable number).
Many strategies are possible here, but a simple approach could be to identify the least common bigram or trigram in each address, and to block on7 that. For performance reasons, we'd want to filter out any excessively common ones.
For example each individual token in Flat A 24 Jubilee Street London
is very common, but the tri-gram A 24 Jubilee
will appear in at most a handful of addresses in the country.
A variation on the n-gram approach is to extract the most unusual single tokens (words) in the address, and block on (say) the two least common tokens in the address.
This is useful to avoid the problem of missing or misordered tokens. For example, in the Philpot Square example, Philpot
and 165
are unusual tokens and there may only be a single address in the country with both tokens, but they don't predictably appear next to each other.
By working back to front and eliminating common tokens, we can identify the following discriminating tokens (highlighted):
1 Rainbow Lane Taunton TA1 1AB
2 Rainbow Lane Taunton TA1 1AB
Highfield Rainbow Lane Taunton TA1 1AB
Old Station House Rainbow Lane Taunton TA1 1AB
5 Rainbow Lane Taunton TA1 1AB
These highlighted n-grams can be used for blocking and can be especially used for non-numeric house names, combined with (say) the first half of the postcode.
Signature based candidate search, described in this paper and implemented in whereabouts
is effectively a generalisation of these techniques.
The idea is that there is significant redundancy in most addresses, we can devise 'signatures' which are subsets (often n-grams) of the full address which narrow down possible candidates to only a handful of candidates. Developing signatures is not completely automated and requires domain-specific knowledge. In this view, the techniques above are examples of types of signatures that may be useful in the context of addresses.
In many cases, the list of canonical addresses will be split robustly and semantically across multiple columns (e.g. the Built Address spec here).
If so, we can derive multiple representations of the canonical address by combining these columns in different ways. This gives a greater chance for blocking to recover the true match, and also means scoring will be more reliable because it increases the similarity between the messy address and the closest representation of the canonical address.
For example, the standard representation may not contain 'floor descriptors' like top floor
, but if this is sometimes observed in the messy addresses, we could add this as a variation onto the canonical address. Or we may want to ensure that we have both the address with business name and without.
This adds accuracy without significantly increasing complexity: we can treat the variations as additional candidates, score as normal, but then the final score for each candidate becomes the highest score across all variations of the canonical address.
This technique also increases the chance of finding an exact match, so the run time performance tradeoff is not unambiguously negative.
The purpose of the scoring stage is to identify the highest scoring candidate found in the blocking stage, to try to identify the correct match.
There are many different scoring approaches, and a weighted combination8 of approaches will often yield the best results9.
A simple 'first cut' of a scoring approach can be simply to look at the number of common tokens between the messy address and candidate address, weighted by token frequency in the corpus of all addresses.
24 Jubilee St, London
24 Jubilee Street, London LO1 23D
Overlapping token | Share of all addresses containing token |
---|---|
24 | 0.15% |
Jubilee | 0.003% |
London | 5.0% |
final score = 0.0015 × 0.00003 × 0.05 ≈ 2.25 × 10⁻⁹
The address with the lowest score is ranked the highest.
This is surprisingly effective, but there are several drawbacks that mean it is possible to improve on this approach:
London
is very common in the overall corpus, but for an address in Manchester, it may be highly discriminative.Rainbow
in the above list of addresses.Philpot Square
example, a different and incorrect candidate may include the tokens Block 3
An improvement on using whole-corpus token frequencies is to instead use the frequencies of tokens within the set of candidates. This is a better measure of the power of the token to distinguish between the candidates.
Similarly, we could use the frequency of n-grams within the set of candidates.
For example, consider how the token 'front' is the only token that appears just once amongst the candidates, making it uniquely powerful because it is the only token that discriminates fully between the candidates.
Front Ground Floor Flat 57 Fulham Park Gardens
Flat Ground Floor Front, 57, Fulham Park Gardens✅
Flat First Floor, 57, Fulham Park Gardens
Flat Ground Floor, 66, Fulham Park Gardens
Flat Ground Floor Rear, 57, Fulham Park Gardens
Where tokens do not overlap, we can apply a penalty to the candidate's score:
How do we treat tokens in areas M and C?
This turns out to be trickier than it first appears because we can't treat all absent tokens as equal:
London
, the names of counties (e.g. Somerset
).Flat
are often omitted for the sake of brevityTokens in category (1) are relatively easy to identify simply by deriving a list of 'common end tokens' from a large dataset of addresses. Issues (2) and (3) are harder to deal with and I am yet to find a robust methodology.
Sometimes a token (or n-gram) from the messy address will be present in other candidates, but not in the one being scored. This is a strong indicator against the candidate being the true match.
Consider:
Flat A, 1 Primrose Apartments, London
Basement A, Block 1 Primrose Apartments, London
Flat A, Block 1 Primrose Apartments, London✅
Flat A, Block 2 Primrose Apartments, London
Flat A, 1 Primrose Street, London
Flat A, 2 Primrose Street, London
Note that the word flat
is present in the messy address, and most candidates.
As a result flat
is a common token within the candidate set, and does not help discriminate between the flat
candidates. The score increment due to 'flat' on the basis of token frequencies will be modest, and so will not strongly differentiate the basement candidate from the others.
However, the absence of flat
in candidate 1 strongly suggests this cannot be the match, particularly because we know the word flat
exists in other addresses (if it existed in no candidates, it would be irrelavant for the purpose of scoring.) To ensure a more highly differentiated score for address 1, we need to apply some sort of score penalty.
In summary, this enables the differential between (say) address 1 and addresses 2-5 to be greater than the 'reward' 2-5 will gain from matching on the word 'flat'.
Note that in this example, the token apartments
also plays a similar role to flat
in ruling out candidates 4 and 5.
Also note in this example, there is no bigram or trigram that uniquely identifies the match.
In the Rainbow Lane
example above, we illustrated a technique to find highly discriminating tokens amongst neighbouring addresses.
On the face of it, these tokens are essential information to uniquely pin down the address.
In practice, addresses are not this simple, and there's no guarantee that:
As a result, instead of requiring a match, we can simply place greater weight on these tokens when scoring.
The following example illustrates how this simple technique can fail:
Messy address:
Flat A 10 Station Street
Candidates:
9 Station Street
Flat A 9 Station Street
10 Station Street
Basement Flat 10 Station Street
10A Station Street
10B Station Street
Note how the simple rule 'require a match on the discriminating (highlighted) token(s)' would fail in this example; we'd incorrectly match to the 9 Station Street
candidate.
Many record linkage approaches simply treat the score as an ordinal variable, the higher the better. A score threshold is set, and any candidate above the threshold is considered a match.
In the context of address matching, this approach is not very useful because:
1 London Road, Birmingham, B1 1AA
. But whilst the top-scoring match may be correct, its score may be relatively low due to the simplicity of the address, and the commonness of the tokens.As a result, we need a more nuanced approach to scoring. The goal is that we want a measure of match confidence.
A powerful technique is to measure the score differential between the top candidate and the second best candidate. We could call this the 'distinguishability' score of the match.
The final match confidence score could then be defined as a combination of the raw score and the distinguishability score - often with more weight placed on the distinguishability score, and perhaps a minimum threshold on the raw score. For instance, we could define an 'almost certain match' as a candidate with high distinguishability and a raw score that is above some medium threshold.
Many of the techniques described above are initially appealing, and can even seem like 'knock out' solutions. However, I have found through bitter experience that there is always an edge case that undermines their effectiveness, and often there is tension between the information the different techniques seem to provide.
One important reason for this is that the process of candidate search by its nature retrieves a set of candidates that are similar to the messy address - favouring candidates that contain highly discriminative n-grams or other features that 'work against' the scoring rules.
I have come to believe that the most powerful address matching solutions are those that:
I will end with some examples that highlight the challenges of address matching - and are interesting to consider in the context of the techniques described above.
23A Marchant House, Jubilee Street, London, LO1 23D
Flat A, Top Floor, Marchant House, 23 Jubilee Street, Fulham, LO1 23D
6, PADDOCK CLOSE CASTLETHORPE MK19 7AY
PIPIT HOUSE, 6, PADDOCK CLOSE, CASTLETHORPE MK19 7AY
Maisonette First And Second Floors, 14, Hadyn Park Road
Top Floor Flat 14 Hadyn Park Road, London
Flat 39 Evans House White City Estate
39, EVANS HOUSE, AUSTRALIA ROAD, LONDON
FLAT 3 ST LEGER HOUSE GREAT LINFORD MK14 5HA
GREAT LINFORD HOUSE 1 ST LEGER COURT GREAT LINFORD MK14 5HA
3 ST LEGER HOUSE 4A ST LEGER COURT GREAT LINFORD MK14 5HA✅
3 ST LEGER COURT GREAT LINFORD MK14 5HA
Rear Studio Flat 191a Uxbridge Road W12 9RA
FLAT GROUND FLOOR REAR, 191A, UXBRIDGE ROAD, LONDON W12 9RA✅
NEALE-ROBINSON, 191A, UXBRIDGE ROAD, LONDON W12 9RA
FIRST FLOOR FLAT, 191A, UXBRIDGE ROAD, LONDON W12 9RA
REAR OF, 191E, UXBRIDGE ROAD, LONDON, W12 9RA
Thanks to Alex Lee the author of whereabouts
for insightful comments and suggestions - all errors remain my own
A variety of interesting edge cases can be found in Falsehoods Programmers Believe About Addresses. ↩
To ensure high performance, I've included only techniques that can be implemented in SQL, meaning we can use high performance engines like DuckDB to implement the address matching system. ↩
A Conditional Random Fields or hidden Markov model could be used for this, or existing libraries such as libpostal. ↩
The Philpot Square
address at the top of this article is a good practical example: how should we interpret the 165
in 165, Philpot Square, London
? In difficult examples like this, solving this problem amounts to figuring out what the true address is - so the problem is circular. ↩
High recall is particularly important because some scoring techniques implicitly assume that the true address is among the candidates. ↩
Blocking on postcode only may result in a list of candidates that is too long for performance reasons. In practice, a simple extension could be to limit to addresses that contain the postcode and any of the numbers in the address. ↩
The terminology 'block on' means 'candidate addresses are found using this condition'. So in this example, we would find any address whose least common bigram matches the least common bigram in the messy address. This is conceptually equivalent to an inverted index. ↩
In uk_address_matcher
, gradient descent has been used to optimise weights, see here. ↩
In my experience, weighted scoring is better than trying to apply deterministic logic because there are too many edge cases so the 'tree' of if statements quickly becomes unmanageable. ↩