Originally posted: 2025-06-29. Last updated: 2025-06-30. View source code for this page here.

#Building Accurate Address Matching Systems

A bag of tricks to improve the accuracy of geocoding

#The challenge

Address matching is a notoriously difficult problem due to unpredictable structure of addresses¹ and the many different ways an address can be written.

Substantial variation can exist between two matching addresses - for instance:

Flat 165 Block 3 Philpot Square, Hammersmith And Fulham

165, Philpot Square, London

✅match

whereas the following example has far less variation, yet does not match:

Flat A 24 Jubilee Street, London, LO1 23D

Flat A 25 Jubilee Street, London, LO1 23D

❌non-match

The challenge is therefore to develop an algorithm that can somehow 'see' that the first pair of addresses are more similar than the second pair.

Unfortunately one of the most effective general approaches to record linkage – known as the Fellegi-Sunter model and implemented in Splink – is not well suited to address matching because:

there is a strong correlation between different parts of an address, violating the model's statistical assumptions
the model assumes input data is split across multiple columns (e.g. first name, surname, DoB etc.), but address data is often received as a single string, and reliably parsing this into a set of standardised columns is difficult

In this post, I share some tricks and feature engineering techniques we can use to exploit the information in addresses as much as possible, to maximise geocoding accuracy.²

Concrete implementations of some of these tricks can be found in two open source address matching libraries:

uk_address_matcher a Splink based address matching library focussed on UK addresses
whereabouts, a DuckDB-powered address matching library, using signature based geocoding methods. Initially developed for Australian addresses, it also supports the creation of geocoding databases for other countries

#Structuring the problem

I will assume that we have:

A list of messy addresses (e.g. addresses that have been entered by hand by users)
A canonical address file (e.g. in the UK, an Ordnance Survey product such as Built Address)

To be as general as possible, we'll assume the address in these files is just a single string.

The address matching problem can be structured into two main stages:

Blocking (finding a small number of plausible candidates from the canonical address file)
Scoring (to rank the candidates and thereby identify the best match).

These steps are independent, so we can mix and match different blocking and scoring techniques.

However, prior to either of these steps, we need to decide how best to represent the address data.

#How to represent the address data - a data-driven approach

A tempting first step is to attempt to parse the messy addresses semantically - e.g. splitting out flat number, house number, street name and so on.³

This is appealing because it appears to enable powerful scoring rules, such as 'if the building number is different, the addresses do not match'.

In practice, this approach suffers from a paradox: the hardest addresses to match often contain ambiguities which make them the hardest to parse, and the problem of parsing the address correctly collapses into needing to find the true address.⁴

As a result, a data-driven approach is more powerful. We can treat the address as a single string from which we can extract certain features such as 'the first number' or 'most unusual bigram' without attaching too much semantic meaning. This enables us to draw on aggregate characteristics of the data itself to detect the most important features of the address.

It gives us greater flexibility in feature engineering, because we can find features that have no particular structural or semantic meaning but are nonetheless useful for matching.

It also prevents us being too rigid later in our blocking and scoring approaches - for example, it avoids the trap of logic like 'if the building number is different then it's not a match' - which contains an implicit assumption of complete information that's correctly parsed.

#Blocking stage

The purpose of blocking is to recover a list of plausible candidate addresses. The aim of the blocking stage is to have high recall (i.e. not to miss the true match)⁵. Precision is less important, because we will be scoring the candidates to find the true match at the next stage.

For example, a simple blocking strategy could be to find all addresses in the postcode of the query address⁶. This is a good start, but in practice suffers two flaws:

The list of candidates can be excessively long, resulting in slower scoring performance
The query address may not have a postcode, or the postcode may be incorrect

As a result we need to supplement a postcode-based blocking strategy with some additional techniques. The idea is that any of these techniques in isolation may miss the true match, but the combined results should almost always contain it.

#1. Surmounting problems of common tokens: n-grams combined with term frequencies

We can pre-process addresses to extract n-grams and their associated term frequencies (frequency of occurrence across the corpus of all addresses). This allows us to identify which tokens are uncommon enough to be useful for blocking (i.e. will restrict the number of candidates to a manageable number).

Many strategies are possible here, but a simple approach could be to identify the least common bigram or trigram in each address, and to block on⁷ that. For performance reasons, we'd want to filter out any excessively common ones.

For example each individual token in Flat A 24 Jubilee Street London is very common, but the tri-gram A 24 Jubilee will appear in at most a handful of addresses in the country.

#2. Word order: Single words sorted by term frequency

A variation on the n-gram approach is to extract the most unusual single tokens (words) in the address, and block on (say) the two least common tokens in the address.

This is useful to avoid the problem of missing or misordered tokens. For example, in the Philpot Square example, Philpot and 165 are unusual tokens and there may only be a single address in the country with both tokens, but they don't predictably appear next to each other.

#3. Discriminating tokens amongst neighbouring addresses

By working back to front and eliminating common tokens, we can identify the following discriminating tokens (highlighted):

1 Rainbow Lane Taunton TA1 1AB
2 Rainbow Lane Taunton TA1 1AB
Highfield Rainbow Lane Taunton TA1 1AB
Old Station House Rainbow Lane Taunton TA1 1AB
5 Rainbow Lane Taunton TA1 1AB

These highlighted n-grams can be used for blocking and can be especially used for non-numeric house names, combined with (say) the first half of the postcode.

#4. Signature based blocking

Signature based candidate search, described in this paper and implemented in whereabouts is effectively a generalisation of these techniques.

The idea is that there is significant redundancy in most addresses, we can devise 'signatures' which are subsets (often n-grams) of the full address which narrow down possible candidates to only a handful of candidates. Developing signatures is not completely automated and requires domain-specific knowledge. In this view, the techniques above are examples of types of signatures that may be useful in the context of addresses.

#5. Deriving multiple variations of the canonical address

In many cases, the list of canonical addresses will be split robustly and semantically across multiple columns (e.g. the Built Address spec here).

If so, we can derive multiple representations of the canonical address by combining these columns in different ways. This gives a greater chance for blocking to recover the true match, and also means scoring will be more reliable because it increases the similarity between the messy address and the closest representation of the canonical address.

For example, the standard representation may not contain 'floor descriptors' like top floor, but if this is sometimes observed in the messy addresses, we could add this as a variation onto the canonical address. Or we may want to ensure that we have both the address with business name and without.

This adds accuracy without significantly increasing complexity: we can treat the variations as additional candidates, score as normal, but then the final score for each candidate becomes the highest score across all variations of the canonical address.

This technique also increases the chance of finding an exact match, so the run time performance tradeoff is not unambiguously negative.

#Scoring stage

The purpose of the scoring stage is to identify the highest scoring candidate found in the blocking stage, to try to identify the correct match.

There are many different scoring approaches, and a weighted combination⁸ of approaches will often yield the best results⁹.

#1. Token frequencies within the corpus of all addresses

A simple 'first cut' of a scoring approach can be simply to look at the number of common tokens between the messy address and candidate address, weighted by token frequency in the corpus of all addresses.

Example - click to expand

Messy address

24 Jubilee St, London

Candidate address

24 Jubilee Street, London LO1 23D

Overlapping token	Share of all addresses containing token
24	0.15%
Jubilee	0.003%
London	5.0%

final score = 0.0015 × 0.00003 × 0.05 ≈ 2.25 × 10⁻⁹

The address with the lowest score is ranked the highest.

This is surprisingly effective, but there are several drawbacks that mean it is possible to improve on this approach:

Token frequency in the corpus of all addresses is not a very good metric of the strength of match within a small group of candidates. For instance, the token London is very common in the overall corpus, but for an address in Manchester, it may be highly discriminative.
Similarly, a token may be very uncommon in the overall corpus, but useless to distinguish between candidates. For example, consider the token Rainbow in the above list of addresses.
Sometimes the true match may have fewer overlapping tokens than a false match. For example, in the Philpot Square example, a different and incorrect candidate may include the tokens Block 3

#2. Token (or n-gram) frequencies within candidates

An improvement on using whole-corpus token frequencies is to instead use the frequencies of tokens within the set of candidates. This is a better measure of the power of the token to distinguish between the candidates.

Similarly, we could use the frequency of n-grams within the set of candidates.

Example - click to expand

For example, consider how the token 'front' is the only token that appears just once amongst the candidates, making it uniquely powerful because it is the only token that discriminates fully between the candidates.

Messy address

Front Ground Floor Flat 57 Fulham Park Gardens

Candidate addresses

Flat Ground Floor Front, 57, Fulham Park Gardens

✅

Flat First Floor, 57, Fulham Park Gardens

Flat Ground Floor, 66, Fulham Park Gardens

Flat Ground Floor Rear, 57, Fulham Park Gardens

#3. Absence of tokens: Punishing missing tokens

Where tokens do not overlap, we can apply a penalty to the candidate's score:

How do we treat tokens in areas M and C?

This turns out to be trickier than it first appears because we can't treat all absent tokens as equal:

There are a variety of optional tokens that tend to occur towards the end of an address whose absence is close to irrelevant. Examples may include London, the names of counties (e.g. Somerset).
Some tokens such as Flat are often omitted for the sake of brevity
Business names are optional and appear at the beginning of an address

Tokens in category (1) are relatively easy to identify simply by deriving a list of 'common end tokens' from a large dataset of addresses. Issues (2) and (3) are harder to deal with and I am yet to find a robust methodology.

#4. Absence of tokens: Matching tokens that exist within other candidates, but not the present candidate

Sometimes a token (or n-gram) from the messy address will be present in other candidates, but not in the one being scored. This is a strong indicator against the candidate being the true match.

Example - click to expand

Consider:

Messy address

Flat A, 1 Primrose Apartments, London

Candidate addresses

Basement A, Block 1 Primrose Apartments, London

Flat A, Block 1 Primrose Apartments, London

✅

Flat A, Block 2 Primrose Apartments, London

Flat A, 1 Primrose Street, London

Flat A, 2 Primrose Street, London

Note that the word flat is present in the messy address, and most candidates.

As a result flat is a common token within the candidate set, and does not help discriminate between the flat candidates. The score increment due to 'flat' on the basis of token frequencies will be modest, and so will not strongly differentiate the basement candidate from the others.

However, the absence of flat in candidate 1 strongly suggests this cannot be the match, particularly because we know the word flat exists in other addresses (if it existed in no candidates, it would be irrelavant for the purpose of scoring.) To ensure a more highly differentiated score for address 1, we need to apply some sort of score penalty.

In summary, this enables the differential between (say) address 1 and addresses 2-5 to be greater than the 'reward' 2-5 will gain from matching on the word 'flat'.

Note that in this example, the token apartments also plays a similar role to flat in ruling out candidates 4 and 5.

Also note in this example, there is no bigram or trigram that uniquely identifies the match.

#5. Discriminating tokens amongst neighbouring addresses

In the Rainbow Lane example above, we illustrated a technique to find highly discriminating tokens amongst neighbouring addresses.

On the face of it, these tokens are essential information to uniquely pin down the address.

In practice, addresses are not this simple, and there's no guarantee that:

the messy address will contain these tokens
the tokens are uniquely discriminating amongst the candidates

As a result, instead of requiring a match, we can simply place greater weight on these tokens when scoring.

Example - click to expand

The following example illustrates how this simple technique can fail:

Messy address:

Flat A 10 Station Street

Candidates:

9 Station Street
Flat A 9 Station Street
10 Station Street
Basement Flat 10 Station Street
10A Station Street
10B Station Street

Note how the simple rule 'require a match on the discriminating (highlighted) token(s)' would fail in this example; we'd incorrectly match to the 9 Station Street candidate.

#Scoring stage: Interpreting the score

Many record linkage approaches simply treat the score as an ordinal variable, the higher the better. A score threshold is set, and any candidate above the threshold is considered a match.

In the context of address matching, this approach is not very useful because:

There should be a single match - we do not want to match a single messy address to multiple candidate addresses
There isn't a strong relationship between 'raw' score and the true match status. For example, the matcher may correctly 'find' the true match for a simple address like 1 London Road, Birmingham, B1 1AA. But whilst the top-scoring match may be correct, its score may be relatively low due to the simplicity of the address, and the commonness of the tokens.

As a result, we need a more nuanced approach to scoring. The goal is that we want a measure of match confidence.

#Using 'distinguishability' to determine match confidence

A powerful technique is to measure the score differential between the top candidate and the second best candidate. We could call this the 'distinguishability' score of the match.

The final match confidence score could then be defined as a combination of the raw score and the distinguishability score - often with more weight placed on the distinguishability score, and perhaps a minimum threshold on the raw score. For instance, we could define an 'almost certain match' as a candidate with high distinguishability and a raw score that is above some medium threshold.

#Final thoughts

Many of the techniques described above are initially appealing, and can even seem like 'knock out' solutions. However, I have found through bitter experience that there is always an edge case that undermines their effectiveness, and often there is tension between the information the different techniques seem to provide.

One important reason for this is that the process of candidate search by its nature retrieves a set of candidates that are similar to the messy address - favouring candidates that contain highly discriminative n-grams or other features that 'work against' the scoring rules.

I have come to believe that the most powerful address matching solutions are those that:

Apply clever feature engineering tricks to uncover hidden information in the structure of the addresses
Use contextual information from the corpus of all addresses, and the list of candidates, rather than relying only on pairwise information
Combine this information using match weights to decide how important different features are, as opposed to crafting deterministic logic
Learn the match weights from the data itself using e.g. gradient descent.

I will end with some examples that highlight the challenges of address matching - and are interesting to consider in the context of the techniques described above.

Some examples that highlight the challenges of address matching - click to expand

23A Marchant House, Jubilee Street, London, LO1 23D

Flat A, Top Floor, Marchant House, 23 Jubilee Street, Fulham, LO1 23D

✅match

6, PADDOCK CLOSE CASTLETHORPE MK19 7AY

PIPIT HOUSE, 6, PADDOCK CLOSE, CASTLETHORPE MK19 7AY

✅match

Maisonette First And Second Floors, 14, Hadyn Park Road

Top Floor Flat 14 Hadyn Park Road, London

✅match

Flat 39 Evans House White City Estate

39, EVANS HOUSE, AUSTRALIA ROAD, LONDON

✅match

Messy address

FLAT 3 ST LEGER HOUSE GREAT LINFORD MK14 5HA

Candidate addresses

GREAT LINFORD HOUSE 1 ST LEGER COURT GREAT LINFORD MK14 5HA

3 ST LEGER HOUSE 4A ST LEGER COURT GREAT LINFORD MK14 5HA

✅

3 ST LEGER COURT GREAT LINFORD MK14 5HA

Messy address

Rear Studio Flat 191a Uxbridge Road W12 9RA

Candidate addresses

FLAT GROUND FLOOR REAR, 191A, UXBRIDGE ROAD, LONDON W12 9RA

✅

NEALE-ROBINSON, 191A, UXBRIDGE ROAD, LONDON W12 9RA

FIRST FLOOR FLAT, 191A, UXBRIDGE ROAD, LONDON W12 9RA

REAR OF, 191E, UXBRIDGE ROAD, LONDON, W12 9RA

#Acknowledgements

Thanks to Alex Lee the author of whereabouts for insightful comments and suggestions - all errors remain my own

#Footnotes

A variety of interesting edge cases can be found in Falsehoods Programmers Believe About Addresses. ↩
To ensure high performance, I've included only techniques that can be implemented in SQL, meaning we can use high performance engines like DuckDB to implement the address matching system. ↩
A Conditional Random Fields or hidden Markov model could be used for this, or existing libraries such as libpostal. ↩
The Philpot Square address at the top of this article is a good practical example: how should we interpret the 165 in 165, Philpot Square, London? In difficult examples like this, solving this problem amounts to figuring out what the true address is - so the problem is circular. ↩
High recall is particularly important because some scoring techniques implicitly assume that the true address is among the candidates. ↩
Blocking on postcode only may result in a list of candidates that is too long for performance reasons. In practice, a simple extension could be to limit to addresses that contain the postcode and any of the numbers in the address. ↩
The terminology 'block on' means 'candidate addresses are found using this condition'. So in this example, we would find any address whose least common bigram matches the least common bigram in the messy address. This is conceptually equivalent to an inverted index. ↩
In uk_address_matcher, gradient descent has been used to optimise weights, see here. ↩
In my experience, weighted scoring is better than trying to apply deterministic logic because there are too many edge cases so the 'tree' of if statements quickly becomes unmanageable. ↩