Originally posted: 2025-09-23. View source code for this page here.

A data structure known as a trie offers a powerful tool for address matching. It exploits the hierarchical nature of addresses, whereby tokens (words) in the address are typically represented in a specific to general order geographically.

For example, consider the following addresses:

1 HIGH STREET WESTMINSTER
2 HIGH STREET WESTMINSTER
3 HIGH STREET WESTMINSTER
FLAT A 3 HIGH STREET WESTMINSTER
FLAT B 3 HIGH STREET WESTMINSTER
4 HIGH STREET WESTMINSTER
5 HIGH STREET WESTMINSTER
ANNEX 5 HIGH STREET WESTMINSTER

These can be represented as a trie as follows, where the blue nodes represent terminal nodes (the end of an address):

Observe how the trie structure encodes a lot of useful information about the tokens in the addresses. For example:

  • Does the token represent a 'terminal node' i.e. the end of an address
  • How many addresses can be reached from the node (the count property)
  • Whether the token represents a leaf (does count=1). Note that some terminal nodes are not leaves (the ANNEX example)

By walking the trie from root to branch we can attempt to match an input address to the addresses in the trie.

On the face of it, this is little different from exact matching. However, we can make this process much more powerful by introducing fault tolerance into the trie.

Consider for example these two addresses:

Messy address: CORNER SHOP 8 RAINBOW ROAD EXTRA ABBOTS LANGLEY HERTS UK
Clean address: 1 RAINBOW ROAD EXTRA ABBOTS LANGLEY

We could define the following rules for fault tolerance:

  • You may ignore token(s) at the end of the messy address if by doing so, you can 'join' the trie of clean addresses at a node with a count above some threshold
  • You may skip a token in the middle of the messy address if this token does not exist in the trie at the current node, and by skipping it you can 'rejoin' the trie.
  • You may skip superfluous tokens at the start of the messy address if you've already hit a terminal node, and that terminal node does not have children.

Let's see how this works in practice. In the following I use a simple set of fault tolerant rules for illustrative purposes.

Consider the following canonical addresses (note: this is ✨interactive✨ and you can edit the addresses):

We can look up a messy address in the trie as follows. Again, this is ✨interactive✨.

I have written a DuckDB Community Extension called splink_udfs that implements this function.

You can try it with in shell.duckdb.org using the script you can unfold below:

Click to expand
-- Load the extension in shell.duckdb.org
INSTALL splink_udfs FROM community;
LOAD splink_udfs;
-- Test data
CREATE OR REPLACE TEMP TABLE os_addresses(
uprn BIGINT,
addr VARCHAR,
postcode_group INTEGER
);
INSERT INTO os_addresses VALUES
(1,'1 RAINBOW ROAD ABBOTS LANGLEY',2),
(2,'2 RAINBOW ROAD ABBOTS LANGLEY',2),
(3,'3 RAINBOW ROAD ABBOTS LANGLEY',2),
(4,'4 RAINBOW ROAD ABBOTS LANGLEY',2),
(5,'5 RAINBOW ROAD ABBOTS LANGLEY',2),
(6,'6 RAINBOW ROAD ABBOTS LANGLEY',2),
(7,'7 RAINBOW ROAD ABBOTS LANGLEY',2),
(8,'ANNEX 7 RAINBOW ROAD ABBOTS LANGLEY',2),
(9,'9 RAINBOW ROAD ABBOTS LANGLEY',2),
(10,'10 RAINBOW ROAD ABBOTS LANGLEY',2),
(11,'11 RAINBOW ROAD ABBOTS LANGLEY',2),
(12,'12 RAINBOW ROAD ABBOTS LANGLEY',2),
(13,'13 RAINBOW ROAD ABBOTS LANGLEY',2),
(14,'14 RAINBOW ROAD ABBOTS LANGLEY',2),
(15,'15 RAINBOW ROAD ABBOTS LANGLEY',2),
(16,'16 RAINBOW ROAD ABBOTS LANGLEY',2),
(17,'17 RAINBOW ROAD ABBOTS LANGLEY',2),
(18,'18 RAINBOW ROAD ABBOTS LANGLEY',2),
(19,'19 RAINBOW ROAD ABBOTS LANGLEY',2),
(20,'20 RAINBOW ROAD ABBOTS LANGLEY',2),
(21,'21 RAINBOW ROAD ABBOTS LANGLEY',2),
(22,'22 RAINBOW ROAD ABBOTS LANGLEY',2),
(23,'MY LONG BUSINESS NAME 45 RAINBOW ROAD ABBOTS LANGLEY',2),
(34,'SOME OTHER ADDRESS 1 HIGH STREET',1);
-- Build tries per postcode_group
CREATE OR REPLACE TEMP TABLE trie AS
WITH toks AS (
SELECT
postcode_group,
uprn,
string_split(regexp_replace(addr, '\\s+', ' '), ' ') AS t
FROM os_addresses
)
SELECT
postcode_group,
build_suffix_trie(uprn, t) AS trie
FROM toks
GROUP BY postcode_group;
-- Example lookup joined on postcode_group
WITH messy_addresses AS (
SELECT
2 AS postcode_group,
string_split(regexp_replace('MY BUSINESS 22 RAINBOW ROAD ABBOTS LANGLEY', '\\s+', ' '), ' ') AS tokens
)
SELECT
find_address(string_split('MY BUSINESS 22 RAINBOW ROAD ABBOTS LANGLEY HERTS', ' '), r.trie) AS uprn,
FROM messy_addresses m
JOIN trie r USING (postcode_group);

We are currently working on adding this approach as part of uk_address_matcher a package for fast, accurate address matching.