Originally posted: 2025-09-23. View source code for this page here.
A data structure known as a trie
offers a powerful tool for address matching. It exploits the hierarchical nature of addresses, whereby tokens (words) in the address are typically represented in a specific to general order geographically.
For example, consider the following addresses:
1 HIGH STREET WESTMINSTER2 HIGH STREET WESTMINSTER3 HIGH STREET WESTMINSTERFLAT A 3 HIGH STREET WESTMINSTERFLAT B 3 HIGH STREET WESTMINSTER4 HIGH STREET WESTMINSTER5 HIGH STREET WESTMINSTERANNEX 5 HIGH STREET WESTMINSTER
These can be represented as a trie as follows, where the blue nodes represent terminal nodes (the end of an address):
Observe how the trie structure encodes a lot of useful information about the tokens in the addresses. For example:
count
property)count=1
). Note that some terminal nodes are not leaves (the ANNEX
example)By walking the trie from root to branch we can attempt to match an input address to the addresses in the trie.
On the face of it, this is little different from exact matching. However, we can make this process much more powerful by introducing fault tolerance into the trie.
Consider for example these two addresses:
Messy address: CORNER SHOP 8 RAINBOW ROAD EXTRA ABBOTS LANGLEY HERTS UKClean address: 1 RAINBOW ROAD EXTRA ABBOTS LANGLEY
We could define the following rules for fault tolerance:
Let's see how this works in practice. In the following I use a simple set of fault tolerant rules for illustrative purposes.
Consider the following canonical addresses (note: this is ✨interactive✨ and you can edit the addresses):
We can look up a messy address in the trie as follows. Again, this is ✨interactive✨.
I have written a DuckDB Community Extension called splink_udfs that implements this function.
You can try it with in shell.duckdb.org using the script you can unfold below:
-- Load the extension in shell.duckdb.orgINSTALL splink_udfs FROM community;LOAD splink_udfs;-- Test dataCREATE OR REPLACE TEMP TABLE os_addresses(uprn BIGINT,addr VARCHAR,postcode_group INTEGER);INSERT INTO os_addresses VALUES(1,'1 RAINBOW ROAD ABBOTS LANGLEY',2),(2,'2 RAINBOW ROAD ABBOTS LANGLEY',2),(3,'3 RAINBOW ROAD ABBOTS LANGLEY',2),(4,'4 RAINBOW ROAD ABBOTS LANGLEY',2),(5,'5 RAINBOW ROAD ABBOTS LANGLEY',2),(6,'6 RAINBOW ROAD ABBOTS LANGLEY',2),(7,'7 RAINBOW ROAD ABBOTS LANGLEY',2),(8,'ANNEX 7 RAINBOW ROAD ABBOTS LANGLEY',2),(9,'9 RAINBOW ROAD ABBOTS LANGLEY',2),(10,'10 RAINBOW ROAD ABBOTS LANGLEY',2),(11,'11 RAINBOW ROAD ABBOTS LANGLEY',2),(12,'12 RAINBOW ROAD ABBOTS LANGLEY',2),(13,'13 RAINBOW ROAD ABBOTS LANGLEY',2),(14,'14 RAINBOW ROAD ABBOTS LANGLEY',2),(15,'15 RAINBOW ROAD ABBOTS LANGLEY',2),(16,'16 RAINBOW ROAD ABBOTS LANGLEY',2),(17,'17 RAINBOW ROAD ABBOTS LANGLEY',2),(18,'18 RAINBOW ROAD ABBOTS LANGLEY',2),(19,'19 RAINBOW ROAD ABBOTS LANGLEY',2),(20,'20 RAINBOW ROAD ABBOTS LANGLEY',2),(21,'21 RAINBOW ROAD ABBOTS LANGLEY',2),(22,'22 RAINBOW ROAD ABBOTS LANGLEY',2),(23,'MY LONG BUSINESS NAME 45 RAINBOW ROAD ABBOTS LANGLEY',2),(34,'SOME OTHER ADDRESS 1 HIGH STREET',1);-- Build tries per postcode_groupCREATE OR REPLACE TEMP TABLE trie ASWITH toks AS (SELECTpostcode_group,uprn,string_split(regexp_replace(addr, '\\s+', ' '), ' ') AS tFROM os_addresses)SELECTpostcode_group,build_suffix_trie(uprn, t) AS trieFROM toksGROUP BY postcode_group;-- Example lookup joined on postcode_groupWITH messy_addresses AS (SELECT2 AS postcode_group,string_split(regexp_replace('MY BUSINESS 22 RAINBOW ROAD ABBOTS LANGLEY', '\\s+', ' '), ' ') AS tokens)SELECTfind_address(string_split('MY BUSINESS 22 RAINBOW ROAD ABBOTS LANGLEY HERTS', ' '), r.trie) AS uprn,FROM messy_addresses mJOIN trie r USING (postcode_group);
We are currently working on adding this approach as part of uk_address_matcher a package for fast, accurate address matching.