Originally posted: 2023-01-09. Last updated: 2023-09-21. View source code for this page here.
Statically hosted parquet files provide one of the easiest to use and most performant APIs for accessing bulk¹ data, and are far simpler and cheaper to provide than custom APIs. Furthermore, features of more sophisticated APIs are often best left to the end user to perform in their preferred data analysis tools.
With DuckDB, the parquet files are directly queryable using SQL, providing a universal API to interrogate such datasets. This can be used from languages like Python and R, and even works in the browser.
This can be extremely performant. The National Statistics Postcode Lookup is a 1.1GB csv file with 2.6 million rows. When hosted as a parquet file, this can be queried in the browser to return a single postcode in 13 seconds from a cold start on a home internet connection, and 100ms for subsequent queries.
As a data scientist, I find the complexity of many open data services frustrating. I don't want to have to learn how to query an endpoint, think about data types, or read through API documentation. Just give me the data!
In this post, I argue that the best way of serving bulk data is usually one or more parquet files, served at a predictable URL with CORS enabled². These files should represent the canonical version of the data, and any other data products (such as Excel files, reports, or CSVs) should be derived from these files.
SQL provides a familiar, universal API on top of these files, enabling filtering, joining, calculation or aggregation services that may otherwise be custom features of data APIs. Software libraries such as DuckDB enable SQL to be run directly and efficiently on these parquet files, including from the browser (meaning no additional software is required).
To be concrete about this proposal, if I wanted to serve open data on the number of widgets produced, the URL structure could be something like:
The dated URLs provide fixed snapshots for reproducible results, and the ‘latest’ URL always provides the most up-to-date data. I would recommend serving a .csv copy at equivalent URLs. A directory listing service should also be provided to enable data discovery and scraping.
My preferred way of getting acquainted with a new dataset is usually by looking at a sample of rows, and then exploring the data further with queries and automatic profiling tools. Only later am I likely to look at documentation or metadata.
Loading a sample of rows from an online parquet file is a simple one-liner in most analytical programming languages - usually the same familiar code as loading in data from disk. Viewing metadata is also a one-liner.
There are a wide range of free tools to dig deeper, and each user will have their favourite (e.g. Rill, skimpy, pandas profiling, datasette or R equivalents). Nowadays, this can even be done in the web browser, without the need for any special software to be downloaded.
In contrast, in the case of many custom APIs, the user would need to write non-trivial custom code to obtain the data prior to being able to use these tools.
Users can then use their preferred analytical tools to manipulate and query the data. One particularly interesting recent development is DuckDB in the browser, which enables parquet files to be queried using SQL directly in the web browser, without the need for installing additional software. If open data was consistently served as parquet files, this would effectively provide a universal API to allow all open data to be interrogated quickly and easily, with no servers required.
Importing from parquet also makes it exceptionally easy to write a Reproducible Analytical Pipeline on top of the data. It also encourages internal use of open data products (‘dogfooding’), which is an effective way of detecting problems early and continuous improvement.
Cross-Origin Resource Sharing (CORS) enables any website to load your data directly from source parquet files without the need for a server.
This enables self-updating data dashboards to be built on the open data in just a few lines of code - whether those be official products built by the data owner, or other internet users. This leads to a very simple architecture that reduces copies of the canonical data and prevents dashboards from going stale.
Serving data as a static file is probably the simplest and cheapest possible architecture for open data services. Modern cloud services mean it’s as simple as uploading the file to blob storage (e.g. AWS S3). Data access will remain performant even if traffic is high, and the service will have very high reliability and availability. All of this is taken care of by the cloud provider.
For the end user, a parquet file served from blob storage will also be one of the fastest possible ways of getting data into their data analysis software from a remote provider - often far faster than obtaining data from a custom API.
Overall, serving static parquet files is likely to be orders of magnitude cheaper than developing a custom solution.
It’s difficult to predict what services may wish to be downstream users of the open data. However, serving performant data in a typed format with wide support enables others to easily build services on top of the data.
This provides a clean separation of concerns between data engineers, who are responsible for the canonical datasets, and digital services that may have bespoke needs to consume this data in bespoke way - e.g. a lookup service. The simple availability of the data is also likely to make it simpler to build and maintain these downstream services³.
Parquet is particularly well suited to sharing a wide variety of data online for a variety of reasons:
Alongside Apache Arrow, parquet is rapidly becoming a de facto standard for saving, transferring and sharing data.
There are some cases where static parquet files may be inappropriate. Examples include:
The French National Institute of Statistics and Economic Studies (Insee) have started publishing open data in parquet format that can be immediately analysed in the DuckDB shell. See here. It can be immediately manipulated in SQL using DuckDB in Observable - see here
Ah, I remember this specific file from my time at the ECB 15-ish years ago. IIRC it was by far the most downloaded file on the ECB website. Tons of people, including many financial institutions, downloaded it daily, and used it to update their own systems.
IIRC #2 in the minutes immediately after the daily scheduled time for publishing this file, there was a massive traffic spike.
It was a conscious decision to make it a simple CSV file (once unzipped): it made it possible to serve the file reliably, fast, and with little resources needed.
The small team responsible at the time for the ECB’S public website was inordinately proud of the technical decisions made to serve this data in a single static file. And rightly so.
¹ By 'bulk' data, I mean APIs designed to serve whole datasets to the end user, as opposed to 'atomic' APIs which serve individual data points (e.g. a postcode lookup) for use in downstream services
² Cross-Origin Resource Sharing (CORS) enables any website to load your data directly from source parquet files without the need for a server.
³ For example, it may be possible to produce a widget lookup service by running DuckDB as a AWS lambda function, with the data cached by lambda. Some recent, relevant tweets: https://mobile.twitter.com/BoilingData/status/1611082851184791553, https://twitter.com/gunnarmorling/status/1609958952040599552
⁴ But I would investigate whether serving several smaller files may be appropriate first (e.g. one for each year). The problem with developing a custom service is that if the data is large enough to merit one, the processing requirements of such a service are relatively high.
⁵ That said, the NSPL - a 2m row dataset which is over 1GB in csv format is 69Mb as a parquet file. Using this for occasional/ad hoc lookups is easy and takes less than 15 seconds on my laptop. Obviously not suitable for a production service, but not bad.