Originally posted: 2020-10-22. View source code for this page here.

Learning more about a tool that can filter and aggregate two billion rows on a laptop in two seconds

In my work as a data scientist, I’ve come across Apache Arrow in a range of seemingly-unrelated circumstances. However, I’ve always struggled to describe exactly what it is, and what it does.

The official description of Arrow is:

a cross-language development platform for in-memory analytics

which is quite abstract — and for good reason. The project is extremely ambitious, and aims to provide the backbone for a wide range of data processing tasks. This means it sits at a low level, providing building blocks for higher level, user-facing, analytics tools like pandas or dplyr.

As a result, the importance of the project can be hard to understand for users who run into it occasionally in their day to day work, because much of what it does is behind the scenes.

In this post I describe some of the user-facing features of Apache Arrow which I have run into in my work, and explain why they are all facets of more fundamental problems which Apache Arrow aims to solve.

By connecting these dots it becomes clear why Arrow is not just a useful tool to solve some practical problems today, but one of the most exciting emerging tools, with the potential to be the engine behind large parts of future data science workflows.

A striking feature of Arrow is that it can read csvs into Pandas more than 10x faster than pandas.read.csv.

This is actually a two step process: Arrow reads the data into memory in an Arrow table, which is really just a collection of record batches, and then converts the Arrow table into a pandas dataframe.

The speedup is thus a consequence of the underlying design of Arrow:

  • Arrow has its own in-memory storage format. When we use Arrow to load data into Pandas, we are really loading data into the Arrow format (an in-memory format for data frames), and then translating this into the pandas in-memory format. Part of the speedup in reading csvs therefore comes from the careful design of the Arrow columnar format itself.

  • Data in Arrow is stored in-memory in record batches, a 2D data structure containing contiguous columns of data of equal length. A ‘table’ can be created from these batches without requiring additional memory copying, because tables can have ‘chunked’ columns (i.e. sections of data, each part representing a contiguous chunk of memory). This design means that data can be read in parallel rather than the single-threaded approach of pandas.

Running Python user defined functions in Apache Spark has historically been very slow — so slow that the usual recommendation was not to do it on datasets of any significant size.

More recently, Apache Arrow has made it possible to efficiently transfer data between JVM and Python processes. Combined with vectorised UDFs, this has resulted in huge speedups.

What’s going on here?

Prior to the introduction of Arrow, the process of translating data from the Java representation to the Python representation and back was slow — involving serialisation and deserialisation. It was also one-row-at-a-time, which is slower because many optimised data processing operations work most efficiently on data held in a columnar format, which data in each column held in contiguous sections of memory.

The use of Arrow almost completely eliminates the serialisation and deserialisation step, and also allows data to be processed in columnar batches, meaning more efficient vectorised algorithms can be used.

The ability of Arrow to transfer data between languages without a large serialisation/deserialisation overhead is a key feature, making it much more feasible for algorithms implemented in one language being able to be used by data scientists working in other languages.

Arrow is able to read a folder containing many data files into a single dataframe. Arrow also supports reading folders where data is partitioned into subfolders.

This functionality is illustrative of the higher-level design of Arrow around record batches. If everything is a record batch, then it matters little whether the source data is stored in a single file or multiple files.

Arrow can be used to read parquet files into data science tools like Python and R, correctly representing the data types in the target tool. Since parquet is a self-describing format, with the data types of the columns specified in the schema, getting data types right may not seem difficult. But a process of translation is still needed to load data into different data science tools, and this is surprisingly complex.

For example, historically the pandas integer type has not allowed integers to be null, despite this being possible in parquet files. Similar quirks apply to other languages. Arrow handles this process of translation for us. Data is always loaded into the Arrow format first, but Arrow provides translators that are then able to convert this into language-specific in-memory formats like the pandas dataframe. This means that for each target format, a single format-to-arrow translator is needed, a huge improvement over the current situation of ‘n take 2’ translators being needed (one for each pair of tools).

This problem of translation is one that Arrow aims eventually to eliminate altogether: ideally there would be a single in-memory format for dataframes rather than each tool having its own representation.

Arrow makes it easy to write data held in-memory in tools like pandas and R to disk in Parquet format.

Why not just persist the data to disk in Arrow format, and thus have a single, cross-language data format that is the same on-disk and in-memory? One of the biggest reasons is that Parquet generally produces smaller data files, which is more desirable if you are IO-bound. This will especially be the case if you are loading data from cloud storage like such as AWS S3.

Julien LeDem explains this further in a blog post discussing the two formats:

The trade-offs for columnar data are different for in-memory. For data on disk, usually IO dominates latency, which can be addressed with aggressive compression, at the cost of CPU. In memory, access is much faster and we want to optimise for CPU throughput by paying attention to cache locality, pipelining, and SIMD instructions.

This makes it desirable to have close cooperation in the development of the in-memory and on-disk formats, with predictable and speedy translation between the two representations — and this is what is offered by the Arrow and Parquet formats.

Whilst very useful, the user-facing features discussed so far are not revolutionary. The real power is in the potential of the underlying building blocks to enable new approaches to data science tooling. At the moment (autumn 2020), some of this is still work in progress by the team behind Arrow.

In Apache Arrow, dataframes are composed of record batches. If stored on disk in Parquet files, these batches can be read into the in-memory Arrow format very quickly.

We’ve seen this can be used to read a csv file into memory. But more broadly this concept opens the door to entire data science workflows which are parallelised and operate on record batches, thereby eliminating the need for the full table to be stored in memory.

Currently, the close coupling of data science tools and their in-memory representations of dataframes means that algorithms written in one language are not easily portable to other languages. This means the same standard operations, such as filters and aggregations, get endlessly re-written and optimisations do not easily translate between tools.

Arrow provides a standard for representing dataframes in memory, and a mechanism for allowing multiple languages and to refer to the same in-memory data.

This creates an opportunity for the development of:

an Arrow-native columnar query execution engine in C++, with intended use not only from C++ but also from user-space languages like Python, R, and Ruby.

This would provide a single, highly optimised codebase, available for use from multiple languages. The pool of developers who have an interest in maintaining and contributing to this library would be much larger because it would be available for use across a wide range of tools.

These ideas come together in the description of Arrow as an ‘API for data’. The idea is that Arrow provides a cross-language in-memory data format, and an associated query execution language that provide the building blocks for analytics libraries. Like any good API, Arrow provides a performant solution to common problems without the user needing to fully understand the implementation.

This paves the way for a step change in data processing capabilities, with analytics libraries:

  • being parallelised by default

  • applying highly optimised calculations on dataframes

  • no longer needing to work with the constraint that the full dataset must fit into memory

whilst breaking down barriers to sharing data between tools via faster ways to transmit data between tools, either on a single machine or over the network.

We can get a glimpse of the potential of this approach in the vignette in the Arrow package in R, in which a 37gb dataset with 2 billion rows is processed and aggregated on a laptop in under 2 seconds.

Where does this leave existing high-level data science tools and languages like SQL, pandas and R?

As the vignette illustrates, the intention is not for Arrow to compete directly with these tools. Instead, higher level tools can use Arrow under the hood, imposing little constraint on their user-facing API. It therefore allows for continued innovation in the expressiveness and features of different tools whilst improving performance and saving authors the task of reinventing the common components.

https://arrow.apache.org/overview/

https://arrow.apache.org/use_cases/

https://arrow.apache.org/powered_by/

https://arrow.apache.org/blog/

https://ursalabs.org/blog/

https://wesmckinney.com/archives.html

https://www.youtube.com/watch?v=fyj4FyH3XdU

Arrow design documents:

Filesystems API

Datasets API

DataFrame API

C++ Query Engine