Over the past three years Apache Arrow has exploded in popularity across a range of different open source communities. In the Python community alone, Arrow is being downloaded more than 500,000 times a month. The Arrow project is both a specification for how to represent data in a highly efficient way for in-memory analytics, as well as a series of libraries in a dozen languages for operating on the Arrow columnar format.
In the same way that most automobile manufacturers OEM their transmissions instead of designing and building their own, Arrow provides an optimal way for projects to manage and operate on data in-memory for diverse analytical workloads, including machine learning, artificial intelligence, data frames, and SQL engines.
The Gandiva initiative for Apache Arrow is a new execution kernel for Arrow that is based on LLVM. Gandiva provides significant performance improvements for low-level operations on Arrow buffers. We first included this work in Dremio to improve the efficiency and performance of analytical workloads on our platform, which will become available to users with Dremio 3.0. In this post I will describe the motivation for the initiative, implementation details, some performance results, and some plans for the future.
A note on the name: Gandiva is a mythical bow, from the Indian epic The Mahabharata, used by the hero Arjuna. According to the story, Gandiva is indestructible, and it makes the arrows it fires a thousand times more powerful.