We and others have worked with multidimensional arrays in both local and distributed environments across multiple languages. The Python scientific computing stack is great for working with arrays that fit in memory. New distributed computing platforms, especially Spark, have made it easy to flexibly implement distributed workflows that scale well to potentially massive data sets. But it is not straightforward to use these tools for working with arrays in Python, or move seamlessly between small, medium, and very large data sets.
We want to be able to build projects on an interface like NumPy’s ndarray and know that we can leverage either local or distributed operations. And we want to target Python because of its rich libaries for scientific computing and machine learning. Specifically, we want an object that:
- implements most of the ndarray interface
- implements a subset of functional operators (e.g. map, filter, reduce)
- supports a variety of backends (e.g. local, multi-core, distributed)
For distributed computation, we currently target Spark’s RDD (resiliant distributed dataset), which provides an elegant API for functional operations (map, reduce, join, filter, etc.) but is not easy to work with as an multidimensional array. Other projects have developed some of the neccessary abstractions (e.g. Thunder, spylearn, sparkit-learn). We hope to solve this problem once well so others can use and build on it.
Bolt aims to implement an object with the properties list above. It currently supports both NumPy (for local computation), and Spark (for distributed computation), but we envision adding other backends in the future.
The project is in its early stages, so we welcome feedback, ideas, and use cases, just join us in the chatroom.