Four python frameworks for parallel processing

Nik Vaklev
5 min readJan 25, 2020

Ray, Numba, Dask and Vaex are a few of the Python libraries that can save you precious time when working with big data.

Python is a standard tool for analysing and visualising data. The process looks generally as follows:

  1. loading the data from different sources, SQL databases, files, etc.;
  2. reshaping the data, which may involve simple data cleansing, some calculations and checks on the raw data, etc.;
  3. (building a predictive model;)
  4. plotting the results.

It all works until the dataset gets bigger or the calculations require more resources, e.g. RAM and CPUs. A source of frustration for analysts is waiting for the code to execute. If it takes long, iterating the code becomes a protracted and torturous process. Hence, we all try writing somewhat efficient code. Here I want to review a few libraries that can save you time when dealing with bigger datasets.

Dask

If data frames are your preferred tool, Dask can be an easy way to speed up things.

“Big Data” collections like parallel arrays, dataframes, and lists extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments.

The idea is simple. The library offers APIs which mimic NumPy arrays or Pandas dataframes but the underlying implementation does the calculations in parallel. It also has an interface for the standard Python module concurrent.futures library. For example:

import pandas as pd      
df = pd.read_csv('2015-01-01.csv')
df.groupby(df.user_id).value.mean()
import dask.dataframe as dd
df = dd.read_csv('2015-*-*.csv')
df.groupby(df.user_id).value.mean().compute()

2. Vaex.io

Vaex is a python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid up to a billion objects/rows per second.

Vaex is similar to Dask as it offers another implementation for data frames. Vaex evaluates matrix operations lazily and also minimises any data copying. The latter becomes a real problem if you work with large datasets (> 1GB). The library has additionally in-built support for a few visualisations, mostly stats related. They are great for exploring large datasets.

A sub-module of the library offers APIs for Apache Arrow, which is in-memory columnar store. Support for Arrow is a rather cool idea because it allows access to the same data using multiple languages. For instance, a Python process can load and crunch the data and store it in Arrow. A Java process can pick up the same data from memory without the need for serialisation/deserialisation.

3. Numba

The big idea behind this library is speeding-up your code at run time with JIT (just-in-time) optimisations.

Numba translates Python functions to optimized machine code at runtime using the industry-standard LLVM compiler library. Numba-compiled numerical algorithms in Python can approach the speeds of C or FORTRAN.

The magic happens using decorators:

from numba import jit, int32

@jit(int32(int32, int32))
def f(x, y):
# A somewhat trivial example
return x + y

Through the decorator @jit(int32(int32, int32)) the LLVM compiler knows that this function takes two 32-bit integers as inputs and returns a 32-bit integer. The compiler will optimise the memory allocation and function calls accordingly. The aim is to reduce the standard overhead in Python, where all variables, including number, string, etc., are actually complex objects in the computer memory. The decorators are unobtrusive and easy to add/remove. JITs are limited, though, to optimising simple numerical calculations.

4. Ray

Ray is a fast and simple framework for building and running distributed applications.

Ray is a real treat! It is a framework for distributed computing. Ray can span multiple nodes or simply help you make the most of you CPUs. The reason I liked it is that it allows direct control over the execution:

import ray
ray.init()

@ray.remote
def f(x):
return x * x

futures = [f.remote(i) for i in range(4)]
print(ray.get(futures))

The call to remote triggers the asynchronous computation. The call to get forces the main thread to wait for all the “futures” to finish. I like this approach to parallel computing because it is closer to Scala futures and Akka actors. The beauty of distributed computing is that you can package any long-running process and wait for it to report back (or not!) without blocking the main thread.

Another cool feature of Ray is the use again of Apache Arrow to share data between process, which opens the door for integration with other languages.

I used Ray to optimise writing to disk in this example. The code changes required in order to include Ray were minimal but the speedup was significant: from 270s down to 170s.

5. If you are really desperate nothing beats writing plain C

Python is C lang under the hood. Well, lots of C code to be frank. But, if you have a really well-defined problem translating the routines into C and adding a thin wrapper to interface with Python is still the best answer.

Conclusion

As you can see here, there is no best solution. There are multiple solutions depending on the problem at hand.

I often find myself in a situation where I want to execute long-running computations in a non-blocking way. Hence, Ray is now my preferred choice in Python for parallel computing. Otherwise CPU bound jobs end-up running on a single core sequentially since Python is single-threaded by default due to the Global Interpreter Lock. N.B. There are modules in Python for running processes in parallel but they are often low-level which means extra work is needed every time to parallelise the tasks (with mixed results in my case).

Once, you are using all CPUs, optimising basic calculations can be done with Numba with JITs. The best approach, but the least safe one, is writing C modules, though. If you are brave, of course.

I personally do not think optimised data frames (Dask, Vaex, etc.) are a great option. If you are processing such large data volumes on your computer you should probably think of using PySpark data frames running on a remote cluster. Loading 8GB in memory on a normal machine, cannot ever possibly be fast.

My name is Nik Vaklev and I am the founder of Techccino Ltd | Bespoke Business Apps.

--

--