I have a machine learning application written in Python which includes a data processing step. When I wrote it, I initially did the data processing on Pandas DataFrames, but when this lead to abysmal performance I eventually rewrote it using vanilla Python, with for loops instead of vectorized operations and lists and dicts instead of DataFrames and Series. To my surprise, the performance of the code written in vanilla Python ended up being far higher than that of the code written using Pandas.
As my handcoded data processing code is substantially bigger and messier than the original Pandas code, I haven't quite given up on using Pandas, and I'm currently trying to optimize the Pandas code without much success.
The core of the data processing step consists of the following: I first divide the rows into several groups, as the data consists of several thousand time series (one for each "individual"), and I then do the same data processing on each group: a lot of summarization, combining different columns into new ones, etc.
I profiled my code using Jupyter Notebook's lprun
, and the bulk of the time is spent on the following and other similar lines:
grouped_data = data.groupby('pk') data[[v + 'Diff' for v in val_cols]] = grouped_data[val_cols].transform(lambda x: x - x.shift(1)).fillna(0) data[[v + 'Mean' for v in val_cols]] = grouped_data[val_cols].rolling(4).mean().shift(1).reset_index()[val_cols] (...)
...a mix of vectorized and non-vectorized processing. I understand that the non-vectorized operations won't be faster than my handwritten for loops, since that's basically what they are under the hood, but how can they be so much slower? We're talking about a performance degradation of 10-20x between my handwritten code and the Pandas code.
Am I doing something very, very wrong?
1 Answers
Answers 1
No, I don't think you should give up on pandas. There's definitely better ways to do what you're trying to. The trick is to avoid apply
/transform
in any form as much as possible. Avoid them like the plague. They're basically implemented as for loops, so you might as well directly use python for
loops which operate at C speed and give you better performance.
The real speed gain is where you get rid of the loops and use pandas' functions that implicitly vectorise their operations. For example, your first line of code can be simplified greatly, as I show you soon.
In this post I outline the setup process, and then, for each line in your question, offer an improvement, along with a side-by-side comparison of the timings and correctness.
Setup
data = {'pk' : np.random.choice(10, 1000)} data.update({'Val{}'.format(i) : np.random.randn(1000) for i in range(100)}) df = pd.DataFrame(data)
g = df.groupby('pk') c = ['Val{}'.format(i) for i in range(100)]
transform
+sub
+shift
→diff
Your first line of code can be replaced with a simple diff
statement:
v1 = df.groupby('pk')[c].diff().fillna(0)
Sanity Check
v2 = df.groupby('pk')[c].transform(lambda x: x - x.shift(1)).fillna(0) np.allclose(v1, v2) True
Performance
%timeit df.groupby('pk')[c].transform(lambda x: x - x.shift(1)).fillna(0) 10 loops, best of 3: 44.3 ms per loop %timeit df.groupby('pk')[c].diff(-1).fillna(0) 100 loops, best of 3: 9.63 ms per loop
Removing redundant indexing operations
As far as your second line of code is concerned, I don't see too much room for improvement, although you can get rid of the reset_index()
+ [val_cols]
call if your groupby statement is not considering pk
as the index:
g = df.groupby('pk', as_index=False)
Your second line of code then reduces to:
v3 = g[c].rolling(4).mean().shift(1)
Sanity Check
g2 = df.groupby('pk') v4 = g2[c].rolling(4).mean().shift(1).reset_index()[c] np.allclose(v3.fillna(0), v4.fillna(0)) True
Performance
%timeit df.groupby('pk')[c].rolling(4).mean().shift(1).reset_index()[c] 10 loops, best of 3: 46.5 ms per loop %timeit df.groupby('pk', as_index=False)[c].rolling(4).mean().shift(1) 10 loops, best of 3: 41.7 ms per loop
Note that timings vary on different machines, so make sure you test your code thoroughly to make sure there is indeed an improvement on your data.
While the difference this time isn't as much, you can appreciate the fact that there are improvements that you can make! This could possibly make a much larger impact for larger data.
Afterword
In conclusion, most operations are slow because they can be sped up. The key is to get rid of any approach that does not use vectorization.
To this end, it is sometimes beneficial to step out of pandas space and step into numpy space. Operations on numpy arrays or using numpy tend to be much faster than pandas equivalents (for example, np.sum
is faster than pd.DataFrame.sum
, and np.where
is faster than pd.DataFrame.where
, and so on).
Sometimes, loops cannot be avoided. In which case, you can create a basic looping function which you can then vectorise using numba or cython. Examples of that are here at Enhancing Performance, straight from the horses mouth.
In still other cases, your data is just too big to reasonably fit into numpy arrays. In this case, it would be time to give up and switch to dask
or spark
, both of which offer high performance distributed computational frameworks for working with big data.
0 comments:
Post a Comment