Consider a small MWE, taken from another question:
DateTime Data 2017-11-21 18:54:31 1 2017-11-22 02:26:48 2 2017-11-22 10:19:44 3 2017-11-22 15:11:28 6 2017-11-22 23:21:58 7 2017-11-28 14:28:28 28 2017-11-28 14:36:40 0 2017-11-28 14:59:48 1
The goal is to clip all values with an upper bound of 1. My answer uses np.clip
, which works fine.
np.clip(df.Data, a_min=None, a_max=1) array([1, 1, 1, 1, 1, 1, 0, 1])
Or,
np.clip(df.Data.values, a_min=None, a_max=1) array([1, 1, 1, 1, 1, 1, 0, 1])
Both of which return the same answer. My question is about the relative performance of these two methods. Consider -
df = pd.concat([df]*1000).reset_index(drop=True) %timeit np.clip(df.Data, a_min=None, a_max=1) 1000 loops, best of 3: 270 µs per loop %timeit np.clip(df.Data.values, a_min=None, a_max=1) 10000 loops, best of 3: 23.4 µs per loop
Why is there such a massive difference between the two, just by calling values
on the latter? In other words...
Why are numpy functions so slow on pandas objects?
4 Answers
Answers 1
Yes, it seems like np.clip
is a lot slower on pandas.Series
than on numpy.ndarray
s. That's correct but it's actually (at least asymptomatically) not that bad. 8000 elements is still in the regime where constant factors are major contributors in the runtime. I think this is a very important aspect to the question, so I'm visualizing this (borrowing from another answer):
# Setup import pandas as pd import numpy as np def on_series(s): return np.clip(s, a_min=None, a_max=1) def on_values_of_series(s): return np.clip(s.values, a_min=None, a_max=1) # Timing setup timings = {on_series: [], on_values_of_series: []} sizes = [2**i for i in range(1, 26, 2)] # Timing for size in sizes: func_input = pd.Series(np.random.randint(0, 30, size=size)) for func in timings: res = %timeit -o func(func_input) timings[func].append(res) %matplotlib notebook import matplotlib.pyplot as plt import numpy as np fig, (ax1, ax2) = plt.subplots(1, 2) for func in timings: ax1.plot(sizes, [time.best for time in timings[func]], label=str(func.__name__)) ax1.set_xscale('log') ax1.set_yscale('log') ax1.set_xlabel('size') ax1.set_ylabel('time [seconds]') ax1.grid(which='both') ax1.legend() baseline = on_values_of_series # choose one function as baseline for func in timings: ax2.plot(sizes, [time.best / ref.best for time, ref in zip(timings[func], timings[baseline])], label=str(func.__name__)) ax2.set_yscale('log') ax2.set_xscale('log') ax2.set_xlabel('size') ax2.set_ylabel('time relative to {}'.format(baseline.__name__)) ax2.grid(which='both') ax2.legend() plt.tight_layout()
It's a log-log plot because I think this shows the important features more clearly. For example it shows that np.clip
on a numpy.ndarray
is faster but it also has a much smaller constant factor in that case. The difference for large arrays is only ~3! That's still a big difference but way less than the difference on small arrays.
However, that's still not an answer to the question where the time difference comes from.
The solution is actually quite simple: np.clip
delegates to the clip
method of the first argument:
>>> np.clip?? Source: def clip(a, a_min, a_max, out=None): """ ... """ return _wrapfunc(a, 'clip', a_min, a_max, out=out) >>> np.core.fromnumeric._wrapfunc?? Source: def _wrapfunc(obj, method, *args, **kwds): try: return getattr(obj, method)(*args, **kwds) # ... except (AttributeError, TypeError): return _wrapit(obj, method, *args, **kwds)
The getattr
line of the _wrapfunc
function is the important line here, because np.ndarray.clip
and pd.Series.clip
are different methods, yes, completely different methods:
>>> np.ndarray.clip <method 'clip' of 'numpy.ndarray' objects> >>> pd.Series.clip <function pandas.core.generic.NDFrame.clip>
Unfortunately is np.ndarray.clip
a C-function so it's hard to profile it, however pd.Series.clip
is a regular Python function so it's easy to profile. Let's use a Series of 5000 integers here:
s = pd.Series(np.random.randint(0, 100, 5000))
For the np.clip
on the values
I get the following line-profiling:
%load_ext line_profiler %lprun -f np.clip -f np.core.fromnumeric._wrapfunc np.clip(s.values, a_min=None, a_max=1) Timer unit: 4.10256e-07 s Total time: 2.25641e-05 s File: numpy\core\fromnumeric.py Function: clip at line 1673 Line # Hits Time Per Hit % Time Line Contents ============================================================== 1673 def clip(a, a_min, a_max, out=None): 1674 """ ... 1726 """ 1727 1 55 55.0 100.0 return _wrapfunc(a, 'clip', a_min, a_max, out=out) Total time: 1.51795e-05 s File: numpy\core\fromnumeric.py Function: _wrapfunc at line 55 Line # Hits Time Per Hit % Time Line Contents ============================================================== 55 def _wrapfunc(obj, method, *args, **kwds): 56 1 2 2.0 5.4 try: 57 1 35 35.0 94.6 return getattr(obj, method)(*args, **kwds) 58 59 # An AttributeError occurs if the object does not have 60 # such a method in its class. 61 62 # A TypeError occurs if the object does have such a method 63 # in its class, but its signature is not identical to that 64 # of NumPy's. This situation has occurred in the case of 65 # a downstream library like 'pandas'. 66 except (AttributeError, TypeError): 67 return _wrapit(obj, method, *args, **kwds)
But for np.clip
on the Series
I get a totally different profiling result:
%lprun -f np.clip -f np.core.fromnumeric._wrapfunc -f pd.Series.clip -f pd.Series._clip_with_scalar np.clip(s, a_min=None, a_max=1) Timer unit: 4.10256e-07 s Total time: 0.000823794 s File: numpy\core\fromnumeric.py Function: clip at line 1673 Line # Hits Time Per Hit % Time Line Contents ============================================================== 1673 def clip(a, a_min, a_max, out=None): 1674 """ ... 1726 """ 1727 1 2008 2008.0 100.0 return _wrapfunc(a, 'clip', a_min, a_max, out=out) Total time: 0.00081846 s File: numpy\core\fromnumeric.py Function: _wrapfunc at line 55 Line # Hits Time Per Hit % Time Line Contents ============================================================== 55 def _wrapfunc(obj, method, *args, **kwds): 56 1 2 2.0 0.1 try: 57 1 1993 1993.0 99.9 return getattr(obj, method)(*args, **kwds) 58 59 # An AttributeError occurs if the object does not have 60 # such a method in its class. 61 62 # A TypeError occurs if the object does have such a method 63 # in its class, but its signature is not identical to that 64 # of NumPy's. This situation has occurred in the case of 65 # a downstream library like 'pandas'. 66 except (AttributeError, TypeError): 67 return _wrapit(obj, method, *args, **kwds) Total time: 0.000804922 s File: pandas\core\generic.py Function: clip at line 4969 Line # Hits Time Per Hit % Time Line Contents ============================================================== 4969 def clip(self, lower=None, upper=None, axis=None, inplace=False, 4970 *args, **kwargs): 4971 """ ... 5021 """ 5022 1 12 12.0 0.6 if isinstance(self, ABCPanel): 5023 raise NotImplementedError("clip is not supported yet for panels") 5024 5025 1 10 10.0 0.5 inplace = validate_bool_kwarg(inplace, 'inplace') 5026 5027 1 69 69.0 3.5 axis = nv.validate_clip_with_axis(axis, args, kwargs) 5028 5029 # GH 17276 5030 # numpy doesn't like NaN as a clip value 5031 # so ignore 5032 1 158 158.0 8.1 if np.any(pd.isnull(lower)): 5033 1 3 3.0 0.2 lower = None 5034 1 26 26.0 1.3 if np.any(pd.isnull(upper)): 5035 upper = None 5036 5037 # GH 2747 (arguments were reversed) 5038 1 1 1.0 0.1 if lower is not None and upper is not None: 5039 if is_scalar(lower) and is_scalar(upper): 5040 lower, upper = min(lower, upper), max(lower, upper) 5041 5042 # fast-path for scalars 5043 1 1 1.0 0.1 if ((lower is None or (is_scalar(lower) and is_number(lower))) and 5044 1 28 28.0 1.4 (upper is None or (is_scalar(upper) and is_number(upper)))): 5045 1 1654 1654.0 84.3 return self._clip_with_scalar(lower, upper, inplace=inplace) 5046 5047 result = self 5048 if lower is not None: 5049 result = result.clip_lower(lower, axis, inplace=inplace) 5050 if upper is not None: 5051 if inplace: 5052 result = self 5053 result = result.clip_upper(upper, axis, inplace=inplace) 5054 5055 return result Total time: 0.000662153 s File: pandas\core\generic.py Function: _clip_with_scalar at line 4920 Line # Hits Time Per Hit % Time Line Contents ============================================================== 4920 def _clip_with_scalar(self, lower, upper, inplace=False): 4921 1 2 2.0 0.1 if ((lower is not None and np.any(isna(lower))) or 4922 1 25 25.0 1.5 (upper is not None and np.any(isna(upper)))): 4923 raise ValueError("Cannot use an NA value as a clip threshold") 4924 4925 1 22 22.0 1.4 result = self.values 4926 1 571 571.0 35.4 mask = isna(result) 4927 4928 1 95 95.0 5.9 with np.errstate(all='ignore'): 4929 1 1 1.0 0.1 if upper is not None: 4930 1 141 141.0 8.7 result = np.where(result >= upper, upper, result) 4931 1 33 33.0 2.0 if lower is not None: 4932 result = np.where(result <= lower, lower, result) 4933 1 73 73.0 4.5 if np.any(mask): 4934 result[mask] = np.nan 4935 4936 1 90 90.0 5.6 axes_dict = self._construct_axes_dict() 4937 1 558 558.0 34.6 result = self._constructor(result, **axes_dict).__finalize__(self) 4938 4939 1 2 2.0 0.1 if inplace: 4940 self._update_inplace(result) 4941 else: 4942 1 1 1.0 0.1 return result
I stopped going into the subroutines at that point because it already highlights where the pd.Series.clip
does much more work than the np.ndarray.clip
. Just compare the total time of the np.clip
call on the values
(55 timer units) to one of the first checks in the pandas.Series.clip
method, the if np.any(pd.isnull(lower))
(158 timer units). At that point the pandas method didn't even start at clipping and it already takes 3 times longer.
However several of these "overheads" become insignificant when the array is big:
s = pd.Series(np.random.randint(0, 100, 1000000)) %lprun -f np.clip -f np.core.fromnumeric._wrapfunc -f pd.Series.clip -f pd.Series._clip_with_scalar np.clip(s, a_min=None, a_max=1) Timer unit: 4.10256e-07 s Total time: 0.00593476 s File: numpy\core\fromnumeric.py Function: clip at line 1673 Line # Hits Time Per Hit % Time Line Contents ============================================================== 1673 def clip(a, a_min, a_max, out=None): 1674 """ ... 1726 """ 1727 1 14466 14466.0 100.0 return _wrapfunc(a, 'clip', a_min, a_max, out=out) Total time: 0.00592779 s File: numpy\core\fromnumeric.py Function: _wrapfunc at line 55 Line # Hits Time Per Hit % Time Line Contents ============================================================== 55 def _wrapfunc(obj, method, *args, **kwds): 56 1 1 1.0 0.0 try: 57 1 14448 14448.0 100.0 return getattr(obj, method)(*args, **kwds) 58 59 # An AttributeError occurs if the object does not have 60 # such a method in its class. 61 62 # A TypeError occurs if the object does have such a method 63 # in its class, but its signature is not identical to that 64 # of NumPy's. This situation has occurred in the case of 65 # a downstream library like 'pandas'. 66 except (AttributeError, TypeError): 67 return _wrapit(obj, method, *args, **kwds) Total time: 0.00591302 s File: pandas\core\generic.py Function: clip at line 4969 Line # Hits Time Per Hit % Time Line Contents ============================================================== 4969 def clip(self, lower=None, upper=None, axis=None, inplace=False, 4970 *args, **kwargs): 4971 """ ... 5021 """ 5022 1 17 17.0 0.1 if isinstance(self, ABCPanel): 5023 raise NotImplementedError("clip is not supported yet for panels") 5024 5025 1 14 14.0 0.1 inplace = validate_bool_kwarg(inplace, 'inplace') 5026 5027 1 97 97.0 0.7 axis = nv.validate_clip_with_axis(axis, args, kwargs) 5028 5029 # GH 17276 5030 # numpy doesn't like NaN as a clip value 5031 # so ignore 5032 1 125 125.0 0.9 if np.any(pd.isnull(lower)): 5033 1 2 2.0 0.0 lower = None 5034 1 30 30.0 0.2 if np.any(pd.isnull(upper)): 5035 upper = None 5036 5037 # GH 2747 (arguments were reversed) 5038 1 2 2.0 0.0 if lower is not None and upper is not None: 5039 if is_scalar(lower) and is_scalar(upper): 5040 lower, upper = min(lower, upper), max(lower, upper) 5041 5042 # fast-path for scalars 5043 1 2 2.0 0.0 if ((lower is None or (is_scalar(lower) and is_number(lower))) and 5044 1 32 32.0 0.2 (upper is None or (is_scalar(upper) and is_number(upper)))): 5045 1 14092 14092.0 97.8 return self._clip_with_scalar(lower, upper, inplace=inplace) 5046 5047 result = self 5048 if lower is not None: 5049 result = result.clip_lower(lower, axis, inplace=inplace) 5050 if upper is not None: 5051 if inplace: 5052 result = self 5053 result = result.clip_upper(upper, axis, inplace=inplace) 5054 5055 return result Total time: 0.00575753 s File: pandas\core\generic.py Function: _clip_with_scalar at line 4920 Line # Hits Time Per Hit % Time Line Contents ============================================================== 4920 def _clip_with_scalar(self, lower, upper, inplace=False): 4921 1 2 2.0 0.0 if ((lower is not None and np.any(isna(lower))) or 4922 1 28 28.0 0.2 (upper is not None and np.any(isna(upper)))): 4923 raise ValueError("Cannot use an NA value as a clip threshold") 4924 4925 1 120 120.0 0.9 result = self.values 4926 1 3525 3525.0 25.1 mask = isna(result) 4927 4928 1 86 86.0 0.6 with np.errstate(all='ignore'): 4929 1 2 2.0 0.0 if upper is not None: 4930 1 9314 9314.0 66.4 result = np.where(result >= upper, upper, result) 4931 1 61 61.0 0.4 if lower is not None: 4932 result = np.where(result <= lower, lower, result) 4933 1 283 283.0 2.0 if np.any(mask): 4934 result[mask] = np.nan 4935 4936 1 78 78.0 0.6 axes_dict = self._construct_axes_dict() 4937 1 532 532.0 3.8 result = self._constructor(result, **axes_dict).__finalize__(self) 4938 4939 1 2 2.0 0.0 if inplace: 4940 self._update_inplace(result) 4941 else: 4942 1 1 1.0 0.0 return result
There are still multiple function calls, for example isna
and np.where
, that take a significant amount of time, but overall this is at least comparable to the np.ndarray.clip
time (that's in the regime where the timing difference is ~3 on my computer).
The takeaway should probably be:
- Many NumPy functions just delegate to a method of the object passed in, so there can be huge differences when you pass in different objects.
- Profiling, especially line-profiling, can be a great tool to find the places where the performance difference comes from.
- Always make sure to test differently sized objects in such cases. You could be comparing constant factors that probably don't matter except if you process lots of small arrays.
Used versions:
Python 3.6.3 64-bit on Windows 10 Numpy 1.13.3 Pandas 0.21.1
Answers 2
Just read the source code, it's clear.
def clip(a, a_min, a_max, out=None): """a : array_like Array containing elements to clip.""" return _wrapfunc(a, 'clip', a_min, a_max, out=out) def _wrapfunc(obj, method, *args, **kwds): try: return getattr(obj, method)(*args, **kwds) #This situation has occurred in the case of # a downstream library like 'pandas'. except (AttributeError, TypeError): return _wrapit(obj, method, *args, **kwds) def _wrapit(obj, method, *args, **kwds): try: wrap = obj.__array_wrap__ except AttributeError: wrap = None result = getattr(asarray(obj), method)(*args, **kwds) if wrap: if not isinstance(result, mu.ndarray): result = asarray(result) result = wrap(result) return result
rectify:
after pandas v0.13.0_ahl1,pandas has it's own implement of clip
.
Answers 3
There are two parts to the performance difference to be aware of here:
- Python overhead in each library (
pandas
being extra helpful) - Difference in numeric algorithm implementation (
pd.clip
actually callsnp.where
)
Running this on a very small array should demonstrate the difference in Python overhead. For numpy, this is understandably very small, however pandas does a lot of checking (null values, more flexible argument processing, etc) before getting to the heavy number crunching. I've tried to show a rough breakdown of the stages which the two codes go through before hitting C code bedrock.
data = pd.Series(np.random.random(100))
When using np.clip
on an ndarray
, the overhead is simply the numpy wrapper function calling the object's method:
>>> %timeit np.clip(data.values, 0.2, 0.8) # numpy wrapper, calls .clip() on the ndarray >>> %timeit data.values.clip(0.2, 0.8) # C function call 2.22 µs ± 125 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 1.32 µs ± 20.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Pandas spends more time checking for edge cases before getting to the algorithm:
>>> %timeit np.clip(data, a_min=0.2, a_max=0.8) # numpy wrapper, calls .clip() on the Series >>> %timeit data.clip(lower=0.2, upper=0.8) # pandas API method >>> %timeit data._clip_with_scalar(0.2, 0.8) # lowest level python function 102 µs ± 1.54 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 90.4 µs ± 1.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 73.7 µs ± 805 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Relative to overall time, the overhead of both libraries before hitting C code is pretty significant. For numpy, the single wrapping instruction takes as much time to execute as the numeric processing. Pandas has ~30x more overhead just in the first two layers of function calls.
To isolate what is happening at the algorithm level, we should check this on a larger array and benchmark the same functions:
>>> data = pd.Series(np.random.random(1000000)) >>> %timeit np.clip(data.values, 0.2, 0.8) >>> %timeit data.values.clip(0.2, 0.8) 2.85 ms ± 37.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 2.85 ms ± 15.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) >>> %timeit np.clip(data, a_min=0.2, a_max=0.8) >>> %timeit data.clip(lower=0.2, upper=0.8) >>> %timeit data._clip_with_scalar(0.2, 0.8) 12.3 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 12.3 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 12.2 ms ± 76.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The python overhead in both cases is now negligible; time for wrapper functions and argument checking is small relative to the calculation time on 1 million values. However there is a 3-4x speed difference which can be attributed to numeric implementation. By investigating a bit in the source code, we see that the pandas
implementation of clip
actually uses np.where
, not np.clip
:
def clip_where(data, lower, upper): ''' Actual implementation in pd.Series._clip_with_scalar (minus NaN handling). ''' result = data.values result = np.where(result >= upper, upper, result) result = np.where(result <= lower, lower, result) return pd.Series(result) def clip_clip(data, lower, upper): ''' What would happen if we used ndarray.clip instead. ''' return pd.Series(data.values.clip(lower, upper))
The additional effort required to check each boolean condition separately before doing a conditional replace would seem to account for the speed difference. Specifying both upper
and lower
would result in 4 passes through the numpy array (two inequality checks and two calls to np.where
). Benchmarking these two functions shows that 3-4x speed ratio:
>>> %timeit clip_clip(data, lower=0.2, upper=0.8) >>> %timeit clip_where(data, lower=0.2, upper=0.8) 11.1 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 2.97 ms ± 76.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I'm not sure why the pandas devs went with this implementation. np.clip
may be a newer API function that previously required a workaround. There is also a little more to it than I've gone into here, since pandas checks for various case before running the final algorithm, and this is only one of the implementations that may be called.
Answers 4
The reason why the performance differs is because numpy first tends to search for pandas implementation of the function using getattr
than doing the same in builtin numpy functions when a pandas object is passed.
Its not the numpy over the pandas object that is slow, its the pandas version.
When you do
np.clip(pd.Series([1,2,3,4,5]),a_min=None,amax=1)
_wrapfunc
is being called :
# Code from source def _wrapfunc(obj, method, *args, **kwds): try: return getattr(obj, method)(*args, **kwds)
Due to _wrapfunc
's getattr
method :
getattr(pd.Series([1,2,3,4,5]),'clip')(None, 1) # Equivalent to `pd.Series([1,2,3,4,5]).clip(lower=None,upper=1)` # 0 1 # 1 1 # 2 1 # 3 1 # 4 1 # dtype: int64
If you go through the pandas implementation there is a lot of pre checking work that is done. Its the reason why the functions which has the pandas implementation done via numpy has such difference in speed.
Not only clip, functions like cumsum
,cumprod
,reshape
,searchsorted
,transpose
and much more uses pandas version of them than numpy when you pass them a pandas object.
It might appear numpy is doing the work over those objects but under the hood its the pandas function.
0 comments:
Post a Comment