While profiling the memory consumption of my algorithm, I was surprised that sometimes for smaller inputs more memory was needed.
It all boils down to the following usage of pandas.unique()
:
import numpy as np import pandas as pd import sys N=int(sys.argv[1]) a=np.arange(N, dtype=np.int64) b=pd.unique(a)
with N=6*10^7
it needs 3.7GB
peak memory, but with N=8*10^7
"only" 3GB
.
Scanning different input-size yields the following graph:
Out of curiosity and for self-education: How can the counterintuitive behavior (i.e. more memory for smaller input size) around N=5*10^7
, N=1.3*10^7
be explained?
Here are the scripts for producing the memory consumption graph on Linux:
pandas_unique_test.py:
import numpy as np import pandas as pd import sys N=int(sys.argv[1]) a=np.arange(N, dtype=np.int64) b=pd.unique(a)
show_memory.py:
import sys import matplotlib.pyplot as plt ns=[] mems=[] for line in sys.stdin.readlines(): n,mem = map(int, line.strip().split(" ")) ns.append(n) mems.append(mem) plt.plot(ns, mems, label='peak-memory') plt.xlabel('n') plt.ylabel('peak memory in KB') ymin, ymax = plt.ylim() plt.ylim(0,ymax) plt.legend() plt.show()
run_perf_test.sh:
WRAPPER="/usr/bin/time -f%M" #peak memory in Kb N=1000000 while [ $N -lt 100000000 ] do printf "$N " $WRAPPER python pandas_unique_test.py $N N=`expr $N + 1000000` done
And now:
sh run_perf_tests.sh 2>&1 | python show_memory.py
1 Answers
Answers 1
Let's see...
pandas.unique
says it's a "hash-table based unique".
It calls this function to acquire the correct hash table implementation for your data, namely htable.Int64HashTable
.
The hash table is initialized with size_hint
= the length of your value vector. That means kh_resize_DTYPE(table, size_hint)
gets called.
Those functions are defined (templated) here in khash.h
.
It seems to allocate (size_hint >> 5) * 4 + (size_hint) * 8 * 2
bytes of memory for the buckets (maybe more, maybe less, I might be off here).
Then, HashTable.unique()
is called.
It allocates an empty Int64Vector
, which seem to quadruple their size whenever they get filled, starting from 128.
It then iterates over your values, figuring out whether they're in the hash table; if not, they get added to both the hash table and the vector. (This is where the vector may grow; the hash table shouldn't need to grow due to the size hint.)
Finally, a NumPy ndarray
is made to point at the vector.
So uh, I think you're seeing the vector size quadrupling at certain thresholds (which should be, if my late-night math stands,
>>> [2 ** (2 * i - 1) for i in range(4, 20)] [ 128, 512, 2048, 8192, 32768, 131072, 524288, 2097152, 8388608, 33554432, 134217728, 536870912, 2147483648, 8589934592, 34359738368, 137438953472, ..., ]
Hope this sheds some light at things :)
0 comments:
Post a Comment