Monday, August 13, 2018

Curious memory consumption of pandas.unique()

Leave a Comment

While profiling the memory consumption of my algorithm, I was surprised that sometimes for smaller inputs more memory was needed.

It all boils down to the following usage of pandas.unique():

import numpy as np import pandas as pd import sys  N=int(sys.argv[1])  a=np.arange(N, dtype=np.int64) b=pd.unique(a) 

with N=6*10^7 it needs 3.7GB peak memory, but with N=8*10^7 "only" 3GB.

Scanning different input-size yields the following graph:

enter image description here

Out of curiosity and for self-education: How can the counterintuitive behavior (i.e. more memory for smaller input size) around N=5*10^7, N=1.3*10^7 be explained?


Here are the scripts for producing the memory consumption graph on Linux:

pandas_unique_test.py:

import numpy as np import pandas as pd import sys  N=int(sys.argv[1])     a=np.arange(N, dtype=np.int64) b=pd.unique(a) 

show_memory.py:

import sys import matplotlib.pyplot as plt    ns=[] mems=[] for line in sys.stdin.readlines():     n,mem = map(int, line.strip().split(" "))     ns.append(n)     mems.append(mem) plt.plot(ns, mems, label='peak-memory') plt.xlabel('n') plt.ylabel('peak memory in KB') ymin, ymax = plt.ylim() plt.ylim(0,ymax) plt.legend() plt.show() 

run_perf_test.sh:

WRAPPER="/usr/bin/time -f%M" #peak memory in Kb N=1000000 while [ $N -lt 100000000 ] do    printf "$N "    $WRAPPER python pandas_unique_test.py $N    N=`expr $N + 1000000` done  

And now:

sh run_perf_tests.sh  2>&1 | python show_memory.py 

1 Answers

Answers 1

Let's see...

pandas.unique says it's a "hash-table based unique".

It calls this function to acquire the correct hash table implementation for your data, namely htable.Int64HashTable.

The hash table is initialized with size_hint = the length of your value vector. That means kh_resize_DTYPE(table, size_hint) gets called.

Those functions are defined (templated) here in khash.h.

It seems to allocate (size_hint >> 5) * 4 + (size_hint) * 8 * 2 bytes of memory for the buckets (maybe more, maybe less, I might be off here).

Then, HashTable.unique() is called.

It allocates an empty Int64Vector, which seem to quadruple their size whenever they get filled, starting from 128.

It then iterates over your values, figuring out whether they're in the hash table; if not, they get added to both the hash table and the vector. (This is where the vector may grow; the hash table shouldn't need to grow due to the size hint.)

Finally, a NumPy ndarray is made to point at the vector.

So uh, I think you're seeing the vector size quadrupling at certain thresholds (which should be, if my late-night math stands,

>>> [2 ** (2 * i - 1) for i in range(4, 20)] [     128,     512,     2048,     8192,     32768,     131072,     524288,     2097152,     8388608,     33554432,     134217728,     536870912,     2147483648,     8589934592,     34359738368,     137438953472,     ..., ] 

Hope this sheds some light at things :)

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment