I am working on using the hdfstore in pandas to data frames from an ongoing iterative process. At each iteration, I append to a table in the hdfstore. Here is a toy example:
import pandas as pd from pandas import HDFStore import numpy as np from random import choice from string import ascii_letters alphanum=np.array(list(ascii_letters)+range(0,9)) def hdfstore_append(storefile,key,df,format="t",columns=None,data_columns=None): if df is None: return if key[0]!='/': key='/'+key with HDFStore(storefile) as store: if key not in store.keys(): store.put(key,df,format=format,columns=columns,data_columns=data_columns) else: try: store.append(key,df) except Exception as inst: df = pd.concat([store.get(key),df]) store.put(key,df,format=format,columns=columns, data_columns=data_columns) storefile="db.h5" for i in range(0,100): df=pd.DataFrame([dict(n=np.random.randn(), s=''.join(alphanum[np.random.randint(1,len(alphanum),np.random.randint(1,2*(i+1))]))],index=[i]) hdfstore_append(storefile,'/SO/df',df,columns=df.columns,data_columns=True) The hdfstore_append function guards against the various exceptions hdfstore.append throws, and rebuilds the table when necessary. The issue with this approach is that it gets very slow when the table in the store becomes very large.
Is there a more efficient way to do this?
0 comments:
Post a Comment