Tuesday, July 4, 2017

How to concatenate multiple pandas.DataFrames without running into MemoryError

Leave a Comment

I have three DataFrames that I'm trying to concatenate.

concat_df = pd.concat([df1, df2, df3]) 

This results in a MemoryError. How can I resolve this?

Note that most of the existing similar questions are on MemoryErrors occuring when reading large files. I don't have that problem. I have read my files in into DataFrames. I just can't concatenate that data.

8 Answers

Answers 1

I advice you to put your dataframes into single csv file by concatenation. Then to read your csv file.

Execute that:

# write df1 content in file.csv df1.to_csv('file.csv', index=False) # append df2 content to file.csv df2.to_csv('file.csv', mode='a', columns=False, index=False) # append df3 content to file.csv df3.to_csv('file.csv', mode='a', columns=False, index=False)  # free memory del df1, df2, df3  # read all df1, df2, df3 contents df = pd.read_csv('file.csv') 

If this solution isn't enougth performante, to concat larger files than usually. Do:

df1.to_csv('file.csv', index=False) df2.to_csv('file1.csv', index=False) df3.to_csv('file2.csv', index=False)  del df1, df2, df3 

Then run bash command:

cat file1.csv >> file.csv cat file2.csv >> file.csv cat file3.csv >> file.csv 

Or concat csv files in python :

def concat(file1, file2):     with open(file2, 'r') as filename2:         data = file2.read()     with open(file1, 'a') as filename1:         file.write(data)  concat('file.csv', 'file1.csv') concat('file.csv', 'file2.csv') concat('file.csv', 'file3.csv') 

After read:

df = pd.read_csv('file.csv') 

Answers 2

Similar to what @glegoux suggests, also pd.DataFrame.to_csv can write in append mode, so you can do something like:

df1.to_csv(filename) df2.to_csv(filename, mode='a', columns=False) df3.to_csv(filename, mode='a', columns=False)  del df1, df2, df3 df_concat = pd.read_csv(filename) 

Answers 3

Kinda taking a guess here, but maybe:

df1 = pd.concat([df1,df2]) del df2 df1 = pd.concat([df1,df3]) del df3 

Obviously, you could do that more as a loop but the key is you want to delete df2, df3, etc. as you go. As you are doing it in the question, you never clear out the old dataframes so you are using about twice as much memory as you need to.

More generally, if you are reading and concatentating, I'd do it something like this (if you had 3 CSVs: foo0, foo1, foo2):

concat_df = pd.DataFrame() for i in range(3):     temp_df = pd.read_csv('foo'+str(i)+'.csv')     concat_df = pd.concat( [concat_df, temp_df] ) 

In other words, as you are reading in files, you only keep the small dataframes in memory temporarily, until you concatenate them into the combined df, concat_df. As you currently do it, you are keeping around all the smaller dataframes, even after concatenating them.

Answers 4

Dask might be good option to try for handling large dataframes - Go through Dask Docs

Answers 5

You can store your individual dataframes in a HDF Store, and then call the store just like one big dataframe.

# name of store fname = 'my_store'  with pd.get_store(fname) as store:      # save individual dfs to store     for df in [df1, df2, df3, df_foo]:         store.append('df',df,data_columns=['FOO','BAR','ETC']) # data_columns = identify the column in the dfs you are appending      # access the store as a single df     df = store.select('df', where = ['A>2'])  # change where condition as required (see documentation for examples)     # Do other stuff with df #  # close the store when you're done os.remove(fname) 

Answers 6

Another option:

1) Write df1 to .csv file: df1.to_csv('Big file.csv')

2) Open .csv file, then append df2:

with open('Big File.csv','a') as f:     df2.to_csv(f, header=False) 

3) Repeat Step 2 with df3

with open('Big File.csv','a') as f:     df3.to_csv(f, header=False) 

Answers 7

The problem is, like shown in the others answers, a problem of memory. And the solution is to store data on disk, then to build an unique Dataframe.

With such huge data, performance is an issue.

csv solutions are very slow, since conversion in text mode occurs. HDF5 solutions are better. I propose a third way in binary mode, with pickle, which seems to be better.

Here the code:

import numpy as np import pandas as pd if 'dfs' not in locals(): dfs=[] try:      while len(dfs)<10:         dfs.append(pd.DataFrame(np.empty(10**5).reshape(-1,10))) except MemoryError:     del dfs[-1] # some room to survive     print(len(dfs))  def bycsv(dfs):     md,hd='w',True     for df in dfs:         df.to_csv('df_all.csv',mode='a',header=hd)         md,hd='a',False     #del dfs     df_all=pd.read_csv('df_all.csv',index_col=1)      return df_all     

Better solutions :

def byHDF(dfs):     store=pd.HDFStore('store.h5')     for df in dfs:         store.append('df',df,data_columns=list('0123456789'))     #del dfs     df=store.select('df')     store.close()     return df  def bypickle(dfs):     c=[]     with open('save.pkl','ab') as f:         for df in dfs:             pickle.dump(df,f)             c.append(len(df))     #del dfs     with open('save.pkl','rb') as f:         df_all=pickle.load(f)         offset=len(df_all)         df_all=df_all.append(pd.DataFrame(np.empty(sum(c[1:])*10).reshape(-1,10)))         for size in c[1:]:             df=pickle.load(f)             df_all.iloc[offset:offset+size]=df             offset+=size     return df_all 

And some test on (little) data to compare performance. you have to multiply by about 1000 for 4 Gb.

In [63]: %time v=bycsv(dfs) Wall time: 2.31 s  In [64]: %time w=byHDF(dfs) Wall time: 219 ms  In [65]: %time x=bypickle(dfs) Wall time: 46.7 ms     

Of course all of that must be tuned to fit your problem, but I have no time to do it before the gong ;)

I can edit it if you give more information on your data structure and size if you want.

Answers 8

I've had a similar performance issues while trying to concatenate a large number of DataFrames to a 'growing' DataFrame.

My workaround was appending all sub DataFrames to a list, and then concatenating the list of DataFrames once processing of the sub DataFrames has been completed. This will bring the runtime to almost half.

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment