Friday, April 15, 2016

Setting DataFrame values with enlargement

Leave a Comment

I have two DataFrames (with DatetimeIndex) and want to update the first frame (the older one) with data from the second frame (the newer one).

The new frame may contain more recent data for rows already contained in the the old frame. In this case, data in the old frame should be overwritten with data from the new frame. Also the newer frame may have more columns / rows, than the first one. In this case the old frame should be enlarged by the data in the new frame.

Pandas docs state, that

"The .loc/.ix/[] operations can perform enlargement when setting a non-existant key for that axis"

and

"a DataFrame can be enlarged on either axis via .loc"

However this doesn't seem to work and throws a KeyError. Example:

In [195]: df1 Out[195]:                       A  B  C 2015-07-09 12:00:00  1  1  1 2015-07-09 13:00:00  1  1  1 2015-07-09 14:00:00  1  1  1 2015-07-09 15:00:00  1  1  1  In [196]: df2 Out[196]:                       A  B  C  D 2015-07-09 14:00:00  2  2  2  2 2015-07-09 15:00:00  2  2  2  2 2015-07-09 16:00:00  2  2  2  2 2015-07-09 17:00:00  2  2  2  2  In [197]: df1.loc[df2.index] = df2 --------------------------------------------------------------------------- KeyError                                  Traceback (most recent call last) <ipython-input-197-74e630e87cf8> in <module>() ----> 1 df1.loc[df2.index] = df2  /.../pandas/core/indexing.pyc in __setitem__(self, key, value)     112      113     def __setitem__(self, key, value): --> 114         indexer = self._get_setitem_indexer(key)     115         self._setitem_with_indexer(indexer, value)     116   /.../pandas/core/indexing.pyc in _get_setitem_indexer(self, key)     107      108         try: --> 109             return self._convert_to_indexer(key, is_setter=True)     110         except TypeError:     111             raise IndexingError(key)  /.../pandas/core/indexing.pyc in _convert_to_indexer(self, obj, axis, is_setter)    1110                 mask = check == -1    1111                 if mask.any(): -> 1112                     raise KeyError('%s not in index' % objarr[mask])    1113     1114                 return _values_from_object(indexer)  KeyError: "['2015-07-09T18:00:00.000000000+0200' '2015-07-09T19:00:00.000000000+0200'] not in index" 

What is the best way (with respect to performance, as my real data is much larger) two achieve the desired updated and enlarged DataFrame. This is the result I would like to see:

                     A  B  C    D 2015-07-09 12:00:00  1  1  1  NaN 2015-07-09 13:00:00  1  1  1  NaN 2015-07-09 14:00:00  2  2  2    2 2015-07-09 15:00:00  2  2  2    2 2015-07-09 16:00:00  2  2  2    2 2015-07-09 17:00:00  2  2  2    2 

3 Answers

Answers 1

df2.combine_first(df1) (documentation) seems to serve your requirement; PFB code snippet & output

import pandas as pd  print 'pandas-version: ', pd.__version__  df1 = pd.DataFrame.from_records([('2015-07-09 12:00:00',1,1,1),                                  ('2015-07-09 13:00:00',1,1,1),                                  ('2015-07-09 14:00:00',1,1,1),                                  ('2015-07-09 15:00:00',1,1,1)],                                 columns=['Dt', 'A', 'B', 'C']).set_index('Dt') # print df1  df2 = pd.DataFrame.from_records([('2015-07-09 14:00:00',2,2,2,2),                                  ('2015-07-09 15:00:00',2,2,2,2),                                  ('2015-07-09 16:00:00',2,2,2,2),                                  ('2015-07-09 17:00:00',2,2,2,2),],                                columns=['Dt', 'A', 'B', 'C', 'D']).set_index('Dt') res_combine1st = df2.combine_first(df1) print res_combine1st 

output

pandas-version:  0.15.2                      A  B  C   D Dt                               2015-07-09 12:00:00  1  1  1 NaN 2015-07-09 13:00:00  1  1  1 NaN 2015-07-09 14:00:00  2  2  2   2 2015-07-09 15:00:00  2  2  2   2 2015-07-09 16:00:00  2  2  2   2 2015-07-09 17:00:00  2  2  2   2 

Answers 2

You can use the combine function.

import pandas as pd  # your data # =========================================================== df1 = pd.DataFrame(np.ones(12).reshape(4,3), columns='A B C'.split(), index=pd.date_range('2015-07-09 12:00:00', periods=4, freq='H'))  df2 = pd.DataFrame(np.ones(16).reshape(4,4)*2, columns='A B C D'.split(), index=pd.date_range('2015-07-09 14:00:00', periods=4, freq='H'))  # processing # ===================================================== # reindex to populate NaN result = df2.reindex(np.union1d(df1.index, df2.index))  Out[248]:                        A   B   C   D 2015-07-09 12:00:00 NaN NaN NaN NaN 2015-07-09 13:00:00 NaN NaN NaN NaN 2015-07-09 14:00:00   2   2   2   2 2015-07-09 15:00:00   2   2   2   2 2015-07-09 16:00:00   2   2   2   2 2015-07-09 17:00:00   2   2   2   2  combiner = lambda x, y: np.where(x.isnull(), y, x)  # use df1 to update result result.combine(df1, combiner)  Out[249]:                       A  B  C   D 2015-07-09 12:00:00  1  1  1 NaN 2015-07-09 13:00:00  1  1  1 NaN 2015-07-09 14:00:00  2  2  2   2 2015-07-09 15:00:00  2  2  2   2 2015-07-09 16:00:00  2  2  2   2 2015-07-09 17:00:00  2  2  2   2  # maybe fillna(method='ffill') if you like 

Answers 3

In addition to previous answer, after reindexing you can use

result.fillna(df1, inplace=True) 

so based on Jianxun Li's code (extended with one more column) you can try this

# your data # =========================================================== df1 = pd.DataFrame(np.ones(12).reshape(4,3), columns='A B C'.split(), index=pd.date_range('2015-07-09 12:00:00', periods=4, freq='H')) df2 = pd.DataFrame(np.ones(20).reshape(4,5)*2, columns='A B C D E'.split(), index=pd.date_range('2015-07-09 14:00:00', periods=4, freq='H'))  # processing # ===================================================== # reindex to populate NaN result = df2.reindex(np.union1d(df1.index, df2.index)) # fill NaN from df1 result.fillna(df1, inplace=True)  Out[3]:                                   A  B  C   D   E 2015-07-09 12:00:00  1  1  1 NaN NaN 2015-07-09 13:00:00  1  1  1 NaN NaN 2015-07-09 14:00:00  2  2  2   2   2 2015-07-09 15:00:00  2  2  2   2   2 2015-07-09 16:00:00  2  2  2   2   2 2015-07-09 17:00:00  2  2  2   2   2 
If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment