I have two DataFrames
(with DatetimeIndex
) and want to update the first frame (the older one) with data from the second frame (the newer one).
The new frame may contain more recent data for rows already contained in the the old frame. In this case, data in the old frame should be overwritten with data from the new frame. Also the newer frame may have more columns / rows, than the first one. In this case the old frame should be enlarged by the data in the new frame.
Pandas docs state, that
"The .loc/.ix/[]
operations can perform enlargement when setting a non-existant key for that axis"
and
"a DataFrame can be enlarged on either axis via .loc
"
However this doesn't seem to work and throws a KeyError
. Example:
In [195]: df1 Out[195]: A B C 2015-07-09 12:00:00 1 1 1 2015-07-09 13:00:00 1 1 1 2015-07-09 14:00:00 1 1 1 2015-07-09 15:00:00 1 1 1 In [196]: df2 Out[196]: A B C D 2015-07-09 14:00:00 2 2 2 2 2015-07-09 15:00:00 2 2 2 2 2015-07-09 16:00:00 2 2 2 2 2015-07-09 17:00:00 2 2 2 2 In [197]: df1.loc[df2.index] = df2 --------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-197-74e630e87cf8> in <module>() ----> 1 df1.loc[df2.index] = df2 /.../pandas/core/indexing.pyc in __setitem__(self, key, value) 112 113 def __setitem__(self, key, value): --> 114 indexer = self._get_setitem_indexer(key) 115 self._setitem_with_indexer(indexer, value) 116 /.../pandas/core/indexing.pyc in _get_setitem_indexer(self, key) 107 108 try: --> 109 return self._convert_to_indexer(key, is_setter=True) 110 except TypeError: 111 raise IndexingError(key) /.../pandas/core/indexing.pyc in _convert_to_indexer(self, obj, axis, is_setter) 1110 mask = check == -1 1111 if mask.any(): -> 1112 raise KeyError('%s not in index' % objarr[mask]) 1113 1114 return _values_from_object(indexer) KeyError: "['2015-07-09T18:00:00.000000000+0200' '2015-07-09T19:00:00.000000000+0200'] not in index"
What is the best way (with respect to performance, as my real data is much larger) two achieve the desired updated and enlarged DataFrame. This is the result I would like to see:
A B C D 2015-07-09 12:00:00 1 1 1 NaN 2015-07-09 13:00:00 1 1 1 NaN 2015-07-09 14:00:00 2 2 2 2 2015-07-09 15:00:00 2 2 2 2 2015-07-09 16:00:00 2 2 2 2 2015-07-09 17:00:00 2 2 2 2
3 Answers
Answers 1
df2.combine_first(df1)
(documentation) seems to serve your requirement; PFB code snippet & output
import pandas as pd print 'pandas-version: ', pd.__version__ df1 = pd.DataFrame.from_records([('2015-07-09 12:00:00',1,1,1), ('2015-07-09 13:00:00',1,1,1), ('2015-07-09 14:00:00',1,1,1), ('2015-07-09 15:00:00',1,1,1)], columns=['Dt', 'A', 'B', 'C']).set_index('Dt') # print df1 df2 = pd.DataFrame.from_records([('2015-07-09 14:00:00',2,2,2,2), ('2015-07-09 15:00:00',2,2,2,2), ('2015-07-09 16:00:00',2,2,2,2), ('2015-07-09 17:00:00',2,2,2,2),], columns=['Dt', 'A', 'B', 'C', 'D']).set_index('Dt') res_combine1st = df2.combine_first(df1) print res_combine1st
output
pandas-version: 0.15.2 A B C D Dt 2015-07-09 12:00:00 1 1 1 NaN 2015-07-09 13:00:00 1 1 1 NaN 2015-07-09 14:00:00 2 2 2 2 2015-07-09 15:00:00 2 2 2 2 2015-07-09 16:00:00 2 2 2 2 2015-07-09 17:00:00 2 2 2 2
Answers 2
You can use the combine
function.
import pandas as pd # your data # =========================================================== df1 = pd.DataFrame(np.ones(12).reshape(4,3), columns='A B C'.split(), index=pd.date_range('2015-07-09 12:00:00', periods=4, freq='H')) df2 = pd.DataFrame(np.ones(16).reshape(4,4)*2, columns='A B C D'.split(), index=pd.date_range('2015-07-09 14:00:00', periods=4, freq='H')) # processing # ===================================================== # reindex to populate NaN result = df2.reindex(np.union1d(df1.index, df2.index)) Out[248]: A B C D 2015-07-09 12:00:00 NaN NaN NaN NaN 2015-07-09 13:00:00 NaN NaN NaN NaN 2015-07-09 14:00:00 2 2 2 2 2015-07-09 15:00:00 2 2 2 2 2015-07-09 16:00:00 2 2 2 2 2015-07-09 17:00:00 2 2 2 2 combiner = lambda x, y: np.where(x.isnull(), y, x) # use df1 to update result result.combine(df1, combiner) Out[249]: A B C D 2015-07-09 12:00:00 1 1 1 NaN 2015-07-09 13:00:00 1 1 1 NaN 2015-07-09 14:00:00 2 2 2 2 2015-07-09 15:00:00 2 2 2 2 2015-07-09 16:00:00 2 2 2 2 2015-07-09 17:00:00 2 2 2 2 # maybe fillna(method='ffill') if you like
Answers 3
In addition to previous answer, after reindexing you can use
result.fillna(df1, inplace=True)
so based on Jianxun Li's code (extended with one more column) you can try this
# your data # =========================================================== df1 = pd.DataFrame(np.ones(12).reshape(4,3), columns='A B C'.split(), index=pd.date_range('2015-07-09 12:00:00', periods=4, freq='H')) df2 = pd.DataFrame(np.ones(20).reshape(4,5)*2, columns='A B C D E'.split(), index=pd.date_range('2015-07-09 14:00:00', periods=4, freq='H')) # processing # ===================================================== # reindex to populate NaN result = df2.reindex(np.union1d(df1.index, df2.index)) # fill NaN from df1 result.fillna(df1, inplace=True) Out[3]: A B C D E 2015-07-09 12:00:00 1 1 1 NaN NaN 2015-07-09 13:00:00 1 1 1 NaN NaN 2015-07-09 14:00:00 2 2 2 2 2 2015-07-09 15:00:00 2 2 2 2 2 2015-07-09 16:00:00 2 2 2 2 2 2015-07-09 17:00:00 2 2 2 2 2
0 comments:
Post a Comment