Having this DataFrame:
import pandas dates = pandas.date_range('2016-01-01', periods=5, freq='H') s = pandas.Series([0, 1, 2, 3, 4], index=dates) df = pandas.DataFrame([(1, 2, s, 8)], columns=['a', 'b', 'foo', 'bar']) df.set_index(['a', 'b'], inplace=True) df
I would like to replace the Series in there with a new one that is simply the old one, but resampled to a day period (i.e. x.resample('D').sum().dropna()
).
When I try:
df['foo'][0] = df['foo'][0].resample('D').sum().dropna()
That seems to work well:
However, I get a warning:
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
The question is, how should I do this instead?
Notes
Things I have tried but do not work (resampling or not, the assignment raises an exception):
df.iloc[0].loc['foo'] = df.iloc[0].loc['foo'] df.loc[(1, 2), 'foo'] = df.loc[(1, 2), 'foo'] df.loc[df.index[0], 'foo'] = df.loc[df.index[0], 'foo']
A bit more information about the data (in case it is relevant):
- The real DataFrame has more columns in the multi-index. Not all of them necessarily integers, but more generally numerical and categorical. The index is unique (i.e.: there is only one row with a given index value).
- The real DataFrame has, of course, many more rows in it (thousands).
- There are not necessarily only two columns in the DataFrame and there may be more than 1 columns containing a Series type. Columns usually contain series, categorical data and numerical data as well. Any single column is always single-typed (either numerical, or categorical, or series).
- The series contained in each cell usually have a variable length (i.e.: two series/cells in the DataFrame do not, unless pure coincidence, have the same length, and will probably never have the same index anyway, as dates vary as well between series).
Using Python 3.5.1 and Pandas 0.18.1.
3 Answers
Answers 1
This should work:
df.iat[0, df.columns.get_loc('foo')] = df['foo'][0].resample('D').sum().dropna()
Pandas is complaining about chained indexing but when you don't do it that way it's facing problems assigning whole series to a cell. With iat
you can force something like that. I don't think it would be a preferable thing to do, but seems like a working solution.
Answers 2
Simply set df.is_copy = False
before asignment of new value.
Answers 3
Hierarchical data in pandas
It really seems like you should consider restructure your data to take advantage of pandas features such as MultiIndexing
and DateTimeIndex
. This will allow you to still operate on a index in the typical way while being able to select on multiple columns across the hierarchical data (a
,b
, andbar
).
Restructured Data
import pandas as pd # Define Index dates = pd.date_range('2016-01-01', periods=5, freq='H') # Define Series s = pd.Series([0, 1, 2, 3, 4], index=dates) # Place Series in Hierarchical DataFrame heirIndex = pd.MultiIndex.from_arrays([1,2,8], names=['a','b', 'bar']) df = pd.DataFrame(s, columns=heirIndex) print df
a 1 b 2 bar 8 2016-01-01 00:00:00 0 2016-01-01 01:00:00 1 2016-01-01 02:00:00 2 2016-01-01 03:00:00 3 2016-01-01 04:00:00 4
Resampling
With the data in this format, resampling becomes very simple.
# Simple Direct Resampling df_resampled = df.resample('D').sum().dropna() print df_resampled
a 1 b 2 bar 8 2016-01-01 10
Update (from data description)
If the data has variable length Series
each with a different index
and non-numeric categories that is ok. Let's make an example:
# Define Series dates = pandas.date_range('2016-01-01', periods=5, freq='H') s = pandas.Series([0, 1, 2, 3, 4], index=dates) # Define Series dates2 = pandas.date_range('2016-01-14', periods=6, freq='H') s2 = pandas.Series([-200, 10, 24, 30, 40,100], index=dates2) # Define DataFrames df1 = pd.DataFrame(s, columns=pd.MultiIndex.from_arrays([1,2,8,'cat1'], names=['a','b', 'bar','c'])) df2 = pd.DataFrame(s2, columns=pd.MultiIndex.from_arrays([2,5,5,'cat3'], names=['a','b', 'bar','c'])) df = pd.concat([df1, df2]) print df
a 1 2 b 2 5 bar 8 5 c cat1 cat3 2016-01-01 00:00:00 0.0 NaN 2016-01-01 01:00:00 1.0 NaN 2016-01-01 02:00:00 2.0 NaN 2016-01-01 03:00:00 3.0 NaN 2016-01-01 04:00:00 4.0 NaN 2016-01-14 00:00:00 NaN -200.0 2016-01-14 01:00:00 NaN 10.0 2016-01-14 02:00:00 NaN 24.0 2016-01-14 03:00:00 NaN 30.0 2016-01-14 04:00:00 NaN 40.0 2016-01-14 05:00:00 NaN 100.0
The only issues is that after resampling. You will want to use how='all'
while dropping na
rows like this:
# Simple Direct Resampling df_resampled = df.resample('D').sum().dropna(how='all') print df_resampled
a 1 2 b 2 5 bar 8 5 c cat1 cat3 2016-01-01 10.0 NaN 2016-01-14 NaN 4.0
0 comments:
Post a Comment