Friday, June 10, 2016

Setting values with pandas.DataFrame

Leave a Comment

Having this DataFrame:

import pandas  dates = pandas.date_range('2016-01-01', periods=5, freq='H') s = pandas.Series([0, 1, 2, 3, 4], index=dates) df = pandas.DataFrame([(1, 2, s, 8)], columns=['a', 'b', 'foo', 'bar']) df.set_index(['a', 'b'], inplace=True)  df 

enter image description here

I would like to replace the Series in there with a new one that is simply the old one, but resampled to a day period (i.e. x.resample('D').sum().dropna()).

When I try:

df['foo'][0] = df['foo'][0].resample('D').sum().dropna() 

That seems to work well:

enter image description here

However, I get a warning:

SettingWithCopyWarning:  A value is trying to be set on a copy of a slice from a DataFrame  See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy 

The question is, how should I do this instead?

Notes

Things I have tried but do not work (resampling or not, the assignment raises an exception):

df.iloc[0].loc['foo'] = df.iloc[0].loc['foo'] df.loc[(1, 2), 'foo'] = df.loc[(1, 2), 'foo'] df.loc[df.index[0], 'foo'] = df.loc[df.index[0], 'foo'] 

A bit more information about the data (in case it is relevant):

  • The real DataFrame has more columns in the multi-index. Not all of them necessarily integers, but more generally numerical and categorical. The index is unique (i.e.: there is only one row with a given index value).
  • The real DataFrame has, of course, many more rows in it (thousands).
  • There are not necessarily only two columns in the DataFrame and there may be more than 1 columns containing a Series type. Columns usually contain series, categorical data and numerical data as well. Any single column is always single-typed (either numerical, or categorical, or series).
  • The series contained in each cell usually have a variable length (i.e.: two series/cells in the DataFrame do not, unless pure coincidence, have the same length, and will probably never have the same index anyway, as dates vary as well between series).

Using Python 3.5.1 and Pandas 0.18.1.

3 Answers

Answers 1

This should work:

df.iat[0, df.columns.get_loc('foo')] = df['foo'][0].resample('D').sum().dropna() 

Pandas is complaining about chained indexing but when you don't do it that way it's facing problems assigning whole series to a cell. With iat you can force something like that. I don't think it would be a preferable thing to do, but seems like a working solution.

Answers 2

Simply set df.is_copy = False before asignment of new value.

Answers 3

Hierarchical data in pandas

It really seems like you should consider restructure your data to take advantage of pandas features such as MultiIndexing and DateTimeIndex. This will allow you to still operate on a index in the typical way while being able to select on multiple columns across the hierarchical data (a,b, andbar).

Restructured Data

import pandas as pd  # Define Index dates = pd.date_range('2016-01-01', periods=5, freq='H') # Define Series s = pd.Series([0, 1, 2, 3, 4], index=dates)  # Place Series in Hierarchical DataFrame heirIndex = pd.MultiIndex.from_arrays([1,2,8], names=['a','b', 'bar']) df = pd.DataFrame(s, columns=heirIndex)  print df 

a                    1 b                    2 bar                  8 2016-01-01 00:00:00  0 2016-01-01 01:00:00  1 2016-01-01 02:00:00  2 2016-01-01 03:00:00  3 2016-01-01 04:00:00  4 

Resampling

With the data in this format, resampling becomes very simple.

# Simple Direct Resampling df_resampled = df.resample('D').sum().dropna()  print df_resampled 

a            1 b            2 bar          8 2016-01-01  10 

Update (from data description)

If the data has variable length Series each with a different index and non-numeric categories that is ok. Let's make an example:

# Define Series dates = pandas.date_range('2016-01-01', periods=5, freq='H') s = pandas.Series([0, 1, 2, 3, 4], index=dates)  # Define Series dates2 = pandas.date_range('2016-01-14', periods=6, freq='H') s2 = pandas.Series([-200, 10, 24, 30, 40,100], index=dates2) # Define DataFrames df1 = pd.DataFrame(s, columns=pd.MultiIndex.from_arrays([1,2,8,'cat1'], names=['a','b', 'bar','c'])) df2 = pd.DataFrame(s2, columns=pd.MultiIndex.from_arrays([2,5,5,'cat3'], names=['a','b', 'bar','c']))  df = pd.concat([df1, df2]) print df 

a                      1      2 b                      2      5 bar                    8      5 c                   cat1   cat3 2016-01-01 00:00:00  0.0    NaN 2016-01-01 01:00:00  1.0    NaN 2016-01-01 02:00:00  2.0    NaN 2016-01-01 03:00:00  3.0    NaN 2016-01-01 04:00:00  4.0    NaN 2016-01-14 00:00:00  NaN -200.0 2016-01-14 01:00:00  NaN   10.0 2016-01-14 02:00:00  NaN   24.0 2016-01-14 03:00:00  NaN   30.0 2016-01-14 04:00:00  NaN   40.0 2016-01-14 05:00:00  NaN  100.0 

The only issues is that after resampling. You will want to use how='all' while dropping na rows like this:

# Simple Direct Resampling df_resampled = df.resample('D').sum().dropna(how='all')  print df_resampled 

a              1    2 b              2    5 bar            8    5 c           cat1 cat3 2016-01-01  10.0  NaN 2016-01-14   NaN  4.0 
If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment