I have a pandas dataframe with a business-day-based DateTimeIndex. For each month that's in the index, I also have a single 'marker' day specified.
Here's a toy version of that dataframe:
# a dataframe with business dates as the index df = pd.DataFrame(list(range(91)), pd.date_range('2015-04-01', '2015-6-30'), columns=['foo']).resample('B').last() # each month has an single, arbitrary marker day specified marker_dates = [df.index[12], df.index[33], df.index[57]]
For each month in the index, I need to calculate average of the foo
column in specific slice of rows in that month.
There are two different ways I need to be able to specify those slices:
1) m'th day to n'th day.
Example might be (2rd to 4th business day in that month). So april would be the average of 1 (apr2), 4 (apr3), and 5 (apr 6) = 3.33. May would be 33 (may 4), 34 (may 5), 35 (may 6) = 34. I don't consider the weekends/holidays that don't occur in the index as days.
2) m'th day before/after the marker date to the n'th day before/after the marker date.
Example might be "average of the slice from 1 day before the marker date to 1 day after the marker date in each month" Eg. In April, the marker date is 17Apr. Looking at the index, we want the average of apr16, apr17, and apr20.
For Example 1, I had an ugly solution that foreach month I would slice the rows of that month away, and then apply df_slice.iloc[m:n].mean()
Whenever I start doing iterative things with pandas, I always suspect I'm doing it wrong. So I imagine there is a cleaner, pythonic/vectorized way to make this result for all the months
For Example 2, I don't not know a good way to do this slice-averaging based on arbitrary dates across many months.
4 Answers
Answers 1
Use BDay() from pandas.tseries.offsets
import pandas as pd from pandas.tseries.offsets import BDay M=2 N=4 start_date = pd.datetime(2015,4,1) end_date = pd.datetime(2015,6,30) df = pd.DataFrame(list(range(91)), pd.date_range('2015-04-01', '2015-6-30'), columns=['foo']).resample('B').last() # for month starts marker_dates = pd.date_range(start=start_date, end=end_date, freq='BMS') # create IntervalIndex bins = pd.IntervalIndex.from_tuples([ (d + (M-1)*BDay(), d + (N-1)*BDay()) for d in marker_dates ], closed='both') df.groupby(pd.cut(df.index, bins)).mean() #[2015-04-02, 2015-04-06] 3.333333 #[2015-05-04, 2015-05-06] 34.000000 #[2015-06-02, 2015-06-04] 63.000000 # any markers marker_dates = [df.index[12], df.index[33], df.index[57]] # M Bday before, and N Bday after bins = pd.IntervalIndex.from_tuples([ (d - M*BDay(), d + N*BDay()) for d in marker_dates ], closed='both') df.groupby(pd.cut(df.index, bins)).mean() #[2015-04-15, 2015-04-23] 18.428571 #[2015-05-14, 2015-05-22] 48.000000 #[2015-06-17, 2015-06-25] 81.428571
Answers 2
The most pythonic/vectorized (pandonic?) way to do this might be to use df.rolling and df.shift to generate the window over which you'll take the average, then df.reindex to select the value at the dates you've marked.
For your example (2), this could look like:
df['foo'].rolling(3).mean().shift(-1).reindex(marker_dates) Out[8]: 2015-04-17 17.333333 2015-05-18 47.000000 2015-06-19 80.333333 Name: foo, dtype: float64
This could be wrapped in a small function:
def window_mean_at_indices(df, indices, begin=-1, end=1): return df.rolling(1+end-begin).mean().shift(-end).reindex(indices)
Helping to make it more clear how to apply this to situation (1):
month_starts = pd.date_range(df.index.min(), df.index.max(), freq='BMS') month_starts Out[11]: DatetimeIndex(['2015-04-01', '2015-05-01', '2015-06-01'], dtype='datetime64[ns]', freq='BMS') window_mean_at_indices(df['foo'], month_starts, begin=1, end=3) Out[12]: 2015-04-01 3.333333 2015-05-01 34.000000 2015-06-01 63.000000 Freq: BMS, Name: foo, dtype: float64
Answers 3
For your first problem you can use grouper and iloc i.e
low = 2 high= 4 slice_mean = df.groupby(pd.Grouper(level=0,freq='m')).apply(lambda x : x.iloc[low-1:high].mean()) # or df.resample('m').apply(lambda x : x.iloc[low-1:high].mean()) foo 2015-04-30 3.333333 2015-05-31 34.000000 2015-06-30 63.000000
For your second problem you can concat the dates and take the groupy mean per month i.e
idx = pd.np.where(df.index.isin(pd.Series(marker_dates)))[0] #array([12, 33, 57]) temp = pd.concat([df.iloc[(idx+i)] for i in [-1,0,1]]) foo 2015-04-16 15 2015-05-15 46 2015-06-18 78 2015-04-17 18 2015-05-18 47 2015-06-19 81 2015-04-20 19 2015-05-19 48 2015-06-22 82 # Groupby mean temp.groupby(pd.Grouper(level=0,freq='m')).mean() # or temp.resample('m').mean() foo 2015-04-30 17.333333 2015-05-31 47.000000 2015-06-30 80.333333 dtype: float64
since the index of output aint specified in the question do let us know what the index of output be.
Answers 4
Here's what I managed to come up with:
Import pandas and setup the dataframe
import pandas as pd df = pd.DataFrame(list(range(91)), pd.date_range('2015-04-01', '2015-6-30'), columns=['foo']).resample('B')
Start with a pure list of marker dates, since I'm guessing that what you're really starting with:
marker_dates = [ pd.to_datetime('2015-04-17', format='%Y-%m-%d'), pd.to_datetime('2015-05-18', format='%Y-%m-%d'), pd.to_datetime('2015-06-19', format='%Y-%m-%d') ] marker_df = pd.DataFrame([], columns=['marker', 'start', 'end', 'avg']) marker_df['marker'] = marker_dates
For the case where you want to just test ranges, input the start and end manually here instead of calculating it. If you want to change the range you can change the arguments to shift():
marker_df['start'] = df.index.shift(-1)[df.index.isin(marker_df['marker'])] marker_df['end'] = df.index.shift(1)[df.index.isin(marker_df['marker'])]
Finally, use DataFrame.apply() to do a row by row calculation of averages:
marker_df.apply( lambda x: df[(x['start'] <= df.index) & (df.index <= x['end'])]['foo'].mean(), axis=1 )
Which gives us this result:
marker start end avg 0 2015-04-17 2015-04-16 2015-04-20 17.000000 1 2015-05-18 2015-05-15 2015-05-19 46.666667 2 2015-06-19 2015-06-18 2015-06-22 80.000000
0 comments:
Post a Comment