I have a pandas dataframe with the following structure:
ID date e_1 1 2016-02-01 False 2016-02-02 False 2016-02-03 True 2016-02-04 False 2016-02-05 False 2016-02-06 False 2016-02-07 False 2016-02-08 False 2016-02-09 False 2016-02-10 False 2 2016-02-01 False 2016-02-02 True 2016-02-03 True 2016-02-04 False ... ...
I want to add several columns that encode the following: Does the e_1
is True
in the following 1d
, 2d
, 3d
, 4d
, 5d
, 1 month
... etc?
I will like to specify the delta in time in a list. The name of the columns will be the e1_XX
where XX
is the delta (i.e. 1d
, etc)
I tried with shift
but that just moves the values. Also tried with rolling
(it seems more suitable for this task):
df.groupby('ID').rolling(3).agg(???)
But I don' t know how to pass the condition (I though in something like np.any
), but I am stuck
2 Answers
Answers 1
You can using groupby
with rolling
apply
df.groupby('ID').e_1.apply(lambda x : x.iloc[::-1].rolling(window=3,min_periods=1).apply(any).iloc[::-1].astype(bool)) Out[51]: ID date 1 2016-02-01 True 2016-02-02 True 2016-02-03 True 2016-02-04 False 2016-02-05 False 2016-02-06 False 2016-02-07 False 2016-02-08 False 2016-02-09 False 2016-02-10 False 2 2016-02-01 True 2016-02-02 True 2016-02-03 True 2016-02-04 False Name: e_1, dtype: bool
EDIT : groupby
the index ID , then we have series of e_1 for each ID , and check the link for rolling , it can accept offset which means when your index is datetime ,it can using offset
(3d means 3 days) to determine the window size
df.groupby('ID').e_1.apply(lambda x : x.reset_index(level=0,drop=True).rolling('3d').apply(any))
Answers 2
check the following code and see if it works:
# make sure date is in valid Pandas datetime format mydf['date'] = pd.to_datetime(mydf['date'], format='%Y-%m-%d') # use date as index to make it easier in date manipulations mydf.set_index('date', inplace=True) def flag_visits(grps, d, d_name): """Loop through each group and extend the index to 'd' more days from df_grp.index.max(). fill the NaN values with *False* this is needed to retrieve the forward rolling stats when running shift(1-d) """ for id, df_grp in grps: # create the new index to cover all days required in calculation idx = pd.date_range( start = df_grp.index.min() , end = df_grp.index.max() + pd.DateOffset(days=d) , freq = 'D' ) # set up the new column 'd_name' for the current group mydf.loc[mydf.ID == id, 'e1_'+d_name] = (df_grp.reindex(idx, fill_value=False) .e_1.rolling(str(d)+'d', min_periods=0) .sum().gt(0) .shift(1-d) ) # if you know the dates are continue without gap, then you might also reverse the dates, do the regular backward rolling(), and then flip it back. However, you can not do the rolling() by the number of day, only by the number of records. def flag_visits_1(grps, d, d_name): for id, df_grp in grps: mydf.loc[mydf.ID == id, 'e1_'+d_name] = (df_grp.sort_index(ascending=False) .e_1.rolling(d, min_periods=0) .sum().gt(0).sort_index() ) # d is the actual number of days used in Series.rolling(), d_name used in the column name""" for d, d_name in [ (2, '1d') , (3, '2d'), (7, '6d'), (30, '1m') ]: mydf.groupby('ID').pipe(flag_visits, d, d_name) # reset to the previous mydf.reset_index(inplace=True) print(mydf)
Note:
- If the
next 1 day
does not include today, thus when d_name == 'id', d == 1, then you can adjust theshift(1-d)
toshift(-d)
- date field must be unique for each ID, or you will not be able to set_index()
0 comments:
Post a Comment