Tuesday, May 1, 2018

Creating new columns given if an event happen in a window of time (future or past)

Leave a Comment

I have a pandas dataframe with the following structure:

ID    date           e_1     1    2016-02-01     False        2016-02-02     False        2016-02-03     True         2016-02-04     False       2016-02-05     False       2016-02-06     False       2016-02-07     False       2016-02-08     False       2016-02-09     False       2016-02-10     False    2    2016-02-01     False         2016-02-02     True           2016-02-03     True           2016-02-04     False             ...         ... 

I want to add several columns that encode the following: Does the e_1 is True in the following 1d, 2d, 3d, 4d, 5d, 1 month... etc?

I will like to specify the delta in time in a list. The name of the columns will be the e1_XX where XX is the delta (i.e. 1d, etc)

I tried with shift but that just moves the values. Also tried with rolling (it seems more suitable for this task):

df.groupby('ID').rolling(3).agg(???) 

But I don' t know how to pass the condition (I though in something like np.any), but I am stuck

2 Answers

Answers 1

You can using groupby with rolling apply

df.groupby('ID').e_1.apply(lambda x : x.iloc[::-1].rolling(window=3,min_periods=1).apply(any).iloc[::-1].astype(bool)) Out[51]:  ID  date       1   2016-02-01     True     2016-02-02     True     2016-02-03     True     2016-02-04    False     2016-02-05    False     2016-02-06    False     2016-02-07    False     2016-02-08    False     2016-02-09    False     2016-02-10    False 2   2016-02-01     True     2016-02-02     True     2016-02-03     True     2016-02-04    False Name: e_1, dtype: bool 

EDIT : groupby the index ID , then we have series of e_1 for each ID , and check the link for rolling , it can accept offset which means when your index is datetime ,it can using offset(3d means 3 days) to determine the window size

df.groupby('ID').e_1.apply(lambda x : x.reset_index(level=0,drop=True).rolling('3d').apply(any)) 

Answers 2

check the following code and see if it works:

# make sure date is in valid Pandas datetime format mydf['date'] = pd.to_datetime(mydf['date'], format='%Y-%m-%d')  # use date as index to make it easier in date manipulations mydf.set_index('date', inplace=True)  def flag_visits(grps, d, d_name):     """Loop through each group and extend the index to 'd' more days from        df_grp.index.max(). fill the NaN values with *False*        this is needed to retrieve the forward rolling stats when running shift(1-d)     """     for id, df_grp in grps:         # create the new index to cover all days required in calculation         idx = pd.date_range(               start = df_grp.index.min()             , end   = df_grp.index.max() + pd.DateOffset(days=d)             , freq  = 'D'         )          # set up the new column 'd_name' for the current group         mydf.loc[mydf.ID == id, 'e1_'+d_name] = (df_grp.reindex(idx, fill_value=False)                                                        .e_1.rolling(str(d)+'d', min_periods=0)                                                        .sum().gt(0)                                                        .shift(1-d)         )  # if you know the dates are continue without gap, then you might also reverse the dates, do the regular backward rolling(), and then flip it back. However, you can not do the rolling() by the number of day, only by the number of records.  def flag_visits_1(grps, d, d_name):     for id, df_grp in grps:         mydf.loc[mydf.ID == id, 'e1_'+d_name] = (df_grp.sort_index(ascending=False)                                                        .e_1.rolling(d, min_periods=0)                                                        .sum().gt(0).sort_index()         )    # d is the actual number of days used in Series.rolling(), d_name used in the column name""" for d, d_name in [ (2, '1d') , (3, '2d'), (7, '6d'), (30, '1m') ]:     mydf.groupby('ID').pipe(flag_visits, d, d_name)  # reset to the previous  mydf.reset_index(inplace=True)  print(mydf) 

Note:

  • If the next 1 day does not include today, thus when d_name == 'id', d == 1, then you can adjust the shift(1-d) to shift(-d)
  • date field must be unique for each ID, or you will not be able to set_index()
If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment