For example, how do you do the following R data.table operation in pandas:
PATHS[,.( completed=sum(exists), missing=sum(not(exists)), total=.N, 'size (G)'=sum(sizeMB)/1024), by=.(projectPath, pipelineId)]
I.e. group by projectPath
and pipelineId
, aggregate some of the columns using possibly custom functions, and then rename the resulting columns.
Output should be a DataFrame with no hierarchical indexes, for example:
projectPath pipelineId completed missing size (G) /data/pnl/projects/TRACTS/pnlpipe 0 2568 0 45.30824 /data/pnl/projects/TRACTS/pnlpipe 1 1299 0 62.69934
2 Answers
Answers 1
You can use groupby.agg
:
df.groupby(['projectPath', 'pipelineId']).agg({ 'exists': {'completed': 'sum', 'missing': lambda x: (~x).sum(), 'total': 'size'}, 'sizeMB': {'size (G)': lambda x: x.sum()/1024} })
Sample run:
df = pd.DataFrame({ 'projectPath': [1,1,1,1,2,2,2,2], 'pipelineId': [1,1,2,2,1,1,2,2], 'exists': [True, False,True,True,False,False,True,False], 'sizeMB': [120032,12234,223311,3223,11223,33445,3444,23321] }) df1 = df.groupby(['projectPath', 'pipelineId']).agg({ 'exists': {'completed': 'sum', 'missing': lambda x: (~x).sum(), 'total': 'size'}, 'sizeMB': {'size (G)': lambda x: x.sum()/1024} }) df1.columns = df1.columns.droplevel(0) df1.reset_index()
Update: if you really want to customize the aggregation without using the deprecated nested dictionary syntax, you can always use groupby.apply
and return a Series object from each group:
df.groupby(['projectPath', 'pipelineId']).apply( lambda g: pd.Series({ 'completed': g.exists.sum(), 'missing': (~g.exists).sum(), 'total': g.exists.size, 'size (G)': g.sizeMB.sum()/1024 }) ).reset_index()
Answers 2
I believe the new 0.20, more "idiomatic" way, is like this (where the second layer of the nested dictionary is basically replaced by an appended .rename
method):
...( completed=sum(exists), missing=sum(not(exists)), total=.N, 'size (G)'=sum(sizeMB)/1024), by=.(projectPath, pipelineId)]...
in R, becomes
df.groupby(['projectPath', 'pipelineId']).agg({ 'exists': 'sum', 'pipelineId': 'count', 'sizeMB': lambda s: s.sum() / 1024 }).rename(columns={'exists': 'completed', 'pipelineId': 'total', 'sizeMB': 'size (G)'})
And then I might just add another line for the inverse of 'exists' -> 'missing':
df['missing'] = df.total - df.completed
e.g.,
df_paths.groupby(['TRACT', 'pipelineId']).agg({ 'mean_len(project)' : 'sum', 'len(seq)' : lambda agg_s: np.mean(agg_s.values) / 1e9 }).rename(columns={'len(seq)': 'Gb', 'mean_len(project)': 'TRACT_sum'})
where "TRACT" was a category one-level higher to "pipelineId" in the dir tree, such that in this example you can see there's 46 total unique pipelines — 2 "TRACT" layers AB/AC x 6 "pipelineId"/"project"'s x 4 binary combinations 00, 01, 10, 11 (minus 2 projects which GNU parallel made into a third topdir; see below). So in the new agg the stats transformed the mean of project-level into the sums of all respective projects agg'd per-TRACT.
df_paths = pd.read_csv('./data/paths.txt', header=None, names=['projectPath']) # df_paths['projectPath'] = df_paths['pipelineId'] = df_paths.projectPath.apply( lambda s: ''.join(s.split('/')[1:5])[:-3]) df_paths['TRACT'] = df_paths.pipelineId.apply(lambda s: s[:2]) df_paths['rand_DNA'] = [ ''.join(random.choices(['A', 'C', 'T', 'G'], k=random.randint(1e3, 1e5))) for _ in range(df_paths.shape[0]) ] df_paths['len(seq)'] = df_paths.rand_DNA.apply(len) df_paths['mean_len(project)'] = df_paths.pipelineId.apply( lambda pjct: df_paths.groupby('pipelineId')['len(seq)'].mean()[pjct]) df_paths
0 comments:
Post a Comment