Monday, May 2, 2016

Applying rules in sequential order to implement hierarchy

Leave a Comment

I'm trying to apply a series of rules and change values if the conditions are met. Ideally, I'd like to avoid loops, and have a large data set (12.5 million observations), so efficiency would be nice (but this only needs be done once, so efficiency not critical).

The sample data looks like this. Each id can be present on multiple dates, on each of those dates can have multiple cc

set.seed(1) id <- 1:200 date <- sample(seq(as.Date('2007/01/01'), as.Date('2012/12/31'), by="day"), 1000, replace=T) cc <- sample(1:150, 1000, replace=T) df <- data.frame(id, date, cc) head(df)  id      date  cc 1 2011-04-28  69 2 2007-05-31 107 3 2012-08-02 100 4 2011-07-04  98 5 2010-08-20 147 6 2009-02-28  51 

The hierarchy rules have been set up like this:

year <- rep(2007:2010, each=8) ifcc <- c(5,7,8,9,15,16,17,18) r1 <- c(112,NA,NA,NA,NA,NA) r2 <- c(8,9,10,NA,NA,NA) r3 <- c(9,10,NA,NA,NA,NA) r4 <- c(10,NA,NA,NA,NA,NA) r5 <- c(16,17,18,19,NA,NA) r6 <- c(17,18,19,NA,NA,NA) r7 <- c(18,19,NA,NA,NA,NA) r8 <- c(19,NA,NA,NA,NA,NA)  h <- rbind(r1, r2, r3, r4, r5, r6, r7, r8) h <- cbind(ifcc, h) h <- data.frame(year, h)  head(h) year ifcc  V2 V3 V4 V5 V6 V7 2007    5 112 NA NA NA NA NA 2007    7   8  9 10 NA NA NA 2007    8   9 10 NA NA NA NA 2007    9  10 NA NA NA NA NA 2007   15  16 17 18 19 NA NA 2007   16  17 18 19 NA NA NA 

For every id/date combination, I need to check the hierarchy table for rules (for that year, as they change each year). If the condition category cc in df matches the hierarchy rule ifcc in h, then if any of the values in h$V2-h$V7 are present for that id/date combination in df, the rows that match h$V2 to h$V7 need to be dropped from df.

I'm having a hard time wrapping my head around the lookups and application of the hierarchies. there is SAS code which is not only for one year at a time and a series of manually types out if/then statements. Someone has given it a go (http://healthydatascience.com/cms_hcc.html) but not quite for this scale/multiple years and so many repeated measures...

I'm able to iterate through with loops and create a logical matrix of matches, but takes a while for only 30k lookups, not feasible to scale to 12.5m.

t <- matrix(nrow=nrow(df), ncol=nrow(h)) for (j in 1:nrow(df)) {   for (i in 1:nrow(h)){   t[j,i] <- df[j,"cc"] == h[i,"ifcc"]   } } 

I also can't figure out how to use this matrix as a basis for applying the hierarchy rules.

This question directly relates to discussions here: https://github.com/jackwasey/icd/issues/31 https://github.com/anobel/icdtohcc/issues/1

Update

I was able to come up with a functioning solution using a loop. h$cc <- h$ifcc

# Merge hierarchy rules with patient data df <- merge(df, h, all.x = TRUE)  ########### # create empty list todrop <- list()  # create a list of dataframes that contain the CCs that will be zero'd out for (i in 1:6) {   todrop[[i]] <- df[!is.na(df$ifcc),c(1,2,3+i)] }  # rename all dfs in list to same column names, rbind into one df todrop <- lapply(1:length(todrop), function(x) {   names(todrop[[x]]) <- c("id", "admtdate", "cc")   return(todrop[[x]])   } ) todrop <- do.call(rbind, todrop)  # set flag, all of these CCs will be dropped todrop$todrop <- T  # merge drop flags with pt data df <- merge(df, todrop, all.x=T) df <- df[is.na(pt$todrop), ] 

1 Answers

Answers 1

An Alternative Solution using data.table:

As you asked for in the question, the code is very much efficient.

.I is the inbuilt functionality provided by data.table which represents the row number of the data table.

library(data.table)  ## convert data.frame to data.table setDT(df)  setDT(h)  ## find year from date df[,year := year(date)]  ## merge the two datasets with all values of x present  ## if order of tuples doesn't matter, please eliminate the sort=F argument df2 <- merge(df,h[,.(year,ifcc,.I),],by.x = c('year','cc'),by.y = c('year','ifcc'),all.x = T,sort=F)  ## obtaining df having NA values df <- df2[is.na(I),.(id,date,cc)]  ## converting back to data.frame (do it only if required) setDF(df) 
If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment