I'm trying to apply a series of rules and change values if the conditions are met. Ideally, I'd like to avoid loops, and have a large data set (12.5 million observations), so efficiency would be nice (but this only needs be done once, so efficiency not critical).
The sample data looks like this. Each id
can be present on multiple date
s, on each of those dates can have multiple cc
set.seed(1) id <- 1:200 date <- sample(seq(as.Date('2007/01/01'), as.Date('2012/12/31'), by="day"), 1000, replace=T) cc <- sample(1:150, 1000, replace=T) df <- data.frame(id, date, cc) head(df) id date cc 1 2011-04-28 69 2 2007-05-31 107 3 2012-08-02 100 4 2011-07-04 98 5 2010-08-20 147 6 2009-02-28 51
The hierarchy rules have been set up like this:
year <- rep(2007:2010, each=8) ifcc <- c(5,7,8,9,15,16,17,18) r1 <- c(112,NA,NA,NA,NA,NA) r2 <- c(8,9,10,NA,NA,NA) r3 <- c(9,10,NA,NA,NA,NA) r4 <- c(10,NA,NA,NA,NA,NA) r5 <- c(16,17,18,19,NA,NA) r6 <- c(17,18,19,NA,NA,NA) r7 <- c(18,19,NA,NA,NA,NA) r8 <- c(19,NA,NA,NA,NA,NA) h <- rbind(r1, r2, r3, r4, r5, r6, r7, r8) h <- cbind(ifcc, h) h <- data.frame(year, h) head(h) year ifcc V2 V3 V4 V5 V6 V7 2007 5 112 NA NA NA NA NA 2007 7 8 9 10 NA NA NA 2007 8 9 10 NA NA NA NA 2007 9 10 NA NA NA NA NA 2007 15 16 17 18 19 NA NA 2007 16 17 18 19 NA NA NA
For every id
/date
combination, I need to check the hierarchy table for rules (for that year, as they change each year). If the condition category cc
in df
matches the hierarchy rule ifcc
in h
, then if any of the values in h$V2
-h$V7
are present for that id
/date
combination in df
, the rows that match h$V2
to h$V7
need to be dropped from df
.
I'm having a hard time wrapping my head around the lookups and application of the hierarchies. there is SAS code which is not only for one year at a time and a series of manually types out if/then statements. Someone has given it a go (http://healthydatascience.com/cms_hcc.html) but not quite for this scale/multiple years and so many repeated measures...
I'm able to iterate through with loops and create a logical matrix of matches, but takes a while for only 30k lookups, not feasible to scale to 12.5m.
t <- matrix(nrow=nrow(df), ncol=nrow(h)) for (j in 1:nrow(df)) { for (i in 1:nrow(h)){ t[j,i] <- df[j,"cc"] == h[i,"ifcc"] } }
I also can't figure out how to use this matrix as a basis for applying the hierarchy rules.
This question directly relates to discussions here: https://github.com/jackwasey/icd/issues/31 https://github.com/anobel/icdtohcc/issues/1
Update
I was able to come up with a functioning solution using a loop. h$cc <- h$ifcc
# Merge hierarchy rules with patient data df <- merge(df, h, all.x = TRUE) ########### # create empty list todrop <- list() # create a list of dataframes that contain the CCs that will be zero'd out for (i in 1:6) { todrop[[i]] <- df[!is.na(df$ifcc),c(1,2,3+i)] } # rename all dfs in list to same column names, rbind into one df todrop <- lapply(1:length(todrop), function(x) { names(todrop[[x]]) <- c("id", "admtdate", "cc") return(todrop[[x]]) } ) todrop <- do.call(rbind, todrop) # set flag, all of these CCs will be dropped todrop$todrop <- T # merge drop flags with pt data df <- merge(df, todrop, all.x=T) df <- df[is.na(pt$todrop), ]
1 Answers
Answers 1
An Alternative Solution using data.table
:
As you asked for in the question, the code is very much efficient.
.I is the inbuilt functionality provided by data.table which represents the row number of the data table.
library(data.table) ## convert data.frame to data.table setDT(df) setDT(h) ## find year from date df[,year := year(date)] ## merge the two datasets with all values of x present ## if order of tuples doesn't matter, please eliminate the sort=F argument df2 <- merge(df,h[,.(year,ifcc,.I),],by.x = c('year','cc'),by.y = c('year','ifcc'),all.x = T,sort=F) ## obtaining df having NA values df <- df2[is.na(I),.(id,date,cc)] ## converting back to data.frame (do it only if required) setDF(df)
0 comments:
Post a Comment