I am working with a large set of data that contains more than 2^31 observations. The actual number of observations is close to 3.5 billion observations.
I am using the R package "biglm" to run a regression with approximately 70 predictors. I read in the data one million rows at a time and update the regression results. The data have been saved in the ffdf format using the R library "ffdf" to load quickly and avoid using up all my RAM.
Here is the basic outline of the code I am using:
library(ff,ffbase,biglm) load.ffdf(dir='home')  dim(data) #the ffdf contains about 70 predictors and 3.5 billion rows  chunk_1 <- data[1:1000000,] rest_of_data <- data[1000000:nrow(data),]  # Running biglm for first chunk b <- biglm(y~x1+x2+...+x70, chunk_1)  chunks <- ceiling((nrow(rest_of_data)/1000000)  # Updating biglm results by iterating through the rest of the data chunks for (i in seq(1,chunks)){       start <- 1+((i-1))*1000000       end <- min(i*1000000,nrow(d))       d_chunk <- d[start:end,]       b<-update(b,d_chunk) } The results look great and everything is running smoothly until the cumulative number of observations from updating the model with each chunk of the data exceeds 2^31 observations. Then, I get an error that reads
In object$n + NROW(mm) : NAs produced by integer overflow How do I get around this overflow issue? Thanks in advance for your help!
1 Answers
Answers 1
I believe that I have found the source of the issue in the biglm code.
The number of observations (n) is stored as an integer, which has a max value of 2^31 - 1.
The numeric type is not subject to this limit, and, as far as I can tell, can be used instead of integers to store n.
Here is a commit on github showing how to fix this problem with one additional line of code that converts the integer n to a numeric. As the model is updated, the number of rows in the new batch is added to the old n, so the type of n remains numeric.
I was able to reproduce the error described in this question and verify that my fix works with this code:
(WARNING: This consumes a large amount of memory, consider doing more iterations with a smaller array if you have tight memory constraints)
library(biglm) df = as.data.frame(replicate(3, rnorm(10000000))) a = biglm(V1 ~ V2 + V3, df) for (i in 1:300) {     a = update(a, df) } print(summary(a)) In the original biglm library, this code outputs:
Large data regression model: biglm(ff, df) Sample size =  NA                Coef (95% CI) SE  p (Intercept) -1e-04   NA  NA NA NA V2          -1e-04   NA  NA NA NA V3          -2e-04   NA  NA NA NA My patched version outputs:
Large data regression model: biglm(V1 ~ V2 + V3, df) Sample size =  3.01e+09                Coef   (95%    CI) SE p (Intercept) -3e-04 -3e-04 -3e-04  0 0 V2          -2e-04 -2e-04 -1e-04  0 0 V3           3e-04  3e-04  3e-04  0 0 The SE and p values are non-zero, just rounded in the output above.
I am fairly new to the R ecosystem, so I would appreciate it if someone could tell me how to submit this patch so that it can be reviewed by the original author and eventually included in the upstream package.
 
0 comments:
Post a Comment