Thursday, July 6, 2017

How to use biglm with more than 2^31 observations

Leave a Comment

I am working with a large set of data that contains more than 2^31 observations. The actual number of observations is close to 3.5 billion observations.

I am using the R package "biglm" to run a regression with approximately 70 predictors. I read in the data one million rows at a time and update the regression results. The data have been saved in the ffdf format using the R library "ffdf" to load quickly and avoid using up all my RAM.

Here is the basic outline of the code I am using:

library(ff,ffbase,biglm) load.ffdf(dir='home')  dim(data) #the ffdf contains about 70 predictors and 3.5 billion rows  chunk_1 <- data[1:1000000,] rest_of_data <- data[1000000:nrow(data),]  # Running biglm for first chunk b <- biglm(y~x1+x2+...+x70, chunk_1)  chunks <- ceiling((nrow(rest_of_data)/1000000)  # Updating biglm results by iterating through the rest of the data chunks for (i in seq(1,chunks)){       start <- 1+((i-1))*1000000       end <- min(i*1000000,nrow(d))       d_chunk <- d[start:end,]       b<-update(b,d_chunk) } 

The results look great and everything is running smoothly until the cumulative number of observations from updating the model with each chunk of the data exceeds 2^31 observations. Then, I get an error that reads

In object$n + NROW(mm) : NAs produced by integer overflow 

How do I get around this overflow issue? Thanks in advance for your help!

1 Answers

Answers 1

I believe that I have found the source of the issue in the biglm code.

The number of observations (n) is stored as an integer, which has a max value of 2^31 - 1.

The numeric type is not subject to this limit, and, as far as I can tell, can be used instead of integers to store n.

Here is a commit on github showing how to fix this problem with one additional line of code that converts the integer n to a numeric. As the model is updated, the number of rows in the new batch is added to the old n, so the type of n remains numeric.

I was able to reproduce the error described in this question and verify that my fix works with this code:

(WARNING: This consumes a large amount of memory, consider doing more iterations with a smaller array if you have tight memory constraints)

library(biglm) df = as.data.frame(replicate(3, rnorm(10000000))) a = biglm(V1 ~ V2 + V3, df) for (i in 1:300) {     a = update(a, df) } print(summary(a)) 

In the original biglm library, this code outputs:

Large data regression model: biglm(ff, df) Sample size =  NA                Coef (95% CI) SE  p (Intercept) -1e-04   NA  NA NA NA V2          -1e-04   NA  NA NA NA V3          -2e-04   NA  NA NA NA 

My patched version outputs:

Large data regression model: biglm(V1 ~ V2 + V3, df) Sample size =  3.01e+09                Coef   (95%    CI) SE p (Intercept) -3e-04 -3e-04 -3e-04  0 0 V2          -2e-04 -2e-04 -1e-04  0 0 V3           3e-04  3e-04  3e-04  0 0 

The SE and p values are non-zero, just rounded in the output above.

I am fairly new to the R ecosystem, so I would appreciate it if someone could tell me how to submit this patch so that it can be reviewed by the original author and eventually included in the upstream package.

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment