Monday, October 23, 2017

R memory not released in Windows

Leave a Comment

I am using RStudio in Windows 7 and I have a problem in releasing memory to the OS. Below is my code. In a for loop:

  • I read data through an API call of the Census.gov website and I use the package acs to save them in a .csv file via the temporary object table.
  • I remove the table (usual size: few MB), and I use the package pryr to check memory usage.

According to the function mem_used(), after the removal of table, R always returns to a constant memory usage; instead according to Windows Task Manager the memory allocation for rsession.exe (not Rstudio) increases at every iteration and it eventually crashes the rsession. The use of gc() does not help. I have read lots of similar questions around but it seems that the only solution to free memory is to restart the R session which seems silly. Any suggestion?

   library(acs)    library(pryr)     # for loop to extract tables from API and save them on API    for (i in 128:length(tablecodes)) {            tryCatch({table <- acs.fetch(table.number = tablecodes[i],endyear = 2014, span=5,                   geography = geo.make(state = "NY", county = "*", tract = "*"),                   key = "e24539dfe0e8a5c5bf99d78a2bb8138abaa3b851",col.names="pretty")},              error = function(e){print("Table skipped") })      # if the table is actually fetched then we save it      if (exists("table", mode="S4")) {                   print(paste("Table",i,"fetched")          if (!is.na(table)){                    write.csv(estimate(table),paste("./CENSUS_tables/NY/",tablecodes[i],".csv",sep = ""))                 }     print(mem_used())       print(mem_change(rm(table)))     gc()     }    } 

1 Answers

Answers 1

I was able to confirm the memory problem exists on Windows 7. (Running via VMware Fusion on MacOSX). It also appears to exist on MacOSX, though memory usage appears quite gradual [Unconfirmed but indicative of memory leak]. Slightly tricky with MacOSX as the OS compresses memory if it sees high usage.

Proposed Workaround:

My proposal in light of the above is to split the table download sets into smaller groups when you download from the US Census Bureau. Why? Well, looking at the code you are downloading the data to store in .CSV files. Hence, the workaround in the short term is to break up the list of tables you are downloading. Your program should be able to complete successfully across a set of runs.

One option is to create a wrapper RScript and have it run across N runs where each invokes a separate R session. i.e. Rscript invokes N RSessions in series, each session downloads N files

nb. Based on your code, and observed memory usage, my sense is you are downloading a lot of the tables, hence splitting up across R session(s) may be the best option.

nb. The following should work under cgiwin on Windows 7.

Invoke Script

Example: Download the primary tables 01 to 27 - if they do not exist skip...

!#/bin/bash  #Ref: https://censusreporter.org/topics/table-codes/ # Params: Primary Table Year Span  for CensusTableCode in $(seq -w 1 27) do   R --no-save -q --slave < ./PullCensus.R --args B"$CensusTableCode"001 2014 5 done 

PullCensus.R

if (!require(acs)) install.packages("acs") if (!require(pryr)) install.packages("pryr")  # You can obtain a US Census key from the developer site # "e24539dfe0e8a5c5bf99d78a2bb8138abaa3b851" api.key.install(key = "** Secret**")  setwd("~/dev/stackoverflow/37264919")  # Extract Table Structure # # B = Detailed Column Breakdown # 19 = Income (Households and Families) # 001 = # A - I = Race #  args <- commandArgs(trailingOnly = TRUE) # trailingOnly=TRUE means that only your arguments are returned  if ( length(args) != 0 ) {     tableCodes <- args[1]     defEndYear = args[2]     defSpan = args[3]   } else {   tableCodes <- c("B02001")   defEndYear = 2014   defSpan = 5 }  # for loop to extract tables from API and save them on API for (i in 1:length(tableCodes)) {   tryCatch(     table <- acs.fetch(table.number = tableCodes[i],                        endyear = defEndYear,                        span = defSpan,                        geography = geo.make(state = "NY",                                             county = "*",                                             tract = "*"),                        col.names = "pretty"),     error = function(e) { print("Table skipped")} )    # if the table is actually fetched then we save it   if (exists("table", mode = "S4"))   {     print(paste("Table", i, "fetched"))     if (!is.na(table))     {       write.csv(estimate(table), paste(defEndYear,"_",tableCodes[i], ".csv", sep = ""))     }     print(mem_used())     print(mem_change(rm(table)))     gc(reset = TRUE)     print(mem_used())   } } 

I hope the above helps by way of example. It is an approach. ;-)

T.

Next Steps:

I'll take a look at the package source to see if I can see what is actually wrong. Alternatively, you yourself may be able to narrow it down and file a bug against the package.


Background / Working Example:

My sense is that is might help to provide a working code example to frame the above workaround. Why? The intent here is to provide an example that people might use to test and consider what is happening. Why? Well, it makes it easier to understand your question and intent.

Essentially, (as I understand it) you are bulk downloading US Census data from the US Census website. The table codes are used to specify what data you wish to download. Ok, so I just created a set of table codes and tested the memory usage to see if memory starts to be consumed as you explained.

library(acs) library(pryr) library(tigris) library(stringr)  # to pad fips codes library(maptools)  # You can obtain a US Census key from the developer site # "e24539dfe0e8a5c5bf99d78a2bb8138abaa3b851" api.key.install(key = "<INSERT KEY HERE>")  # Table Codes # # While Census Reporter hopes to save you from the details, you may be # interested to understand some of the rationale behind American Community # Survey table identifiers. # # Detailed Tables # # The bulk of the American Community Survey is the over 1400 detailed data # tables. These tables have reference codes, and knowing how the codes are # structured can be helpful in knowing which table to use. # # Codes start with either the letter B or C, followed by two digits for the # table subject, then 3 digits that uniquely identify the table. (For a small # number of technical tables the unique identifier is 4 digits.) In some cases # additional letters for racial iterations and Puerto Rico-specific tables. # # Full and Collapsed Tables # # Tables beginning with B have the most detailed column breakdown, while a # C table for the same numbers will have fewer columns. For example, the # B02003 table ("Detailed Race") has 71 columns, while the "collapsed # version," C02003 has only 19 columns. While your instinct may be to want # as much data as possible, sometimes choosing the C table can simplify # your analysis. # # Table subjects # # The first two digits after B/C indicate the broad subject of a table. # Note that many tables have more than one subject, but this reflects the # main subject. # # 01 Age and Sex # 02 Race # 03 Hispanic Origin # 04 Ancestry # 05 Foreign Born; Citizenship; Year or Entry; Nativity # 06 Place of Birth07Residence 1 Year Ago; Migration # 08 Journey to Work; Workers' Characteristics; Commuting # 09 Children; Household Relationship # 10 Grandparents; Grandchildren # 11 Household Type; Family Type; Subfamilies # 12 Marital Status and History13Fertility # 14 School Enrollment # 15 Educational Attainment # 16 Language Spoken at Home and Ability to Speak English # 17 Poverty # 18 Disability # 19 Income (Households and Families) # 20 Earnings (Individuals) # 21 Veteran Status # 22 Transfer Programs (Public Assistance) # 23 Employment Status; Work Experience; Labor Force # 24 Industry; Occupation; Class of Worker # 25 Housing Characteristics # 26 Group Quarters # 27 Health Insurance # # Three groups of tables reflect technical details about how the Census is # administered. In general, you probably don't need to look at these too # closely, but if you need to check for possible weaknesses in your data # analysis, they may come into play. # # 00 Unweighted Count # 98 Quality Measures # 99 Imputations # # Race and Latino Origin # # Many tables are provided in multiple racial tabulations. If a table code # ends in a letter from A-I, that code indicates that the table universe is # restricted to a subset based on responses to the race or # Hispanic/Latino-origin questions. # # Here is a guide to those codes: # #   A White alone #   B Black or African American Alone #   C American Indian and Alaska Native Alone #   D Asian Alone #   E Native Hawaiian and Other Pacific Islander Alone #   F Some Other Race Alone #   G Two or More Races #   H White Alone, Not Hispanic or Latino #   I Hispanic or Latino   setwd("~/dev/stackoverflow/37264919")  # Extract Table Structure # # B = Detailed Column Breakdown # 19 = Income (Households and Families) # 001 = # A - I = Race # tablecodes <- c("B19001", "B19001A", "B19001B", "B19001C", "B19001D",                 "B19001E", "B19001F", "B19001G", "B19001H", "B19001I" )  # for loop to extract tables from API and save them on API for (i in 1:length(tablecodes)) {   print(tablecodes[i])   tryCatch(     table <- acs.fetch(table.number = tablecodes[i],                        endyear = 2014,                        span = 5,                        geography = geo.make(state = "NY",                                             county = "*",                                             tract = "*"),                        col.names = "pretty"),     error = function(e) { print("Table skipped")} )    # if the table is actually fetched then we save it   if (exists("table", mode="S4"))   {     print(paste("Table", i, "fetched"))     if (!is.na(table))     {       write.csv(estimate(table), paste("T",tablecodes[i], ".csv", sep = ""))     }     print(mem_used())     print(mem_change(rm(table)))     gc()     print(mem_used())   } } 

Runtime Output

> library(acs) > library(pryr) > library(tigris) > library(stringr)  # to pad fips codes > library(maptools) > # You can obtain a US Census key from the developer site > # "e24539dfe0e8a5c5bf99d78a2bb8138abaa3b851" > api.key.install(key = "...secret...") >  ... > setwd("~/dev/stackoverflow/37264919") >  > # Extract Table Structure > # > # B = Detailed Column Breakdown > # 19 = Income (Households and Families) > # 001 = > # A - I = Race > # > tablecodes <- c("B19001", "B19001A", "B19001B", "B19001C", "B19001D", +                 "B19001E", "B19001F", "B19001G", "B19001H", "B19001I" ) >  > # for loop to extract tables from API and save them on API > for (i in 1:length(tablecodes)) + { +   print(tablecodes[i]) +   tryCatch( +     table <- acs.fetch(table.number = tablecodes[i], +                        endyear = 2014, +                        span = 5, +                        geography = geo.make(state = "NY", +                                             county = "*", +                                             tract = "*"), +                        col.names = "pretty"), +     error = function(e) { print("Table skipped")} ) +  +   # if the table is actually fetched then we save it +   if (exists("table", mode="S4")) +   { +     print(paste("Table", i, "fetched")) +     if (!is.na(table)) +     { +       write.csv(estimate(table), paste("T",tablecodes[i], ".csv", sep = "")) +     } +     print(mem_used()) +     print(mem_change(rm(table))) +     gc() +     print(mem_used()) +   } + } [1] "B19001" [1] "Table 1 fetched" 95.4 MB -1.88 MB 93.6 MB [1] "B19001A" [1] "Table 2 fetched" 95.4 MB -1.88 MB 93.6 MB [1] "B19001B" [1] "Table 3 fetched" 95.5 MB -1.88 MB 93.6 MB [1] "B19001C" [1] "Table 4 fetched" 95.5 MB -1.88 MB 93.6 MB [1] "B19001D" [1] "Table 5 fetched" 95.5 MB -1.88 MB 93.6 MB [1] "B19001E" [1] "Table 6 fetched" 95.5 MB -1.88 MB 93.6 MB [1] "B19001F" [1] "Table 7 fetched" 95.5 MB -1.88 MB 93.6 MB [1] "B19001G" [1] "Table 8 fetched" 95.5 MB -1.88 MB 93.6 MB [1] "B19001H" [1] "Table 9 fetched" 95.5 MB -1.88 MB 93.6 MB [1] "B19001I" [1] "Table 10 fetched" 95.5 MB -1.88 MB 93.6 MB 

Output files

>ll total 8520 drwxr-xr-x@ 13 hidden  staff   442B Oct 17 20:41 . drwxr-xr-x@ 40 hidden  staff   1.3K Oct 17 23:17 .. -rw-r--r--@  1 hidden  staff   4.4K Oct 17 23:43 37264919.R -rw-r--r--@  1 hidden  staff   492K Oct 17 23:50 TB19001.csv -rw-r--r--@  1 hidden  staff   472K Oct 17 23:51 TB19001A.csv -rw-r--r--@  1 hidden  staff   414K Oct 17 23:51 TB19001B.csv -rw-r--r--@  1 hidden  staff   387K Oct 17 23:51 TB19001C.csv -rw-r--r--@  1 hidden  staff   403K Oct 17 23:51 TB19001D.csv -rw-r--r--@  1 hidden  staff   386K Oct 17 23:51 TB19001E.csv -rw-r--r--@  1 hidden  staff   402K Oct 17 23:51 TB19001F.csv -rw-r--r--@  1 hidden  staff   393K Oct 17 23:52 TB19001G.csv -rw-r--r--@  1 hidden  staff   465K Oct 17 23:44 TB19001H.csv -rw-r--r--@  1 hidden  staff   417K Oct 17 23:44 TB19001I.csv 
If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment