Thursday, March 2, 2017

reading a template file and writing it to disk after some modifications

Leave a Comment

I need to read a template file test.txt, modify the contents and then write to disk a modified copy with name foo`i`.in (i is an iteration number). Since I need to perform this operation a large number of times (a million times wouldn't be uncommon), efficient solutions would be preferred. The template file is like this:

1  bar.out         70.000000000000000          2.000000000000000         14.850000000000000       8000.000000000000000        120.000000000000000         60.000000000000000          0.197500000000000          0.197500000000000          2.310000000000000          0.200000000000000          0.000000000000000          1.000000000000000          0.001187700000000         22.000000000000000          1.400000000000000          1.000000000000000          0.010000000000000  100          0.058600000000000         -0.217000000000000          0.078500000000000         -0.110100000000000  30        500.000000000000000  T  

I don't need to modify all lines, just some of them. Specifically, I need to modify bar.out to bar`i`.out where i is an iteration index. I also need to modify some numeric lines with the following values:

parameters <- data.frame(index = c(1:10, 13:16, 21:22), variable = c("P1",                        "P2", "T1", "s", "D", "L", "C1", "C2", "VA",                        "pw", "m", "mw", "Cp", "Z", "ff_N", "ff_M"),                       value = c(65, 4, 16.85, 7900, 110, 60, 0.1975, .1875, 2.31,                                  0.2, 0.0011877, 22.0, 1.4, 1.0, 0.0785, -0.1101)) 

All the other lines must remain the same, including the last line T. Thus, assuming I'm at the first iteration, the expected output is a text file named foo1.in having the content (the exact number format is not important, as long as all the significant digits in parameters$value are included in foo1.in):

1  bar1.out         65.000000000000000          4.000000000000000        16.850000000000000       7900.000000000000000        110.000000000000000         60.000000000000000          0.197500000000000          0.187500000000000          2.310000000000000          0.200000000000000          0.000000000000000          1.000000000000000          0.001187700000000         22.000000000000000          1.400000000000000          1.000000000000000          0.010000000000000  100          0.058600000000000         -0.217000000000000          0.078500000000000         -0.110100000000000  30        500.000000000000000  T  

Modifying foo.in and bar.out is easy:

template  <- "test.txt" infile    <- "foo.in" string1 <- "bar.out" iteration <- 1  # build string1 elements <- strsplit(string1, "\\.")[[1]] elements[1] <- paste0(elements[1], iteration) string1 <- paste(elements, collapse = ".")  # build infile name elements <- strsplit(infile, "\\.")[[1]] elements[1] <- paste0(elements[1], iteration) infile<- paste(elements, collapse = ".") 

Now, I would like to read the template file and modify only the intended lines. The first problem I face is that read.table only outputs a data frame. Since my template file contains numbers and strings in the same column, if I read all the file with read.table I would obtain a character column (I guess). I circumvent the problem by reading only the numeric values I'm interested in:

    # read template file        temp <- read.table(template, stringsAsFactors = FALSE, skip = 2, nrows = 23)$V1     lines_to_read <- temp[length(temp)]      # modify numerical parameter values     temp[parameters$index] <- parameters$value 

However, now I don't know how to write foo1.in. If I use write.table, I can only write matrices or dataframes to disk, so I can't write a file which contains numbers and strings in the same column. How can I solve this?

EDIT I provide a bit of background on this problem, to explain why I need to write this file so many times. So, the idea is to perform Bayesian inference for the calibration parameters of a computer code (an executable). The basic idea is simple: you have a black box (commercial) computer code, which simulates a physical problem, for example a FEM code. Let's call this code Joe. Given an input file, Joe outputs a prediction for the response of a physical system. Now, I also have actual experimental measurements for the response of this system. I would like to find values of Joe's inputs such that the difference between Joe's outputs and the real measurements is minimized (actually things are quite different, but this is just to give an idea). In practice, this means that I need to run Joe many times with different input files, and iteratively find the input values which reduce the "discrepancy" between Joe's prediction and experimental results. In short:

  1. I need to generate many input (text) files
  2. I don't know in advance the contents of the input files. The numerical parameters are modified during the optimization in an iterative way.
  3. I also need to read Joe's output for each input. This is actually another problem and I'll probably write a specific question on this point.

So, while Joe is a commercial code for which I only have the executable (no source), the Bayesian inference is performed in R, because R (and, for what it matters, Python) have excellent tools to perform this kind of study.

3 Answers

Answers 1

This is probably easiest solved using a template language, such as Mustache, which is implemented in R in the whisker package.

Below is an example showing how this can be done in your case. As an example, I only implemented the first three variables and the bar1.out. Implementing the remaining variables should be straightforward.

library(whisker)   # You could also read the template in using readLines # template <- readLines("template.txt") # but to keep example selfsufficient, I included it in the code template <- "1  bar{{run}}.out        {{P1}}       {{P2}}       {{T1}}      8000.000000000000000        120.000000000000000         60.000000000000000          0.197500000000000          0.197500000000000          2.310000000000000          0.200000000000000          0.000000000000000          1.000000000000000          0.001187700000000         22.000000000000000          1.400000000000000          1.000000000000000          0.010000000000000  100          0.058600000000000         -0.217000000000000          0.078500000000000         -0.110100000000000  30        500.000000000000000  T"   # Store parameters in a list parameters <- list(   run = 1,    P1 = 65,   P2 = 4,   T1 = 16.85)  for (i in seq_len(10)) {   # New set of parameters   parameters$run <- i   parameters$P1  <- sample(1:100, 1)    # Generate new script by rendering the template using paramers   current_script <- whisker.render(template, parameters)   writeLines(current_script, paste0("foo", i, ".in"))    # Run script   # system(...) } 

What mustache does (in this case; more complex templating is possible; e.g. conditional elements) is replace all {{<variable>}} with the corresponding value in the parameters list.

Answers 2

Sounds like you need custom read/write functions; not ideal, but when you have a hybrid column-like-thing, you already diverge from "neat data" (whether or not it is tidy).

Three functions that simplify what I believe you need:

read_myfile <- function(x) {   # mostly during dev   if (file.exists(x)) x <- readLines(x)   if (length(x) == 1) x <- strsplit(rawfile, "\n")[[1]]   # find all left-aligned NAMED rows   hdrs <- grep("[A-Za-z]", x)   hdrs <- c(1, hdrs) # ensure the first "1" is preserved   dat <- mapply(function(a,b,x) if (b >= a) as.numeric(x[seq(a, b)]),                 hdrs + 1, c(hdrs[-1] - 1, length(x)), list(x),                 SIMPLIFY = FALSE)   names(dat) <- trimws(x[hdrs])   dat }  mod_myfile <- function(x, i, params) {   # sanity checks   stopifnot(     is.list(x),     is.numeric(i),     is.data.frame(params),     all(c("index", "value") %in% colnames(params))   )   isbarout <- which(names(x) == "bar.out")   stopifnot(     length(isbarout) == 1   )   x$bar.out[ params$index ] <- params$value   names(x)[isbarout] <- sprintf("bar%i.out", i)   x }  write_myfile <- function(x, ...) {   newdat <- unlist(unname(     mapply(function(hdr, dat) c(hdr, sprintf("%25.15f ", dat)),            names(x), x, SIMPLIFY = TRUE)   ))   writeLines(newdat, ...) } 

The use is straight-forward. I'll start with a single character string to emulate the input template (the read function works equally well with a character string as it does with a file name):

rawfile <- "1  bar.out         70.000000000000000          2.000000000000000         14.850000000000000       8000.000000000000000        120.000000000000000         60.000000000000000          0.197500000000000          0.197500000000000          2.310000000000000          0.200000000000000          0.000000000000000          1.000000000000000          0.001187700000000         22.000000000000000          1.400000000000000          1.000000000000000          0.010000000000000  100          0.058600000000000         -0.217000000000000          0.078500000000000         -0.110100000000000  30        500.000000000000000  T  " 

To start, just read the data:

dat <- read_myfile(rawfile) # dat <- read_myfile("file.in") str(dat) # List of 3 #  $ 1      : NULL #  $ bar.out: num [1:24] 70 2 14.8 8000 120 ... #  $ T      : NULL 

You will somehow determine how the parameters should be changed. I'll use your previous data:

parameters <- data.frame(   index = c(1:10, 13:16, 21:22),   variable = c("P1", "P2", "T1", "s", "D", "L", "C1", "C2",                "VA", "pw", "m", "mw", "Cp", "Z", "ff_N", "ff_M"),   value = c(65, 4, 16.85, 7900, 110, 60, 0.1975, .1875, 2.31,             0.2, 0.0011877, 22.0, 1.4, 1.0, 0.0785, -0.1101) ) 

The first parameter is the output from read_myfile; the second is the iterator you want to augment bar.out; the third is this parameters data.frame:

newdat <- mod_myfile(dat, 32, parameters) str(newdat) # List of 3 #  $ 1        : NULL #  $ bar32.out: num [1:24] 65 4 16.9 7900 110 ... #  $ T        : NULL 

And now write it out.

write_myfile(newdat, sprintf("foo%d.in", 32)) 

I don't know how @GiovanniRighi's performance will compare when run in a single R session, but 1000 of these files takes less than 7 seconds on my computer.

Answers 3

A couple of tricks should help. Let's follow a minimum working example that - I think - has all the features of your problem. Here are the contents of the file I'm modifying, tmp.txt:

1 bar.out 21 31 T 

Typically we work with lists rather than vectors when R has heterogeneous data. But here it seems to me easier to work with a character vector. Read the file from a text connection into a character vector:

a <- readLines("tmp.txt") 

Since you have the replacement values, replace them just like you would the strings. Since it looks like you have the replacement of the strings under control, let's change those numbers. We'll want to convert the numeric vector you have to a character vector.

value <- c(21, 31) value <- as.character(value) a[3:4] <- value 

Now write to replace the old file:

writeLines(a, "tmp.txt") 

Now Frank's comment is relevant because file I/O will be a serious bottleneck here. It would be much faster to do all of this in RAM.

 time for i in {1..1000}; do ./run.R; done   real   0m44.988s  user   0m33.270s  sys    0m5.170s 

Time seemed to increase linearly, so I'd expected a million iterations to take around 16 hours. Most of that time is file reading and writing. You can attempt to speed that up, but I don't think you'll be able to increase it much unless you can get your MCMC binary to spit out Rdata binaries (or feather files).

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment