I need to read a template file test.txt
, modify the contents and then write to disk a modified copy with name foo`i`.in
(i
is an iteration number). Since I need to perform this operation a large number of times (a million times wouldn't be uncommon), efficient solutions would be preferred. The template file is like this:
1 bar.out 70.000000000000000 2.000000000000000 14.850000000000000 8000.000000000000000 120.000000000000000 60.000000000000000 0.197500000000000 0.197500000000000 2.310000000000000 0.200000000000000 0.000000000000000 1.000000000000000 0.001187700000000 22.000000000000000 1.400000000000000 1.000000000000000 0.010000000000000 100 0.058600000000000 -0.217000000000000 0.078500000000000 -0.110100000000000 30 500.000000000000000 T
I don't need to modify all lines, just some of them. Specifically, I need to modify bar.out
to bar`i`.out
where i
is an iteration index. I also need to modify some numeric lines with the following values:
parameters <- data.frame(index = c(1:10, 13:16, 21:22), variable = c("P1", "P2", "T1", "s", "D", "L", "C1", "C2", "VA", "pw", "m", "mw", "Cp", "Z", "ff_N", "ff_M"), value = c(65, 4, 16.85, 7900, 110, 60, 0.1975, .1875, 2.31, 0.2, 0.0011877, 22.0, 1.4, 1.0, 0.0785, -0.1101))
All the other lines must remain the same, including the last line T
. Thus, assuming I'm at the first iteration, the expected output is a text file named foo1.in
having the content (the exact number format is not important, as long as all the significant digits in parameters$value
are included in foo1.in
):
1 bar1.out 65.000000000000000 4.000000000000000 16.850000000000000 7900.000000000000000 110.000000000000000 60.000000000000000 0.197500000000000 0.187500000000000 2.310000000000000 0.200000000000000 0.000000000000000 1.000000000000000 0.001187700000000 22.000000000000000 1.400000000000000 1.000000000000000 0.010000000000000 100 0.058600000000000 -0.217000000000000 0.078500000000000 -0.110100000000000 30 500.000000000000000 T
Modifying foo.in
and bar.out
is easy:
template <- "test.txt" infile <- "foo.in" string1 <- "bar.out" iteration <- 1 # build string1 elements <- strsplit(string1, "\\.")[[1]] elements[1] <- paste0(elements[1], iteration) string1 <- paste(elements, collapse = ".") # build infile name elements <- strsplit(infile, "\\.")[[1]] elements[1] <- paste0(elements[1], iteration) infile<- paste(elements, collapse = ".")
Now, I would like to read the template file and modify only the intended lines. The first problem I face is that read.table
only outputs a data frame. Since my template file contains numbers and strings in the same column, if I read all the file with read.table
I would obtain a character column (I guess). I circumvent the problem by reading only the numeric values I'm interested in:
# read template file temp <- read.table(template, stringsAsFactors = FALSE, skip = 2, nrows = 23)$V1 lines_to_read <- temp[length(temp)] # modify numerical parameter values temp[parameters$index] <- parameters$value
However, now I don't know how to write foo1.in
. If I use write.table
, I can only write matrices or dataframes to disk, so I can't write a file which contains numbers and strings in the same column. How can I solve this?
EDIT I provide a bit of background on this problem, to explain why I need to write this file so many times. So, the idea is to perform Bayesian inference for the calibration parameters of a computer code (an executable). The basic idea is simple: you have a black box (commercial) computer code, which simulates a physical problem, for example a FEM code. Let's call this code Joe. Given an input file, Joe outputs a prediction for the response of a physical system. Now, I also have actual experimental measurements for the response of this system. I would like to find values of Joe's inputs such that the difference between Joe's outputs and the real measurements is minimized (actually things are quite different, but this is just to give an idea). In practice, this means that I need to run Joe many times with different input files, and iteratively find the input values which reduce the "discrepancy" between Joe's prediction and experimental results. In short:
- I need to generate many input (text) files
- I don't know in advance the contents of the input files. The numerical parameters are modified during the optimization in an iterative way.
- I also need to read Joe's output for each input. This is actually another problem and I'll probably write a specific question on this point.
So, while Joe is a commercial code for which I only have the executable (no source), the Bayesian inference is performed in R, because R (and, for what it matters, Python) have excellent tools to perform this kind of study.
3 Answers
Answers 1
This is probably easiest solved using a template language, such as Mustache, which is implemented in R in the whisker package.
Below is an example showing how this can be done in your case. As an example, I only implemented the first three variables and the bar1.out
. Implementing the remaining variables should be straightforward.
library(whisker) # You could also read the template in using readLines # template <- readLines("template.txt") # but to keep example selfsufficient, I included it in the code template <- "1 bar{{run}}.out {{P1}} {{P2}} {{T1}} 8000.000000000000000 120.000000000000000 60.000000000000000 0.197500000000000 0.197500000000000 2.310000000000000 0.200000000000000 0.000000000000000 1.000000000000000 0.001187700000000 22.000000000000000 1.400000000000000 1.000000000000000 0.010000000000000 100 0.058600000000000 -0.217000000000000 0.078500000000000 -0.110100000000000 30 500.000000000000000 T" # Store parameters in a list parameters <- list( run = 1, P1 = 65, P2 = 4, T1 = 16.85) for (i in seq_len(10)) { # New set of parameters parameters$run <- i parameters$P1 <- sample(1:100, 1) # Generate new script by rendering the template using paramers current_script <- whisker.render(template, parameters) writeLines(current_script, paste0("foo", i, ".in")) # Run script # system(...) }
What mustache does (in this case; more complex templating is possible; e.g. conditional elements) is replace all {{<variable>}}
with the corresponding value in the parameters
list.
Answers 2
Sounds like you need custom read/write functions; not ideal, but when you have a hybrid column-like-thing, you already diverge from "neat data" (whether or not it is tidy).
Three functions that simplify what I believe you need:
read_myfile <- function(x) { # mostly during dev if (file.exists(x)) x <- readLines(x) if (length(x) == 1) x <- strsplit(rawfile, "\n")[[1]] # find all left-aligned NAMED rows hdrs <- grep("[A-Za-z]", x) hdrs <- c(1, hdrs) # ensure the first "1" is preserved dat <- mapply(function(a,b,x) if (b >= a) as.numeric(x[seq(a, b)]), hdrs + 1, c(hdrs[-1] - 1, length(x)), list(x), SIMPLIFY = FALSE) names(dat) <- trimws(x[hdrs]) dat } mod_myfile <- function(x, i, params) { # sanity checks stopifnot( is.list(x), is.numeric(i), is.data.frame(params), all(c("index", "value") %in% colnames(params)) ) isbarout <- which(names(x) == "bar.out") stopifnot( length(isbarout) == 1 ) x$bar.out[ params$index ] <- params$value names(x)[isbarout] <- sprintf("bar%i.out", i) x } write_myfile <- function(x, ...) { newdat <- unlist(unname( mapply(function(hdr, dat) c(hdr, sprintf("%25.15f ", dat)), names(x), x, SIMPLIFY = TRUE) )) writeLines(newdat, ...) }
The use is straight-forward. I'll start with a single character string to emulate the input template (the read function works equally well with a character string as it does with a file name):
rawfile <- "1 bar.out 70.000000000000000 2.000000000000000 14.850000000000000 8000.000000000000000 120.000000000000000 60.000000000000000 0.197500000000000 0.197500000000000 2.310000000000000 0.200000000000000 0.000000000000000 1.000000000000000 0.001187700000000 22.000000000000000 1.400000000000000 1.000000000000000 0.010000000000000 100 0.058600000000000 -0.217000000000000 0.078500000000000 -0.110100000000000 30 500.000000000000000 T "
To start, just read the data:
dat <- read_myfile(rawfile) # dat <- read_myfile("file.in") str(dat) # List of 3 # $ 1 : NULL # $ bar.out: num [1:24] 70 2 14.8 8000 120 ... # $ T : NULL
You will somehow determine how the parameters should be changed. I'll use your previous data:
parameters <- data.frame( index = c(1:10, 13:16, 21:22), variable = c("P1", "P2", "T1", "s", "D", "L", "C1", "C2", "VA", "pw", "m", "mw", "Cp", "Z", "ff_N", "ff_M"), value = c(65, 4, 16.85, 7900, 110, 60, 0.1975, .1875, 2.31, 0.2, 0.0011877, 22.0, 1.4, 1.0, 0.0785, -0.1101) )
The first parameter is the output from read_myfile
; the second is the iterator you want to augment bar.out
; the third is this parameters
data.frame:
newdat <- mod_myfile(dat, 32, parameters) str(newdat) # List of 3 # $ 1 : NULL # $ bar32.out: num [1:24] 65 4 16.9 7900 110 ... # $ T : NULL
And now write it out.
write_myfile(newdat, sprintf("foo%d.in", 32))
I don't know how @GiovanniRighi's performance will compare when run in a single R session, but 1000 of these files takes less than 7 seconds on my computer.
Answers 3
A couple of tricks should help. Let's follow a minimum working example that - I think - has all the features of your problem. Here are the contents of the file I'm modifying, tmp.txt:
1 bar.out 21 31 T
Typically we work with lists rather than vectors when R has heterogeneous data. But here it seems to me easier to work with a character vector. Read the file from a text connection into a character vector:
a <- readLines("tmp.txt")
Since you have the replacement values, replace them just like you would the strings. Since it looks like you have the replacement of the strings under control, let's change those numbers. We'll want to convert the numeric vector you have to a character vector.
value <- c(21, 31) value <- as.character(value) a[3:4] <- value
Now write to replace the old file:
writeLines(a, "tmp.txt")
Now Frank's comment is relevant because file I/O will be a serious bottleneck here. It would be much faster to do all of this in RAM.
time for i in {1..1000}; do ./run.R; done real 0m44.988s user 0m33.270s sys 0m5.170s
Time seemed to increase linearly, so I'd expected a million iterations to take around 16 hours. Most of that time is file reading and writing. You can attempt to speed that up, but I don't think you'll be able to increase it much unless you can get your MCMC binary to spit out Rdata binaries (or feather files).
0 comments:
Post a Comment