Saturday, July 12, 2014

odfweave setup and counting logicals

Two short items in this blogpost. Since it was not obvious how to run odfWeave() in my particular setup, the call I am using. Then there were several people crosstabulating logical vectors, so I wanted to play along, 80 times faster than table().

odfWeave

My particular setup consists of R, 7-zip, libreoffice. Somehow they don't 100% play along when using odfWeave. I had that problem this spring and decided to put my solution in a post at some point. In terms of versions therefore, I had that with my previous versions, and tested that it still runs with my new setup (R 3.1.1, Libreoffice  4.2.5.2). The only loose end, is that odfWeave complains I am re-using a directory, and that I need to empty said directory manually.
# the standard example call that works for me
demoFile <- system.file("examples", "simple.odt", package = "odfWeave")
outputFile <- gsub("simple.odt", "output.odt", demoFile)
odfWeave(demoFile, outputFile,
    workDir='C:\\Users\\Kees\\Documents\\tmp',
    odfWeaveControl(zipCmd = 
            c("C:\\Progra~1\\7-Zip\\7z a -tzip $$file$$ . -r", 
                "C:\\Progra~1\\7-Zip\\7z x -tzip $$file$$ -yr") ))
# removing files
file.remove(dir('C:\\Users\\Kees\\Documents\\tmp',
        recursive=TRUE,
        full.names=TRUE))

# using a different directory
odfWeave('C:\\Users\\Kees\\Documents\\test\\testcases.odt',
    'C:\\Users\\Kees\\Documents\\test\\testout.odt',
    workDir='C:\\Users\\Kees\\Documents\\tmp',
    odfWeaveControl(zipCmd = 
            c("C:\\Progra~1\\7-Zip\\7z a -tzip $$file$$ . -r", 
                "C:\\Progra~1\\7-Zip\\7z x -tzip $$file$$ -yr") ))

Cross table of logical vectors

This was started in Sometimes Table is not the Answer – a Faster 2×2 Table and carried on with Sometimes I feel (some) need for speed. So, I wanted to add my own attempts. The aim is to make a cross table of two logical vectors with a minimum of time. Which becomes important if these vectors are long. Solutions from previous posts.
set.seed(2014)

manual = sample(c(TRUE, FALSE), 10e6, replace = TRUE)
auto = sample(c(TRUE, FALSE), 10e6, replace = TRUE)

logical.tab = function(x, y) {
  tt = sum(x & y)
  tf = sum(x & !y)
  ft = sum(!x & y)
  ff = sum(!x & !y)
  return(matrix(c(ff, tf, ft, tt), 2, 2))
}

basic.tab2 = function(x, y) {
  dif = x - y
  tf = sum(dif > 0)
  ft = sum(dif < 0)
  tt = sum(x*y)
  ff = length(dif) - tt - tf - ft
  return(c(tf, ft, tt, ff))
}
tabulate(manual + auto *2+1, 4)

My idea was we should use the margins and go back from there.
my.tab = function(x, y) {
  tt = sum(x * y)
  t1=sum(x)
  t2=sum(y)
  return(matrix(c(length(x)-t1-t2+tt,  t1-tt, t2-tt, tt), 2, 2))
}

my.tab2 <- function(x, y) {
  phase1 <- colSums(cbind(x,y,x*y))
  return(matrix(c(length(x)-sum(phase1[-3])+phase1[3],
     phase1[-3]-phase1[3],
     phase1[3]),2,2))
}
With my particular hardware table() is just too slow to microbenchmark often, but 80 times faster than table() is not bad.
library(microbenchmark)
microbenchmark(
    logical.tab(manual, auto), 
    basic.tab2(manual, auto),
    my.tab(manual,auto),
    my.tab2(manual,auto),
    tabulate(manual + auto *2+1, 4),
    table(manual,auto),
    times = 20)
Unit: milliseconds
                               expr        min         lq     median         uq        max neval
          logical.tab(manual, auto)  2852.5587  2888.8590  2906.4571  2972.3916  3227.0821    20
           basic.tab2(manual, auto)   705.8153   722.5800   746.1683   765.9400   957.5435    20
               my.tab(manual, auto)   185.8359   186.6829   188.0988   224.2308   413.5623    20
              my.tab2(manual, auto)   463.2731   481.8843   487.7825   512.2563   694.1729    20
 tabulate(manual + auto * 2 + 1, 4)   276.1837   300.8009   315.9451   379.7302   534.7997    20
                table(manual, auto) 15703.0576 16132.0100 16231.3342 16466.7445 19012.0273    20

1 comment: