Sunday, June 28, 2015

Deaths in the Netherlands by cause and age

I downloaded counts of deaths by age, year and mayor cause from the Dutch statistics site. In this post I do some plots to look at causes and changes between the years.

Data 

Data from CBS. I downloaded the data in Dutch, hence the first thing to do was provide some kind of translation. The coding used seems slightly different from IDC-10 main categories (and has been alphabetically disordered compared to that). I used Google translate and IDC-10 to obtain the translations

Plots

Preparation

In the following I will be using both percentage of population and percentage of deaths by age cohort. The need for the percentage of deaths is because in some cohorts the percentages of deaths are much higher, thereby hiding anything happening in other cohorts. In addition I should mention that for visual purposes only the most important eight causes are used in the plots

Young

It seems that most of risks are associated with birth. In addition, these risks have steadily been decreasing. 
Looking at the age cohorts above 0 years, it seems accidents are most important. Most remarkable is a spike at 1953, which occurs for all four ages. After some consideration, I link this to the North Sea flood of 1953. It is remarkable that this is visible in the plot. It says a lot about how safe we are from accidents that it does. In the age category 15 to 20 there is also a relatively large bump during the 1970 to 1975. This is more difficult to explain, but I suspect traffic, especially the moped. A light motorcycle which preferably would be boosted to run much faster than the legal speed. 1975 saw the requirement to wear a helmet. It was much hated at the time, but in hindsight I can see that government felt compelled to do something and that it did have effect.
Looking at the plots, it seems the next big cause are Neoplasms. This is not because these become more deadly, it is because accidents are getting under control.

Elderly

For the elderly, diseases of the circulatory system are the main cause and decreasing quite a bit. The number of Symptoms and Abnormal Clinical Observations seems to decrease too. Since this seems to be a nice name for the 'other' category, this may be better diagnostics.
What is less visible is the increase in mental and behavioral disorders, especially after 1980 and at oldest age. It also seems that Neoplasms are getting lower very slowly.

Code

data reading

library(dplyr)
library(ggplot2)
txtlines <- readLines('Overledenen__doodsoo_170615161506.csv')
txtlines <- grep('Centraal',txtlines,value=TRUE,invert=TRUE) 
#txtlines[1:5]
#cat(txtlines[4])
r1 <- read.csv(sep=';',header=FALSE,
        col.names=c('Causes','Causes2','Age','year','aantal','count'),
        na.strings='-',text=txtlines[3:length(txtlines)]) %>%
    select(.,-aantal,-Causes2)
transcauses <- c(
    "Infectious and parasitic diseases",
    "Diseases of skin and subcutaneous",
    "Diseases musculoskeletal system and connective ",
    "Diseases of the genitourinary system",
    "Pregnancy, childbirth",
    "Conditions of perinatal period",
    "Congenital abnormalities",
    "Sympt., Abnormal clinical Observations",
    "External causes of death",
    "Neoplasms",
    "Illness of blood, blood-forming organs",
    "Endocrine, nutritional, metabolic illness",
    "Mental and behavioral disorders",
    "Diseases of the nervous system and sense organs",
    "Diseases of the circulatory system",
    "Diseases of the respiratory organs",
    "Diseases of the digestive organs",
    "Population",
    "Total all causes of death")
#cc <- cbind(transcauses,levels(r1$Causes))
#options(width=100)
levels(r1$Causes) <- transcauses
levels(r1$Age) <- 
    gsub('jaar','year',levels(r1$Age)) %>%
    gsub('tot','to',.) %>%
    gsub('of ouder','+',.) 

Preparation for plots

perc.of.death <- filter(r1,Causes=='Total all causes of death') %>%
    mutate(.,Population=count) %>%
    select(.,-count,-Causes) %>%
    merge(.,r1) %>%
    filter(.,Causes %in% transcauses[1:17]) %>%
    mutate(.,Percentage=100*count/Population,
        Causes = factor(Causes),
        year = as.numeric(gsub('*','',year,fixed=TRUE))
    )
perc.of.pop <- filter(r1,Causes=='Population') %>%
    mutate(.,Population=count) %>%
    select(.,-count,-Causes) %>%
    merge(.,r1) %>%
    filter(.,Causes %in% transcauses[1:17]) %>%
    mutate(.,Percentage=100*count/Population,
        Causes = factor(Causes),
        year = as.numeric(gsub('*','',year,fixed=TRUE))

    )

young

png('youngpop1.png')
tmp1 <- perc.of.pop %>% filter(.,Age %in% levels(perc.of.pop$Age)[c(1,2,11,3)],
        !is.na(Percentage)) %>%
    mutate(.,Age=factor(Age,levels=levels(perc.of.pop$Age)[c(1,2,11,3)]),
        Causes =factor(Causes)) 
# select 'important' causes (which somewhen got over 15%)
group_by(tmp1,Causes)%>%
    summarize(.,mp = max(Percentage)) %>%
    mutate(.,rk=rank(-mp)) %>%
    merge(.,tmp1) %>%
    filter(.,rk<=8) %>%
    ggplot(.,
        aes(y=Percentage,x=year,col=Causes)) +
    geom_line()+
    guides(col=guide_legend(ncol=2)) + 
    facet_wrap( ~Age ) +
    theme(legend.position="bottom")+
    ylab('Percentage of Cohort')
dev.off()
###
png('youngpop2.png')
tmp1 <- perc.of.pop %>% filter(.,Age %in% levels(perc.of.pop$Age)[c(2,11,3,4)],
        !is.na(Percentage)) %>%
    mutate(.,Age=factor(Age,levels=levels(perc.of.pop$Age)[c(2,11,3,4)]),
        Causes =factor(Causes)) 
# select 'important' causes (which somewhen got over 15%)
group_by(tmp1,Causes)%>%
    summarize(.,mp = max(Percentage)) %>%
    mutate(.,rk=rank(-mp)) %>%
    merge(.,tmp1) %>%
    filter(.,rk<=8) %>%
    ggplot(.,
        aes(y=Percentage,x=year,col=Causes)) +
    geom_line()+
    guides(col=guide_legend(ncol=2)) + 
    facet_wrap( ~Age ) +
    theme(legend.position="bottom")+
    ylab('Percentage of Cohort')
# https://en.wikipedia.org/wiki/North_Sea_flood_of_1953
dev.off()

old

png('oldpop.png')
tmp2 <- perc.of.pop %>% filter(.,Age %in% levels(perc.of.pop$Age)[18:21],
        !is.na(Percentage)) %>%
    mutate(.,Age=factor(Age),
        Causes =factor(Causes)) 
group_by(tmp2,Causes)%>%
    summarize(.,mp = max(Percentage)) %>%
    mutate(.,rk=rank(-mp)) %>%
    merge(.,tmp2) %>%
    filter(.,rk<=8) %>%
    ggplot(.,
        aes(y=Percentage,x=year,col=Causes)) +
    geom_line()+
    guides(col=guide_legend(ncol=2)) + 
    facet_wrap( ~Age ) +
    theme(legend.position="bottom")+
    ylab('Percentage of Cohort')
dev.off()
# rj.GD 
#     2 

png('oldpop2.png')
tmp2 <- perc.of.pop %>% 
    filter(.,
        Age %in% levels(perc.of.death$Age)[18:21],
        year>=1980,
        !is.na(Percentage)) %>%
    mutate(.,Age=factor(Age),
        Causes =factor(Causes)) 
group_by(tmp2,Causes)%>%
    summarize(.,mp = max(Percentage)) %>%
    mutate(.,rk=rank(-mp)) %>%
    merge(.,tmp2) %>%
    filter(.,rk<=8) %>%
    ggplot(.,
        aes(y=Percentage,x=year,col=Causes)) +
    geom_line()+
    guides(col=guide_legend(ncol=2)) + 
    facet_wrap( ~Age ) +
    theme(legend.position="bottom")+
    ylab('Percentage of Cohort')

dev.off()


  





Sunday, June 21, 2015

SAS PROC MCMC example 12 in R: Change point model

I restarted at working my way through the PROC MCMC examples. The SAS manual describes this example: Consider the data set from Bacon and Watts (1971), where $y_ i$ is the logarithm of the height of the stagnant surface layer and the covariate $x_ i$ is the logarithm of the flow rate of water. 
It is a simple example. It provided no problems at all for STAN and Jags. For LaplacesDemon on the other hand I had some problems. It took me quite some effort to obtain samples which seemed to be behaving. I did not try to do this in MCMCpack, but noted that the function MCMCregressChange() uses a slightly different model. The section below shows first the results, at the bottom the code is given.

Previous post in the series PROC MCMC examples programmed in R were: example 61.1: sampling from a known densityexample 61.2: Box Cox transformationexample 61.5: Poisson regressionexample 61.6: Non-Linear Poisson Regressionexample 61.7: Logistic Regression Random-Effects Model, and example 61.8: Nonlinear Poisson Regression Multilevel Random-Effects Model

Data

Data are read as below.
observed <-
'1.12  -1.39   1.12  -1.39   0.99  -1.08   1.03  -1.08
0.92  -0.94   0.90  -0.80   0.81  -0.63   0.83  -0.63
0.65  -0.25   0.67  -0.25   0.60  -0.12   0.59  -0.12
0.51   0.01   0.44   0.11   0.43   0.11   0.43   0.11
0.33   0.25   0.30   0.25   0.25   0.34   0.24   0.34
0.13   0.44  -0.01   0.59  -0.13   0.70  -0.14   0.70
-0.30   0.85  -0.33   0.85  -0.46   0.99  -0.43   0.99
-0.65   1.19'
observed <- scan(text=gsub('[[:space:]]+',' ',observed),
    what=list(y=double(),x=double()),
    sep=' ')
stagnant <- as.data.frame(observed)

LaplacesDemon

I have been playing around with LaplacesDemon. There is actually a function
LaplacesDemon.hpc which can use multiple cores. However, on my computer it seems more efficient just to use mclapply() from the parallel package and give the result class LaplacesDemon.hpc . Having said that, I had again quite some trouble to get LaplacesDemon to work well. In the end I used a combination of two calls to LaplacesDemon. The plot below shows selected samples after the first run. Not good enough, but that I do like this way of displaying the results of chains. It should be added that the labels looked correct with all parameters. However, that gave to much output for this blog. In addition, after the second call the results looked acceptable.

Call:
LaplacesDemon(Model = Model, Data = MyData, Initial.Values = apply(cc1$Posterior1,
    2, median), Covar = var(cc1$Posterior1), Iterations = 1e+05,
    Algorithm = "RWM")

Acceptance Rate: 0.2408
Algorithm: Random-Walk Metropolis
Covariance Matrix: (NOT SHOWN HERE; diagonal shown instead)
       alpha        beta1        beta2           cp           s2
4.920676e-04 2.199525e-04 3.753738e-04 8.680339e-04 6.122862e-08

Covariance (Diagonal) History: (NOT SHOWN HERE)
Deviance Information Criterion (DIC):
          All Stationary
Dbar -144.660   -144.660
pD      7.174      7.174
DIC  -137.486   -137.486
Initial Values:
        alpha         beta1         beta2            cp            s2
 0.5467048515 -0.4100040451 -1.0194586232  0.0166405998  0.0004800931

Iterations: 4e+05
Log(Marginal Likelihood): 56.92606
Minutes of run-time: 1.32
Model: (NOT SHOWN HERE)
Monitor: (NOT SHOWN HERE)
Parameters (Number of): 5
Posterior1: (NOT SHOWN HERE)
Posterior2: (NOT SHOWN HERE)
Recommended Burn-In of Thinned Samples: 0
Recommended Burn-In of Un-thinned Samples: 0
Recommended Thinning: 5
Specs: (NOT SHOWN HERE)
Status is displayed every 100 iterations
Summary1: (SHOWN BELOW)
Summary2: (SHOWN BELOW)
Thinned Samples: 40000
Thinning: 1


Summary of All Samples
                  Mean           SD         MCSE      ESS            LB
alpha     5.348239e-01 0.0244216567 2.999100e-04 11442.06  4.895229e-01
beta1    -4.196180e-01 0.0142422533 1.661658e-04 12726.60 -4.469688e-01
beta2    -1.013882e+00 0.0164892337 1.681833e-04 15191.59 -1.046349e+00
cp        2.855852e-02 0.0308177765 3.649332e-04 11945.66 -3.406306e-02
s2        4.472644e-04 0.0001429674 1.383748e-06 16571.94  2.474185e-04
Deviance -1.446602e+02 3.7879060637 4.940488e-02 10134.91 -1.496950e+02
LP        4.636511e+01 1.8939530321 2.470244e-02 10134.91  4.164313e+01
                Median            UB
alpha       0.53339024  5.842152e-01
beta1      -0.41996859 -3.903572e-01
beta2      -1.01387256 -9.815650e-01
cp          0.03110570  8.398674e-02
s2          0.00042101  8.006666e-04
Deviance -145.46896682 -1.352162e+02
LP         46.76949458  4.888251e+01


Summary of Stationary Samples
                  Mean           SD         MCSE      ESS            LB
alpha     5.348239e-01 0.0244216567 2.999100e-04 11442.06  4.895229e-01
beta1    -4.196180e-01 0.0142422533 1.661658e-04 12726.60 -4.469688e-01
beta2    -1.013882e+00 0.0164892337 1.681833e-04 15191.59 -1.046349e+00
cp        2.855852e-02 0.0308177765 3.649332e-04 11945.66 -3.406306e-02
s2        4.472644e-04 0.0001429674 1.383748e-06 16571.94  2.474185e-04
Deviance -1.446602e+02 3.7879060637 4.940488e-02 10134.91 -1.496950e+02
LP        4.636511e+01 1.8939530321 2.470244e-02 10134.91  4.164313e+01
                Median            UB
alpha       0.53339024  5.842152e-01
beta1      -0.41996859 -3.903572e-01
beta2      -1.01387256 -9.815650e-01
cp          0.03110570  8.398674e-02
s2          0.00042101  8.006666e-04
Deviance -145.46896682 -1.352162e+02
LP         46.76949458  4.888251e+01

STAN

Stan did not give much problems for this analysis.
Inference for Stan model: smodel.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.

         mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
Beta[1] -0.42    0.00 0.01 -0.45 -0.43 -0.42 -0.41 -0.39  1017 1.00
Beta[2] -1.01    0.00 0.02 -1.05 -1.02 -1.01 -1.00 -0.98  1032 1.00
Alpha    0.54    0.00 0.03  0.49  0.52  0.53  0.55  0.59   680 1.00
s2       0.00    0.00 0.00  0.00  0.00  0.00  0.00  0.00  1361 1.00
cp       0.03    0.00 0.03 -0.04  0.00  0.03  0.05  0.09   636 1.00
lp__    90.63    0.06 1.78 86.17 89.71 91.00 91.91 93.03   935 1.01

Samples were drawn using NUTS(diag_e) at Fri Jun 19 21:17:54 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

JAGS

Again no problems for Jags.
Inference for Bugs model at "/tmp/Rtmpy4a6C5/modeld4f6e9c9055.txt", fit using jags,
 4 chains, each with 10000 iterations (first 5000 discarded), n.thin = 5
 n.sims = 4000 iterations saved
          mu.vect sd.vect     2.5%      25%      50%      75%    97.5%  Rhat
alpha       0.534   0.027    0.479    0.518    0.533    0.552    0.586 1.040
beta[1]    -0.420   0.015   -0.450   -0.429   -0.420   -0.410   -0.389 1.013
beta[2]    -1.014   0.017   -1.049   -1.024   -1.014   -1.003   -0.980 1.023
cp          0.029   0.035   -0.038    0.006    0.032    0.051    0.100 1.037
s2          0.000   0.000    0.000    0.000    0.000    0.001    0.001 1.004
deviance -144.501   3.986 -149.634 -147.378 -145.432 -142.584 -134.378 1.021
         n.eff
alpha      160
beta[1]    380
beta[2]    290
cp         160
s2         710
deviance   290

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 7.9 and DIC = -136.6
DIC is an estimate of expected predictive error (lower deviance is better).

CODE used

# http://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htm#statug_mcmc_examples18.htm
# Example 61.12 Change Point Models
##############
#Data                                                   
##############
observed <-
'1.12  -1.39   1.12  -1.39   0.99  -1.08   1.03  -1.08
0.92  -0.94   0.90  -0.80   0.81  -0.63   0.83  -0.63
0.65  -0.25   0.67  -0.25   0.60  -0.12   0.59  -0.12
0.51   0.01   0.44   0.11   0.43   0.11   0.43   0.11
0.33   0.25   0.30   0.25   0.25   0.34   0.24   0.34
0.13   0.44  -0.01   0.59  -0.13   0.70  -0.14   0.70
-0.30   0.85  -0.33   0.85  -0.46   0.99  -0.43   0.99
-0.65   1.19'
observed <- scan(text=gsub('[[:space:]]+',' ',observed),
    what=list(y=double(),x=double()),
    sep=' ')
stagnant <- as.data.frame(observed)
#plot(y~x,data=stagnant)
##############
#LaplacesDemon                                                   
##############
library('LaplacesDemon')
library(parallel)

mon.names <- "LP"
parm.names <- c('alpha',paste('beta',1:2,sep=''),'cp','s2')

PGF <- function(Data) {
  x <-c(rnorm(5,0,1))
  x[4] <- runif(1,-1.3,1.1)
  x[5] <- runif(1,0,2)
  x
}
MyData <- list(mon.names=mon.names,
    parm.names=parm.names,
    PGF=PGF,
    x=stagnant$x,
    y=stagnant$y)
#N<-1
Model <- function(parm, Data)
{
  alpha=parm[1]
  beta=parm[2:3]
  cp=parm[4]
  s2=parm[5]
  yhat <- alpha+(Data$x-cp)*beta[1+(Data$x>=cp)]
  LL <- sum(dnorm(Data$y,yhat,sd=sqrt(s2),log=TRUE))
  prior <- sum(dnorm(parm[1:3],0,1e3,log=TRUE))+
      dunif(cp,-1.3,1.1,log=TRUE)+
      dunif(s2,0,5,log=TRUE)
  LP=LL+prior
  Modelout <- list(LP=LP, Dev=-2*LL, Monitor=LP,
      yhat=yhat,
      parm=parm)
  return(Modelout)
}
Fit1 <- mclapply(1:4,function(i)
      LaplacesDemon(Model,
    Data=MyData,
    Iterations=100000,
    Algorithm='RWM',
    Covar=c(rep(.01,4),.00001),
    Initial.Values = c(.5,-.4,-1,.05,.001)) #Initial.Values 
)
class(Fit1) <- 'demonoid.hpc'
plot(Fit1,Data=MyData,Parms=c('alpha'))
cc1 <- Combine(Fit1,MyData)
#
Fit2 <- mclapply(1:4,function(i)
      LaplacesDemon(Model,
          Data=MyData,
          Iterations=100000,
          Algorithm='RWM',
          Covar=var(cc1$Posterior1),
          Initial.Values = apply(cc1$Posterior1,2,median)) #Initial.Values 
)
class(Fit2) <- 'demonoid.hpc'
#plot(Fit2,Data=MyData,Parms=c('alpha'))
cc2 <- Combine(Fit2,MyData)
cc2
##############
#STAN                                                   
##############
stanData <- list(
    N=nrow(stagnant),
    x=stagnant$x,
    y=stagnant$y)

library(rstan)
smodel <- '
    data {
    int <lower=1> N;
    vector[N] x;
    vector[N] y;
    }
    parameters {
    real Beta[2];
    real Alpha;
    real <lower=0> s2;
    real <lower=-1.3,upper=1.1> cp;
    }
    transformed parameters {
    vector[N] yhat;
    for (i in 1:N)
       yhat[i] <- Alpha+(x[i]-cp)*Beta[1+(x[i]>cp)];
    }
    model {
    y ~ normal(yhat,sqrt(s2));
    s2 ~ uniform(0,1e3);
    cp ~ uniform(-1.3,1.1);
    Alpha ~ normal(0,1000);
    Beta ~ normal(0,1000);
    }
    '
fstan <- stan(model_code = smodel,
    data = stanData,
    pars=c('Beta','Alpha','s2','cp'))
fstan
##############
#Jags                                                  
##############
library(R2jags)jagsdata <- list(
    N=nrow(stagnant),
    x=stagnant$x,
    y=stagnant$y)

jagsm <- function() {
  for(i in 1:N) {
    yhat[i] <- alpha+(x[i]-cp)*beta[1+(x[i]>cp)]
    y[i] ~ dnorm(yhat[i],tau)
  }
  tau <- 1/s2
  s2 ~  dunif(0,1e3)
  cp ~ dunif(-1.3,1.1)
  alpha ~ dnorm(0,0.001)
  beta[1] ~ dnorm(0,0.001)
  beta[2] ~ dnorm(0,0.001)
}
params <- c('alpha','beta','s2','cp')
myj <-jags(model=jagsm,
    data = jagsdata,
    n.chains=4,
    n.iter=10000,
    parameters=params,
)
myj

Sunday, June 14, 2015

Parallel and a new laptop

I am thinking about a new laptop. For one thing a 1366*768 resolution just seems to get impractically small. Secondly, faster comutations, more memory.
Regarding CPU speed, my current laptop has a lowly Celeron 877. From what I see at my computers activity, under R it is mostly one core which does the work. Which means that even though there are two cores the single core CPU mark of 715 (from cpubenchmark.net) is what I have available. A bit of checking shows the current batch of processors has mainly more cores. For instance, the highest rated common CPU, an Intel Core i7-4710HQ, has a CPU mark of 7935 and single core 1870. That is 2.5 times faster for one core. But it is best because there are four cores. The same is true down the line. Four cores is common. But single core speed has not improved that much. Unless I can actually use those extra cores, what is the gain? Hence I am wondering, can I do something with extra cores for real world R computations? For this I can investigate.

Easy approach, Parallel

A bit of browsing shows that the parallel package is the easy way to use multiple cores, think of using mclapply() rather than lapply. And in many situations this is easy, for instance, cross validation is easy, except for the small upfront cost of partitioning the data in chunks. Trying different settings for a machine learning problem is similar.
To give this a certain real world setting, data was taken from the UCI machine learning repository:  Physicochemical Properties of Protein Tertiary Structure Data Set which has 45730 rows and 9 variables. A bit of plotting shows this figure for 2000 random selected rows. It seems the problem is not so much which variables to use but rather interactions. This was also suggested by poor performance of linear regression.

Random forest in parallel

Even though nine variables is a bit low for random forest, I elected to use it as first technique. The main variables to tune are nodesize and number of variables to try. Hence I wrapped this in mclapply, not even using a cross validation and taking care not to nest the mclapply calls. The result was a big usage of memory. Which in hindsight may be obvious. Each of the instances gets a complete data set. The net effect is that I ran out of RAM and data was swapped. This cannot be good for performance. It may also explain comments I have read that the caret package uses too much memory. A decent set of hardware for machine learning including a four core processor would create four instances of the same data. Perhaps adding another 4 GB of memory and an SSD rather than a HDD would serve me just as well as a new laptop...
tol <- expand.grid(mtry=1:3,
    nodesize=c(3,5,10))
bomen <- mclapply(seq(1:nrow(tol)),function(i)
          randomForest(
              y=train[,1],
              x=train[,-1],
              ntree=50,
              mtry=tol$mtry[i],
              nodesize=tol$nodesize[i])
)

Final thoughts

New hardware could also bring GPU computing in the picture. But this seems not so easy. It is unclear to me if CUDA or OpenCL is preferable and neither seems particularly easy to use. Then again, I could minimize hardware usage, buy a chromebook with a decent screen do my stuff in the cloud. For now though, I will continue to investigate how extra cores can help me.

Sunday, June 7, 2015

European debt and interest

I was told the Eurostat package would be interesting for me.  This is indeed true and now I want to use it to plot some data which are related core of some of the European policies; debt.
In these plots I only show individual countries, not aggregated over the EU or Euro zone. In addition Norway is dropped, because it had less data, is not an EU country and has fairly different financial positions than the rest of Europe. This resulted in 28 countries, which can be displayed in a grid of 4 by 7.

Lending and borrowing

In lending and borrowing we see the crisis troubles. Especially in Ireland there is a spike showing increased borrowing. In a few years from net lender to 30% borrowing is massive. But the same can be seen in Spain, UK. In fact it only few countries did not have more borrowing in that period.
The plot has an additional line, in red 3% borrowing is depicted, 3% being in the stability and growth pact. It can be seen that quite some countries are under or near those 3% or edging closer to it.

Debt

The consequence of all that borrowing is debt. All countries have debt. The red line is placed at 60%, which is the upper limit for the Euro zone stability and growth pact. Many countries had debt under 60% before the crisis, examples, Spain, Netherlands and Ireland. Italy and Belgium were quite above those 60% and got some increase. Germany, Europe's economic miracle, had debts over 60%. There seems to be no obvious link between the debt before crisis and the debt post crisis.

Interest

The consequence of debt is interest. This is probably the most remarkable set of data. There are no targets here. What is visible is the decrease in interest for many countries at the end of last century. Here the positive influence of the Euro is visible. But the strange thing is that many countries did hardly suffer increased interest payments during the crisis at all. Italy payed approximately 5% for more than 10 years now. Netherlands has a decreasing curve which was flattened by the crisis and is decreasing again. In summary, many countries currently pay historically low interests.
Part of this is obviously the work of various policies to contain the crisis. But it can also mean that debt may not be the biggest problem Europe faces at this moment.

Code

The code is fairly simple. What is needed is retrieving what codes mean. For countries these are extracted from the database. For the other codes it is most easy just to open the table on the Eurostat website and look there which codes are interesting.
The dplyr package allowed channeling selection and plotting in one call, thereby eliminating the chance of not updating intermediate data frames.
library(eurostat)
library(dplyr)
library(ggplot2)
library(scales)

r1 <- get_eurostat('gov_10dd_edpt1')

# add country names
r2 <- get_eurostat_dic('geo') %>%
    mutate(.,
        geo=V1,
        country=V2,
        country=gsub('\\(.*$','',country)) %>%
    select(.,geo,country) %>%
    merge(.,r1) %>%
# filter countries
    filter(.,
        !grepl('EA.*',geo),
        !grepl('EU.*',geo),
        geo!='NO')

filter(r2,
        sector=='S13', # general government
        na_item=='B9', # Net lending (+) /net borrowing (-)
        unit=='PC_GDP' # % GDP
    ) %>%
    ggplot(.,aes(x=time,y=values)) + 
    geom_line()+
    ylab('% GDP')+
    facet_wrap(~country,nrow=4) +
    ggtitle('Net lending (+) /net borrowing (-)') +
    xlab('Year') +
    geom_hline(yintercept=-3,colour='red') +
    scale_x_date(
        breaks=c(as.Date("2000-01-01"),as.Date("2010-01-01") )
        ,labels = date_format("%Y"))

#########
filter(r2,
        sector=='S13', # general government
        na_item=='GD', # Gross debt
        unit=='PC_GDP' # % GDP
    ) %>%
    ggplot(.,aes(x=time,y=values)) + 
    geom_line()+
    ylab('% GDP')+
    facet_wrap(~country,nrow=4) +
    ggtitle('Gross Debt') +
    xlab('Year') +
    geom_hline(yintercept=60,colour='red') +
    scale_x_date(
        breaks=c(as.Date("2000-01-01"),as.Date("2010-01-01") )
        ,labels = date_format("%Y"))

#########
filter(r2,
        sector=='S13', # general government
        na_item=='D41PAY', # Interest, payable
        unit=='PC_GDP' # % GDP
    ) %>%
    ggplot(.,aes(x=time,y=values)) +
    geom_line()+
    ylab('% GDP')+
    facet_wrap(~country,nrow=4) +
    ggtitle('Interest, payable') +
    xlab('Year') +
    scale_x_date(
        breaks=c(as.Date("2000-01-01"),as.Date("2010-01-01") )
        ,labels = date_format("%Y"))

Sunday, May 31, 2015

Paper Helicopter Experiment, part III

As final part of my paper helicopter experiment analysis (part I, part II) I do a reanalysis for one more data set. In 2002 Erik Erhardt and Hantao Mai did an extensive experiment, see The Search for the Optimal Paper Helicopter. They did a number of steps, including variable screening, steepest ascend and confirmatory experiment. For my part, I have combined all those data in one data set, and checked what kind of model would be used.

Data

The data extracted contains 45 observations. These observations have a number of replications, for instance a central composite design has a replicated center and the optimum found has been repeatedly tested.
After creation of a factor combining all variables it is pretty easy to examine the replications. The replications are thus. Here the first eight variables are the experimental settings, allvl is the factor combining all levels, Time is response and Freq the frequency of occurrence for allvl:
   RotorLength RotorWidth BodyLength FootLength FoldLength FoldWidth
1         8.50       4.00        3.5       1.25          8       2.0
2         8.50       4.00        3.5       1.25          8       2.0
3         8.50       4.00        3.5       1.25          8       2.0
4         8.50       4.00        3.5       1.25          8       2.0
5         8.50       4.00        3.5       1.25          8       2.0
6         8.50       4.00        3.5       1.25          8       2.0
7        11.18       2.94        2.0       2.00          6       1.5
8        11.18       2.94        2.0       2.00          6       1.5
9        11.18       2.94        2.0       2.00          6       1.5
10       11.18       2.94        2.0       2.00          6       1.5
11       11.18       2.94        2.0       2.00          6       1.5
12       11.18       2.94        2.0       2.00          6       1.5
13       11.50       2.83        2.0       1.50          6       1.5
14       11.50       2.83        2.0       1.50          6       1.5
15       11.50       2.83        2.0       1.50          6       1.5
   PaperWeight DirectionOfFold                                 allvl  Time Freq
1        heavy         against  8.5.4.0.3.5.1.2. 8.2.0.heavy.against 13.88    3
2        heavy         against  8.5.4.0.3.5.1.2. 8.2.0.heavy.against 15.91    3
3        heavy         against  8.5.4.0.3.5.1.2. 8.2.0.heavy.against 16.08    3
4        light         against  8.5.4.0.3.5.1.2. 8.2.0.light.against 10.52    3
5        light         against  8.5.4.0.3.5.1.2. 8.2.0.light.against 10.81    3
6        light         against  8.5.4.0.3.5.1.2. 8.2.0.light.against 10.89    3
7        light         against 11.2.2.9.2.0.2.0. 6.1.5.light.against 17.29    6
8        light         against 11.2.2.9.2.0.2.0. 6.1.5.light.against 19.41    6
9        light         against 11.2.2.9.2.0.2.0. 6.1.5.light.against 18.55    6
10       light         against 11.2.2.9.2.0.2.0. 6.1.5.light.against 15.54    6
11       light         against 11.2.2.9.2.0.2.0. 6.1.5.light.against 16.40    6
12       light         against 11.2.2.9.2.0.2.0. 6.1.5.light.against 19.67    6
13       light         against 11.5.2.8.2.0.1.5. 6.1.5.light.against 16.35    3
14       light         against 11.5.2.8.2.0.1.5. 6.1.5.light.against 16.41    3
15       light         against 11.5.2.8.2.0.1.5. 6.1.5.light.against 17.38    3

Transformation

It is also possible to do a regression of Time against allvl and examine the residuals. Since it is not difficult to imagine that error is proportional to elapsed time this is done for both original data and log10 transformed data.
As can be seen it seems that larger values have the larger error, but it is not really corrected very much by a log transformation. To examine this a bit more, the Box-Cox transformation is used. From there it seems square root is almost optimum, but log and no transformation should also work. It was decided to use a square root transformation.
Given the square root transformation the residual error should not be lower than 0.02, since that is what the replications have. On the other hand, much higher than 0.02 is a clear sign of under fitting.
Analysis of Variance Table

Response: sqrt(Time)
          Df  Sum Sq Mean Sq F value    Pr(>F)    
allvl      3 1.84481 0.61494      26 2.707e-05 ***
Residuals 11 0.26016 0.02365                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model selection

Given the residual variance desired, the model linear in variables is not sufficient.
Analysis of Variance Table

Response: sTime
                Df Sum Sq Mean Sq F value    Pr(>F)    
RotorLength      1 3.6578  3.6578 18.4625 0.0001257 ***
RotorWidth       1 1.0120  1.0120  5.1078 0.0299644 *  
BodyLength       1 0.1352  0.1352  0.6823 0.4142439    
FootLength       1 0.2719  0.2719  1.3725 0.2490708    
FoldLength       1 0.0060  0.0060  0.0302 0.8629331    
FoldWidth        1 0.0189  0.0189  0.0953 0.7592922    
PaperWeight      1 0.6528  0.6528  3.2951 0.0778251 .  
DirectionOfFold  1 0.4952  0.4952  2.4994 0.1226372    
Residuals       36 7.1324  0.1981                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Adding interactions and quadratic effects via stepwise regression did not improve much.
Analysis of Variance Table

Response: sTime
                       Df Sum Sq Mean Sq F value    Pr(>F)    
RotorLength             1 3.6578  3.6578 29.5262 3.971e-06 ***
RotorWidth              1 1.0120  1.0120  8.1687  0.007042 ** 
FootLength              1 0.3079  0.3079  2.4851  0.123676    
PaperWeight             1 0.6909  0.6909  5.5769  0.023730 *  
I(RotorLength^2)        1 2.2035  2.2035 17.7872  0.000159 ***
I(RotorWidth^2)         1 0.3347  0.3347  2.7018  0.108941    
FootLength:PaperWeight  1 0.4291  0.4291  3.4634  0.070922 .  
RotorWidth:FootLength   1 0.2865  0.2865  2.3126  0.137064    
Residuals              36 4.4598  0.1239                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '
Just adding the quadratic effects did not help either. However, using both linear and quadratic as a starting point did give a more extensive model.
Analysis of Variance Table

Response: sTime
                            Df Sum Sq Mean Sq  F value    Pr(>F)    
RotorLength                  1 3.6578  3.6578 103.8434 5.350e-10 ***
RotorWidth                   1 1.0120  1.0120  28.7293 1.918e-05 ***
FootLength                   1 0.3079  0.3079   8.7401 0.0070780 ** 
FoldLength                   1 0.0145  0.0145   0.4113 0.5276737    
FoldWidth                    1 0.0099  0.0099   0.2816 0.6007138    
PaperWeight                  1 0.7122  0.7122  20.2180 0.0001633 ***
DirectionOfFold              1 0.5175  0.5175  14.6902 0.0008514 ***
I(RotorLength^2)             1 1.7405  1.7405  49.4119 3.661e-07 ***
I(RotorWidth^2)              1 0.3160  0.3160   8.9709 0.0064635 ** 
I(FootLength^2)              1 0.1216  0.1216   3.4525 0.0760048 .  
I(FoldLength^2)              1 0.0045  0.0045   0.1272 0.7245574    
RotorLength:RotorWidth       1 0.4181  0.4181  11.8693 0.0022032 ** 
RotorLength:PaperWeight      1 0.3778  0.3778  10.7247 0.0033254 ** 
RotorWidth:FootLength        1 0.6021  0.6021  17.0947 0.0004026 ***
PaperWeight:DirectionOfFold  1 0.3358  0.3358   9.5339 0.0051968 ** 
RotorWidth:FoldLength        1 1.5984  1.5984  45.3778 7.167e-07 ***
RotorWidth:FoldWidth         1 0.3937  0.3937  11.1769 0.0028207 ** 
RotorWidth:PaperWeight       1 0.2029  0.2029   5.7593 0.0248924 *  
RotorWidth:DirectionOfFold   1 0.0870  0.0870   2.4695 0.1297310    
RotorLength:FootLength       1 0.0687  0.0687   1.9517 0.1757410    
FootLength:PaperWeight       1 0.0732  0.0732   2.0781 0.1629080    
Residuals                   23 0.8102  0.0352                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This model does is quite extensive. For prediction purpose I would probably drop a few terms. For instance FootLength:PaperWeight could be removed, lessen the fit yet improve predictions, since its p-value is close to 0.15. As it is currently the model does have some issues. For instance, quite some points have high leverage.

Conclusion

The paper helicopter needs quite a complex model to fit all effects on flying time. This somewhat validates the complex models found in part 1. 

Code used

library(dplyr)
library(car)
h3 <- read.table(sep='\t',header=TRUE,text='
        RotorLength RotorWidth BodyLength FootLength FoldLength FoldWidth PaperWeight DirectionOfFold Time
        5.5 3 1.5 0 5 1.5 light against 11.8
        5.5 3 1.5 2.5 11 2.5 heavy against 8.29
        5.5 3 5.5 0 11 2.5 light with 9
        5.5 3 5.5 2.5 5 1.5 heavy with 7.21
        5.5 5 1.5 0 11 1.5 heavy with 6.65
        5.5 5 1.5 2.5 5 2.5 light with 10.26
        5.5 5 5.5 0 5 2.5 heavy against 7.98
        5.5 5 5.5 2.5 11 1.5 light against 8.06
        11.5 3 1.5 0 5 2.5 heavy with 9.2
        11.5 3 1.5 2.5 11 1.5 light with 19.35
        11.5 3 5.5 0 11 1.5 heavy against 12.08
        11.5 3 5.5 2.5 5 2.5 light against 20.5
        11.5 5 1.5 0 11 2.5 light against 13.58
        11.5 5 1.5 2.5 5 1.5 heavy against 7.47
        11.5 5 5.5 0 5 1.5 light with 9.79
        11.5 5 5.5 2.5 11 2.5 heavy with 9.2
        8.5 4 3.5 1.25 8 2 light against 10.52
        8.5 4 3.5 1.25 8 2 light against 10.81
        8.5 4 3.5 1.25 8 2 light against 10.89
        8.5 4 3.5 1.25 8 2 heavy against 15.91
        8.5 4 3.5 1.25 8 2 heavy against 16.08
        8.5 4 3.5 1.25 8 2 heavy against 13.88
        8.5 4 2 2 6 2 light against 12.99
        9.5 3.61 2 2 6 2 light against 15.22
        10.5 3.22 2 2 6 2 light against 16.34
        11.5 2.83 2 2 6 1.5 light against 18.78
        12.5 2.44 2 2 6 1.5 light against 17.39
        13.5 2.05 2 2 6 1.5 light against 7.24
        10.5 2.44 2 1.5 6 1.5 light against 13.65
        12.5 2.44 2 1.5 6 1.5 light against 13.74
        10.5 3.22 2 1.5 6 1.5 light against 15.48
        12.5 3.22 2 1.5 6 1.5 light against 13.53
        11.5 2.83 2 1.5 6 1.5 light against 17.38
        11.5 2.83 2 1.5 6 1.5 light against 16.35
        11.5 2.83 2 1.5 6 1.5 light against 16.41
        10.08 2.83 2 1.5 6 1.5 light against 12.51
        12.91 2.83 2 1.5 6 1.5 light against 15.17
        11.5 2.28 2 1.5 6 1.5 light against 14.86
        11.5 3.38 2 1.5 6 1.5 light against 11.85
        11.18 2.94 2 2 6 1.5 light against 15.54
        11.18 2.94 2 2 6 1.5 light against 16.4
        11.18 2.94 2 2 6 1.5 light against 19.67
        11.18 2.94 2 2 6 1.5 light against 19.41
        11.18 2.94 2 2 6 1.5 light against 18.55
        11.18 2.94 2 2 6 1.5 light against 17.29
        ')
names(h3)

h3 <- h3 %>% 
    mutate(.,
        FRL=factor(format(RotorLength,digits=2)),
        FRW=factor(format(RotorWidth,digits=2)),
        FBL=factor(format(BodyLength,digits=2)),
        FFt=factor(format(FootLength,digits=2)),
        FFd=factor(format(FoldLength,digits=2)),
        FFW=factor(format(FoldWidth,digits=2)),
        allvl=interaction(FRL,FRW,FBL,FFt,FFd,FFW,PaperWeight,DirectionOfFold,drop=TRUE)
    )

h4 <- xtabs(~allvl,data=h3) %>% 
    as.data.frame %>%
    filter(.,Freq>1) %>%
    merge(.,h3) %>%
    select(.,RotorLength,
        RotorWidth,BodyLength,FootLength,
        FoldLength,FoldWidth,PaperWeight,
        DirectionOfFold,allvl,Time,Freq) %>%
    print
lm(Time~allvl,data=h4) %>% anova

par(mfrow=c(1,2))
aov(Time~allvl,data=h3) %>% residualPlot(.,main='Untransformed')
aov(log10(Time)~allvl,data=h3) %>% residualPlot(.,main='Log10 Transform')

lm(Time ~   RotorLength + RotorWidth + BodyLength +
            FootLength + FoldLength + FoldWidth + 
            PaperWeight + DirectionOfFold,
        data=h3) %>%
    boxCox(.)
dev.off()

lm(sqrt(Time)~allvl,data=h4) %>% anova

h3 <- mutate(h3,sTime=sqrt(Time))

lm(sTime ~  RotorLength + RotorWidth + BodyLength +
            FootLength + FoldLength + FoldWidth + 
            PaperWeight + DirectionOfFold,
        data=h3) %>%
    anova

s1 <- lm(sTime ~ 
            RotorLength + RotorWidth + BodyLength +
            FootLength + FoldLength + FoldWidth + 
            PaperWeight + DirectionOfFold ,
        data=h3) %>%
    step(.,scope=~(RotorLength + RotorWidth + BodyLength +
              FootLength + FoldLength + FoldWidth + 
              PaperWeight + DirectionOfFold)*
            (RotorLength + RotorWidth + BodyLength +
              FootLength + FoldLength + FoldWidth)+
            I(RotorLength^2) + I(RotorWidth^2) + I(BodyLength^2) +
            I(FootLength^2) + I(FoldLength^2) + I(FoldWidth^2) +
            PaperWeight*DirectionOfFold)
anova(s1)

s1 <- lm(sTime ~ 
            RotorLength + RotorWidth + BodyLength +
            FootLength + FoldLength + FoldWidth + 
            PaperWeight + DirectionOfFold ,
        data=h3) %>%
    step(.,scope=~(RotorLength + RotorWidth + BodyLength +
              FootLength + FoldLength + FoldWidth + 
              PaperWeight + DirectionOfFold)*
            (RotorLength + RotorWidth + BodyLength +
              FootLength + FoldLength + FoldWidth)+
            I(RotorLength^2) + I(RotorWidth^2) + I(BodyLength^2) +
            I(FootLength^2) + I(FoldLength^2) + I(FoldWidth^2) +
            PaperWeight*DirectionOfFold)

anova(s1)

s2 <- lm(sTime ~ 
            RotorLength + RotorWidth + BodyLength +
            FootLength + FoldLength + FoldWidth + 
            PaperWeight + DirectionOfFold +
            I(RotorLength^2) + I(RotorWidth^2) + I(BodyLength^2) +
            I(FootLength^2) + I(FoldLength^2) + I(FoldWidth^2) ,
        data=h3) %>%
    step(.,scope=~(RotorLength + RotorWidth + BodyLength +
              FootLength + FoldLength + FoldWidth + 
              PaperWeight + DirectionOfFold)*
            (RotorLength + RotorWidth + BodyLength +
              FootLength + FoldLength + FoldWidth)+
            I(RotorLength^2) + I(RotorWidth^2) + I(BodyLength^2) +
            I(FootLength^2) + I(FoldLength^2) + I(FoldWidth^2) +
            PaperWeight*DirectionOfFold)

anova(s2)
par(mfrow=c(2,2))
plot(s2)