Notes to myself: submitting R scripts to a cluster

In working with Bayesian spatial models in R,  I’ve rapidly come to the point where I’m running out of RAM on my laptop.  So I’ve been passing my code to a HPC cluster at Southampton to run the darned things in parallel.  Here are some key things for making the process work.

1) Rscript:  working with R as a command-line stats package,  I sometimes forget that you can run R commands from the command line of your computer’s operating system, such as the terminal window.  

$ Rscript Your_program.R

The Rscript tells the server that “Your_program.R” is a script to be run in R. Useful!
&nbsp

2) commandArgs(): What if you are running a bunch of versions of the same script, but want to change a parameter each time? The function commandArgs() takes any input after the arguments passed to Rscript and makes them available in the R environment that Your_Program.R is running in. For example:

$ Rscript Your_program.R FirstArg 2 3000

combined with the following line in R,

Args=commandArgs(trailingOnly=TRUE)

makes a vector “Args” available within Your_program.R

> Args
[1] "FirstArg" "2" "3000"

You could then use Args for whatever devious intent you might have in mind:

> Dev.intent<-rnorm(as.numeric(Args[2]), as.numeric(Args[1]))

note that I’ve used as.numeric(), as the input read into the R environment is treated as character by default. Neato!

3) Bash scripting: So you want to run Your_program.R 2000 times in parallel? Well you could go with a package like {doParallel} or {SNOW}, but this means telling the HPC job manager that you want XXX Gb of RAM and YYY nodes, with a total of ZZZ processors for 89:00:00 hours. An alternative is to break the job into bite-sized chunks and submit those all at once, so that they can be run when a small amount of space becomes available. To do this (in OSX/Linux) we need a wrapper script:

#!/bin/sh
module load R
Rscript Your_program.R $1 $2 $3

which is saved as “wrapper_script.sh”. The $1 $2 $3 are placeholders for the arguments needed in the program to be run. We then need a job script to call the wrapper script as many times as needed, with the right arguments in place:

#!/bin/sh
while read -r par1 par2;
do
echo "./wrapper_script.sh $par1 $par2 $par3" | qsub -l 
done < $1 

save this as job_script.sh. This then needs to be called as follows:

./job_script.sh Args_list.txt

where “Args_list.txt” is a text file with the arguments to passed (eventually) to Your_program.R.

FirstArg 2 3000
SecondArg 3 4000
ThirdArg 4 5000

So here, we will have three jobs submitted independently.

Posted in Uncategorized | Leave a comment

Introducing INLA: Multinomial data with covariates

I’ve recently started using the new R-INLA package developed by Havard Rue, Janine Illian Finn Lindgren et al. It’s quite easy to start using, has a lot of depth and is extremely useful for spatial data. For another project, I was trying to figure out how to fit a multinomial distribution in INLA for predicting age structures as a function of covariates. While a good example exists on the website, it wasn’t immediately clear to me how to add covariates to the model. After a discussion with Havard Rue, I came up with this example which both shows how to add covariates and a comparison of analysis of simple data using INLA versus another existing R package.

The analysis is based on this example describing the use of the multinom() function from the nnet package.

require(foreign)
require(nnet)
require(xtable)
ml <- read.dta("http://www.ats.ucla.edu/stat/data/hsbdemo.dta")
ml$prog2 <- relevel(ml$prog, ref = "academic")

This is what the data look like:

  id cid write
1 45.00 1 35.00
2 108.00 1 33.00
3 15.00 1 39.00
4 67.00 1 37.00
5 153.00 1 31.00
6 51.00 1 36.00

Here, “id” indicates the student-level indicator and “prog” indicates the possible choices or outcomes that we are interested in estimating probabilities for. We will use both of these to index the random effects appropriately. The continuous covariate “write” will be used in the analysis.

Because it makes the assumption that everything is based on a Gausian random field, INLA likes is when everything is centered and normalized. To meet those assumptions, we will centre the data. As well, we need to explicitly state the number of individuals in each category for each observation. For these data, we have one choice per person, so we need to fill in some zeros:

ml$number = 1
m1 = ml
m1$number = 0
m2 = m1
m1$prog2 = ifelse(ml$prog2 == "academic", "general", ifelse(m1$prog2 == "general", 
    "vocation", "academic"))
m2$prog2 = ifelse(m1$prog2 == "academic", "general", ifelse(m1$prog2 == "general", 
    "vocation", "academic"))
mI = rbind(ml, m1, m2)

The centering function and transformation:

my.centre = function(x) x - mean(x)
mI$C.write = my.centre(m2$write)
ml$C.write = my.centre(ml$write)

With the data massaging out of the way, we can run some models. First, with multinom(),

test2 <- multinom(prog2 ~ 1 + C.write, data = ml)

the results look like this:

  (Intercept) C.write
general -0.77 -0.07
vocation -0.86 -0.12

Now we can compare results from INLA, providing an example of INLA code. The basic multinomial model is provided by the example in the FAQ section of the INLA site , which allows us to estimate an intercepts-only model. In order to add covariates, we need to recognize that the Poisson approximation to the Multinomial distributoin assumes that the observed counts \( Y_{ij} \) in each of the \( J \) age categories are assumed to be Poisson random variables with means satisfying the equation:

\[ \mbox{log}(\mu_{ij})=\nu + \theta_i + \alpha_j +x_i \beta_j. \]

where \( \nu \) is a common intercept, assumed to be 0, \( \theta_i \) relates to the number of samples drawn for each Multinomial variable, \( \alpha_j \) specifies the relative strength in each category and \( \beta_j \) is a vector of \( J-1 \) covariates specific for each of the covariates in matrix \( x_i \). Thus, to estimate the influence of a covariate, we need to specify that we want \( J-1 \) covariates relating to the response in the relevant categories, assuming that the $J$th category will be treated as the reference (0).

We therefore need some more data manipulation to get the appropriate dummy variables. Here, our reference is “academic”, and we assume the coefficient for the reference group is 0.

mI$C.write_pg = with(mI, C.write * (prog2 == "general"))
mI$C.write_pv = with(mI, C.write * (prog2 == "vocation"))

Next, a formula. INLA extends the standard formula() structured used by linear models in R using the f() notation to define random effects. In this model, we define random effects for both the Observation and category levels, which allows us to do the approximation of the Multinomial as a set of poisson distributions.

formula = number ~ -1 + f(id, model = "iid", hyper = list(prec = list(initial = log(1e-06), 
    fixed = TRUE))) + 
f(prog2, model = "iid", constr = FALSE, hyper = list(prec = list(initial = log(0.001), 
    fixed = TRUE))) + C.write_pg + C.write_pv

Note that we have added two covariates C.write_pg and C.write_pv to estimate coefficients specifically for each category level (except the reference). Now we can pass the formula and data to INLA:

require(INLA)
r <- inla(formula, family = "poisson", data = mI, control.compute = list(config = TRUE))

Some brief notes on notation: the formula, family and data are self-explanatory, but the control.compute allows us to generate model predictions if we need them. Now we can compare the intercepts and coefficients for the multinom() and INLA models:

## compare intercepts Need to make sure they both sum to 1
INLA_Mn = r$summary.random$prog2$mean
INLA_res2 = exp(INLA_Mn)/sum(exp(INLA_Mn))

Multi_Mn = coef(test2)[, 1]
test2_res = exp(Multi_Mn)/(1 + sum(exp(Multi_Mn)))
MN_res2 = c(1 - sum(test2_res), test2_res)

print(xtable(data.frame(INLA_res2, MN_res2), align = "ccc"), type = "html", 
    include.rownames = T)

  INLA_res2 MN_res2
  0.53 0.53
general 0.24 0.25
vocation 0.22 0.22

And the coefficients too:

  Multinomial_fn INLA_result
general -0.07 -0.07
vocation -0.12 -0.12

As my daughter says: “Same:Same”. This may seem like a bit of a complicated exercise to achieve something almost identical to what has been done in a few lines of code in multinom(). But this simple model is now extensible to a spatio-temporal framework using the spde and random walk models described in the INLA tutorials. Stay tuned…

 

 

Posted in Uncategorized | Leave a comment

The Pool of Tears: Installing PyMC

Statistical advice often comes out sounding dreadfully easy.  “Just model it the way you wish it was, then model the missing data as well” quoth one advisor, with the hindsight of many years of Bayesian modeling. Such nodules of wisdom usually prompt a look of dawning awareness on the recipient, an awareness that only scratches the surface of what was really meant.  So off I scurry, to do as I have been bidden, only to get entangled in the details waiting behind the curtain of that pithy statement.  One short sentence can result in months of learning, a process that is in equal parts frustrating and rewarding.

Another statement: “Just vectorize the model and run it in pymc.” was issued by another advisor. And here I am, months later, still learning.  This post is a repository for me to document the trial and error process of getting python and all its various dependencies on my computer.

My Computer:

I have a MacBook Pro 2.66Ghz with 4Gb and various other bells and whistles.  Many of the details that follow are moot for windows users,  but the general ideas are the same.

Installing Python:

PyMC is a python module.  Of course, Macs these days come with python pre-installed, but mine had version 2.5 and PyMC’s latest build was made for python 2.6, (though at one point I had it running on Python 2.7 ).  I haven’t got much to say about installing Python. Download the installer and install it. Make sure to leave your pre-installed version in place.

Installing PyMC with superpack

The easiest way to install PyMC is using the superpack for Mac OsX that Chris Fonnesbeck has put together at stronginference.  This shell script will use Easy-Install to download and install the various modles that PyMC depends upon, namely Numpy, Nose, Scipy and Matplotlib.  I’ll get to those in a minute. It also installs a Fortran compiler if you haven’t already got one. The Superpack should be your first stop, even if it’s just to make sure you’ve got the Compiler. If the superpack works for you, great.  If, like me, you ran into some issues using superpack,  read on.

Installing piecewise from Easy_install

My issues, for the record were centered around the fact that I had conflicting versions of Python running (2.7 and 2.6).  Long story short:  I deleted all the python 2.7 files and did a fresh install of 2.6.  Then I used setuptools to easy_install Numpy, Matplotib and nose.  I decided to do things the hard way and install Matplotlib and all its dependencies from binaries, including libpng,  freetypeTk and TCl.  Of course, the order that you install these in matters, and it’s all done from within terminal, so the instructions for “Installing from Source” in the Matplotlib page were useful. After installing each,  I tried importing the module, to check that it had installed properly.  With a few false starts, I installed PyMC with all its dependencies and ran a test using nose and everything came up roses.

Then when I tried running the sample models in the PyMC, I got errors related to MatPlotlib 1.0.  It turned out that the installer for Matplotlib 1.0 sets the read/write permissions to admin only. Due to various reasons associated with a university admin having purchased this laptop for me, I am not listed as the admin for this computer.  However, I managed to reset the file permissions using chown and chmod.

So now what?  It’s all installed and I try out the examples in the pymc tutorial.  And it turns out that I haven’t the foggiest clue how to make all these example commands do anything meaningful.  Enter ipython, textwrangler and IDLE.  The first is an interactive command-line thingy that lets you work in a python shell with a few extra features.  Textwrangler is a powerful text editing tool that recognizes python scripts and lets you run them from the editor.  It also will talk to R.  IDLE is similar and comes with Python.  To be honest, I have been mostly working from ipython, which is easy to install with easy_install ipython.

Now What?

Now I’m figuring out how to make pymc and python in general work for me as an integrated language.  I can already see the benefits over JAGS/R.  the debugging is lightyears ahead, and I can see that the greater flexibility afforded by Numpy/Matplotlib will be a great asset.  However, the learning curve is steep (for me).  Stay posted as I continue to bash my head against it.

Posted in Uncategorized | Leave a comment

A Neophyte’s Diary Down the Rabbit Hole

If you’re already a statistical modeler like Chris Fonnesbeck (http://stronginference.com/), or a Bayesian Ninja like Andrew Royle (http://www.pwrc.usgs.gov/staff/profiles/documents/royle.htm) or Richard Barker (http://www.maths.otago.ac.nz/~rbarker/), then Modeling statistical problems in the Bayesian Paradigm is already in your blood.  However, if you’re a middle-of-the-road ecologist with limited statistical background and a windows-based computer experience, then the journey towards meaningful Bayesian Statistical Inference using MCMC in arenas like R, WinBUGS and Python may not be all that appealing at first.  But even for an outsider,  the Bayesian approach to reading data and getting results from them can be tempting.

Books like Royle and Dorazio’s Hierarchical Modeling and Inference in Ecology, Marc Kery’s Introduction to WinBUGS for Ecologists, Link and Barker’s Bayesian Inference with Ecological Applications all contain within their front matter bewitchingly clear and logical chutes down which, I think, many ecologists should take a slide.  These introductions are like a gateway drug that beckon to harder chapters inside.  Which is where I currently find myself;  fully embroiled, committed to a problem and a means of solving it that, a year ago, I would have thought beyond my meager processing capacity.

At present,  I’m (sort of) doing a PhD in which I am supposed to construct a hierarchical Mark-Recapture model for a fish population in Australia.  The problem is convoluted enough that traditional Mark-Recapture (via program MARK or similar) are likely to be either intractable, or at least hard enough to set up that it’s worth considering more flexible modeling languages. Multiple data sources at different time-scales,  spatially-explicit radio-telemetry and electrofishing, prior distributions for capture probabilities, population growth, survival and migration.  Richard Barker and Chris Fonnesbeck pointed me towards WinBUGS, then Python, and now I’m neck-deep in it.

And as I go further down the rabbit hole,  I find myself relying more and more on the community of modelers programmers and novices who have encountered (or constructed) the same problems as I have been wading into.  That’s why I’m writing this blog.  Because when I manage to grunt through a problem with installing Python, vectorizing a model into PyMC (not there yet) or understanding an output from WinBUGS, it’s the community of forums (PhiDot),   and wikis (wiki.python.org,  PyMC/w/list),  Mailing lists (R-help)  that get me there.

So what we have here is sort of a lab notebook in which I will mutter to myself, just loud enough for some to overhear.  It’s a way for me to record and compile the mistakes,  problems and solutions that I find.

TTFN

Posted in Uncategorized | Leave a comment