This is a quick tutorial on functional programming in R with respect to phylogenetics. I am using the functional programming definitions available here, which is a good resource for further information. We will discuss some tricks to make your phylogenetics applications more efficient and flexible in R.

Motivating Example

You may have noticed that some packages fit models in different ways. Let’s compare two ways of fitting a birth-death model, one in the R package diversitree and one in ape.

library(diversitree)
## Loading required package: ape
library(ape)
## Make the tree
set.seed(2)
phy <- tree.bd(c(.1, .03), max.taxa=100)

## Fit with diversitree
lik <- make.bd(phy)
fit_diversitree <- find.mle(lik, c(.1, .03), method="subplex")
fit_diversitree$par
##     lambda         mu 
## 0.08716469 0.01231621
## Fit with ape
fit_ape <- birthdeath(phy)
fit_ape$para
##       d/b       b-d 
## 0.1419101 0.0748218

They give the answer in different parameterizations, but you can prove to yourself that they’re roughly the same. Which one is better? Well, ape has only one line of code, while diversitree took two! Why bother with the two steps? The answer is the flexibility afforded by functional programming.

## Take the same likelihood from diversitree and fit it with MCMC
mcmc_diversitree <- mcmc(lik, c(0.1,0.03), 100, w=0.1, print.every=10)
## 10: {0.0930, 0.0045} -> 14.11458
## 20: {0.0892, 0.0214} -> 14.70931
## 30: {0.0830, 0.0057} -> 14.77845
## 40: {0.1305, 0.0915} -> 11.64522
## 50: {0.1030, 0.0389} -> 14.41441
## 60: {0.0848, 0.0202} -> 14.47579
## 70: {0.0786, 0.0162} -> 13.96003
## 80: {0.0896, 0.0079} -> 14.64217
## 90: {0.0845, 0.0058} -> 14.78028
## 100: {0.1064, 0.0364} -> 14.29043
summary(mcmc_diversitree)
##        i              lambda              mu                  p         
##  Min.   :  1.00   Min.   :0.06632   Min.   :0.0001294   Min.   : 8.638  
##  1st Qu.: 25.75   1st Qu.:0.08525   1st Qu.:0.0120494   1st Qu.:13.957  
##  Median : 50.50   Median :0.09319   Median :0.0226492   Median :14.424  
##  Mean   : 50.50   Mean   :0.09527   Mean   :0.0262987   Mean   :14.058  
##  3rd Qu.: 75.25   3rd Qu.:0.10253   3rd Qu.:0.0371644   3rd Qu.:14.647  
##  Max.   :100.00   Max.   :0.13373   Max.   :0.1026696   Max.   :14.800
## Use a different function than find.mle in diversitree
fit_optim <- optim(c(0.1, 0.03), lik, control=list(fnscale = -1))
fit_optim
## $par
## [1] 0.08719640 0.01237997
## 
## $value
## [1] 14.80924
## 
## $counts
## function gradient 
##       55       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL
## diversitree also sends likelihood functions to ancestral state reconstructions for trait models

Diversitree uses the same mcmc and find.mle function, regardless of what type of model is made and fit. This drastically cuts down on the number of lines of code needed to run various models, and makes the program much more robust than recoding a new function that does it all in one step every time. Plus, you can add new make.my_new_model functions later that don’t break your code.

Anonymous functions

Functions in R are just like other objects, and can be assigned to a variable for later use. When you don’t think you’ll ever use the function again, you can skip that step and simply use a function without assigning it! Check it out:

(function(x) paste0("My anonymous function says: ", x))(x="What?")
## [1] "My anonymous function says: What?"

This doesn’t look pretty, and I don’t recommend you do this. But it simply shows that you don’t have to name a function to use it. This comes in handy for the apply family of functions, as we’re often faced with trivial tasks that we don’t care to make a function for.

For example, here’s something that happens to me a lot. I want to fit 3 models using fitContinuous, but I want to do it in a streamlined way. So I use lapply:

library(treeplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:diversitree':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Registered S3 method overwritten by 'geiger':
##   method            from
##   unique.multiPhylo ape
## 
## Attaching package: 'treeplyr'
## The following object is masked from 'package:stats':
## 
##     reorder
library(geiger)
data(anolis)
anolis <- make.treedata(anolis$phy, anolis$dat)
models <- c("BM", "OU", "lambda")
fits <- lapply(models, fitContinuous, phy=anolis$phy, dat=anolis[['SVL']], SE=0)
## Warning in FUN(X[[i]], ...): Parameter estimates appear at bounds:
##  lambda
fits
## [[1]]
## GEIGER-fitted comparative model of continuous data
##  fitted 'BM' model parameters:
##  sigsq = 0.136160
##  z0 = 4.065918
## 
##  model summary:
##  log-likelihood = -4.700404
##  AIC = 13.400807
##  AICc = 13.524519
##  free parameters = 2
## 
## Convergence diagnostics:
##  optimization iterations = 100
##  failed iterations = 0
##  frequency of best fit = 1.00
## 
##  object summary:
##  'lik' -- likelihood function
##  'bnd' -- bounds for likelihood search
##  'res' -- optimization iteration summary
##  'opt' -- maximum likelihood parameter estimates
## 
## [[2]]
## GEIGER-fitted comparative model of continuous data
##  fitted 'OU' model parameters:
##  alpha = 0.000000
##  sigsq = 0.136160
##  z0 = 4.065918
## 
##  model summary:
##  log-likelihood = -4.700404
##  AIC = 15.400807
##  AICc = 15.650807
##  free parameters = 3
## 
## Convergence diagnostics:
##  optimization iterations = 100
##  failed iterations = 0
##  frequency of best fit = 0.81
## 
##  object summary:
##  'lik' -- likelihood function
##  'bnd' -- bounds for likelihood search
##  'res' -- optimization iteration summary
##  'opt' -- maximum likelihood parameter estimates
## 
## [[3]]
## GEIGER-fitted comparative model of continuous data
##  fitted 'lambda' model parameters:
##  lambda = 1.000000
##  sigsq = 0.136160
##  z0 = 4.065918
## 
##  model summary:
##  log-likelihood = -4.700404
##  AIC = 15.400807
##  AICc = 15.650807
##  free parameters = 3
## 
## Convergence diagnostics:
##  optimization iterations = 100
##  failed iterations = 0
##  frequency of best fit = 0.09
## 
##  object summary:
##  'lik' -- likelihood function
##  'bnd' -- bounds for likelihood search
##  'res' -- optimization iteration summary
##  'opt' -- maximum likelihood parameter estimates

Now all my results are locked away in a list. I could go get what I want, one by one. For example, if I want the AIC values for each fit to make a pretty table.

setNames(c(fits[[1]]$opt$aic, fits[[2]]$opt$aic, fits[[3]]$opt$aic), models)
##       BM       OU   lambda 
## 13.40081 15.40081 15.40081

Not great though right? If we had fit 20 models, the script above would be really tedious. We could use sapply, and define a new function, aic_getter:

aic_getter <- function(x) x$opt$aic
aics <- sapply(fits, aic_getter)
setNames(aics, models)
##       BM       OU   lambda 
## 13.40081 15.40081 15.40081

But who am I kidding, am I going to ever use aic_getter again? Is it worth it to make this function? Sometimes it’s just easier to use it as an anonymous function in a single step, since we will likely never need aic_getter again.

names(fits) <- models
sapply(fits, function(x) x$opt$aic)
##       BM       OU   lambda 
## 13.40081 15.40081 15.40081

This is really useful if you just need an on-the-fly function that isn’t worth making into an object. Note that anonymous functions are really hard to read if they get long and complicated, so try not to make them longer than 1 line. You can, but probably shouldn’t, use {} and semicolons to make multi-line anonymous functions.

Closures

Functions can also make functions, which are called “closures”. These are anonymous functions that are returned by other functions.

power <- function(exponent) {
  function(x) {
    x ^ exponent
  }
}
square <- power(2)
square(2)
## [1] 4
square(3)
## [1] 9
cube <- power(3)
cube(2)
## [1] 8
cube(3)
## [1] 27

One problem with this is that we don’t see what the function does very clearly from printing the function. Both functions used the same x ^ exponent. Instead, the difference is that in the environment of the function when the closure was made, the exponent was either 2 or 3. We can see this here:

square
## function(x) {
##     x ^ exponent
##   }
## <environment: 0x7f8f0feb7f90>
cube ## Both are the same function
## function(x) {
##     x ^ exponent
##   }
## <bytecode: 0x7f8f150278f8>
## <environment: 0x7f8f13708fe0>
as.list(environment(square))
## $exponent
## [1] 2
as.list(environment(cube))
## $exponent
## [1] 3

In other words, the function is made with its environment coinciding with the function environment it was created in. This is really useful for making flexible functions that respond to user input.

Functions can also be made into lists of functions. For example:

powerFns <- lapply(1:10, power)
names(powerFns) <- paste0("powerExp", 1:10)
powerFns$powerExp10(2)
## [1] 1024
sapply(powerFns, function(x) x(2))
##  powerExp1  powerExp2  powerExp3  powerExp4  powerExp5  powerExp6  powerExp7 
##          2          4          8         16         32         64        128 
##  powerExp8  powerExp9 powerExp10 
##        256        512       1024

Exercise 1: Make an S3 generic function to print a more useful definition

Let’s make use of classes. We are going to define a new class for our power function, let’s call it powerFn, and write a print method that gives us useful information.

power <- function(exponent) {
  fn <- function(x) {
    x ^ exponent
  }
  class(fn) <- "powerFn"
  return(fn)
}
cube <- power(3)
class(cube)
## [1] "powerFn"

Now currently we have no methods for a powerFn, and it inherits the class “function”. Let’s define our print function:

print.powerFn <- function(x, ...){
 
}
cube

Exercise 2: Making a function lets users specify priors

Suppose you want to make a prior function for 3 parameters (a, b, c) with different distributions, how will you do it? Well you could do the following:

prior <- function(x){
  lnP <- dnorm(x[1], 0, 1, log=TRUE) + dnorm(x[2], 0, 10, log=TRUE) + dnorm(x[3], 0, 100, log=TRUE)
  return(lnP)
}
prior(c(0, 3, 10))
## [1] -9.714571

But what if a user wants to modify the distributions? Or use a different distribution than a normal distribution? What can you do to make it more flexible?

make.prior <- function(fns, pars){
  prior <- function(x) fns[[1]](x[1], pars[[1]][1], pars[[1]][2], log=TRUE) + fns[[2]](x[2], pars[[2]][1], pars[[2]][2], log=TRUE) + fns[[3]](x[3], pars[[3]][1], pars[[3]][2], log=TRUE)
  return(prior)
}

Now let’s test it out.

fns <- list(A=dnorm, B=dnorm, C=dnorm)
pars <- list(A=c(mean=0, sd=1), B=c(mean=0, sd=10), C=c(mean=0, sd=100))
prior <- make.prior(fns, pars)
prior(c(0,3,10))
## [1] -9.714571

Yay! We get the same answer. But now try changing one of the functions to an exponential distribution, dexp:

fns <- list(A=dnorm, B=dnorm, C=dexp)
pars <- list(A=c(mean=0, sd=1), B=c(mean=0, sd=10), C=c(rate=1))
prior <- make.prior(fns, pars)
try(prior(c(0,3,10)))
## Error in fns[[3]](x[3], pars[[3]][1], pars[[3]][2], log = TRUE) : 
##   unused argument (pars[[3]][2])

It breaks because the exponential distribution has only 1 parameter, and our code asks for 2. Also, what if we want to add a fourth parameter? Do we need to rewrite the function every time?

Here’s one way we could maybe make an more elegant solution. We can make use of what are called formals.

formals(dnorm)
## $x
## 
## 
## $mean
## [1] 0
## 
## $sd
## [1] 1
## 
## $log
## [1] FALSE

Notice that we see the default values of the formal arguments $mean=0 and $sd=1. Guess what, we can change these (PS, this is kind of scary):

formals(dnorm)$sd <- 10
formals(dnorm)
## $x
## 
## 
## $mean
## [1] 0
## 
## $sd
## [1] 10
## 
## $log
## [1] FALSE

While we wouldn’t want to do this in the global environment, messing with formals inside a function ensures that it only modifies the functions within the scope of the function, and not in the global environment.

make.prior <- function(fns, pars){
  ## Fill this in
  return(prior)
}

fns <- list(A=dnorm, B=dnorm, C=dexp)
pars <- list(A=list(mean=0, sd=1), B=list(mean=0, sd=10), C=list(rate=1))
prior <- make.prior(fns, pars)
prior(c(0,3,10)) #-14.18546

Challenge, now make it work for an arbitrary number of parameters (e.g. 4 parameters as shown below).

make.prior <- function(fns, pars){
  ## Fill this in
  return(prior)
}

fns <- list(A=dnorm, B=dnorm, C=dlnorm, D=dunif)
pars <- list(A=list(mean=0, sd=1), B=list(mean=0, sd=10), C=list(meanlog=0, sdlog=1), D=list(min=0, max=10))
prior <- make.prior(fns, pars)
prior(c(0,3,10,5))  #-12.36052
prior(c(0,3,10,11)) #-Inf since 11 > than max=10 for D.

Notice that we can see all the details of the prior function using the environment function again:

as.list(environment(prior))
## $prior
## function(x) fns[[1]](x[1], pars[[1]][1], pars[[1]][2], log=TRUE) + fns[[2]](x[2], pars[[2]][1], pars[[2]][2], log=TRUE) + fns[[3]](x[3], pars[[3]][1], pars[[3]][2], log=TRUE)
## <bytecode: 0x7f8f12c59260>
## <environment: 0x7f8f12278708>
## 
## $fns
## $fns$A
## function (x, mean = 0, sd = 1, log = FALSE) 
## .Call(C_dnorm, x, mean, sd, log)
## <bytecode: 0x7f8f17012190>
## <environment: namespace:stats>
## 
## $fns$B
## function (x, mean = 0, sd = 1, log = FALSE) 
## .Call(C_dnorm, x, mean, sd, log)
## <bytecode: 0x7f8f17012190>
## <environment: namespace:stats>
## 
## $fns$C
## function (x, rate = 1, log = FALSE) 
## .Call(C_dexp, x, 1/rate, log)
## <bytecode: 0x7f8f1302d8c0>
## <environment: namespace:stats>
## 
## 
## $pars
## $pars$A
## mean   sd 
##    0    1 
## 
## $pars$B
## mean   sd 
##    0   10 
## 
## $pars$C
## rate 
##    1

Exercise 3: Make an S3 Generic functiont to print an informative prior function definition.