sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.3
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.6.2 magrittr_1.5 tools_3.6.2 htmltools_0.4.0 yaml_2.2.1 Rcpp_1.0.3 stringi_1.4.6
## [8] rmarkdown_2.1 knitr_1.28 stringr_1.4.0 xfun_0.12 digest_0.6.24 rlang_0.4.4 evaluate_0.14
Scientific planning: What experiments would verify/invalidate our hypotheses? What parameter settings should we consider?
Code planning: What does the code need to do? How will the code fit together? What functions will be used? What are their inputs/outputs etc.
Implementation:
Prototype functions, classes, etc., partial documentation
Write unit tests
Implement code, run unit tests, debug
Broader testing, more debugging
Profile code, identify bottlenecks
Optimize code
Conduct experiments.
Full documentation.
After profiling, what to do to improve performance?
Ask are there obvious speedups? Are things being computed unnecessarily? Are you using a data.frame
where you should be using a matrix etc.
Look up your problem on internet (e.g., search for “lapply slow” or “speeding up lapply
” etc.)
Try the just-in-time (JIT) compiler.
Consider re-writing some or all of the code in a compiled language (e.g., C/C++).
Try parallelization.
R typical execution:
compiler
package by Luke Tierney is distributed with base R. compiler
package compiles an R function into bytecode.Brute-force for
loop for summing a vector:
sum_r <- function(x) {
sumx <- 0.0
for (i in 1:length(x)) {
sumx <- sumx + x[i]
}
return(sumx)
}
sum_r
## function(x) {
## sumx <- 0.0
## for (i in 1:length(x)) {
## sumx <- sumx + x[i]
## }
## return(sumx)
## }
Run the code on 1e6 elements:
library(microbenchmark)
library(ggplot2)
x = seq(from = 0, to = 100, by = 0.0001)
microbenchmark(sum_r(x))
## Unit: milliseconds
## expr min lq mean median uq max neval
## sum_r(x) 35.52734 40.13336 42.8163 42.45542 44.63928 71.40086 100
Let’s compile the function into bytecode sum_rc
and benchmark again:
library(compiler)
sum_rc <- cmpfun(sum_r)
sum_rc
## function(x) {
## sumx <- 0.0
## for (i in 1:length(x)) {
## sumx <- sumx + x[i]
## }
## return(sumx)
## }
## <bytecode: 0x7fb16a117d20>
Benchmark again:
microbenchmark(sum_r(x), sum_rc(x))
## Unit: milliseconds
## expr min lq mean median uq max neval
## sum_r(x) 36.64929 38.34437 41.29133 40.29301 43.11206 61.43303 100
## sum_rc(x) 36.56571 37.90031 40.65300 39.47313 42.70774 48.41150 100
Surprise! Surprise! Compiling into bytecode does not help at all. Following code shows that the function sum_r
is already compiled into bytecode before execution.
sum_r
## function(x) {
## sumx <- 0.0
## for (i in 1:length(x)) {
## sumx <- sumx + x[i]
## }
## return(sumx)
## }
## <bytecode: 0x7fb16a274548>
Let’s turn off JIT (just-in-time compilation), re-define the (same) sum_r
function, and benchmark again:
enableJIT(0) # set JIT leval to 0
## [1] 3
sum_r <- function(x) {
sumx <- 0.0
for (i in 1:length(x)) {
sumx <- sumx + x[i]
}
return(sumx)
}
microbenchmark(sum_r(x))
## Unit: milliseconds
## expr min lq mean median uq max neval
## sum_r(x) 334.8417 402.3486 421.8041 412.1567 428.6354 561.157 100
Now we witness the slowness of the un-compiled sum_r
.
Documentation of enableJIT
:
enableJIT enables or disables just-in-time (JIT) compilation. JIT is disabled if the argument is 0. If level is 1 then larger closures are compiled before their first use. If level is 2, then some small closures are also compiled before their second use. If level is 3 then in addition all top level loops are compiled before they are executed. JIT level 3 requires the compiler option optimize to be 2 or 3. The JIT level can also be selected by starting R with the environment variable R_ENABLE_JIT set to one of these values. Calling enableJIT with a negative argument returns the current JIT level. The default JIT level is 3.
Since R 3.4.0 (Apr 2017), the JIT (‘Just In Time’) bytecode compiler is enabled by default at its level 3.
If you create a package, then you automatically compile the package on installation by adding
ByteCompile: true
to the DESCRIPTION
file.
Note Matlab has employed JIT technology since 2002 and Julia is designed totally based on JIT. R finally is on the same boat.
Learning sources:
compiler
package compiles R code into bytecode, which is translated to machine code by interpreter during execution. A low-level language such as C, C++, and Fortran is compiled into machine code directly, yielding the maximum efficiency.
cppFunction
Rcpp
package provides a convenient way to embed C++ code in R code.
library(Rcpp)
cppFunction('double sum_c(NumericVector x) {
int n = x.size();
double total = 0;
for(int i = 0; i < n; ++i) {
total += x[i];
}
return total;
}')
sum_c
## function (x)
## .Call(<pointer: 0x10d80fc50>, x)
Benchmark (1) compiled C++ function sum_c
together with (2) R function sum_r
, (3) compiled R function sum_rc
, and (4) the sum
function in base R:
mbm <- microbenchmark(sum_r(x), sum_rc(x), sum_c(x), sum(x))
mbm
## Unit: milliseconds
## expr min lq mean median uq max neval
## sum_r(x) 348.397398 390.520400 411.278585 403.803101 419.898282 582.862497 100
## sum_rc(x) 35.757540 40.229704 43.876828 42.678854 46.459574 67.863500 100
## sum_c(x) 1.317483 1.479515 1.608074 1.576969 1.710736 2.237991 100
## sum(x) 1.095215 1.301181 1.445542 1.409306 1.536520 2.091459 100
autoplot(mbm)
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
Remember we turned off JIT by enableGIT(0)
earlier.
sourceCpp
In realistic projects, we write standalone C++ files and then source them into R using sourceCpp()
. For example
cat sum.cpp
## #include <Rcpp.h>
## using namespace Rcpp;
##
## // This is a simple example of exporting a C++ function to R. You can
## // source this function into an R session using the Rcpp::sourceCpp
## // function (or via the Source button on the editor toolbar). Learn
## // more about Rcpp at:
## //
## // http://www.rcpp.org/
## // http://adv-r.had.co.nz/Rcpp.html
## // http://gallery.rcpp.org/
## //
##
## // [[Rcpp::export]]
## double sum_c(NumericVector x) {
## int n = x.size();
## double total = 0;
## for(int i = 0; i < n; ++i) {
## total += x[i];
## }
## return total;
## }
##
## // You can include R code blocks in C++ files processed with sourceCpp
## // (useful for testing and development). The R code will be automatically
## // run after the compilation.
## //
##
## /*** R
## sum_c(as.double(1:10))
## */
sourceCpp("sum.cpp")
##
## > sum_c(as.double(1:10))
## [1] 55
sum_c
## function (x)
## .Call(<pointer: 0x10e16cc50>, x)
As a Julia fan, I want to see how Julia works on this example. sum_j
is Julia equivalent of the sum_r
function:
# Julia code in this chunk
using BenchmarkTools
function sum_j(x)
total = zero(eltype(x))
for xi in x
total += xi
end
total
end
## sum_j (generic function with 1 method)
sum_j(1:10)
## 55
x = collect(0:0.0001:100);
@benchmark sum_j(x) samples=100
## BenchmarkTools.Trial:
## memory estimate: 16 bytes
## allocs estimate: 1
## --------------
## minimum time: 1.319 ms (0.00% GC)
## median time: 1.668 ms (0.00% GC)
## mean time: 1.648 ms (0.00% GC)
## maximum time: 2.039 ms (0.00% GC)
## --------------
## samples: 100
## evals/sample: 1
Julia retains the claritiy of a high-level language, while achieving efficiency of a low-level language!
My suggestion: avoid all these hassles of two language problem, just use Julia 😄
Fact: base R is single-threaded. Even you request a fancy instance with 96 vCPUs, running R code is just using 1/96th of its power.
To perform multi-core computation in R:
Option 1: Manually run multiple R sessions.
Option 2: Make multiple system("Rscript")
calls. Typically automated by a scripting language (Python, Perl, shell script) or within R (HW1 Q4)
Option 3: Use package parallel
.
parallel
package in R.
Authors: Brian Ripley, Luke Tieney, Simon Urbanek.
Included in base R since 2.14.0 (2011).
Based on the snow
(Luke Tierney) and multicore
(Simon Urbanek) packages.
To find the number of cores:
library(parallel)
detectCores()
## [1] 8
Let’s re-visit the simulation example considered in earlier lecture and HW1 Q4.
We have a “new” method that estimates the population mean by averaging the observations indexed by prime numbers.
## check if a given integer is prime
isPrime = function(n) {
if (n <= 3) {
return (TRUE)
}
if (any((n %% 2:floor(sqrt(n))) == 0)) {
return (FALSE)
}
return (TRUE)
}
## estimate mean only using observation with prime indices
estMeanPrimes = function(x) {
n <- length(x)
ind <- sapply(1:n, isPrime)
return (mean(x[ind]))
}
We want to compare our method to the traditional sample average estimator by simulation studies.
## compare methods: sample avg and prime-indexed avg
compare_methods <- function(dist = "gaussian", n = 100, reps = 100, seed = 123) {
# set seed according to command argument `seed`
set.seed(seed)
# preallocate space to store estimators
msePrimeAvg <- 0.0
mseSamplAvg <- 0.0
# loop over simulation replicates
for (r in 1:reps) {
# simulate data according to command arguments `n` and `distr`
if (dist == "gaussian") {
x = rnorm(n)
} else if (dist == "t1") {
x = rcauchy(n)
} else if (dist == "t5") {
x = rt(n, 5)
} else {
stop(paste("unrecognized dist: ", dist))
}
# prime indexed mean estimator and classical sample average estimator
msePrimeAvg <- msePrimeAvg + estMeanPrimes(x)^2
mseSamplAvg <- mseSamplAvg + mean(x)^2
}
mseSamplAvg <- mseSamplAvg / reps
msePrimeAvg <- msePrimeAvg / reps
return(c(mseSamplAvg, msePrimeAvg))
}
We need to loop over 3 generative models (distTypes
) and 20 samples sizes (nVals
). That are 60 embarssingly parallel tasks.
seed = 280
reps = 500
nVals = seq(100, 1000, by = 50)
distTypes = c("gaussian", "t5", "t1")
This is the serial code that double-loop over combinations of distTypes
and nVals
:
## simulation study with combination of generative model `dist` and
## sample size `n` (serial code)
simres1 = matrix(0.0, nrow = 2 * length(nVals), ncol = length(distTypes))
i = 1 # entry index
system.time(
for (dist in distTypes) {
for (n in nVals) {
#print(paste("n=", n, " dist=", dist, " seed=", seed, " reps=", reps, sep=""))
simres1[i:(i + 1)] = compare_methods(dist, n, reps, seed)
i <- i + 2
}
}
)
## user system elapsed
## 40.496 0.219 40.799
simres1
## [,1] [,2] [,3]
## [1,] 0.0103989436 0.017070603 312.4001
## [2,] 0.0410217819 0.066503177 200.3237
## [3,] 0.0065484669 0.011260420 173.9631
## [4,] 0.0297390639 0.047465397 199.5330
## [5,] 0.0056445593 0.007754855 68026.7023
## [6,] 0.0215380206 0.040039145 1230343.1341
## [7,] 0.0040803523 0.006871930 43609.7815
## [8,] 0.0165144049 0.032353077 931755.3182
## [9,] 0.0032566766 0.005417194 30283.1286
## [10,] 0.0161191554 0.026330133 684726.8788
## [11,] 0.0027565672 0.004444172 22306.1369
## [12,] 0.0145039253 0.022820075 539105.3392
## [13,] 0.0024915830 0.003798500 17119.0807
## [14,] 0.0122801335 0.022788299 435528.4778
## [15,] 0.0023706676 0.003360507 13531.4104
## [16,] 0.0112703627 0.016674640 111.4673
## [17,] 0.0020190283 0.003147367 10973.3489
## [18,] 0.0106157492 0.016027485 278.9663
## [19,] 0.0017567901 0.002863640 9069.6647
## [20,] 0.0096185720 0.016444671 261373.7646
## [21,] 0.0016441481 0.002637964 7629.9867
## [22,] 0.0081784426 0.013710539 296.6235
## [23,] 0.0015075246 0.002498450 6498.2362
## [24,] 0.0088018140 0.012909942 191986.6388
## [25,] 0.0014372130 0.002308089 5603.9395
## [26,] 0.0077292632 0.012789483 171280.5448
## [27,] 0.0012924543 0.002216936 4889.8739
## [28,] 0.0069012154 0.011052562 170.3975
## [29,] 0.0011994654 0.001987311 4299.6838
## [30,] 0.0067559611 0.011291788 178.7454
## [31,] 0.0011642413 0.001888637 3806.8472
## [32,] 0.0070131993 0.010048327 140.2389
## [33,] 0.0011566121 0.001873365 3401.9766
## [34,] 0.0065066558 0.009020973 34.1644
## [35,] 0.0010506067 0.001595430 3049.0432
## [36,] 0.0060026682 0.010338424 103578.0598
## [37,] 0.0009770234 0.001618095 2768.4517
## [38,] 0.0054705674 0.009229294 143.5544
mcmapply
Run the same task using mcmapply
function (parallel analog of mapply
) in the parallel
package:
## simulation study with combination of generative model `dist` and
## sample size `n` (parallel code using mcmapply)
library(parallel)
system.time({
simres2 <- mcmapply(compare_methods,
rep(distTypes, each = length(nVals), times = 1),
rep(nVals, each = 1, times = length(distTypes)),
reps,
seed,
mc.cores = 4)
})
## user system elapsed
## 38.910 0.701 14.111
simres2 <- matrix(unlist(simres2), ncol = length(distTypes))
simres2
## [,1] [,2] [,3]
## [1,] 0.0103989436 0.017070603 312.4001
## [2,] 0.0410217819 0.066503177 200.3237
## [3,] 0.0065484669 0.011260420 173.9631
## [4,] 0.0297390639 0.047465397 199.5330
## [5,] 0.0056445593 0.007754855 68026.7023
## [6,] 0.0215380206 0.040039145 1230343.1341
## [7,] 0.0040803523 0.006871930 43609.7815
## [8,] 0.0165144049 0.032353077 931755.3182
## [9,] 0.0032566766 0.005417194 30283.1286
## [10,] 0.0161191554 0.026330133 684726.8788
## [11,] 0.0027565672 0.004444172 22306.1369
## [12,] 0.0145039253 0.022820075 539105.3392
## [13,] 0.0024915830 0.003798500 17119.0807
## [14,] 0.0122801335 0.022788299 435528.4778
## [15,] 0.0023706676 0.003360507 13531.4104
## [16,] 0.0112703627 0.016674640 111.4673
## [17,] 0.0020190283 0.003147367 10973.3489
## [18,] 0.0106157492 0.016027485 278.9663
## [19,] 0.0017567901 0.002863640 9069.6647
## [20,] 0.0096185720 0.016444671 261373.7646
## [21,] 0.0016441481 0.002637964 7629.9867
## [22,] 0.0081784426 0.013710539 296.6235
## [23,] 0.0015075246 0.002498450 6498.2362
## [24,] 0.0088018140 0.012909942 191986.6388
## [25,] 0.0014372130 0.002308089 5603.9395
## [26,] 0.0077292632 0.012789483 171280.5448
## [27,] 0.0012924543 0.002216936 4889.8739
## [28,] 0.0069012154 0.011052562 170.3975
## [29,] 0.0011994654 0.001987311 4299.6838
## [30,] 0.0067559611 0.011291788 178.7454
## [31,] 0.0011642413 0.001888637 3806.8472
## [32,] 0.0070131993 0.010048327 140.2389
## [33,] 0.0011566121 0.001873365 3401.9766
## [34,] 0.0065066558 0.009020973 34.1644
## [35,] 0.0010506067 0.001595430 3049.0432
## [36,] 0.0060026682 0.010338424 103578.0598
## [37,] 0.0009770234 0.001618095 2768.4517
## [38,] 0.0054705674 0.009229294 143.5544
We see roughly 3x-4x speedup with mc.cores=4
.
mcmapply
, mclapply
and related functions rely on the forking capability of POSIX operating systems (e.g. Linux, MacOS) and is not available in Windows.
parLapply
, parApply
, parCapply
, parRapply
, clusterApply
, clusterMap
, and related functions create a cluster of workers based on either socket (default) or forking. Socket is available on all platforms: Linux, MacOS, and Windows.
clusterMap
The same simulation example using clusterMap
function:
cl <- makeCluster(getOption("cl.cores", 4))
clusterExport(cl, c("isPrime", "estMeanPrimes", "compare_methods"))
system.time({
simres3 <- clusterMap(cl, compare_methods,
rep(distTypes, each = length(nVals), times = 1),
rep(nVals, each = 1, times = length(distTypes)),
reps,
seed,
.scheduling = "dynamic")
})
## user system elapsed
## 0.031 0.006 10.220
simres3 <- matrix(unlist(simres3), ncol = length(distTypes))
stopCluster(cl)
simres3
## [,1] [,2] [,3]
## [1,] 0.0103989436 0.017070603 312.4001
## [2,] 0.0410217819 0.066503177 200.3237
## [3,] 0.0065484669 0.011260420 173.9631
## [4,] 0.0297390639 0.047465397 199.5330
## [5,] 0.0056445593 0.007754855 68026.7023
## [6,] 0.0215380206 0.040039145 1230343.1341
## [7,] 0.0040803523 0.006871930 43609.7815
## [8,] 0.0165144049 0.032353077 931755.3182
## [9,] 0.0032566766 0.005417194 30283.1286
## [10,] 0.0161191554 0.026330133 684726.8788
## [11,] 0.0027565672 0.004444172 22306.1369
## [12,] 0.0145039253 0.022820075 539105.3392
## [13,] 0.0024915830 0.003798500 17119.0807
## [14,] 0.0122801335 0.022788299 435528.4778
## [15,] 0.0023706676 0.003360507 13531.4104
## [16,] 0.0112703627 0.016674640 111.4673
## [17,] 0.0020190283 0.003147367 10973.3489
## [18,] 0.0106157492 0.016027485 278.9663
## [19,] 0.0017567901 0.002863640 9069.6647
## [20,] 0.0096185720 0.016444671 261373.7646
## [21,] 0.0016441481 0.002637964 7629.9867
## [22,] 0.0081784426 0.013710539 296.6235
## [23,] 0.0015075246 0.002498450 6498.2362
## [24,] 0.0088018140 0.012909942 191986.6388
## [25,] 0.0014372130 0.002308089 5603.9395
## [26,] 0.0077292632 0.012789483 171280.5448
## [27,] 0.0012924543 0.002216936 4889.8739
## [28,] 0.0069012154 0.011052562 170.3975
## [29,] 0.0011994654 0.001987311 4299.6838
## [30,] 0.0067559611 0.011291788 178.7454
## [31,] 0.0011642413 0.001888637 3806.8472
## [32,] 0.0070131993 0.010048327 140.2389
## [33,] 0.0011566121 0.001873365 3401.9766
## [34,] 0.0065066558 0.009020973 34.1644
## [35,] 0.0010506067 0.001595430 3049.0432
## [36,] 0.0060026682 0.010338424 103578.0598
## [37,] 0.0009770234 0.001618095 2768.4517
## [38,] 0.0054705674 0.009229294 143.5544
Again we see roughly 3x-4x speedup by using 4 cores.
clusterExport
copies environment of master to slaves.
Learning resources:
Book _R Packages_ by Hadley Wickham
RStudio tutorial: https://support.rstudio.com/hc/en-us/articles/200486488-Developing-Packages-with-RStudio