window light
December 7, 2017

Speedup Your R Code with RStudio in AWS

What if AWS could save you days without changing your usual workflow?
luminous squares

Data analysis with RStudio is great, apart from R famous poor performance. What if AWS could save your days without changing your usual workflow?

I have been using R for almost ten years now; I like R, I love it. I have been pushed to switch to Python but to me RStudio remains an unbeatable state-of-the-art IDE for data analysis and research on the whole.

While getting working R code is quite straightforward, getting high performance R code may become a headache. All the usual tricks (matrix calculations*apply functionscompilerRcpp) may not bring a sufficient speedup. It is possible that you still have to wait minutes (or hours for heavy statistical simulation) for your computation to be done. And performance is a bottleneck. But aren't statistical simulations just different trials of the same things? What if you were using parallel computing to actually work them in parallel and achieve a scalable speedup? What if you were using it on AWS?

In this post I want to share my experience on how to get a working RStudio on AWS with your own files and as many CPU as you need. Without any painful or complicated devops tasks. And how to distribute loops. In 5 to 10 minutes.

Let us see define a toy example as an illustration. Consider for instance that one has a prior over n data samples so that the likelihood of the data could be something like that:

n <- 1e7 
X <- rnorm(n) 
model <- function(x) prod(dnorm(x, mean = X, sd = abs(X)))

A single computation would take:

system.time(model(rnorm(n)))

## user system elapsed 
## 1.648 0.054 1.708

so that any statistical operation (optimization, evidence estimation, etc.) with this model would probably take minutes or hours, requiring hundreds or thousands of calls to model. For instance:

system.time(replicate(10, model(rnorm(n))))

## user system elapsed 
## 16.501 0.527 17.151

This bad performance is directly proportional to the number of calls to your modelDistributing these computations over several independent computers would directly save over the computation wall-clock timeie. the time you have to wait for your computation to be done. Hence you would directly increase the performance of your code. If you don’t want to buy a new expensive high-performance multicore laptop, cloud computing is your best bet.

RStudio on an Amazon AWS machine

Thanks to Louis Aslett Amazon AWS Machine Images (AMIs) this is as easy as a click. For the technical details on how to do the setting by yourself, you can follow this tutorial.

Let us stick to the basics :

  • go to Amazon AWS page and create a free account

  • click on Services/EC2

  • select Launch instance

  • on the left panel, select Community AMI

  • check out the AMI key on this website(several possibilities depending on your region) and paste it. You can also directly click on the link therein

  • search for the key and select the corresponding image

  • click Review and launch

  • click Edit security group

  • select type: HTTP

  • click Review and launch

  • proceed without a key pair and launch

  • click on View instance to go back to the dashboard

From this point we can see that Amazon is starting our machine on the AWS cloud. If you exactly proceed as follows with the free tier option selected, this machine is a single-core one and won’t give you any performance improvement. Usually, you would go with a better machine on a spot request. But for now, it is still worth having a look at the settings before paying for better (up to 32 cores) machines.

When the instance state goes to Running, copy/paste the Public DNS from the Description tab to your browser. It should open a page similar to this one. The default credentials are:

  • Username: rstudio

  • Password: ID of the launched instance

Validate and see the magic happen. A Welcome.R file is already open and gives you some useful info about how to change the credentials and how to link your dropbox to get your dropbox files right in the folder panel. It also specifies some interesting while less known supports from RStudio: PythonJulia or Tensorflow.

From that moment on, you may be driving a high-performance Ferrari from your modest single-core computer. Last thing you have to get introduced to: distributed loops in R.

Using foreach for running code in parallel

Now that you have a multicores machine you want to use them to run as many processes as possible in parallel. For an embarrassingly parallel problemie. for example when you loop through a variable to repeat the same task with different randomness:

# with a basic for loop 
for(i in 1:100) rnorm(1)

# with R standard *apply functions 
sapply(1:100, function(i) rnorm(1)) 
replicate(100, rnorm(1))

you can expect to divide the computing time by the number of cores of your machine. Let’s do it.

For that purpose I will use the foreach package together with the doMC one. These packages have great vignettes and I go here directly to the usage with a single setting well-suited for your Amazon AWS cloud machine.

install.packages(c('doMC', 'foreach')) 
library(foreach) 
library(doMC) 
doMC::registerDoMC(cores = detectCores())

We are almost done. Few words about what happened :

  • the foreach package defines a new foreach loop which is able to run in parallel. This is not required and you will probably find it very useful even for sequential computing

  • the doMC actually does the job of running tasks in parallel.

Here I overwrite the default cores parameter to get as many workers as the number of cores of your Amazon AWS cloud machine:

detectCores()

## [1] 4

foreach::getDoParWorkers()

## [1] 4

You are now in position to run loops in parallel with a script like this one:

foreach(i=1:100) %dopar% {
   rnorm(1) 
}

The foreach documentation is great. Don't forget to have a look if you want to go deeper into distributed computing. At least notice that:

  • the foreach function is like an improved lapply, it does not do any global affectation with %dopar% (but does it make any sense?)

  • like the lapply function, you have to save the output with a variable:

res <- foreach(...) ...
  • unlike lapply or sapply you can specify the shape of the output with the.combine parameter:

(res_default <- foreach(i=1:5) %dopar% rnorm(1))

## [[1]] 
## [1] -0.7804437 
## 
## [[2]] 
## [1] 1.389149 
## 
## [[3]] 
## [1] 0.660726 
## 
## [[4]] 
## [1] 0.6330952 
## 
## [[5]] 
## [1] 1.087294

(res_vector <- foreach(i=1:5, .combine = 'c') %dopar% rnorm(1))

## [1] 0.3198138 3.0455799 -0.3342982 -1.0266328 -1.3111736

(res_rbind <- foreach(i=1:5, .combine = 'rbind') %dopar% rnorm(1))

##                   [,1] 
## result.1  0.1310123392 
## result.2  0.3611695875 
## result.3 -0.0006377836 
## result.4  0.7869791290 
## result.5 -1.5703346777

etc.

Illustration

So now what about a benchmark on our previous model:

(time_rep <- system.time(
   replicate(10, model(rnorm(n)))
)[3])

## elapsed ## 18.049

(time_foreach <- system.time(
   foreach(i=1:10) %dopar% model(rnorm(n))
)[3])

## elapsed ## 10.741

Here, with 4 cores we are able to divide the computing time by approximately 1.7. This is not exactly the performance we expected. Due to some overheads in distributing the computation, you don’t simply divide your computation time. But this would eventually become negligible as your tasks get longer. From my experience, for heavy statistical simulations you barely notice this overhead. Documentation on parallel computing may also help you optimize this, as for example using the iterators package.

In any case, you are now settled up to save heaps of time with High Performance Computing on AWS. Not mentioned here is also the real advantage of accessing and running your code from anywhere. So what if you were switching to RStudio’s R notebooks for your analysis to generate documents like this one? They allow for using together plain markdown, RPython or Julia code all in one file. You can even share data in between code chunks of different languages using feather. Amazon AWS also lets you request GPU. What about starting big data and machine learning projects online with TensorflowKeras and Spark from this remote AWS machine? Stay posted! (and click follow-me just below)

Acknowledgement

If you ever read this blog post, once again a huge thanks Mr. Louis Aslett for providing these AMIs. It saved my life once and got me started with distributed computing.

Do you need data science services for your business? Do you want to apply for a job at Sicara? Feel free to contact me.

Thanks to Tristan Roussel and Adil Baaj. 

boxing bokeh vs dash

Bokeh vs Dash — Which is the Best Dashboard Framework for Python?

This article compares Bokeh and Dash (by Plotly), two Python alternatives for the Shiny framework for R, using the same example.

logo loopback

Enhance Your Loopback Models with Custom mixins

This article puts light on the very useful mixins option of your model.json declaration file.

flags

Feature Flags in Nodejs + react

Enable continuous delivery and feature testing in minute