What is cloud computing

Image source: https://towardsdatascience.com/how-to-start-a-data-science-project-using-google-cloud-platform-6618b7c6edd2.

It’s not in cloud

Why we care about cloud computing

Vendors

There are many vendors out there. Good for customers (us). They all work similarly.

We will demonstrate how to start using GCP.

Using GCP Compute Engine: basic workflow

  1. Set up GCP account.

  2. Configure and launch VM instance(s).

  3. Set up connection (SSH key).

  4. Install software you need.

  5. Run your jobs.

  6. Transfer result.

  7. Terminate instance(s).

Step 0: set up GCP account

GCP free trial:

Step 1: configure and launch a VM instance

Step 2: set up SSH keys

There are several ways to connect to the VM instance you just created. Most often we want to be able to SSH into the VM instance from other machines, e.g., your own laptop. By default, VM instance only accept key authentication. So it’s necessary to set up the SSH key first.

Option 1: SSH in browser

Option 2: Manually set up SSH

cd
mkdir .ssh
chmod go-rx .ssh/
cd .ssh
vi authorized_keys

Copy your public key to authorized_keys and set permission

chmod go-rwx authorized_keys
ssh username@XX.XXX.XXX.XXX

Option 3: Set up instance specific key in GCP

Option 4: Set up project-wide key in GCP

Step 3: install software

yum is the default package management tool on CentOS. Most software can be installed via sudo yum. sudo executes a command as a superuser (or root).

Install R/R Studio Server

  • Install the epel-repository (if not yet)
sudo yum install epel-release -y
  • Install R (it takes a couple minutes)
sudo yum install R -y
  • Install wget, which is a command line tool for downloading files from internet.
sudo yum install wget -y
wget https://download2.rstudio.org/rstudio-server-rhel-1.1.463-x86_64.rpm
sudo yum install rstudio-server-rhel-1.1.463-x86_64.rpm -y
rm rstudio-server-rhel-1.1.463-x86_64.rpm
  • The R Studio service starts immediately. Let’s check if it is running properly by triggering the following command.
sudo systemctl status rstudio-server.service
  • By default, port 8787 used by R Studio Server is blocked by VM firewall. On GCP console, go to VPC network and then Firewall rules, create a rule for R Studio Server (tcp: 8787), apply that rule to your VM instance.

  • Now you should be able to access R Studio Server on the VM instance by pointing your browser to address http://XX.XXX.XXX.XXX:8787.

Set up a regular user

  • Key authentication suffices for most applications.

  • Unfortunately R Studio Server (open source edition) does not support key authentication. That implies if you want to use R Studio Server on the VM Instance, you need to enable username/password authentication.

  • As super user e.g. huazhou, you can create a regular user say huazhou:

sudo useradd -m huazhou

The -m option creates the home folder /home/huazhou.

  • You can set password for a user by
sudo passwd huazhou
  • Now you should be able to log in the R Studio Server from browser http://XX.XXX.XXX.XXX:8787 using username huazhou and corresponding password.

  • To SSH into VM instance as the regular user huazhou, you need to set up the key (similar to set up key for superuser).

  • If you want to enable the regular user as a sudoer, add it into the wheel group:

su - huazhou
sudo usermod -aG wheel username
su - username

Install R packages

  • Install R packages using install.packages() function in R. Install as superuser will make packages availalbe to all users on this instance. For example,
sudo R -e 'install.packages("tidyverse")'
  • To set the CRAN mirror globally, we write following lines into the /usr/lib64/R/etc/Rprofile.site file: options(repos = c(CRAN = "https://cran.rstudio.com")).

  • When installing R packages, it often fails because certain Linux libraries are absent.

  • Pay attention to the error messages, and install those libraries using yum.

  • E.g., try installing tidyverse may yield following errors

ERROR: dependencies ‘httr’, ‘rvest’, ‘xml2’ are not available for package ‘tidyverse’
* removing ‘/usr/lib64/R/library/tidyverse’

You can install these Linux dependencies curl, openssl, and libxml2 by:

sudo yum install curl curl-devel -y
sudo yum install openssl openssl-devel -y
sudo yum install libxml2 libxml2-devel -y

Install Git

  • Install Git on VM instance:
sudo yum install git -y
  • For smooth Gitting, you need to put the private key matching the public key in your GitHub account in the ~/.ssh folder on the VM instance.

  • Now you can git clone any repo to the VM instance to start working on a project. E.g.,

git clone https://github.com/ucla-biostat203b-2020winter/ucla-biostat203b-2020winter.github.io.git

(Optional) Install Julia

sudo yum install yum-utils -y
sudo yum-config-manager --add-repo https://copr.fedorainfracloud.org/coprs/nalimilan/julia/repo/epel-7/nalimilan-julia-epel-7.repo
sudo yum install julia -y

Step 4: run your jobs

git clone https://github.com/ucla-biostat203b-2020winter/ucla-biostat203b-2020winter.github.io.git

Step 5: transfer results

Step 6: terminate instance(s)

Go forth and use the cloud

Final word

Before requesting massive computing resources, always examine your code and algorithm. Most likely you can gain order of magnitude efficiency (say 100 folder speedup) by educated choice of algorithms and careful coding. You’ll see a dozen examples in Spring (Biostat 257).