In my previous post Databases in the Cloud: Amazon Relational Database , I reviewed some of the benefits Amazon Web Services has to offer. Amazon EC2 is just another great service offered by Amazon. Simply speaking, EC2 is a virtual server that offers different software stacks and computational power.

EC2 allows users to build apps, automate scaling according to changing needs and peak periods, deploy computational intensive models, streamline development processes, and create virtual servers lessening the need to invest in infrastructure.


Advantages

Two major advantages of using EC2 is that you’re not locked into one operating system or one configuration setup. You can go from a standard operating system running 1GB ram to a GPU equipped OS with 60+ GB ram that can handle a wide range of tasks. You can easily change your instance type in minutes; just stop your instance, modify the instance type, and start the instance again to run more/less power than before. This goes hand in hand with R and RStudio due to R’s in-memory architecture.


Benefits to R

Memory, memory, and more memory. If you ever cross a dataset that just eats your computers RAM away, you can easily pay <$0.80 hour to use a $10,000 GPU that will cut right through any compute-intensive operations. It can get up to 60-90% cheaper when using a “spot instance” compared to an “on-demand instance”. See this post On-Demand vs Reserved vs Spot AWS EC2 Pricing Comparison to learn the differences between a spot, on-demand, and reserved instance.


Two Ways to Install / Run RStudio on an EC2

There are two ways that we can create an EC2 and install R & RStudio. The first way is the “Short and Easy Way” which will get RStudio up and running quickly, and you will get easy access to dropbox. The second way is the “Detailed and Longer Way” which has a more flexible configuration setup.

The first way is “Short and Easy” but only because of the hard work that Louis Aslett put into creating/maintaining this RStudio AMI. An AMI (Amazon Machine Image) is a template for your systems software configuration for your instance. AMI defines things like which operating system is installed on your EC2 instance and which software is included. Louis was kind enough to do all of the dirty work needed (plus more) to create an instance that will allow us to run RStudio on AWS. His AMI allows you to get up and running with RStudio and a seamless Dropbox integration in under 10 minutes, so I appreciate his time for maintaining this AMI!

To truly appreciate what he has done, you must go through the next option. The second way is starting from scratch and configure everything yourself. For those who want to get their hands dirty and start from scratch, then jump here to the detailed, long way. You will have a lot more flexibility (and possibly learn more along the way) which is great if you plan on using other applications other than RStudio. If your looking to create an instance and have RStudio running all within 5 minutes, then please continue along!


Short and Easy

You have to go through the “long way” to really appreciate how much time your saving by doing it this way. If you just need to quickly create an AWS instance to run RStudio on, then this is your best bet.

Step 1: Choosing AMI

Click the link for Louis Aslett’s RStudio AMI and view the most recent AMI specs on the right side of the screen. Click the region nearest to you, and a new instance will automatically be created using this AMI, taking you to step 2 of the process.


Step 2: Selecting Instance

To stay within the free tier, we will select the “General Purpose t2.micro” instance type. Click “Next: Configure Instance Details”


Step 3: Configuration Settings, Adding Storage, and Tags

#####Configuring Instance Details Leave all settings to default unless you know which more advanced configuration settings you’d like to set. You could insert scripts like sudo apt update under “Advanced Details” to run on system boot. Click “Next: Add Storage”.

#####Add Storage You will also leave these settings default unless you believe you may need more storage but make sure to not go above the free tier limit (8GB should be enough). Click “Next: Add Tags”

#####Add Tags To help you manage your instances, images, and other Amazon EC2 resources, you can optionally assign your metadata to each resource in the form of tags. Each tag consists of a key and an optional value, both of which you define. Click “Next: Configure Security Group”


Step 4: Configuring the Security Group

A Security Group acts as a virtual firewall & controls traffic for one or more instances. Set your secuirty group name, which in this case I will name it “RStudio”. RStudio has a default port of 8787, but this AMI has it configured to port 80 (HTTP).

To give access to port 80 (HTTP) for our RStudio instance, we need to select HTTP under the “Type” drop-down and set the source to “Anywhere”. Click “Review and Launch”.

Warning: Setting your source to “Anywhere” opens your instance up to the world. Even though your instance is protected with SSH keys, it is susceptible to dictionary password attacks during boot. You can manage IP addresses within the Security Group, but that can become a handful if your continuously accessing your instance from different IP’s. Setting a reverse proxy for RStudio to reroute to port 443 on HTTPS is a better solution. Stay tuned for the next post on how to do this.


Step 5: Launching & Key-Pair

Click Launch and wait for the “Select an existing key pair or create a new key pair” tab to open. We will select “Proceed without a key pair” under the first drop-down and click the checkbox that states “I acknowledge that I will not be able to connect to this instance unless I already know the password built into this AMI”. Finally, click Launch Instances!


Step 6: Getting Public URL (DNS)

Once your instance State reads “running”, select the instance and copy the “Public DNS” to then paste into your browser url.



Logging in to RStudio

Once your browser connects to the instance, you’ll see an RStudio login screen. The AMI’s username is “rstudio” and the password is your instance’s ID (to the left of Public DNS).


It is highly recommended to change your password as soon as you log in. See next step.

Detailed, Longer Way

Step 1: Choosing an AMI

You can choose the Amazon Linux AMI, which is provided at no additional cost and has a stable version of R in the repository. This AMI is maintained by AWS and includes packages and configurations that provide native integration with AWS and other software.

To go through the entire process, we will use the Ubuntu Server (18.04) AMI and download all the necessary software.



Step 2: Choosing an Instance Type

Choose an EC2 instance type that matches the data size and processing that your analysis requires. By default, R runs only on one core node and, in many cases, requires a lot of memory. For programming and development, the general-purpose T2 instance types are sufficient and cheap, and t2.micro is available through the AWS Free Tier.

For our purpose, we will select the “General Purpose T2.Micro” Instance. See the following link for more details on AWS EC2 Instance Types.


Step 3: Configuring Instance Details

We will leave these settings to default, but you can expand “Advanced Details” to add scripts like sudo apt update to run on instance start up. Click “Next: Add Storage” to continue.

Step 4: Add Storage

Here we can set the EBS volume size which defaults to 8GB. Usually, this will be enough, but let’s change it to 16GB to play it safe (free tier gets up to 30GB). Click “Next: Add Tags” to optionally add tags, then click “Next: Configure Security Group”.

Step 5: Configuring the Security Group

In the EC2 launch wizard, you define a security group, which acts as a virtual firewall that controls the traffic for one or more instances. For this R-based analysis environment, we want to access RStudio from the browser so select HTTP from the Type dropdown, and set the Source settings to “Anywhere”.

We will add an additional rule to allow SSH logins to our instance directly. Select SSH under type and set Source to “Anywhere”. Next, we want to add the ports for Shiny-server and RStudio-server so lets add two custom TCP rules and set the ports to 3838 and 8787 respectively. Set both sources to “Anywhere”. See below for review.

  • SSH: port 22
  • HTTP: port 80
  • Shiny: port 3838
  • RStudio: port 8787

As a quick reminder about SG source settings;

Warning: Setting your source to “Anywhere” opens your instance up to the world. Even though your instance is protected with SSH keys, it is susceptible to dictionary password attacks during boot. You can manage IP addresses within the Security Group, but that can become a handful if your continuously accessing your instance from different IP’s. Setting a reverse proxy for RStudio to reroute to port 443 on HTTPS is a better solution. Stay tuned for the next post on how to do this.


After you finish your security group configuration, choose “Review and Launch”.


Step 6: Launch & Key Pair Management

Review your instance information and click “Launch”. You’ll be prompted to “Select or Create Key Pair” and choose “Create a new key pair” and name it to your choosing. Launch your instance and let’s change some permissions to the key-pair while it launches. Open your terminal and copy / paste the below.

mv <path-to-pem-file> ~/.ssh

chmod 600 <new-path-to-pem-file>

Once the instance status is “running”, we can now login via SSH.


Step 7: SSH login to our EC2 to begin downloading software

We will connect to the EC2 instance via SSH on the terminal. Please open your terminal and copy/paste the below, and replace the content inside < > with your instance info.

ssh -i <path-to-pem-file> ubuntu@<PublicDNS>

When your prompted, type yes.

Success! We are finally logged into our AWS EC2 instance. Before I get into the script of downloading the necessary software, I want to show how to create swap memory to stay within free tier instances.


Step 8: Creating Swap Memory and Why Its Necessary

Due to the limited RAM in the free tier T2.micro instance, we need to enable and create swap memory to successfully download certain packages without clogging the RAM (and to avoid having to select a larger instance outside of free-tier). If you plan on using instances above t2.medium, then you do not need to do this and skip to step 9.

Some R packages like Tidyverse require memory larger than 1GB to successfully download, so that’s when Swap memory will allow us to get the extra memory needed. In simple terms, Swap memory “borrows” some memory from the disk when the RAM is fully utilized and clogged. Please copy and paste the below to set 2GB of swap memory (do not set swap > 2x larger than RAM).

#Create Swap file memory to avoid using bigger instance
sudo /bin/dd if=/dev/zero of=/var/swap.1 bs=1M count=2048
sudo /sbin/mkswap /var/swap.1
sudo /sbin/swapon /var/swap.1
sudo sh -c 'echo "/var/swap.1 swap swap defaults 0 0 " >> /etc/fstab'


Step 9: Downloading R, RStudio, and System Dependencies

Now it is time to download all of the necessary software, libraries, packages, etc. It is important to select the most recent versions of RStudio where you can find here. Modify the script to install the most recent version if the one used below is outdated. Copy / paste the below in your terminal.

# The Ubuntu repos contain an outdated version of R so lets add the updated repo to avoid errors
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9

sudo add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/'

# Update ubuntu package repo, to get latest R
sudo apt update

# Install R
sudo apt -y install r-base r-base-dev

# Install shiny before shiny-server
sudo R -e "install.packages('shiny')"

# Install debian package manager, gdebi
sudo apt install gdebi-core


# Install Shiny Server 
wget https://download3.rstudio.org/ubuntu-14.04/x86_64/shiny-server-1.5.9.923-amd64.deb
sudo gdebi -n shiny-server-1.5.9.923-amd64.deb
sudo rm shiny-server-1.5.9.923-amd64.deb

# Dependencies for R packages like RMariaDB, devtools, tidyverse, sparklyr. Please run seperate.
sudo apt -y install libcurl4-openssl-dev 

sudo apt -y install libssl-dev libxml2-dev libmariadbclient-dev build-essential libcurl4-gnutls-dev


# Install RStudio
wget https://download2.rstudio.org/server/bionic/amd64/rstudio-server-1.2.1335-amd64.deb
sudo gdebi -n rstudio-server-1.2.1335-amd64.deb
sudo rm rstudio-server-1.2.1335-amd64.deb
Running R & Installing Packages

Now let’s put that swap memory to use! Let’s install some useful packages, including some of the larger packages that would give us issues if the swap memory isn’t set. It may take a while for some packages like “tidyverse”.

# Install some useful R Packages (be a little patient here)
sudo R -e "install.packages('RCurl', repos='http://cran.rstudio.com')"
sudo R -e "install.packages('devtools', repos='http://cran.rstudio.com')"
sudo R -e "install.packages('tidyverse')"
sudo R -e "install.packages('RMariaDB')"
Setting RStudio login credentials

Now let’s quickly add the RStudio user credentials and install Java to ensure RStudio has all the necessary libraries to work properly.

# Add user info to login RStudio
sudo adduser rstudio

#Add rstudio to sudo group
sudo usermod -aG sudo rstudio
Install Java

Install Java for RStudio and reconfigure the library paths

# Install Java and reconfigure in R for RStudio use
sudo apt -y install default-jdk
sudo R CMD javareconf


# Change permissions for R library
sudo chmod 777 -R /usr/local/lib/R/site-library
Ports for RStudio + Shiny

Finally, lets decide if we want to change the port for RStudio or not. If we plan on running a Shiny App on this instance, then I will want to leave RStudio on port 8787 and run shiny on port 80.

As a reminder, port 80 is the default internet port which enables users to use; http://domain.com instead of http://domain.com:8787.

I will show how to install and use nginx in order to run both Shiny & RStudio on port 80 in a future post.

For this post, I am going to leave RStudio on port 8787 but the command to switch it to port 80 is below (or simply just open and edit the .conf file). I am going to leave it on port 8787 because in a future post, nginx will be able to route ports like 8787 and 3838 to port 80.

# ****VERY IMPORTANT***
#If you plan on running a shiny app on port 80, then you need to set a reverse proxy to use one port for both. 
sudo echo "www-port=80" >> /etc/rstudio/rserver.conf

#Restart rstudio-server to incorporate changes made in rserver.conf
sudo rstudio-server restart

Great, so everything is now downloaded onto our EC2 that is required to login to RStudio from our browser. For those of you who did the short way first, I am sure you now understand why I called this the “Detailed, longer way” and if you haven’t then go ahead and try the short and easy way.


Step 10: Logging into RStudio + Viewing Shiny

Open a web browser and enter your Public DNS(IPv4) as part of the URL to login to RStudio Server. If you kept RStudio on port 8787 (which I did in this post), then add :8787 to the end of your url like below. This means that the shiny-server we also downloaded is living on its default port, 3838. We will also visit there.


RStudio

For login credentials, use the newly created credentials for the user rstudio.

RStudio seems to be happy at home! How about Shiny?

Shiny

Both RStudio and Shiny are deployed to their default ports! In an upcoming post, I will show how we can host both RStudio & Shiny on port 80 using nginx. Now we can do our data analysis on any browser, anywhere, at any scale. AWS brings us a new beneficial way to use RStudio!


Optional Steps

Step 11: Installing and Using MySQL in RStudio’s Terminal

One of my favorite things in RStudio is the built-in system terminal that allows us to install/use MySQL without ever leaving RStudio. We can access/create databases, tables, etc all within our RStudio session from anywhere. Click the terminal tab and copy/paste the below to install/use MySQL in RStudio:

# Install MySQL
wget https://dev.mysql.com/get/mysql-apt-config_0.8.12-1_all.deb
sudo gdebi -n mysql-apt-config_0.8.12-1_all.deb

#To be sure everything is up to date
sudo apt update
sudo apt upgrade

I included the update/upgrade script in the above to ensure your installing the most up-to-date version of mysql, but this is optional.

Now install mysql-server and when prompted for a password, please create one and remember it for when you login to your mysql-server.

# Create your password
sudo apt install mysql-server mysql-client
sudo rm mysql-apt-config_0.8.12-1_all.deb

Now login with the password we created to see that everything went well!

# Login to MySQL inside RStudi Terminal 
sudo mysql -p

#It works!
exit;


Step 12: Checking if an installed package works?

RStudio and MySQL are working well, so lets see if those useful packages we installed earlier are also working. We should see a datatable if everything went as planned.

library(RCurl) 
data <- read.table(textConnection(getURL("https://cgiardata.s3-us-west-2.amazonaws.com/ccafs/amzn.csv")), sep=",", header=FALSE)
head(data)
##          V1     V2      V3     V4      V5      V6
## 1 1/15/2014 395.87 2677150 398.94 399.310 392.534
## 2 1/14/2014 397.54 2339458 392.13 398.630 391.290
## 3 1/13/2014 390.98 2843810 397.98 399.780 388.450
## 4 1/10/2014 397.66 2678085 402.53 403.764 393.800
## 5  1/9/2014 401.01 2103029 403.71 406.890 398.440
## 6  1/8/2014 401.92 2316220 398.47 403.000 396.040

Perfect! The “RCurl” package installed correctly or else we would have gotten an error here. You can continue to play around here to see that everything else installed correctly.


Wrapping it up

As I mentioned earlier, we can now do our data analysis (and access MySQL in RStudio if you did the optional steps) on any browser, anywhere and at any scale. This post was a lot to take in especially if you decided to try both the short and easy + detailed long way. I’ve decided to save some content for a separate post, such as setting SSL (HTTPS) on port 443 for RStudio and Shiny instead of the unencrypted HTTP, in addition to deploying friendly shiny apps on AWS!

Thanks for reading if you made it this far, and I hope you can now deploy AWS EC2 instances with R, RStudio, and optionally MySQL for free!