Big Data Step-by-Step
                              Boston Predictive Analytics
                                 Big Data Workshop
                                Microsoft New England Research &
                               Development Center, Cambridge, MA
                                    Saturday, March 10, 2012



                                                            by Jeffrey Breen

                                                        President and Co-Founder
         http://atms.gr/bigdata0310                   Atmosphere Research Group
                                                      email: jeffrey@atmosgrp.com
                                                             Twitter: @JeffreyBreen

Saturday, March 10, 2012
n ee d a
             just         AM
                  mo re R
           little
                   Big Data Infrastructure
                           Part 2: Running R + RStudio on Amazon EC2




    Code & more on github:
    https://github.com/jeffreybreen/tutorial-201203-big-data
Saturday, March 10, 2012
Overview

                    • Sometimes you just need a little more
                           RAM, CPU, or disk space than you have
                    • Let’s try launching an instance on Amazon
                           EC2 and configuring it to do some work
                    • We’ll install R and RStudio and call it a day


Saturday, March 10, 2012
Some details we’ll skip
                    • Signing up (it’s not that hard)
                           http://aws.amazon.com/ec2/

                    • Pricing (it keeps dropping)
                           http://aws.amazon.com/ec2/pricing/

                    • The alphabet soup of services (we care
                           about EC2 computing and S3 storage)



Saturday, March 10, 2012
Just look for biggest button on the page...




Saturday, March 10, 2012
Select an Amazon Machine Image
                ami-7385461a is a good, recent 64-bit CentOS
                image published by RightScale




Saturday, March 10, 2012
Only use EBS images
                • Instance-storage machines lose their data
                      upon shutdown (termination)
                • EBS instances can be stopped and restarted,
                      or terminated when you’re done forever




Saturday, March 10, 2012
Pick a size
                See http://aws.amazon.com/ec2/instance-types/




                                          Already out of date! Amazon introduced
                                          new “m1.medium” instance type this week.




Saturday, March 10, 2012
Avoid Premature Termination
                Set Termination Protection + Shutdown Behavior




Saturday, March 10, 2012
Name your instance




Saturday, March 10, 2012
Create a key pair
                Don’t forget to download it (and keep it safe!)




Saturday, March 10, 2012
Create a Security Group
                All TCP, UDP and ICMP from your IP address




Saturday, March 10, 2012
Don’t know your IP address?
                Don’t ask me. Ask Google!




                (simply append “/32” when entering into firewall rules)



Saturday, March 10, 2012
3... 2... 1...




Saturday, March 10, 2012
State = running
                Up and running at specified domain name




Saturday, March 10, 2012
Time to get all command line
                    • You’ll need an ssh client and the key pair we
                           generated in order to connect with your
                           instance
                    • We’ll use the Cloudera VM to control versions,
                           options, etc.
                    • ssh won’t use your key pair if its file permissions
                           are too lax
                           $ chmod og-rwx rstudio-ec2.pem


                    • Log in as root to your domain name
                           $ ssh -i rstudio-ec2.pem root@YOURDOMAINHERE.amazonaws.com
                                                        (from previous slide)



Saturday, March 10, 2012
Install R and RStudio
                • Create a user login for yourself (RStudio needs this)
                      # useradd jbreen
                      # passwd jbreen


                • EPEL is already installed, so R is easy
                      # yum -y install R


                • Follow RStudio’s download instructions
                      http://www.rstudio.org/download/server
                      # wget http://download2.rstudio.org/rstudio-server-0.95.262-x86_64.rpm

                      # rpm -Uvh rstudio-server-0.95.262-x86_64.rpm


                • Browse to port 8787 and use the login and password
                      e.g., http://ec2-107-22-109-130.compute-1.amazonaws.com:8787/



Saturday, March 10, 2012
Success!




Saturday, March 10, 2012
The meter’s running
                    • Amazon charges by the hour (or fraction
                           thereof). So when you’re done, you should
                           probably shutdown
                    • via command line
                           $ sudo shutdown -h now


                    • or with the “Stop” Instance Action in the
                           AWS Management Console
                    • (use “Terminate” if you never want to use it
                           again)


Saturday, March 10, 2012
Next up:
                           How to launch Hadoop
                            clusters in the cloud
                            without really trying



Saturday, March 10, 2012

Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2

  • 1.
    Big Data Step-by-Step Boston Predictive Analytics Big Data Workshop Microsoft New England Research & Development Center, Cambridge, MA Saturday, March 10, 2012 by Jeffrey Breen President and Co-Founder http://atms.gr/bigdata0310 Atmosphere Research Group email: [email protected] Twitter: @JeffreyBreen Saturday, March 10, 2012
  • 2.
    n ee da just AM mo re R little Big Data Infrastructure Part 2: Running R + RStudio on Amazon EC2 Code & more on github: https://github.com/jeffreybreen/tutorial-201203-big-data Saturday, March 10, 2012
  • 3.
    Overview • Sometimes you just need a little more RAM, CPU, or disk space than you have • Let’s try launching an instance on Amazon EC2 and configuring it to do some work • We’ll install R and RStudio and call it a day Saturday, March 10, 2012
  • 4.
    Some details we’llskip • Signing up (it’s not that hard) http://aws.amazon.com/ec2/ • Pricing (it keeps dropping) http://aws.amazon.com/ec2/pricing/ • The alphabet soup of services (we care about EC2 computing and S3 storage) Saturday, March 10, 2012
  • 5.
    Just look forbiggest button on the page... Saturday, March 10, 2012
  • 6.
    Select an AmazonMachine Image ami-7385461a is a good, recent 64-bit CentOS image published by RightScale Saturday, March 10, 2012
  • 7.
    Only use EBSimages • Instance-storage machines lose their data upon shutdown (termination) • EBS instances can be stopped and restarted, or terminated when you’re done forever Saturday, March 10, 2012
  • 8.
    Pick a size See http://aws.amazon.com/ec2/instance-types/ Already out of date! Amazon introduced new “m1.medium” instance type this week. Saturday, March 10, 2012
  • 9.
    Avoid Premature Termination Set Termination Protection + Shutdown Behavior Saturday, March 10, 2012
  • 10.
  • 11.
    Create a keypair Don’t forget to download it (and keep it safe!) Saturday, March 10, 2012
  • 12.
    Create a SecurityGroup All TCP, UDP and ICMP from your IP address Saturday, March 10, 2012
  • 13.
    Don’t know yourIP address? Don’t ask me. Ask Google! (simply append “/32” when entering into firewall rules) Saturday, March 10, 2012
  • 14.
  • 15.
    State = running Up and running at specified domain name Saturday, March 10, 2012
  • 16.
    Time to getall command line • You’ll need an ssh client and the key pair we generated in order to connect with your instance • We’ll use the Cloudera VM to control versions, options, etc. • ssh won’t use your key pair if its file permissions are too lax $ chmod og-rwx rstudio-ec2.pem • Log in as root to your domain name $ ssh -i rstudio-ec2.pem [email protected] (from previous slide) Saturday, March 10, 2012
  • 17.
    Install R andRStudio • Create a user login for yourself (RStudio needs this) # useradd jbreen # passwd jbreen • EPEL is already installed, so R is easy # yum -y install R • Follow RStudio’s download instructions http://www.rstudio.org/download/server # wget http://download2.rstudio.org/rstudio-server-0.95.262-x86_64.rpm # rpm -Uvh rstudio-server-0.95.262-x86_64.rpm • Browse to port 8787 and use the login and password e.g., http://ec2-107-22-109-130.compute-1.amazonaws.com:8787/ Saturday, March 10, 2012
  • 18.
  • 19.
    The meter’s running • Amazon charges by the hour (or fraction thereof). So when you’re done, you should probably shutdown • via command line $ sudo shutdown -h now • or with the “Stop” Instance Action in the AWS Management Console • (use “Terminate” if you never want to use it again) Saturday, March 10, 2012
  • 20.
    Next up: How to launch Hadoop clusters in the cloud without really trying Saturday, March 10, 2012