The Ultimate R Guide For Data Science
Making your Data Science stronger with R.
If you have the power of a strong will you can certainly become a successful Data Scientist and achieve your goal through self-education. This path, however, is not the easiest one. The first dilemma is that you will probably stumble upon procrastination, and the second one - you have to deal with informational chaos by yourself.
While the first dilemma's solution depends only on you, today I can take over control of the second one. In this post, I will showcase the basic steps for learning R from scratch without going into the rules, syntax, and specific areas of application. Let’s be honest, for that stuff it will not be enough one guide! Here I will just highlight the most effective and structured path for learning as I see it! You can use it to facilitate your learning routine.
Without further ado, let's jump right in!
Why Learn R?
Motivation is the key to success. Without knowing your purpose of learning you will have no motivation, it's logical.
So, what is R?
Developed by Ross Ihaka and Robert Gentleman in 1993, R is widely used for applications related to data science. R does statistics, R provides support for an extensive suite of inference techniques, machine learning algorithms, time series analysis, data analytics, graphical plots to list a few. In other words, it is used for the whole spectrum of tasks, but especially this language is great for data exploration and investigation.
Why do you need to learn R?
Do you need to learn R? Perhaps not, need is a strong word. But is R a valuable tool for data analytics? Certainly. The language was expressly designed to reflect the way that statisticians think and work. R reinforces good habits and sound analysis.
For me personally, it's the right tool for data science job. Why so?
Firstly, R has a rather narrow and specific purpose. It was developed directly for working with data. Python has a broader purpose, and although you can work with data using it too, doing so is not so convenient.
For example, the most popular data manipulation module written for Python, pandas, was completely borrowed from R. Python is widely used in web programming, as well as for solving a huge range of other tasks. It is more universal, but, when starting to study, it is worth deciding whether you need all the entire arsenal?
Secondly, R is the most powerful data visualization tool. Neither Python nor any BI platform can compare with R in the field of data visualization. The most popular extension for data visualization on R is ggplot2 (it was developed by Hadley Wickham back in 2005).
LeaRning Path: Step by Step
So, here we get to the main path. These guidelines will help you gain knowledge quickly, efficiently and stressless. Here are detailed steps, just follow this sequence and do not rush ahead of time.
Think twice, code once.
Step 1. Gentle Introduction to R and Setting up Your Machine
This step can be completed in less than a day, especially if you have a good introductory textbook.
For me, the best starting point in learning any programming language is to gain a superficial understanding of the community culture and the software environment.
To learn R effectively, start with sites like R-project, CRAN (R-library repository) and Сrantastic.
It's also a good idea to read some beginner's guides:
An Introduction to R - Notes on R: A Programming Environment for Data Analysis and Graphics
Your goal here is not to cling to every detail, but to draw yourself a general picture-structure of the path ahead.
The next task is installation.
The easiest way to set-up R is by downloading it from CRAN ( https://cloud.r-project.org ), the comprehensive R archive network. You can choose between binaries for Linux, Mac, and Windows.
CRAN comprises of a set of mirror servers distributed around the world and is used to distribute R and R packages. You need to install R, RStudio, and packages like Rcmdr, rattle, and Deducer. Install all those stuff using library command and open these GUIs one by one.
For convenient work, it is worth installing one of the available integrated development environments (IDEs) for R with a graphical user interface. It can be downloaded from http://www.rstudio.com/download.
Step 2. Learn the Basic Syntax
And into the woods, we go into the more exciting part.
Here it’s worth holding up for a couple of weeks. This stage is essentially a cold theory, but it will provide some basis for successful experiences in the future.
After installing R, you need to understand the basics of syntax and comprehend the working principle of libraries and data structure. This learning rote is boring a bit, so I would recommend spending as little time as possible doing it, but instead learn as much of the syntax as you can while working on real-world problems.
Remember, you can always refer to a variety of resources for learning and double-checking syntax if you get stuck later.
The easiest way to get started is online courses. Pay attention to those courses that highlights problems of real projects and revels solution in all stages.
Here are some great online courses:
- Introduction to R by DataCamp - the great course for beginners to learn bits and pieces on data analysis and manipulating common data structures such as vectors, matrices, and data frames.
- Intermediate R by DataCamp - the follow-up of the previous course. Here you will master conditional statements, loops, and vector functions.
- Codecademy - also does a good job of teaching basic syntax.
- Dataquest: Introduction to R Programming - free course to grasp the real-world data and real data science problems right off the bat.
Pay attention to some useful books. Don't be frightened of a large number on this list. You can start with reading at least a few books, and over time, supplement knowledge by reading other books:
- R for Data Science - Textbook that’s available in print from O’Reilly or for free online.
- Introductory Statistics with R by P. Dalgaard - the book is especially good for those who begin to study not only the R language but also statistics in general
- An R Companion to Applied Regression by John Fox -book on regression models; written at the same level as the previous one
- Data Analysis and Graphics Using R: An Example-based Approach by J Maindonald, JW Braun - a somewhat more complicated book, which nevertheless provides a broad overview of statistical methods implemented using R and has many examples
- Data Analysis Using Regression and Multilevel / Hierarchical Models by Gelman A, Hill J - book on regression analysis, including mixed effect models
- Modern Applied Statistics with S (Statistics and Computing) by Venables VN, Ripley BD - a must-have for every data scientist
- Data Manipulation with R by P. Spector - a brief but very useful introduction to R data structures and the basic commands that are used to manage data
- R in a Nutshell by J. Adler - another introductory tutorial on R
- R Cookbook by P. Teetor - as the name suggests, this is a collection of "R-recipes"; useful and very practical book
Blogs are also an excellent source of interesting and useful examples of R code. I recommend paying special attention to David’s Smith blog, Quick R, R-Bloggers.
On this step, create a GitHub account at http://github.com and learn to troubleshoot package installation.
Step 3. Data Manipulation, Data Visualization, Machine Learning
Remember, you should focus more on the process and methodology than on the syntax of the language. You need to learn to think about how to solve problems. You need to learn how to extract useful insights from data.
To do this, you will have to master three areas well: data manipulation, data visualization, and machine learning. Learning these skills in the R environment will be much easier than working with (almost) any other language.
Data manipulation
Everyone knows that 70% to 80% or more of the time spent by data scientists is actually just in preparing the data. More often than you would like it you really have to spend a considerable amount of time on all kinds of transformations of your data.
R has some of the best tools for this. The dplyr package in R makes data manipulation simple. By building a few simple dplyr commands in a chain, you can greatly simplify your data conversion work.
So, you need to read and practice how to work with packages like dplyr, tidyr, and data.table. You can master the packages mentioned for importing data via this “Importing Data Into R” course. Also, see this Data Wrangling with R video by RStudio
Data visualization
Visualizing data is as much of an art as it is a skill. A great read on this is Edward Tufte principles for visualizing quantitative data or the pitfalls on dashboard design by Stephen Few.
Data visualization process has a deep inner structure. This well-structured system determines the course of our thoughts when creating statistical graphs, and ggplot2 is based on this system. Take this ggplot2 tutorial to understand it better.
What makes ggplot2 so special is that as you study its syntax, you also learn to think about the process of data visualization itself.
In addition, when combining ggplot2 and dplyr (by sequentially arranging commands) the detection of patterns in the data ceases to cause difficulties.
Machine learning
Finally, machine learning. Although I think that most beginners should not rush into the study of machine learning methods (it is much more important to learn how to perform exploratory data analysis first), knowledge of these methods is very important. When exploratory data analysis ceases to bring new portions of useful information, you will need more powerful tools.
When you are ready to learn machine learning techniques, R will provide you with some of the best tools and training materials.
One of the best and widely cited introductory books on machine learning - An Introduction to Statistical Learning - sets out the corresponding methods using the example of the R language. Besides, course from Stanford University is based on this book and so also taught using R.
Step 4. Build Projects on Your Own
No doubts, you will not succeed without practice. Study in project-oriented courses and try to write programs and sites disassembled in them. Look for lectures on YouTube with a discussion of the projects you would like to develop. First, copy other people's work and analyze it. Then try to move away from the original, experiment, change individual elements until you can create something unique.
Be sure to set a goal to create your project and constantly work on it.
This will help consolidate the knowledge gained and understand what information you still do not have enough. Your skills will develop along with the project. When you finish it, work on the new - more complex problems.
If you have any difficulties in the process of training or development, you can always turn to any community of programmers like Stack Overflow for any question. It will help you to solve some problems, choose a good course or point out errors in the code and more.
Final Word: Learn R and Don't Get Hung up on Other Things
R is becoming a universal language in data science. This does not mean that it is the only language, or that it is equally well suited for any task. However, it is certainly the most widely used tool, and its popularity is only growing with time.
If you are at the very beginning of your leaRning journey, R will almost certainly be the best choice for you. So, all you need to do is to focus on mastering key skills, but I would recommend avoiding the “shiny new subject” syndrome (ie, do not switch to any new publicized tools).
With a high probability, you will come across examples of work performed using new techniques and tools. Remember at least some of those dizzying examples of data visualization that you must have seen! Seeing the great work of other people (and finding out that they used some other tools), you might want to try something new. But believe me: you need to focus. It will take several months (or longer) to master one tool well.
And, as I noted above, you seriously need to master the basic skills of data science. At the very beginning, you should have significant skills in manipulating and visualizing data. Before you move on, you must be able to perform serious exploratory data analysis with R.
Ultimately, the result will be much better.
……………………………..
As always, if you do anything cool with this information, leave a response in the comments below or reach out at any time on my Instagram and Medium blog. Thanks for reading!
Digitalizing decision processes in our EU supply chain
5yHi Oleksii, I maintain a repo of resources for R programming which I hope you find useful: https://paulvanderlaken.com/2017/08/10/r-resources-cheatsheets-tutorials-books/