Data Analytics Lesson 10 Notes
Data Analytics Lesson 10 Notes
Data wrangling
2
Contents
3 Lesson outcomes
3 Introduction
3 Data cleaning
5 Described in R
6 Merging datasets
6 Conclusion
6 Additional resources
7 References
DATA ANALYTICS
3
Lesson outcomes
By the end of this lesson, you should be able to:
Introduction
According to director of education at RStudio, Carl Howe, being able to produce code that is reproducible so that others
can build on what you have done is one of the key skills for data analysts. Howe says that the reality is many businesses
run on Excel data and A good data analyst can build a dialog with end users and find ways to work with those users, and in
many businesses, that may mean working with Excel data. Howe says that the problem with Excel, though, is that it
guarantees that you are doing the analysis manually and that is not a recipe for creating reproducible results. Howe also
states that most analytics is not doing one-offs, but doing things repeatedly, and you always want to do them in the same
way and that the trick usually is to get that data out of Excel as quickly as possible and put it into a form more amenable to
reproducible analysis.
Data cleaning
Data cleaning is one of the most important steps in the data analysis process. A lot of time in the data analyst journey is
spent focusing on extracting useful insights from the data, but in order to extract useful and valuable insights from data,
the data has to be treasure, not trash. Multiple sources state that the data cleaning step can take up to 80% of the data
analyst’s time. Therefore, it is important to find ways of cleaning the data as efficiently and effective as possible.
Package description
• The basic information of the package will be provided in the description file. Here you will be able to find out
more about what the package can be used for, who the author is, what version the documentation belongs to, the
date the package was created and updated, if the package has any dependencies on other packages, and so forth.
• packagedesciption(“package”)
• help(package = "package")
DATA ANALYTICS
4
Tidyverse
Tidyverse is a collection of packages that transforms, organises, and visualizes data. Tidyverse is a set of packages
designed to help you tidy the dataset.
The packages have been designed to simplify the data analysis processes like tidying data and are developed by Hadley
Wickham, chief data scientist at RStudio.
The packages work together, unlike current tools that require translation between packages. The idea is to make tidying
data as interesting and effective as possible with the help of the collective packages.
(Akshat, 2019)
o dplyr functions iclude select, filter, groupby, and join to name a few,
o The goal of the tidyr is to tidy data in the consistent form where every column is a variable, every row is
an observation, and every cell is a single value.
o Readr is used to import information from rectangular data structures likes csv files.
DATA ANALYTICS
5
Data frames
• Most packages in R reads in data to a data frame.
• The function data.frame() generates data frames.
• Data frames is can be likened to a table with rows and columns.
• A data frame is of similar format to SQL tables.
Described in R
Recap titanic
In module 1 we used the Data Analysis Toolpak to explore and describe the titanic dataset. We were able to find out more
about the minimum, maximum, mean, standard deviation and skewness, to name but a few, of the various variables in the
dataset through the use of this add on feature. We were also able to create graphs in excel to better understand the
distribution of the variables from the dataset. Let us now further explore the tidyverse library to see how it measures up
against excel’s data analysis process.
ggplot2
This package is over 10 years old now and is used by thousands of people. Ggplot2 is a visualizations package that is used
to create plots and charts and other useful visualizations.
This package contains so many useful functions, from building box plots and density plots to time series plots. This
package is also contained as a part of the tidyverse library and is installed when we run the code library (tidyverse) to
access the packages of tidyverse.
A useful resource that lists all the interesting plots this package contains, can be found at:
https://exts.ggplot2.tidyverse.org/gallery/
DATA ANALYTICS
6
Note: A package only needs to be installed once, but we must reload the library command with every session of r so that r
will know what package we want to utilize.
• RStudio community is a discussion board where you can ask questions related to R queries you might have should
you get stuck in the learning process.
• Alternatively, the stack overflow community that some of you might already be familiar with is also a resource for
seeking answers to questions particular to r. Just remember to tag your question as R on this platform.
Merging datasets
Often the data we want to extract information from lies in multiple sources and we need to integrate the various datasets.
We use the term merging as an overarching term to describe combining data. Merging means that we combine different
types of data into one data frame. Luckily for us, r is an expert in merging datasets
dplyr
Data sets can be joined through common columns using r’s merge function or through the dplyr join functions from
tidyverse.
Remember: to combine data, there has to be an identical and unique variable in the datasets that will be combined.
We will use the package dplyr, for data manipulation of merging datasets. The dplyr package can help us solve data
manipulation challenges through functions like select, filter, groupby, summarise, arrange, join, and mutate to name a
few.
Dplyr uses SQL data syntax for its join functions. For instance, a right join means that everything is included on the right
that was in the second data frame that we want to merge and all the rows that match from the right first data frame. So
right join means to include all the rows of the right second dataset, but only rows that match from the first dataset.
Conclusion
“Education never ends, Watson, it is a series of lessons.” Sherlock Holmes
Additional resources
Grolemund, G. & Wickham, H., 2017, R for Data Science, available online at: https://r4ds.had.co.nz/
DATA ANALYTICS
7
References
https://www.tidyverse.org/packages/
https://datascienceplus.com/merging-datasets-with-tidyverse/
https://www.dummies.com/programming/r/how-to-combine-and-merge-data-sets-in-r/
https://medium.com/analytics-vidhya/a-beginners-guide-to-learning-r-with-the-titanic-dataset-
a630bc5495a8
https://datascienceplus.com/getting-started-with-dplyr-in-r-using-titanic-dataset/
https://github.com/rstudio/cheatsheets/blob/master/data-import.pdf
https://www.analyticsvidhya.com/blog/2019/05/beginner-guide-tidyverse-most-powerful-collection-r-
packages-data-science/
https://www.datacamp.com/community/tutorials/r-packages-guide#what
http://hadley.nz/
https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html
https://mran.microsoft.com/web/packages/tidyverse/vignettes/manifesto.html
https://tidyr.tidyverse.org/
https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame
https://www.kaggle.com/varimp/a-mostly-tidyverse-tour-of-the-titanic
https://support.rstudio.com/hc/en-us/articles/200552336-Getting-Help-with-R
https://www.infoworld.com/article/3454356/how-to-merge-data-in-r-using-r-merge-dplyr-or-
datatable.html#:~:text=dplyr%20uses%20SQL%20database%20syntax,left_join(x%2C%20y)%20.
DATA ANALYTICS