0% found this document useful (0 votes)
5 views7 pages

Data Analytics Lesson 10 Notes

The document outlines a lesson on data analytics focusing on data cleaning, merging datasets, and utilizing the tidyverse library in R. It emphasizes the importance of reproducible code, efficient data cleaning, and the advantages of tidy data for analysis. Key packages discussed include ggplot2 for visualization, dplyr for data manipulation, and tidyr for data tidying.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views7 pages

Data Analytics Lesson 10 Notes

The document outlines a lesson on data analytics focusing on data cleaning, merging datasets, and utilizing the tidyverse library in R. It emphasizes the importance of reproducible code, efficient data cleaning, and the advantages of tidy data for analysis. Key packages discussed include ggplot2 for visualization, dplyr for data manipulation, and tidyr for data tidying.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Diploma in Data Analytics

Data wrangling
2

Contents

3 Lesson outcomes

3 Introduction

3 Data cleaning

5 Described in R

6 Merging datasets

6 Conclusion

6 Additional resources

7 References

DATA ANALYTICS
3

Lesson outcomes
By the end of this lesson, you should be able to:

• Clean data with tidyverse library


• Reproduce the descriptive results in R achieved in Module 1 with Excel
• Merge datasets

Introduction
According to director of education at RStudio, Carl Howe, being able to produce code that is reproducible so that others
can build on what you have done is one of the key skills for data analysts. Howe says that the reality is many businesses
run on Excel data and A good data analyst can build a dialog with end users and find ways to work with those users, and in
many businesses, that may mean working with Excel data. Howe says that the problem with Excel, though, is that it
guarantees that you are doing the analysis manually and that is not a recipe for creating reproducible results. Howe also
states that most analytics is not doing one-offs, but doing things repeatedly, and you always want to do them in the same
way and that the trick usually is to get that data out of Excel as quickly as possible and put it into a form more amenable to
reproducible analysis.

Data cleaning
Data cleaning is one of the most important steps in the data analysis process. A lot of time in the data analyst journey is
spent focusing on extracting useful insights from the data, but in order to extract useful and valuable insights from data,
the data has to be treasure, not trash. Multiple sources state that the data cleaning step can take up to 80% of the data
analyst’s time. Therefore, it is important to find ways of cleaning the data as efficiently and effective as possible.

Recap: packages in R definition


• In lesson 1 we learnt that R uses packages as excel uses add-ons like the Data Analysis Toolpak.
• Packages are collections of r functions, compiled code and sample data.
• A package is a way to share code, documentation for the package, functions in the package and data sets as part
of the package.

Package description
• The basic information of the package will be provided in the description file. Here you will be able to find out
more about what the package can be used for, who the author is, what version the documentation belongs to, the
date the package was created and updated, if the package has any dependencies on other packages, and so forth.
• packagedesciption(“package”)
• help(package = "package")

DATA ANALYTICS
4

Why tidy data


• Tidying data is necessary for structuring datasets to facilitate data analysis.
• Tidy data is easier to manipulate, model and visualize.
• Tidy data organizes the data values in the dataset to create a standard way of working that makes the data
analysis process simpler to handle.

Tidyverse
Tidyverse is a collection of packages that transforms, organises, and visualizes data. Tidyverse is a set of packages
designed to help you tidy the dataset.

All tidyverse packages can therefore be installed with a single command.

The packages have been designed to simplify the data analysis processes like tidying data and are developed by Hadley
Wickham, chief data scientist at RStudio.

The packages work together, unlike current tools that require translation between packages. The idea is to make tidying
data as interesting and effective as possible with the help of the collective packages.

Packages apart of the tidyverse library

(Akshat, 2019)

The elemental tidyverse packages include:

• ggplot2, for data visualisation.

• dplyr, for data manipulation.

o dplyr functions iclude select, filter, groupby, and join to name a few,

• tidyr, for data tidying.

o The goal of the tidyr is to tidy data in the consistent form where every column is a variable, every row is
an observation, and every cell is a single value.

• readr, for data import.

o Readr is used to import information from rectangular data structures likes csv files.

DATA ANALYTICS
5

o It is an quick and easy way to read or import rectangular data

• purrr, for functional programming.

• tibble, for tibbles, a modern re-imagining of data frames.

• stringr, for strings.

• forcats, for factors.

Data frames
• Most packages in R reads in data to a data frame.
• The function data.frame() generates data frames.
• Data frames is can be likened to a table with rows and columns.
• A data frame is of similar format to SQL tables.

Described in R
Recap titanic
In module 1 we used the Data Analysis Toolpak to explore and describe the titanic dataset. We were able to find out more
about the minimum, maximum, mean, standard deviation and skewness, to name but a few, of the various variables in the
dataset through the use of this add on feature. We were also able to create graphs in excel to better understand the
distribution of the variables from the dataset. Let us now further explore the tidyverse library to see how it measures up
against excel’s data analysis process.

ggplot2
This package is over 10 years old now and is used by thousands of people. Ggplot2 is a visualizations package that is used
to create plots and charts and other useful visualizations.

This package contains so many useful functions, from building box plots and density plots to time series plots. This
package is also contained as a part of the tidyverse library and is installed when we run the code library (tidyverse) to
access the packages of tidyverse.

Some of the basic functions of this package:

• Create a new ggplot


o ggplot()
• Add components to a plot
o ‘+’ (<gg>) ‘%+%’
• Save a ggplot
o ggsave()

A useful resource that lists all the interesting plots this package contains, can be found at:
https://exts.ggplot2.tidyverse.org/gallery/

DATA ANALYTICS
6

Note: A package only needs to be installed once, but we must reload the library command with every session of r so that r
will know what package we want to utilize.

How to get further help with r


Sometimes whilst we are learning how to use a new tool, it can be helpful to talk to other users that have gone through the
same process. Often, as in life, you will find that a person that has gone through the same struggles as you have, will be
able to offer you great advice on how to deal with the issues you are facing.

• RStudio community is a discussion board where you can ask questions related to R queries you might have should
you get stuck in the learning process.
• Alternatively, the stack overflow community that some of you might already be familiar with is also a resource for
seeking answers to questions particular to r. Just remember to tag your question as R on this platform.

Merging datasets
Often the data we want to extract information from lies in multiple sources and we need to integrate the various datasets.
We use the term merging as an overarching term to describe combining data. Merging means that we combine different
types of data into one data frame. Luckily for us, r is an expert in merging datasets

dplyr
Data sets can be joined through common columns using r’s merge function or through the dplyr join functions from
tidyverse.

Remember: to combine data, there has to be an identical and unique variable in the datasets that will be combined.

We will use the package dplyr, for data manipulation of merging datasets. The dplyr package can help us solve data
manipulation challenges through functions like select, filter, groupby, summarise, arrange, join, and mutate to name a
few.

Dplyr uses SQL data syntax for its join functions. For instance, a right join means that everything is included on the right
that was in the second data frame that we want to merge and all the rows that match from the right first data frame. So
right join means to include all the rows of the right second dataset, but only rows that match from the first dataset.

Conclusion
“Education never ends, Watson, it is a series of lessons.” Sherlock Holmes

Additional resources
Grolemund, G. & Wickham, H., 2017, R for Data Science, available online at: https://r4ds.had.co.nz/

More on tidyverse packages: https://www.tidyverse.org/packages/

DATA ANALYTICS
7

References

Still to Harvard ref


https://campus.datacamp.com/courses/introduction-to-the-tidyverse/data-wrangling

https://www.tidyverse.org/packages/

https://datascienceplus.com/merging-datasets-with-tidyverse/

https://www.dummies.com/programming/r/how-to-combine-and-merge-data-sets-in-r/

https://medium.com/analytics-vidhya/a-beginners-guide-to-learning-r-with-the-titanic-dataset-
a630bc5495a8

https://datascienceplus.com/getting-started-with-dplyr-in-r-using-titanic-dataset/

https://github.com/rstudio/cheatsheets/blob/master/data-import.pdf

https://www.analyticsvidhya.com/blog/2019/05/beginner-guide-tidyverse-most-powerful-collection-r-
packages-data-science/

https://www.datacamp.com/community/tutorials/r-packages-guide#what

http://hadley.nz/

https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html

https://mran.microsoft.com/web/packages/tidyverse/vignettes/manifesto.html

https://tidyr.tidyverse.org/

https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame

https://www.kaggle.com/varimp/a-mostly-tidyverse-tour-of-the-titanic

https://support.rstudio.com/hc/en-us/articles/200552336-Getting-Help-with-R

https://www.infoworld.com/article/3454356/how-to-merge-data-in-r-using-r-merge-dplyr-or-
datatable.html#:~:text=dplyr%20uses%20SQL%20database%20syntax,left_join(x%2C%20y)%20.

DATA ANALYTICS

You might also like