0% found this document useful (0 votes)
15 views56 pages

Lecture 3 - Data Manipulation

Uploaded by

kotoole8r
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views56 pages

Lecture 3 - Data Manipulation

Uploaded by

kotoole8r
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

The picture can't be displayed.

Principles of Data Visualization

Data Manipulation
Data Manipulation
• Why do we need to manipulate data?
• Data in the real world is rarely formatted
• Different visualization tools use different data formats
• Reformatting helps us re-purpose data

6
Jupyter Notebooks
— Python based data
manipulation suite
— Uses a web browser for
managing file and
programming functions
— Very easy loading and
manipulation of large data
files

3
Importing Libraries

— import – Used to import a python library


— as – Used to define alias for library
— pandas – Used for data manipulation and analysis
— numpy – Used for handling large, multi-dimensional arrays
and matrices

4
Importing Libraries
— matplotlib – Used for 2D plotting in Python
— pyplot – Provides a MATLAB-like plotting framework
— seaborn – Used to create attractive and informative statistical
graphics
— os – Allows to use operating system dependent functionality
— %matplotlib inline – Allows to produce inline graphs

5
Data Loading

— path variable is used here to assign the dataset location

— data_dir variable is used to concatenate the path with the


name of the excel file

6
Data Loading

— df acts as a pandas object which reads and loads the excel file
— sheet_names is an attribute of ExcelFile object
— sheet_names displays the name of all the sheets present within
an excel workbook

7
Data Frame Definition
— data.frame() - List of variables(vectors) with same number of rows
— A data frame is used for storing data tables

Vectors

R Help, r-tutor.com 8
DataFrame
— DataFrame is a 2-dimensional labeled data structure with
columns of potentially different types.

— df_observation is used as a pandas dataframe to load excel data


— Observations was one of the sheets from the Grocery Prices
excel workbook

9
Viewing Data

— head() allows viewing the top 5 rows in the dataframe


— head() can take parameters to view ‘n’ number of rows
— head(10) will display the top 10 rows

10
Exploring Data

— info() provides information like total number of rows and


columns, the name of each column, and their datatypes
— info() can also be analyzed to determine columns with nulls or
missing values

11
Nulls in DataFrame

— Nulls are any blank or missing value in the dataset


— isnull() finds any NaN (not a number) or None values
— isnull.sum() searches for nulls in each column

12
Nulls in DataFrame

— isnull().sum().sum() provides total nulls across all rows and


columns in the df_observation dataframe

13
Dropping NA values

— dropna() is used to drop na values (null, Nan, None)


— Axis parameter defines the axis - 0 means rows, 1 means
columns
— Running the above code will remove all rows where na values
exist

14
Plotting Histogram

— figsize allows to set the dimension of the graphic


— In figsize - 20 is width, 8 is height
— hist() creates an histogram for Price using 27 bins (columns) as
a user specification
— show() makes the histogram plot visible

15
Skewness

— If the mean is greater than the median, the distribution is positively


skewed
— If the mean is less than the median, the distribution is negatively
skewed
— If the mean and the median are same, then it is symmetrical

16
Imputing NA (null) values

— If imputation technique is selected as median then above code


will replace all na values in Price with the median Price value

— If imputation technique is selected as mean then above code


will replace all na values in Price with the median Price value

17
Unique Values

— unique() determines the unique or distinct values


— list() gives out the values as a list

18
Mapping Key Values

Greece

India

China

Ireland

19
Mapping Key Values
— The Country column contains mis-spelled names

— To map correct values map() is used


— Within map(), each existing value is assigned a new value

20
Handling Duplicates

— Row 3 and 8 – India

— Row 5 and 6 – China

— Total Duplicate Values – 2 (India,


China)

21
Handling Duplicates
— duplicated() returns a series of duplicate values

• Row 6 is 2nd Instance of China

• Row 8th is 2nd Instance of India

22
Handling Duplicates
— duplicated().sum() returns total duplicate values

— drop_duplicates() drops the duplicate values in the dataframe

23
Merging Datasets
DataFrame1 DataFrame 2

24
Merging Datasets
— merge() allows to join two dataframes

DataFrame.merge(self, right, how='inner', on=None)

— self : First Dataframe


— right : Dataframe to merge with
— how : determines the type of join
{‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’
— on : Column/Index levels to join
on = None, It merges on intersection of column in both
dataframes

25
Merging Datasets With Parameters
— Merging DataFrame1 with DataFrame2 on parameter ID

26
Merging Datasets Without Parameters
— Merging DataFrame1 with DataFrame2 without passing any on
parameter

27
Sorting Values

— sort_values sorts a column of values in a dataframe


— sort_values has default parameters like (by, axis=0,
ascending=True, inplace=False, kind='quicksort',
na_position='last’)
— by refers to the column by which the data has to be sorted

28
Workspace Area
— Analysis worksheets are created first
— Dashboards are constructed from worksheets
— Stories are constructed from worksheets and dashboards

Worksheet Dashboard Story

29
Visualization Worksheet
— View constructed of visually encoded data elements
— Data elements are dragged to cards or shelves to forms rows
and columns
— One or multiple data elements can be dragged to create single
or multiple axes

30
Visual Data Encoding
• Automatic adjusts to data selections
• Bar chart provides comparison
• Line chart provides trend
• Area provides trend and comparison
• Shape provides complex comparison
• Maps provide proximity in space
• Pie provides % contribution
• Gantt provides relationship of measures in
time
• Polygon creates data areas

31
Visualization Standard View

Image Source: New York Times


32
Best Practices Recommender
— Data elements can be highlighted and the show me function
will recommend good visualizations

33
Ask Data Functionality
— Automatic view building from asking questions
— Clarifications like “as a bar chart” “by country”

34
Worksheet – Customer Scatter
— Customer data from 2011 to 2017 across all US region is used
— Sales data is plotted against the corresponding profit.
— The scatter plot uses color to represent the profit ratio

Source: Tableau – Sample Store 35


Worksheet – Customer Rank
— Customer data is sorted per the sum of sales to give customer rank
— Horizontal bar chart uses color encoding to represent profit ratio
— Tooltip provides additional info such as profit and sales rank

36
Worksheet – Customer Overview
— For each of the measures, data is plotted against each region
— The tooltip gives an overview of all the measure values

37
What is a Dashboard?

“A visual display of the most important information needed to achieve one or more
objectives; consolidated and arranged on a single screen so the information can be
monitored at a glance”
(Stephen Few)

38
Create a Dashboard
— Fixed size (default): The dashboard remains
the same size, regardless of the size of the
window used to display it

— Range: The dashboard scales between minimum


and maximum sizes. Scroll bars are displayed when
the dashboard is smaller than the minimum size

— Automatic: The dashboard automatically resizes


to fill the window used to display it

Source: https://onlinehelp.tableau.com/current/pro/desktop/en-us/dashboards_organize_floatingandtiled.htm 39
Critical Aspects of the Dashboard
• Essential Strategic Metrics

• Single Visual Frame

• Monitored at a Glance

• Designed with Audience Context

40
Hospitality Design Metaphor

Image Source: hermentorcenter.com, www.sap.com.


41
Customer Acquisition Design Metaphor

Cold
Warm
Hot

Opportunity

42
Dashboard Layouts

Source: https://onlinehelp.tableau.com/current/pro/desktop/en-us/dashboards_organize_floatingandtiled.htm 43
Dashboard Layouts

Source: https://onlinehelp.tableau.com/current/pro/desktop/en-us/dashboards_organize_floatingandtiled.htm 44
Dashboard – Customer Analysis
— It combines all the previous 3 worksheets to give an analytical
view for comparison using tiled layout

Source: Tableau – Sample Store 45


Dashboard – Website Analytics
— The dashboard combines overview of Pageviews, topic areas,
map and weekdays using floating layout

Source: The Big Book of Dashboards (BigBookofDashboards.com) 46


Dashboard – Ease of Business
— The dashboard combines 4 business tax rates worksheets
using horizontal layout

Source: Tableau – World Indicators 47


Dashboard – Economic Indicators
— The dashboard combines 3 different returns worksheets using
vertical layout

Source: Tableau – Regional 48


Add Interactivity to Dashboard
— Show filters –

Source: https://onlinehelp.tableau.com/current/pro/desktop/en-us/dashboards_create.htm 49
Add Interactivity to Dashboard
— Enable highlighting –
A highlighter allows to highlight parts of a view based on what
one enters or selects

Source: https://onlinehelp.tableau.com/current/pro/desktop/en-us/dashboards_best_practices.htm 50
Dashboard Actions
— Use a single view to filter other views in a dashboard

Source: https://onlinehelp.tableau.com/current/pro/desktop/en-us/actions_dashboards.htm 51
Dashboard Actions
— Use multiple views to filter other views in a dashboard

Source: https://onlinehelp.tableau.com/current/pro/desktop/en-us/actions_dashboards.htm 52
Dashboard Actions
— Navigate from one view to another view, dashboard, or story

Source: https://onlinehelp.tableau.com/current/pro/desktop/en-us/actions_dashboards.htm 53
Dashboard Actions
— Interactively display a web page in a dashboard

Source: https://onlinehelp.tableau.com/current/pro/desktop/en-us/actions_dashboards.htm 54
Modify Layout for Mobile Devices
— “Use automatic layout” option allows to automatically synchronize any
changes to the Default dashboard
— “Edit layout myself ” makes the Phone layout fully independent, so
manually add and arrange items to reflect changes to the Default
dashboard

Source: https://onlinehelp.tableau.com/current/pro/desktop/en-us/dashboards_dsd_create.htm 55
Summary
• Define your titles and axis for what you want to say
• Reformat data files for the visualization tools you wish to use
• Clean and format data to a solid executable file
• Create knowledge of the data domain
• Define the user experience
• Visually encode the data

You might also like