Pres Dataviz
Pres Dataviz
Laurent Rouvière
2022-12-05
1
Overview
• Materials: available at
https://lrouviere.github.io/page_perso/visualisationR.html
2
Overview
• Materials: available at
https://lrouviere.github.io/page_perso/visualisationR.html
• Prerequisites: basics on R, probability, statistics and computer
programming
2
Overview
• Materials: available at
https://lrouviere.github.io/page_perso/visualisationR.html
• Prerequisites: basics on R, probability, statistics and computer
programming
• Objectives:
• understand the importance of visualization in datascience
• visualize data, models and results of a datascience project
• discover (and master) some R visualization packages
2
Overview
• Materials: available at
https://lrouviere.github.io/page_perso/visualisationR.html
• Prerequisites: basics on R, probability, statistics and computer
programming
• Objectives:
• understand the importance of visualization in datascience
• visualize data, models and results of a datascience project
• discover (and master) some R visualization packages
• Teacher: Laurent Rouvière, [email protected]
• Research interests: nonparametric statistics, statistical learning
• Teaching: statistics and probability (University and engineer school)
• Responsabilities: head of the Master Mathématiques Appliquées,
Statistique of Rennes
• Consulting: energy (ERDF), banks, marketing, sport
2
Resources
3
Resources
3
Why data visualization in your Master?
4
Why data visualization in your Master?
4
Why data visualization in your Master?
Consequence
Visualization reveals crucial throughout a statistical study.
4
How to make visualization?
5
How to make visualization?
5
Boxplot for the iris dataset
> data(iris)
> summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
6
Classical tool
> boxplot(Sepal.Length~Species,data=iris)
7.5
Sepal.Length
6.5
5.5
4.5
7
Species
Ggplot tools
7
Sepal.Length
8
A temperature map
temp
17.5
15.0
12.5
Many informations
• Background map with boundaries of departments;
• Temperatures in each departments (meteofrance website).
9
Mapping with sf
> library(sf)
> dpt <- read_sf("./DATA/dpt")
> dpt |> select(NOM_DEPT,geometry) |> head()
Simple feature collection with 6 features and 1 field
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 644570 ymin: 6272482 xmax: 1077507 ymax: 6997000
Projected CRS: RGF93 v1 / Lambert-93
# A tibble: 6 x 2
NOM_DEPT geometry
<chr> <MULTIPOLYGON [m]>
1 AIN (((919195 6541470, 918932 6541203, 918628 6~
2 AISNE (((735603 6861428, 735234 6861392, 734504 6~
3 ALLIER (((753769 6537043, 753554 6537318, 752879 6~
4 ALPES-DE-HAUTE-PROVENCE (((992638 6305621, 992263 6305688, 991610 6~
5 HAUTES-ALPES (((1012913 6402904, 1012577 6402759, 101085~
6 ALPES-MARITIMES (((1018256 6272482, 1017888 6272559, 101677~
10
Background map
> ggplot(dpt)+geom_sf()
50°N
48°N
46°N
44°N
42°N
0° 5°E 10°E
11
Temperature map
temp
17.5
15.0
12.5
12
Interactive charts with rAmCharts
> library(rAmCharts)
> amBoxplot(Sepal.Length~Species,data=iris)
13
Dashboard
14
Dashboard
14
Interactive web apps with shiny
• Examples:
• understand overfitting in machine learning:
https://lrouviere.shinyapps.io/overfitting_app/
• bike stations in Rennes: https://lrouviere.shinyapps.io/velib/
15
To summarize
16
To summarize
16
Outline
1. Data visualization with ggplot2
ggplot2 grammar
2. Mapping
ggmap
18
• Graphs are often the starting point for statistical analysis.
• One of the main advantages of R is how easy it is for the user to
create many different kinds of graphs.
• We begin by a (short) review on conventional graphs,
• followed by an examination of some more complex representations,
especially with ggplot2 package.
19
Data visualization with ggplot2
20
The plot function
• For a scatter plot, we have to specify a vector for the 𝑥-axis and a
vector for the 𝑦-axis.
> x <- seq(-2*pi,2*pi,by=0.1)
> plot(x,sin(x),type="l",xlab="x",ylab="sin(x)")
> abline(h=c(-1,1))
1.0
sin(x)
−1.0 0.0
−6 −4 −2 0 2 4 6
21
x
Graphs for datasets
22
Graphs for datasets
22
Scatterplot with dataset
> plot(Sepal.Length~Sepal.Width,data=iris)
7.5
Sepal.Length
6.0
4.5
Sepal.Width
> plot(iris$Sepal.Width,iris$Sepal.Length) #similar
23
Histogram for continous variable
> hist(iris$Sepal.Length,probability=TRUE,
+ col="red",xlab="Sepal.Length",main="Histogram")
Histogram
0.4
Density
0.2
0.0
4 5 6 7 8
Sepal.Length
24
Barplot for categorical variables
> barplot(table(iris$Species))
40
20
0
25
Boxplot
> boxplot(Sepal.Length~Species,data=iris)
7.5
Sepal.Length
6.0
4.5
Species
26
Data visualization with ggplot2
ggplot2 grammar
27
• ggplot2 is a plotting system for R based on the grammar of graphics
(as dplyr to manipulate data).
• Ggplot provides
• ”nice” graphs (nor always the case for conventional R graphs).
• ”complex” graphs with few command lines.
28
• ggplot2 is a plotting system for R based on the grammar of graphics
(as dplyr to manipulate data).
• Ggplot provides
• ”nice” graphs (nor always the case for conventional R graphs).
• ”complex” graphs with few command lines.
28
For a given dataset, a graph is defined from many layers. We have to
specify:
• the data
• the variables we want to plot
• the type of representation (scatterplot, boxplot…).
29
For a given dataset, a graph is defined from many layers. We have to
specify:
• the data
• the variables we want to plot
• the type of representation (scatterplot, boxplot…).
29
The grammar
30
The grammar
30
The grammar
30
The grammar
30
The grammar
30
The grammar
30
An example
> ggplot(iris)+aes(x=Sepal.Length,y=Sepal.Width)+geom_point()
4.5
4.0
Sepal.Width
3.5
3.0
2.5
2.0
5 6 7 8
Sepal.Length
31
Color and size
> ggplot(iris)+aes(x=Sepal.Length,y=Sepal.Width)+
+ geom_point(color="blue",size=2)
4.5
4.0
Sepal.Width
3.5
3.0
2.5
2.0
5 6 7 8
Sepal.Length
32
Color by (categorical) variable
> ggplot(iris)+aes(x=Sepal.Length,y=Sepal.Width,
+ color=Species)+geom_point()
4.5
4.0
Species
Sepal.Width
3.5
setosa
versicolor
3.0
virginica
2.5
2.0
5 6 7 8
Sepal.Length
33
Changing the color
> ggplot(iris)+aes(x=Sepal.Length,y=Sepal.Width,
+ color=Species)+geom_point()+
+ scale_color_manual(values=c("setosa"="blue","virginica"="green",
+ "versicolor"="red"))
4.5
4.0
Species
Sepal.Width
3.5
setosa
versicolor
3.0
virginica
2.5
2.0
5 6 7 8
Sepal.Length 34
Color by (continous) variable
> ggplot(iris)+aes(x=Sepal.Length,y=Sepal.Width,
+ color=Petal.Width)+geom_point()
4.5
4.0 Petal.Width
2.5
Sepal.Width
3.5 2.0
1.5
3.0
1.0
0.5
2.5
2.0
5 6 7 8
Sepal.Length
35
Color by (continous) variable
> ggplot(iris)+aes(x=Sepal.Length,y=Sepal.Width,
+ color=Petal.Width)+geom_point()+
+ scale_color_continuous(low="yellow",high="red")
4.5
4.0 Petal.Width
2.5
Sepal.Width
3.5 2.0
1.5
3.0
1.0
0.5
2.5
2.0
5 6 7 8
Sepal.Length
36
Histogram
> ggplot(iris)+aes(x=Sepal.Length)+geom_histogram(fill="red")
12.5
10.0
7.5
count
5.0
2.5
0.0
5 6 7 8
Sepal.Length
37
Barplot
> ggplot(iris)+aes(x=Species)+geom_bar(fill="blue")
50
40
30
count
20
10
0
setosa versicolor virginica
Species
38
Facetting (more “complex”)
> ggplot(iris)+aes(x=Sepal.Length,y=Sepal.Width)+geom_point()+
+ geom_smooth(method="lm")+facet_wrap(~Species)
4.0
Sepal.Width
3.5
3.0
2.5
2.0
5 6 7 8 5 6 7 8 5 6 7 8
Sepal.Length
39
Combining ggplot with dplyr
• For instance
> head(df)
# A tibble: 6 x 3
size weight.20 weight.50
<dbl> <dbl> <dbl>
1 153 61.2 81.4
2 169 67.5 81.4
3 168 69.4 80.3
4 169 66.1 81.9
5 176 70.4 79.2
6 169 67.6 88.9
40
Goal
90
80 age
weight
20
70 50
60
41
dplyr step
Gather column weight.M and weight.W into one column weight with
pivot_longer:
> df1 <- df |> pivot_longer(-size,names_to="age",values_to="weight")
> df1 |> head()
# A tibble: 6 x 3
size age weight
<dbl> <chr> <dbl>
1 153 weight.20 61.2
2 153 weight.50 81.4
3 169 weight.20 67.5
4 169 weight.50 81.4
5 168 weight.20 69.4
6 168 weight.50 80.3
> df1 <- df1 |>
+ mutate(age=recode(age,"weight.20"="20","weight.50"="50"))
42
ggplot step
> ggplot(df1)+aes(x=size,y=weight,color=age)+
+ geom_point()+geom_smooth(method="lm")+theme_classic()
90
80 age
weight
20
70 50
60
43
Statistics
44
Statistics
44
Statistics
help(stat_bin)
Computed variables
count
number of points in bin
density
density of points in bin, scaled to integrate to 1
...
44
Visualize another statistics
10000 1.5
count
density
5000 1.0
0.5
0
0 1 2 3 4 5
carat 0.0
0 1 2 3 4 5
carat
45
stat_summary
1.5
density
1.0
0.5
0.0
46
0 1 2 3 4 5
Complement: some demos
> demo(image)
> example(contour)
> demo(persp)
> library("lattice");demo(lattice)
> example(wireframe)
> library("rgl");demo(rgl)
> example(persp3d)
> demo(plotmath);demo(Hershey)
47
Complement: some demos
> demo(image)
> example(contour)
> demo(persp)
> library("lattice");demo(lattice)
> example(wireframe)
> library("rgl");demo(rgl)
> example(persp3d)
> demo(plotmath);demo(Hershey)
47
Mapping
48
Introduction
49
Mapping
ggmap
50
Syntax
• Similar to ggplot…
51
Syntax
• Similar to ggplot…
• Instead of
> ggplot(data)+...
• use
> ggmap(backgroundmap)+...
51
Background map
> library(ggmap)
> us <- c(left = -125, bottom = 25.75, right = -67, top = 49)
> map <- get_stamenmap(us, zoom = 5, maptype = "toner-lite")
> ggmap(map)
45
40
lat
35
30
52
Adding informations with ggplot
> fr <- c(left = -6, bottom = 41, right = 10, top = 52)
> fond <- get_stamenmap(fr, zoom = 5,"toner-lite")
> Paris <- data.frame(lon=2.351499,lat=48.85661)
> ggmap(fond)+geom_point(data=Paris,aes(x=lon,y=lat),color="red")
52
50
48
lat
46
44
42
−5 0 5 10
lon 53
Mapping
54
sf package
55
sf package
55
sf package
> library(sf)
> dpt <- read_sf("./DATA/dpt")
> dpt[1:5,3]
Simple feature collection with 5 features and 1 field
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 644570 ymin: 6290136 xmax: 1022851 ymax: 6997000
Projected CRS: RGF93 v1 / Lambert-93
# A tibble: 5 x 2
NOM_DEPT geometry
<chr> <MULTIPOLYGON [m]>
1 AIN (((919195 6541470, 918932 6541203, 918628 6~
2 AISNE (((735603 6861428, 735234 6861392, 734504 6~
3 ALLIER (((753769 6537043, 753554 6537318, 752879 6~
4 ALPES-DE-HAUTE-PROVENCE (((992638 6305621, 992263 6305688, 991610 6~
5 HAUTES-ALPES (((1012913 6402904, 1012577 6402759, 101085~
56
Visualize with plot
> plot(st_geometry(dpt))
57
Visualize with ggplot
> ggplot(dpt)+geom_sf()
50°N
48°N
46°N
44°N
42°N
0° 5°E 10°E
58
Adding points on the map
59
Adding points on the map
60
Coloring polygons
> set.seed(1234)
> dpt1 <- dpt |> mutate(temp=runif(96,10,20))
> ggplot(dpt1) + geom_sf(aes(fill=temp)) +
+ scale_fill_continuous(low="yellow",high="red")+
+ theme_void()
temp
17.5
15.0
12.5
61
Supplement: geometry class
62
• Creation of a sf object
> b1 <- st_point(c(3,4))
> b1
POINT (3 4)
> class(b1)
[1] "XY" "POINT" "sfg"
65
Background map
• Documentation: here
> library(leaflet)
> leaflet() |> addTiles()
+
−
66
Leaflet | © OpenStreetMap contributors, CC-BY-SA
Many background styles
> Paris <- c(2.35222,48.856614)
> leaflet() |> addTiles() |>
+ setView(lng = Paris[1], lat = Paris[2],zoom=12)
+
−
+
−
68
Leaflet with data
69
Visualize seismics with magnitude more than 5.5
> quakes1 <- quakes |> filter(mag>5.5)
> leaflet(data = quakes1) |> addTiles() |>
+ addMarkers(~long, ~lat, popup = ~as.character(mag))
+
−
Remark
70
When you click on a marker, the magnitude appears.
addCircleMarkers
> leaflet(data = quakes1) |> addTiles() |>
+ addCircleMarkers(~long, ~lat, popup=~as.character(mag),
+ radius=3,fillOpacity = 0.8,color="red")
+
−
71
Color polygon (combining leaflet and sf)
73
Some R tools for dynamic visualization
74
Some Dynamic visualization tools
75
rAmCharts
• References:
https://datastorm-open.github.io/introduction_ramcharts/
76
rAmCharts Histogram
> library(rAmCharts)
> amHist(iris$Petal.Length)
40
35
30
25
Frequency
20
15
10
0
1.25 2.75 4.25 5.75
iris$Petal.Length
77
rAmcharts Boxplot
> amBoxplot(iris)
0
Sepal.LengthSepal.WidthPetal.LengthPetal.Width
78
Plotly
• References: https://plot.ly/r/reference/
79
Scatter plot
> library(plotly)
> iris |> plot_ly(x=~Sepal.Width,y=~Sepal.Length,color=~Species) |>
+ add_markers(type="scatter")
80
Plotly boxplot
81
Some Dynamic visualization tools
82
Connections between individuals
• Many datasets can be visualized with graphs, especially when one has
to study connections between individuals (genomic, social network…).
83
Connections between individuals
• Many datasets can be visualized with graphs, especially when one has
to study connections between individuals (genomic, social network…).
• References: http://igraph.org/r/,
http://kateto.net/networks-r-igraph
> library(igraph)
> net <- graph_from_data_frame(d=edges, vertices=nodes, directed=F)
> plot(net,vertex.color="green",vertex.size=25)
84
Dynamic graph: visNetwork Package
• Reference:
https://datastorm-open.github.io/visNetwork/interaction.html
> library(visNetwork)
> visNetwork(nodes,edges)
85
Nodes color
> nodes$group <- c(rep("A",8),rep("B",7))
> visNetwork(nodes,edges) |>
+ visGroups(groupname = "A", color = "darkblue",
+ shape = "square") |>
+ visGroups(groupname = "B", color = "red",
+ shape = "triangle")
86
Edeges width
> edges$width <- round(runif(nrow(edges),1,10))
> visNetwork(nodes,edges) |>
+ visEdges(shadow = TRUE,
+ arrows =list(to = list(enabled = TRUE)),
+ color = list(color = "black", highlight = "red"))
87
Some Dynamic visualization tools
88
• Just a tool… but an important visualization tool in datascience
89
• Just a tool… but an important visualization tool in datascience
• Package: flexdasboard
• Reference: https://pkgs.rstudio.com/flexdashboard/index.html
89
Header
---
title: "My title"
output:
flexdashboard::flex_dashboard:
orientation: columns
vertical_layout: fill
theme: default
---
90
Flexdashboard | code
Descriptive statistics
=====================================
Column {data-width=650}
-----------------------------------------------------------------------
### Dataset
```{r}
DT::datatable(df, options = list(pageLength = 25))
```
Column {data-width=350}
-----------------------------------------------------------------------
### Correlation matrix
```{r}
cc <- cor(df[,1:11])
mat.cor <- corrplot::corrplot(cc)
```
### Histogram
```{r}
amHist(df$maxO3)
91
```
Flexdashboard | dashboard
93
Visualization project
• Group (2 members)
• Don’t hesitate to use tools presented in the lecture (you can also use
other tools)
before ???.
95
Some examples (Smart Data, 2021)
• https://jmlascar.shinyapps.io/Fraisse_Lascar_App/
• https://mssdprojectriemerleroy.shinyapps.io/MSSD-Project-Riemer-
Le-Roy/
• https://abdessamadmarc.shinyapps.io/R_viz_project/
• https://razvanvisoiu.shinyapps.io/USA_Election_Analysis/
96