0% found this document useful (0 votes)

10 views24 pages

DVT Unit-V

The document discusses the significance of parallel coordinates in visualizing high-dimensional data, emphasizing their role in exploratory data analysis and classification. It outlines the historical context, practical applications, and advantages of using parallel coordinates over traditional methods like scatterplot matrices. The text also highlights the importance of effective visualization techniques in uncovering complex relationships among multiple variables across various datasets.

Uploaded by

cartoongamingworld.0930

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views24 pages

DVT Unit-V

Uploaded by

cartoongamingworld.0930

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 24

UNIT-V

Parallel Coordinates: Visualization, Exploration and

Classification of High-Dimensional Data

14.1 Introduction
14.1.1. Origins

Over half of our sensory neurons focus on vision, which enhances our ability to
recognize patterns. The purpose of visualization is to connect this ability to problem-
solving, helping us gain insights from images. To address multivariate and
multidimensional problems, various mappings have been created to represent complex
information visually in a two-dimensional format. However, many successful
examples in this area, like Minard’s “Napoleon’s March to Moscow” and Snow’s “dot
map,” are unique and do not generally reveal multivariate relationships effectively. To
improve this, we must utilize the interactivity made possible by recent technological
advancements, leading to the need for effective graphical user interfaces (GUIs),
exploration strategies, and information-preserving displays.

Visualization has historical significance, as illustrated by Archimedes' famous death

while focused on a diagram, indicating the importance of visualization in geometry.
The use of diagrams is integral to testing conjectures and forming proofs, evolving
into a broader problem-solving process. The process involves creating a mental image
of the problem at hand, leading to the saying that we “see” when we mean we
understand.

The author’s interest in visualization began with geometry, and later frustrations over
the lack of visual aids in multidimensional geometry sparked questions about the
creation of accurate pictures for complex problems. They wondered if parallel
coordinates could provide a new way to view multidimensional spaces. This
exploration was encouraged by professors at the University of Illinois, leading to the
discovery of basic correspondence between points and lines in multidimensional
spaces.

In 1977, while teaching a class, the author revisited parallel coordinates to visualize
multidimensional objects, which led to the structured development of 'f-coordinates,'
documented in early reports. Contributions in the field continued over the years, with
many individuals assisting in advancing the knowledge of multidimensional objects
and their representations. The development of 'f-coordinates' gained more visibility,
with numerous applications researched in statistics and data visualization since that
time, reflecting an increasing interest and utility in parallel coordinates. Today, a
simple online search reveals a wealth of information, indicating the significance and
growth of this area.

14.1.2 The Case for Visualization

Searching a dataset with M items for "interesting" properties is challenging due to the
vast number of possible subsets. Effective visual cues can help quickly navigate this
complexity. To explore a dataset with N variables, a good display should:
1. Preserve information, allowing for complete dataset reconstruction,
2. Have low representational complexity for efficient display construction,
3. Work for any number of variables,
4. Treat each variable equally,
5. Remain recognizable under transformations like rotations and scalings,
6. Reveal multivariate relations,
7. Be based on rigorous mathematical methods to minimize ambiguity.

This list is open to critique and improvement. Further details and examples are
provided below.

1. The numerical values of each N-tuple should be retrievable from the scatterplot
matrix and the f-coords display, although this may not be necessary for presentation
graphics.
2. In the pairwise scatterplot matrix, N variables are displayed N(N - 1)/2 times,
while f-coords have N axes with a different preprocessing cost. For N = 10, display
space and visual issues limit scatterplot matrix use, unlike f-coords, which are more
flexible.
3. Orthogonal coordinates are limited to N = 3, but can create the illusion of more
dimensions. In “Chernoff faces,” each variable is linked to a specific facial feature,
but different assignments can change the display dramatically. This characteristic is
true for both scatterplot matrix and f-coords visualizations.
4. In the "Chernoff faces" display, each variable is linked to a facial feature in an
arbitrary way. Different choices create different displays, making it hard to prove they
show the same data. This applies to all "glyph" displays as well.
5. This is true for SM and for f -coords, although it has not been implemented in
general. Incidentally, this is a wonderful M.Sc. thesis topic!
6. The true value of visualization is in recognizing relationships among objects, not
just seeing many of them. Starting with a complete display helps uncover information,
with visual cues guiding exploration and interaction.
7. The value of rigor is self-evident.

The text discusses discovering data using three real datasets. It introduces basic
queries from GIS, combines them with boolean operators for financial data, briefly
covers an example with 400 variables, and emphasizes visualization and classification
in data analysis.

The text discusses planes created by small rotations and translations of π around three
axes, forming a "twisted slab" hard to visualize in 3-D. It raises questions about how
to show points in 3-D lie on this twisted slab and how to visualize the form for any
number N. You are encouraged to explore various methods without looking at the
answer provided at the end of the chapter.

14.2 Exploratory Data Analysis with ||-coords

14.2.1 Multidimensional Detective

Parallel coordinates help visualize complex data relationships in a two-dimensional

format, making them useful for analysis in various software tools. A Google search
for "Parallel Coordinates + Software" yielded around 31,000 results, while
"Scatterplot matrix + Software" returned about 15,000. Despite being seen as difficult,
parallel coordinates have gained attention in data exploration, known as exploratory
data analysis (EDA), which compares to modern visual data mining. This process
involves examining data for clues, testing theories, and ultimately identifying key
insights.

In parallel coordinates, data appear as wavy lines, requiring a multidimensional

approach to discovery. It's important to consider how similar analyses are done with
other methods, such as spreadsheets, and what visual indicators prompt queries. In
using parallel coordinates, selecting the right type of queries can enhance the analysis.
Initial designs included many complex queries but later simplified to three intuitive
ones called atomic queries, which can be combined for deeper analysis. Successful
exploratory data analysis depends on an informative display, good query selection,
and skillful interaction with the data.

14.2.2 An Easy Case Study: GIS Data

The first admonition is:

- do not let the picture intimidate you,

The dataset being explored includes over 9,000 measurements with nine variables.
The first two variables (X, Y) represent the location on a map of Slovenia where
seven types of ground emissions have been measured by satellite. The location of one
data item is indicated on a provided map. The query used to select data items is called
Pinch, activated by the P button on the toolbar. It allows for the selection of multiple
data items, and the position of the selected item can be changed with the cursor.

- understand the objectives.

The task is to find and identify ground features on the map, including a prominently
shaped lake in the corner.

- carefully scrutinize the data display for clues and patterns.

Follow up on anything notable, such as gaps or patterns. Use the Interval query,
activated by the I button, starting at the minimum low range of B and stopping at
the dense part. The result reveals a lake and two small streams, confirmed by the map.

- a good thing may be worth repeating

The text explores density variations in the lower part of variable B4, indicating that
this area is denser than the rest. By focusing on the sparse area that marks the water's
edge, it shows how the lake fills up from shallow water. This initial exploration of a
single variable provides a useful introduction to multivariate analysis. Observers may
notice vertical bands between variables B1, B2, and B3, leading into a discussion of
an angle query activated by the A button. This angle query selects groups of lines
based on a user-specified angle range and can help identify regions of high vegetation
between B2 and B3.

The text details the control of the angle range through cursor movements on the right
axis and suggests that this method can be refined by adjusting additional parameters.
It raises the topic of classification and rule-finding, which are discussed further in a
later section. The importance of understanding geometry to effectively utilize these
queries is highlighted, especially in recognizing how positively or negatively
correlated data behaves visually.

Next, the text poses a question regarding the implications if the B2 and B3 axes were
not adjacent, suggesting that their pairwise relations could be overlooked. It
references research indicating the minimum number of permutations needed to
maintain adjacency across all axes, illustrating this with a graph example that
represents adjacent variables. This graph is connected to Hamilton paths, which are
crucial in various modern problems like the traveling salesman issue.

The exploration data analysis (EDA) process is discussed, mentioning an icon on a

toolbar that activates a permutation editor for generating Hamiltonian permutations.
The text suggests that after analyzing the dataset, the next step involves going through
the O(N^2) Hamiltonian permutations to discover useful adjacencies. Users are
encouraged to create custom permutations and experiment with different
arrangements during their exploration, emphasizing the flexibility in analyzing the
dataset.

14.2.3 Compound Queries: Financial Data

Analyze the financial dataset to find useful relationships for investments and trading.
The data focuses on the years 1986 and 1992. In 1986, the Yen showed the most
volatility, while in 1992, Sterling became very volatile.

- look for the gold” by checking out patterns that catch our attention.
In 1986, gold prices were low until mid-August when they increased significantly.
Exploration was done by four financial experts who noted the connection between a
low Yen, high 3MTB rates, and low gold prices, indicating that a low exchange rate
for the Yen signifies its high value compared to the dollar. Data indicated a strong
negative correlation between the Yen and 3MTB rates. Additionally, a smaller cluster
showed that lowering one variable's range led the other variable’s range to rise,
highlighting this negative correlation. There was also a noted positive correlation
during the 1990s when gold prices were low to mid-range, providing investment
insights for contrarians. Testing various angle ranges and inversion features may
reveal special data clusters.

- vary one of the variables watching for interesting variations in the other vari
ables.
The text discusses the relationship between currency exchange rates and the price of
gold. It explains how changes in exchange rates can influence the price of gold,
showing that movements in currency rates and gold prices are connected. The analysis
indicates that when gold is at a high price range, the exchange rate between Sterling
and Dmark (or Yen) forms a straight line, suggesting a stable relationship. While
there could be hints of market manipulation in the gold market, the text avoids
speculation on that topic.

The text then introduces an exploratory method called "multidimensional contouring,"

which helps visualize the relationships between various financial variables. It
highlights a specific analysis that shows low rates of a metric (3MTB) and gold prices
are good predictors of a high S&P 500 index.

A comparison with other visualization tools is mentioned but notes that they are not
available. The text also emphasizes the challenge of analyzing multiple variables
using traditional scatterplot matrices, which can be complex and difficult to interpret.
In contrast, a specific plot using f-coordinates simplifies the analysis of various
interrelated financial factors.

14.2.4 Hundreds of Variables

One frequently asked question is “how many variables can be handled with f-coords?
” The largest dataset effectively worked with had about 800 variables and 10,000 data
entries.

- be sceptical about the quality of datasets with large numbers of variables

When analyzing large datasets with many variables, it is often challenging for
individuals to grasp what is happening. For example, a dataset revealed that many
instruments recorded zero values and had repetitive patterns. After cleaning the data,
it was reduced from about 90 to 30 relevant variables, as many were duplicates. This
redundancy is common, with at least 10% of variables being near-duplicates due to
measurement inconsistencies. It may be helpful to add an automated feature in
software to detect these suspect variables. The exploration of large datasets can still
be effective, and compound queries can enhance analysis in process control datasets.

14.3 Classification

Exploring data analysis can be enjoyable, but it often requires significant skill and
patience, which may deter some users. Many requests have been made for tools that
help automate this knowledge discovery process. Classification is a key task in data
analysis, where a classifier algorithm distinguishes between a designated subset and
the rest of the dataset. This involves creating rules based on the dataset's
characteristics. Using parallel coordinates, data can be transformed into an N-
dimensional space to visualize and approach classification algorithms effectively.
It can and does happen that the process does not converge when P does not contain
sufficient information to characterize S. It may also happen that S is so “porous”
(sponge-like) that an inordinate number of iterations are required. On convergence,
say at step n, the description of S is provided as:

14.4 Visual and Computational Models

The methodology discussed helps to model complex relationships among multiple

variables, similar to how two-variable relationships are depicted as flat regions.
Using an interior point algorithm, it facilitates trade-off analysis, sensitivity discovery,
constraint impact understanding, and, at times, optimization. The analysis is based
on a dataset containing outputs from several economic sectors and expenses from a
specific country, covering eight variables over multiple years, including Agricultural,
Fishing, Mining, Manufacturing, Construction, Government spending, and GNP.

The goal is to show how f-coordinates serve as a modeling tool rather than fully
exploring data model construction. Using least squares, a function is fitted to the
dataset, resulting in a region in R⁸, represented visually by upper and lower curves.
These curves illustrate the country's economic characteristics and interconnections
among sectors. Points within this region represent feasible economic policies, which
can be constructed using the interior point algorithm by selecting variable values
sequentially.

When the first variable’s value is chosen, the dimension is reduced, affecting the
ranges of other variables due to their interrelationships. For instance, a high fishing
output results in low mining sector values and vice versa, highlighting competition
for resources. This interactive exploration helps uncover the dynamics between
sectors, such as how a flourishing fishing industry draws workers away from mining.
The method also scales to more complex models, including a 20-dimensional surface
shown in one of the figures.
14.5 Parallel Coordinates: Quick Overview

14.5.1 Lines

In 3-D space, three points on a line are collinear, forming line ℓ, denoted by ¯L. A
polygonal line on N-1 points represents a point on ℓ. Two points can also define a
line. Intersections of polygonal lines yield additional points on ℓ. Proper indexing of
points is important.

14.5.2 Planes and Hyperplanes

A line can be identified from its projections, but a plane cannot in three dimensions.
Eickemeyer (1992) suggests a new method that describes a p-dimensional object
using its (p−1)-dimensional subsets from points. For example, a set of coplanar
points in 3-D does not show clear relationships. However, constructing a line from
collinear points shows that these lines intersect at a point, indicating coplanarity,
though this alone does not define the plane.
The text discusses a mathematical concept involving a plane defined by the sum of
three variables, represented with subscripts. It describes how additional points can
be generated through translations of the second and third axes. The distance
between points is shown to be proportional to a variable. The representation of
hyperlanes in higher dimensions using points with indices is explained, along with a
recursive algorithm for constructing these points. The text also mentions that various
transformations like rotation and scaling can be applied to multidimensional objects
while maintaining their recognition in different coordinates.

14.5.3 Nonlinear Multivariate Relations: Hypersurfaces

A relation between two real variables is shown as a unique area in 2-D, while a
relation with N variables appears as a hypersurface in N-D. Smooth surfaces in 3-D
can be described as the envelope of all their tangent planes, which helps to visualize
their representation. Each point on the surface corresponds to its tangent plane,
leading to regions in N-D that can reconstruct the hypersurface. Different surface
classes can be identified by their characteristics in f-coords, with examples including
developable surfaces, ruled surfaces, and quadric surfaces.

A simpler representation of surfaces uses polygonal lines along boundary points,

creating an envelope that "represents" the surface, although this is not unique. By
keeping the exact equation of the surface, interior points can also be displayed.
Intermediate curves provide insights into local curvature and areas where critical
variables interact with boundaries. A theorem states that a polygonal line between
intermediate curves indicates an interior point, while touching or crossing these
curves denotes boundary or exterior points, respectively.

By interactively varying variable values, sensitive regions can be identified, showing

how small changes impact other variables. This approach supports decision-making,
process control, and can adapt to new data, ensuring that models remain up-to-date for
informed decisions.
Matrix Visualization
15.1 Introduction
The graphical exploration of data is a crucial step in modern statistical analysis.
Matrix visualization is a technique that allows the examination of relations among
many subjects and variables without reducing data dimensions first. This method
involves rearranging the rows and columns of data using specific algorithms and then
displaying the data and proximity matrices as colorful matrix maps to visually identify
clusters and interactions.

While traditional tools like boxplots and scatterplots are helpful for analyzing data
with fewer variables, they become less effective with large datasets that may contain
tens of thousands of variables. Matrix visualization, supported by advanced
computing technologies, can effectively explore complex datasets.

The chapter reviews previous studies on matrix visualization, discusses its

foundations and provides examples, including comparisons with other graphical
tools and applications for binary data. It concludes with future perspectives on the
topic.

15.2 Related Works

The idea of matrix visualization was first introduced by Bertin in 1967 as a way to
present data structures and relationships. In 1969, Carmichael and Sneath created
taxometric maps for classifying operational taxonomy units. Later, Hartigan
developed block clustering in 1972 for data matrices. Wegman reported the first
color matrix visualization in 1990. Other techniques focused on proximity matrices,
such as shaded correlation matrices by Ling in 1973 and elliptical glyphs by Murdoch
and Chow in 1996. Friendly proposed corrgrams in 2002 for analyzing correlations.
Chen combined raw data matrices with proximity matrices in generalized association
plots. The Cluster and TreeView packages became popular for gene expression
profiling. The reordering of matrix rows and columns is vital in visualization, with
various terms used to refer to these related techniques, collectively termed matrix
visualization (MV).

15.3 The Basic Principles of Matrix Visualization

The GAP approach is used to explain matrix visualization for continuous data,
utilizing a dataset of 6,400 genes from yeast expression experiments. Detailed data
preprocessing is provided. For illustration, 15 samples and 30 genes are selected,
where rows represent genes and columns represent experiments, allowing for
interchangeable analysis.

15.3.1 Presentation of the Raw Data Matrix

The first step in visualizing continuous data is to create a raw data matrix, X₃₀₁₅, and
two proximity matrices for the rows, R₃₀₃₀, and columns, C₁₅₁₅, using user-defined
similarity measures. These matrices are then represented using colors in matrix maps.
The left panel shows a raw data matrix of log₂-transformed ratios with colors
indicating gene expression levels and correlations. A red (green) dot indicates up
(down)-regulation, while black indicates no change. Color points in the proximity
matrices indicate the relationship strength between arrays and genes, with darker
colors showing stronger correlations or smaller distances.

Data Transformation
Data may need transformations like log, standardization, or normalization before
creating visual maps or proximity matrices for clear representation. This process
might require repetition for complete analysis.

Selection of Proximity Measures

Proximity matrices have two main uses: they visually show relationships among
variables and samples, and they help reorder these for better visualization. The choice
of proximity measures is crucial for visualization. Pearson correlation is often used
for variables, while Euclidean distance is used for samples. For non-linear
relationships, alternatives like Spearman’s rank correlation and Kendall’s tau can be
used, along with nonlinear methods like Isomap. More advanced kernel methods may
also be applied as needed.

Color Spectrum
The choice of color spectrum is important for visualizing data and understanding
proximity matrices. It should effectively show numerical information both
individually and as a whole. Different color options may be suited for different
situations. A correlation matrix map using four bidirectional color spectra for
psychosis disorder variables is presented, highlighting the differences in visual
effectiveness.

Display Conditions
The display conditions for data are similar to how colors are transformed. The full
color spectrum represents all values, while conditions can focus on specific
distributions. Center matrix conditions balance colors around baseline values.
Sometimes, extreme values can be downweighted using ranks instead of numbers,
known as the rank matrix condition.

Resolution of a Statistical Graph

Extreme values in data matrices can skew the visual perception of the data. This can
be addressed by using rank conditions, compressing color scales, or applying
transformations like logarithms to reduce outlier effects.

15.3.2 Seriation of Proximity Matrices and the Raw Data Matrix

Matrix visualization is ineffective without proper orderings of variables and samples.
To extract information, meaningful proximity measures must be computed and
suitable permutations applied before visualization is done. We discuss criteria for
evaluating seriation algorithms.

Relativity of a Statistical Graph

Chen (2002) introduced the concept of relativity in statistical graphs, which involves
placing similar objects closer together in graphs. This is naturally seen in continuous
displays like histograms and scatterplots. Examples include the histogram of Petal
Width and a scatterplot of Petal Width and Length for 150 Iris flowers. Friendly and
Kwan (2003) suggested a similar idea called effect-ordered data display, while Hurley
(2004) examined related topics using scatterplot matrices and parallel coordinates
plots.
The concept of relativity often fails in matrix visualizations, as random permutations
can disrupt it. Sorting algorithms are commonly used to group similar samples and
variables.

Global Criterion: Robinson Matrix

To make a matrix look like a Robinson matrix, it is often helpful to permute its
elements. A symmetric matrix is a Robinson matrix if it meets certain criteria related
to its elements. If its rows and columns can be rearranged to resemble a Robinson
matrix, it’s called pre-Robinson. Three anti-Robinson loss functions are then
calculated to measure how much a permuted matrix deviates from this form.

Elliptical Seriation
A new algorithm called rank-two elliptical seriation was introduced for extracting the
structure of sequences formed from correlation matrices using eigenvalue
decomposition. Starting with a proximity matrix D, a sequence of correlation
matrices is created iteratively. When the process reaches rank two, two eigenvectors
with nonzero values define an ellipse that represents the data. This method helps in
identifying global patterns and smooth gene expression profiles by optimizing the
Robinson criterion.

Local Criterion: Minimal Span Loss Function

The minimal span loss function MS focuses on optimizing local structures in a
permuted matrix D by finding the shortest path through data elements, similar to
the traveling salesman problem. Local seriation produces tighter blocks around the
main diagonal of the proximity matrix compared to the global method. Additionally,
the anti-Robinson measure can be combined with minimal span loss to assess a
selected band along the diagonal of the proximity matrix.

Tree Seriation
The hierarchical clustering tree with a dendrogram is widely used for sorting gene
expression data. Agglomerative clustering maintains local grouping well, while
divisive clustering captures global patterns but is less common due to its complexity.

Flipping of Intermediate Nodes

A key problem in using dendrograms to arrange rows and columns in an expression
profile matrix is the independent flipping of intermediate nodes. This can create
multiple layouts for the same objects, despite having the same proximity matrices
and tree linkage methods. The flipping can be influenced by either external or
internal reference lists.
For instance, the software from Eisen's lab uses average expression levels for
guidance, while results from a self-organizing map can also help. Additionally, some
researchers suggest ordering nodes by their similarity to sibling nodes or using
methods that maximize similarities between adjacent leaves.

15.4 Generalization and Flexibility

15.4.1 Summarizing Matrix Visualization

Sorted matrix maps show gene expression patterns and relationships between genes
and arrays. Clusters can be identified in these maps using dendrogram structures or
methods like Pearson’s correlation and block searching. After obtaining partitioned
matrix maps, a “sufficient matrix visualization” can be created, which summarizes
data points and proximity measures using statistics like means and medians. This
visualization helps to understand correlation structures among array groups and gene
clusters. To ensure effective visualization, three requirements must be met: suitable
permuted variables and samples, well-defined partitions, and representative summary
statistics.

15.4.2 Sediment Display

The sediment display arranges data matrices by sorting column and row profiles
based on their values. This method shows the distribution of all rows and columns
together. The middle panel displays gene expression profiles, while the right shows
selected arrays, similar to boxplots.

15.4.3 Sectional Display

A sectional display shows only numerical values that meet specific conditions in data
or maps. Users can ignore values below a threshold or emphasize coherent
neighboring structures. Figure 15.8 illustrates such displays for gene distance
matrices.

15.4.4 Restricted Display

Outlying data points can obscure color details in data displays. This issue can be fixed
by showing only rank conditions or by using a limited color spectrum for the main
data values. An example shows how displaying a restricted distance range reveals a
three-group structure. Nonlinear color mapping techniques can also help make the
data clearer.

15.5 An Example
Construction of an MV Display

Dataset 0 had many missing values due to different experiments studying various sets
of genes. From this dataset, 2,000 genes and 400 arrays with fewer missing values
were selected to create Dataset 2. Pearson’s correlation coefficient measures
relationships between genes and arrays, which is common in analyzing gene
expression profiles. Average linkage clustering trees are then used to organize the
correlation data for genes and arrays.

The data matrix for gene expression is shown with color-coded dots: red for high
expression, green for low expression, and black for little change. White dots indicate
missing values, and many arrays still have missing data. The arranged data helps
identify patterns in gene clusters and experimental groups. Analyzing these color
maps can provide important insights into the information structure within the data.

Examination of an MV Display
Proper training and experience are needed to effectively use complex matrix
visualization tools, following general steps for examination.

1. For the column (array) proximity matrix:

a) Search for coherent clusters of arrays along the main diagonal of the correlation
matrix with dark red points. Two major groups of arrays, A₁ and A₂, show similar
expression patterns across all 2000 genes.
b) Examine interactions between off-diagonal array clusters and correlations.
c) The arrays show various biological tests for yeast functions, including cell-cycle
control and stress responses. Integrating this knowledge with numerical data helps
validate known information and discover new patterns.
d) Hierarchical clustering trees provide a partial view of data and proximity
structure, but are less comprehensive than proximity matrix maps.

2. For the row (gene) proximity matrix:

a) The text discusses the proximity matrix of 2000 genes, highlighting two groups:
up-regulated genes (in red) and down-regulated genes (in green), referred to as G1
and G2, with smaller subclusters identified.
b) Consult various annotation databases for detailed explanations of gene clusters.
Some genes remain unannotated, yet their functions can be inferred from correlated
patterns.
3. For the raw data (gene expression profile) matrix:
a) In steps one and two, various major and minor gene clusters and array groups
were identified. Step three involves analyzing the raw gene expression data to
observe interaction patterns. Vertical and horizontal strips will help compare group
structures and gene cluster distributions.
b) There are about 10,000 missing observations in a data matrix of 400 arrays with
2,000 genes. The pattern of missing data is not random and varies by array group
and gene cluster. Visualizing this missing structure helps users choose better
methods for estimating or imputing missing values.
c) Visual exploration offers insights into studying metabolite pathways and
discovering new ones.

15.6 Comparison with Other Graphical Techniques

We compare visualization efficiencies of scatterplot, parallel coordinates plot, and

matrix visualization.

Low-Dimensional Data
For one-dimensional data, scatterplots and parallel coordinate plots (PCP) act like
dotplots, while a one-dimensional multivariate (MV) results in a colored bar chart.
Histograms are still the most popular for one-dimensional data. Scatterplots are best
for two-dimensional data, and their effectiveness drops with more dimensions. For
three-dimensional data, rotational scatterplots help visualize geometry, while
optimal variable arrangements are crucial for PCP and MV displays.

High-Dimensional Data
A scatterplot matrix (SM) helps visualize relationships between pairs of variables in
high-dimensional data. Grand tours and dimension reduction methods, like principal
component analysis (PCA), are used to reveal data structure. The text includes
figures showing a scatterplot matrix and a corresponding PCA chart for a dataset.
Although a PCA can display all samples, it often requires interaction to analyze
relationships among variables. A scatterplot matrix uses a lot of space, making it
ineffective for datasets with many variables (greater than 15). PCA is suitable for
hundreds of variables but struggles with more. Multi-variable (MV) plots use display
space efficiently and offer better resolution than PCA.
Overall Eiciency
The diagram shows that scatterplots are best for low-dimensional data visualization,
while matrix visualization and parallel coordinates plots are better for datasets with
fifteen or more variables.

Missing Values
Displaying missing values in scatterplots is challenging, but in a Parallel Coordinates
Plot (PCP), they can be shown outside the data range. The MANET system allows for
interactive display of missing information. In an MV plot, missing values are
highlighted in a distinct color, like white in specific gene expression profiles. This
visual representation helps users understand the nature of the missing data before
further statistical analysis is done.

15.7 Matrix Visualization of Binary Data

Scatterplots, PCP, and MV displays each have strengths and weaknesses for showing
continuous data, but only MV displays can effectively show binary data across all
dimensions. This study uses KEGG metabolism pathways for the yeast
Saccharomyces cerevisiae to demonstrate how MV displays can visualize important
information from multivariate binary data.

The KEGG website lists 1,177 genes linked to 100 metabolism pathways, which we
simplified into a two-way binary data matrix called Dataset 3. In this matrix, a one
indicates that a gene is involved in a pathway, while a zero means it is not.

15.7.1 Similarity Measure for Binary Data

Measures like Euclidean distance and correlation cannot be directly used for binary
data; specific similarity measures for binary data are needed.

Symmetric and Asymmetric Binary Variables

A binary variable is symmetric if both outcomes are equally valued, while it is
asymmetric if one outcome is more important. Asymmetric variables are often treated
as "monary," focusing on the more significant state.

Sparseness and Dimensionality

Asymmetric binary variables are often sparse, making it hard to find suitable
measures for assessing relationships. Dimension reduction techniques struggle with
high-dimensional data. The Jaccard coefficient is preferred for sparse data over the
simple matching coefficient.
15.7.2 Matrix Visualization of the KEGG Metabolism Pathway Data

The text discusses the use of the 1−Jaccard distance coefficient to create proximity
matrices for genes and pathways. It explains that elliptical seriations are used to
permute these distance matrices along with the binary pathway data matrix. Many
genes that are only engaged with a single pathway are excluded from analysis, which
reduces the initial set of 1,177 genes to 432 genes associated with 88 pathways. Users
can adjust their view of the data using scroll bars or by zooming out. Average linkage
clustering trees help organize the distance matrices and reveal the complex
associations between genes and pathways. The analysis may focus on more active
genes and pathways for deeper insights.

15.8 Other Modules and Extensions of MV

This text discusses the basics of matrix visualization using the GAP approach for
visualizing continuous and binary data. It covers derived proximity matrices and some
generalizations like sufficient MV and different display methods. However, the
complexity of real data can make basic visualization inadequate. The section also
highlights ongoing projects aimed at improving matrix visualization's effectiveness.
Key features of the GAP approach include four main procedures: color projection of
raw data, computation of proximity matrices, color projection of these matrices, and
variable/sample permutations. Extensions are mostly related to the first two
procedures.

15.8.1 MV for Nominal Data

Performing multivariate analysis (MV) for nominal data is harder than for binary data
because there is no simple way to color-code nominal data while preserving statistical
relationships. Challenges also exist in finding meaningful proximity measures for
nominal data. Some researchers developed solutions using the Homals algorithm to
address these issues.
15.8.2 MV for Covariate Adjustment

In studies, data like gender and age are often collected alongside main variables.
When considering these covariates, adjustment is necessary, similar to a statistical
modeling process. Wu and Chen (2006) proposed a method that splits the data into
model and residual matrices to apply ordinary matrix visualization. Covariate
adjustment uses conditional correlations, with different approaches for discrete and
continuous covariates.

15.8.3 Data with Missing Values

The relativity of a statistical graph is used in seriation algorithms to create clustered

matrices. A weighted pattern method can impute missing values through iterative
proximity calculations.

15.8.4 Modeling Proximity Matrices

Many statistical modeling techniques visually explore high-dimensional data in a

proximity matrix, showing similarities and differences between objects. Key
techniques include multidimensional scaling, hierarchical clustering, and factor
analysis. The process involves transforming an input proximity matrix into a disparity
matrix and calculating a stress matrix to evaluate the model's fit.
Visualization in Bayesian Data Analysis

16.1 Introduction

Modern Bayesian statistics often neglects the use of statistical graphics, leading to a
separation between the two fields, despite both involving computation. Traditionally,
Bayesians conduct exploratory data analysis (EDA) initially, but focus on fitting
models thereafter, using graphs mainly for checking simulation convergence or for
teaching and presentation. Once a model is fitted, EDA seems to have no formal role
in Bayesian analysis. Conversely, users of statistical graphics believe all models are
flawed and emphasize staying close to the data without reliance on models. This
approach avoids bringing in subjective elements from models. To reconcile these
differing perspectives, a synthesis is proposed that views all statistical graphs as
comparisons to a reference distribution or model. This idea, introduced by Gelman,
aims to unify EDA with formal Bayesian methods, connecting it to goodness-of-fit
testing and building on previous graphical model-checking concepts.

16.1.1 The Role of EDA in Model Comprehension and Model-Checking

Exploratory data analysis (EDA) uses graphs to find patterns in data. The idea is
based on earlier thoughts by Tukey, who emphasized that graphs help to reveal what
is happening beyond existing descriptions. EDA relies on implicit reference models,
like assuming a comparison to zero in time series plots or independence in
scatterplots. Before examining data, we have certain expectations in mind regarding
its distribution. In Bayesian analysis, inferring whether results are sensible involves
comparing estimates to prior knowledge.

Using EDA with advanced models enhances its effectiveness, even if one prefers
model-free approaches. EDA applies to both inferences and data. In Bayesian
probability, the reference distribution can be determined through the predictive
distribution of observable data. Comparing observed values with predictive draws
helps assess model fit, identifying any discrepancies. Graphs are the preferred method
for these comparisons, although complex models may need tailored graphical checks.
The article reviews these concepts, offers examples, and discusses possible
extensions.

16.1.2 Comparable Non-Bayesian Approaches

Bayesian data visualization tools use posterior uncertainty from simulated parameter
draws and replicated data. Non-Bayesian analysis, which involves point estimates and
parametric bootstrap, can be similar if estimates are precise. Confidence intervals
provide an overview of posterior uncertainty. The described visualization tools apply
to non-Bayesian settings as well.

16.2 Using Visualization to Understand and Check Models

Bayesian inference helps manage uncertainty and variability in data visualization and
analysis.
16.2.1 Using Statistical Graphics in Model-Based Data Analysis

EDA focuses on discovering unexpected areas of model misfit, while CDA measures
how often these discrepancies might happen by chance. The goal is to apply these
concepts to more complex models using Bayesian inference and nonparametric
statistics. Complex models help EDA identify subtle data patterns, making graphical
checks more necessary to spot misfits. Statisticians engage in iterative modeling,
starting with simple models and gradually increasing complexity, identifying
deficiencies at each stage, and refining the models until satisfactory. Simulation-based
checks compare observed data to model replications, while EDA techniques are also
applied to parameter inferences and latent data. Theoretical analysis involves
exploring graphical displays to enhance model interpretation and guide effective
model checking.

16.2.2 Bayesian Exploratory Data Analysis

Bayesian inference allows for a clear understanding of data through posterior

simulations, which help create predictive distributions for unobserved data. These
simulations can be used to generate graphs that compare predictions with actual data
and can be customized for better interpretation. By incorporating missing or hidden
data, the resulting exploratory plots become more meaningful. An explicit model
enables the creation of graphs that assess how well the data fits.

The approach emphasizes the use of statistical graphics throughout the data analysis
stages, including model-building and model-checking. Graphs serve as important
tools to identify issues in models, as traditional statistics like p-values are often
insufficient. Exploratory data analysis is not limited to the beginning of the process; it
continues after model fitting to uncover potential problems. The method avoids model
averaging and instead focuses on progressively building more complex models. The
central idea of Bayesian statistics is to deal with uncertainty in inferences, often using
simulations to reflect draws from the posterior distribution, particularly seen in
hierarchical models.

16.2.3 Hierarchical Models and Parameter Naming Conventions

In structured models, we focus on comparing values in groups of parameters instead

of individual coefficients. For example, we may look at group-level averages and their
uncertainty intervals. Posterior intervals can be obtained from posterior simulations.
Traditional methods for summarizing models, like examining coefficients, do not
effectively capture the various levels of variation and uncertainty in these models,
which is important for understanding sources of variation. Naming the parameters
clearly helps in comparison, as similar names indicate related groups. Instead of using
detailed plots, we generally summarize parameters by their posterior intervals.

16.2.4 Model-Checking

Statistical graphics help check models by comparing actual data with data generated
by the model. This involves both exploring graphics and p-values, but the aim isn't
simply to see if the model is right or wrong. Instead, the focus is on understanding
how the data differs from the model. Key parts of this exploration are graphical
displays and reference distributions. The best type of graph depends on what part of
the model is being checked, such as comparing residual plots to replicated data.

16.3 Example: A Hierarchical Model of Structure in Social Networks

As an example, we consider the problem of estimating the sizes of social networks

Zheng et al. (2006). he model uses a negative-binomial model with an overdisper
sion parameter:

The study involves 1,370 respondents who identify how many people they know
within 32 subpopulations, named and defined by specific characteristics. Each
respondent has a “gregariousness” parameter, indicating their likelihood of knowing
people in different groups. This parameter is modeled with a mathematical formula.
A group-level size parameter is also used, along with an overdispersion vector. The
model is analyzed using Gibbs and Metropolis algorithms, resulting in simulated
draws for the parameters and hyperparameters.

Model-Informed Exploratory Data Analysis

Figure 16.3 shows part of an exploratory data analysis with histograms for two survey
questions and simulations from three models. The last model used shows a less than
perfect fit.

A First Look at the Estimates

We can summarize the estimated posterior distributions of scalar parameters using
histograms, but intervals are often clearer for comparisons. The parameter estimates,
along with their 50% and 95% posterior intervals, are displayed in a graph that also
shows convergence statistics. For larger models, not all quantities can be displayed,
but this provides a good summary. Convergence is typically acceptable if the Rˆ
convergence statistic is below 1.1 for all scalar parameters.
A value of ˆR close to 1.0 indicates good chain mixing. If ˆR is 1.1 for some
parameters, more iterations are allowed.

Distribution of Social Network Sizes ai

The text discusses how to summarize the estimates of 1370 parameters related to a
study. Instead of using a table, it suggests using histograms to visualize the posterior
distribution of these parameters. The histograms separate the estimates for men and
women, particularly in terms of "gregariousness. " To show uncertainty in the
estimates, the text recommends overlaying multiple histograms from sampled vectors
on the average histograms. It also emphasizes the importance of plotting the posterior
draws instead of just point estimates to better represent the randomness of the data.

16.3.1 Posterior Predictive Checks

To assess how well a Bayesian model fits, we compare its predictions to observed
data. This is done by using posterior predictive simulations from negative binomial
distributions based on parameter vectors drawn from previous simulations. We
generate multiple predictive simulations for the data and create a replicated
observation matrix. We can find numerical summaries like standard deviation or mean
for these data features and compare them. However, we prefer graphical
representations because they better show the complexity of the dataset. We compare
these graphical test statistics to evaluate the model.

Plots of Data Compared with Replicated Data

The social networks model compares observed and expected response proportions by
plotting data from 1,370 respondents. Proportions of different response values are
calculated and compared to predictions from the model, including uncertainty
intervals. Overall, the model fits the data well but tends to underpredict the proportion
of respondents knowing exactly one person. The plots also show a tendency for
respondents to use round numbers. Additionally, it emphasizes the importance of
displaying related graphs together for easier visual comparison.
16.4 Challenges Associated with the Graphical Display of Bayesian
Inferences

Using graphs more regularly in data analysis can improve the quality of statistical
studies. Exploratory data analysis should be included in software for complex
modeling. Four main challenges are noted: integrating automatic replication
distributions, selecting these distributions, choosing test variables, and displaying test
results graphically. Future tools may help simulate replication distributions and
perform model checks automatically.

16.4.1 Integrating Graphics and Bayesian Modeling

We regularly use software like BUGS to fit Bayesian models and then simulate the
data in R with R2WinBUGS. We can also summarize simulations in R more naturally
using random variable objects. While BUGS is helpful, it has limitations, so we use
the Universal Markov chain sampler in R for complex models. We are working on
creating an integrated Bayesian computing environment for modeling and model-
checking, which includes standardized graphical displays and the ability to handle
multiple models and their comparisons.

Chapter 1 Psychology in Your Life 4E Summarized Notes
No ratings yet
Chapter 1 Psychology in Your Life 4E Summarized Notes
61 pages
Statistics For Business and Economics: Metric Version - Ebook PDF Download
100% (3)
Statistics For Business and Economics: Metric Version - Ebook PDF Download
53 pages
Marketing Analytics Unit 2
No ratings yet
Marketing Analytics Unit 2
38 pages
IB Math Exam Practise Questions
100% (2)
IB Math Exam Practise Questions
109 pages
Fatima Khan 17834 ResearchReport
No ratings yet
Fatima Khan 17834 ResearchReport
34 pages
统计大纲
No ratings yet
统计大纲
14 pages
The Relationship Between Social Media Usage and Grammar Skills of Grade 8 Students of Sacred Heart Academy of Santa Maria Bulacan Inc.
No ratings yet
The Relationship Between Social Media Usage and Grammar Skills of Grade 8 Students of Sacred Heart Academy of Santa Maria Bulacan Inc.
43 pages
7 Regression With Stationary Time-Series Data-Revised
No ratings yet
7 Regression With Stationary Time-Series Data-Revised
75 pages
Simple Regression and Correlation
No ratings yet
Simple Regression and Correlation
26 pages
An Assessment of Selected Family Based Businesses
No ratings yet
An Assessment of Selected Family Based Businesses
16 pages
MTP 21 56 Questions 1714475753
No ratings yet
MTP 21 56 Questions 1714475753
18 pages
Herbicides Properties Synthesis and Control of Weeds Hasaneen Mohammed Naguib Abd Elghany Ed Instant Download
No ratings yet
Herbicides Properties Synthesis and Control of Weeds Hasaneen Mohammed Naguib Abd Elghany Ed Instant Download
78 pages
BES220 - Theme 5 Linear Regression - Lecture 2 Line Fitting and Correlation - Slides
No ratings yet
BES220 - Theme 5 Linear Regression - Lecture 2 Line Fitting and Correlation - Slides
20 pages
Statistical Measures Every Analyst Must Know - Part1 - by Prof. Frenzel - Feb, 2024 - Medium
No ratings yet
Statistical Measures Every Analyst Must Know - Part1 - by Prof. Frenzel - Feb, 2024 - Medium
21 pages
The Analysis of Biological Data Michael C Whitlock Dolph Schluter Download
No ratings yet
The Analysis of Biological Data Michael C Whitlock Dolph Schluter Download
76 pages
Detecting Lead-Lag Relationships in Stock Returns and Portfolio Strategies
No ratings yet
Detecting Lead-Lag Relationships in Stock Returns and Portfolio Strategies
45 pages
HNS B301 BIOSTATISTICS FOR HEALTH SCIENCES - Marking Scheme
No ratings yet
HNS B301 BIOSTATISTICS FOR HEALTH SCIENCES - Marking Scheme
9 pages
5.6 MMW
No ratings yet
5.6 MMW
27 pages
4 211 A&E Mills & Gay. 2016. Glossary MERAH 11th Eds. Educational - Research - Competencies - For - Analysis - and - Applications
No ratings yet
4 211 A&E Mills & Gay. 2016. Glossary MERAH 11th Eds. Educational - Research - Competencies - For - Analysis - and - Applications
12 pages
HumanEval Pro and MBPPPro Evaluating Large Language Models
No ratings yet
HumanEval Pro and MBPPPro Evaluating Large Language Models
27 pages
Ifm - Term Paper
No ratings yet
Ifm - Term Paper
19 pages
Chapter 2 Notes - Psychological Research Methods and Statistics
No ratings yet
Chapter 2 Notes - Psychological Research Methods and Statistics
11 pages
SYLLABUS Statistics For Business and Economics
No ratings yet
SYLLABUS Statistics For Business and Economics
17 pages
Effectof Marketing Innovation On Performanceof Smalland Medium Size Enterprisesin Nigeria
No ratings yet
Effectof Marketing Innovation On Performanceof Smalland Medium Size Enterprisesin Nigeria
19 pages
Is Google Getting Worse? A Longitudinal Investigation of SEO Spam in Search Engines
No ratings yet
Is Google Getting Worse? A Longitudinal Investigation of SEO Spam in Search Engines
16 pages
5 Da
No ratings yet
5 Da
6 pages
09 Plotting and Visualization
No ratings yet
09 Plotting and Visualization
97 pages
Monthly Expenditure
No ratings yet
Monthly Expenditure
11 pages
Cda U2 Visualization
No ratings yet
Cda U2 Visualization
38 pages
DA UNIT V Notes
No ratings yet
DA UNIT V Notes
17 pages
Unit V
No ratings yet
Unit V
102 pages
DVA_UNIT2
No ratings yet
DVA_UNIT2
21 pages
Does Quran Memorization Influence IQ
No ratings yet
Does Quran Memorization Influence IQ
10 pages
Cda U2 Visualization
No ratings yet
Cda U2 Visualization
39 pages
Da Unit - V
No ratings yet
Da Unit - V
14 pages
Data Visulization Techniques
No ratings yet
Data Visulization Techniques
10 pages
Article 7
No ratings yet
Article 7
5 pages
Unit 1 Data Objects Attributes Visualization
No ratings yet
Unit 1 Data Objects Attributes Visualization
34 pages
Data Preprocessing
No ratings yet
Data Preprocessing
76 pages
DA Unit-5
No ratings yet
DA Unit-5
6 pages
DVT Unit 2
No ratings yet
DVT Unit 2
29 pages
Data Analytics Data Visualization Unit V
No ratings yet
Data Analytics Data Visualization Unit V
12 pages
L5 Data Visualization
No ratings yet
L5 Data Visualization
33 pages
Module 6 - Data Visualization Tools
No ratings yet
Module 6 - Data Visualization Tools
37 pages
IDS Unit 5 Visualization
No ratings yet
IDS Unit 5 Visualization
24 pages
DVT UNIT 3 - PART-1 Notes
No ratings yet
DVT UNIT 3 - PART-1 Notes
8 pages
Data Visualization
No ratings yet
Data Visualization
23 pages
Data Analytics - Unit-V
0% (1)
Data Analytics - Unit-V
9 pages
3rd UNIT DVT
No ratings yet
3rd UNIT DVT
30 pages
WINSEM2022-23 CSI3005 ETH VL2022230503218 ReferenceMaterialI WedMar0100 00 00IST2023 MultivariateDataVisualization PDF
No ratings yet
WINSEM2022-23 CSI3005 ETH VL2022230503218 ReferenceMaterialI WedMar0100 00 00IST2023 MultivariateDataVisualization PDF
56 pages
Da Unit 5
No ratings yet
Da Unit 5
11 pages
DVT r18 Notes
No ratings yet
DVT r18 Notes
17 pages
02 Data
No ratings yet
02 Data
42 pages
DAS732 Lecture - 2024 08 08
No ratings yet
DAS732 Lecture - 2024 08 08
47 pages
Data Visualization Shorts
No ratings yet
Data Visualization Shorts
68 pages
DVT Unit2 1
No ratings yet
DVT Unit2 1
17 pages
Group - 4
No ratings yet
Group - 4
27 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
12 pages
Data Analytics-Data Visualization UNIT-V
No ratings yet
Data Analytics-Data Visualization UNIT-V
11 pages
03 Temporal, Geospatial Multivariate Data
No ratings yet
03 Temporal, Geospatial Multivariate Data
69 pages
Information Visualization: Dr. Parvathi.R VIT University, Chennai
No ratings yet
Information Visualization: Dr. Parvathi.R VIT University, Chennai
73 pages
Pixel-Oriented Visualization Techniques
No ratings yet
Pixel-Oriented Visualization Techniques
21 pages
Unit V DVT
No ratings yet
Unit V DVT
20 pages
Chapter 2-Getting To Know Your Data
No ratings yet
Chapter 2-Getting To Know Your Data
23 pages
Visualization
No ratings yet
Visualization
15 pages
Handout6 - Visualization
No ratings yet
Handout6 - Visualization
75 pages
Data Visualization
No ratings yet
Data Visualization
14 pages
Chapter1 Introduction Data Visualization
No ratings yet
Chapter1 Introduction Data Visualization
73 pages
Quiz Problems Chapters 10 11
No ratings yet
Quiz Problems Chapters 10 11
2 pages
M2 - Visualization Across Time, Space, Relationships
No ratings yet
M2 - Visualization Across Time, Space, Relationships
14 pages
Sci Vis 2005
No ratings yet
Sci Vis 2005
54 pages
3 MultiD
No ratings yet
3 MultiD
46 pages
Week 02.1 Chaptr002
No ratings yet
Week 02.1 Chaptr002
29 pages
Data Visualization Techniques: Dr. D. Koteswara Rao
No ratings yet
Data Visualization Techniques: Dr. D. Koteswara Rao
41 pages
Flashcards
No ratings yet
Flashcards
29 pages
Unit III
No ratings yet
Unit III
105 pages
Python June2025
No ratings yet
Python June2025
2 pages
多维数据可视化技术
No ratings yet
多维数据可视化技术
11 pages
Data Visualization - 1: Course Leader
No ratings yet
Data Visualization - 1: Course Leader
26 pages
Constructing Parallel Coordinates Plot For Problem
No ratings yet
Constructing Parallel Coordinates Plot For Problem
7 pages
Unit 4 Part A
No ratings yet
Unit 4 Part A
51 pages
00 Course
No ratings yet
00 Course
15 pages
Ach Tert 2013
No ratings yet
Ach Tert 2013
4 pages
03 Multivariate
No ratings yet
03 Multivariate
10 pages
William S. Cleveland-Visualizing Data-Hobart Press (1993)
100% (5)
William S. Cleveland-Visualizing Data-Hobart Press (1993)
367 pages
A Tour Through The Visualization Zoo PDF
No ratings yet
A Tour Through The Visualization Zoo PDF
18 pages
IoT Based Detection of Microbial Activity in Raw Milk by Using Intel Galileo Gen II.
No ratings yet
IoT Based Detection of Microbial Activity in Raw Milk by Using Intel Galileo Gen II.
4 pages
Nwu CS 05 12
No ratings yet
Nwu CS 05 12
10 pages
DM14 Visualisation
100% (1)
DM14 Visualisation
67 pages

DVT Unit-V

Uploaded by

DVT Unit-V

Uploaded by

UNIT-V

Parallel Coordinates: Visualization, Exploration and

Visualization has historical significance, as illustrated by Archimedes' famous death

14.1.2 The Case for Visualization

14.2 Exploratory Data Analysis with ||-coords

14.2.1 Multidimensional Detective

Parallel coordinates help visualize complex data relationships in a two-dimensional

In parallel coordinates, data appear as wavy lines, requiring a multidimensional

14.2.2 An Easy Case Study: GIS Data

The first admonition is:

- do not let the picture intimidate you,

- understand the objectives.

- carefully scrutinize the data display for clues and patterns.

- a good thing may be worth repeating

The exploration data analysis (EDA) process is discussed, mentioning an icon on a

14.2.3 Compound Queries: Financial Data

The text then introduces an exploratory method called "multidimensional contouring,"

14.2.4 Hundreds of Variables

- be sceptical about the quality of datasets with large numbers of variables

14.4 Visual and Computational Models

The methodology discussed helps to model complex relationships among multiple

14.5.2 Planes and Hyperplanes

14.5.3 Nonlinear Multivariate Relations: Hypersurfaces

A simpler representation of surfaces uses polygonal lines along boundary points,

By interactively varying variable values, sensitive regions can be identified, showing

The chapter reviews previous studies on matrix visualization, discusses its

15.2 Related Works

15.3 The Basic Principles of Matrix Visualization

15.3.1 Presentation of the Raw Data Matrix

Selection of Proximity Measures

Resolution of a Statistical Graph

15.3.2 Seriation of Proximity Matrices and the Raw Data Matrix

Relativity of a Statistical Graph

Global Criterion: Robinson Matrix

Local Criterion: Minimal Span Loss Function

Flipping of Intermediate Nodes

15.4 Generalization and Flexibility

15.4.1 Summarizing Matrix Visualization

15.4.2 Sediment Display

15.4.3 Sectional Display

15.4.4 Restricted Display

1. For the column (array) proximity matrix:

2. For the row (gene) proximity matrix:

15.6 Comparison with Other Graphical Techniques

We compare visualization efficiencies of scatterplot, parallel coordinates plot, and

15.7 Matrix Visualization of Binary Data

15.7.1 Similarity Measure for Binary Data

Symmetric and Asymmetric Binary Variables

Sparseness and Dimensionality

15.8 Other Modules and Extensions of MV

15.8.1 MV for Nominal Data

15.8.3 Data with Missing Values

The relativity of a statistical graph is used in seriation algorithms to create clustered

15.8.4 Modeling Proximity Matrices

Many statistical modeling techniques visually explore high-dimensional data in a

16.1.1 The Role of EDA in Model Comprehension and Model-Checking

16.1.2 Comparable Non-Bayesian Approaches

16.2 Using Visualization to Understand and Check Models

16.2.2 Bayesian Exploratory Data Analysis

Bayesian inference allows for a clear understanding of data through posterior

16.2.3 Hierarchical Models and Parameter Naming Conventions

In structured models, we focus on comparing values in groups of parameters instead

16.3 Example: A Hierarchical Model of Structure in Social Networks

As an example, we consider the problem of estimating the sizes of social networks

Model-Informed Exploratory Data Analysis

A First Look at the Estimates

Distribution of Social Network Sizes ai

16.3.1 Posterior Predictive Checks

Plots of Data Compared with Replicated Data

16.4.1 Integrating Graphics and Bayesian Modeling

You might also like