DVT Unit-V
DVT Unit-V
14.1 Introduction
14.1.1. Origins
Over half of our sensory neurons focus on vision, which enhances our ability to
recognize patterns. The purpose of visualization is to connect this ability to problem-
solving, helping us gain insights from images. To address multivariate and
multidimensional problems, various mappings have been created to represent complex
information visually in a two-dimensional format. However, many successful
examples in this area, like Minard’s “Napoleon’s March to Moscow” and Snow’s “dot
map,” are unique and do not generally reveal multivariate relationships effectively. To
improve this, we must utilize the interactivity made possible by recent technological
advancements, leading to the need for effective graphical user interfaces (GUIs),
exploration strategies, and information-preserving displays.
The author’s interest in visualization began with geometry, and later frustrations over
the lack of visual aids in multidimensional geometry sparked questions about the
creation of accurate pictures for complex problems. They wondered if parallel
coordinates could provide a new way to view multidimensional spaces. This
exploration was encouraged by professors at the University of Illinois, leading to the
discovery of basic correspondence between points and lines in multidimensional
spaces.
In 1977, while teaching a class, the author revisited parallel coordinates to visualize
multidimensional objects, which led to the structured development of 'f-coordinates,'
documented in early reports. Contributions in the field continued over the years, with
many individuals assisting in advancing the knowledge of multidimensional objects
and their representations. The development of 'f-coordinates' gained more visibility,
with numerous applications researched in statistics and data visualization since that
time, reflecting an increasing interest and utility in parallel coordinates. Today, a
simple online search reveals a wealth of information, indicating the significance and
growth of this area.
Searching a dataset with M items for "interesting" properties is challenging due to the
vast number of possible subsets. Effective visual cues can help quickly navigate this
complexity. To explore a dataset with N variables, a good display should:
1. Preserve information, allowing for complete dataset reconstruction,
2. Have low representational complexity for efficient display construction,
3. Work for any number of variables,
4. Treat each variable equally,
5. Remain recognizable under transformations like rotations and scalings,
6. Reveal multivariate relations,
7. Be based on rigorous mathematical methods to minimize ambiguity.
This list is open to critique and improvement. Further details and examples are
provided below.
1. The numerical values of each N-tuple should be retrievable from the scatterplot
matrix and the f-coords display, although this may not be necessary for presentation
graphics.
2. In the pairwise scatterplot matrix, N variables are displayed N(N - 1)/2 times,
while f-coords have N axes with a different preprocessing cost. For N = 10, display
space and visual issues limit scatterplot matrix use, unlike f-coords, which are more
flexible.
3. Orthogonal coordinates are limited to N = 3, but can create the illusion of more
dimensions. In “Chernoff faces,” each variable is linked to a specific facial feature,
but different assignments can change the display dramatically. This characteristic is
true for both scatterplot matrix and f-coords visualizations.
4. In the "Chernoff faces" display, each variable is linked to a facial feature in an
arbitrary way. Different choices create different displays, making it hard to prove they
show the same data. This applies to all "glyph" displays as well.
5. This is true for SM and for f -coords, although it has not been implemented in
general. Incidentally, this is a wonderful M.Sc. thesis topic!
6. The true value of visualization is in recognizing relationships among objects, not
just seeing many of them. Starting with a complete display helps uncover information,
with visual cues guiding exploration and interaction.
7. The value of rigor is self-evident.
The text discusses discovering data using three real datasets. It introduces basic
queries from GIS, combines them with boolean operators for financial data, briefly
covers an example with 400 variables, and emphasizes visualization and classification
in data analysis.
The text discusses planes created by small rotations and translations of π around three
axes, forming a "twisted slab" hard to visualize in 3-D. It raises questions about how
to show points in 3-D lie on this twisted slab and how to visualize the form for any
number N. You are encouraged to explore various methods without looking at the
answer provided at the end of the chapter.
The text details the control of the angle range through cursor movements on the right
axis and suggests that this method can be refined by adjusting additional parameters.
It raises the topic of classification and rule-finding, which are discussed further in a
later section. The importance of understanding geometry to effectively utilize these
queries is highlighted, especially in recognizing how positively or negatively
correlated data behaves visually.
Next, the text poses a question regarding the implications if the B2 and B3 axes were
not adjacent, suggesting that their pairwise relations could be overlooked. It
references research indicating the minimum number of permutations needed to
maintain adjacency across all axes, illustrating this with a graph example that
represents adjacent variables. This graph is connected to Hamilton paths, which are
crucial in various modern problems like the traveling salesman issue.
Analyze the financial dataset to find useful relationships for investments and trading.
The data focuses on the years 1986 and 1992. In 1986, the Yen showed the most
volatility, while in 1992, Sterling became very volatile.
- look for the gold” by checking out patterns that catch our attention.
In 1986, gold prices were low until mid-August when they increased significantly.
Exploration was done by four financial experts who noted the connection between a
low Yen, high 3MTB rates, and low gold prices, indicating that a low exchange rate
for the Yen signifies its high value compared to the dollar. Data indicated a strong
negative correlation between the Yen and 3MTB rates. Additionally, a smaller cluster
showed that lowering one variable's range led the other variable’s range to rise,
highlighting this negative correlation. There was also a noted positive correlation
during the 1990s when gold prices were low to mid-range, providing investment
insights for contrarians. Testing various angle ranges and inversion features may
reveal special data clusters.
- vary one of the variables watching for interesting variations in the other vari
ables.
The text discusses the relationship between currency exchange rates and the price of
gold. It explains how changes in exchange rates can influence the price of gold,
showing that movements in currency rates and gold prices are connected. The analysis
indicates that when gold is at a high price range, the exchange rate between Sterling
and Dmark (or Yen) forms a straight line, suggesting a stable relationship. While
there could be hints of market manipulation in the gold market, the text avoids
speculation on that topic.
A comparison with other visualization tools is mentioned but notes that they are not
available. The text also emphasizes the challenge of analyzing multiple variables
using traditional scatterplot matrices, which can be complex and difficult to interpret.
In contrast, a specific plot using f-coordinates simplifies the analysis of various
interrelated financial factors.
One frequently asked question is “how many variables can be handled with f-coords?
” The largest dataset effectively worked with had about 800 variables and 10,000 data
entries.
When analyzing large datasets with many variables, it is often challenging for
individuals to grasp what is happening. For example, a dataset revealed that many
instruments recorded zero values and had repetitive patterns. After cleaning the data,
it was reduced from about 90 to 30 relevant variables, as many were duplicates. This
redundancy is common, with at least 10% of variables being near-duplicates due to
measurement inconsistencies. It may be helpful to add an automated feature in
software to detect these suspect variables. The exploration of large datasets can still
be effective, and compound queries can enhance analysis in process control datasets.
14.3 Classification
Exploring data analysis can be enjoyable, but it often requires significant skill and
patience, which may deter some users. Many requests have been made for tools that
help automate this knowledge discovery process. Classification is a key task in data
analysis, where a classifier algorithm distinguishes between a designated subset and
the rest of the dataset. This involves creating rules based on the dataset's
characteristics. Using parallel coordinates, data can be transformed into an N-
dimensional space to visualize and approach classification algorithms effectively.
It can and does happen that the process does not converge when P does not contain
sufficient information to characterize S. It may also happen that S is so “porous”
(sponge-like) that an inordinate number of iterations are required. On convergence,
say at step n, the description of S is provided as:
The goal is to show how f-coordinates serve as a modeling tool rather than fully
exploring data model construction. Using least squares, a function is fitted to the
dataset, resulting in a region in R⁸, represented visually by upper and lower curves.
These curves illustrate the country's economic characteristics and interconnections
among sectors. Points within this region represent feasible economic policies, which
can be constructed using the interior point algorithm by selecting variable values
sequentially.
When the first variable’s value is chosen, the dimension is reduced, affecting the
ranges of other variables due to their interrelationships. For instance, a high fishing
output results in low mining sector values and vice versa, highlighting competition
for resources. This interactive exploration helps uncover the dynamics between
sectors, such as how a flourishing fishing industry draws workers away from mining.
The method also scales to more complex models, including a 20-dimensional surface
shown in one of the figures.
14.5 Parallel Coordinates: Quick Overview
14.5.1 Lines
In 3-D space, three points on a line are collinear, forming line ℓ, denoted by ¯L. A
polygonal line on N-1 points represents a point on ℓ. Two points can also define a
line. Intersections of polygonal lines yield additional points on ℓ. Proper indexing of
points is important.
A line can be identified from its projections, but a plane cannot in three dimensions.
Eickemeyer (1992) suggests a new method that describes a p-dimensional object
using its (p−1)-dimensional subsets from points. For example, a set of coplanar
points in 3-D does not show clear relationships. However, constructing a line from
collinear points shows that these lines intersect at a point, indicating coplanarity,
though this alone does not define the plane.
The text discusses a mathematical concept involving a plane defined by the sum of
three variables, represented with subscripts. It describes how additional points can
be generated through translations of the second and third axes. The distance
between points is shown to be proportional to a variable. The representation of
hyperlanes in higher dimensions using points with indices is explained, along with a
recursive algorithm for constructing these points. The text also mentions that various
transformations like rotation and scaling can be applied to multidimensional objects
while maintaining their recognition in different coordinates.
A relation between two real variables is shown as a unique area in 2-D, while a
relation with N variables appears as a hypersurface in N-D. Smooth surfaces in 3-D
can be described as the envelope of all their tangent planes, which helps to visualize
their representation. Each point on the surface corresponds to its tangent plane,
leading to regions in N-D that can reconstruct the hypersurface. Different surface
classes can be identified by their characteristics in f-coords, with examples including
developable surfaces, ruled surfaces, and quadric surfaces.
While traditional tools like boxplots and scatterplots are helpful for analyzing data
with fewer variables, they become less effective with large datasets that may contain
tens of thousands of variables. Matrix visualization, supported by advanced
computing technologies, can effectively explore complex datasets.
The idea of matrix visualization was first introduced by Bertin in 1967 as a way to
present data structures and relationships. In 1969, Carmichael and Sneath created
taxometric maps for classifying operational taxonomy units. Later, Hartigan
developed block clustering in 1972 for data matrices. Wegman reported the first
color matrix visualization in 1990. Other techniques focused on proximity matrices,
such as shaded correlation matrices by Ling in 1973 and elliptical glyphs by Murdoch
and Chow in 1996. Friendly proposed corrgrams in 2002 for analyzing correlations.
Chen combined raw data matrices with proximity matrices in generalized association
plots. The Cluster and TreeView packages became popular for gene expression
profiling. The reordering of matrix rows and columns is vital in visualization, with
various terms used to refer to these related techniques, collectively termed matrix
visualization (MV).
The GAP approach is used to explain matrix visualization for continuous data,
utilizing a dataset of 6,400 genes from yeast expression experiments. Detailed data
preprocessing is provided. For illustration, 15 samples and 30 genes are selected,
where rows represent genes and columns represent experiments, allowing for
interchangeable analysis.
The first step in visualizing continuous data is to create a raw data matrix, X₃₀₁₅, and
two proximity matrices for the rows, R₃₀₃₀, and columns, C₁₅₁₅, using user-defined
similarity measures. These matrices are then represented using colors in matrix maps.
The left panel shows a raw data matrix of log₂-transformed ratios with colors
indicating gene expression levels and correlations. A red (green) dot indicates up
(down)-regulation, while black indicates no change. Color points in the proximity
matrices indicate the relationship strength between arrays and genes, with darker
colors showing stronger correlations or smaller distances.
Data Transformation
Data may need transformations like log, standardization, or normalization before
creating visual maps or proximity matrices for clear representation. This process
might require repetition for complete analysis.
Color Spectrum
The choice of color spectrum is important for visualizing data and understanding
proximity matrices. It should effectively show numerical information both
individually and as a whole. Different color options may be suited for different
situations. A correlation matrix map using four bidirectional color spectra for
psychosis disorder variables is presented, highlighting the differences in visual
effectiveness.
Display Conditions
The display conditions for data are similar to how colors are transformed. The full
color spectrum represents all values, while conditions can focus on specific
distributions. Center matrix conditions balance colors around baseline values.
Sometimes, extreme values can be downweighted using ranks instead of numbers,
known as the rank matrix condition.
Elliptical Seriation
A new algorithm called rank-two elliptical seriation was introduced for extracting the
structure of sequences formed from correlation matrices using eigenvalue
decomposition. Starting with a proximity matrix D, a sequence of correlation
matrices is created iteratively. When the process reaches rank two, two eigenvectors
with nonzero values define an ellipse that represents the data. This method helps in
identifying global patterns and smooth gene expression profiles by optimizing the
Robinson criterion.
Tree Seriation
The hierarchical clustering tree with a dendrogram is widely used for sorting gene
expression data. Agglomerative clustering maintains local grouping well, while
divisive clustering captures global patterns but is less common due to its complexity.
Sorted matrix maps show gene expression patterns and relationships between genes
and arrays. Clusters can be identified in these maps using dendrogram structures or
methods like Pearson’s correlation and block searching. After obtaining partitioned
matrix maps, a “sufficient matrix visualization” can be created, which summarizes
data points and proximity measures using statistics like means and medians. This
visualization helps to understand correlation structures among array groups and gene
clusters. To ensure effective visualization, three requirements must be met: suitable
permuted variables and samples, well-defined partitions, and representative summary
statistics.
The sediment display arranges data matrices by sorting column and row profiles
based on their values. This method shows the distribution of all rows and columns
together. The middle panel displays gene expression profiles, while the right shows
selected arrays, similar to boxplots.
A sectional display shows only numerical values that meet specific conditions in data
or maps. Users can ignore values below a threshold or emphasize coherent
neighboring structures. Figure 15.8 illustrates such displays for gene distance
matrices.
Outlying data points can obscure color details in data displays. This issue can be fixed
by showing only rank conditions or by using a limited color spectrum for the main
data values. An example shows how displaying a restricted distance range reveals a
three-group structure. Nonlinear color mapping techniques can also help make the
data clearer.
15.5 An Example
Construction of an MV Display
Dataset 0 had many missing values due to different experiments studying various sets
of genes. From this dataset, 2,000 genes and 400 arrays with fewer missing values
were selected to create Dataset 2. Pearson’s correlation coefficient measures
relationships between genes and arrays, which is common in analyzing gene
expression profiles. Average linkage clustering trees are then used to organize the
correlation data for genes and arrays.
The data matrix for gene expression is shown with color-coded dots: red for high
expression, green for low expression, and black for little change. White dots indicate
missing values, and many arrays still have missing data. The arranged data helps
identify patterns in gene clusters and experimental groups. Analyzing these color
maps can provide important insights into the information structure within the data.
Examination of an MV Display
Proper training and experience are needed to effectively use complex matrix
visualization tools, following general steps for examination.
Low-Dimensional Data
For one-dimensional data, scatterplots and parallel coordinate plots (PCP) act like
dotplots, while a one-dimensional multivariate (MV) results in a colored bar chart.
Histograms are still the most popular for one-dimensional data. Scatterplots are best
for two-dimensional data, and their effectiveness drops with more dimensions. For
three-dimensional data, rotational scatterplots help visualize geometry, while
optimal variable arrangements are crucial for PCP and MV displays.
High-Dimensional Data
A scatterplot matrix (SM) helps visualize relationships between pairs of variables in
high-dimensional data. Grand tours and dimension reduction methods, like principal
component analysis (PCA), are used to reveal data structure. The text includes
figures showing a scatterplot matrix and a corresponding PCA chart for a dataset.
Although a PCA can display all samples, it often requires interaction to analyze
relationships among variables. A scatterplot matrix uses a lot of space, making it
ineffective for datasets with many variables (greater than 15). PCA is suitable for
hundreds of variables but struggles with more. Multi-variable (MV) plots use display
space efficiently and offer better resolution than PCA.
Overall Eiciency
The diagram shows that scatterplots are best for low-dimensional data visualization,
while matrix visualization and parallel coordinates plots are better for datasets with
fifteen or more variables.
Missing Values
Displaying missing values in scatterplots is challenging, but in a Parallel Coordinates
Plot (PCP), they can be shown outside the data range. The MANET system allows for
interactive display of missing information. In an MV plot, missing values are
highlighted in a distinct color, like white in specific gene expression profiles. This
visual representation helps users understand the nature of the missing data before
further statistical analysis is done.
Scatterplots, PCP, and MV displays each have strengths and weaknesses for showing
continuous data, but only MV displays can effectively show binary data across all
dimensions. This study uses KEGG metabolism pathways for the yeast
Saccharomyces cerevisiae to demonstrate how MV displays can visualize important
information from multivariate binary data.
The KEGG website lists 1,177 genes linked to 100 metabolism pathways, which we
simplified into a two-way binary data matrix called Dataset 3. In this matrix, a one
indicates that a gene is involved in a pathway, while a zero means it is not.
Measures like Euclidean distance and correlation cannot be directly used for binary
data; specific similarity measures for binary data are needed.
The text discusses the use of the 1−Jaccard distance coefficient to create proximity
matrices for genes and pathways. It explains that elliptical seriations are used to
permute these distance matrices along with the binary pathway data matrix. Many
genes that are only engaged with a single pathway are excluded from analysis, which
reduces the initial set of 1,177 genes to 432 genes associated with 88 pathways. Users
can adjust their view of the data using scroll bars or by zooming out. Average linkage
clustering trees help organize the distance matrices and reveal the complex
associations between genes and pathways. The analysis may focus on more active
genes and pathways for deeper insights.
This text discusses the basics of matrix visualization using the GAP approach for
visualizing continuous and binary data. It covers derived proximity matrices and some
generalizations like sufficient MV and different display methods. However, the
complexity of real data can make basic visualization inadequate. The section also
highlights ongoing projects aimed at improving matrix visualization's effectiveness.
Key features of the GAP approach include four main procedures: color projection of
raw data, computation of proximity matrices, color projection of these matrices, and
variable/sample permutations. Extensions are mostly related to the first two
procedures.
Performing multivariate analysis (MV) for nominal data is harder than for binary data
because there is no simple way to color-code nominal data while preserving statistical
relationships. Challenges also exist in finding meaningful proximity measures for
nominal data. Some researchers developed solutions using the Homals algorithm to
address these issues.
15.8.2 MV for Covariate Adjustment
In studies, data like gender and age are often collected alongside main variables.
When considering these covariates, adjustment is necessary, similar to a statistical
modeling process. Wu and Chen (2006) proposed a method that splits the data into
model and residual matrices to apply ordinary matrix visualization. Covariate
adjustment uses conditional correlations, with different approaches for discrete and
continuous covariates.
16.1 Introduction
Modern Bayesian statistics often neglects the use of statistical graphics, leading to a
separation between the two fields, despite both involving computation. Traditionally,
Bayesians conduct exploratory data analysis (EDA) initially, but focus on fitting
models thereafter, using graphs mainly for checking simulation convergence or for
teaching and presentation. Once a model is fitted, EDA seems to have no formal role
in Bayesian analysis. Conversely, users of statistical graphics believe all models are
flawed and emphasize staying close to the data without reliance on models. This
approach avoids bringing in subjective elements from models. To reconcile these
differing perspectives, a synthesis is proposed that views all statistical graphs as
comparisons to a reference distribution or model. This idea, introduced by Gelman,
aims to unify EDA with formal Bayesian methods, connecting it to goodness-of-fit
testing and building on previous graphical model-checking concepts.
Exploratory data analysis (EDA) uses graphs to find patterns in data. The idea is
based on earlier thoughts by Tukey, who emphasized that graphs help to reveal what
is happening beyond existing descriptions. EDA relies on implicit reference models,
like assuming a comparison to zero in time series plots or independence in
scatterplots. Before examining data, we have certain expectations in mind regarding
its distribution. In Bayesian analysis, inferring whether results are sensible involves
comparing estimates to prior knowledge.
Using EDA with advanced models enhances its effectiveness, even if one prefers
model-free approaches. EDA applies to both inferences and data. In Bayesian
probability, the reference distribution can be determined through the predictive
distribution of observable data. Comparing observed values with predictive draws
helps assess model fit, identifying any discrepancies. Graphs are the preferred method
for these comparisons, although complex models may need tailored graphical checks.
The article reviews these concepts, offers examples, and discusses possible
extensions.
Bayesian data visualization tools use posterior uncertainty from simulated parameter
draws and replicated data. Non-Bayesian analysis, which involves point estimates and
parametric bootstrap, can be similar if estimates are precise. Confidence intervals
provide an overview of posterior uncertainty. The described visualization tools apply
to non-Bayesian settings as well.
Bayesian inference helps manage uncertainty and variability in data visualization and
analysis.
16.2.1 Using Statistical Graphics in Model-Based Data Analysis
EDA focuses on discovering unexpected areas of model misfit, while CDA measures
how often these discrepancies might happen by chance. The goal is to apply these
concepts to more complex models using Bayesian inference and nonparametric
statistics. Complex models help EDA identify subtle data patterns, making graphical
checks more necessary to spot misfits. Statisticians engage in iterative modeling,
starting with simple models and gradually increasing complexity, identifying
deficiencies at each stage, and refining the models until satisfactory. Simulation-based
checks compare observed data to model replications, while EDA techniques are also
applied to parameter inferences and latent data. Theoretical analysis involves
exploring graphical displays to enhance model interpretation and guide effective
model checking.
The approach emphasizes the use of statistical graphics throughout the data analysis
stages, including model-building and model-checking. Graphs serve as important
tools to identify issues in models, as traditional statistics like p-values are often
insufficient. Exploratory data analysis is not limited to the beginning of the process; it
continues after model fitting to uncover potential problems. The method avoids model
averaging and instead focuses on progressively building more complex models. The
central idea of Bayesian statistics is to deal with uncertainty in inferences, often using
simulations to reflect draws from the posterior distribution, particularly seen in
hierarchical models.
16.2.4 Model-Checking
Statistical graphics help check models by comparing actual data with data generated
by the model. This involves both exploring graphics and p-values, but the aim isn't
simply to see if the model is right or wrong. Instead, the focus is on understanding
how the data differs from the model. Key parts of this exploration are graphical
displays and reference distributions. The best type of graph depends on what part of
the model is being checked, such as comparing residual plots to replicated data.
The study involves 1,370 respondents who identify how many people they know
within 32 subpopulations, named and defined by specific characteristics. Each
respondent has a “gregariousness” parameter, indicating their likelihood of knowing
people in different groups. This parameter is modeled with a mathematical formula.
A group-level size parameter is also used, along with an overdispersion vector. The
model is analyzed using Gibbs and Metropolis algorithms, resulting in simulated
draws for the parameters and hyperparameters.
To assess how well a Bayesian model fits, we compare its predictions to observed
data. This is done by using posterior predictive simulations from negative binomial
distributions based on parameter vectors drawn from previous simulations. We
generate multiple predictive simulations for the data and create a replicated
observation matrix. We can find numerical summaries like standard deviation or mean
for these data features and compare them. However, we prefer graphical
representations because they better show the complexity of the dataset. We compare
these graphical test statistics to evaluate the model.
Using graphs more regularly in data analysis can improve the quality of statistical
studies. Exploratory data analysis should be included in software for complex
modeling. Four main challenges are noted: integrating automatic replication
distributions, selecting these distributions, choosing test variables, and displaying test
results graphically. Future tools may help simulate replication distributions and
perform model checks automatically.
We regularly use software like BUGS to fit Bayesian models and then simulate the
data in R with R2WinBUGS. We can also summarize simulations in R more naturally
using random variable objects. While BUGS is helpful, it has limitations, so we use
the Universal Markov chain sampler in R for complex models. We are working on
creating an integrated Bayesian computing environment for modeling and model-
checking, which includes standardized graphical displays and the ability to handle
multiple models and their comparisons.