Unit 5 Notes
Unit 5 Notes
Attribute Description
(a) Income Pixels get darker as income increases.
(b) Credit Limit Pixels get darker as credit limit increases.
(c) Transaction Volume Pixels get darker as transaction volume increases.
(d) Age Pixels get darker as age increases.
By sorting customers by income first, you can easily compare how other attributes (credit limit,
transaction volume, age) change as income changes.
The following two techniques are powerful tools in data visualization, helping to make complex,
high-dimensional data more understandable and accessible.
1.Space-Filling Curves
It is used to map high-dimensional data into a 2D space. Examples include the Hilbert
curve, Gray code curve, and Z-curve. These curves ensure that all parts of the space are
covered without gaps.In the figure, each curve shows how a 2D space is filled in a
specific pattern:
Figure 2.12(a): Shows how a single data record is represented in circle segments. Each
dimension (Dim 1, Dim 2, ..., Dim 6) has its own segment, and the data record is plotted
as a point in each segment.
Figure 2.12(b): Shows how multiple data records are laid out in circle segments. The
segments are arranged in a circular pattern, and the data records are plotted along these
segments, forming a circular layout.
A scatter plot displays 2-D data points using Cartesian coordinates. A third dimension can be
added using different colors or shapes to represent different data points. X and Y are two spatial
attributes and the third dimension is represented by different shapes. Through this visualization,
we can see that points of types “+” and “* ” tend to be colocated.
A 3D scatter plot lets you visualize 3 variables (X, Y, Z), and adding color allows you to
include a 4th variable.
When you have a dataset with more than 4 dimensions, it becomes hard to visualize everything
using a single scatter plot. A scatter-plot matrix solves this problem by showing all possible
pairs of dimensions in a grid format.
Example: The Iris dataset has 5 dimensions (Sepal Length, Sepal Width, Petal Length, Petal
Width, and Species).
When the number of dimensions in a dataset becomes very large, scatter-plot matrices become
cluttered and hard to interpret.
A better alternative is parallel coordinates, which can effectively visualize high-dimensional
data in a compact and intuitive way. Instead of plotting dimensions on perpendicular axes (like in
a scatter plot), parallel coordinates use n parallel vertical axes, one for each dimension. Each
data record is represented by a polygonal line that connects its values across all dimensions. The
limitations of parallel coordinates are
1. It cannot show dataset of many records.
2. With too many data points, the visualization can become messy (lots of overlapping
lines).
Icon-Based Visualization Techniques: Chernoff Faces
1. What Are Icon-Based Visualization Techniques?
Icon-based visualization techniques use small icons or symbols to represent multidimensional
data. The two popular icon-based techniques:
1. Chernoff faces and
2. Stick figures.
1. Chernoff Faces
Chernoff faces are a unique and creative way to visualize multidimensional data by
representing each data record as a cartoon-like human face. This technique uses different facial
features (like eyes, nose, mouth, etc.) to encode up to 18 dimensions of data. By looking at the
faces, you can quickly spot trends, similarities, or differences in the data. Different dimensions of
the data are mapped to specific facial features.For example:
Dimension 1: Eye size
Dimension 2: Nose length
Dimension 3: Mouth width
Dimension 4: Pupil size
Dimension 5: Eyebrow slant
Dimension 6: Eye eccentricity and so on...
Advantages of Chernoff Faces
Compact Representation: Multiple dimensions can be visualized in a single icon (a
face), making it easy to compare many data points at once.
Human Intuition: Humans are naturally good at recognizing faces and subtle
differences, which helps in quickly identifying patterns or anomalies.
Visual Appeal: Faces are engaging and memorable, making the visualization more
intuitive and accessible.
Limitations of Chernoff Faces
Cognitive Load: While humans are good at recognizing faces, interpreting the meaning
of each facial feature can be challenging, especially when there are many dimensions.
Overcrowding: With too many dimensions, the faces can become cluttered and difficult
to interpret.
Subjectivity: The interpretation of facial features can be subjective, and not everyone
may perceive the same patterns.
Asymmetrical Chernoff faces remove the requirement for symmetry, allowing the left and
right sides of the face to be different. This doubles the number of facial characteristics(36
dimensions instead of just 18 ) that can be used to encode data.
2.Stick Figure
The stick figure visualization technique is a creative way to represent multidimensional data
using simple stick figures. Each stick figure has five parts: a body and four limbs (two arms and
two legs). The technique maps dimensions of the data to the position, angle, or length of these
parts, allowing you to visualize complex datasets in an intuitive way.
Example: Census Data
Let’s say we’re analyzing census data with the following dimensions:
1. Age → X-axis
2. Income → Y-axis
3. Gender → Left arm angle (horizontal = male, vertical = female).
4. Education Level → Right arm length (longer = higher education).
5. Employment Status → Leg angles (straight = employed, bent = unemployed).
If the data is dense (many stick figures close together), the stick figures form a texture pattern
that highlights trends, such as:
Highly educated people (long right arms) tend to cluster in high-income areas.
Unemployed people (bent legs) are more common in lower-income regions.
Thus, given more dimensions, more levels of worlds can be used, which is why the method is
called “worlds-within worlds.”
As another example of hierarchical visualization methods, tree-maps display hierarchical data as
a set of nested rectangles.
Example of a tree-map visualizing Google news stories:
Top-Level Categories
The entire dataset is divided into seven main categories, each shown as a large
rectangle:
o Politics
o Sports
o Technology
o Business
o Health
o Entertainment
o Science
Each category is assigned a unique color (e.g., blue for Politics, green for Sports).
Subcategories
Within each category, the news stories are further divided into subcategories or
individual stories.
o Example:
Under "Sports":
Football: A medium-sized rectangle.
Basketball: A smaller rectangle.
Tennis: An even smaller rectangle.
o The size of each rectangle reflects the number of news stories in that subcategory.
Visualizing Complex Data and Relations
In the early days, data visualization was primarily focused on numeric data. However, with the
advent of modern technologies, we now have access to a wide variety of complex data types,
including textual data, network data, multimedia data. Visualizing and analyzing such data
attracts a lot of focus.
One common way to visualize non-numeric data, such as text and social media content, is
through tag clouds. A tag cloud is a visualization technique used to display statistics of user-
generated tags (e.g., in blogs, social media, or product reviews).Often, in a tag cloud, tags are
listed alphabetically or in a user-preferred order. The importance of a tag is indicated by font size
or color.
Another challenge in visualizing complex data arises when there are relationships between
entities. For example, in a disease influence graph, nodes represent diseases, and edges
represent correlations between them. This type of visualization is particularly useful for
understanding how different diseases might influence or co-occur with one another.
Example:
Suppose you are visualizing a disease influence graph for common illnesses:
Nodes:
o Diseases like "Flu" and "Common Cold" would have large nodes because they are
prevalent.
o Diseases like "Ebola" would have small nodes because they are rare.
Edges:
o Diseases like "Flu" and "Pneumonia" might have thick edges, indicating a strong
correlation (since flu can lead to pneumonia).
o Diseases like "Flu" and "Diabetes" might have thin edges, indicating a weaker
correlation.
This visualization helps researchers and healthcare professionals understand how diseases are
related and how they might influence each other.