Data Organization and Visualization Techniques
Data Organization and Visualization Techniques
VISUALISING AND
DESCRIBING DATA
CHAPTER 2
CFA Institute (2020): Quantitative Investment Analysis, 4th Edition,
John Wiley & Sons
LEARNING OBJECTIVES
• Identify and compare data types;
• Describe how data are organized for quantitative analysis;
• Interpret frequency and related distributions;
• Describe ways that data may be visualized and evaluate uses of specific visualizations;
• Describe how to select among visualization types;
• Calculate and interpret measures of central tendency;
• Select among alternative definitions of mean to address an investment problem;
• Calculate and interpret measures of dispersion;
• Interpret skewness, Interpret kurtosis and Interpret correlation between two variables.
CFA Institute (2020): Quantitative Investment Analysis, 4th Edition, John Wiley & Sons
DATA TYPES
• Data can be defined as a collection of numbers, characters, words, and text—as well as images, audio, and video—in a raw or organized format to
represent facts or information. The primary purpose of data is to represent facts or information, serving as a foundation for analysis, interpretation,
and decision-making in various fields.
• To choose the appropriate statistical methods for summarizing and analyzing data and to select suitable charts for visualizing data, we need to
distinguish among different data types.
1. Discrete data
2. Continuous data
3. Ordinal data
4. Nominal data
CFA Institute (2020): Quantitative Investment Analysis, 4th Edition, John Wiley & Sons
Cross-Sectional versus Time-Series versus
Panel Data
TIME SERIES DATA
CROSS-SECTIONAL DATA
Panel Data Time series data is collected over
Cross-sectional data is
multiple time periods for a single
collected at a single point in Panel data, also known as
entity or individual. It involves
time and focuses on different longitudinal or cross-sectional observing and recording data
entities or individuals at that time series data, combines points at sequential points in
specific moment. elements of both cross-sectional time, such as daily, weekly,
and time series data. It involves monthly, annually, or quarterly.
E.g If you analyse the
financial performance of
collecting information on multiple
entities over multiple time E.g. Tracking the monthly stock
different companies by prices of a specific company
looking at their annual periods. E.g. quarterly earnings from January 2020 to December
reports for the year 2022, you per share for three companies in 2022 is an example of time series
are using cross-sectional a given year by quarter data.
data.
CFA Institute (2020): Quantitative Investment Analysis, 4th Edition, John Wiley & Sons
Structured vs Unstructured Data
Unstructured data
Structured data
do not follow any conventionally organized forms.
highly organized in a pre-defined manner, usually Some common types of unstructured data are text
with repeating patterns. E.g. daily closing stock —such as financial news, posts in social media, and
prices, earnings per share, dividend yield, return company filings with regulators—and also
on equity or forecasted earnings growth. audio/video, such as managements’ earnings calls
and presentations to analysts.
CFA Institute (2020): Quantitative Investment Analysis, 4th Edition, John Wiley & Sons
DATA SUMMARIZATION
Summarizing Data Using Frequency Distributions: A frequency distribution (also called a one-way table) is a tabular
display of data constructed either by counting the observations of a variable by distinct values or groups or by tallying the
values of a numerical variable into a set of numerically ordered bins.
Procedure for constructing a frequency distribution for numerical data can be stated as follows:
•Sort the data in ascending order.
•Calculate the range of the data, defined as Range = Maximum value – Minimum value.
•Decide on the number of bins (k) in the frequency distribution.
•Determine bin width as Range/k.
•Determine the first bin by adding the bin width to the minimum value. Then, determine the remaining bins by
successively adding the bin width to the prior bin’s end point and stopping after reaching a bin that includes the
maximum value.
•Determine the number of observations falling into each bin by counting the number of observations whose values are
equal to or exceed the bin minimum value yet are less than the bin’s maximum value. The exception is in the last bin,
where the maximum value is equal to the last bin’s maximum, and therefore, the observation with the maximum value is
included in this bin’s count.
•Construct a table of the bins listed from smallest to largest that shows the number of observations falling into each bin.
CFA Institute (2020): Quantitative Investment Analysis, 4th Edition, John Wiley & Sons
DATA SUMMARIZATION
As Exhibit 13 shows, there is substantial variation in these equity index returns. One-third of the observations fall in the 7.0 to
8.0% bin, making it the bin with the most observations. Both the 6.0 to 7.0% bin and the 8.0 to 9.0% bin hold four
observations each, accounting for 22.22 percent of the total number of the observations, respectively. The two remaining
bins have fewer observations, one or three observations, respectively.
CFA Institute (2020): Quantitative Investment Analysis, 4th Edition, John Wiley & Sons
DATA VISUALIZATION
Visualization is the presentation of data in a pictorial or graphical format for the purpose of increasing understanding and for
gaining insights into the data.
•Histogram
•Bar chart
•Tree Map
•Word Cloud
•Line Chart
•Scatter Plot
•Heat Map
When it comes to selecting a chart for visualizing data, the intended purpose is the key consideration: Is it for exploring
and/or presenting distributions or relationships, or is it for making comparisons? Given your intended purpose, the best
selection is typically the simplest visual that conveys the message or achieves the specific goal.
CFA Institute (2020): Quantitative Investment Analysis, 4th Edition, John Wiley & Sons
DATA
VISUALIZATION —
Histogram and
Frequency polygon
Bar chart
• Bar Chart: a tool to express the
frequency distribution of
categorical data.
• In a bar chart, each bar
represents a distinct category, with
the bar’s height proportional to the
frequency of the corresponding
category.
• Frequency by Sector for Stocks
in a Portfolio:
• Purpose: Making comparisons
between different categories.
• Example: Comparing sales
performance of different products
in a store over a month. Each bar
represents a product, and the
height of the bar corresponds to
the total sales for that product.
Tree-Map
• Tree-Map: consists of a set of
colored rectangles to represent
distinct groups — the area of each
rectangle is proportional to the
value of the corresponding group.
• Tree-Map for Frequency
Distribution by Sector in a Portfolio:
CFA Institute (2020): Quantitative Investment Analysis, 4th Edition, John Wiley & Sons
Selecting Visualization Types
(1) A portfolio manager plans to buy several stocks traded on a small emerging market exchange but is concerned whether the
market can provide sufficient liquidity to support her purchase order size. As the first step, she wants to analyze the daily
trading volumes of one of these stocks over the past five years. Explain which type of chart can best provide a quick view of
trading volume for the given period.
Solution to 1: The five-year history of daily trading volumes contains a large amount of numerical data. Therefore, a histogram
is the best chart for grouping these data into frequency distribution bins and for showing a quick snapshot of the shape,
center, and spread of the data’s distribution.
(2) An analyst is building a model to predict stock market downturns. According to the academic literature and his practitioner
knowledge and expertise, he has selected 10 variables as potential predictors. Before continuing to construct the model, the
analyst would like to get a sense of how closely these variables are associated with the broad stock market index and whether
any pair of variables are associated with each other. Describe the most appropriate visual to select for this purpose.
Solution to 2: To inspect for a potential relationship between two variables, a scatter plot is a good choice. But with 10
variables, plotting individual scatter plots is not an efficient approach. Instead, utilizing a scatter plot matrix would give the
analyst a good overview in one comprehensive visual of all the pairwise associations between the variables.
CFA Institute (2020): Quantitative Investment Analysis, 4th Edition, John Wiley & Sons
Selecting Visualization Types
(3) Central Bank members meet regularly to assess the economy and decide on any interest rate changes. Minutes of their
meetings are published on the Central Bank’s website. A quantitative researcher wants to analyze the meeting minutes for
use in building a model to predict future economic growth.
Explain which type of chart is most appropriate for creating an overview of the meeting minutes.
Solution to 3: Since the meeting minutes consist of textual data, a word cloud would be the most suitable tool to visualize the
textual data and facilitate the researcher’s understanding of the topic of the text as well as the sentiment, positive or
negative, it may convey.
(4) A private investor wants to add a stock to her portfolio, so she asks her financial adviser to compare the three-year
financial performances (by quarter) of two companies. One company experienced consistent revenue and earnings growth,
while the other experienced volatile revenue and earnings growth, including quarterly losses.
Describe the chart the adviser should use to best show these performance differences.
Solution to 4: The best chart for making this comparison would be a bubble line chart using two different color lines to
represent the quarterly revenues for each company. The bubble sizes would then indicate the magnitude of each company’s
quarterly earnings, with green bubbles signifying profits and red bubbles indicating losses.
CFA Institute (2020): Quantitative Investment Analysis, 4th Edition, John Wiley & Sons
MEASURES OF CENTRAL TENDENCY
In this section, we discuss the use of quantitative measures that explain characteristics of data. Our focus is on measures of central tendency and other
measures of location. A measure of central tendency specifies where the data are centered (This means that when we use a measure of central
tendency, like the mean or median, it provides us with a point that represents the center or average position of the data. It helps us understand
where most of the data points cluster). Example: Imagine you have a set of financial data, like stock prices over a month. If you calculate the average
(mean) stock price, that would give you a measure of central tendency. This average value gives you an idea of where the bulk of the stock prices are
centered in that particular period. It's like finding the midpoint that represents the general level of the data.
CFA Institute (2020): Quantitative Investment Analysis, 4th Edition, John Wiley & Sons
MEASURES OF CENTRAL TENDENCY
Arithmetic mean (Mean)
•The arithmetic mean is the sum of the values of the observations divided by the number of
observations. Sample mean
For example, if a sample of market capitalizations for six publicly traded Australian companies
contains the values (in AUD billions) 35, 30, 22, 18, 15, and 12, the sample mean market cap is
132/6 = A$22 billion. As previously noted, the sample mean is a statistic (that is, a descriptive
measure of a sample).
• Outliers are very common in financial data and can skew the mean of a data series. However
outliers may contain useful information so they cannot be removed. Use trimming techniques like
trimming mean and winzorised mean to adjust for outliers.
CFA Institute (2020): Quantitative Investment Analysis, 4th Edition, John Wiley & Sons
MEASURES OF CENTRAL TENDENCY
Geometric mean: Geometric mean is often used to analyse investment returns or growth rates over multiple periods.
When dealing with rates of change over time, the geometric mean is useful because it accounts for compounding. In finance,
assets or portfolios often experience compounding effects due to the reinvestment of returns.
•The geometric mean is most frequently used to average rates of change over time or to compute the growth rate of a
variable- Imagine you have an investment that gives you a certain percentage return each year. The geometric mean helps
you find an average return that takes into account the compounding effect of these yearly returns. It's like looking at the
overall growth rate, considering how each year's return contributes to the next.
•In investments, we frequently use the geometric mean to either average a time series of rates of return on an asset or a
portfolio or to compute the growth rate of a financial variable, such as earnings or sales.
CFA Institute (2020): Quantitative Investment Analysis, 4th Edition, John Wiley & Sons
MEASURES OF CENTRAL TENDENCY
The Weighted Mean
The weighted arithmetic mean is similar to an ordinary arithmetic mean, except that instead of each of the data
points contributing equally to the final average, some data points contribute more than others.
Example: Think of the weighted arithmetic mean as a way of finding an average (like you would with regular grades),
but with a twist. In a regular average, every piece of information gets the same importance. However, with a
weighted average, some pieces of information count more than others.
Consider a portfolio made up of 35% stocks 60% bonds and 5% crypto. If each security
earned a return of 20, 11 and 5.5 percent respectively, what is the return of the portfolio?
The harmonic mean is the value obtained by summing the reciprocals of the
observations—terms of the form 1/Xi—then averaging that sum by dividing it by the
number of observations n, and, finally, taking the reciprocal of the average.
CFA Institute (2020): Quantitative Investment Analysis, 4th Edition, John Wiley & Sons
MEASURES OF DISPERSION
• Dispersion is the variability around the central tendency- Dispersion is about how spread out or variable those numbers are around
the average. In the context of investments, it tells us how much returns deviate from the average.
• To understand an investment we need to know how returns are dispersed around the mean.
Center
CFA Institute (2020): Quantitative Investment Analysis, 4th Edition, John Wiley & Sons
1. MEASURE OF DISPERSION
Why it Matters for Investments: If you're considering an investment, just knowing the average return (central
tendency) isn't enough. You also want to understand how much the actual returns tend to deviate from that
average. High dispersion means higher risk because returns can vary widely. Low dispersion means more
predictable returns.
In simpler terms, measures of dispersion help you see not just the average performance of an investment but
also how uncertain or risky that performance might be. This is important for investors who want to have a clearer
picture of what to expect from their investment over time.
CFA Institute (2020): Quantitative Investment Analysis, 4th Edition, John Wiley & Sons
MEASURES OF DISPERSION
Mean
Absolute value
No. observations
The mean absolute deviation uses all of the observations in the sample and is thus superior
to the range as a measure of dispersion.
CFA Institute (2020): Quantitative Investment Analysis, 4th Edition, John Wiley & Sons
MEASURES OF DISPERSION
The variance and standard deviation, which are based on squared deviations, are the two most
widely used measures of dispersion
•Variance:
defined as the average of the squared deviations around the mean .
Sample variance
• Standard Deviation:
The positive square root of the variance The Std
.d
is used ev
as a
measu
Sample std. dev re of
risk
CFA Institute (2020): Quantitative Investment Analysis, 4th Edition, John Wiley & Sons
MEASURES OF DISPERSION
• Coefficient of Variation:
A measure of the relative dispersion of observations around the mean. It is the ratios of the
standard deviation of a data series to their mean value.
Std. dev
mean
• Useful in cases where datasets have different mean values or unit of measurement.
CFA Institute (2020): Quantitative Investment Analysis, 4th Edition, John Wiley & Sons
THE SHAPE OF THE DISTRIBUTIONS
Mean and variance may not adequately describe an investment’s distribution of returns. In calculations
of variance, for example, the deviations around the mean are squared, so we do not know whether large
deviations are likely to be positive or negative.
Bell shaped/symmetrical
This symmetrical, bell-shaped distribution plays a central role in the mean–variance model of
portfolio selection; it is also used extensively in financial risk management.
CFA Institute (2020): Quantitative Investment Analysis, 4th Edition, John Wiley & Sons
THE SHAPE OF THE DISTRIBUTIONS
Skewness
Like variance, skewness is computed using each observation’s deviation from its mean. Skewness (sometimes referred to as
relative skewness) is computed as the average cubed deviation from the mean standardized by dividing by the standard
deviation cubed to make the measure free of scale
✔ Positively skewed unimodal distribution: mode is less than the median, which is less than the mean.
✔ Negatively skewed unimodal distribution: mean is less than the median, which is less than the mode.
For a given expected return and standard deviation, investors should be attracted by a positive skew because the mean
return lies above the median. Relative to the mean return, positive skew amounts to limited, though frequent,
downside returns compared with somewhat unlimited, but less frequent, upside returns.
CFA Institute (2020): Quantitative Investment Analysis, 4th Edition, John Wiley & Sons
THE SHAPE OF THE DISTRIBUTIONS
Kurtosis
A measure of the combined weight of the tails of a distribution relative to the rest of the distribution.
CFA Institute (2020): Quantitative Investment Analysis, 4th Edition, John Wiley & Sons
CORRELATION BETWEEN TWO VARIABLES
COVARIANCE
•The first step is to consider how two variables vary together, their covariance.
•The sample covariance (sXY) is a measure of how two variables in a sample move together:
• Indicates that the sample covariance is the average value of the product of the deviations of
observations on two random variables (Xi and Yi) from their sample means.
• By itself, the size of the covariance measure is difficult to interpret as it is not normalized
and so depends on the magnitude of the variable
CFA Institute (2020): Quantitative Investment Analysis, 4th Edition, John Wiley & Sons
CORRELATION BETWEEN TWO VARIABLES
• The sample correlation coefficient is a standardized measure of how two variables in a sample
move together.
• The sample correlation coefficient (rXY) is the ratio of the sample covariance to the product of
the two variables’ standard deviations:
1. Correlation ranges from –1 and +1 for two random variables, X and Y:
2. A correlation of 0 (uncorrelated variables) indicates an absence of any linear (that is, straight-
line) relationship between the variables.
3. A positive correlation close to +1 indicates a strong positive linear relationship. A correlation of 1
indicates a perfect linear relationship.
4. A negative correlation close to –1 indicates a strong negative (that is, inverse) linear
relationship. A correlation of –1 indicates a perfect inverse linear relationship.
CFA Institute (2020): Quantitative Investment Analysis, 4th Edition, John Wiley & Sons
CORRELATION BETWEEN TWO VARIABLES
CFA Institute (2020): Quantitative Investment Analysis, 4th Edition, John Wiley & Sons