Publication Excerpt Dazong Richard Hosea
Publication Excerpt Dazong Richard Hosea
Correlation Coefficients
By Dazong Richard Hosea
Many practitioners use Pearson and Spearman interchangeably without fully understanding
the consequences. This project aims to explore the mathematical relationship and
performance differences between these two measures under simulated conditions,
addressing when each should be preferred.
This research will provide deeper insights into the appropriate use cases for each
correlation coefficient, guiding researchers and practitioners in selecting the most robust
statistical tool for their data type and research design.
This study focuses on simulated data ranging from 30 to 1000 observations across different
distributions (normal, uniform, exponential). The study does not involve real-world
datasets and limits itself to bivariate correlation analysis.
Pearson’s Correlation Coefficient: A measure of the strength and direction of the linear
relationship between two variables. It assumes both variables are normally distributed
(Benesty et al., 2009).
Spearman’s Rank Correlation Coefficient: A non-parametric measure of correlation based
on the rank values of the variables instead of raw data, useful when the relationship is
monotonic but not necessarily linear (Sheskin, 2004).
Linear Relationship: A type of relationship that can be described by a straight line equation,
where changes in one variable predict changes in another with a constant ratio (Rodgers &
Nicewander, 1988).
This project is structured into five chapters. Chapter One introduces the study and outlines
the problem, objectives, and significance. Chapter Two reviews relevant literature on
correlation theory and past empirical findings. Chapter Three details the methodology,
including the mathematical derivation of Pearson and Spearman correlations and
simulation techniques. Chapter Four presents and analyzes the results. Chapter Five
concludes with findings and offers recommendations for future research (Creswell, 2014).
CHAPTER TWO
LITERATURE REVIEW
Pearson’s correlation coefficient (r) is derived from the covariance of the variables divided
by the product of their standard deviations. It is suitable when data is normally distributed
and assumes homoscedasticity (Rodgers & Nicewander, 1988). Pearson's r ranges between
-1 and +1, where +1 indicates a perfect positive linear relationship and -1 indicates a perfect
negative one (Benesty et al., 2009).
Several empirical studies have compared the performance of Pearson’s and Spearman’s
coefficients under different data conditions. Myers and Well (2003) found that in the
presence of outliers, Spearman’s rho maintains higher accuracy than Pearson’s r. Corder
and Foreman (2014) also demonstrated that Spearman’s correlation provides more reliable
results when analyzing non-linear but monotonic trends. However, when data satisfies the
assumptions of linearity and normality, Pearson’s r tends to offer greater statistical power
(Hauke & Kossowski, 2011).
Although much research has compared Pearson and Spearman correlations, fewer studies
have examined their behavior under simulated conditions with systematically varied data
properties. There is also limited work integrating both theoretical derivations and empirical
simulations in a unified framework, which this study aims to address (Mukaka, 2012).
CHAPTER THREE
METHODOLOGY
The population in this simulation study consists of theoretical data structures designed to
represent various correlation patterns. Samples of varying sizes (n = 30, 100, 300, 1000)
will be drawn randomly from synthetic populations using NumPy-based random number
generators. Each synthetic dataset will undergo correlation analysis using both Pearson’s
and Spearman’s techniques.
nₕ = (Nₕ / N) × n
Where:
nₕ = sample size for stratum h
Nₕ = population size of stratum h
N = total population size
n = total sample size
Data was collected by generating values from known distributions. Normal distributions (μ
= 0, σ = 1) and uniform distributions were used. Non-linear monotonic transformations
(e.g., exponential and logarithmic) were applied to assess non-linear relationships. Python
was used for this simulation using libraries such as NumPy and SciPy.
Data simulation and analysis were conducted using Python libraries such as NumPy,
pandas, matplotlib, and SciPy. These tools enabled efficient generation of random data and
accurate computation of correlation coefficients.
CHAPTER FOUR
4.1 Introduction
This chapter presents the results of the simulation and statistical analysis conducted to
explore the relationship between Pearson’s and Spearman’s correlation coefficients. The
synthetic datasets were analyzed under conditions of linear, monotonic, and nonlinear
associations. Both correlation coefficients were computed and compared across multiple
sample sizes and distributions.
Three types of data scenarios were generated to examine how Pearson and Spearman
respond to various relationships:
1. Linear relationship with normally distributed data
2. Non-linear monotonic relationship (exponential)
3. Non-monotonic relationship (sinusoidal)
Each scenario was repeated using 30, 100, and 300 sample sizes.
From the table above, Pearson’s and Spearman’s coefficients yield closely similar values in
the linear-normal datasets, indicating both metrics perform well under ideal assumptions.
However, for exponential and sinusoidal datasets, Spearman’s coefficient often remains
higher, showing better robustness to non-linearities.
Graphs for each data scenario illustrate the observed relationships. Scatter plots for linear
data show a tightly clustered linear pattern, whereas exponential and sinusoidal data show
patterns where Pearson's measure underestimates the strength of association compared to
Spearman’s.
CHAPTER FIVE
This study investigated the relationship between Pearson’s and Spearman’s correlation
coefficients both mathematically and through simulation. Key findings are:
- Pearson’s coefficient is highly effective when the data follows a linear and normally
distributed pattern.
- Spearman’s coefficient is more robust to outliers and better captures non-linear
monotonic relationships.
- In simulated datasets with exponential and sinusoidal structures, Spearman's coefficient
demonstrated greater consistency than Pearson's.
- Both coefficients showed similar values under ideal conditions, but diverged significantly
in complex data distributions.
5.2 Conclusion
Pearson and Spearman correlation coefficients, while related, serve different analytical
purposes depending on the nature of the data. Pearson’s method is optimal for linear
associations under strict assumptions, whereas Spearman’s rank-based method provides a
more flexible tool for analyzing ordinal or non-normally distributed data. The study
confirms that the choice of correlation metric should be informed by the underlying data
structure and research objective.
5.3 Recommendations