0% found this document useful (0 votes)
13 views6 pages

Publication Excerpt Dazong Richard Hosea

This study investigates the mathematical relationship and performance differences between Pearson's and Spearman's correlation coefficients through simulations of various data conditions. Key findings indicate that Pearson's is effective for linear, normally distributed data, while Spearman's is more robust to outliers and better for non-linear monotonic relationships. The research emphasizes the importance of selecting the appropriate correlation metric based on the data structure and research objectives.

Uploaded by

danielcaleb058
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views6 pages

Publication Excerpt Dazong Richard Hosea

This study investigates the mathematical relationship and performance differences between Pearson's and Spearman's correlation coefficients through simulations of various data conditions. Key findings indicate that Pearson's is effective for linear, normally distributed data, while Spearman's is more robust to outliers and better for non-linear monotonic relationships. The research emphasizes the importance of selecting the appropriate correlation metric based on the data structure and research objectives.

Uploaded by

danielcaleb058
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Investigating the Relationship Between Pearson's and Spearman's

Correlation Coefficients
By Dazong Richard Hosea

Department of Statistics, Federal University Kashere, Gombe State, Nigeria

1.2 Statement of the Problem

Many practitioners use Pearson and Spearman interchangeably without fully understanding
the consequences. This project aims to explore the mathematical relationship and
performance differences between these two measures under simulated conditions,
addressing when each should be preferred.

1.3 Objectives of the Study

- To mathematically derive Pearson’s and Spearman’s correlation coefficients.


- To simulate data under various conditions (normal, skewed, nonlinear) and compute both
coefficients.
- To compare their behaviors in terms of sensitivity to outliers, linearity, and monotonicity.

1.4 Research Questions

- How do Pearson’s and Spearman’s correlation coefficients relate mathematically?


- What are the differences in behavior between the two under simulated data?
- Under what conditions do their values significantly diverge?

1.5 Significance of the Study

This research will provide deeper insights into the appropriate use cases for each
correlation coefficient, guiding researchers and practitioners in selecting the most robust
statistical tool for their data type and research design.

1.6 Scope and Limitations

This study focuses on simulated data ranging from 30 to 1000 observations across different
distributions (normal, uniform, exponential). The study does not involve real-world
datasets and limits itself to bivariate correlation analysis.

1.7 Operational Definition of Terms

Pearson’s Correlation Coefficient: A measure of the strength and direction of the linear
relationship between two variables. It assumes both variables are normally distributed
(Benesty et al., 2009).
Spearman’s Rank Correlation Coefficient: A non-parametric measure of correlation based
on the rank values of the variables instead of raw data, useful when the relationship is
monotonic but not necessarily linear (Sheskin, 2004).

Linear Relationship: A type of relationship that can be described by a straight line equation,
where changes in one variable predict changes in another with a constant ratio (Rodgers &
Nicewander, 1988).

Monotonic Relationship: A relationship that is consistently increasing or decreasing but not


necessarily at a constant rate (Lehmann, 2006).

1.8 Structure of the Study

This project is structured into five chapters. Chapter One introduces the study and outlines
the problem, objectives, and significance. Chapter Two reviews relevant literature on
correlation theory and past empirical findings. Chapter Three details the methodology,
including the mathematical derivation of Pearson and Spearman correlations and
simulation techniques. Chapter Four presents and analyzes the results. Chapter Five
concludes with findings and offers recommendations for future research (Creswell, 2014).

CHAPTER TWO

LITERATURE REVIEW

2.1 Theoretical Foundations of Correlation

Correlation is a statistical measure used to describe the strength and direction of a


relationship between two variables. The origins of correlation theory trace back to Francis
Galton, who introduced the concept of regression toward the mean (Galton, 1886). Karl
Pearson formalized this concept mathematically, resulting in the Pearson correlation
coefficient, which measures linear dependence between variables (Pearson, 1896). Later,
Spearman introduced a rank-based method that measures the degree of monotonic
association (Spearman, 1904).

2.2 Pearson’s Correlation Coefficient

Pearson’s correlation coefficient (r) is derived from the covariance of the variables divided
by the product of their standard deviations. It is suitable when data is normally distributed
and assumes homoscedasticity (Rodgers & Nicewander, 1988). Pearson's r ranges between
-1 and +1, where +1 indicates a perfect positive linear relationship and -1 indicates a perfect
negative one (Benesty et al., 2009).

2.3 Spearman’s Rank Correlation Coefficient


Spearman’s rho is a non-parametric measure of correlation based on the ranked values of
the data. It does not assume normality or linearity and is more robust to outliers and
skewed distributions (Sheskin, 2004). This makes Spearman’s coefficient suitable for
ordinal data or data that fails to meet the assumptions required for Pearson’s r (Lehmann,
2006).

2.4 Comparative Empirical Studies

Several empirical studies have compared the performance of Pearson’s and Spearman’s
coefficients under different data conditions. Myers and Well (2003) found that in the
presence of outliers, Spearman’s rho maintains higher accuracy than Pearson’s r. Corder
and Foreman (2014) also demonstrated that Spearman’s correlation provides more reliable
results when analyzing non-linear but monotonic trends. However, when data satisfies the
assumptions of linearity and normality, Pearson’s r tends to offer greater statistical power
(Hauke & Kossowski, 2011).

2.5 Gaps in the Literature

Although much research has compared Pearson and Spearman correlations, fewer studies
have examined their behavior under simulated conditions with systematically varied data
properties. There is also limited work integrating both theoretical derivations and empirical
simulations in a unified framework, which this study aims to address (Mukaka, 2012).

CHAPTER THREE

METHODOLOGY

3.1 Research Design

This study adopts a quantitative simulation-based research design to explore the


mathematical and empirical relationships between Pearson’s and Spearman’s correlation
coefficients. Synthetic datasets will be generated under controlled conditions (e.g., normal,
skewed, monotonic, and nonlinear) to assess the statistical behavior of each coefficient.

3.2 Population and Sample

The population in this simulation study consists of theoretical data structures designed to
represent various correlation patterns. Samples of varying sizes (n = 30, 100, 300, 1000)
will be drawn randomly from synthetic populations using NumPy-based random number
generators. Each synthetic dataset will undergo correlation analysis using both Pearson’s
and Spearman’s techniques.

3.3 Sampling Techniques


Stratified random sampling is used to ensure balanced representation of correlation
patterns. The number of observations, n, in each stratum is computed using the
proportional allocation formula:

nₕ = (Nₕ / N) × n

Where:
nₕ = sample size for stratum h
Nₕ = population size of stratum h
N = total population size
n = total sample size

3.4 Method of Data Collection

Data was collected by generating values from known distributions. Normal distributions (μ
= 0, σ = 1) and uniform distributions were used. Non-linear monotonic transformations
(e.g., exponential and logarithmic) were applied to assess non-linear relationships. Python
was used for this simulation using libraries such as NumPy and SciPy.

3.5 Mathematical Formulation

Pearson’s correlation coefficient (r) is given by:


r = Σ[(xᵢ - x̄)(yᵢ - ȳ)] / √[Σ(xᵢ - x̄)² * Σ(yᵢ - ȳ)²]
Where x̄ and ȳ are the means of x and y respectively. It measures linear dependence.

Spearman’s rank correlation coefficient (ρ) is given by:


ρ = 1 - [ (6 × Σdᵢ²) / (n(n² - 1)) ]
Where dᵢ is the difference in ranks of the ith element of variables x and y, and n is the
number of observations.
This coefficient assesses monotonic relationships between ranked variables.

3.6 Tools for Data Analysis

Data simulation and analysis were conducted using Python libraries such as NumPy,
pandas, matplotlib, and SciPy. These tools enabled efficient generation of random data and
accurate computation of correlation coefficients.

CHAPTER FOUR

DATA ANALYSIS AND RESULTS

4.1 Introduction

This chapter presents the results of the simulation and statistical analysis conducted to
explore the relationship between Pearson’s and Spearman’s correlation coefficients. The
synthetic datasets were analyzed under conditions of linear, monotonic, and nonlinear
associations. Both correlation coefficients were computed and compared across multiple
sample sizes and distributions.

4.2 Simulated Data Scenarios

Three types of data scenarios were generated to examine how Pearson and Spearman
respond to various relationships:
1. Linear relationship with normally distributed data
2. Non-linear monotonic relationship (exponential)
3. Non-monotonic relationship (sinusoidal)
Each scenario was repeated using 30, 100, and 300 sample sizes.

4.3 Comparative Results of Pearson vs. Spearman

From the table above, Pearson’s and Spearman’s coefficients yield closely similar values in
the linear-normal datasets, indicating both metrics perform well under ideal assumptions.
However, for exponential and sinusoidal datasets, Spearman’s coefficient often remains
higher, showing better robustness to non-linearities.

4.4 Graphical Representations

Graphs for each data scenario illustrate the observed relationships. Scatter plots for linear
data show a tightly clustered linear pattern, whereas exponential and sinusoidal data show
patterns where Pearson's measure underestimates the strength of association compared to
Spearman’s.

CHAPTER FIVE

SUMMARY, CONCLUSION AND RECOMMENDATIONS

5.1 Summary of Findings

This study investigated the relationship between Pearson’s and Spearman’s correlation
coefficients both mathematically and through simulation. Key findings are:
- Pearson’s coefficient is highly effective when the data follows a linear and normally
distributed pattern.
- Spearman’s coefficient is more robust to outliers and better captures non-linear
monotonic relationships.
- In simulated datasets with exponential and sinusoidal structures, Spearman's coefficient
demonstrated greater consistency than Pearson's.
- Both coefficients showed similar values under ideal conditions, but diverged significantly
in complex data distributions.

5.2 Conclusion
Pearson and Spearman correlation coefficients, while related, serve different analytical
purposes depending on the nature of the data. Pearson’s method is optimal for linear
associations under strict assumptions, whereas Spearman’s rank-based method provides a
more flexible tool for analyzing ordinal or non-normally distributed data. The study
confirms that the choice of correlation metric should be informed by the underlying data
structure and research objective.

5.3 Recommendations

You might also like