做相关性分析时，如何排除奇异值Outliers，以增加相关分析的准确性

共9个文件

m：7个

mat：1个

txt：1个

版权申诉

相关性分析

4星 · 超过85%的资源 110 浏览量 2022-04-21 21:13:56 上传评论 3 收藏 8KB RAR 举报

在进行相关性分析时，确保数据的准确性至关重要。异常值，即奇异值（Outliers），是数据集中远离其他观测值的极端数值，它们可能由测量错误、数据输入错误或者样本中的特殊情况造成。当存在异常值时，它们可能会对统计分析结果产生显著影响，特别是对于基于均值和方差的计算，例如皮尔逊相关系数（Pearson Correlation Coefficient）。在进行相关性分析时，排除这些奇异值有助于提高分析的准确性和可靠性。理解皮尔逊相关系数的概念。这是一种衡量两个变量之间线性关系强度和方向的指标，取值范围在-1到1之间。值越接近1表示正相关性越强，越接近-1表示负相关性越强，而接近0则表示没有或微弱的相关性。然而，如果数据集中存在异常值，它们可能导致皮尔逊相关系数的估计失真，使得原本存在的相关性被掩盖，或者无关联的变量被误判为有相关性。排除奇异值的方法多种多样，以下是一些常用策略： 1. **可视化检查**：散点图（Scatterplot）是最直观的识别异常值的方法。通过绘制两个变量的散点分布，异常值通常会远离大部分点集，形成明显的偏离。例如，压缩包中的`ScatterOutliers_mian`可能就是这种散点图，可以直观地看出哪些点可能是异常值。 2. **统计方法**： - **四分位数法**：使用上四分位数（Q3）和下四分位数（Q1）来定义异常值的范围。任何小于Q1-1.5(IQR)或大于Q3+1.5(IQR)的值都可能被视为异常，其中IQR是四分位距，即Q3-Q1。 - **Z-score**：计算每个数据点的标准差偏差，通常设定一个阈值（如3或4），超过这个阈值的数据点被视为异常。 - **Tukey's Method**：类似四分位数法，但考虑了数据的分布形态，提供了一种更稳健的异常值检测方法。 3. **模型识别**：使用机器学习模型如Isolation Forest或Local Outlier Factor (LOF) 来识别异常值，这些方法基于数据的密度或离群程度。 4. **领域知识**：结合专业知识判断某些值是否合理，比如生物学数据中，某些生物体征的值超出生理范围可能就是异常。 5. **逐步剔除**：可以尝试逐步剔除异常值并重新计算相关性，观察结果的变化，以确定剔除的必要性。在实际操作中，处理异常值并不总是简单地剔除。有时，异常值可能蕴含重要的信息，或者剔除后可能导致样本量过少，影响分析的有效性。因此，应谨慎对待，结合业务背景和统计学原理综合判断。在决定剔除异常值后，还需要重新评估剔除后的数据分布，以及皮尔逊相关系数的变化，确保分析的稳健性。总结来说，排除奇异值是为了避免它们对相关性分析的误导，通过多种方法识别并处理异常值，可以提高相关系数的可信度，从而更好地理解和解释变量之间的关系。在实际分析过程中，需结合数据特点、业务理解及统计方法，灵活选择合适的方式来处理异常值。

资源推荐

资源详情

资源评论

收起资源包目录

做相关性分析时如何排除奇异值Outliers以增加相关分析的准确性.rar （9个子文件）

ScatterOutliers_mian

round_decs.m 115B

bsmahal.m 965B

mian.m 65B

Readme.txt 5KB

Shepherd.m 1KB

corrci.m 1KB

ScatterOutliers.m 2KB

SetupRand.m 833B

ExampleData.mat 3KB

********************************************************************************* SHEPHERD'S PI CORRELATION (Version 2012-10-31) ********************************************************************************* DISCLAIMER: We take no responsibility for any damage or data loss this software may cause (not that it should). You may reuse and modify these scripts for your own purposes but please cite or acknowledge us if you do so. While this software is not officially supported by us, if you have any questions or suggestions, please contact Sam Schwarzkopf ([email protected]). These functions require the Statistics Toolbox from MATLAB. ********************************************************************************* 31 Oct 2012: Modified the corrci.m function so that it can also estimate the confidence intervals for Spearman's rho. 19 Oct 2012: Corrected a typo in the help section of corrci.m. This does not affect the calculation or the output of the function. 11 Oct 2012: Corrected a minor bug with ScatterOutliers.m preventing the title of the graph to be displayed (Thanks for Carl Gaspar for pointing out this error). Added an explanation why we use bsmahal.m instead of the MATLAB bootstrp function. This does not affect the results. 09 Aug 2012: Added the corrci.m function. You can use this for calculating the nominal 95% confidence interval and for estimating the actual confidence based on your data using a bootstrapping approach. 08 Aug 2012: Added SetupRand.m script to initialise the random number generator for bsmahal.m. This is necessary in order to prevent the boot- strapping from always producing the same results every time you start MATLAB. You should run this initialisation before running bootstrapping, simulations or anything relying on randomization. ********************************************************************************* Shepherd's pi is a robust test of statistical association between two variables. It can be used in lieu of Pearson's r or other tests (such as Spearman's rho or Kendall's tau) as these tests can be susceptible to the presence of influential outliers. It detects outliers by first bootstrapping the Mahalanobis distance of each data point from the bivariate mean and then excluding all observations whose distance is >=6. Shepherd's pi is simply Spearman's rho after outlier removal. The p-value is doubled because the removal of outliers can inflate false positive rates. See also: Schwarzkopf DS, de Haas B & Rees G (2012). Front. Hum. Neurosci. (http://www.frontiersin.org/Human_Neuroscience/10.3389/fnhum.2012.00200/full) ********************************************************************************* This archive contains four MATLAB functions. All these functions contain usage information for the MATLAB help feature: SetupRand.m: Initializes the state of the random number generator. YOU SHOULD RUN THIS BEFORE ANYTHING ELSE!!! Shepherd.m: The actual Shepherd's pi correlation test. ScatterOutliers.m: A function to make scatter plots denoting the points that were removed as outliers and including a contour plot of Mahalanobis distances. bsmahal.m: The function for bootstrapping Mahalanobis distances. (You could also use the MATLAB bootstrp function here but this is buggy for large matrices so it is no good for calculating the contour plot in ScatterOutliers.m). round_decs.m: Simple function for rounding to a decimal. corrci.m: Function for calculating the nominal 95% confidence interval and for estimating a bootstrapped interval for a correlation. ********************************************************************************* There is also an example data set containing the following variables: x: Data drawn from normal distribution y0: Data uncorrelated with x y1: Data weakly correlated with x y2: Data correlated with x y3: Data strongly correlated with x xo1,yo1: Two uncorrelated variables, which appear correlated under Pearson's r due to the presence of a single outlier xo3,yo3: Two uncorrelated variables, which appear correlated under Pearson's r due to the presence of three bivariate outliers ********************************************************************************* Bootstrapping the Mahalanobis distance is more robust than simply using the raw Mahalanbois distance for outlier detection. This can be illustrated by comparing ScatterOutliers(x,y0,1000) with ScatterOutliers(x,y0,0). The latter will use the raw Mahalanobis distance (because the number of bootstraps is 0). As you can see, the bootstrapped distances are larger, because they are less skewed by the outliers in the sample. *********************************************************************************

评论收藏

内容反馈

版权申诉