Statistical classification

本文介绍了机器学习和统计学中的分类问题,探讨了如何利用已知类别的训练集数据来预测新观测值的类别归属。文章涵盖了分类的基本概念,如监督学习、非监督学习及其与聚类的关系,并解释了多种用于实现分类任务的技术。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

link address : https://en.wikipedia.org/wiki/Statistical_classification

>>For the unsupervised learning approach, see Cluster analysis<<



In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs(for example, there is a populations we call sheep), on the basis of a training set of data containing observations (or instances) whose category membership is known.


>>Classification is an example of pattern recognition<<.


In the terminology of machine learning,[1] >>classification is considered an instance of supervised learning<<, i.e. learning where a training set of correctly identified observations is available. The corresponding >>unsupervised procedure is known as clustering<<, and >>involves grouping data into categories based on some measure of inherent similarity or distance<<.


Often, >>the individual observations are analyzed into a set of quantifiable properties, known variously as explanatory variables or features<<. These properties may variously be categorical (e.g. "A", "B", "AB" or "O", for blood type), ordinal (e.g. "large", "medium" or "small"), integer-valued (e.g. the number of occurrences of a particular word in an email) or real-valued (e.g. a measurement of blood pressure)(these properties should be quantifiable). >>Other classifiers work by comparing observations to previous observations by means of a similarity or distance function<<.


An algorithm that implements classification, especially in a concrete implementation, is known as a classifier. The term "classifier" sometimes also refers to the mathematical function, >>implemented by a classification algorithm, that maps input data to a category(seems like y = f(x) )<<.


Terminology across fields is quite varied. In statistics, where classification is often done with logistic regression or a similar procedure, the properties of observations are termed explanatory variables (or >>independent variables<<, regressors, etc.), and the categories to be predicted are known as outcomes, which are considered to be possible values of the >>dependent variable<<(construct a relative between input and output).


Classification and clustering are examples of the more general problem of pattern recognition, which is the >>assignment of some sort of output value to a given input value<<(construct a relative between input and output).
Other examples are >>regression, which assigns a real-valued output to each input<<


A common subclass of classification is >>probabilistic classification<<. Algorithms of this nature use statistical inference to find the best class for a given instance. Unlike other algorithms, which simply output a "best" class, >>probabilistic algorithms output a probability of the instance being a member of each of the possible classes<<. The best class is normally then selected as the one with the highest probability. However, such an algorithm has numerous advantages over non-probabilistic classifiers:
1.It can output a confidence value associated with its choice (in general, a classifier that can do this is known as a confidence-weighted classifier).
2.Correspondingly, it can abstain when its confidence of choosing any particular output is too low.
3.Because of the probabilities which are generated, probabilistic classifiers can be more effectively incorporated into larger machine-learning tasks, in a way that partially or completely avoids the problem of error propagation


Mahalanobis distance:
The Mahalanobis distance is a measure of the distance between a point P and a distribution D, introduced by P. C. Mahalanobis in 1936.[1] It is a multi-dimensional generalization of the idea of measuring how many standard deviations away P is from the mean of D. This distance is zero if P is at the mean of D, and grows as P moves away from the mean: along each principal component axis, it measures the number of standard deviations from P to the mean of D. If each of these axes is rescaled to have unit variance, then Mahalanobis distance corresponds to standard Euclidean distance in the transformed space. Mahalanobis distance is thus unitless and scale-invariant, and takes into account the correlations of the data set


Bayesian procedures:
Unlike frequentist procedures, Bayesian classification procedures provide a natural way of taking into account any available information about the relative sizes of the sub-populations associated with the different groups within the overall population.[7] Bayesian procedures tend to be computationally expensive and, in the days before Markov chain Monte Carlo computations were developed, approximations for Bayesian clustering rules were devised.[8]
Some Bayesian procedures involve the calculation of group membership probabilities: these can be viewed as providing a more and more informative outcome of a data analysis than a simple attribution of a single group-label to each new observation


Binary and multiclass classification:
Classification can be thought of as two separate problems – binary classification and multiclass classification. In binary classification, a better understood task, only two classes are involved, whereas multiclass classification involves assigning an object to one of several classes.[9] Since many classification methods have been developed specifically for binary classification, >>multiclass classification often requires the combined use of multiple binary classifiers<<.


Application domains:
Computer vision:
1.Medical imaging and medical image analysis
2.Optical character recognition
3.Video tracking
Speech recognition
Handwriting recognition
Document classification
Internet search engines
Biometric identification
Biological classification
Statistical natural language processing
Credit scoring


See also:
Artificial intelligence
Binary classification
Class membership probabilities
Classification rule
Compound term processing
Data mining
Data warehouse
Fuzzy logic
Information retrieval
List of datasets for machine learning research
Machine learning

Recommender system


内容概要:本文档主要展示了C语言中关于字符串处理、指针操作以及动态内存分配的相关代码示例。首先介绍了如何实现键值对(“key=value”)字符串的解析,包括去除多余空格和根据键获取对应值的功能,并提供了相应的测试用例。接着演示了从给定字符串中分离出奇偶位置字符的方法,并将结果分别存储到两个不同的缓冲区中。此外,还探讨了常量(const)修饰符在变量和指针中的应用规则,解释了不同类型指针的区别及其使用场景。最后,详细讲解了如何动态分配二维字符数组,并实现了对这类数组的排序与释放操作。 适合人群:具有C语言基础的程序员或计算机科学相关专业的学生,尤其是那些希望深入理解字符串处理、指针操作以及动态内存管理机制的学习者。 使用场景及目标:①掌握如何高效地解析键值对字符串并去除其中的空白字符;②学会编写能够正确处理奇偶索引字符的函数;③理解const修饰符的作用范围及其对程序逻辑的影响;④熟悉动态分配二维字符数组的技术,并能对其进行有效的排序和清理。 阅读建议:由于本资源涉及较多底层概念和技术细节,建议读者先复习C语言基础知识,特别是指针和内存管理部分。在学习过程中,可以尝试动手编写类似的代码片段,以便更好地理解和掌握文中所介绍的各种技巧。同时,注意观察代码注释,它们对于理解复杂逻辑非常有帮助。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值