0% found this document useful (0 votes)
207 views42 pages

Data Mining in Medicine

Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
207 views42 pages

Data Mining in Medicine

Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 42

Datamining in Medicine:

Selected Techniques and


Applications

Author
Adrian Giurca
[email protected]
Copyright, 2002 © webAI Group, www.datamining.ro
Overview
Generally, data mining (sometimes called data or
knowledge discovery) is the process of analyzing data
from different perspectives and summarizing it into
useful information - information that can be used to
increase revenue, cuts costs, or both. Data mining
software is one of a number of analytical tools for
analyzing data. It allows users to analyze data from
many different dimensions or angles, categorize it, and
summarize the relationships identified. Technically, data
mining is the process of finding correlation or patterns
among dozens of fields in large relational databases.
The nature of Medical Data
The rapidly emerging globally of data
requires standards in terminology,
vocabularies and formats to support data
sharing, standards for interfaces between
different sources of data and integration of
heterogeneous data (including images),
and standards in the design of electronic
patient records.
The nature of Medical Data
Many environments still lack such
standards, which hinders the use of data
analysis tools on large global databases,
limiting their applications to datasets
collected for specific diagnostic,
screening, prognostic, monitoring, therapy
support or other patient management
purposes.
The nature of Medical Data
Patient records collected for diagnosis and
prognosis typically encompass values of
anamnestic, clinical and laboratory
parameters, as well as results of particular
investigations, specific to the given task.
The nature of Medical Data
Such datasets are characterized by
 incompleteness (missing parameter values),
 incorrectness (systematic or random noise in the
data),
 sparness (few and/or non-representable patient
records available),
 inexactness (inappropriate selection of
parameters for the given task).
The nature of Medical Data
Datasets collected in monitoring (either acute
monitoring of a particular patient in an intensive
care unit, or discrete monitoring over long
periods of time in the case of patients with
chronic diseases) have additional characteristics:
they involve the measurements of a set of
parameters at different times, requesting the
temporal component to be taken into account in
data analysis.
Selected Medical Data Mining
Techniques
Current trends in medical decision making
show awareness of the need to introduce
formal reasoning, as well as intelligent
data analysis techniques in the extraction
of knowledge, regularities, trends and
representative cases from patient data
stored in medical records.
Selected Medical Data Mining
Techniques
Formal techniques include:
 decision theory
 symbolic reasoning technology
 methods at their intersection, such as
probabilistic belief networks
Selected Medical Data Mining
Techniques
Intelligent data analysis techniques include:
 machine learning
 clustering
 data visualization
 interpretation of time-ordered data ( derivation
and revision of temporal trends and other forms
of temporal data abstraction).
Selected Medical Data Mining
Techniques
Machine learning methods can be classified into three
major groups:
 inductive learning of symbolic rules (such as
induction of rules, decision trees and logic programs)
 statistical or pattern-recognition methods (such as k-
nearest neighbors or instance-based learning,
discriminate analysis and Bayesian classifiers)
 artificial neural networks (such as networks with
backpropagation learning, Kohonen's self organizing
network and Hofield's associative memory)
Selected Medical Data Mining
Techniques
Machine learning methods have been applied to
a variety of medical domains in order to improve
medical decision making.
These include diagnostic and prognostic
problems in: oncology, liver pathology,
neuropsychology, gynaecology.
Improved medical diagnosis and prognosis may
be achieved through automatic analysis of
patient data stored in medical records i.e. by
learning from past experiences.
Selected Medical Data Mining
Techniques
Given patient records with corresponding diagnoses, machine
learning methods are able to diagnose new cases. More
specifically, suppose E is a set of examples with known
classifications.
An example is described by Athe values of a fixed collection of
features (attributes): Ai, i =1,...,Nat
Each attribute can either have a finite set of values (discrete)
or take real numbers as values (continous).
 An individual example ej, j =1,...,Nex is a n-tuple of values vik
of attributes Ai Each example is assigned one of Ncl possible
values in the class variable C (classifications):c i, i =1,…, Ncl.
Selected Medical Data Mining
Techniques
For example, in the domain of early diagnosis of rheumatic diseases,
the patient record comprise 16 anamnestic attributes. Some of these
are continuous (age, duration of morning stiffness) and some are
discrete (e.g. joint pain, which can be arthrotic, arthritic, or not
present at all). There are eight possible diagnoses:
– degenerative spin diseases
– inflammatory spine diseases
– other inflamatory diseases
– extraarticular rheumatism
– crystal-induced synovitis
– non-specific rheumatic manifestations
– non-rheumatic diseases
Selected Medical Data Mining
Techniques
To classify (diagnose ) new cases, machine learning methods
can take different approaches.
– They can construct explicit symbolic rules that generalize
the training cases( rule induction and decision tree
induction). The induced rules or decision trees can then be
used to classify new cases.
– To store (some of) the training cases for reference
(instance-based learning). New cases can then be classified
by comparing them to the reference cases.
– To compute , for a given case to be classified , the
conditional probability of classes according to the Bayesian
formula and assign the most probable class to the case.
How does data mining work?
 While large-scale information technology has been evolving separate transaction
and analytical systems, data mining provides the link between the two. Data mining
software analyzes relationships and patterns in stored transaction data based on
open-ended user queries. Several types of analytical software are available:
statistical, machine learning, and neural networks. Generally, any of four types of
relationships are sought:
 Classes: Stored data is used to locate data in predetermined groups. For example, a
restaurant chain could mine customer purchase data to determine when customers visit
and what they typically order. This information could be used to increase traffic by
having daily specials.
 Clusters: Data items are grouped according to logical relationships or consumer
preferences. For example, data can be mined to identify market segments or consumer
affinities.
 Associations: Data can be mined to identify associations. The beer-diaper example is an
example of associative mining.
 Sequential patterns: Data is mined to anticipate behavior patterns and trends. For
example, an outdoor equipment retailer could predict the likelihood of a backpack being
purchased based on a consumer's purchase of sleeping bags and hiking shoes.
Five major elements:

 Extract, transform, and load transaction data onto the


data warehouse system.
 Store and manage the data in a multidimensional
database system.
 Provide data access to business analysts and
information technology professionals.
 Analyze the data by application software.
 Present the data in a useful format, such as a graph or
table.
Different levels of analysis
 Artificial neural networks: Non-linear predictive models that learn through training and resemble
biological neural networks in structure.
 Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation,
and natural selection in a design based on the concepts of natural evolution.
 Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for
the classification of a dataset. Specific decision tree methods include Classification and Regression Trees
(CART) and Chi Square Automatic Interaction Detection (CHAID) . CART and CHAID are decision tree
techniques used for classification of a dataset. They provide a set of rules that you can apply to a new
(unclassified) dataset to predict which records will have a given outcome. CART segments a dataset by
creating 2-way splits while CHAID segments using chi square tests to create multi-way splits. CART
typically requires less data preparation than CHAID.
 Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of
the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes called the k-
nearest neighbor technique.
 Rule induction: The extraction of useful if-then rules from data based on statistical significance.
 Data visualization: The visual interpretation of complex relationships in multidimensional data. Graphics
tools are used to illustrate data relationships.
What technological
infrastructure is required?
 Today, data mining applications are available on all size systems for mainframe,
client/server, and PC platforms. System prices range from several thousand
dollars for the smallest applications up to $1 million a terabyte for the largest.
Enterprise-wide applications generally range in size from 10 gigabytes to over
11 terabytes. There are two critical technological drivers:
 Size of the database: the more data being processed and maintained, the more powerful the
system required.
 Query complexity: the more complex the queries and the greater the number of queries
being processed, the more powerful the system required.
 Relational database storage and management technology is adequate for many
data mining applications less than 50 gigabytes. However, this infrastructure
needs to be significantly enhanced to support larger applications. Some vendors
have added extensive indexing capabilities to improve query performance.
Others use new hardware architectures such as Massively Parallel Processors
(MPP) to achieve order-of-magnitude improvements in query time. For example,
MPP systems from NCR link hundreds of high-speed Pentium processors to
achieve performance levels exceeding those of the largest supercomputers.
Software Design

Algorithms
Decision Tree (I)
The Decision Tree exploration engine, helps solve the task of classifying cases into multiple
categories. Decision Tree is the fastest algorithm when dealing with large amounts of attributes.
Decision Tree report provides an easily interpreted decision tree diagram and a predicted versus
real table.
Problems to Solve:
– Classification of cases into multiple categories

Target Attributes:
– Categorical or Boolean (Yes/No) attribute

Output Format:
– Classification statistics
– Predicted versus Real table (confusion matrix)
– Decision Tree diagram

Optimal Number of Records:


– Minimum of 100 records
– Maximum of 5,000,000 records
Decision Tree (II)
Preprocessing Suggested: Summary Statistics - to deselect attributes that contain to many values to provide any useful insight to
the exploration engine.

Underlying Algorithms: Information Gain splitting criteria, Shannon information theory and statistical significance tests.

The Data Used: Decision Tree works on data of any type. The DT algorithm is well-poised for analyzing very large databases
because it does not require loading all the data in machine main memory simultaneously. The software takes a full
advantage of this feature by implementing incremental DT learning with the help of the OLE DB for Data Mining
mechanism. The DT algorithm calculation time scales very well (grows only linearly) with increasing number of data
columns. At the same time, it grows more than linearly with the growing number of data records - as N*log(N), where N is
the number of records.

Problems to Solve: Decision Tree algorithm helps solving the task of classifying cases into multiple categories. In many cases,
this is the fastest, as well as easily interpreted machine learning algorithm. The DT algorithm provides intuitive rules for
solving a great variety of classification tasks ranging from predicting buyers/non-buyers in database marketing, to
automatically diagnosing patient in medicine, and to determining customer attrition causes in banking and insurance.

Target Attribute: The target attribute of a Decision Tree exploration must be of a Boolean (yes/no) or categorical data type.

When to Use This Algorithm: The Decision Tree exploration engine is used for task such as classifying records or predicting
outcomes. You should use decision trees when you goal is to assign your records to a few broad categories. Decision Trees
provide easily understood rules that can help you identify the best fields for further exploration.

The Output: The Decision Tree report starts of by giving measures resulting from the decision tree. These measures are the
Number of non-terminal nodes, Number of leaves, and depth of the constructed tree. Next, the report provides classification
statistics on the decision tree. After these measures, the predictive versus real table is shown.
Cluster Analysis
Cluster engine is used for the automated detecting clusters of
records that lie close to each other in a certain sense in the space
of all variables. Such clusters may represent different situations
or target groups, which one might find beneficial to study
separately. The Cluster engine places records corresponding to
different clusters in separate datasets for further analysis. The
cluster analysis proves to be useful for applications ranging from
database marketing to quality control.
The use of all attributes makes the Cluster algorithm very useful for
beginning data mining – it is an undirected method, and does not
require the selection of a target attribute.
Fuzzy Logic Classification
The algorithm is used for assigning cases to different classes. On
the output this exploration engine not only produces the
prediction to which class the case belongs, but also provides the
obtained symbolic classification rule generalized automatically
from the training examples. The classifier engine furnishes
simpler and more reliable results than systems based on pure
decision trees ideology. The prediction accuracy obtained for the
testing cases is comparable to the accuracy obtained for the
training cases. And again, statistical significance of the
generalized rule is determined rigorously by the classifier engine.
Note that the classifier engine can utilize either SKAT or MLR
or neural network prediction method as its driving mechanism.
Linear Regression
The Stepwise Linear Regression algorithm is, to our knowledge, the
only system capable of including categorical variables, in addition to
numerical and logical variables, in the regression analysis.
MLR discovers linear relations in data, automatically selecting only
those independent variables which influence the target variable
most. It also pinpoints redundant, mutually correlating independent
variables, and includes only their minimal subset in the results.
The Linear Regression is based on a very quick and robust calculation
algorithm. As with all other , the rigorous determination of
significance of the obtained results is performed for each model
considered. MLR is the fastest exploration engine and thus can be
used as a complementary preprocessing module for the SKAT
exploration engine.
Symbolic Knowledge Acquisition
Technology (SKAT)
Data mining is one of the most promising modern information technologies. The corporate world has
learned to derive new value from data by utilizing various intelligent tools and algorithms
designed for an automated discovery of non-trivial, useful, and previously unknown knowledge
in raw data.
Which factors influence the future variation of the price of some security shares?

What characteristics of a potential customer of some service make him/her the most probable buyer?

These and numerous other business questions can be successfully addressed by data mining.
The majority of available data mining tools are based on a few well-established technologies for data
analysis. Different knowledge discovery methods are suited best for different applications.
Among the useful knowledge presentation tasks one can name the dependency detection,
numerical prediction, explicit relation modeling, or classification rules.
Despite the usefulness of traditional data mining methods in various situations, we choose to
concentrate here first on the problems that plague these methods. Then we discuss the solutions
to these problems, which become available with an advent of SKAT - a next generation data
mining technology. We outline the reasons, foundations, and commercial implementations of this
emerging approach.
Symbolic Knowledge Acquisition
Technology (SKAT)
 Among the various tasks a data mining system is asked to perform, two
questions are encountered most frequently:
– Which database fields influence the selected target field?
– Precisely how the target field depends on other fields in the database?
 While there are many successful methods designed to answer the first
question, it is far more difficult to answer the second. Why does this
happen? Simply, an observation that across a number of cases with close
values of all parameters except some parameter X, the target parameter
Y varies considerably, implies that Y depends on X. For multi-
dimensional dependencies the issue becomes less straightforward, but
the basic idea for solving the problem is the same. At the same time, the
task of automated determination of an explicit form of the dependence
between several variables is significantly more difficult. The solution to
this problem cannot be based on similar simple-minded considerations.
Symbolic Knowledge
Acquisition Technology (SKAT)
Traditional methods for finding the precise form of a sought
relation implement the search for an expression representing the
dependence among possible expressions from some fixed class.
This idea is exploited in many existing data mining applications.
For example, one of the most straightforward and popular
methods of search for simple numerical dependencies - linear
regression - selects a solution out of a set of linear formulae
approximating the sought dependence. Systems from another
popular class of data mining algorithms - decision trees - search
for classification rules represented as trees involving simple
equalities and inequalities in the nodes connected by Boolean
AND and OR operations.
Symbolic Knowledge
Acquisition Technology (SKAT)
However, beyond the limits of the narrow classes of dependencies that
can be found by these systems there is an endless sea of dependencies
which cannot even be represented in the language used by these
systems. For example, assume you are using a decision tree system to
analyze the data holding the following simple rule: "most frequent
buyers of Post cereal are homemakers of age smaller than the inverse
square of their family income multiplied by a certain constant". A
traditional system has no means to discover such a rule. Only if one
furnishes to the system explicitly the parameter "inverse square of the
family income" can the stated rule be found by traditional systems. In
other words, one has to guess an approximate form of the solution first -
and then the machine does the rest of the job efficiently. While guessing
a general form of the solution prior to automated modeling might be a
challenging brain twister, it certainly does not make life of a corporate
data analyst much easier.
Symbolic Knowledge
Acquisition Technology (SKAT)
Case Study :

Bayesian Classification.
Bayesian Classification: Why?

 Probabilistic learning: Calculate explicit probabilities for


hypothesis, among the most practical approaches to certain
types of learning problems
 Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct.
Prior knowledge can be combined with observed data.
 Probabilistic prediction: Predict multiple hypotheses, weighted
by their probabilities
 Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
Bayesian Theorem: Basics
Let X be a data sample whose class label is unknown
Let H be a hypothesis that X belongs to class C
For classification problems, determine P(H/X): the
probability that the hypothesis holds given the observed
data sample X
P(H): prior probability of hypothesis H (i.e. the initial
probability before we observe any data, reflects the
background knowledge)
P(X): probability that sample data is observed
P(X|H) : probability of observing the sample X, given that
the hypothesis holds
Bayesian Theorem
Given training data X, posteriori probability of a hypothesis H, P(H|X)
follows the Bayes theorem
P(H | X )  P( X | H )P(H )
P( X )
Informally, this can be written as
posterior =likelihood x prior / evidence
MAP (maximum posteriori) hypothesis
h  arg max P(h | D)  arg max P(D | h)P(h).
MAP hH hH
Practical difficulty: require initial knowledge of many probabilities,
significant computational cost
Naïve Bayesian Classifier
Each data sample X is represented as a vector {x1, x2, …, xn}
There are m classes C1, C2, …, Cm
Given unknown data sample X, the classifier will predict that
X belongs to class Ci, iff
P(Ci|X) > P (Cj|X) where 1  j  m , I  J
By Bayes theorem, P(Ci|X)= P(X|Ci)P(Ci)/ P(X)
Naïve Bayes Classifier
A simplified assumption: attributes are conditionally
independent:
n
P( X | C i)   P( x k | C i)
k 1
The product of occurrence of say 2 elements x1 and x2, given
the current class is C, is the product of the probabilities of
each element taken separately, given the same class
P([y1,y2],C) = P(y1,C) * P(y2,C)
No dependence relation between attributes
Greatly reduces the computation cost, only count the class
distribution.
Once the probability P(X|Ci) is known, assign X to the class
with maximum P(X|Ci)*P(Ci)
Training dataset
age income student credit_rating buys_computer
<=30 high no fair no
Class: <=30 high no excellent no
C1: 30…40 high no fair yes
buys_computer= >40 medium no fair yes
‘yes’ >40 low yes fair yes
C2: >40 low yes excellent no
buys_computer= 31…40 low yes excellent yes
‘no’
<=30 medium no fair no
Data sample <=30 low yes fair yes
X =(age<=30, >40 medium yes fair yes
Income=mediu
m, Student=yes <=30 medium yes excellent yes
Credit_rating= 31…40 medium no excellent yes
Fair) 31…40 high yes fair yes
>40 medium no excellent no
Naïve Bayesian Classifier:
Example
Compute P(X/Ci) for each class
P(age=“<30” | buys_computer=“yes”) = 2/9=0.222
P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“yes)= 6/9 =0.667
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4

X=(age<=30 ,income =medium, student=yes,credit_rating=fair)

P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044


P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019
P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028
P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.007

X belongs to class “buys_computer=yes”


Naïve Bayesian Classifier:
Comments
 Advantages :
– Easy to implement
– Good results obtained in most of the cases
 Disadvantages
– Assumption: class conditional independence , therefore loss of accuracy
– Practically, dependencies exist among variables
– E.g., hospitals : patients: Profile : age, family history etc
Symptoms : fever, cough etc , Disease : lung cancer, diabetes etc ,
Dependencies among these cannot be modeled by Naïve Bayesian
Classifier, use a Bayesian network
 How to deal with these dependencies?
– Bayesian Belief Networks
Naive Bayesian Classifier:
Example II
 Given a training set, we can compute the probabilities

Outlook P N Humidity P N
sunny 2/9 3/5 high 3/9 4/5
overcast 4/9 0 normal 6/9 1/5
rain 3/9 2/5
Tempreature Windy
hot 2/9 2/5 true 3/9 3/5
mild 4/9 2/5 false 6/9 2/5
cool 3/9 1/5
Bayesian Networks
 Bayesian belief network allows a subset of the
variables conditionally independent
 A graphical model of causal relationships
– Represents dependency among the variables
– Gives a specification of joint probability distribution
Nodes: random variables
Links: dependency
X Y X,Y are the parents of Z
Y is the parent of P
No dependency between Z
Z and P
P Has no loops or cycles
Bayesian Belief Network: An
Example
Family
Smoker
History
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)

LC 0.8 0.5 0.7 0.1


LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9

The conditional probability table for


the variable LungCancer:
Shows the conditional probability for
PositiveXRay Dyspnea each possible combination of its
parents
n
Bayesian Belief Networks P( z1,..., zn)   P ( z i | Parents ( Z i ))
i 1

You might also like