0% found this document useful (0 votes)
4 views

ch5

The document discusses various modeling techniques in data science, including mathematical modeling, regression methods, and decision-making frameworks. It covers deterministic, stochastic, and empirical models, highlighting their strengths, limitations, and applications. Additionally, it explains linear regression, logistic regression, regularization techniques, and multi-criteria decision-making methods, providing insights into their functions and uses in predictive analytics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

ch5

The document discusses various modeling techniques in data science, including mathematical modeling, regression methods, and decision-making frameworks. It covers deterministic, stochastic, and empirical models, highlighting their strengths, limitations, and applications. Additionally, it explains linear regression, logistic regression, regularization techniques, and multi-criteria decision-making methods, providing insights into their functions and uses in predictive analytics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Modeling Techniques,

Regression
Mathematical Modeling
● To explore how mathematical models represent real-world phenomena and
contribute to solving problems in data science.
● Mathematical modeling is a fundamental approach to representing real-world
phenomena through mathematical expressions, enabling the prediction,
analysis, and optimization of complex systems.
● In data science and machine learning, it serves as the backbone for
designing algorithms and generating insights from data.
● It transforms complex phenomena into a simplified, understandable, and
computable form.
● Models allow for prediction, optimization, and understanding of systems
across various dicipleance. They help identify relationships and patterns in
data, enabling informed decision-making and problem-solving.
Cont.

1. Problem Formulation and Data Collection


○ Problem Formulation: Clearly define the problem or phenomenon to model.
i. Example: Predicting sales based on seasonal demand.
○ Data Collection: Gather relevant, accurate, and sufficient data to inform the model.
i. Example: Historical sales data, market trends, and economic indicators.
2. Model Selection and Assumptions
○ Model Selection: Choose the most appropriate model type for the problem.
i. Example: Linear regression for relationships, differential equations for dynamic systems.
○ Assumptions: Identify and document assumptions to simplify the model while maintaining its
relevance.
i. Example: Assuming a constant rate of growth in a population model.
Cont.

3. Model Implementation, Validation, and Evaluation


○ Implementation: Translate the mathematical structure into a computational framework or
algorithm.
i. Tools used are: Python, MATLAB, or R.
○ Validation: Compare model predictions with real-world data to assess its accuracy.
i. Techniques: Use test datasets or cross-validation methods.
○ Evaluation: Analyze the model's performance and refine it if necessary.
i. Metrics: Mean Squared Error (MSE), R-squared, or likelihood functions.
Types of Models
Deterministic Models
● Deterministic models are systems where the outcomes are completely determined
by the input data and parameters, without any randomness or probabilistic
elements.
● These models operate under strict rules or equations, ensuring that the same
inputs always produce the same outputs. This characteristic makes deterministic
models highly predictable and reliable.
● They are used for Deterministic transformations like scaling, encoding, or
normalizing data ensure consistent input for algorithms in machine learning
preprocessing and
● Deterministic models often form the foundation for simple AI systems, such as
chatbots that provide predefined responses in rule based AI.
Cont.
● Deterministic models rely on fixed relationships between variables, often
derived from established rules or mathematical equations.
● They assume that all relevant factors can be fully captured in the model,
leaving no room for uncertainty.
Example: Weather Prediction
● Estimating temperature trends based on historical seasonal data.
● Inputs: Time of year, geographical location, historical averages.
● Rule: Temperature follows a sinusoidal pattern based on the Earth's orbit.
Cont.
The strength of this modeling approach is:
● Predictability: Outcomes are consistent and repeatable for the same inputs.
● Simplicity: Easy to design and interpret due to their rule-based nature.
● Transparency: Clearly defined rules make it easy to audit and justify
decisions.
Limitations of Deterministic Models
● Lack of Flexibility: Cannot handle randomness or uncertainty effectively.
● Dependence on Complete Data: Assumes all influencing factors are known
and measurable.
Stochastic Models

● Stochastic models are systems that incorporate elements of randomness or


probability to represent uncertainty and variability in data or outcomes.
● Unlike deterministic models, stochastic models recognize that real-world
processes often involve inherent randomness, which is reflected in their
predictions.
● Outputs are not fixed and may vary even with identical inputs due to
probabilistic components.
○ Example: Predicting customer churn based on historical behavior and external factors
introduces uncertainty.
● Stochastic models use probability distributions to describe outcomes.
○ Example: Using a Gaussian distribution to model the variability in product demand.
Cont.
● One of the most known Stochastic Models is Markov Chains. This type of models
are used to determine the uncertain conditions that might affect the prediction. So
this conditions are real world factors which can make the output be random.
Strengths of Stochastic Models
● Handling Uncertainty: Ideal for systems with inherent randomness, like weather or
customer behavior.
● Realism: Reflects the variability and unpredictability of real-world processes.
● Flexibility: Can adapt to dynamic systems where inputs and conditions change
frequently.
Cont.

Limitations of Stochastic Models

● Complexity: Requires advanced mathematical understanding and


computation.
● Data Dependency: Heavily reliant on high-quality data to estimate
probabilities accurately.
● Uncertainty in Predictions: Provides ranges or probabilities instead of fixed
outputs, which might be less actionable in some contexts.
Empirical Models
● Empirical models are data-driven approaches that rely on observed data to derive
patterns, relationships, and predictions.
● The model construction is based on historical or real-time data.Patterns and
relationships are inferred directly from data rather than predefined equations or
rules.
● Empirical models can generalize to new, unseen data when trained effectively.
● Often lack interpretability, as the focus is on accurate predictions rather than
understanding the process.
○ Example: Neural networks can predict outcomes but may not explain why or how a decision was
made.
● Some of the algorithms that follows the empirical model is svm and Random
Forests.
Cont.
Strengths of Empirical Models

● Versatility: Can be applied to various domains, such as finance, healthcare, and e-commerce.
● Data Adaptability: Models improve as more data becomes available, leading to better accuracy.
● Automation: Empirical models can automate complex tasks like anomaly detection or
recommendation systems.

Challenges of Empirical Models

● Data Dependence: Requires extensive and high-quality data for training. Poor data quality can lead
to inaccurate predictions.
● Overfitting: Empirical models may perform well on training data but fail to generalize to unseen data
if not properly regularized.
● Interpretability: Complex models like neural networks are often "black boxes," making it difficult to
explain decisions.
Linear regression

● Linear regression is a statistical method that models the relationship between


a dependent variable and one or more independent variables using a linear
equation.
● It is one of the simplest and most widely used techniques in predictive
modeling. The goal is to fit a linear equation to the observed data and use this
equation to predict values of the dependent variable based on new
observations of the independent variables.
● Predicting continuous outcomes like house prices, stock prices, etc.
Cont.
Equation: y=β0​+β1x+ϵ
y: Dependent variable

x: Independent variable

β0: Intercept

β1: Slope

ϵ: Error term

Explanation: The equation represents a straight line that best fits the data points. The intercept (β0​) is the value of
y when x is 0, and the slope (β1​) represents the change in y for a one-unit change in x
Cont.
Slope (β1): Change in the dependent variable for a one-unit change in the
independent variable.
Intercept (β0): Value of the dependent variable when the independent variable is
zero.
Understanding the coefficients helps interpret the model and the relationship
between variables. The slope indicates the strength and direction of the
relationship.
Regularization
● Regularization is a technique used to address overfitting by penalizing overly
complex models. It helps in enhancing the generalization ability of models,
ensuring they perform well on unseen data.
● A method to constrain or shrink model coefficients to prevent overfitting.
Introduces a penalty for large coefficients, encouraging simpler, more
interpretable models.
● Prevents Overfitting: Reduces the model's ability to capture noise in the
training data.
● Improves Generalization: Ensures the model performs well on test or real-
world data.
● Stabilizes Models: Helps avoid extreme fluctuations in predictions for small
changes in input.
Types of Regularization Techniques

Lasso Regression (Least Absolute Shrinkage and Selection Operator)

● Penalty Added: 𝐿1-norm penalty, represented as λ∑∣𝑤𝑖∣ , where 𝑤𝑖 are model


coefficients.
● Forces some coefficients to become exactly zero, effectively performing
feature selection.
● Simplifies models by eliminating irrelevant features.
● This regularization is used when there are many irrelevant or redundant
features in the dataset.
Cont.

Ridge Regression

● Penalty Added: 𝐿2-norm penalty, represented as λ∑𝑤𝑖2 , where 𝑤𝑖 are model


coefficients.
● Shrinks all coefficients toward zero but does not eliminate them entirely.
● Useful for managing multicollinearity and stabilizing predictions.
Cont.

Elastic Net

● Penalty Added: Combines 𝐿1-norm and 𝐿2-norm penalties.

λ1∑∣𝑤𝑖∣ +λ2∑𝑤𝑖2

● Balances the benefits of Lasso (feature selection) and Ridge (shrinkage).


● Useful when features are highly correlated or when there are many irrelevant
features.
Cont.
Logistic regression

● Logistic regression is a statistical method used for binary classification


problems, where the outcome is a binary variable (e.g., yes/no, true/false).
● Unlike linear regression, logistic regression predicts the probability that a
given input point belongs to a certain class.
● Predicting outcomes like spam vs. not spam, disease presence, etc.
Cont.

● Logistic function (sigmoid function)


● The logistic function maps any real-valued number into the range (0, 1),
making it suitable for probability estimation. The model predicts the probability
that the dependent variable Y equals 1 (i.e., belongs to the positive class).
Cont.

● Binary dependent variable: The dependent variable has only two possible
outcomes, typically coded as 0 and 1.
● Independent observations:Observations in the dataset must be
independent, meaning the outcome of one observation does not influence
another.
● No multicollinearity: Independent variables should not be highly correlated
with each other, as this can make the model's coefficients unstable.
● Large sample size: Logistic regression requires a relatively large sample
size to provide reliable and stable estimates of the model parameters.
Building Decision Models with Multi-Criteria Decision Making (MCDM)

● Multi-Criteria Decision Making (MCDM) refers to a set of techniques or


frameworks used to evaluate, prioritize, and choose between multiple
alternatives, especially when decisions involve trade-offs among conflicting
criteria.
● It helps decision-makers arrive at optimal or near-optimal solutions in complex
scenarios by structuring the decision process and incorporating multiple
viewpoints.
● In real-world scenarios, decisions are rarely based on a single factor. For
example, selecting a supplier may involve considering cost, quality, delivery
speed, and sustainability, which often conflict. MCDM provides systematic
methods to evaluate such trade-offs.
Cont.
Decision Space
● The decision space refers to the collection of all possible solutions or alternatives
available for evaluation.
● Helps define the boundaries and scope of the decision-making process.
Criteria Weighting
● Criteria weighting involves assigning relative importance to each criterion based
on its significance to the overall decision.
● Ensures that more critical factors have a greater influence on the final decision.
○ Expert Judgment: Decision-makers rank or rate criteria based on experience or preferences.
○ Pairwise Comparisons: Methods like the Analytic Hierarchy Process (AHP) systematically compare
criteria to assign weights.
Cont.
Scoring and Ranking
● Each alternative is scored based on its performance for each criterion, and
the scores are aggregated to rank the alternatives.
Steps:
1. Evaluate each alternative against the criteria.
2. Multiply the criterion score by its weight.
3. Sum up the weighted scores for each alternative.
● Provides a clear ranking of alternatives, allowing decision-makers to identify
the most suitable choice.
Popular MCDM Methods

1. Analytic Hierarchy Process (AHP):

◦ Breaks a decision problem into a hierarchy of sub-problems .

◦ Steps:

▪ Define the criteria and alternatives.

▪ Compare criteria pairwise to establish weights.

▪ Score alternatives and calculate an overall ranking.


cont.
2. Weighted Sum Model (WSM):
◦ Calculates a weighted score for each alternative.

● Where wj is the weight of criterion j, and xij is the performance of alternative i


on criterion j.
Taylor Polynomials

● A Taylor polynomial is an approximation of a function 𝑓(𝑥) using a finite


number of terms from its Taylor series expansion.
● It is a way to approximate a function near a specific point, typically around
𝑥=a, by using the values of the function and its derivatives at that point.
● Given a function 𝑓(𝑥) that is sufficiently differentiable at a point 𝑎, the 𝑛-th
degree Taylor polynomial 𝑃𝑛(𝑥) of 𝑓(𝑥) around 𝑥=𝑎 is:
Cont.

● f(a): Function value at 𝑎.


● 𝑓′(𝑎): First derivative of the function evaluated at 𝑎
● 𝑓′′(𝑎): Second derivative of the function evaluated at 𝑎
● 𝑓(𝑛)(𝑎):𝑛-th derivative of the function evaluated at 𝑎
● (𝑥−𝑎): The difference between the point of approximation 𝑥 and the base point
𝑎.
● 𝑛!: Factorial of 𝑛.
Cont.
Properties and Significance:

Convergence:
The Taylor polynomial provides a good approximation of the function near the point 𝑎. The more terms
you include in the expansion, the more accurate the approximation, especially for smooth functions.
Degree of Approximation:

The degree 𝑛 of the Taylor polynomial determines how many derivatives of the function are taken into
account, with higher-degree polynomials offering better approximations for a larger range of 𝑥.

Error Estimate (Lagrange Remainder):

The error between the Taylor polynomial and the actual function can be bounded by the Lagrange
remainder term, which provides an estimate of how much the Taylor polynomial deviates from the actual
function for a given 𝑛.
Dividing and Conquering with Bisection Methods

● The bisection method is a numerical technique used to find the roots of a


continuous function. It works by repeatedly narrowing down an interval in half
and selecting the subinterval that contains the root. This method is based on
the intermediate value theorem, which ensures that a root exists between two
points if the function values at those points have opposite signs.
● Also called the interval halving method, the binary search method, or the
dichotomy method. is based on the Bolzano’s theorem for continuous
functions.
Cont.

● The Bisection Method looks to find the


value c for which the plot of the function f
crosses the x-axis. The c value is in this
case is an approximation of the root of the
function f(x). How close the value of c gets
to the real root depends on the value of the
tolerance we set for the algorithm.
Cont.

Algorithm Steps:

1. Identify Interval:
a. Choose an interval [𝑎,𝑏] such that 𝑓(𝑎)⋅𝑓(𝑏)<0. This condition ensures that the function has a
root in the interval, as the function changes sign between 𝑎 and 𝑏.
2. Compute Midpoint:
a. Calculate the midpoint 𝑚 of the interval [𝑎,𝑏]
Cont.

3. Evaluate:
a. Evaluate the function at the midpoint 𝑓(𝑚).
b. If 𝑓(𝑚)=0, then 𝑚 is the root.
c. If 𝑓(𝑎)⋅𝑓(𝑚)<0, the root lies in the left subinterval [𝑎,𝑚].
d. If 𝑓(𝑚)⋅𝑓(𝑏)<0, the root lies in the right subinterval [𝑎,𝑚].
4. Repeat:
a. Narrow the interval by selecting the subinterval where the root lies. Repeat the process until
the desired level of precision is achieved, i.e., until the difference between 𝑎 and 𝑏 is
sufficiently small.
Predicting the future with Markov chains

● A Markov Chain is a mathematical model that describes a system where the


future state depends only on the current state, and not on the sequence of
events that preceded it. This property is known as the Markov Property, which
can be summarized as:
● Markov Assumption: current unobservable state depends on a finite number
of past states
● The Markov property states that the transition probability depends only on the
current state and not on the sequence of events that led to it.
Cont.

● i.e.: Xt depends on some previous Xis


● First-order Markov process: current state depends only on the previous state,
● i.e.: P(Xt|X0:t-1) = P(Xt|Xt-1)
● kth order: depends on previous k time steps
● Sensor Markov assumption:
○ Observable variables depend only on the current state (by definition, essentially), these are
the “sensors”.
○ The current state causes the sensor values.

P(Et|X0:t, E0:t-1) = P(Et|Xt)


Cont.

● Assume stationary process: transition model P(Xt|Xt-1) and sensor model


P(Et | Xt) are the same for all t
● In a stationary process, the changes in the world state are governed by laws
that do not themselves change over time
● The laws of probability don’t change over time
Cont.
Steps for Using Markov Chains to Predict the Future:

● Define the States:


○ Identify all possible states in the system. For example, in a weather prediction model, the states could be "Sunny,"
"Cloudy," and "Rainy."
● Construct the Transition Matrix:
○ Create a transition matrix that defines the probabilities of transitioning from one state to another. Each row represents
the current state, and each column represents the possible next states.
● Initial State Distribution:
○ Define the initial distribution of the states, i.e., the probability of starting in each state. This can be represented as a
vector 𝜋0.
● Predict Future States:
○ To predict the future state, multiply the initial state distribution by the transition matrix. This gives the probability
distribution of the next state.

π t+1 =πt ⋅P
Dimensionality Reduction: PCA and SVD

● Dimensionality reduction is the process of reducing the number of variables


(features) in a dataset while retaining as much relevant information as
possible.
● It is crucial for:
○ Simplifying data visualization.
○ Enhancing computational efficiency.
○ Reducing noise and redundancy in data.
● Two popular techniques for dimensionality reduction are Principal Component
Analysis (PCA) and Singular Value Decomposition (SVD). Both are grounded
in linear algebra and are widely used in machine learning, data science, and
statistics.
Principal Component Analysis (PCA):

● PCA transforms a dataset into a new coordinate system by finding directions


(principal components) that capture the maximum variance in the data.
● Linear combinations of the original features. The first principal component
captures the largest variance, the second captures the next largest variance
orthogonal to the first, and so on.
● PCA seeks to maximize the variance captured in lower dimensions to
preserve as much information as possible.
Singular Value Decomposition (SVD):
● SVD splits 𝐴 into its constituent parts, revealing its structure.
● Provide the magnitude of variance captured by each component.
● By retaining only the top 𝑘 singular values and corresponding vectors, we approximate 𝐴 in a
lower-dimensional space.
● SVD decomposes a matrix 𝐴 into three matrices:

● U: Orthogonal matrix containing left singular vectors (column space of 𝐴).


● Σ: Diagonal matrix of singular values (capturing importance of components).
● VT : Orthogonal matrix containing right singular vectors (row space of 𝐴).

You might also like