0% found this document useful (0 votes)
12 views13 pages

Data Analysis

Uploaded by

Nermine Limeme
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views13 pages

Data Analysis

Uploaded by

Nermine Limeme
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Wejden zormati

Supervised Learning
In supervised learning, the algorithm is provided with labeled training data, and the goal is to
learn a mapping from inputs (features) to outputs (labels or target values).
If there’s a prediction then you’re in the supervising learning
Characteristics:
● Labeled Data: The training data includes both the input data and the correct output. This
allows the algorithm to learn from the provided examples.
● Prediction Goal: The primary objective is to make accurate predictions for new, unseen
data based on what the model has learned from the training data.
● Feedback Loop: The algorithm can adjust itself based on the error of its predictions on
the training data.
Example : you can predict if a telecom customer is planning on shutting down and changing his
line or not based on his infos (chart analysis) , also prepare menu based on previous receipt of
customers .
Unsupervised Learning
In unsupervised learning, the algorithm is given data without explicit instructions on what to do
with it. The system tries to learn the patterns and the structure from the data without any labeled
responses to guide the learning process.
Characteristics:
● Unlabeled Data: The training data doesn't have any labels or target values.
● Discovery Goal: The primary objective is to identify patterns, relationships, or structures
in the data.
● No Feedback: There's no feedback loop based on prediction error, as there are no
"correct" answers.
Sampling :
- Random + growing size
probability sampling technique ( there’s a frame )
Probability sampling is a technique wherein every member of the population has a known,
non-zero chance of being selected.
● Simple Random Sampling (SRS): Every individual has an equal probability of being
selected.
Non-probability sampling ( there is no frame , big enough )
The Questionnaire:

● Definition: A questionnaire is described as a research instrument that consists


of a series of questions. Its main objective is to gather information from
respondents.
● Advantages: Although the full details of the advantages aren't provided in the
initial extract, it's likely that the document discusses the benefits of using
questionnaires. Common advantages include cost-effectiveness, ease of
distribution, and the ability to collect data from a large number of respondents in
a relatively short time.

Generalities about Sampling:

This section would discuss the foundational concepts of sampling. Topics such as the
distinction between sampling and a full census might be explained, highlighting the
scenarios where each is most applicable. Another probable topic is the justification for
sampling — why researchers often prefer sampling over other methods of data
collection.

Probability Sampling Techniques:

This section delves into various standardized probability sampling methods. Each
method might be elaborated upon, discussing its advantages, disadvantages,
use-cases, and potential pitfalls.

Some of the methods mentioned include:

● Simple Random Sampling (SRS)


● Systematic Sampling: every nth individual , K=N/n
● Sampling with Probability Proportional to the Unit's Size : The unit with
larger size has the greatest chance of being included in the sample.
● Stratified Random Sampling : The population is subdivided into strata
(relatively homogeneous groups) which are mutually exclusive. From each strata
the same proportion of individuals is drawn. The sampling rate is the same in all
strata.
● Disproportionate stratified sampling The only difference between
proportionate and disproportionate stratified random sampling is their sampling
fractions. With disproportionate sampling, the different strata have different
sampling fractions.
● Cluster Sampling
● Multistage SamplingSimilar to cluster sampling, except that in this case a
sample is taken from each cluster. We have at least two degrees. The first one
identifies large clusters (primary units). In the second degree, within each cluster,
the units (secondary units) that will be part of the sample are selected.
● is a sampling method that consists of ensuring the representativity of a sample

Non-Probability Methods:

Unlike probability sampling techniques where each member of the population has a
known, non-zero chance of being selected, non-probability sampling doesn't give every
member a chance. Methods such as Judgmental Sampling, Quota Method selected by
an investor (with importance ), and Route Sampling (directions and locations) would be
discussed in detail, with emphasis on when and why they might be used.
Sample Adjustment:

It's important in sampling to ensure the sample accurately represents the broader
population. This section might talk about techniques to adjust samples. These could
range from adjusting according to quantitative variables, by weight, or using other
methodologies to ensure representativeness.

This is a broad overview based on the initial extraction. If you have specific questions or
topics within this chapter that you'd like to explore further, please let me know, and I'll do
my best to provide insights based on my knowledge.

● 𝑌𝑞 = 𝑦 µ/𝑥
- Ӯ = the average turnover
- µ =the average number
- 𝑥 = 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛

Adjustment
By deleting: In order to have the same characteristics of the reference population, we
can select randomly some observations from the sample to delete.
By bootstrapping: In order to have the same characteristics of the reference
population, we can create a new bootstrap sample from the available sample
Chapter 2 : Bivariate analysis
Imagine you're at a dance party. You're not just dancing alone; you have a partner. How
you move, whether you're in sync, and the space between you two can tell a lot about
your partnership. That's bivariate analysis - the dance of two variables!

1️⃣ Statistical Characters: Dependence and Interdependence

● Dependency Relationships: Think of it as a one-way street. How one thing (like


age) affects another (like brand preference).
● Interdependence Relationships: Now, it's a roundabout where both variables
affect each other, like a back-and-forth dialogue.

2️⃣ Pearson Correlation Analysis

● This is about the dance rhythm! If both you and your partner (two variables)
move together in perfect sync, that's a strong correlation.
● Pearson's r: Tells you about the dance style:
○ Close to +1: Perfect sync in the same direction (Positive Correlation)
○ Close to -1: Perfect sync but in opposite directions (Negative Correlation)
○ Around 0: You two are doing your own thing (No Correlation)

3️⃣ Hypothesis testing for a risk level α


3️⃣ Cross Table Analysis and Chi-Square Test

● Imagine you're trying to find out if wearing sunglasses at the party (one variable)
is related to dance style (another variable). The Chi-Square test helps you see if
the variables are busting moves independently or in a pattern.
4️⃣ Comparison Test of Two Averages

● This is like having a dance-off between two groups to see if they're really
different. For example, do groups with different DJs (categories) really have
different energy levels (means)?

5️⃣ Analysis of Variance (ANOVA)

● Now, it's a multi-group dance-off! ANOVA checks if there are real differences in
the performance (means) of more than two dance crews (groups/categories).

🎯 In Summary
Bivariate analysis is like being a dance judge. You're watching the interaction of two
dancers (variables) and deciding:
● Are they in sync or not? (Correlation)
● Is their performance related to other factors like dance style or the song playing?
(Chi-Square Test)
● How do different pairs or groups compare? (t-Tests and ANOVA)

🔍 We use different 'judging criteria' (statistical tests) based on what we're trying to find
out. And that, dear students, is the art of understanding the dance floor of data through
Bivariate Analysis! 🕺💃
Chapter 3: Normalized Principal Components Analysis

Initial Data
Imagine you're an artist with a vast canvas made up of tiny squares. Each square can be
filled with colors representing different data points (like happiness, wealth, population, etc.),
and each row of squares is a detailed portrait of a country.

In technical terms, you have a matrix where columns represent quantitative variables, and
rows are statistical units.

Technical Objectives

As a treasure hunter, you have a detailed map and your goal is to:

Simplify the map by finding the


most straightforward path.
Group similar landmarks
together and identify unique
ones that stand out.
Understand the relationships
between different paths and
landmarks.
In data analysis, this means
reducing data dimensions,
clustering similar data points,
and analyzing variable
relationships.
Cases Studies and Methodologies

Imagine you're now an architect studying various building designs. You use special lenses
(mathematical tools and projections) to view the buildings from different angles,
understanding their structure more clearly.

These slides are likely discussing specific case studies (like analyzing consumer
perceptions) and the technical methods (like orthogonal projections) used for analysis.
Individuals Scatter Plot Analysis

Back to being an artist, but this time, you're critiquing art (data points). You realize that
certain colors (variables) are more dominant and help in understanding the overall painting
better.

This slide talks about how specific variables provide a better understanding of data
dispersion and the importance of visual representation in data analysis.

Variable Scatter Plot Analysis


Now, you're a photographer, adjusting your camera lens to get the perfect shot of different
birds (data points) in a forest (data set). The angle and focus (mathematical calculations
and projections) are crucial to understanding each bird's importance and role in the forest's
ecosystem.

These slides discuss how each data point is analyzed in relation to others through scatter
plots, highlighting the technical process behind it
Understanding Variability

Imagine you're an explorer looking at different paths in a forest. Each path is unique, leading
to various destinations (outcomes). Your goal is to understand which paths (variables) have
the most influence on your journey (the system).

In technical terms, this is about understanding which variables contribute most to the
variability in your data. By identifying these, you can simplify complex multi-dimensional
data into more manageable forms, focusing on the most critical factors.
Dimension Reduction Techniques

Now, think of yourself as a gardener. Your garden is vast, filled with numerous types of
plants (variables). However, you need to make it simpler for visitors to tour. So, you decide
to organize the plants by certain similarities, grouping them (dimension reduction) and
creating specific paths (principal components) highlighting the most distinctive and
interesting plants.
In data analysis, this is what dimension reduction techniques do. They reduce the number of
variables to consider, making data interpretation more straightforward without significantly
losing important information.

Identifying Outliers

As a teacher in a classroom, you notice that while most students perform similarly, a few
perform exceptionally well or need extra help. These students stand out (outliers), and
understanding them can offer insights into what makes exceptional cases different and
perhaps what influences overall performance.

This rate defines the explanatory power of the k first axis (or factors): it represents the part
of total variance taken into account by these k axis. However, its appreciation must take into
account the number of variables and the number of individuals.

In your data, outliers can provide valuable insights. They might indicate errors, or they might
reveal areas for innovation or improvement.

Correlation Between Variables


Imagine being at a dance where pairs of dancers move in sync. When two dancers
(variables) move closely together, they are highly correlated. When one moves
independently of the other, they are less correlated or uncorrelated.

Understanding the relationships between variables helps in predicting how changes in one
variable might affect another, which is crucial in planning and decision-making processes.

Slide 18: Visualization Techniques

You're a tourist using a map of a complex subway system. The map simplifies information,
helping you understand how to navigate the various routes. Similarly, data visualization
techniques allow complex data to become more understandable and accessible, aiding in
identifying patterns, trends, and insights that might be less obvious in a tabular format.

Slide 19: Interpretation and Decision Making

Back in the village, after your thorough investigation and research, you gather the villagers
(stakeholders) to discuss your findings. You interpret the data, highlight important insights,
and guide decision-making for the village’s future. This stage is crucial because the way
data is interpreted directly influences the decisions made and the actions taken.

Slide 20: Conclusion and Next Steps

As the sun sets, you conclude your adventure. You recap the journey, what you've learned,
and suggest the next steps. Maybe more paths need exploring, or perhaps the village will
start a new tradition based on your findings.

Similarly, you conclude your presentation by summarizing key points, reflecting on the
project’s significance, and possibly suggesting future research areas or strategies based on
your findings.
REVISE PCA IN MIDTERM ? CHAPTER 3

You might also like