Analytics Notes
Analytics Notes
Business Analytics
for Managers
Q. 1. - Introduction to Data Science.
(https://youtu.be/X3paOmcrTjQ?si=n_5a9fs1nZVrq6Qk)
Ans. - Data science is like being a detective, but instead of solving crimes, you're solving puzzles
using data. It's all about collecting, analyzing, and interpreting information to uncover insights and
make better decisions.
Imagine you have a big pile of puzzle pieces (data) scattered around. Each piece holds a tiny bit
of information. Data science helps you organize and understand these pieces so you can see the
bigger picture.
Here's how it works:
1. Collecting Data: First, you gather data from various sources like sensors, websites, or
surveys. This could be anything from sales numbers to social media posts.
2. Cleaning and Preparing Data: Data is often messy and incomplete, like puzzle pieces
with smudges or missing edges. So, you clean and organize the data, making sure it's
accurate and ready for analysis.
3. Exploring Data: Now comes the fun part! You start piecing together the puzzle by
exploring the data. You look for patterns, trends, or anomalies that could reveal interesting
insights.
4. Analyzing Data: Once you've got a good grasp of the data, you use statistical techniques
or machine learning algorithms to dig deeper. This helps you uncover hidden relationships
or predict future outcomes.
5. Visualizing Results: To make your findings easy to understand, you create visualizations
like charts or graphs. These help you communicate your insights to others effectively.
6. Making Decisions: Finally, armed with your insights, you can make informed decisions.
Whether it's optimizing business processes, improving healthcare outcomes, or predicting
customer behavior, data science empowers you to make smarter choices.
In essence, data science is about turning raw data into actionable insights. It's a powerful tool that's
revolutionizing industries and driving innovation across the board. And as more and more data
becomes available, the opportunities for discovery and impact continue to grow.
1|Page
1. Exploring Data: Once you've identified a problem, you gather relevant data. This might
include things like sales records, patient information, or social media activity.
2. Cleaning and Preparing Data: Data often needs cleaning and organizing before it can be
used effectively. You make sure it's accurate and in a format that data science tools can
understand.
3. Analyzing Data: With the data ready, you use techniques like statistics or machine
learning to find patterns or make predictions. For example, you might analyze customer
buying habits to identify trends or detect fraudulent activity.
4. Applying Insights: Once you've uncovered insights from the data, you apply them to solve
the original problem. This could involve making business decisions, improving processes,
or creating new products or services.
5. Evaluating Results: Finally, you assess the impact of your data science solution. Did it
solve the problem effectively? Did it provide valuable insights? If not, you may need to
refine your approach or gather more data.
Throughout this process, it's important to collaborate with experts in the field to ensure that your
data science solution addresses the real needs and challenges of the problem you're trying to solve.
By leveraging the power of data, you can uncover valuable insights and drive positive change in a
wide range of domains.
Q. 3 - Data Summaries.
(https://youtu.be/f-7fmwZ81H4?si=_3Nuof-6s4SDA3ve)
Ans. - Data summaries are like short stories that capture the key points and insights hidden within
a dataset. Here's a simplified explanation:
1. What are Data Summaries?: Data summaries are brief descriptions or presentations of
the main characteristics or trends found in a dataset. They help to condense large amounts
of data into manageable and understandable information.
2. Types of Data Summaries:
Descriptive Statistics: These provide basic information about the dataset, such as
measures of central tendency (like mean or median) and variability (like standard
deviation or range).
Visualizations: Charts, graphs, or diagrams that visually represent the data, making
it easier to identify patterns or trends.
Summarized Reports: Written summaries that highlight the most important
findings and insights from the data analysis.
3. Why Data Summaries Matter:
2|Page
Clarity and Understanding: Data summaries make complex data easier to
comprehend by presenting it in a clear and concise manner.
Decision-Making: They help decision-makers quickly grasp the main points of the
data, enabling them to make informed decisions.
Communication: Summaries facilitate communication among stakeholders by
providing a common understanding of the data and its implications.
4. Examples of Data Summaries:
A bar chart showing the distribution of customer ages in a sales dataset.
A written summary highlighting the average monthly sales revenue and the top-
selling products.
A pie chart illustrating the market share of different smartphone brands based on
sales data.
In essence, data summaries distill the essence of a dataset into bite-sized pieces of information,
making it easier for stakeholders to understand and act upon. They play a crucial role in data
analysis and decision-making processes across various fields and industries.
OR
(https://youtu.be/f-7fmwZ81H4?si=DjcFUKUFiX2QzsPO)
Ans. - Data summaries involve condensing large amounts of data into concise and informative
descriptions. They can take various forms depending on the context and the type of data being
summarized. Here are some common types of data summaries:
1. Descriptive Statistics: These include measures such as mean, median, mode, range,
variance, and standard deviation, which provide a basic overview of the data's central
tendency, dispersion, and distribution.
2. Frequency Distributions: These summarize how often each value occurs within a dataset,
often displayed in tables or charts.
3. Histograms and Bar Charts: These graphical representations show the frequency
distribution of numerical data or categorical data, respectively.
4. Box Plots: These graphical summaries display the distribution of a dataset along with key
summary statistics such as quartiles, median, and outliers.
5. Summary Tables: These tables can provide an overview of multiple variables or aspects
of the data, often organized in a tabular format for easy comparison.
6. Pivot Tables: Particularly useful for summarizing large datasets, pivot tables allow users
to dynamically reorganize and summarize data according to different variables and metrics.
3|Page
7. Regression Analysis Results: Summarizing regression analysis involves presenting key
statistics such as coefficients, p-values, R-squared values, and other diagnostic measures
that assess the relationship between variables.
8. Time Series Summaries: For time-series data, summaries may include trends, seasonal
patterns, cyclicality, and irregular fluctuations over time.
9. Text Summaries: Summarizing textual data may involve techniques such as keyword
extraction, sentiment analysis, or topic modeling to distill key themes or insights from large
bodies of text.
10. Cluster Analysis Summaries: In the case of clustering algorithms, summaries may
involve describing the characteristics of each cluster and the overall patterns present in the
data.
These summaries can be tailored to suit specific analytical goals and the needs of the audience,
whether they are stakeholders, decision-makers, or other data analysts.
4|Page
5. Determine the Critical Region:
Based on the chosen significance level (α) and the distribution of the test statistic
under the null hypothesis, determine the critical region(s) - the range of values that,
if observed, would lead to rejecting the null hypothesis.
6. Make a Decision:
Compare the calculated test statistic to the critical region:
If the test statistic falls within the critical region, reject the null hypothesis.
If the test statistic does not fall within the critical region, fail to reject the
null hypothesis.
7. Draw Conclusions:
If the null hypothesis is rejected, conclude that there is sufficient evidence to
support the alternative hypothesis.
If the null hypothesis is not rejected, conclude that there is not enough evidence to
support the alternative hypothesis.
8. Interpret Results:
Consider the practical significance of the findings and the implications for the
research question or problem being investigated.
It's important to note that hypothesis testing does not prove anything definitively; rather, it provides
evidence for or against a particular hypothesis based on the sample data. Additionally, hypothesis
testing relies on assumptions about the data and the sampling process, which should be carefully
considered when interpreting the results.
5|Page
Features: Features are individual measurable properties or characteristics of the data. In a
dataset of houses, for example, features might include the number of bedrooms, square
footage, and location. The selection and engineering of relevant features play a significant
role in the effectiveness of a machine learning model.
Labels: In supervised learning, a subset of machine learning, each data point is associated
with a label or outcome that the model aims to predict. For instance, in a dataset of emails,
the labels might indicate whether each email is spam or not spam.
Models: Machine learning models are mathematical representations of patterns and
relationships within the data. These models are trained using algorithms that adjust their
internal parameters based on the input data to minimize errors or discrepancies between
predicted outcomes and actual outcomes.
Training: During the training phase, the model is exposed to a labeled dataset, and its
parameters are adjusted iteratively to reduce prediction errors. The goal is to generalize
well to new, unseen data.
Testing and Evaluation: After training, the model is evaluated using a separate dataset,
called the test set, to assess its performance on unseen data. Metrics such as accuracy,
precision, recall, and F1 score are commonly used to evaluate the model's performance.
Types of Machine Learning:
Supervised Learning: The model learns from labeled data, making predictions or
decisions based on input-output pairs.
Unsupervised Learning: The model learns patterns and structures from unlabeled
data, identifying hidden relationships or clusters.
Reinforcement Learning: The model learns to make decisions by interacting with
an environment, receiving feedback in the form of rewards or penalties.
Semi-Supervised Learning: Combines elements of supervised and unsupervised
learning, typically using a small amount of labeled data and a larger amount of
unlabeled data.
Machine learning has diverse applications across various domains, including image and speech
recognition, natural language processing, recommendation systems, healthcare, finance, and
autonomous vehicles. As the amount of available data continues to grow and computational
capabilities improve, machine learning is poised to play an increasingly central role in technology
and society.
Q. 6 - Data Prep for ML.
(https://youtu.be/P8ERBy91Y90?si=eiGIk5qLSjIJj_t4)
Ans. - Data preparation is a crucial step in any machine learning (ML) project. Here's a general
outline of the process:
6|Page
1. Data Collection: Gather the data relevant to your problem from various sources, such as
databases, APIs, files, or scraping websites.
2. Data Cleaning:
Handle missing values: Decide whether to impute missing values, delete
rows/columns, or use algorithms that can handle missing data.
Handle outliers: Identify and decide how to handle outliers, whether it's removing
them, transforming them, or leaving them as is.
Remove duplicates: Eliminate duplicate records to avoid skewing the analysis.
Data format consistency: Ensure consistency in data formats (e.g., date formats,
units of measurement).
3. Feature Selection/Engineering:
Select relevant features: Choose features that are most likely to have predictive
power for your ML model.
Create new features: Generate new features that might better represent patterns in
the data.
Transform features: Convert categorical variables into numerical representations
(e.g., one-hot encoding), scale numerical features, or apply other transformations
as needed.
4. Data Splitting:
Split the data into training, validation, and test sets. The training set is used to train
the model, the validation set is used to tune hyperparameters and evaluate model
performance during training, and the test set is used to evaluate the final model
performance.
5. Normalization/Standardization: Scale the features so that they have similar ranges. This
step is important for many machine learning algorithms to converge faster and perform
better.
6. Handling Imbalanced Data: If your data is imbalanced (i.e., one class is much more
prevalent than others), consider techniques such as resampling (oversampling minority
class, undersampling majority class), or using algorithms that are robust to imbalanced
datasets.
7. Data Transformation:
If needed, apply transformations like log transformation, Box-Cox transformation,
or other techniques to make the data more normally distributed.
7|Page
Consider applying dimensionality reduction techniques like Principal Component
Analysis (PCA) if dealing with high-dimensional data.
8. Data Augmentation (if applicable): For tasks like image classification or natural language
processing, you can generate additional training data by applying transformations such as
rotation, flipping, cropping (for images), or by adding noise (for text).
9. Data Pipeline Creation: Construct a pipeline that automates the entire data preparation
process, from loading the raw data to producing the preprocessed data ready for model
training.
10. Documentation: Document all the steps taken during data preparation, including any
assumptions made and any transformations applied. This documentation is crucial for
reproducibility and for understanding the decisions made during the data preparation
process.
By following these steps, you can ensure that your data is in the best possible shape for training
robust and accurate machine learning models.
Q. 7 - ML Algorithms.
(https://youtu.be/olFxW7kdtP8?si=7iP1djKypLHA9Lu5
https://youtu.be/yN7ypxC7838?si=zd9a8bbIv2ZwGl6d)
Ans. - Sure, machine learning (ML) algorithms are at the heart of many artificial intelligence
applications. They are essentially mathematical models that learn patterns and relationships from
data, enabling computers to make predictions or decisions without being explicitly programmed
for every task. Here's an overview of some common ML algorithms:
1. Linear Regression: A simple algorithm used for predicting a continuous value based on
one or more input features. It assumes a linear relationship between the input features and
the target variable.
2. Logistic Regression: This algorithm is used for binary classification tasks, where the target
variable has two possible outcomes. It models the probability that an instance belongs to a
particular class.
3. Decision Trees: These are tree-like structures where each internal node represents a
feature, each branch represents a decision based on that feature, and each leaf node
represents the outcome. Decision trees are versatile and can be used for classification or
regression tasks.
4. Random Forest: An ensemble learning method that builds multiple decision trees during
training and outputs the mode of the classes (classification) or the mean prediction
(regression) of the individual trees.
8|Page
5. Support Vector Machines (SVM): A supervised learning algorithm that can be used for
classification or regression tasks. SVMs find the hyperplane that best separates classes in
the feature space.
6. K-Nearest Neighbors (KNN): A simple algorithm that stores all available cases and
classifies new cases based on a similarity measure (e.g., distance functions).
7. Naive Bayes: A probabilistic classifier based on Bayes' theorem with the "naive"
assumption of independence between features. It is commonly used for text classification
tasks like spam detection or sentiment analysis.
8. Neural Networks: Inspired by the structure of the human brain, neural networks consist
of interconnected layers of artificial neurons. Deep neural networks, in particular, have
achieved remarkable success in various domains such as image recognition, natural
language processing, and reinforcement learning.
9. Clustering Algorithms (e.g., K-means, Hierarchical Clustering): These algorithms
group similar data points together based on certain criteria, such as distance or similarity.
10. Dimensionality Reduction Techniques (e.g., PCA, t-SNE): These methods are used to
reduce the number of features in a dataset while preserving important information. They
are often used for visualization or to improve the efficiency of other machine learning
algorithms.
These are just a few examples, and there are many other ML algorithms and techniques available,
each with its own strengths and weaknesses, suitable for different types of data and tasks.
Q. 8 - Unsupervised Learning.
(https://youtu.be/yN7ypxC7838?si=zd9a8bbIv2ZwGl6d)
Ans. - Unsupervised learning is a type of machine learning where the model learns patterns and
structures from input data without explicit supervision or labeled responses. In unsupervised
learning, the algorithm tries to find hidden structures or intrinsic patterns in the data.
Here are some common techniques and algorithms used in unsupervised learning:
1. Clustering: Clustering algorithms group similar data points together into clusters based on
some similarity measure. Common clustering algorithms include:
K-means clustering: Partitioning the data into K clusters based on the mean of
data points.
Hierarchical clustering: Building a hierarchy of clusters by recursively merging
or splitting them.
2. Dimensionality Reduction: Dimensionality reduction techniques aim to reduce the
number of features (dimensions) in a dataset while preserving important information. This
9|Page
can help in visualizing high-dimensional data or reducing computational complexity.
Common dimensionality reduction techniques include:
Principal Component Analysis (PCA): Identifying the orthogonal directions
(principal components) that capture the most variance in the data.
t-Distributed Stochastic Neighbor Embedding (t-SNE): Mapping high-
dimensional data into a lower-dimensional space while preserving local structure.
3. Anomaly Detection: Anomaly detection algorithms identify data points that deviate from
normal behavior. This can be useful for detecting fraud, errors, or outliers in datasets.
Common anomaly detection techniques include:
Density-based methods like Local Outlier Factor (LOF) or Isolation Forest.
Statistical methods like Gaussian Mixture Models (GMM) or One-Class SVM.
4. Association Rule Learning: Association rule learning discovers interesting relationships
or associations between variables in large datasets. It is commonly used in market basket
analysis and recommendation systems. Apriori algorithm is a popular technique for mining
association rules.
5. Generative Models: Generative models learn the underlying probability distribution of
the data and can generate new samples similar to the training data. Common generative
models include:
Variational Autoencoders (VAEs): Learning a low-dimensional latent
representation of data and generating new samples by sampling from this latent
space.
Generative Adversarial Networks (GANs): Training two neural networks, a
generator and a discriminator, in a competitive setting to generate realistic data
samples.
Unsupervised learning is particularly useful when dealing with unlabeled or unstructured data,
where it can reveal insights or patterns that may not be immediately apparent. However, evaluation
and interpretation of results in unsupervised learning can be more challenging compared to
supervised learning.
1. Exp - Exploring without a Map: Imagine you're in a big forest with no map. You're just
wandering around, trying to make sense of everything you see without anyone telling you
what's what. That's a bit like unsupervised learning. You have a bunch of data, but nobody
has told the computer what it means or what to look for.
2. Grouping Similar Things Together: One thing you might try to do is group similar things
together. For example, if you find a bunch of rocks, you might put them in one pile, and if
you find some sticks, you might put them in another pile. That's like clustering in
unsupervised learning. You're finding patterns in the data without being told what those
patterns are.
10 | P a g e
3. Simplifying Things: Sometimes, you might have a lot of stuff to deal with, like a huge
pile of toys. You might want to organize them so it's easier to understand. For instance,
you could put all the cars in one box, all the dolls in another, and so on. That's similar to
dimensionality reduction in unsupervised learning. You're making things simpler without
losing too much important information.
4. Spotting Weird Stuff: Imagine you're sorting through your toys and you find something
really strange, like a toy that doesn't look like anything else you have. That could be an
anomaly. In unsupervised learning, you're trying to find these strange things in your data
without anyone telling you what to look for.
5. Discovering Patterns: Sometimes, you might just be curious and want to find out if there
are any interesting connections between different things. For example, you might notice
that whenever you see dark clouds, it usually rains. That's like finding association rules in
unsupervised learning. You're discovering connections between different parts of your
data.
(So, unsupervised learning is all about exploring and finding patterns in data without any
instructions or examples provided beforehand. It's like being a detective trying to solve a mystery
using clues you find along the way!)
Q. 9 - Model Inference.
(https://youtu.be/ftd9XUykiaw?si=9ce9wGAJ6pv_sTiO)
Ans. - "Model inference" generally refers to the process of using a trained machine learning model
to make predictions or decisions based on new data. It's the phase where the model is deployed in
a real-world scenario to perform its intended task, such as classifying images, generating text,
making recommendations, or any other task it was trained for. During inference, the model takes
input data, processes it according to its learned parameters, and produces an output or prediction.
This phase typically occurs after the model has been trained on a dataset and validated for its
accuracy and performance.
11 | P a g e
2. Clarity and Simplicity: Keep your communication clear, concise, and simple. Avoid using
technical jargon that might confuse non-technical stakeholders. Use plain language and
explain complex concepts in an accessible way.
3. Storytelling: Data storytelling is a powerful technique for conveying insights. Structure
your communication as a narrative, with a clear beginning, middle, and end. Use real-world
examples and anecdotes to illustrate your points and make them more engaging.
4. Visualizations: Visualizations are a powerful tool for conveying complex data in a
digestible format. Choose the right type of visualization for your data and use it to highlight
key insights. Make sure your visualizations are clear, easy to understand, and aesthetically
pleasing.
5. Contextualization: Provide context for your analysis by explaining why it matters and
how it relates to broader business goals or challenges. Help stakeholders understand the
implications of your insights and how they can use them to make informed decisions.
6. Feedback and Iteration: Be open to feedback from stakeholders and be willing to iterate
on your communication based on their input. Ask for clarification if needed and make sure
you address any questions or concerns they may have.
7. Documentation: Document your analysis and communication process to ensure
transparency and reproducibility. This can include written reports, slide decks, or
documentation of code and methodology.
8. Active Listening: Finally, communication is a two-way street. Practice active listening by
paying attention to stakeholders' questions, concerns, and feedback. This will help you
tailor your communication to their needs and ensure that your insights are effectively
understood and acted upon.
Q. 11 - Model in Production.
Ans. - Transitioning a model from development to production is a crucial step in the data science
lifecycle. Here's a breakdown of key considerations when deploying a model into production:
1. Model Evaluation and Testing: Before deploying a model into production, it's essential
to thoroughly evaluate its performance and test it under various conditions. This includes
assessing metrics such as accuracy, precision, recall, and F1-score, as well as conducting
robustness testing to ensure the model performs well in real-world scenarios.
2. Scalability and Performance: Consider the scalability and performance requirements of
your model in a production environment. Will it be able to handle large volumes of data
and requests? Does it meet latency and throughput requirements? Optimize your model
and infrastructure accordingly to ensure it can scale effectively.
3. Infrastructure and Deployment Environment: Choose an appropriate infrastructure and
deployment environment for hosting your model in production. This may include cloud
12 | P a g e
platforms such as AWS, Azure, or Google Cloud, containerization technologies like
Docker and Kubernetes, or serverless architectures. Consider factors such as cost,
scalability, reliability, and ease of maintenance when selecting your deployment
environment.
4. Model Monitoring and Maintenance: Implement monitoring and logging mechanisms to
track the performance of your model in production and detect any issues or anomalies. This
includes monitoring metrics such as prediction accuracy, latency, and resource utilization,
as well as logging inputs, outputs, and errors for debugging purposes. Establish procedures
for model retraining and maintenance to ensure it remains accurate and up-to-date over
time.
5. Data Privacy and Security: Ensure that your model deployment adheres to data privacy
and security requirements, especially if it involves handling sensitive or personally
identifiable information (PII). Implement encryption, access controls, and other security
measures to protect data both in transit and at rest. Comply with relevant regulations such
as GDPR, HIPAA, or CCPA to avoid legal and regulatory issues.
6. Versioning and Reproducibility: Establish versioning and reproducibility practices to
track changes to your model and ensure that deployments are consistent and reproducible.
Use version control systems such as Git to manage code and configuration changes, and
document dependencies, hyperparameters, and experimental results to facilitate
reproducibility.
7. Deployment Strategy: Develop a deployment strategy that minimizes downtime and
disruption to existing systems and services. Consider techniques such as blue-green
deployment, canary deployment, or rolling updates to gradually transition to the new model
while minimizing risk. Implement automated testing and rollback procedures to handle
failures and mitigate risks during deployment.
8. Collaboration and Communication: Foster collaboration and communication between
data scientists, engineers, and stakeholders throughout the deployment process. Clearly
define roles and responsibilities, establish communication channels and feedback loops,
and ensure that everyone is aligned on the goals and expectations for the model in
production.
By considering these factors and following best practices, you can effectively deploy and manage
models in production, enabling your organization to derive value from data science insights in a
reliable and scalable manner.
Q. 12 – Data Science Project Management
Ans. - Data science project management is essential for ensuring the success of data-driven
initiatives. Here's an overview of key principles and practices in managing data science projects:
1. Define Clear Objectives: Start by defining clear project objectives and success criteria.
What problem are you trying to solve? What are the business goals and metrics for
13 | P a g e
measuring success? Establishing clear objectives helps align stakeholders and guide project
prioritization and decision-making.
2. Structured Approach: Adopt a structured approach to project management, such as Agile
or Scrum, tailored to the unique requirements of data science projects. Break down the
project into smaller, manageable tasks or user stories, and prioritize them based on value
and feasibility. Use iterative development and regular feedback cycles to adapt to changing
requirements and insights.
3. Cross-Functional Teams: Build cross-functional teams with diverse skills and expertise,
including data scientists, engineers, domain experts, and business stakeholders. Encourage
collaboration and communication between team members from different disciplines to
leverage their respective strengths and perspectives.
4. Data Governance and Quality: Establish data governance policies and procedures to
ensure the quality, integrity, and security of data throughout the project lifecycle. Define
data standards, documentation practices, and access controls to maintain data quality and
compliance with regulatory requirements. Implement data validation and cleaning
processes to address errors, outliers, and missing values.
5. Resource Allocation and Management: Allocate resources effectively, including people,
time, and budget, to support the project's goals and objectives. Identify and mitigate
resource constraints or bottlenecks that may impact project progress or outcomes. Track
resource utilization and performance metrics to monitor project health and identify areas
for improvement.
6. Risk Management: Identify potential risks and uncertainties that may impact project
success, such as technical challenges, resource constraints, or changes in project
requirements. Develop risk mitigation strategies to address these risks proactively and
minimize their impact on project outcomes. Regularly review and update the risk register
to ensure that emerging risks are addressed promptly.
7. Communication and Stakeholder Engagement: Maintain open and transparent
communication with stakeholders throughout the project lifecycle. Provide regular updates
on project progress, milestones, and deliverables, and solicit feedback and input from
stakeholders to ensure alignment with their needs and expectations. Establish clear
channels for communication and escalation of issues or concerns.
8. Continuous Improvement: Foster a culture of continuous improvement by reflecting on
lessons learned and implementing process improvements based on feedback and
retrospective reviews. Encourage experimentation and innovation to explore new
approaches and technologies that can enhance project outcomes and deliver business value.
By applying these principles and practices, you can effectively manage data science projects and
drive successful outcomes that deliver value to your organization.
14 | P a g e
Quiz
Class 1 PPT
A comprehensive overview of data science, covering its definition, role in business decision-
making, applications across various industries, statistical concepts, and methodologies. Here's a
summarized breakdown:
1. Introduction to Data Science: Data science is the application of computer technology, statistics,
and domain knowledge to solve problems and convert raw data into knowledge for decision-
making.
2. Role in Business Decision Making: Data science plays a crucial role in extracting insights from
large datasets to support informed decision-making in areas such as customer segmentation,
manufacturing optimization, and strategic planning.
3. Applications across Industries: Data science finds applications in search engines, transportation
(e.g., driverless cars), finance (e.g., fraud detection), e-commerce (e.g., personalized
recommendations), healthcare (e.g., tumor detection), among others.
4. Statistical Concepts: The text covers various statistical concepts such as measures of central
tendency and variability, hypothesis testing, probability distributions, sampling methods, and
inferential statistics.
5. Methodologies in Data Science: Methodologies like the methodology-agnostic approach
emphasize flexibility and alignment with business needs, while breaking down business problems
involves understanding, defining, decomposing, analyzing, synthesizing, reviewing, and refining
the problem-solving process.
6. Dimensions of Data Quality: Data quality is evaluated based on dimensions like accuracy,
completeness, consistency, timeliness, validity, and uniqueness, ensuring that data is reliable and
relevant for decision-making.
Overall, the text provides insights into the fundamental concepts, applications, and methodologies
of data science, highlighting its importance in modern business and industry.
1|Page
2. Which of the following is NOT an application of data science?
A) Search Engines
B) Transportation
C) Archaeology
D) Healthcare
Answer: C) Archaeology
4. Which industry uses data science to optimize logistics routes and schedules?
A) Healthcare
B) Finance
C) Manufacturing
D) Transportation
Answer: D) Transportation
2|Page
6. Which measure of central tendency is best for nominal data?
A) Mode
B) Median
C) Mean
D) Standard Deviation
Answer: A) Mode
9. What type of error occurs when a guilty person is found to be not guilty?
A) Type I error
B) Type II error
C) Null error
D) Alternative error
Answer: B) Type II error
3|Page
10. What is the purpose of inferential statistics?
A) To summarize data
B) To draw conclusions about a population by examining random samples
C) To create probability distributions
D) To calculate measures of central tendency
Answer: B) To draw conclusions about a population by examining random samples
11. In data science, what does the term "agile" refer to?
A) A methodology for software development
B) A breed of dogs
C) A statistical measure
D) A type of data visualization
Answer: A) A methodology for software development
12. Which industry benefits from using data science to predict customer lifetime value and stock
market moves?
A) Transportation
B) Healthcare
C) Finance
D) E-commerce
Answer: C) Finance
13. What measure of variability is the average squared difference of the values from the mean?
A) Range
B) Variance
C) Standard Deviation
D) Coefficient of Variation
Answer: B) Variance
4|Page
14. Which sampling method allows you to make strong statistical inferences about the whole
group?
A) Probability sampling
B) Non-probability sampling
C) Convenience sampling
D) Judgment sampling
Answer: A) Probability sampling
16. Which dimension of data quality refers to the availability of information when needed?
A) Accuracy
B) Completeness
C) Consistency
D) Timeliness
Answer: D) Timeliness
17. What does the term "agnostic approach" mean in data science?
A) Being indifferent to technology, models, methodologies, or data
B) Being overly attached to specific technologies
C) Being unaware of statistical methods
D) Being skeptical about data analysis
Answer: A) Being indifferent to technology, models, methodologies, or data
5|Page
18. What does a measure of central tendency indicate?
A) The spread of data values
B) The middle or typical value in a dataset
C) The likelihood of future events
D) The average difference from the mean
Answer: B) The middle or typical value in a dataset
19. Which statistical method compares events and explanations to validate them?
A) Descriptive statistics
B) Inferential statistics
C) Hypothesis testing
D) Probability distributions
Answer: C) Hypothesis testing
20. What type of error occurs when a researcher finds an effect that doesn't actually exist?
A) Type I error
B) Type II error
C) Null error
D) Alternative error
Answer: A) Type I error
6|Page
Class 2 PPT
The data covers various aspects related to data analysis, statistics, hypothesis testing, and machine
learning. It begins by discussing the types of data, including numeric, qualitative (nominal and
ordinal), date-time data, and unstructured data.
Then delves into data summary components, such as measures of central tendency (mode, mean,
median) and variability (range, quartiles, inter-quartile range, MAD, standard deviation, and
variance), along with the concept of outliers and distributions (discrete and continuous).
The data further explores specific distributions like uniform, Bernoulli, binomial, and Poisson
distributions, providing insights into their characteristics and applications. It also touches upon the
pitfalls of data summaries, emphasizing the importance of understanding the nuances of data
representation.
Additionally, the data covers visualization methods for summarizing data, including histograms,
kernel density curves, box plots, and correlation coefficients. It discusses techniques for
summarizing categorical data, visualization of various types of data, and the introduction to
machine learning, distinguishing between supervised (regression and classification) and
unsupervised learning problems.
Furthermore, it outlines regression theory, classification theory, hypothesis testing frameworks,
types of errors, and performance metrics for evaluating models. The summary concludes with an
overview of predictive modeling and model performance evaluation in both regression and
classification scenarios.
Overall, the data provides a comprehensive overview of fundamental concepts and techniques
essential for understanding and analyzing data, hypothesis testing, and building predictive models
in various domains.
2. Which measure of central tendency is more suitable when extreme values are present in the
data?
7|Page
A) Mode
B) Mean
C) Median
D) Range
Answer: C) Median
8|Page
Answer: C) Uniform Distribution
9. Which branch of artificial intelligence focuses on the use of data algorithms to imitate human
learning?
9|Page
A) Deep Learning
B) Natural Language Processing
C) Machine Learning
D) Robotics
10. What are the two main types of supervised learning problems?
A) Regression and Classification
B) Regression and Unsupervised
C) Classification and Clustering
D) Clustering and Dimensionality Reduction
12. What is the range of values for probability outcomes in classification problems?
A) -1 to 1
B) 0 to 1
C) 1 to ∞
D) -∞ to +∞
10 | P a g e
Answer: B) 0 to 1
13. Which function is used to transform generic combinations of inputs in classification problems?
A) Sigmoid function
B) Linear function
C) Exponential function
D) Logarithmic function
16. Which type of distribution is observed when the probability of success in an individual trial is
small and unknown?
11 | P a g e
A) Binomial Distribution
B) Poisson Distribution
C) Uniform Distribution
D) Normal Distribution
12 | P a g e
Answer: C) Frequency bar chart
20. In machine learning, what problem type involves finding general patterns hidden in measured
features?
A) Regression
B) Classification
C) Unsupervised
D) Reinforcement Learning
Answer: C) Unsupervised
13 | P a g e
Class 3 PPT
The provided data offers an overview of key concepts in machine learning, particularly focusing
on supervised and unsupervised learning, as well as regression and classification problems. It
outlines the fundamental principles behind these concepts, including:
1. Machine Learning Overview: Machine learning is a branch of artificial intelligence (AI)
and computer science focused on using data algorithms to mimic human learning, with a
gradual improvement in accuracy over time.
2. Supervised vs. Unsupervised Learning: Supervised learning involves training a model
using labeled data with explicit outcomes, while unsupervised learning deals with
unlabeled data and aims to find hidden patterns.
3. Regression Problems: Regression problems entail predicting continuous numeric
outcomes, such as sales or temperature, using input features. The performance of regression
models is evaluated using metrics like Mean Squared Error (MSE) and R-squared.
4. Classification Problems: Classification problems involve predicting categorical
outcomes, such as yes/no or different classes, based on input features. Evaluation metrics
for classification models include accuracy, precision, recall, F1-score, and AUC-ROC.
5. Model Evaluation: Model evaluation is essential in both regression and classification
problems to assess the performance of trained models on unseen data. Common evaluation
metrics help understand the model's strengths and weaknesses.
6. Overfitting: Overfitting occurs when a model learns the training data too well, capturing
noise and irrelevant patterns. Remedies for overfitting include simplifying the model,
regularization, cross-validation, and feature selection.
7. Trade-offs in Classification Models: Sensitivity and specificity are crucial metrics in
binary classification models, with a trade-off between them. The choice between
maximizing sensitivity or specificity depends on the specific requirements of the problem
domain.
Overall, the data provides a comprehensive understanding of machine learning fundamentals,
including problem categorization, model evaluation, and strategies to mitigate common challenges
like overfitting.
14 | P a g e
Answer: b) Imitating the way humans learn
4. Which function is used to represent the relationship between input factors and outcome in
regression?
a) f(X)
b) P(y)
c) Sigmoid
d) Logit
Answer: a) f(X)
6. How do regression and classification problems differ in terms of their outcome types?
a) Regression problems have categorical outcomes, while classification problems have
continuous numeric outcomes.
b) Regression problems have continuous numeric outcomes, while classification problems have
categorical outcomes.
c) Regression problems have only two possible outcomes, while classification problems have
multiple possible outcomes.
d) Regression problems have multiple possible outcomes, while classification problems have
only two possible outcomes.
Answer: b) Regression problems have continuous numeric outcomes, while classification
problems have categorical outcomes.
16 | P a g e
a) Sensitivity measures the ability to correctly identify negative cases, while specificity measures
the ability to correctly identify positive cases.
b) Sensitivity measures the ability to correctly identify positive cases, while specificity measures
the ability to correctly identify negative cases.
c) Sensitivity measures the overall accuracy of the model, while specificity measures the
precision of the model.
d) Sensitivity measures the precision of the model, while specificity measures the recall of the
model.
Answer: b) Sensitivity measures the ability to correctly identify positive cases, while specificity
measures the ability to correctly identify negative cases.
17 | P a g e
Class 4 PPT
The provided data covers various aspects of machine learning algorithms, including Gradient
Boosting Machines (GBMs), XGBoost, CatBoost, unsupervised learning, similarity measures,
clustering methods, and dimensionality reduction techniques like PCA, t-SNE, and UMAP. Here's
a summary of the key points:
1. Gradient Boosting Machines (GBMs):
- GBMs are ensemble learning algorithms used for regression and classification tasks.
- Parameters include `n_estimators`, `learning_rate`, `max_depth`, `subsample`, and
`max_features`.
- They build decision trees sequentially and combine them to improve prediction accuracy.
2. XGBoost:
- XGBoost is a variant of GBMs with altered cost functions, making it slower but more targeted.
- It penalizes model complexity and individual tree contributions to ensure better overall
objective alignment.
3. CatBoost:
- CatBoost is a boosting algorithm designed for categorical features, published in 2017.
- It replaces categorical values with out-of-sample response rates to avoid data explosion and
runs fast due to forced balanced tree structures.
4. Unsupervised Learning:
- Unsupervised learning deals with unlabeled data, allowing algorithms to discover patterns and
insights without explicit guidance.
- Similarity measures like Euclidean distance, cosine similarity, Jaccard distance, and Gower’s
distance are used to quantify relationships between data points.
5. Clustering Methods:
- Clustering algorithms group similar data points together.
- Methods include K-means, DBSCAN, agglomerative clustering, and PAMk.
- Linkage methods like single, complete, average, centroid, and Ward's method determine
proximity between clusters.
6. Dimensionality Reduction:
- Techniques like PCA, t-SNE, and UMAP reduce high-dimensional data to lower dimensions
for visualization and analysis.
18 | P a g e
- PCA preserves variance while reducing dimensions, while t-SNE and UMAP focus on
maintaining local relationships in the data.
Overall, the data provides insights into various machine learning techniques used for modeling,
clustering, and dimensionality reduction, each with its specific strengths and applications. These
methods play crucial roles in analyzing and extracting valuable information from complex
datasets.
1. Which algorithm uses a slightly altered version of the original cost functions used by
traditional GBMs?
- A) XGBoost
- B) CatBoost
- C) Unsupervised Learning
- D) K-means
- Answer: A) XGBoost
2. What parameter determines the maximum depth of individual tree models in Gradient
Boosting Machines (GBMs)?
- A) n_estimators
- B) learning_rate
- C) max_depth
- D) subsample
- Answer: C) max_depth
3. Which distance metric is commonly used for measuring similarity between two points in
unsupervised learning?
- A) Euclidean distance
- B) Manhattan distance
- C) Cosine similarity
- D) Jaccard distance
- Answer: A) Euclidean distance
19 | P a g e
4. What is the primary purpose of Principal Component Analysis (PCA)?
- A) Reducing the number of features while preserving variance
- B) Clustering data points into groups
- C) Visualizing high-dimensional data in 2D or 3D
- D) Identifying outliers in the dataset
- Answer: A) Reducing the number of features while preserving variance
5. Which clustering algorithm is susceptible to finding nonsensical clusters if the shape of the
data groups is spherical?
- A) K-means
- B) PAMk
- C) DBSCAN
- D) Agglomerative clustering
- Answer: A) K-means
7. Which linkage method calculates the proximity between two clusters based on the average
distance of all objects in one cluster to all objects in the other cluster?
- A) Single linkage
- B) Complete linkage
- C) Average linkage
- D) Centroid linkage
- Answer: C) Average linkage
20 | P a g e
8. In t-SNE, what does the "t" stand for?
- A) Triangulated
- B) Transformed
- C) T-distribution
- D) Tapered
- Answer: C) T-distribution
10. Which algorithm is specifically designed to work with categorical features and is known
for its reliability and speed?
- A) XGBoost
- B) CatBoost
- C) K-means
- D) DBSCAN
- Answer: B) CatBoost
11. What parameter determines the number of features to consider when looking for the best
split in Gradient Boosting Machines (GBMs)?
- A) n_estimators
- B) learning_rate
- C) max_features
- D) max_depth
21 | P a g e
- Answer: C) max_features
12. Which similarity measure is commonly used for binary data and calculates the similarity
based on the number of common attributes?
- A) Euclidean distance
- B) Cosine similarity
- C) Jaccard distance
- D) Gower’s distance
- Answer: C) Jaccard distance
14. Which clustering method starts by assigning each object to its own cluster and then
iteratively merges clusters based on their similarity?
- A) K-means
- B) DBSCAN
- C) Agglomerative clustering
- D) PAMk
- Answer: C) Agglomerative clustering
22 | P a g e
- D) It is insensitive to the scale of the data
- Answer: C) It is computationally expensive for large datasets
18. What parameter in t-SNE controls the number of iterations during optimization?
- A) Learning rate
- B) Perplexity
- C) Number of neighbors
- D) Number of iterations
- Answer: D) Number of iterations
19. Which algorithm replaces categorical values with out-of-sample response rates to avoid
data explosion?
- A) K-means
- B) DBSCAN
23 | P a g e
- C) CatBoost
- D) PAMk
- Answer: C) CatBoost
21. Which method calculates the proximity between two clusters by considering the centroids
of the clusters?
- A) Single linkage
- B) Complete linkage
- C) Centroid linkage
- D) Ward's method
- Answer: C) Centroid linkage
22. Which algorithm starts by randomly selecting two points and iteratively assigns points
to the nearest cluster center?
- A) K-means
- B) DBSCAN
- C) Agglomerative clustering
- D) PAMk
- Answer: A) K-means
23. Which similarity measure is appropriate for measuring the similarity between two binary
vectors?
24 | P a g e
- A) Euclidean distance
- B) Cosine similarity
- C) Jaccard distance
- D) Gower’s distance
- Answer: C) Jaccard distance
25. Which algorithm is prone to the problem of finding nonsensical clusters when the number
of clusters (K) is not chosen carefully?
- A) K-means
- B) DBSCAN
- C) Agglomerative clustering
- D) PAMk
- Answer: A) K-means
29. Which algorithm is specifically designed to work with categorical features and is known
for its reliability and speed?
- A) XGBoost
- B) CatBoost
- C) K-means
- D) DBSCAN
- Answer: B) CatBoost
30. What parameter determines the maximum depth of individual tree models in Gradient
Boosting Machines (GBMs)?
- A) n_estimators
- B) learning_rate
- C) max_depth
- D) subsample
26 | P a g e
- Answer: C) max_depth
27 | P a g e