0% found this document useful (0 votes)

18 views42 pages

Analytics Notes

Uploaded by

csfrockz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views42 pages

Analytics Notes

Uploaded by

csfrockz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

MBA SEM III

Business Analytics
for Managers
Q. 1. - Introduction to Data Science.
(https://youtu.be/X3paOmcrTjQ?si=n_5a9fs1nZVrq6Qk)
Ans. - Data science is like being a detective, but instead of solving crimes, you're solving puzzles
using data. It's all about collecting, analyzing, and interpreting information to uncover insights and
make better decisions.
Imagine you have a big pile of puzzle pieces (data) scattered around. Each piece holds a tiny bit
of information. Data science helps you organize and understand these pieces so you can see the
bigger picture.
Here's how it works:
1. Collecting Data: First, you gather data from various sources like sensors, websites, or
surveys. This could be anything from sales numbers to social media posts.
2. Cleaning and Preparing Data: Data is often messy and incomplete, like puzzle pieces
with smudges or missing edges. So, you clean and organize the data, making sure it's
accurate and ready for analysis.
3. Exploring Data: Now comes the fun part! You start piecing together the puzzle by
exploring the data. You look for patterns, trends, or anomalies that could reveal interesting
insights.
4. Analyzing Data: Once you've got a good grasp of the data, you use statistical techniques
or machine learning algorithms to dig deeper. This helps you uncover hidden relationships
or predict future outcomes.
5. Visualizing Results: To make your findings easy to understand, you create visualizations
like charts or graphs. These help you communicate your insights to others effectively.
6. Making Decisions: Finally, armed with your insights, you can make informed decisions.
Whether it's optimizing business processes, improving healthcare outcomes, or predicting
customer behavior, data science empowers you to make smarter choices.
In essence, data science is about turning raw data into actionable insights. It's a powerful tool that's
revolutionizing industries and driving innovation across the board. And as more and more data
becomes available, the opportunities for discovery and impact continue to grow.

Q. 2 - Discovering Data Science Use Cases.

Ans. - Discovering data science use cases involves finding real-world problems where data
science techniques can be applied to provide valuable solutions or insights. Here's a simplified
breakdown:
Identifying Problems: First, you look for challenges or opportunities where data might help. This
could be anything from predicting customer behavior to diagnosing diseases earlier.

1|Page
1. Exploring Data: Once you've identified a problem, you gather relevant data. This might
include things like sales records, patient information, or social media activity.
2. Cleaning and Preparing Data: Data often needs cleaning and organizing before it can be
used effectively. You make sure it's accurate and in a format that data science tools can
understand.
3. Analyzing Data: With the data ready, you use techniques like statistics or machine
learning to find patterns or make predictions. For example, you might analyze customer
buying habits to identify trends or detect fraudulent activity.
4. Applying Insights: Once you've uncovered insights from the data, you apply them to solve
the original problem. This could involve making business decisions, improving processes,
or creating new products or services.
5. Evaluating Results: Finally, you assess the impact of your data science solution. Did it
solve the problem effectively? Did it provide valuable insights? If not, you may need to
refine your approach or gather more data.
Throughout this process, it's important to collaborate with experts in the field to ensure that your
data science solution addresses the real needs and challenges of the problem you're trying to solve.
By leveraging the power of data, you can uncover valuable insights and drive positive change in a
wide range of domains.

Q. 3 - Data Summaries.
(https://youtu.be/f-7fmwZ81H4?si=_3Nuof-6s4SDA3ve)
Ans. - Data summaries are like short stories that capture the key points and insights hidden within
a dataset. Here's a simplified explanation:
1. What are Data Summaries?: Data summaries are brief descriptions or presentations of
the main characteristics or trends found in a dataset. They help to condense large amounts
of data into manageable and understandable information.
2. Types of Data Summaries:
 Descriptive Statistics: These provide basic information about the dataset, such as
measures of central tendency (like mean or median) and variability (like standard
deviation or range).
 Visualizations: Charts, graphs, or diagrams that visually represent the data, making
it easier to identify patterns or trends.
 Summarized Reports: Written summaries that highlight the most important
findings and insights from the data analysis.
3. Why Data Summaries Matter:

2|Page
 Clarity and Understanding: Data summaries make complex data easier to
comprehend by presenting it in a clear and concise manner.
 Decision-Making: They help decision-makers quickly grasp the main points of the
data, enabling them to make informed decisions.
 Communication: Summaries facilitate communication among stakeholders by
providing a common understanding of the data and its implications.
4. Examples of Data Summaries:
 A bar chart showing the distribution of customer ages in a sales dataset.
 A written summary highlighting the average monthly sales revenue and the top-
selling products.
 A pie chart illustrating the market share of different smartphone brands based on
sales data.
In essence, data summaries distill the essence of a dataset into bite-sized pieces of information,
making it easier for stakeholders to understand and act upon. They play a crucial role in data
analysis and decision-making processes across various fields and industries.
OR
(https://youtu.be/f-7fmwZ81H4?si=DjcFUKUFiX2QzsPO)
Ans. - Data summaries involve condensing large amounts of data into concise and informative
descriptions. They can take various forms depending on the context and the type of data being
summarized. Here are some common types of data summaries:
1. Descriptive Statistics: These include measures such as mean, median, mode, range,
variance, and standard deviation, which provide a basic overview of the data's central
tendency, dispersion, and distribution.
2. Frequency Distributions: These summarize how often each value occurs within a dataset,
often displayed in tables or charts.
3. Histograms and Bar Charts: These graphical representations show the frequency
distribution of numerical data or categorical data, respectively.
4. Box Plots: These graphical summaries display the distribution of a dataset along with key
summary statistics such as quartiles, median, and outliers.
5. Summary Tables: These tables can provide an overview of multiple variables or aspects
of the data, often organized in a tabular format for easy comparison.
6. Pivot Tables: Particularly useful for summarizing large datasets, pivot tables allow users
to dynamically reorganize and summarize data according to different variables and metrics.

3|Page
7. Regression Analysis Results: Summarizing regression analysis involves presenting key
statistics such as coefficients, p-values, R-squared values, and other diagnostic measures
that assess the relationship between variables.
8. Time Series Summaries: For time-series data, summaries may include trends, seasonal
patterns, cyclicality, and irregular fluctuations over time.
9. Text Summaries: Summarizing textual data may involve techniques such as keyword
extraction, sentiment analysis, or topic modeling to distill key themes or insights from large
bodies of text.
10. Cluster Analysis Summaries: In the case of clustering algorithms, summaries may
involve describing the characteristics of each cluster and the overall patterns present in the
data.
These summaries can be tailored to suit specific analytical goals and the needs of the audience,
whether they are stakeholders, decision-makers, or other data analysts.

Q.4 – Hypothesis Testing

(https://youtu.be/Mbop8CW4YNc?si=6VId5r1kUmXpH4Q5)
Ans. - Hypothesis testing is a statistical method used to make inferences about a population based
on sample data. Here's a basic overview of the hypothesis testing process:
1. Formulate Hypotheses:
 Null Hypothesis (H0): This is the default assumption, often representing no effect
or no difference. It's denoted as H0.
 Alternative Hypothesis (H1 or Ha): This is the opposite of the null hypothesis,
representing the effect or difference you're interested in testing.
2. Select a Significance Level (α):
 The significance level, denoted as α, determines how much evidence is required to
reject the null hypothesis. Common values for α include 0.05 or 0.01.
3. Choose a Test Statistic:
 The choice of test statistic depends on the type of data and the specific hypothesis
being tested. Common test statistics include t-tests, z-tests, chi-square tests,
ANOVA, etc.
4. Collect Data and Calculate the Test Statistic:
 Gather data from a sample and compute the test statistic based on the sample data
and the chosen test method.

4|Page
5. Determine the Critical Region:
 Based on the chosen significance level (α) and the distribution of the test statistic
under the null hypothesis, determine the critical region(s) - the range of values that,
if observed, would lead to rejecting the null hypothesis.
6. Make a Decision:
 Compare the calculated test statistic to the critical region:
 If the test statistic falls within the critical region, reject the null hypothesis.
 If the test statistic does not fall within the critical region, fail to reject the
null hypothesis.
7. Draw Conclusions:
 If the null hypothesis is rejected, conclude that there is sufficient evidence to
support the alternative hypothesis.
 If the null hypothesis is not rejected, conclude that there is not enough evidence to
support the alternative hypothesis.
8. Interpret Results:
 Consider the practical significance of the findings and the implications for the
research question or problem being investigated.
It's important to note that hypothesis testing does not prove anything definitively; rather, it provides
evidence for or against a particular hypothesis based on the sample data. Additionally, hypothesis
testing relies on assumptions about the data and the sampling process, which should be carefully
considered when interpreting the results.

Q. 5 – Introduction to Machine Learning.

(https://youtu.be/ukzFI9rgwfU?si=3lRZEE5YzdV4SvS9)
Ans. - Machine learning is a field of artificial intelligence (AI) that focuses on the development of
algorithms and models that allow computers to learn from and make predictions or decisions based
on data, without being explicitly programmed. It enables computers to identify patterns and
relationships within large datasets, which can then be used to make predictions or decisions.
Here's a brief overview of key concepts in machine learning:
Data: Machine learning algorithms require data to learn from. This data can come in
various forms, such as text, images, audio, or numerical values. The quality and quantity
of data are crucial factors in the performance of a machine learning model.

5|Page
Features: Features are individual measurable properties or characteristics of the data. In a
dataset of houses, for example, features might include the number of bedrooms, square
footage, and location. The selection and engineering of relevant features play a significant
role in the effectiveness of a machine learning model.
Labels: In supervised learning, a subset of machine learning, each data point is associated
with a label or outcome that the model aims to predict. For instance, in a dataset of emails,
the labels might indicate whether each email is spam or not spam.
Models: Machine learning models are mathematical representations of patterns and
relationships within the data. These models are trained using algorithms that adjust their
internal parameters based on the input data to minimize errors or discrepancies between
predicted outcomes and actual outcomes.
Training: During the training phase, the model is exposed to a labeled dataset, and its
parameters are adjusted iteratively to reduce prediction errors. The goal is to generalize
well to new, unseen data.
Testing and Evaluation: After training, the model is evaluated using a separate dataset,
called the test set, to assess its performance on unseen data. Metrics such as accuracy,
precision, recall, and F1 score are commonly used to evaluate the model's performance.
Types of Machine Learning:
 Supervised Learning: The model learns from labeled data, making predictions or
decisions based on input-output pairs.
 Unsupervised Learning: The model learns patterns and structures from unlabeled
data, identifying hidden relationships or clusters.
 Reinforcement Learning: The model learns to make decisions by interacting with
an environment, receiving feedback in the form of rewards or penalties.
 Semi-Supervised Learning: Combines elements of supervised and unsupervised
learning, typically using a small amount of labeled data and a larger amount of
unlabeled data.
Machine learning has diverse applications across various domains, including image and speech
recognition, natural language processing, recommendation systems, healthcare, finance, and
autonomous vehicles. As the amount of available data continues to grow and computational
capabilities improve, machine learning is poised to play an increasingly central role in technology
and society.
Q. 6 - Data Prep for ML.
(https://youtu.be/P8ERBy91Y90?si=eiGIk5qLSjIJj_t4)
Ans. - Data preparation is a crucial step in any machine learning (ML) project. Here's a general
outline of the process:

6|Page
1. Data Collection: Gather the data relevant to your problem from various sources, such as
databases, APIs, files, or scraping websites.
2. Data Cleaning:
 Handle missing values: Decide whether to impute missing values, delete
rows/columns, or use algorithms that can handle missing data.
 Handle outliers: Identify and decide how to handle outliers, whether it's removing
them, transforming them, or leaving them as is.
 Remove duplicates: Eliminate duplicate records to avoid skewing the analysis.
 Data format consistency: Ensure consistency in data formats (e.g., date formats,
units of measurement).
3. Feature Selection/Engineering:
 Select relevant features: Choose features that are most likely to have predictive
power for your ML model.
 Create new features: Generate new features that might better represent patterns in
the data.
 Transform features: Convert categorical variables into numerical representations
(e.g., one-hot encoding), scale numerical features, or apply other transformations
as needed.
4. Data Splitting:
 Split the data into training, validation, and test sets. The training set is used to train
the model, the validation set is used to tune hyperparameters and evaluate model
performance during training, and the test set is used to evaluate the final model
performance.
5. Normalization/Standardization: Scale the features so that they have similar ranges. This
step is important for many machine learning algorithms to converge faster and perform
better.
6. Handling Imbalanced Data: If your data is imbalanced (i.e., one class is much more
prevalent than others), consider techniques such as resampling (oversampling minority
class, undersampling majority class), or using algorithms that are robust to imbalanced
datasets.
7. Data Transformation:
 If needed, apply transformations like log transformation, Box-Cox transformation,
or other techniques to make the data more normally distributed.

7|Page
 Consider applying dimensionality reduction techniques like Principal Component
Analysis (PCA) if dealing with high-dimensional data.
8. Data Augmentation (if applicable): For tasks like image classification or natural language
processing, you can generate additional training data by applying transformations such as
rotation, flipping, cropping (for images), or by adding noise (for text).
9. Data Pipeline Creation: Construct a pipeline that automates the entire data preparation
process, from loading the raw data to producing the preprocessed data ready for model
training.
10. Documentation: Document all the steps taken during data preparation, including any
assumptions made and any transformations applied. This documentation is crucial for
reproducibility and for understanding the decisions made during the data preparation
process.
By following these steps, you can ensure that your data is in the best possible shape for training
robust and accurate machine learning models.

Q. 7 - ML Algorithms.
(https://youtu.be/olFxW7kdtP8?si=7iP1djKypLHA9Lu5
https://youtu.be/yN7ypxC7838?si=zd9a8bbIv2ZwGl6d)
Ans. - Sure, machine learning (ML) algorithms are at the heart of many artificial intelligence
applications. They are essentially mathematical models that learn patterns and relationships from
data, enabling computers to make predictions or decisions without being explicitly programmed
for every task. Here's an overview of some common ML algorithms:
1. Linear Regression: A simple algorithm used for predicting a continuous value based on
one or more input features. It assumes a linear relationship between the input features and
the target variable.
2. Logistic Regression: This algorithm is used for binary classification tasks, where the target
variable has two possible outcomes. It models the probability that an instance belongs to a
particular class.
3. Decision Trees: These are tree-like structures where each internal node represents a
feature, each branch represents a decision based on that feature, and each leaf node
represents the outcome. Decision trees are versatile and can be used for classification or
regression tasks.
4. Random Forest: An ensemble learning method that builds multiple decision trees during
training and outputs the mode of the classes (classification) or the mean prediction
(regression) of the individual trees.

8|Page
5. Support Vector Machines (SVM): A supervised learning algorithm that can be used for
classification or regression tasks. SVMs find the hyperplane that best separates classes in
the feature space.
6. K-Nearest Neighbors (KNN): A simple algorithm that stores all available cases and
classifies new cases based on a similarity measure (e.g., distance functions).
7. Naive Bayes: A probabilistic classifier based on Bayes' theorem with the "naive"
assumption of independence between features. It is commonly used for text classification
tasks like spam detection or sentiment analysis.
8. Neural Networks: Inspired by the structure of the human brain, neural networks consist
of interconnected layers of artificial neurons. Deep neural networks, in particular, have
achieved remarkable success in various domains such as image recognition, natural
language processing, and reinforcement learning.
9. Clustering Algorithms (e.g., K-means, Hierarchical Clustering): These algorithms
group similar data points together based on certain criteria, such as distance or similarity.
10. Dimensionality Reduction Techniques (e.g., PCA, t-SNE): These methods are used to
reduce the number of features in a dataset while preserving important information. They
are often used for visualization or to improve the efficiency of other machine learning
algorithms.
These are just a few examples, and there are many other ML algorithms and techniques available,
each with its own strengths and weaknesses, suitable for different types of data and tasks.

Q. 8 - Unsupervised Learning.
(https://youtu.be/yN7ypxC7838?si=zd9a8bbIv2ZwGl6d)
Ans. - Unsupervised learning is a type of machine learning where the model learns patterns and
structures from input data without explicit supervision or labeled responses. In unsupervised
learning, the algorithm tries to find hidden structures or intrinsic patterns in the data.
Here are some common techniques and algorithms used in unsupervised learning:
1. Clustering: Clustering algorithms group similar data points together into clusters based on
some similarity measure. Common clustering algorithms include:
 K-means clustering: Partitioning the data into K clusters based on the mean of
data points.
 Hierarchical clustering: Building a hierarchy of clusters by recursively merging
or splitting them.
2. Dimensionality Reduction: Dimensionality reduction techniques aim to reduce the
number of features (dimensions) in a dataset while preserving important information. This

9|Page
can help in visualizing high-dimensional data or reducing computational complexity.
Common dimensionality reduction techniques include:
 Principal Component Analysis (PCA): Identifying the orthogonal directions
(principal components) that capture the most variance in the data.
 t-Distributed Stochastic Neighbor Embedding (t-SNE): Mapping high-
dimensional data into a lower-dimensional space while preserving local structure.
3. Anomaly Detection: Anomaly detection algorithms identify data points that deviate from
normal behavior. This can be useful for detecting fraud, errors, or outliers in datasets.
Common anomaly detection techniques include:
 Density-based methods like Local Outlier Factor (LOF) or Isolation Forest.
 Statistical methods like Gaussian Mixture Models (GMM) or One-Class SVM.
4. Association Rule Learning: Association rule learning discovers interesting relationships
or associations between variables in large datasets. It is commonly used in market basket
analysis and recommendation systems. Apriori algorithm is a popular technique for mining
association rules.
5. Generative Models: Generative models learn the underlying probability distribution of
the data and can generate new samples similar to the training data. Common generative
models include:
 Variational Autoencoders (VAEs): Learning a low-dimensional latent
representation of data and generating new samples by sampling from this latent
space.
 Generative Adversarial Networks (GANs): Training two neural networks, a
generator and a discriminator, in a competitive setting to generate realistic data
samples.
Unsupervised learning is particularly useful when dealing with unlabeled or unstructured data,
where it can reveal insights or patterns that may not be immediately apparent. However, evaluation
and interpretation of results in unsupervised learning can be more challenging compared to
supervised learning.
1. Exp - Exploring without a Map: Imagine you're in a big forest with no map. You're just
wandering around, trying to make sense of everything you see without anyone telling you
what's what. That's a bit like unsupervised learning. You have a bunch of data, but nobody
has told the computer what it means or what to look for.
2. Grouping Similar Things Together: One thing you might try to do is group similar things
together. For example, if you find a bunch of rocks, you might put them in one pile, and if
you find some sticks, you might put them in another pile. That's like clustering in
unsupervised learning. You're finding patterns in the data without being told what those
patterns are.

10 | P a g e
3. Simplifying Things: Sometimes, you might have a lot of stuff to deal with, like a huge
pile of toys. You might want to organize them so it's easier to understand. For instance,
you could put all the cars in one box, all the dolls in another, and so on. That's similar to
dimensionality reduction in unsupervised learning. You're making things simpler without
losing too much important information.
4. Spotting Weird Stuff: Imagine you're sorting through your toys and you find something
really strange, like a toy that doesn't look like anything else you have. That could be an
anomaly. In unsupervised learning, you're trying to find these strange things in your data
without anyone telling you what to look for.
5. Discovering Patterns: Sometimes, you might just be curious and want to find out if there
are any interesting connections between different things. For example, you might notice
that whenever you see dark clouds, it usually rains. That's like finding association rules in
unsupervised learning. You're discovering connections between different parts of your
data.
(So, unsupervised learning is all about exploring and finding patterns in data without any
instructions or examples provided beforehand. It's like being a detective trying to solve a mystery
using clues you find along the way!)

Q. 9 - Model Inference.
(https://youtu.be/ftd9XUykiaw?si=9ce9wGAJ6pv_sTiO)
Ans. - "Model inference" generally refers to the process of using a trained machine learning model
to make predictions or decisions based on new data. It's the phase where the model is deployed in
a real-world scenario to perform its intended task, such as classifying images, generating text,
making recommendations, or any other task it was trained for. During inference, the model takes
input data, processes it according to its learned parameters, and produces an output or prediction.
This phase typically occurs after the model has been trained on a dataset and validated for its
accuracy and performance.

Q. 10 – Business Communication for Data Science.

Ans. - Business communication is crucial in data science because it's not just about analyzing data;
it's also about conveying insights effectively to stakeholders who may not be familiar with
technical jargon or data analysis techniques. Here's an outline to get started on understanding
business communication in the context of data science:
1. Understanding the Audience: Before communicating any insights, it's essential to
understand who your audience is. Are they technical experts, managers, executives, or
clients? Tailor your communication style and level of detail to match their background and
interests.

11 | P a g e
2. Clarity and Simplicity: Keep your communication clear, concise, and simple. Avoid using
technical jargon that might confuse non-technical stakeholders. Use plain language and
explain complex concepts in an accessible way.
3. Storytelling: Data storytelling is a powerful technique for conveying insights. Structure
your communication as a narrative, with a clear beginning, middle, and end. Use real-world
examples and anecdotes to illustrate your points and make them more engaging.
4. Visualizations: Visualizations are a powerful tool for conveying complex data in a
digestible format. Choose the right type of visualization for your data and use it to highlight
key insights. Make sure your visualizations are clear, easy to understand, and aesthetically
pleasing.
5. Contextualization: Provide context for your analysis by explaining why it matters and
how it relates to broader business goals or challenges. Help stakeholders understand the
implications of your insights and how they can use them to make informed decisions.
6. Feedback and Iteration: Be open to feedback from stakeholders and be willing to iterate
on your communication based on their input. Ask for clarification if needed and make sure
you address any questions or concerns they may have.
7. Documentation: Document your analysis and communication process to ensure
transparency and reproducibility. This can include written reports, slide decks, or
documentation of code and methodology.
8. Active Listening: Finally, communication is a two-way street. Practice active listening by
paying attention to stakeholders' questions, concerns, and feedback. This will help you
tailor your communication to their needs and ensure that your insights are effectively
understood and acted upon.

Q. 11 - Model in Production.
Ans. - Transitioning a model from development to production is a crucial step in the data science
lifecycle. Here's a breakdown of key considerations when deploying a model into production:
1. Model Evaluation and Testing: Before deploying a model into production, it's essential
to thoroughly evaluate its performance and test it under various conditions. This includes
assessing metrics such as accuracy, precision, recall, and F1-score, as well as conducting
robustness testing to ensure the model performs well in real-world scenarios.
2. Scalability and Performance: Consider the scalability and performance requirements of
your model in a production environment. Will it be able to handle large volumes of data
and requests? Does it meet latency and throughput requirements? Optimize your model
and infrastructure accordingly to ensure it can scale effectively.
3. Infrastructure and Deployment Environment: Choose an appropriate infrastructure and
deployment environment for hosting your model in production. This may include cloud

12 | P a g e
platforms such as AWS, Azure, or Google Cloud, containerization technologies like
Docker and Kubernetes, or serverless architectures. Consider factors such as cost,
scalability, reliability, and ease of maintenance when selecting your deployment
environment.
4. Model Monitoring and Maintenance: Implement monitoring and logging mechanisms to
track the performance of your model in production and detect any issues or anomalies. This
includes monitoring metrics such as prediction accuracy, latency, and resource utilization,
as well as logging inputs, outputs, and errors for debugging purposes. Establish procedures
for model retraining and maintenance to ensure it remains accurate and up-to-date over
time.
5. Data Privacy and Security: Ensure that your model deployment adheres to data privacy
and security requirements, especially if it involves handling sensitive or personally
identifiable information (PII). Implement encryption, access controls, and other security
measures to protect data both in transit and at rest. Comply with relevant regulations such
as GDPR, HIPAA, or CCPA to avoid legal and regulatory issues.
6. Versioning and Reproducibility: Establish versioning and reproducibility practices to
track changes to your model and ensure that deployments are consistent and reproducible.
Use version control systems such as Git to manage code and configuration changes, and
document dependencies, hyperparameters, and experimental results to facilitate
reproducibility.
7. Deployment Strategy: Develop a deployment strategy that minimizes downtime and
disruption to existing systems and services. Consider techniques such as blue-green
deployment, canary deployment, or rolling updates to gradually transition to the new model
while minimizing risk. Implement automated testing and rollback procedures to handle
failures and mitigate risks during deployment.
8. Collaboration and Communication: Foster collaboration and communication between
data scientists, engineers, and stakeholders throughout the deployment process. Clearly
define roles and responsibilities, establish communication channels and feedback loops,
and ensure that everyone is aligned on the goals and expectations for the model in
production.
By considering these factors and following best practices, you can effectively deploy and manage
models in production, enabling your organization to derive value from data science insights in a
reliable and scalable manner.
Q. 12 – Data Science Project Management
Ans. - Data science project management is essential for ensuring the success of data-driven
initiatives. Here's an overview of key principles and practices in managing data science projects:
1. Define Clear Objectives: Start by defining clear project objectives and success criteria.
What problem are you trying to solve? What are the business goals and metrics for

13 | P a g e
measuring success? Establishing clear objectives helps align stakeholders and guide project
prioritization and decision-making.
2. Structured Approach: Adopt a structured approach to project management, such as Agile
or Scrum, tailored to the unique requirements of data science projects. Break down the
project into smaller, manageable tasks or user stories, and prioritize them based on value
and feasibility. Use iterative development and regular feedback cycles to adapt to changing
requirements and insights.
3. Cross-Functional Teams: Build cross-functional teams with diverse skills and expertise,
including data scientists, engineers, domain experts, and business stakeholders. Encourage
collaboration and communication between team members from different disciplines to
leverage their respective strengths and perspectives.
4. Data Governance and Quality: Establish data governance policies and procedures to
ensure the quality, integrity, and security of data throughout the project lifecycle. Define
data standards, documentation practices, and access controls to maintain data quality and
compliance with regulatory requirements. Implement data validation and cleaning
processes to address errors, outliers, and missing values.
5. Resource Allocation and Management: Allocate resources effectively, including people,
time, and budget, to support the project's goals and objectives. Identify and mitigate
resource constraints or bottlenecks that may impact project progress or outcomes. Track
resource utilization and performance metrics to monitor project health and identify areas
for improvement.
6. Risk Management: Identify potential risks and uncertainties that may impact project
success, such as technical challenges, resource constraints, or changes in project
requirements. Develop risk mitigation strategies to address these risks proactively and
minimize their impact on project outcomes. Regularly review and update the risk register
to ensure that emerging risks are addressed promptly.
7. Communication and Stakeholder Engagement: Maintain open and transparent
communication with stakeholders throughout the project lifecycle. Provide regular updates
on project progress, milestones, and deliverables, and solicit feedback and input from
stakeholders to ensure alignment with their needs and expectations. Establish clear
channels for communication and escalation of issues or concerns.
8. Continuous Improvement: Foster a culture of continuous improvement by reflecting on
lessons learned and implementing process improvements based on feedback and
retrospective reviews. Encourage experimentation and innovation to explore new
approaches and technologies that can enhance project outcomes and deliver business value.
By applying these principles and practices, you can effectively manage data science projects and
drive successful outcomes that deliver value to your organization.

14 | P a g e
Quiz
Class 1 PPT
A comprehensive overview of data science, covering its definition, role in business decision-
making, applications across various industries, statistical concepts, and methodologies. Here's a
summarized breakdown:
1. Introduction to Data Science: Data science is the application of computer technology, statistics,
and domain knowledge to solve problems and convert raw data into knowledge for decision-
making.
2. Role in Business Decision Making: Data science plays a crucial role in extracting insights from
large datasets to support informed decision-making in areas such as customer segmentation,
manufacturing optimization, and strategic planning.
3. Applications across Industries: Data science finds applications in search engines, transportation
(e.g., driverless cars), finance (e.g., fraud detection), e-commerce (e.g., personalized
recommendations), healthcare (e.g., tumor detection), among others.
4. Statistical Concepts: The text covers various statistical concepts such as measures of central
tendency and variability, hypothesis testing, probability distributions, sampling methods, and
inferential statistics.
5. Methodologies in Data Science: Methodologies like the methodology-agnostic approach
emphasize flexibility and alignment with business needs, while breaking down business problems
involves understanding, defining, decomposing, analyzing, synthesizing, reviewing, and refining
the problem-solving process.
6. Dimensions of Data Quality: Data quality is evaluated based on dimensions like accuracy,
completeness, consistency, timeliness, validity, and uniqueness, ensuring that data is reliable and
relevant for decision-making.
Overall, the text provides insights into the fundamental concepts, applications, and methodologies
of data science, highlighting its importance in modern business and industry.

1. What is data science?

A) The study of computer hardware
B) The application of computer technology, statistics, and domain knowledge to solve problems
C) The study of ancient artifacts
D) The study of oceanography
Answer: B) The application of computer technology, statistics, and domain knowledge to solve
problems

1|Page
2. Which of the following is NOT an application of data science?
A) Search Engines
B) Transportation
C) Archaeology
D) Healthcare
Answer: C) Archaeology

3. What is the role of data scientists in business decision-making?

A) They play no role in business decision-making
B) They extract, process, and analyze data to provide valuable insights
C) They handle customer service
D) They design marketing campaigns
Answer: B) They extract, process, and analyze data to provide valuable insights

4. Which industry uses data science to optimize logistics routes and schedules?
A) Healthcare
B) Finance
C) Manufacturing
D) Transportation
Answer: D) Transportation

5. What is the purpose of hypothesis testing in statistics?

A) To create probability distributions
B) To draw conclusions about a population by examining random samples
C) To evaluate results and determine if they are meaningful
D) To make informed decisions based on data
Answer: C) To evaluate results and determine if they are meaningful

2|Page
6. Which measure of central tendency is best for nominal data?
A) Mode
B) Median
C) Mean
D) Standard Deviation
Answer: A) Mode

7. What does the term "population" refer to in statistics?

A) The entire group that you want to draw conclusions about
B) A specific group of individuals that you will collect data from
C) The geographical location of a business
D) The total number of data points in a dataset
Answer: A) The entire group that you want to draw conclusions about

8. Which statistical measure represents the amount of dispersion in a dataset?

A) Mean
B) Mode
C) Range
D) Median
Answer: C) Range

9. What type of error occurs when a guilty person is found to be not guilty?
A) Type I error
B) Type II error
C) Null error
D) Alternative error
Answer: B) Type II error

3|Page
10. What is the purpose of inferential statistics?
A) To summarize data
B) To draw conclusions about a population by examining random samples
C) To create probability distributions
D) To calculate measures of central tendency
Answer: B) To draw conclusions about a population by examining random samples

11. In data science, what does the term "agile" refer to?
A) A methodology for software development
B) A breed of dogs
C) A statistical measure
D) A type of data visualization
Answer: A) A methodology for software development

12. Which industry benefits from using data science to predict customer lifetime value and stock
market moves?
A) Transportation
B) Healthcare
C) Finance
D) E-commerce
Answer: C) Finance

13. What measure of variability is the average squared difference of the values from the mean?
A) Range
B) Variance
C) Standard Deviation
D) Coefficient of Variation
Answer: B) Variance
4|Page
14. Which sampling method allows you to make strong statistical inferences about the whole
group?
A) Probability sampling
B) Non-probability sampling
C) Convenience sampling
D) Judgment sampling
Answer: A) Probability sampling

15. What is the purpose of creating probability distributions in statistics?

A) To organize data in a table
B) To calculate the frequency of each possible value or outcome
C) To summarize data
D) To predict the likelihood of future events
Answer: D) To predict the likelihood of future events

16. Which dimension of data quality refers to the availability of information when needed?
A) Accuracy
B) Completeness
C) Consistency
D) Timeliness
Answer: D) Timeliness

17. What does the term "agnostic approach" mean in data science?
A) Being indifferent to technology, models, methodologies, or data
B) Being overly attached to specific technologies
C) Being unaware of statistical methods
D) Being skeptical about data analysis
Answer: A) Being indifferent to technology, models, methodologies, or data
5|Page
18. What does a measure of central tendency indicate?
A) The spread of data values
B) The middle or typical value in a dataset
C) The likelihood of future events
D) The average difference from the mean
Answer: B) The middle or typical value in a dataset

19. Which statistical method compares events and explanations to validate them?
A) Descriptive statistics
B) Inferential statistics
C) Hypothesis testing
D) Probability distributions
Answer: C) Hypothesis testing

20. What type of error occurs when a researcher finds an effect that doesn't actually exist?
A) Type I error
B) Type II error
C) Null error
D) Alternative error
Answer: A) Type I error

6|Page
Class 2 PPT
The data covers various aspects related to data analysis, statistics, hypothesis testing, and machine
learning. It begins by discussing the types of data, including numeric, qualitative (nominal and
ordinal), date-time data, and unstructured data.
Then delves into data summary components, such as measures of central tendency (mode, mean,
median) and variability (range, quartiles, inter-quartile range, MAD, standard deviation, and
variance), along with the concept of outliers and distributions (discrete and continuous).
The data further explores specific distributions like uniform, Bernoulli, binomial, and Poisson
distributions, providing insights into their characteristics and applications. It also touches upon the
pitfalls of data summaries, emphasizing the importance of understanding the nuances of data
representation.
Additionally, the data covers visualization methods for summarizing data, including histograms,
kernel density curves, box plots, and correlation coefficients. It discusses techniques for
summarizing categorical data, visualization of various types of data, and the introduction to
machine learning, distinguishing between supervised (regression and classification) and
unsupervised learning problems.
Furthermore, it outlines regression theory, classification theory, hypothesis testing frameworks,
types of errors, and performance metrics for evaluating models. The summary concludes with an
overview of predictive modeling and model performance evaluation in both regression and
classification scenarios.
Overall, the data provides a comprehensive overview of fundamental concepts and techniques
essential for understanding and analyzing data, hypothesis testing, and building predictive models
in various domains.

1. What are the types of data mentioned in the data summary?

A) Numeric and Qualitative
B) Numeric, Qualitative, and Time Series
C) Numeric, Qualitative, Date-Time, and Unstructured
D) Numeric, Qualitative, and Continuous

Answer: C) Numeric, Qualitative, Date-Time, and Unstructured

2. Which measure of central tendency is more suitable when extreme values are present in the
data?

7|Page
A) Mode
B) Mean
C) Median
D) Range

Answer: C) Median

3. Outliers in data are values that:

A) Are similar to other values
B) Deviate too much from the central location
C) Are within the range of the mean
D) Have a high standard deviation

Answer: B) Deviate too much from the central location

4. Discrete distributions pertain to data that contains:

A) Continuous outcomes
B) Integers as their values
C) Decimal values
D) Both integers and decimals

Answer: B) Integers as their values

5. Which distribution has each value with an equal probability of occurrence?

A) Poisson Distribution
B) Bernoulli Distribution
C) Uniform Distribution
D) Binomial Distribution

8|Page
Answer: C) Uniform Distribution

6. Which distribution is observed when looking at multiple Bernoulli trials together?

A) Poisson Distribution
B) Bernoulli Distribution
C) Uniform Distribution
D) Binomial Distribution

Answer: D) Binomial Distribution

7. What is a Type I error in hypothesis testing?

A) Failing to reject the null hypothesis when it is true
B) Rejecting the null hypothesis when it is true
C) Failing to reject the null hypothesis when it is false
D) Rejecting the null hypothesis when it is false

Answer: B) Rejecting the null hypothesis when it is true

8. What is the purpose of understanding distributions in data?

A) To determine the mean
B) To comprehend how likely certain values are to occur
C) To calculate standard deviation
D) To identify outliers

Answer: B) To comprehend how likely certain values are to occur

9. Which branch of artificial intelligence focuses on the use of data algorithms to imitate human
learning?
9|Page
A) Deep Learning
B) Natural Language Processing
C) Machine Learning
D) Robotics

Answer: C) Machine Learning

10. What are the two main types of supervised learning problems?
A) Regression and Classification
B) Regression and Unsupervised
C) Classification and Clustering
D) Clustering and Dimensionality Reduction

Answer: A) Regression and Classification

11. In regression problems, what type of outcome is predicted?

A) Continuous numeric value
B) Categorical value
C) Binary value
D) Probability value

Answer: A) Continuous numeric value

12. What is the range of values for probability outcomes in classification problems?
A) -1 to 1
B) 0 to 1
C) 1 to ∞
D) -∞ to +∞

10 | P a g e
Answer: B) 0 to 1

13. Which function is used to transform generic combinations of inputs in classification problems?
A) Sigmoid function
B) Linear function
C) Exponential function
D) Logarithmic function

Answer: A) Sigmoid function

14. What does sensitivity measure in classification problems?

A) Ability to capture negatives
B) Ability to capture positives
C) Overall accuracy
D) Precision

Answer: B) Ability to capture positives

15. What is the main purpose of hypothesis testing?

A) To estimate the mean
B) To make predictions
C) To test assumptions about data
D) To summarize data visually

Answer: C) To test assumptions about data

16. Which type of distribution is observed when the probability of success in an individual trial is
small and unknown?
11 | P a g e
A) Binomial Distribution
B) Poisson Distribution
C) Uniform Distribution
D) Normal Distribution

Answer: B) Poisson Distribution

17. What is the purpose of using log likelihood in classification problems?

A) To maximize the likelihood of the real outcome
B) To minimize the likelihood of the real outcome
C) To transform input features
D) To calculate sensitivity

Answer: A) To maximize the likelihood of the real outcome

18. What metric is used to measure model performance in regression problems?

A) Sensitivity
B) Specificity
C) Mean Absolute Error
D) Precision

Answer: C) Mean Absolute Error

19. Which visualization method is used for single categorical columns?

A) Scatter plot
B) Box plot
C) Frequency bar chart
D) Kernel density curve

12 | P a g e
Answer: C) Frequency bar chart

20. In machine learning, what problem type involves finding general patterns hidden in measured
features?
A) Regression
B) Classification
C) Unsupervised
D) Reinforcement Learning

Answer: C) Unsupervised

13 | P a g e
Class 3 PPT
The provided data offers an overview of key concepts in machine learning, particularly focusing
on supervised and unsupervised learning, as well as regression and classification problems. It
outlines the fundamental principles behind these concepts, including:
1. Machine Learning Overview: Machine learning is a branch of artificial intelligence (AI)
and computer science focused on using data algorithms to mimic human learning, with a
gradual improvement in accuracy over time.
2. Supervised vs. Unsupervised Learning: Supervised learning involves training a model
using labeled data with explicit outcomes, while unsupervised learning deals with
unlabeled data and aims to find hidden patterns.
3. Regression Problems: Regression problems entail predicting continuous numeric
outcomes, such as sales or temperature, using input features. The performance of regression
models is evaluated using metrics like Mean Squared Error (MSE) and R-squared.
4. Classification Problems: Classification problems involve predicting categorical
outcomes, such as yes/no or different classes, based on input features. Evaluation metrics
for classification models include accuracy, precision, recall, F1-score, and AUC-ROC.
5. Model Evaluation: Model evaluation is essential in both regression and classification
problems to assess the performance of trained models on unseen data. Common evaluation
metrics help understand the model's strengths and weaknesses.
6. Overfitting: Overfitting occurs when a model learns the training data too well, capturing
noise and irrelevant patterns. Remedies for overfitting include simplifying the model,
regularization, cross-validation, and feature selection.
7. Trade-offs in Classification Models: Sensitivity and specificity are crucial metrics in
binary classification models, with a trade-off between them. The choice between
maximizing sensitivity or specificity depends on the specific requirements of the problem
domain.
Overall, the data provides a comprehensive understanding of machine learning fundamentals,
including problem categorization, model evaluation, and strategies to mitigate common challenges
like overfitting.

1. What is the main focus of machine learning?

a) Mimicking human emotions
b) Imitating the way humans learn
c) Developing advanced computer hardware
d) Creating artificial consciousness

14 | P a g e
Answer: b) Imitating the way humans learn

2. What are the two main categories of supervised problems?

a) Numerical and categorical
b) Regression and classification
c) Linear and non-linear
d) Binary and multiclass
Answer: b) Regression and classification

3. What is a regression problem?

a) A problem with no measurable outcome
b) A problem with categorical outcomes
c) A problem where the outcome is a continuous numeric value
d) A problem where the outcome is binary
Answer: c) A problem where the outcome is a continuous numeric value

4. Which function is used to represent the relationship between input factors and outcome in
regression?
a) f(X)
b) P(y)
c) Sigmoid
d) Logit
Answer: a) f(X)

5. What is the range of probabilities in classification problems?

a) -∞ to +∞
b) 0 to 1
c) 1 to ∞
d) -1 to 1
15 | P a g e
Answer: b) 0 to 1

6. How do regression and classification problems differ in terms of their outcome types?
a) Regression problems have categorical outcomes, while classification problems have
continuous numeric outcomes.
b) Regression problems have continuous numeric outcomes, while classification problems have
categorical outcomes.
c) Regression problems have only two possible outcomes, while classification problems have
multiple possible outcomes.
d) Regression problems have multiple possible outcomes, while classification problems have
only two possible outcomes.
Answer: b) Regression problems have continuous numeric outcomes, while classification
problems have categorical outcomes.

7. Explain the purpose of the sigmoid or logit function in classification problems.

a) To convert continuous outcomes into categorical outcomes
b) To transform the range of outcomes into probabilities between 0 and 1
c) To increase the complexity of the model
d) To decrease the computational complexity of the model
Answer: b) To transform the range of outcomes into probabilities between 0 and 1

8. What is the significance of the negative log likelihood in classification problems?

a) It measures the accuracy of the model's predictions.
b) It represents the cost function that needs to be minimized during model training.
c) It indicates the degree of overfitting in the model.
d) It measures the sensitivity of the model to outliers.
Answer: b) It represents the cost function that needs to be minimized during model training.

9. How does sensitivity differ from specificity in evaluating classification models?

16 | P a g e
a) Sensitivity measures the ability to correctly identify negative cases, while specificity measures
the ability to correctly identify positive cases.
b) Sensitivity measures the ability to correctly identify positive cases, while specificity measures
the ability to correctly identify negative cases.
c) Sensitivity measures the overall accuracy of the model, while specificity measures the
precision of the model.
d) Sensitivity measures the precision of the model, while specificity measures the recall of the
model.
Answer: b) Sensitivity measures the ability to correctly identify positive cases, while specificity
measures the ability to correctly identify negative cases.

10. Why is accuracy considered a potentially misleading metric in classification problems?

a) Because it only considers correctly predicted cases without distinguishing between positive
and negative outcomes.
b) Because it doesn't take into account the total number of cases in each class.
c) Because it can be influenced by imbalanced datasets.
d) Because it doesn't provide insights into the model's ability to capture true positives and true
negatives.
Answer: c) Because it can be influenced by imbalanced datasets.

17 | P a g e
Class 4 PPT
The provided data covers various aspects of machine learning algorithms, including Gradient
Boosting Machines (GBMs), XGBoost, CatBoost, unsupervised learning, similarity measures,
clustering methods, and dimensionality reduction techniques like PCA, t-SNE, and UMAP. Here's
a summary of the key points:
1. Gradient Boosting Machines (GBMs):
- GBMs are ensemble learning algorithms used for regression and classification tasks.
- Parameters include `n_estimators`, `learning_rate`, `max_depth`, `subsample`, and
`max_features`.
- They build decision trees sequentially and combine them to improve prediction accuracy.
2. XGBoost:
- XGBoost is a variant of GBMs with altered cost functions, making it slower but more targeted.
- It penalizes model complexity and individual tree contributions to ensure better overall
objective alignment.
3. CatBoost:
- CatBoost is a boosting algorithm designed for categorical features, published in 2017.
- It replaces categorical values with out-of-sample response rates to avoid data explosion and
runs fast due to forced balanced tree structures.
4. Unsupervised Learning:
- Unsupervised learning deals with unlabeled data, allowing algorithms to discover patterns and
insights without explicit guidance.
- Similarity measures like Euclidean distance, cosine similarity, Jaccard distance, and Gower’s
distance are used to quantify relationships between data points.
5. Clustering Methods:
- Clustering algorithms group similar data points together.
- Methods include K-means, DBSCAN, agglomerative clustering, and PAMk.
- Linkage methods like single, complete, average, centroid, and Ward's method determine
proximity between clusters.
6. Dimensionality Reduction:
- Techniques like PCA, t-SNE, and UMAP reduce high-dimensional data to lower dimensions
for visualization and analysis.

18 | P a g e
- PCA preserves variance while reducing dimensions, while t-SNE and UMAP focus on
maintaining local relationships in the data.
Overall, the data provides insights into various machine learning techniques used for modeling,
clustering, and dimensionality reduction, each with its specific strengths and applications. These
methods play crucial roles in analyzing and extracting valuable information from complex
datasets.

1. Which algorithm uses a slightly altered version of the original cost functions used by
traditional GBMs?
- A) XGBoost
- B) CatBoost
- C) Unsupervised Learning
- D) K-means
- Answer: A) XGBoost

2. What parameter determines the maximum depth of individual tree models in Gradient
Boosting Machines (GBMs)?
- A) n_estimators
- B) learning_rate
- C) max_depth
- D) subsample
- Answer: C) max_depth

3. Which distance metric is commonly used for measuring similarity between two points in
unsupervised learning?
- A) Euclidean distance
- B) Manhattan distance
- C) Cosine similarity
- D) Jaccard distance
- Answer: A) Euclidean distance

19 | P a g e
4. What is the primary purpose of Principal Component Analysis (PCA)?
- A) Reducing the number of features while preserving variance
- B) Clustering data points into groups
- C) Visualizing high-dimensional data in 2D or 3D
- D) Identifying outliers in the dataset
- Answer: A) Reducing the number of features while preserving variance

5. Which clustering algorithm is susceptible to finding nonsensical clusters if the shape of the
data groups is spherical?
- A) K-means
- B) PAMk
- C) DBSCAN
- D) Agglomerative clustering
- Answer: A) K-means

6. What does the acronym "DBSCAN" stand for?

- A) Dimension-Based Spatial Clustering of Applications with Noise
- B) Density-Based Spatial Clustering of Applications with Noise
- C) Distance-Based Spatial Clustering of Applications with Neighbors
- D) Data-Based Spatial Clustering of Applications with Neighbors
- Answer: B) Density-Based Spatial Clustering of Applications with Noise

7. Which linkage method calculates the proximity between two clusters based on the average
distance of all objects in one cluster to all objects in the other cluster?
- A) Single linkage
- B) Complete linkage
- C) Average linkage
- D) Centroid linkage
- Answer: C) Average linkage

20 | P a g e
8. In t-SNE, what does the "t" stand for?
- A) Triangulated
- B) Transformed
- C) T-distribution
- D) Tapered
- Answer: C) T-distribution

9. What is the primary advantage of UMAP over t-SNE?

- A) UMAP can handle high-dimensional data better
- B) UMAP always preserves the global structure of the data
- C) UMAP is faster and more scalable
- D) UMAP can visualize data in higher dimensions
- Answer: C) UMAP is faster and more scalable

10. Which algorithm is specifically designed to work with categorical features and is known
for its reliability and speed?
- A) XGBoost
- B) CatBoost
- C) K-means
- D) DBSCAN
- Answer: B) CatBoost

11. What parameter determines the number of features to consider when looking for the best
split in Gradient Boosting Machines (GBMs)?
- A) n_estimators
- B) learning_rate
- C) max_features
- D) max_depth

21 | P a g e
- Answer: C) max_features

12. Which similarity measure is commonly used for binary data and calculates the similarity
based on the number of common attributes?
- A) Euclidean distance
- B) Cosine similarity
- C) Jaccard distance
- D) Gower’s distance
- Answer: C) Jaccard distance

13. What is the primary goal of unsupervised learning?

- A) Predicting an output variable based on input features
- B) Finding patterns and insights in unlabeled data
- C) Classifying data into predefined categories
- D) Minimizing prediction errors
- Answer: B) Finding patterns and insights in unlabeled data

14. Which clustering method starts by assigning each object to its own cluster and then
iteratively merges clusters based on their similarity?
- A) K-means
- B) DBSCAN
- C) Agglomerative clustering
- D) PAMk
- Answer: C) Agglomerative clustering

15. What is the primary disadvantage of t-SNE?

- A) It cannot handle high-dimensional data
- B) It always preserves the global structure of the data
- C) It is computationally expensive for large datasets

22 | P a g e
- D) It is insensitive to the scale of the data
- Answer: C) It is computationally expensive for large datasets

16. Which algorithm uses the concept of "Partitional Around Medoids"?

- A) K-means
- B) PAMk
- C) DBSCAN
- D) XGBoost
- Answer: B) PAMk

17. Which clustering method is based on creating a hierarchy of clusters by iteratively

merging or splitting them?
- A) K-means
- B) DBSCAN
- C) Agglomerative clustering
- D) PAMk
- Answer: C) Agglomerative clustering

18. What parameter in t-SNE controls the number of iterations during optimization?
- A) Learning rate
- B) Perplexity
- C) Number of neighbors
- D) Number of iterations
- Answer: D) Number of iterations

19. Which algorithm replaces categorical values with out-of-sample response rates to avoid
data explosion?
- A) K-means
- B) DBSCAN

23 | P a g e
- C) CatBoost
- D) PAMk
- Answer: C) CatBoost

20. What is the primary purpose of Gower’s distance?

- A) Measure similarity between numerical data points
- B) Measure dissimilarity between categorical data points
- C) Measure similarity between binary data points
- D) Measure dissimilarity between mixed-type data points
- Answer: D) Measure dissimilarity between mixed-type data points

21. Which method calculates the proximity between two clusters by considering the centroids
of the clusters?
- A) Single linkage
- B) Complete linkage
- C) Centroid linkage
- D) Ward's method
- Answer: C) Centroid linkage

22. Which algorithm starts by randomly selecting two points and iteratively assigns points
to the nearest cluster center?
- A) K-means
- B) DBSCAN
- C) Agglomerative clustering
- D) PAMk
- Answer: A) K-means

23. Which similarity measure is appropriate for measuring the similarity between two binary
vectors?

24 | P a g e
- A) Euclidean distance

- B) Cosine similarity
- C) Jaccard distance
- D) Gower’s distance
- Answer: C) Jaccard distance

24. What does the acronym "UMAP" stand for?

- A) Unified Mapping and Projection
- B) Uniform Manifold Approximation and Projection
- C) Unbiased Manifold Averaging and Projection
- D) Unsupervised Manifold Alignment and Projection
- Answer: B) Uniform Manifold Approximation and Projection

25. Which algorithm is prone to the problem of finding nonsensical clusters when the number
of clusters (K) is not chosen carefully?
- A) K-means
- B) DBSCAN
- C) Agglomerative clustering
- D) PAMk
- Answer: A) K-means

26. What is the primary limitation of t-SNE in terms of dimensionality reduction?

- A) It is computationally expensive
- B) It cannot handle high-dimensional data
- C) It cannot preserve global structure
- D) It can only reduce data to 2 or 3 dimensions
- Answer: D) It can only reduce data to 2 or 3 dimensions
25 | P a g e
27. Which parameter determines the number of clusters in PAMk clustering?
- A) n_neighbors
- B) n_clusters
- C) min_samples
- D) eps
- Answer: B) n_clusters

28. What is the primary advantage of UMAP over t-SNE?

29. Which algorithm is specifically designed to work with categorical features and is known
for its reliability and speed?
- A) XGBoost
- B) CatBoost
- C) K-means
- D) DBSCAN
- Answer: B) CatBoost

30. What parameter determines the maximum depth of individual tree models in Gradient
Boosting Machines (GBMs)?
- A) n_estimators
- B) learning_rate
- C) max_depth
- D) subsample

26 | P a g e
- Answer: C) max_depth

27 | P a g e

CS3352-QB Fds
No ratings yet
CS3352-QB Fds
12 pages
Data Science Overview Basic to Advance Guide
No ratings yet
Data Science Overview Basic to Advance Guide
27 pages
Graphics Notes
No ratings yet
Graphics Notes
197 pages
intro
No ratings yet
intro
144 pages
42956199
No ratings yet
42956199
61 pages
ADS Final Sem
No ratings yet
ADS Final Sem
112 pages
DI3092
No ratings yet
DI3092
47 pages
Crash Course_Introduction to Data Science
No ratings yet
Crash Course_Introduction to Data Science
121 pages
18_synopsis
No ratings yet
18_synopsis
77 pages
DRYING AND HUMIDIFICATION new
No ratings yet
DRYING AND HUMIDIFICATION new
46 pages
DA_1733591326
No ratings yet
DA_1733591326
132 pages
HTML
No ratings yet
HTML
98 pages
Total quality management and lean six sigma impact on supply chain research field systematic analysis
No ratings yet
Total quality management and lean six sigma impact on supply chain research field systematic analysis
20 pages
Mathematics Thesis Statement
100% (3)
Mathematics Thesis Statement
6 pages
Intro Lectures To DSA
0% (1)
Intro Lectures To DSA
17 pages
CHAPTER 1
No ratings yet
CHAPTER 1
85 pages
Practice Session-2
No ratings yet
Practice Session-2
32 pages
ida unit-4
No ratings yet
ida unit-4
19 pages
assignmnt
No ratings yet
assignmnt
14 pages
QB_ESE_FDS
No ratings yet
QB_ESE_FDS
29 pages
DS PPT 1
No ratings yet
DS PPT 1
30 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
85 pages
Revision Question Reasoning (PrashantChaturvedi)
No ratings yet
Revision Question Reasoning (PrashantChaturvedi)
19 pages
ds sem
No ratings yet
ds sem
71 pages
Chapter 3 Control Structures
No ratings yet
Chapter 3 Control Structures
62 pages
Lab_Staff_J_D
No ratings yet
Lab_Staff_J_D
10 pages
DAT100_Int_Data_Ana_Lec2_Intro II
No ratings yet
DAT100_Int_Data_Ana_Lec2_Intro II
39 pages
Chapter 2 Simplex
No ratings yet
Chapter 2 Simplex
36 pages
datas_unit1
No ratings yet
datas_unit1
20 pages
BI_Unit_2
No ratings yet
BI_Unit_2
113 pages
Project MGMT Slides
No ratings yet
Project MGMT Slides
41 pages
Unit-1 - Introduction to Data Science
No ratings yet
Unit-1 - Introduction to Data Science
17 pages
Measurement Technique Cheat Sheet
No ratings yet
Measurement Technique Cheat Sheet
7 pages
Approaches in data analysis [Slides]
No ratings yet
Approaches in data analysis [Slides]
13 pages
Lean
No ratings yet
Lean
44 pages
Combinatorics and "Higher" Mathematics: 2018 Prof. Yuh-Dauh Lyuu, National Taiwan University
No ratings yet
Combinatorics and "Higher" Mathematics: 2018 Prof. Yuh-Dauh Lyuu, National Taiwan University
83 pages
9e25760f-f161-44e7-816b-3ee177915659
No ratings yet
9e25760f-f161-44e7-816b-3ee177915659
12 pages
Approaches in data analysis [Slides] [Re-brand]
No ratings yet
Approaches in data analysis [Slides] [Re-brand]
13 pages
Assignment
15% (13)
Assignment
10 pages
Crack_Data_Science_Interview_�_1731300339
No ratings yet
Crack_Data_Science_Interview_�_1731300339
132 pages
MST129 - Applied Calculus 2021-2022 / Spring 1
No ratings yet
MST129 - Applied Calculus 2021-2022 / Spring 1
13 pages
Time Table Format ODD_20201 EVEN_2022
No ratings yet
Time Table Format ODD_20201 EVEN_2022
5 pages
Da Ans (GKJ)
No ratings yet
Da Ans (GKJ)
11 pages
Da Notes
No ratings yet
Da Notes
6 pages
Fd45092a Ccad 459e Bc18 b01536fd6bac Untitled
No ratings yet
Fd45092a Ccad 459e Bc18 b01536fd6bac Untitled
53 pages
Data Science
100% (2)
Data Science
33 pages
Density and Viscocity CH2O
No ratings yet
Density and Viscocity CH2O
23 pages
Data Science
No ratings yet
Data Science
59 pages
Digital Signal Processing Lab 5th
No ratings yet
Digital Signal Processing Lab 5th
31 pages
Fundamental of Data Science
No ratings yet
Fundamental of Data Science
20 pages
Unit - 1
No ratings yet
Unit - 1
25 pages
DSV Sem Exam
No ratings yet
DSV Sem Exam
15 pages
Data-Science
No ratings yet
Data-Science
14 pages
DA-1,2,3[1]_merged
No ratings yet
DA-1,2,3[1]_merged
39 pages
FDS Introduction
No ratings yet
FDS Introduction
41 pages
2022 Mock JEE Main - 11 - Paper
No ratings yet
2022 Mock JEE Main - 11 - Paper
22 pages
Data Science Interview Prep For SQL, Panda, Python, R Langu
No ratings yet
Data Science Interview Prep For SQL, Panda, Python, R Langu
136 pages
Questions + Solutions - (EPH105C) - 2023
No ratings yet
Questions + Solutions - (EPH105C) - 2023
6 pages
Main Hydro Project in Dehradun Uttrakhand
No ratings yet
Main Hydro Project in Dehradun Uttrakhand
2 pages
Unit-1 Data Science
No ratings yet
Unit-1 Data Science
74 pages
Datasciencevictoryy
No ratings yet
Datasciencevictoryy
16 pages
Exploratory Data Analysis
100% (1)
Exploratory Data Analysis
209 pages
Unsteady State Conduction
No ratings yet
Unsteady State Conduction
18 pages
Introduction To Data Science and Python For Data
No ratings yet
Introduction To Data Science and Python For Data
12 pages
Sna It Unit5
No ratings yet
Sna It Unit5
20 pages
Approaches in data science [Slides]
No ratings yet
Approaches in data science [Slides]
13 pages
Anjali Rana
No ratings yet
Anjali Rana
1 page
dataScience(mod1)
No ratings yet
dataScience(mod1)
4 pages
Data Science 1
No ratings yet
Data Science 1
2 pages
datascience
No ratings yet
datascience
12 pages
DS Mod 1 To 2 Complete Notes
No ratings yet
DS Mod 1 To 2 Complete Notes
63 pages
foundation of Data science imp notes
No ratings yet
foundation of Data science imp notes
6 pages
Thermofluid Measurement Home Work
No ratings yet
Thermofluid Measurement Home Work
2 pages
Summer Training
No ratings yet
Summer Training
8 pages
Nash-Sutcliffe Uncertainty Rating
No ratings yet
Nash-Sutcliffe Uncertainty Rating
4 pages
Data Science
No ratings yet
Data Science
5 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
Data Science
No ratings yet
Data Science
11 pages
Data Science
No ratings yet
Data Science
2 pages
SUMMATIVE TEST Math 5, Week 1 For Quarter 1
100% (1)
SUMMATIVE TEST Math 5, Week 1 For Quarter 1
3 pages
FDSNotes
No ratings yet
FDSNotes
12 pages
Unit I 2 Marks
No ratings yet
Unit I 2 Marks
5 pages
Data Science
No ratings yet
Data Science
18 pages
Chapter 1 - Lecture
No ratings yet
Chapter 1 - Lecture
7 pages
ML roadmap
No ratings yet
ML roadmap
7 pages
File
No ratings yet
File
27 pages
DSUR_EA2352001010391_W3
No ratings yet
DSUR_EA2352001010391_W3
3 pages
1.1 Introduction To Data Science 1
No ratings yet
1.1 Introduction To Data Science 1
17 pages
Impact of Data Science Across Industries
No ratings yet
Impact of Data Science Across Industries
3 pages
Anitha Christopher Automata Theory Lecture Notes
No ratings yet
Anitha Christopher Automata Theory Lecture Notes
80 pages
Anna University DSP
No ratings yet
Anna University DSP
2 pages
Assignment No 1 Linear Convolution
No ratings yet
Assignment No 1 Linear Convolution
5 pages
Conditionals
No ratings yet
Conditionals
5 pages
Handouts On DOM
No ratings yet
Handouts On DOM
10 pages
Chapter Test B: Two-Dimensional Motion and Vectors
No ratings yet
Chapter Test B: Two-Dimensional Motion and Vectors
7 pages
Ib Mathai SL Study Guide
100% (2)
Ib Mathai SL Study Guide
77 pages
Data Analysis for Engineers and Statisticians: A Modern Guide to Statistical Methods and Techniques
From Everand
Data Analysis for Engineers and Statisticians: A Modern Guide to Statistical Methods and Techniques
Pasquale De Marco
No ratings yet
Data Analysis: An In-depth Insight
From Everand
Data Analysis: An In-depth Insight
Pasquale De Marco
No ratings yet
Statistical Data Analysis Made Easy
From Everand
Statistical Data Analysis Made Easy
Pasquale De Marco
No ratings yet
Statistics and Data Analysis Essentials
From Everand
Statistics and Data Analysis Essentials
Jayant Ramaswamy
No ratings yet
Data Analytics
From Everand
Data Analytics
Jeffery Short
1/5 (1)

Analytics Notes

Uploaded by

Analytics Notes

Uploaded by

MBA SEM III

Q. 2 - Discovering Data Science Use Cases.

Q.4 – Hypothesis Testing

Q. 5 – Introduction to Machine Learning.

Q. 10 – Business Communication for Data Science.

1. What is data science?

3. What is the role of data scientists in business decision-making?

5. What is the purpose of hypothesis testing in statistics?

7. What does the term "population" refer to in statistics?

8. Which statistical measure represents the amount of dispersion in a dataset?

15. What is the purpose of creating probability distributions in statistics?

1. What are the types of data mentioned in the data summary?

Answer: C) Numeric, Qualitative, Date-Time, and Unstructured

3. Outliers in data are values that:

Answer: B) Deviate too much from the central location

4. Discrete distributions pertain to data that contains:

Answer: B) Integers as their values

5. Which distribution has each value with an equal probability of occurrence?

6. Which distribution is observed when looking at multiple Bernoulli trials together?

Answer: D) Binomial Distribution

7. What is a Type I error in hypothesis testing?

Answer: B) Rejecting the null hypothesis when it is true

8. What is the purpose of understanding distributions in data?

Answer: B) To comprehend how likely certain values are to occur

Answer: C) Machine Learning

Answer: A) Regression and Classification

11. In regression problems, what type of outcome is predicted?

Answer: A) Continuous numeric value

Answer: A) Sigmoid function

14. What does sensitivity measure in classification problems?

Answer: B) Ability to capture positives

15. What is the main purpose of hypothesis testing?

Answer: C) To test assumptions about data

Answer: B) Poisson Distribution

17. What is the purpose of using log likelihood in classification problems?

Answer: A) To maximize the likelihood of the real outcome

18. What metric is used to measure model performance in regression problems?

Answer: C) Mean Absolute Error

19. Which visualization method is used for single categorical columns?

1. What is the main focus of machine learning?

2. What are the two main categories of supervised problems?

3. What is a regression problem?

5. What is the range of probabilities in classification problems?

7. Explain the purpose of the sigmoid or logit function in classification problems.

8. What is the significance of the negative log likelihood in classification problems?

9. How does sensitivity differ from specificity in evaluating classification models?

10. Why is accuracy considered a potentially misleading metric in classification problems?

6. What does the acronym "DBSCAN" stand for?

9. What is the primary advantage of UMAP over t-SNE?

13. What is the primary goal of unsupervised learning?

15. What is the primary disadvantage of t-SNE?

16. Which algorithm uses the concept of "Partitional Around Medoids"?

17. Which clustering method is based on creating a hierarchy of clusters by iteratively

20. What is the primary purpose of Gower’s distance?

24. What does the acronym "UMAP" stand for?

26. What is the primary limitation of t-SNE in terms of dimensionality reduction?

28. What is the primary advantage of UMAP over t-SNE?

You might also like