Unit 1
Unit 1
Analytics and data science are closely related fields that involve the extraction of
insights, knowledge, and value from data. While there is some overlap between the two,
they have distinct focuses and methodologies. Here's an overview of analytics and data
science:
Data Science: Data science is a broader and interdisciplinary field that encompasses
analytics but extends beyond it. Data science focuses on the extraction of knowledge
and insights from large and complex datasets, including unstructured data such as text,
images, or social media content. It combines elements of statistics, mathematics,
computer science, and domain expertise to handle data at scale and extract valuable
information. Data scientists are skilled in data collection, data pre-processing,
exploratory data analysis, statistical modelling, machine learning, and data visualization.
They apply advanced algorithms and techniques to solve complex problems, uncover
hidden patterns, build predictive models, develop recommendation systems, and create
innovative solutions. Data science is used in diverse domains, including finance,
healthcare, e-commerce, transportation, and artificial intelligence.
While analytics tends to focus on extracting insights from structured data and answering
specific business questions, data science has a broader scope and often involves working
with large-scale datasets, applying advanced algorithms, and developing new
methodologies. Data scientists are typically responsible for end-to-end data analysis,
including data collection, data cleaning, model development, deployment, and ongoing
iteration.
Both analytics and data science are crucial for leveraging the power of data to drive
informed decision-making, gain competitive advantages, and create value for
organizations. They complement each other and are often used in tandem to solve
complex problems and extract meaningful insights from data.
Analytics life cycle
The analytics life cycle refers to the process of performing data analytics from start to
finish, encompassing various stages and activities. It involves transforming raw data into
meaningful insights and actionable recommendations. Here is a typical analytics life
cycle:
1. Problem Definition: The first step is to clearly define the business problem or question
that you want to address through analytics. This could involve identifying areas for
improvement, finding patterns, predicting outcomes, or understanding customer
behaviour.
2. Data Acquisition: In this stage, you gather the relevant data needed to address the
defined problem. Data can come from various sources, such as databases, files, APIs, or
external data providers. It's important to ensure data quality, integrity, and appropriate
permissions during acquisition.
3. Data Preparation: Raw data often requires cleaning, transformation, and formatting
before it can be analysed effectively. This step involves tasks such as data cleansing,
data integration, handling missing values, removing outliers, and structuring the data in
a suitable format for analysis.
4. Data Exploration and Visualization: Here, you explore the data to gain initial insights and
a deeper understanding of its characteristics. Techniques like descriptive statistics, data
visualization, and exploratory data analysis help identify patterns, trends, correlations,
and potential relationships.
5. Data Modelling: This stage involves building statistical or machine learning models to
analyse the data and extract meaningful insights. Depending on the problem at hand,
you may use techniques like regression, classification, clustering, time series analysis, or
predictive modelling.
6. Model Evaluation: Once the models are developed, they need to be evaluated to assess
their performance and accuracy. This involves using appropriate evaluation metrics,
comparing alternative models, validating against a holdout dataset, and fine-tuning the
models if necessary.
7. Insight Generation: After validating the models, you interpret the results and derive
insights from the analysis. These insights provide answers to the initial problem or
question and help make informed decisions or take appropriate actions.
8. Communication and Reporting: This step involves presenting the findings, insights, and
recommendations to stakeholders in a clear and concise manner. Visualizations,
dashboards, reports, or presentations are often used to effectively communicate the
results and their implications.
9. Implementation: Once the insights are communicated, they need to be implemented in
real-world scenarios to drive actual business impact. This may involve making
operational changes, implementing new strategies, or optimizing existing processes
based on the analytics results.
10. Monitoring and Iteration: Analytics is an ongoing process, and it's crucial to monitor the
implemented changes and measure their impact over time. Continuous monitoring helps
identify any deviations, track performance, and refine the models or strategies as
needed.
It's important to note that the analytics life cycle is not a strictly linear process, and
iterations between different stages are common. The process is often iterative, where
new insights and feedback lead to refining the problem definition, acquiring additional
data, or re-evaluating the models.
Types of Analytics
Analytics can be broadly categorized into several types based on the objectives and
techniques used. Here are some common types of analytics:
These are just a few examples of the types of analytics. In practice, multiple types of
analytics may be combined to gain a comprehensive understanding of the data and
address specific business objectives.
1. Identify the Objective: Start by understanding the overall objective or goal that the
business wants to achieve. This could be increasing revenue, reducing costs, improving
customer satisfaction, optimizing operations, or entering a new market. The objective
provides a high-level direction for problem definition.
2. Gather Stakeholder Input: Engage with relevant stakeholders, such as executives,
managers, employees, and customers, to gain their insights and perspectives on the
challenges and opportunities faced by the business. This helps ensure a comprehensive
understanding of the problem and incorporate diverse viewpoints.
3. Analyse Current State: Assess the current state of the business by examining relevant
data, performance metrics, processes, and existing strategies. Identify any pain points,
bottlenecks, inefficiencies, or gaps that hinder the achievement of the desired objective.
This analysis provides a baseline for problem definition.
4. Define the Problem Statement: Based on the gathered information and analysis, clearly
define the problem in a concise and specific manner. The problem statement should be
focused, measurable, and aligned with the overall business objective. It should address
the "what" and "why" of the problem.
5. Consider Root Causes: Dig deeper to identify the underlying root causes of the problem.
Look for factors or variables that contribute to the issue and try to understand their
relationships. This analysis helps in targeting the right areas for improvement and
designing effective solutions.
6. Formulate Hypotheses: Develop initial hypotheses or assumptions about potential causes
and solutions for the problem. These hypotheses will guide the subsequent data analysis
and validation process. It's important to clearly state the assumptions and expectations
that need to be tested.
7. Determine Data Needs: Identify the data required to analyse and address the defined
problem. Determine what data sources are available, what data is missing, and if any
additional data needs to be collected. Consider both internal data (e.g., sales records,
customer data) and external data (e.g., market trends, industry benchmarks).
8. Validate and Refine: Share the problem statement, hypotheses, and data requirements
with stakeholders for feedback and validation. Refine the problem definition based on
their input and ensure that everyone is aligned on the problem statement and the
desired outcome.
By following these steps, a business can effectively define a problem, setting the stage
for data analysis, decision-making, and ultimately finding appropriate solutions.
Data collection
5. Data collection is the process of gathering relevant and accurate data from various
sources to support analysis, decision-making, and problem-solving. It involves identifying
the data needed, determining the sources, collecting the data, and ensuring its quality
and integrity. Here are some key steps involved in the data collection process:
1. Identify Data Requirements: Clearly define the data requirements based on the problem
statement and the objectives of the analysis. Determine the types of data needed, such
as numerical data, text data, categorical data, or spatial data. Identify the specific
variables or attributes that are relevant to the analysis.
2. Determine Data Sources: Identify the potential sources of data that can fulfil the
requirements. This could include internal sources within the organization, such as
databases, files, transactional systems, or customer relationship management (CRM)
systems. External sources such as public databases, research reports, government data,
or third-party data providers may also be considered.
3. Plan Data Collection Methods: Determine the most appropriate methods for data
collection based on the nature of the data and the available sources. Common methods
include surveys, interviews, observations, experiments, web scraping, data extraction
from APIs, or purchasing data from external vendors. Consider factors such as cost, time,
feasibility, and data privacy regulations when selecting the methods.
4. Prepare Data Collection Instruments: If surveys or interviews are used for data collection,
develop questionnaires or interview protocols that align with the data requirements.
Design questions that are clear, unbiased, and relevant to gather the desired
information. Pre-testing the instruments with a small sample can help identify any issues
or improvements.
5. Data Collection Execution: Implement the data collection methods according to the
planned approach. This may involve distributing surveys, conducting interviews,
performing observations, or collecting data through automated processes. Ensure that
the data collection is conducted consistently and in a standardized manner to maintain
data integrity.
6. Data Validation and Cleaning: Review and validate the collected data to ensure its
accuracy, completeness, and consistency. Check for any errors, missing values, outliers,
or inconsistencies that may impact the analysis. Clean the data by correcting errors,
addressing missing values, and resolving inconsistencies.
7. Data Storage and Organization: Establish a proper data storage and organization system
to store the collected data securely. This could involve using databases, data
warehouses, or cloud storage solutions. Ensure that the data is appropriately labelled,
structured, and indexed for easy retrieval and analysis.
8. Data Documentation: Document the data collection process, including details such as the
data sources, collection methods, instrument design, and any relevant information about
the data collection process. This documentation helps in ensuring data reproducibility
and transparency.
9. Data Privacy and Ethical Considerations: Adhere to data privacy regulations and ethical
guidelines throughout the data collection process. Obtain necessary permissions and
consents when dealing with personal or sensitive data. Anonymised or aggregate data
when required to protect privacy.
10. Data Security: Implement appropriate security measures to protect the collected data
from unauthorized access, loss, or breaches. This may involve encryption, access
controls, regular backups, and compliance with data security standards.
Data collection is a critical step in the analytics process, as the quality and relevance of
the data collected greatly impact the accuracy and effectiveness of the subsequent
analysis and decision-making.
Data preparation
Data preparation, also known as data pre-processing or data wrangling, is the process of
transforming raw data into a clean, structured, and suitable format for analysis. It
involves cleaning, integrating, transforming, and formatting the data to ensure its
quality, consistency, and compatibility with the analysis techniques and algorithms. Here
are some key steps involved in data preparation:
1. Data Cleaning: Clean the data by handling missing values, outliers, duplicates, and
inconsistencies. This may involve imputing missing values, removing or correcting
outliers, merging or removing duplicate records, and resolving inconsistencies in
formatting or coding.
2. Data Integration: If you have data from multiple sources, integrate the data to create a
unified dataset. This involves resolving differences in variables, units of measurement,
and data formats across sources. Techniques like data matching, record linkage, and
data merging are used to combine datasets appropriately.
3. Data Transformation: Transform the data to make it suitable for analysis. This includes
converting variables to the correct data types (e.g., numeric, categorical), normalizing or
standardizing numerical variables, and creating derived variables or features that
capture relevant information. Transformations may also involve handling skewed
distributions, scaling variables, or applying mathematical functions.
4. Feature Selection: Identify the most relevant features or variables that contribute to the
analysis or prediction task. This step involves assessing the importance or relevance of
each feature and selecting a subset of features that are most informative. Feature
selection techniques may include statistical tests, correlation analysis, or machine
learning algorithms.
5. Data Reduction: If the dataset is large or contains redundant information, apply
techniques to reduce the dimensionality of the data. This can involve techniques like
principal component analysis (PCA), feature extraction, or feature engineering to reduce
the number of variables while retaining the most important information.
6. Data Formatting: Ensure that the data is formatted properly for analysis. This includes
standardizing units of measurement, converting dates and times to a consistent format,
and encoding categorical variables into numerical representations (e.g., one-hot
encoding). Formatting the data makes it compatible with the analysis techniques and
algorithms to be applied.
7. Data Splitting: Split the prepared data into training, validation, and test sets. The training
set is used to build models, the validation set is used for model selection and parameter
tuning, and the test set is used to evaluate the final model's performance. Proper data
splitting helps assess the model's generalization ability.
8. Data Documentation: Document the data preparation steps taken, including the cleaning,
transformation, and formatting applied to the data. This documentation helps ensure
reproducibility, transparency, and traceability in the analysis process.
Data preparation is a crucial step in the analytics life cycle, as the quality and suitability
of the prepared data greatly impact the accuracy and reliability of the subsequent
analysis and modelling. It requires careful attention to detail and domain knowledge to
ensure that the data is prepared appropriately for the specific analysis objectives.
Hypothesis generation
It's important to note that hypothesis generation is an iterative process, and hypotheses
can be refined, expanded, or revised as the analysis progresses and new insights are
gained. The generated hypotheses guide the subsequent data analysis, modelling, and
validation steps in the analytics life cycle.
Modelling
Modeling, in the context of data analytics, refers to the process of creating mathematical
or statistical representations of real-world phenomena or systems using data. Models are
constructed to understand, explain, predict, or optimize certain aspects of the data and
provide insights for decision-making. Here are key steps involved in the modelling
process:
1. Define the Objective: Clearly define the objective of the modelling exercise. Determine
what you want to achieve with the model, such as predicting outcomes, understanding
relationships, optimizing performance, or simulating scenarios. The objective guides the
selection of the appropriate modelling technique and the variables to be considered.
2. Select the Modelling Technique: Choose the modelling technique that best suits the
problem at hand and aligns with the available data. Common modelling techniques
include regression analysis, classification algorithms, clustering algorithms, time series
analysis, optimization models, simulation models, and machine learning algorithms.
Consider the assumptions, limitations, and requirements of each technique.
3. Data Preparation: Prepare the data for modelling by cleaning, transforming, and
formatting it as discussed in the data preparation stage. Ensure that the data is suitable
for the chosen modelling technique. Split the data into training, validation, and test sets
for model development, evaluation, and validation.
4. Model Development: Develop the model using the chosen technique. This involves fitting
the model to the training data by estimating the parameters or finding the best-fitting
pattern. The specific steps and algorithms used depend on the chosen technique, such as
fitting regression coefficients, training a neural network, or building decision trees.
5. Model Evaluation: Assess the performance and validity of the developed model. Use
appropriate evaluation metrics, such as accuracy, precision, recall, F1 score, mean
squared error, or R-squared, depending on the modelling technique and the specific
problem. Validate the model using the validation data set to ensure that it generalizes
well to unseen data.
6. Model Refinement: Refine the model based on the evaluation results. This may involve
tweaking model parameters, selecting a subset of variables, applying feature
engineering techniques, or addressing any issues identified during evaluation. Iteratively
refine the model to improve its performance and reliability.
7. Model Interpretation: Interpret the model to gain insights into the relationships, patterns,
or factors that contribute to the analysed phenomenon. Depending on the modelling
technique, you may examine coefficients, feature importance, decision rules, or
visualization techniques to understand the model's internal workings and its implications.
8. Model Deployment: Once the model is developed and validated, it can be deployed for
practical use. This involves integrating the model into existing systems, creating APIs for
real-time predictions, or incorporating it into decision support tools. Ensure proper
documentation, version control, and monitoring of the deployed model.
9. Model Maintenance and Monitoring: Models may require periodic updates and
maintenance to stay relevant and accurate. Monitor the model's performance over time,
retrain it with new data if necessary, and assess its ongoing impact on the business or
problem being addressed. Keep track of changing circumstances and update the model
as needed.
10. Communication of Results: Communicate the modelling results and insights to relevant
stakeholders in a clear and understandable manner. Present the findings, visualizations,
and recommendations in reports, dashboards, or presentations. Ensure that the results
are effectively communicated to support decision-making and actions.
Validation and evaluation are crucial steps in the data analytics process to assess the
performance, accuracy, and reliability of the developed models or analysis. These steps
involve measuring how well the models or analysis techniques perform, ensuring their
effectiveness, and determining their suitability for the intended purpose. Here are key
aspects of validation and evaluation:
1. Validation Data: Use a separate validation dataset that is not used during model
development to assess the performance. The validation data should be representative of
the real-world scenarios and have similar characteristics to the data the model will
encounter in practice.
2. Evaluation Metrics: Select appropriate evaluation metrics based on the problem and the
type of model or analysis being performed. Common evaluation metrics include
accuracy, precision, recall, F1 score, mean squared error, R-squared, area under the
curve (AUC), or lift, depending on the specific context.
3. Model Performance: Evaluate the performance of the model using the validation dataset
and the selected evaluation metrics. Compare the model's predictions or results against
the ground truth or known values. Assess how well the model performs in terms of
accuracy, reliability, robustness, and generalization to unseen data.
4. Over fitting and under fitting: Check for over fitting or under fitting of the model. Over
fitting occurs when the model learns the training data too well but fails to generalize to
new data. Under fitting occurs when the model is too simple and fails to capture the
underlying patterns or relationships in the data. Ensure that the model strikes the right
balance between complexity and generalization.
5. Cross-Validation: Consider applying cross-validation techniques, such as k-fold cross-
validation or stratified cross-validation, to get a more robust estimate of the model's
performance. Cross-validation helps mitigate issues related to the specific random split
of data into training and validation sets and provides a more representative evaluation.
6. Sensitivity Analysis: Conduct sensitivity analysis to understand how the model's
performance changes with variations in the input variables or parameters. This analysis
helps identify critical factors that affect the model's predictions or outcomes and assess
the robustness of the model.
7. Business Impact Evaluation: Evaluate the business impact or value of the models or
analysis results. Assess how well the models address the initial problem statement,
whether they provide actionable insights, and if they align with the desired business
outcomes. Consider the cost-benefit analysis and the practicality of implementing the
model's recommendations.
8. Iterative Refinement: Based on the evaluation results, refine the models or analysis
techniques as necessary. This may involve adjusting model parameters, feature
selection, data pre-processing steps, or exploring alternative modelling techniques.
Iterate the evaluation-refinement cycle until the desired performance or suitability is
achieved.
9. Documentation and Reporting: Document the validation and evaluation process,
including the data used, evaluation metrics, results, and any insights gained. Provide
clear and transparent reporting of the model's performance, strengths, limitations, and
recommendations. Effective communication of the validation and evaluation results
ensures transparency and supports decision-making.
Validation and evaluation provide critical feedback on the performance and effectiveness
of models or analysis techniques. They help ensure the reliability, accuracy, and
practicality of the analytics results and guide the decision-making process. It's important
to emphasize that validation and evaluation are ongoing processes, especially as new
data becomes available or business requirements evolve.
Interpretation
Interpretation in data analytics refers to the process of making sense of the results and
findings obtained from data analysis. It involves extracting meaningful insights,
understanding relationships or patterns in the data, and deriving actionable
recommendations or conclusions. Effective interpretation helps stakeholders understand
the implications of the analysis and make informed decisions. Here are key aspects of
interpretation:
Effective interpretation of data analysis results is critical to derive actionable insights and
support decision-making. It requires a combination of analytical skills, domain
knowledge, critical thinking, and effective communication to ensure that the analysis is
translated into meaningful and useful information for stakeholders.
1. Deployment Planning: Develop a deployment plan that outlines how the insights, models,
or recommendations will be implemented in real-world scenarios. Consider factors such
as the required infrastructure, integration with existing systems, user training, and any
necessary process changes.
2. Implementation: Put the analytical insights or models into action by integrating them into
operational processes or systems. This may involve creating APIs for real-time
predictions, integrating models into decision support tools or business intelligence
platforms, or implementing process changes based on the recommendations. Ensure that
the deployment aligns with the overall business objectives and operational requirements.
3. Monitoring and Performance Measurement: Continuously monitor the performance and
impact of the deployed models or analytical solutions. Define appropriate metrics to
measure the effectiveness, accuracy, efficiency, or other relevant performance
indicators. Regularly evaluate how well the deployed solutions are meeting the intended
goals and identify areas for improvement.
4. Feedback Collection: Collect feedback from stakeholders, users, or customers who
interact with the deployed solutions. Gather their insights, suggestions, and observations
about the performance, usability, or practicality of the deployed models or solutions.
Incorporate this feedback into the iteration process.
5. Model Maintenance and Retraining: Models may require periodic maintenance and
retraining to ensure their accuracy and relevance. Keep track of changes in the data
environment, business context, or external factors that may impact the model's
performance. Regularly retrain the models with new data to keep them up to date and
aligned with changing patterns or relationships.
6. Iterative Improvement: Use the feedback and performance monitoring results to drive
iterative improvements. Analyse the areas where the deployed models or solutions fall
short or can be enhanced. Refine the models, update the analytical approaches, or
modify the recommendations based on the iterative learning process. Continuously seek
ways to optimize and improve the performance of the deployed solutions.
7. Documentation and Version Control: Maintain proper documentation of the deployed
models or analytical solutions. Document the updates, changes, and improvements
made during the iteration process. Keep track of different versions of the models or
solutions to ensure traceability and reproducibility.
8. Stakeholder Communication: Communicate the results, improvements, and changes to
stakeholders and relevant teams. Share the impact and value delivered by the deployed
models or solutions. Provide regular updates on the performance, lessons learned, and
future plans for further enhancements.
9. Ethical Considerations: Ensure that the deployed models or solutions comply with ethical
guidelines, privacy regulations, and any legal or regulatory requirements. Regularly
assess the ethical implications of the deployed solutions and make adjustments as
necessary.