DATA SCIENCE FOR MANAGERS
Data Science for Managers DATA SOCIETY © 2023
Who we are
Data Society’s mission is to integrate Big Data
and machine learning best practices across
entire teams and empower professionals to
identify new insights.
We provide:
• High-quality data science training programs
• Customized executive workshops
• Custom software solutions and consulting
services
Since 2014, we’ve worked with thousands of
professionals to make their data work for them.
Data Science for Managers DATA SOCIETY © 2021
2023 2
About the course
● Instructor introduction
● Schedule:
○ 4 sessions
○ 11 am – 2 pm
○ 1 or 2 short breaks each session
Data Science for Managers DATA SOCIETY © 2023 3
Best practices for virtual learning
1. Find a quiet place, free of as many distractions as possible.
Headphones are recommended.
1. Remove or silence alerts from cell phones, e-mail pop-ups,
etc.
1. Participate in activities and ask questions.
1. Give your honest feedback so we can troubleshoot problems
and improve the course.
Data Science for Managers DATA SOCIETY © 2023 4
Class materials
You should have received the following
materials:
• Slides
• Participant guide
- Needed during class
- Contains activities, a data science
glossary, information about popular
data science tools, and more!
Data Science for Managers DATA SOCIETY © 2023 5
Polling question
What you rate your current data literacy level on a scale of 0 -10?
0 5 10
No knowledge or Has combined data from Works with Big
awareness; may read multiple sources; has made Data; has
articles or documents basic charts/graphs; deployed machine
that contain understands data limitations learning
percentages
Data Science for Managers DATA SOCIETY © 2023 6
Agenda
Day 1 Day 2
• Data and its uses • Building a data-driven culture
• Data analytics overview • Data tools
• Data governance • Data teams
• Data ethics • The data science process
• Putting together a project
Day 4
Day 3
• Data visualization
• Foundational data science methods • Misleading statistics & visual distortions
• Advanced data science methods • Data storytelling
Data Science for Managers DATA SOCIETY © 2023 7
Agenda
Day 1
• Data and its uses
• Data analytics overview
• Data governance • What is data and why should we
• Data ethics use it?
• How can data be used in ways
that bring value?
Data Science for Managers DATA SOCIETY © 2023 8
What is data?
Merriam Webster
Data Science for Managers DATA SOCIETY © 2023 9
Data in our daily life
● Using data to make informed decisions isn’t just for business but also personal reasons in
our day-to-day lives.
● Let’s look some examples how data is collected and used routinely:
Checking reviews before buying a new product
Using fitness tracker to measure your heart rate, calories burnt and to track
your progress
Tracking productivity during the day using an application on your phone or
laptop
Comparing two car rental organization to find the best deal
Data Science for Managers DATA SOCIETY © 2023 10
Types of data
Structured Semi-structured Quasi-structured Unstructured
Sep 17 02:33:08.536 [debug]
connection_edge_process_relay_
cell(): Now seen 1802 relay cells
here (command 2, stream 5845).
Sep 17 02:33:08.536 [debug]
connection_edge_process_relay_
cell(): circ deliver_window now
933.
Data Science for Managers DATA SOCIETY © 2023 11
Sources of data
• HR (performance data, • ERPs (Enterprise Resource
Platforms) - Oracle SAP, etc.
salary/compensation, hiring,
360 view, etc.)
• CRMs (Customer Relationship
Management) - SalesForce,
• Network data (application logs, Hubspot, etc.
webserver logs, firewall alert
logs, e-mails, etc.) • Webserver
• Contracts/proposals/procuremen
• Clickstream
t
Data Science for Managers DATA SOCIETY © 2023
12
External sources of data
● Publicly-accessible APIs
○ e.g., api.data.gov
● Other open data sources
○ e.g., data.worldbank.org
○ Large businesses (e.g., Wal-Mart, Best Buy, Trip Advisor, Expedia, Google,
and Spotify) are increasingly giving people access to their data
○ Data is sometimes available for purchase (e.g. weather data)
Data Science for Managers DATA SOCIETY © 2023 13
What is big data?
● “Big data” refers to a large volume ● Characteristics of big data include:
of data that can be mined for
information and used in machine o High volume. Typically, the size
learning projects and other analytics of big data is described in
applications. terabytes, petabytes, even
exabytes!
o High velocity. Big data flows
from sources at a rapid and
continuous pace.
o High variety. Big data comes in
different formats from
heterogeneous sources.
Data Science for Managers DATA SOCIETY © 2021
2023 14
Why use data?
Data may be collected, retained,
and used for several reasons:
● Compliance: avoiding penalties
● Automation: economic
efficiencies
● Analytics: insights
Data Science for Managers DATA SOCIETY © 2021
2023 15
What can using data do?
1. Find a needle in 2. Prioritize work for 3. Provide early warning
haystack high impact / detection
4. Speed up 5. Optimize 6. Enable
decisions resources experiments
Data Science for Managers DATA SOCIETY © 2023 16
Find a needle in a haystack
• Stanford is using satellite imagery and
predictive analytics to estimate
consumption expenditures and asset
wealth.
• This could transform efforts to track
and target poverty in developing
countries with existing, public data.
http://sustain.stanford.edu/predicting-poverty/
Data Science for Managers DATA SOCIETY © 2023 17
Prioritize work for high impact
using-predictive-modeling-to-prioritize-building-inspections/
http://urbanspatialanalysis.com/portfolio/proof-of-concept-
● Consultants in Philadelphia developed a
model for prioritizing building inspections
based on a location’s:
o Distance to nearby vacant properties
o Distance to certain crimes
o Distance to infestation reports
● Benefits could include generating better
daily inspection routes or providing more
information to inspectors on existing
routes.
Data Science for Managers DATA SOCIETY © 2023 18
Provide early warning / detection
• When individuals and groups are planning criminal activity, they often signal their
intentions online via open-source social media.
• Tactical Institute uses cognitive analytics to monitor social channels 24x7, analyze
billions of comments and posts, home in on threats, and identify perpetrators before
they can act.
• They then provide real-time notification of threats issued so that clients can take pre-
emptive action before the threat is executed.
https://www.ibm.com/case-studies/tactical-institute
Data Science for Managers DATA SOCIETY © 2023 19
Speed up decisions
• Recruiting chatbots are used to automate the communication
between recruiters and candidates. They are useful:
○ when there is a high number of applicants
○ to ensure that similar questions are asked of all candidates https://jobai.de/
○ for answering frequently asked questions effectively
• JobAI is a German recruiting chatbot. Their platform offers
jobseekers the opportunity to contact companies, inform
themselves, and apply via familiar messenger apps such as
WhatsApp and Telegram.
Data Science for Managers DATA SOCIETY © 2023 20
Optimize resources
• BNY Mellon developed and deployed more than 220 automated computer
programs in 2016 and 2017.
• These “bots” carry out repetitive tasks such as formatting requests for dollar
funds transfers and responding to data requests from external auditors.
• The bank estimates that its funds transfer bots alone are saving it $300,000
annually.
• Bots that reply to information requests on financial statements from auditors
enabled the bank to cut down its response time to 24 hours from 6 to 10
business days
https://www.reuters.com/article/us-bony-mellon-technology-ai-idUSKBN186253
Data Science for Managers DATA SOCIETY © 2023 21
Enable experiments
• The NYC government reduced the
number of people who fail to appear
(FTA) in court using data to evaluate
options.
• The cost of a one-time court summons’
redesign corresponded to a 13% drop
in FTAs.
• When paired with a text message
costing $0.0075 per message, there
was a 36% decrease.
https://www.sciencemag.org/news/2020/10/new-york-city-
uses-nudges-reduce-missed-court-dates
Data Science for Managers DATA SOCIETY © 2023 22
Polling question
The most relevant use of data for my
organization is:
● Finding a needle in haystack
● Prioritizing work for high impact
● Speeding up decisions
● Optimizing resources
● Enabling experiments
● Providing early warning/ detection
Data Science for Managers DATA SOCIETY © 2023 23
Chat question
What hurdles might you face trying to
implement a data analytics project in
your organization?
Data Science for Managers DATA SOCIETY © 2023 24
Agenda
Day 1
• Data and its uses
• Data analytics overview
• Data governance • What is data analytics and how
• Data ethics can it be used?
• What are the principles of data
science?
Data Science for Managers DATA SOCIETY © 2023 25
What is data analytics?
• Data analytics focuses on
processing and performing
statistical analysis on existing
datasets.
• Analysts capture, process, and
organize data to uncover
actionable insights for current
problems and establish the
best way to present this data.
Data Science for Managers DATA SOCIETY © 2021
2023 26
Chat question
How do you use data analytics within
your organization currently?
Data Science for Managers DATA SOCIETY © 2023 27
Data analytics maturity model
Realm of data science
How can we learn
to do this at scale,
How can we make continuously?
What will it happen? Cognitive/
happen? Prescriptive Artificial
Why did it
Analytics Intelligence
What happen? Predictive
happened? Diagnostic Analytics
Value
Descriptive Analytics
Analytics
Gartner
Information Optimization
Hindsight Insight Foresight
®
Difficulty
Data Science for Managers DATA SOCIETY © 2023 28
Model revisited
Realm of data science
How can we learn
What Why did it What will How can we make to do this at scale,
happened? happen? happen? it happen? continuously?
Descriptive Diagnostic Predictive Prescriptive Cognitive/
Analytics Analytics Analytics Analytics Artificial
Intelligence
Data Science for Managers DATA SOCIETY © 2023 29
Stage 1: descriptive analytics
What questions does it answer? What has happened in the past?
How valuable is it? Provides some value, but doesn’t provide causation or
prediction
How labor intensive is it? Easy to deploy provided you have the right data
Data Science for Managers DATA SOCIETY © 2023 30
Stage 2: diagnostic analytics
What questions does it answer? Why did something happen in the past?
How valuable is it? Provides insights into a particular problem, and can help
you identify some root causes for past trends and
behaviors
How labor intensive is it? Requires detailed data, but doesn’t have to be overly
intensive
Data Science for Managers DATA SOCIETY © 2023 31
Stage 3: predictive analytics
What questions does it answer? What is likely to happen?
How valuable is it? Provides trends / behaviors that are likely to happen
How labor intensive is it? Requires detailed data, and may require a moderate to
high level of computer power, depending on the
method and the amount of data
Data Science for Managers DATA SOCIETY © 2023 32
Stages 4, 5: prescriptive analytics, AI
What questions does it answer? What action should I take next?
How valuable is it? Provides recommendations for future actions
How labor intensive is it? Requires a lot of detailed data, as well as data from
other external sources that will impact the model; very
labor intensive
Data Science for Managers DATA SOCIETY © 2023 33
Example: fighting human trafficking
• Polaris has made a connection
between massage parlors and
human trafficking.
• Once they find one owner of an
illicit massage business by tracing
business records, they often find
that he owns several more
businesses in the area.
https://www.datanami.com/2016/10/07/data-analytics-fight-human-
• They are now able to use data to trafficking/
identify illicit activities and alert
law enforcement.
Data Science for Managers DATA SOCIETY © 2023
34
Polling question
What type of analytics is demonstrated when
Polaris uses data to identify possible illicit
activities and alert law enforcement?
• Descriptive
• Diagnostic
• Predictive
• Prescriptive
Data Science for Managers DATA SOCIETY © 2023 35
How do we move forward?
• To reach the realm of data science
organizations require:
o quality data
o an innovative environment
o resources, with the requisite
knowledge and technical skillsets
to use them
Data Science for Managers DATA SOCIETY © 2023 36
Break
Data Science for Managers DATA SOCIETY © 2021
2023 37
Agenda
Day 1
• Data and its uses
• Data analytics overview
• Data governance • What is data governance? Why is
• Data ethics it important?
• What do Federal managers need
to know about data governance?
Data Science for Managers DATA SOCIETY © 2023 38
Quality data is “clean”
Clean data is: Clean data is not:
• Valid • Corrupt
• Accurate • Incorrect
• Consistent • Duplicate
• Complete • Incomplete
• Uniform • Wrongly formatted
Data Science for Managers DATA SOCIETY © 2021
2023 39
Acquiring quality data is hard
2019 O’Reilly survey of more than 1,900 leaders and data professionals
https://www.oreilly.com/radar/the-state-of-data-quality-in-2020/
Data Science for Managers DATA SOCIETY © 2023 40
Controlling data
● Data is arguably the most important asset that organizations have.
● Controlling it through data governance practices and processes
helps to ensure that data is usable, accessible, and protected.
● Effective data governance helps to avoid data inconsistencies
and errors in data, plays a role in the organization’s ability to
comply with laws and regulations, and increases access to data.
Data Science for Managers DATA SOCIETY © 2023 41
What is data governance?
● Data governance is a
collection of practices and
processes that help to ensure
the formal management of
data assets within an
organization.
● It encompasses the complete
life cycle of IT investment, from
strategic planning to the day-
to-day operations of the IT
function.
Data Science for Managers DATA SOCIETY © 2023 42
What is data governance?
• Awareness and communication
• Policies, standards and
procedures
Within each process, data
• Tools and automation
governance is concerned with:
• Skills and expertise
• Responsibility and accountability
• Goal setting and measurement
Data Science for Managers DATA SOCIETY © 2023 43
Why is data governance important?
● Regulatory compliance – with increased regulation comes
compliance that needs to be implemented and followed
● Reduce risk – effective data governance enhances data security
and privacy
● Improve processes – when everyone follows the same standards,
projects and management become more efficient
Data Science for Managers DATA SOCIETY © 2023 44
Data governance principles
A data governance program should be:
● Sustainable –it survives beyond the initial implementation
● Embedded – data governance should be present in all processes
related to data
● Measured – there should be some defined metrics to help demonstrate
value to the organization
Data Science for Managers DATA SOCIETY © 2023 45
Data governance strategy
A data governance program might be documented using:
Charters Implementation Operating Plans for operational
roadmaps frameworks / success
accountabilities
Data Science for Managers DATA SOCIETY © 2023 46
Data governance models
Centralized
One overarching data governance
organization applies to all sectors.
Replicated
Each data governance section is
repeated across departments but
may have multiple governing bodies.
Federated
An overarching data governance
organization works with multiple
departments to maintain
Data Science for Managers 47
consistency. DATA SOCIETY © 2023
Poll question: data governance
Which governance model do you think is suitable for your
organization?
● Centralized
● Replicated
● Federated
● None of the above
.
Data Science for Managers DATA SOCIETY © 2023 48
Poll question: data governance
After purchasing three companies, an organization is interested
in ensuring high quality data across the enterprise, which
analytics governance strategy will probably best support that
goal?
● Centralized
● Replicated
.
● Federated
● None of the above
Data Science for Managers DATA SOCIETY © 2023 49
Data governance: process maturity
Level Description
0 Non-existent Complete lack of any recognizable processes; have not even recognized that there is an
issue to be addressed
1 Initial / ad hoc Enterprise has recognized that the issues exist and need to be addressed but there are no
standardized processes (only ad hoc approaches); overall approach to management is
disorganized
2 Repeatable Processes have developed to the stage where similar procedures are followed by different
but intuitive people undertaking the same task; no formal training or communication; high degree of
reliance on the knowledge of individuals and, therefore, errors are likely
3 Defined Procedures have been standardized, documented, and communicated; it’s mandated that
these processes should be followed; however, it is unlikely that deviations will be detected
4 Managed and Management monitors and measures compliance with procedures and takes action where
measurable processes appear not to be working effectively; processes are under constant improvement
and provide good practice; automation and tools are used in a limited or fragmented way
5 Optimized Processes have been refined to a level of best practice, based on the results of continuous
improvement and maturity modelling with other enterprises; IT is used in an integrated way to
automate the workflow, providing tools to improve quality and effectiveness, making the
enterprise quick to adapt
Data Science for Managers DATA SOCIETY © 2023 50
What do leaders need to know?
● Target maturity levels would be expected to vary for individual IT
processes, IT infrastructure, and industry characteristics
● Differences in maturity come from factors such as the risks facing the
enterprise and the contribution of processes to value generation and
service delivery
● It does not make sense to be at level 5 for every IT process because the
benefits could not justify the costs of achieving and maintaining that
level
Data Science for Managers DATA SOCIETY © 2023 51
Activity: evaluate yourself!
● Turn to your participant guide to the Data governance
assessment, which begins on page 4, to see how far along you
and your team are in the data governance cycle.
● You’ll measure the foundational components, such as
awareness, formalization, and metadata, as well as the project
components of stewardship, data quality, and master data
policies.
● Then, assess your progress and set goals for where you want your
team.
Data Science
Literacy for Managers DATA SOCIETY © 2023 52
Promoting good governance
● Catalog the data “owned” by your team or office. Which elements are critical?
● Define roles and responsibilities:
○ Data owners are accountable for the state of the data.
○ Data stewards make sure that data policies and standards are adhered to and
stay abreast of changes.
● Develop standardized data definitions and educate stakeholders on them.
● Implement preventative and detective controls to improve data quality.
Data Science for Managers DATA SOCIETY © 2023 53
Agenda
Day 1
• Data and its uses
• Data analytics overview
• Data governance • What are data ethics?
• Data ethics
• What does a Federal manager
need to know about data ethics?
Data Science for Managers DATA SOCIETY © 2023 54
What is data ethics?
Data ethics is a newer branch of ethics that studies and evaluates moral problems related
to:
● Data (including generation, recording, curation, processing, dissemination, sharing, and use)
● Algorithms (including artificial intelligence, artificial agents, machine learning, and robots)
● Corresponding practices (including responsible innovation, programming, hacking, and
professional codes)
Source: University of Oxford
Data Science for Managers DATA SOCIETY © 2023 55
Why data ethics?
● Data science has huge opportunities, but those opportunities are accompanied by complex
data ethical challenges.
○ To formulate and support morally good solutions (e.g., right conducts or right
values)
○ To maximize the value of data science for our societies, for all of us and for our
environments
The best single thing you can do to further data ethics is to talk about data
ethics!
Source: University of Oxford
Data Science for Managers DATA SOCIETY © 2023 56
FDS: Data Ethics Framework
● In December 2020, GSA published a Data Ethics Framework.
● The Framework’s purpose is to guide federal leaders and data users as
they make ethical decisions when acquiring, managing, and using data
to support their agency’s mission.
https://resources.data.gov/assets/documents/f
ds-data-ethics-framework.pdf
Data Science for Managers DATA SOCIETY © 2023 57
Federal Data Ethics Tenets
Uphold Applicable Statutes, Regulations, Professional Practices, and Ethical Standards
Respect the Public, Individuals, and Communities
Respect Privacy and Confidentiality
Act with Honesty, Integrity, and Humility
Hold Oneself and Others Accountable
Promote Transparency
Stay Informed of Developments in the Fields of Data Management and Data Science
Data Science for Managers DATA SOCIETY © 2023 58
Existing frameworks
● O’Reilly’s 5 Cs: consent, clarity, consistency,
control, consequences
● UK Government Data Ethics Framework
1. Start with clear user need and public benefit.
2. Be aware of relevant legislation and codes of practice.
3. Use data that is proportionate to the user need.
4. Understand the limitations of the data.
5. Ensure robust practices and work within your skillset.
6. Make your work transparent and be accountable.
7. Embed data use responsibly.
● GDPR regulations developed in Europe to help
individuals control their data
Data Science for Managers DATA SOCIETY © 2023 59
Data Society guidelines
1. Ownership: Who owns the data? Do you have the right to collect the data?
1. History: How long can you store the data?
1. Privacy: Who controls access to the data?
1. Uses: What kinds of inferences can you make?
1. Math: How do you prevent machine learning algorithms from learning the biases of the past?
Understanding how the math works is imperative for ethical data science!
Data Science for Managers DATA SOCIETY © 2023
60
Activity: data ethics
• Turn to page 9 of your participant guide to the Data
ethics activity.
• Read the scenario excerpted from the Data Ethics
Framework and answer the questions that follow.
Data Science
Literacy for Managers DATA SOCIETY © 2023 61
End of Day 1
Data and its uses
Data analytics overview
Data governance
Data ethics
Data Science for Managers DATA SOCIETY © 2023 62
DATA LITERACY FOR MANAGERS
Day 2
Data Science for Managers DATA SOCIETY © 2023
Recap
• To reach the realm of data science
organizations require:
o quality data
o an innovative environment
o resources, with the requisite
knowledge and technical skillsets
to use them
Data Science for Managers DATA SOCIETY © 2023 64
Agenda
Day 2
● Building a data-driven culture
● Data tools
• What is a data-driven culture?
● Data teams
● The data science process • How can a Federal manager
● Putting together a project encourage data-driven
practices?
Data Science for Managers DATA SOCIETY © 2023 65
Chat question
What does it mean to be
data driven?
Data Science for Managers DATA SOCIETY © 2023 66
What is a data-driven culture?
Data infrastructure
● A data-driven culture
incorporates data and
analysis into its business Data Prepared Data Driven
decisions, systems, and
processes. Data literacy
● It can be separated into
two main categories: Data Nascent Data Literate
○ Data infrastructure
○ Data literacy
Data Science for Managers DATA SOCIETY © 2023 67
Data infrastructure
DATA ACCESS DATA STORAGE DATA COLLECTION
Can staff access Is the data stored Is data collected in a
data easily and in securely with a timely and clean
a timely manner? backup? way?
Data Science for Managers DATA SOCIETY © 2023 68
Data literacy
DATA LEADERSHIP DATA GOVERNANCE DATA KNOWLEDGE
Do executives Are staff aware of Does staff
champion data data standards and understand how to
usage? practices? ask questions of
data?
Data Science for Managers DATA SOCIETY © 2023 69
Why is it important to be data driven?
• Identify trends. Trends can inform effective practices, help you become aware of issues,
and illuminate possible innovations or solutions.
• Reduce bias. Making decisions based on data is far more reliable than ones based on
instinct, assumptions, or perceptions.
• Benchmark performance. Benchmarking allows staff to connect their actions to
business results, which will reveal new opportunities for improvement.
A study from the MIT Center for Digital Business found that organizations driven most by
data-based decision making had 4% higher productivity rates and 6% higher profits.
Data Science for Managers DATA SOCIETY © 2023 70
Example: Walmart
• Walmart executives wanted to know
what items to stock before Hurricane
Frances in 2004.
• Analysts mined a terabyte of purchase
history from other Walmart stores under
similar conditions.
• Turns out, in times of natural disasters,
Americans want strawberry Pop-Tarts
Walmart Corporate, via Flickr
and beer! Stores were stocked
accordingly.
Data Science for Managers DATA SOCIETY © 2021
2023 71
Example: IRM
● Milan needed to replace its slow
computers.
● By pulling and analyzing data on
computer read/write speeds and
hard drive usage, IRM was able to
change the purchase order specs.
● Over $50,000 was saved by
eliminating unnecessary
requirements.
Data Science for Managers DATA SOCIETY © 2021
2023 72
Activity: Are you data driven?
Turn to page 10 of your
participant guide to the Data-
driven culture assessment to
evaluate your team.
Data Science
Literacy for Managers DATA SOCIETY © 2023 73
How to encourage data-
driven thinking & innovation
Data Science for Managers DATA SOCIETY © 2021
2023 74
Step 1: Create data-driven guideline
● All the data may be irrelevant if it is
not used correctly.
● Hence, organizations need to know
how to extract information and
knowledge from their data.
● They must incorporate data by
developing objectives and laying out
a broad roadmap for the data.
Data Science for Managers DATA SOCIETY © 2023 75
Step 2: Invest in data infrastructure and strategy
● Determine the space required to
manage data for your organization
and develop systems to support data
collection, storage, and analysis.
● Collaborate with the IT department to
establish databases and install
software for data reporting,
modeling, and analysis.
● Using the right tool to perform data
analysis.
Data Science for Managers DATA SOCIETY © 2023 76
Step 3: Encourage careful and comprehensive methods
of data collection
● Create policies for gathering data
● Establish practices to measure the
success
● Discuss the importance of collecting
data records in the future
● Meet with other managers or
department leaders to communicate
individual data collection methods
Data Science for Managers DATA SOCIETY © 2023 77
Step 4: Streamline data collection process:
● Every department gathers relevant and
valuable data and hence have a
central repository for all the collected
data is recommended.
● Data analysts evaluate the data and
provide understandable analytical
reports and insights back to the head of
each department.
● Each team can turn the outcomes from
these insights into actions, execute
them in their domain, and share results
with other departments/teams.
Data Science for Managers DATA SOCIETY © 2023 78
Step 5: Improve and maintain the data quality
● Data quality is just as crucial as data
quantity.
● If you do not have new data in your
repository, you might be looking at
outdated data and fake reality.
● Keep collecting more relevant, new data.
● Use data mining and software tools to
clean and maintain the quality of the
data automatically.
Data Science for Managers DATA SOCIETY © 2023 79
Step 6: Train your team
● Educate employees with the right
data-related skills and knowledge.
● Plan or suggest official training
sessions (like this one!) on data
literacy and information analysis.
● Have your team complete a
tutorial/training when the
organization incorporates a new
software or database system.
Data Science for Managers DATA SOCIETY © 2023 80
Step 7: Share insights and knowledge
● Encourage your team by sharing data
directly relevant to them
● Show the value and impact of data by
preparing information about the team's
performance and sharing it with them
during quarterly and yearly reviews.
● It is crucial to ask the team how they
came to a conflict, analyzed it, and
decided on the resolution. It gives your
data team a deeper understanding of
the data.
Data Science for Managers DATA SOCIETY © 2023 81
Step 8: Applaud your team
● Identify a successful analytics
project / team and highlight
their success through a
newsletter, event, or lunch and
learn.
● Recognize the right things—
including when mistakes move
you to another level.
Data Science for Managers DATA SOCIETY © 2023 82
Recap: 8 Actionable steps to establish data-driven culture
Here is the checklist of the steps:
✔ Create Data-Driven guideline
✔ Invest in data infrastructure and strategy
✔ Encourage careful and comprehensive methods of data collection
✔ Streamline data collection process
✔ Improve and maintain the data quality
✔ Train your team
✔ Share insights and knowledge
✔ Applaud your team
Data Science for Managers DATA SOCIETY © 2023 83
Chat question
● Which idea(s) that we’ve discussed could you implement in
the near term? What specifically would you do?
● What challenges do you expect to face when
implementing those ideas? How will you overcome them?
Data Science for Managers DATA SOCIETY © 2023 84
Data solutions
Data Science for Managers DATA SOCIETY © 2023 85
Break
Data Science for Managers DATA SOCIETY © 2021
2023 86
Agenda
Day 2
● Building a data-driven culture
● Data tools
● Data teams
● The data science process • What types of tools do data
● Putting together a project scientists use to do their work?
Data Science for Managers DATA SOCIETY © 2023 87
Data tools
● There’s no shortage of tools in
the data analytics space.
● There are different tools for
different functions, but most
overlap in their offerings.
Data Science for Managers DATA SOCIETY © 2023 88
Storage tools
• Databases y1 x1 x2 x3
o Relational A F X P
o Non-relational (NoSQL) B G Y Q
C H Z W
• Data warehouses & data lakes
Data Science for Managers DATA SOCIETY © 2021
2023 89
Cleaning tools
• Data cleaning is the process of
preparing data for analysis by
removing or modifying data
that is incorrect, incomplete,
irrelevant, duplicated, or
improperly formatted.
• Example tools: Drake,
OpenRefine, DataWrangler,
Data Cleaner, Winpure Data
Cleaning Tool
Data Science for Managers DATA SOCIETY © 2021
2023 90
Analysis tools
• Analysis tools make it easier to sort
through data in order to identify
patterns, trends, relationships,
correlations, and anomalies that
would otherwise be difficult to detect.
• Tools known to be in use at State:
o Excel
o R
o Python
o SAS
o WordSmith
Data Science for Managers DATA SOCIETY © 2021
2023 91
Visualization tools
• Visualization gives a visual or graphical
representation of data/concepts.
• Tools known to be in use at State:
o Excel
o Power BI
o Tableau
o R and RStudio
o Python
o Power BI
o MicroStrategy
Data Science for Managers DATA SOCIETY © 2021
2023 92
Collaboration tools
• Collaboration tools offer version
control, workflow, bug tracking, task
management, etc.
• Example tools: Git, GitHub
Data Science for Managers DATA SOCIETY © 2021
2023 93
Other technologies
● Several other technologies enable and support data analytics, including:
○ Application programming interfaces (APIs). An API is a computing interface that
allows two applications to talk to each other. Using them speeds up data
acquisition.
○ Graphical processing units (GPUs). A GPU is an electronic circuit specially
designed to process graphics such as images and video. Using them can help
speed up computation.
○ Cloud computing. Cloud computing offers fast and flexible servers, storage,
databases, networking, software, analytics, and intelligence over the Internet.
Data Science for Managers DATA SOCIETY © 2023 94
Questions to guide tool selection
1. What types of technologies are needed for working with data at various stages of the
data pipeline?
1. How do the different tools and technologies compare in their functionality, strengths,
and weaknesses?
1. Do you have staff who can be trained or know how to use particular tools?
1. Do you have budget constraints you need to be mindful of?
1. Is it on the approved software list?
Data Science for Managers DATA SOCIETY © 2023 95
Chat question
Let’s imagine you want to create
data visualizations for an upcoming
report.
What tool will you use? Why?
Data Science for Managers DATA SOCIETY © 2023 96
Agenda
Day 2
● Building a data-driven culture
● Data tools
● Data teams • Who is on the data team?
● The data science process
● Putting together a project • How do data teams fit within an
organization?
Data Science for Managers DATA SOCIETY © 2023 97
Data Analyst
● Ensures that collected data is
relevant and exhaustive while
also interpreting the analytics
results
● Main role and responsibilities
include:
○ Wrangling the data
○ Managing the data
○ Creating basic analyses
and visualizations
● Core skills to include: SQL, R /
Python, Tableau / Power BI
Data Science for Managers DATA SOCIETY © 2021
2023 98
Data Scientist
● Builds upon the analysts’ data work
to develop predictive models and
complex algorithms
● Main role and responsibilities include:
○ Asking the right questions from
the data
○ Building more complex
predictive models
○ Interpreting the results critically
and communicating them well
● Core skills to include: R, Python,
Spark, Hadoop
Data Science for Managers DATA SOCIETY © 2021
2023 99
Data Engineer
● Develops the infrastructure to house
the data and maintains the structural
components
● Main role and responsibilities:
○ Ensuring data integrity across
different data sources
○ Building out additional data
warehouses as needed
○ Maintaining data pipelines and
access
● Core skills to include: AWS,
MongoDB, MySQL, Hadoop, C++,
Azure
Data Science for Managers 2023 100
DATA SOCIETY © 2021
MLOps Engineer
● Aims to deploy and maintain
machine learning systems in
production reliably and efficiently
● Main role and responsibilities:
○ Requirements engineering
○ System design
○ Implementation and testing
○ Maintenance, support,
troubleshooting, etc.
● Core skills to include: distributed
computing principles, networking,
database architecture
Data Science for Managers 2023 101
DATA SOCIETY © 2021
Data Science Manager
● Oversees and directs data science
teams and projects and is the bridge
between data and non-data people
● Main role and key responsibilities
include:
○ Planning out people and
resources for projects
○ Communicating results to
executives and stakeholders
○ Running the data science teams
● Core skills to include: management
experience, programming skills (R /
Python / SQL), strong communication
Data Science for Managers 2023 102
DATA SOCIETY © 2021
Team structures
Centralized structure
Data
Team
Business
Units
Decentralized structure
Business
Units
Hybrid structure
Data analysts Data
Business analysts Team
Business
Units
Data Science for Managers DATA SOCIETY © 2023 103
Pros and cons
Centralized structure
+ easier to standardize team processes
Data - harder to coordinate projects to meet
Team
strategic goals
Business
Units
+ easier to coordinate projects to meet
Decentralized structure
strategic goals
Business - leads to inconsistent & redundant data
Units usage across organization
Hybrid structure
+ easier to standardize team processes
Data
+ easier to coordinate projects to meet
Team
strategic goals
Business
Units
Data Science for Managers DATA SOCIETY © 2023 104
Polling question
Centralized structure
Data Which best describes the structure of
Team
the data teams in your organization?
Business
Units ○ Centralized
○ Decentralized
○ Hybrid
Decentralized structure
Business
Units
Hybrid structure
Data
Team
Business
Units
Data Science for Managers DATA SOCIETY © 2023 105
Another option…
Contracting a team Hiring a team
Strengths Strengths
● Flexible cost structure can adapt to ● Data science becomes an endemic
changing budgets capability—better decision making
● Easy to change staff if people don't work
becomes part of the DNA
out
● Internal know-how is developed and
● Quickly add staff with new skills
sustained—the analytics capability has a
Weaknesses strong foundation
● Internal know-how is not built up
● Data science does not become an Weaknesses
endemic capability ● State-of-the-art capabilities may still need
● The organization becomes dependent on to be brought in from the outside ("rented")
forces outside of its control ● Organizational challenge: data science
must remain impartial to internal dynamics
Data Science for Managers 2023 106
DATA SOCIETY © 2021
Polling question
What would be the best option for your organization?
○ Contracting a team
○ Hiring a team
What are the key factor(s) in making that decision?
○ Recruitment/ training time
○ Cost
○ Internal know how
○ Flexibility
○ Not depending on outside forces
Data Science for Managers DATA SOCIETY © 2023 107
Break
Data Science for Managers 2023 108
DATA SOCIETY © 2021
Agenda
Day 2
● Building a data-driven culture
● Data tools
● Data teams
● The data science process • What are the six stages of the
● Putting together a project typical data science process?
Data Science for Managers DATA SOCIETY © 2023 109
Typical data science process
What is the What data do we
problem(s) we need and how
need to solve? do we get it?
Ask Research
How can we use Which method(s)
the conclusions in is appropriate
the real world? to use?
Interpret Model
How does the Do the model and
model generalize assumptions work
to real-world data? as expected?
Test Validate
Data Science for Managers DATA SOCIETY © 2023 110
What is the
problem(s) we
need to solve?
Ask
● The business and data teams should work together to develop a question that is
specific, measurable, and objective.
● Domain knowledge comes into play.
Examples
How can I make my Which 3 policies have demonstrated the
policies more effective? best results, and did they have anything in
common?
We’ll use the calculated ROI and the
We’ll use an indicator that percent difference in desired behaviors
shows the most from before and after.
improvement.
Data Science for Managers DATA SOCIETY © 2023 111
What data do we
need and how
do we get it?
Research
● The data team, with input from the business, gathers information about the data
needed to get a relevant answer.
● Is it already collected, or is time needed to get it? What format is it in?
Examples
I’m sure we have the data We’ll use the datasets from the policy
somewhere. report that can be found in X repository.
I’m sure the data is good Where can I read about how the data was
enough as is. collected and how the metrics are
defined?
Data Science for Managers DATA SOCIETY © 2023 112
Which method(s) Do the model and How does the
is appropriate assumptions work model generalize
to use? as expected? to real-world data?
Model Validate Test
● Models take questions and
provide answers and outputs.
● The methods chosen by the
data team are based on the
questions asked and the type(s)
of data that you have.
● Multiple iterations are required
to ensure the model works well.
Data Science for Managers DATA SOCIETY © 2023 113
How can we use
the conclusions in
the real world?
Interpret
● The data team looks at what the results are telling them—not what they were
expecting the results to be.
● They present the data and make recommendations based on the data, their domain
knowledge, and stakeholder needs.
Example
I’ll put the results in the How can I best convey the results that
same format as I usually matter most to my end users?
do.
Data Science for Managers DATA SOCIETY © 2023 114
Chat question
Which part of the data science
process do you think data teams
spend the most time on? Why?
Data Science for Managers DATA SOCIETY © 2023 115
Chat question
As a manager, which part of the
data science process do you
think you should spend the most
time on? Why?
Data Science for Managers DATA SOCIETY © 2023 116
Agenda
Day 2
● Building a data-driven culture
● Data tools
● Data teams
● The data science process • How do I identify feasible and
● Putting together a project impactful data projects?
Data Science for Managers DATA SOCIETY © 2023 117
Planning a data project
• A successful and comprehensive data project is way beyond just
programming.
• It involves sophisticated planning and a large amount of
communication.
• In this section, we will practice planning an impactful and feasible
project.
Data Science for Managers DATA SOCIETY © 2023 118
Planning a data project
In previous classes, participants have tacked projects such as:
• Using data to prioritize the distribution of COVID vaccines to Department
employees, family members, and members of the diplomatic community.
• Improving equity in the post bidding process using data.
• Determining the impact of bilateral engagement (e.g., meetings and trips)
on a country’s child abduction indicators, according to the data.
• Proving, with data, that counternarcotics funding in Colombia has reduced
cocaine consumption.
Data Science for Managers DATA SOCIETY © 2023 119
Activity: brainstorm ideas
• Turn to page 13 of your participant guide to the Project
brainstorm activity.
• Identify 3-5 ideas for leveraging data in your
workplace. Then, assess their feasibility and impact.
Data Science
Literacy for Managers DATA SOCIETY © 2023 120
End of Day 2
Building a data-driven culture
Data tools
Data teams
The data science process
Putting together a project
Data Science for Managers DATA SOCIETY © 2023 121
DATA LITERACY FOR MANAGERS
Day 3
Data Science for Managers DATA SOCIETY © 2023 122
World’s Smartest Home
Data Science for Managers DATA SOCIETY © 2023 123
Agenda
Day 3
• What are the basics of machine
• Foundational data science learning?
methods
• What is clustering and how is it
• Advanced data science used?
methods
• What is classification and how is it
used?
• What is regression and how is it
used?
Data Science for Managers DATA SOCIETY © 2023 124
Why learn about these methods?
1. To develop a common vocabulary with the data science
team
1. To direct data science projects and make recommendations
1. To understand what options are available for finding new
insights and becoming more efficient
Data Science for Managers DATA SOCIETY © 2023 125
What’s an algorithm?
Data Science for Managers DATA SOCIETY © 2023 126
What is machine learning?
● Machine learning uses algorithms to find patterns in massive amounts of data and
predict future results with minimal human intervention.
● It powers many of the services we use today:
o recommendation systems like those on Netflix
o search engines like Google
o social-media feeds like Facebook and Twitter
o voice assistants like Siri and Alexa
● Most is categorized as either supervised or unsupervised.
Data Science for Managers DATA SOCIETY © 2023 127
Supervised learning
● You have input variables (x) and an output variable (Y) and you use an
algorithm to learn the mapping function from the input to the output.
● The goal is to approximate the mapping function so well that when you have
new input data (x) that you can predict the output variables (Y) for that data.
● Requires labeled data (i.e., data tagged with one or more labels identifying
certain properties, characteristics, or classifications)
● Example: emails are classified as spam/not spam based on how their features
compare to the features of emails that a human “Marked as Spam.”
Data Science for Managers DATA SOCIETY © 2023 128
Unsupervised learning
● You only have input data (x) and no corresponding output variables.
The goal is to model the underlying structure or distribution in the data in
order to learn more about the data.
● In other words, the machine looks for whatever patterns it can find.
● Example: for marketing purposes, finding groups of customers with similar
behavior given a large database of customer data containing their
properties and past buying record
Data Science for Managers DATA SOCIETY © 2023 129
Polling question
The goal of this type of machine learning is to model
the underlying structure or distribution in the data in
order to learn more about the data.
Do you think this statement describes supervised
machine learning or unsupervised machine learning?
Data Science for Managers DATA SOCIETY © 2023 130
Polling question
The Stanford Dogs Dataset contains 20,580 images. Each image
is categorized into 1 of 120 different dog breed categories.
Based on the information provided, is this dataset suitable for
use with supervised machine learning techniques?
Data Science for Managers DATA SOCIETY © 2023 131
Before we go further…
● Remember that most data
science projects combine a
few methods to extract the full
picture.
● The two big components that
drive the decision for which
method to use are: the
question you’re asking, and the
data you have.
Data Science for Managers 2023 132
DATA SOCIETY © 2021
Clustering
Data Science for Managers 2023 133
DATA SOCIETY © 2021
Clustering
● Clustering is a type of unsupervised
machine learning.
● You find similarities between data
points and create groups (clusters)
based on those similarities.
● It tries to find whether there is a
relationship between the data points
when the classes are unknown.
Data Science for Managers DATA SOCIETY © 2023 134
How can you use clustering?
● Clustering answers the questions:
1. Who/what is this person/object similar to?
2. Is there a hidden pattern in the data that we can't see?
3. Are there groups of data with similar attributes?
● Domain knowledge is key!
o If we know that certain policies are more effective, we can model
more policies off of the similar metrics.
o If we had projects with similar objectives and outcomes, we can
consolidate ones that overlap to streamline progress.
Data Science for Managers DATA SOCIETY © 2023 135
Example: credit line optimization
● GE Capital created a model to 1
predict customer behavior and
4
offer tailored products.
● The clusters were defined using
2
existing GE Capital data—based
on days delinquent, monthly
5
spend, and percent of credit line 3
used.
● Led to more targeted marketing
and specific offers to those groups.
Data Science for Managers DATA SOCIETY © 2023 136
Types of clustering
● Centroid - iterative clustering algorithms where the proximity of data points is translated
into similarity
● Hierarchical - assumes that the closer the data points are to each other, the more
similar they are
● Density-based - searches for areas of varied density of data points in the dataset and
clusters based on the density
Data Science for Managers DATA SOCIETY © 2023 137
Evaluating the accuracy of the model
● Goal of clustering is to maximize the
separation between clusters and
minimize the distance within clusters
● The ratio of inter-cluster variance to
total variance can help you assess
the performance of algorithms,
although this is dependent on the
model you use
Variation explained by clusters
inter-cluster variance
total variance
Data Science for Managers DATA SOCIETY © 2023 138
Evaluating the accuracy of the model
● A screeplot identifies the contribution
of each variable on the explained
variance of the model.
● Good for identifying important
components of a model
Data Science for Managers DATA SOCIETY © 2023 139
Questions managers should ask
1. How was the distance measure identified?
1. Did you scale the data appropriately?
1. How many clusters do you expect or want? Why?
1. Does your algorithm scale to the size of the data?
1. What can we learn from the groups that the algorithm identified?
Data Science for Managers DATA SOCIETY © 2023 140
Common pitfalls with clustering
● Clustering algorithms don’t scale well to
large datasets
○ “Curse of dimensionality” – as the
dimensions increase, the data points
become sparse and increases distance
and similarity between points
● Different data types need to be
formatted correctly (i.e., mixing
categorical data with numerical data
may not be the best way to find similar
points).
● Make sure you use the right clustering
model for the data!
Data Science for Managers DATA SOCIETY © 2023 141
Recap: when should you use clustering?
● Use clustering when:
1. You have an unlabeled dataset
2. The dataset has multiple attributes
3. You need to identify patterns in your data
4. You need to find groups in your data
Data Science for Managers DATA SOCIETY © 2023 142
Classification
Data Science for Managers 2023 143
DATA SOCIETY © 2021
Classification
● Classification is a type of supervised
machine learning.
● It is the process of assigning new
data points to known classes.
● The assignment is done based on the
similarity of new data points to
existing data points with known class
assignment (category or behavior
pattern).
Data Science for Managers DATA SOCIETY © 2023 144
How can you use classification?
Classification answers the questions:
1. Which is the probability of an object / person being in a particular group?
2. What category is this person / object in?
3. What is this person / object most similar to?
Domain knowledge is key!
○ If we know that certain policies are most likely to be successful, we can predict if
new policies will also be successful
○ If we see behavioral outcomes based on certain decisions, we can predict similar
behaviors
Data Science for Managers DATA SOCIETY © 2023 145
Example: predicting pregnancy
● In 2002, Target implemented data
analytics to analyze buying patterns
in customers.
● New parents often get bombarded
with advertising offers, so Target
wanted a way to anticipate who is
expecting in order to get ahead of
the competition.
● They were able to predict pregnancy
of their customers based upon their
purchases and sent out targeted
coupons.
http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html?_r=0
Data Science for Managers DATA SOCIETY © 2023 146
Chat question
Time out! What ethical implications might
Target’s pregnancy predications have
raised?
Data Science for Managers DATA SOCIETY © 2023 147
Common classifiers
● k-Nearest Neighbors (KNN) – assumes ● Decision trees – uses a tree-like graph
that similar things exist in close or model of decisions and their
proximity; classifies a data point possible consequences to classify
based on how its neighbors are data
classified
Data Science for Managers DATA SOCIETY © 2023 148
Common classifiers
● Support vector machines – separates ● Logistic regression – determines the
data points by class using an optimal probability of a data point to be part
hyperplane of a certain class or not
Data Science for Managers DATA SOCIETY © 2023 149
Evaluating accuracy of a model
● In order to determine the accuracy of
the model, you need to split your
data into a training set and a test set.
● Then, compare the outcomes that
the model produced to the actual
outcomes to determine how
accurate your model is, and how well
it generalizes to new data.
● This is called a confusion matrix.
Data Science for Managers DATA SOCIETY © 2023 150
Accuracy, cont’d.
● Next, you can plot the ROC (receiver ● Another metric to plot is called the
operator characteristic), which is the AUC (area under curve), which
true positive rate against the false compares classification models to
positive rate at different thresholds. measure predictive accuracy.
The AUC should be above .5 to say
the model is better than a random
guess.
Data Science for Managers DATA SOCIETY © 2023 151
Questions managers should ask
1. How was the distance measure identified?
1. Did you scale the data appropriately?
1. How did you split the test and training data?
1. What thresholds did you use for AUC and ROC?
Data Science for Managers DATA SOCIETY © 2023 152
Recap: when should you use classification?
● Use classification when:
1. You have a labeled dataset
2. You want to predict group assignments
3. You want to predict behaviors / events
4. You want to identify important attributes
Data Science for Managers DATA SOCIETY © 2023 153
Regression
Data Science for Managers 2023 154
DATA SOCIETY © 2021
Regression
● Regression is a type of supervised
machine learning.
● It predicts the value of a variable
based on the value of another
variable or several variables.
● It’s used to examine and calculate
the relationship between a variable
of interest (dependent variable) and
one or more explanatory variables
(predictors or independent
variables).
Data Science for Managers DATA SOCIETY © 2023 155
How can you use regression?
● Regression answers the questions:
1. Which factors matter most?
2. Which can we ignore?
3. How do those factors interact with each other?
4. How certain are we about all of these factors?
● Domain knowledge is key!
○ We can predict political instability in countries
○ We can predict how tourism season affects a country’s economy
Data Science for Managers DATA SOCIETY © 2023 156
Types of regression techniques
Regression
# of independent Shape of the Type of dependent
variables regression line variable
Data Science for Managers DATA SOCIETY © 2023 157
Use case: predicting city movements
● There are over 500 bike-sharing
programs around the world with over
500,000 bikes.
● Automated systems track numerous
data points providing a treasure trove
of data about the mobility of
residents.
● Data can be used to forecast the
number of bikes required and adjust
pricing based on demand.
Data Science for Managers DATA SOCIETY © 2023 158
Chat question
What factors do you think might drive
demand for bike-share use?
Data Science for Managers DATA SOCIETY © 2023 159
Simple linear regression
1. Gather data on variables in question 4. Evaluate model performance
2. Plot the data • Measure error
• Deal with outliers
3. Draw the line to best fit the data
• Determine accuracy
y = mx + b
Number of bike users
=
2.6 * (Temperature) +
37.6
Data Science for Managers DATA SOCIETY © 2023 160
Measure error
Variance. How widely dispersed Randomness. Are the errors Standard deviation/Certainty.
is actual data from the random or is there bias in the What proportion of data
expected data? model? points fall within a given
range? How likely is a value to
be in that range?
Data Science for Managers DATA SOCIETY © 2023 161
Deal with outliers
● Just one outlier can have a very ● Methods such as scatterplots, box-
negative impact on a linear and-whisker plots, and Cook’s
regression if it is not identified and distance can be used to identify
handled properly. outliers.
Data Science for Managers DATA SOCIETY © 2023 162
Determine accuracy
● Look at:
○ Covariance: measures how changes in one variable effects another variable
○ Correlation: identifies the strength of the relationship between the variables
○ p-values: probability that pattern exists through random chance, and not a
relationship between the variables
● R2 determines the accuracy of a regression model. It’s the proportion of variance in the
outcome variable that's accounted for by regression
○ e.g., “about 40% of the variance in the number of bike users is explained by the
temperature”
Data Science for Managers DATA SOCIETY © 2023 163
Multiple linear regression
● Has more than one independent variable
○ e.g., How do several variables (temperature, humidity, day of the week, time of
day) affect demand for bikes?
● Added concerns:
○ Multicollinearity: when 2 or more independent variables are strongly correlated to
one another you may be effectively double counting an effect
○ Autocorrelation: when the correlation between the values of the same variables is
based on related objects
○ Heteroskedasticity: when the variability of a variable is unequal across the range of
values of a second variable that predicts it
Data Science for Managers DATA SOCIETY © 2023 164
Other types of regression
● Nonlinear Regression
● Binary Logistic Regression
● Ordinal Logistic Regression
● Nominal Logistic Regression
● Ridge Regression
● Lasso Regression
● Partial Least Squares Regression
● Polynomial Regression
● Logistic Regression
● Quantile Regression
● Elastic Net Regression
● Principal Components Regression
● Support Vector Regression
● Ordinal Regression
● Poisson Regression
● Negative Binomial Regression
● Ecologic Regression
● Bayesian Regression
● Jackknife Regression
Data Science for Managers DATA SOCIETY © 2023 165
Questions managers should ask
1. How well do we understand the underlying data distribution?
1. Did you identify any outliers? Were they significant? Did you remove them?
1. Did you test the variables for multicollinearity so as not to double-count their effects?
1. What was the R2 metric?
Data Science for Managers DATA SOCIETY © 2023 166
Recap: when should you use regression?
● Use regression when:
1. You have a labeled dataset
2. You want to predict trends
3. You want to anticipate needs or shortages
Data Science for Managers DATA SOCIETY © 2023 167
Polling question
Do you think the decision tree shown below depicts a
classification method?
Data Science for Managers DATA SOCIETY © 2023 168
Polling question
Would you use clustering, classification,
or regression to anticipate what
candidate a person would vote for?
Data Science for Managers DATA SOCIETY © 2023 169
How Machines Learn
Data Science for Managers DATA SOCIETY © 2023 170
Break
Data Science for Managers 2023 171
DATA SOCIETY © 2021
Agenda
Day 3
• Foundational data science • What is text mining and how is it
methods used?
• Advanced data science
• What is graph analysis and how is
methods
it used?
• What are neural networks and
how are they used?
Data Science for Managers DATA SOCIETY © 2023 172
Text Mining
Data Science for Managers 2023 173
DATA SOCIETY © 2021
Text mining
● Text mining employs methods from
various fields including mathematics,
statistics, computational linguistics,
and programming.
● It’s the process of getting insightful
and valuable information out of text
data.
● Includes entity extraction, document
classification, and sentiment analysis.
Data Science for Managers 2023 174
DATA SOCIETY © 2021
Text mining branches
Text mining
Entity Document Sentiment
extraction classification analysis
Data Science for Managers DATA SOCIETY © 2023 175
Entity extraction
● Use entity extraction when you want to get an overview of the themes and topics in
documents.
● Measure word frequency and word co-occurrences.
Data Science for Managers DATA SOCIETY © 2023 176
Document classification
● Use document classification
when you want to sort
through documents and
identify groups of similar
articles.
● Based on similarity of topics /
other metrics
Data Science for Managers DATA SOCIETY © 2023 177
Sentiment analysis
● Use sentiment analysis when you want
to understand the emotions and
overtones of documents.
● Use reference dictionaries to identify
positive / negative words.
● Natural language processing (a
similar branch) doesn’t focus
specifically on sentiment, but rather
on the meaning of the document.
What events might have driven the trends in
emotion depicted above?
Data Science for Managers DATA SOCIETY © 2023 178
Text mining process
Scrape / Clean &
Visualize Analyze
collect organize
Index Word Freq %
A Apple 5 20
B Book 7 28
C Cat 13 52
Data Science for Managers DATA SOCIETY © 2023 179
Evaluating accuracy of our model
● This is a tricky subject!
● Text analysis and text mining rely on other methods that we’ve
introduced in this class, such as clustering and classification. You’ll need
to use the evaluation methods for those particular models.
● In terms of sanity-checking the text mining process, look for unhelpful
stop words (frequent words that don’t provide additional information)
and see if the topics generally make sense.
Data Science for Managers DATA SOCIETY © 2023 180
Common pitfalls with text mining
● Cleaning text is extremely messy and
time consuming – this is a key problem
in text mining projects.
● Existing dictionaries are not a
panacea for catching the nuances of
language – typically, there need to be
manual additions of other words.
● Using the right methods and metrics to
classify and cluster documents
correctly.
Data Science for Managers DATA SOCIETY © 2021
2023 181
Questions managers should ask
1. How does the model take sarcasm / irony / colloquialisms into account?
1. Is there an existing library of reference words that can assist you in text
mining?
1. Does that reference library include misspellings, alternate versions of
words, symbols, different parts of speech or compound terms?
1. How do the topics change over time?
Data Science for Managers DATA SOCIETY © 2023 182
Graph analysis
Data Science for Managers 2023 183
DATA SOCIETY © 2021
Graph analysis
● Graph analysis (also known as network
analysis) seeks to find patterns within a
network, a set of points connected by lines
that represent connections.
● Networks can represent organizational
relationships; communications patterns;
economic relationships; environmental
relationships; connections based on interests,
preferences and similarities; as well as
geographic relationships.
Data Science for Managers DATA SOCIETY © 2023 184
Example: IBM & a volcano
● In April 2010 a volcano in Iceland halted flights throughout Europe.
● IBM's internal analytics software alerted the team that IBM's supply chain link most
relevant to the eruption was in Hong Kong – not Europe!
● The software showed that when flights resumed after the eruption was over, IBM would
need to quickly move a backlog of components from Asian manufacturers to
European customers. A bottleneck in Hong Kong would result.
● IBM booked additional space on commercial flights to help transport the backlog.
Source: Big Data Driven Supply Chain Management by Nada R. Sanders
Data Science for Managers DATA SOCIETY © 2023 185
Types of graph analysis
Graph analysis
Community Centrality Social
detection metrics Media
Data Science for Managers DATA SOCIETY © 2023 186
Community detection
● Use community detection
when you want to dive into
your network to find new
communities and groups.
● Identifies groups of individuals /
nodes that belong together;
can detect latent connections
and communities.
Data Science for Managers DATA SOCIETY © 2023 187
Centrality metrics
● Use centrality metrics when you want to look at an overview of a network and identify
key nodes.
● Identifies the most important nodes, most central nodes, shortest paths, etc.
This email network shows Marketing
how a company department
communicates.
CFO
Supply chain
Finance department
department
Data Science for Managers DATA SOCIETY © 2023 188
Social media
● Use social media
when you are using
data from social
media platforms.
● Identifies how an idea
travels across social
media platforms and
how individuals are
connected.
Data Science for Managers DATA SOCIETY © 2023 189
Ways to measure networks
Metric Purpose
# of nodes How many participants are included in the network?
# of edges How many connections exist in a network?
Distance How long does it take for information to travel through a network?
Degree (in-, out-) Direction of connections, is someone a follower or an opinion leader?
Degree centrality How many other people/objects can someone/something reach?
Closeness centrality On average, how quickly can someone/something reach every other point in the network?
Betweenness centrality How important is someone/something as a connector to the structure of the network?
Eigenvector centrality How important is someone/something based on who/what else they are connected to?
Tie strength How strong or significant is a connection between two people/objects?
Density How sparse and fragile or inter-connected and resilient is a network?
Jaccard Index How similar or redundant are 2 people/elements of a network?
Data Science for Managers DATA SOCIETY © 2023 190
Evaluating accuracy of our model
● This is a tricky subject!
● Graph analysis relies on other methods that we’ve introduced in this
class, such as clustering and classification. You’ll need to use the
evaluation methods for those particular models.
● In terms of sanity-checking the process, look at how the nodes are
accounted for in each community and determine what threshold
makes the most sense for your analysis.
Data Science for Managers DATA SOCIETY © 2023 191
Questions managers should ask
● What aspect of the relationship are you most interested in (i.e., who is
the most connected, who has the strongest connections, who is most
important)?
● Does the data you’re using account for a large amount of a
relationship? How much is in the numbers versus not collected?
● What metrics did you use to evaluate the proximity between nodes /
communities?
Data Science for Managers DATA SOCIETY © 2023 192
Neural networks
Data Science for Managers 2023 193
DATA SOCIETY © 2021
Activity: field trip
● Visit https://quickdraw.withgoogle.com/
● Click the “Let’s Draw!” button and play a round (6 drawings).
● At the end of the round, visit the data to see why guesses were
made. Also, make a note of how many of your drawings were
guessed correctly.
Note: A clickable link is available on page
16 of the participant guide.
Data Science
Literacy for Managers DATA SOCIETY © 2023 194
Neural networks
● A neural network is born ignorant and
builds on itself to get smarter and
smarter.
● It starts out with a guess, and then
tries to make better guesses as it
learns from its mistakes.
● Neural networks cover the same
topics that we’ve reviewed
previously. In theory, you can apply
them to almost any method!
Data Science for Managers 2023 195
DATA SOCIETY © 2021
Intuition: neural networks
● Neural networks are made up of perceptrons. Hidden
Input
● A simple perceptron has 3 layers:
○ Input: observations that enter the model
Output
○ Hidden layer: composed of an activation function that
derives the output based on inputs and other factors
○ Output: target variable you want to predict
● Once the output is produced, the model measures the
error, then walks it back over the model to adjust its
performance and reduce errors.
A perceptron acts like a
neuron.
Data Science for Managers DATA SOCIETY © 2023 196
What data do you need?
1. Relevant: data must resemble the real-world data you hope to process as much as
possible
1. Properly classified: in order for a deep-learning solution to correctly classify, a labeled
dataset is needed. If a labeled dataset is not available, someone needs to actually
apply the labels to the raw data
1. Formatted: all data needs to be vectorized, and the vectors should be the same
length when they enter the neural net
2. Minimum data requirement: this may vary with the complexity of the problem, but
100,000 instances in total across all categories is a good place to start
Data Science for Managers DATA SOCIETY © 2023 197
Neural networks: pros and cons
● Pros
○ Neural networks are highly versatile.
○ They are fairly insensitive to noise in your data.
○ They are well-equipped to handle fuzzy and convoluted
relationships.
● Cons
○ It’s a black box – those hidden layers are difficult to explain
and evaluate.
○ They are in danger of overfitting the training data, so it
might not generalize as well to new information.
○ An experienced data scientist should develop the
parameters of hidden layers and nodes.
Data Science for Managers DATA SOCIETY © 2023 198
Chat question
How many of your drawings did the
neural network guess correctly?
Does that mean you are a good (or
We started our discussion bad) artist?
on neural networks with a
drawing activity...
Data Science for Managers DATA SOCIETY © 2023 199
Key points
● Don’t accept an analysis at face value – you need
to ask the right questions!
● Most data analyses incorporate multiple methods
in order to determine which one is the most
accurate.
● Remember! The two big components that drive the
decision for which method to use are: the question
you’re asking, and the data you have.
Data Science for Managers DATA SOCIETY © 2023 200
End of Day 3
Foundational data science methods Questions
Advanced data science methods
?
Data Science for Managers DATA SOCIETY © 2023 201
DATA LITERACY FOR MANAGERS
DAY 4
Data Science for Managers DATA SOCIETY © 2023 202
Agenda
Day 4
• Data visualization
• Misleading statistics & visual
distortions • What is data visualization?
• Data storytelling
• How to I pick and design visuals?
Data Science for Managers DATA SOCIETY © 2023 203
What is data visualization?
• Data visualization is any attempt
to make data more easily
digestible by rendering it in a
visual context.
• Common data visualizations
include tables, charts, graphs,
and dashboards.
Data Science for Managers DATA SOCIETY © 2023 204
Explore or explain
● We can use data visualization to review new data to discover patterns, to spot
anomalies, to test hypotheses, and to check assumptions.
● We can also use data visualization to transform raw data into a compelling story or
takeaway for an external audience.
Data Science for Managers DATA SOCIETY © 2023 205
Chat questions
● What types of data visualization does your
organization produce?
● What improvements would you like to see in
the visualizations created or used by your
organization? Why?
Data Science for Managers DATA SOCIETY © 2023 206
Getting it right
Using visualizations
incorrectly can cause you
to lose your audience, lose
the value in your data,
and ultimately lead to
poor decision making.
Data Science for Managers DATA SOCIETY © 2023 207
Example: The Challenger
• On January 27, 1986, concerned engineers presented data and the following charts to
try to illustrate the damage cold temperatures would have on the O-rings of the
Challenger space shuttle.
Source: Presidential Commission on the Space Shuttle Challenger Accident, vol. 5 (Washington, DC: US Government Printing Office, 1986.) pp. 895-
Data
896.Science for Managers DATA SOCIETY © 2023 208
Example: The Challenger
• January 28, 1986, the Challenger space The chart below shows O-ring
challenger-explosion-2/
https://vizdatar.wordpress.com/2015/05/06/space-shuttle-
damage on the y-axis and
shuttle exploded within seconds of temperature on the x-axis.
takeoff.
Is it easier to see the issue?
• Data visualization legend Edward
Tufte argues that the shuttle’s engineers
failed to communicate dangers
because their data wasn't presented in
an easily digestible form.
Data Science for Managers DATA SOCIETY © 2023 209
Using appropriate visuals
• We’ll start by talking about how to
pick and design the right visual for
your purpose.
• Then, we’ll discuss common mistakes.
• Later, we’ll talk about how to avoid
being misled by visualizations.
Data Science for Managers DATA SOCIETY © 2023 210
To get started with data viz
1. Know your audience and understand
how it processes visual information.
(Who)
2. Determine what you’re trying to
visualize and what kind of information
you want to communicate. (What)
3. Choose a type of visual that conveys
the information in the best and
simplest form for your audience.
(How)
Data Science for Managers DATA SOCIETY © 2023 211
Who
• Know your audience and understand how it processes visual information.
• Consider audience familiarity. For example:
– High-level executives are generally well-versed in visual data, so use a variety of
methods to stand out
– Less-experienced audiences will want it kept simple (e.g., pie charts, bar graphs, and
word maps)
• Consider how the visualization will be used by the audience:
– Is it for executives to use to make decisions?
– Is it to inform the public?
Data Science for Managers DATA SOCIETY © 2023 212
Chat: who is the audience?
https://www.oreilly.com/library/view/learning-highcharts-4/9781783287451/ch08s04.html http://www.mensfitness.com/nutrition/what-to-eat/mens-fitness-food-
pyramid
Data Science for Managers DATA SOCIETY © 2023 213
What
• Determine what you’re trying to visualize and what kind of information
you want to communicate.
• Remember, the audience only knows as much as you tell them:
– Do you want them to explore the data on their own? (exploratory analysis)
– Do you want to tell a specific story about the data? (explanatory purposes)
• If the message is explanatory, consider:
– What type of data you have on which to base the analysis?
– What are the audience’s topmost concerns or requirements?
– What decisions can be made based on the results you provide?
Data Science for Managers DATA SOCIETY © 2023 214
How
• Choose a type of visual that conveys the information in the best and
simplest form for your audience.
• The type of visual you use depends primarily on two things:
1. the data you want to communicate
2. what you want to convey about that data
• Then, choose the visual that will be easiest for your audience to read.
– Aim for them to “get it” in 30 seconds or less.
Data Science for Managers DATA SOCIETY © 2023 215
Just a few numbers
• Don’t overcomplicate!
…we spent only $75,000 of our
$125,000 budget…
• Simple text works well when
…therefore, it is not surprising
there is just a number or two that only 29 percent of the
to share. applications were accepted…
…product A ($12.99) was much
more affordable than product B
($59.99)…
Data Science for Managers DATA SOCIETY © 2023 216
Unique data
• Don’t overcomplicate!
• Tables are great when
communicating to a mixed
audience who will look to a
particular row of interest or when
you need to show different units Product Weight Price
of measure. Toaster (UK) 1.05 kg £17,49
Toaster (US) 3.13 lbs. $29.99
Toaster (South
1.07 kg R239,00
Africa)
Data Science for Managers DATA SOCIETY © 2023 217
Common messages
Comparison Composition
Evaluate and Understand
compare values how individual
between two or parts make up
more data points a whole
Distribution Relationships
Combine Represent the
comparison and correlation or
composition connection
between 2+
variables
Data Science for Managers DATA SOCIETY © 2023 218
What if it’s more complicated?
• A dashboard is a collection of visual reports that display important
metrics and KPIs, usually in real-time.
Data visualization Data dashboard
• a visual representation of your • a collection of data visualizations
data, such as a chart assembled into a single, unified
• can be static or dynamic view
• typically shows data for a single • might display data visualizations
metric, such as electricity usage for electricity usage, energy
costs, CO2 emissions, and
peak/off peak use
Data Science for Managers DATA SOCIETY © 2023 219
Sample dashboard
Data Science for Managers DATA SOCIETY © 2023 220
Designing compelling visuals
• Picking the right chart type isn't
enough.
• There are choices to be made
about the elements you include
and how they are formatted.
• Data visualization is an art,
informed by science.
Data Science for Managers DATA SOCIETY © 2023 221
Visual design theory
• Our eyes “load” information while the brain
“processes” it.
• We give the most attention to what looks good
and struggle when our working memory is
overwhelmed.
• For information to be effective, it should not
provide more data than what the human brain
can process.
Data Science for Managers DATA SOCIETY © 2023 222
Example: buying oranges
• You want to buy oranges at a new supermarket.
• Our eyes scan the layout of the supermarket, while the brain processes the various
sections.
• The brain then instructs the eyes to zone in on the fruit section by sending signals about
how fruits look from memory.
• The eyes then break the entire scanned area into parts and scan each part to spot the
fruit section.
• The process is repeated until oranges are located.
Data Science for Managers DATA SOCIETY © 2023 223
Designing compelling visuals
• Our eyes and brains work the same way with data visualizations as they
did in the oranges example.
– Use visual clues to make data visualizations easier for the audience.
• However, every piece of information in a visualization also creates
cognitive load on the viewer, asking them to use their brain power to
process it.
– Reduce visual clutter to lower the cognitive load and help transmission of the message.
Data Science for Managers DATA SOCIETY © 2023 224
Theory
• The visual design tips we’ll review today draw on theory such as:
– the building blocks of visual design described by the Interaction Design Foundation
– the four categories of preattentive visual attributes described in Colin Ware’s book,
Information Visualization: Perception for Design
– the Gestalt Principles of visual perception, which describe how people group similar
elements, recognize patterns, and simplify complex images when we perceive objects
Data Science for Managers DATA SOCIETY © 2023 225
Tip
1 Make position meaningful
Data should be sorted and placed in the visual in a meaningful way.
The left chart is sorted alphabetically; the right by value.
When would you use one over the other?
Data Science for Managers DATA SOCIETY © 2023 226
Tip
2 Group related items
• Things that are closer appear to be more related than those that are spaced farther
apart.
• In fact, proximity overrules the similarity of other factors (e.g., shape, color).
Data Science for Managers DATA SOCIETY © 2023 227
Tip
3 Distinguish different items
• The mind groups together things that look
to be similar and assumes they have the
same function.
F
• We can use this principle for: A B C D E F G H I J K L M N O P
– distinguishing different sections
– differentiating links from regular text
– showing that elements with certain
characteristics serve one purpose and
others different
A B C D E F G H I J K L M N O P
Data Science for Managers DATA SOCIETY © 2023 228
Tip
4 Use natural positioning
• People usually tend to start at the top
left of the visual and scan in zig-
zag motions across the page forming
a Z-pattern.
• Aim to position elements in a way that
will feel natural for users to consume.
• Also, remember that the top of the
page is the most precious.
Data Science for Managers DATA SOCIETY © 2023 229
Tip
5 Use labels and legends
● Labels can be used to show value
of datapoint.
● Legends can be used to identify the
size, color or any other distinguishing
feature in the visual.
The labels and legends used
in the bottom chart makes it
easier to understand.
Data Science for Managers DATA SOCIETY © 2023 230
Tip
6 Use size to show importance
• Relative size represents relative importance.
• Visuals of almost equal importance should be sized similarly.
• If there’s one really important thing, it must be BIG.
Resizing the “Next
Page” button
deemphasizes its
importance.
Data Science for Managers DATA SOCIETY © 2023 231
Tip
7 Use color to grab attention
• Color is another powerful tool used to draw the audience's attention
• However, the following must be kept in mind:
– Use it sparingly: too much variety prevents anything from standing out
– Use it consistently: a color change can be used to visually reinforce change in topic or
tone
A A
B B
C C
D D
E
Too many colors are
E
F F
used in the image on G
H
G
the left, making it I I
difficult to identify
J J
K K
which are the busiest L L
months.
M M
N N
Data Science for Managers DATA SOCIETY © 2023 232
Tip
8 Use color to evoke emotion
• Color evokes emotion, so choose the one that helps reinforce the emotion you want to
arouse in your audience.
Warm
represent energy
colors
Cool
represent calmness
colors
Data Science for Managers DATA SOCIETY © 2023 233
Tip
9 Encode data with color
• Use color schemes to encode data as sequential, diverging, or categorical.
Sequential Diverging Categorical
for discrete data
to highlight
values
when the order minimums,
representing
matters maximums, and
distinct
midpoints
categories
Data Science for Managers DATA SOCIETY © 2023 234
Sequential color schemes
• Use a sequential color scheme when
the order matters.
• These schemes range between two
colors—usually a lighter shade to a
darker one—by varying one or more
parameters such as saturation.
Data Science for Managers DATA SOCIETY © 2023 235
Diverging color schemes
• Use a diverging color scheme to
highlight minimums, maximums, and
midpoints.
• These schemes range between three or
more colors with the different colors
being quite distinct—usually having
different hues.
Data Science for Managers DATA SOCIETY © 2023 236
Categorical color schemes
• Use a categorical color
scheme for discrete data
values representing distinct
categories.
• These schemes use different
hues with consistent steps in
lightness and saturation.
Data Science for Managers DATA SOCIETY © 2023 237
Tip
10 Reduce chart clutter
Small changes can have a big effect on
a visualization’s impact.
1. Remove special effects
2. Lighten the background
3. Remove chart borders
4. Remove gridlines
5. Direct label
6. Clean up axis titles and labels
7. Use consistent colors
Data Science for Managers DATA SOCIETY © 2023 238
Activity: analyze visualizations
● Turn to page 17 of your participant guide to find the Analyzing
visualizations activity.
● You will be asked to assess 4 visualizations. Write down your
notes.
Data Science
Literacy for Managers DATA SOCIETY © 2023 239
Polling question
For each of the charts, select the best way to improve
visual:
• Change colors
• Remove extra information
• Add more information
Which chart is the best?
Data Science for Managers DATA SOCIETY © 2023 240
Polling question
What tools have you used to visualize data?
● Google charts
● Excell
● Tableau
● Python
● RStudio
● Power BI
Data Science for Managers DATA SOCIETY © 2023 241
Excel
● Create basic chart
types such as pie, line,
bar, scatter, and
more.
● Charts created in
Excel can easily be
ported to PowerPoint
and Word.
Data Science for Managers DATA SOCIETY © 2023 242
Google Charts
● Free and open source,
which includes a rich
gallery, fully
customizable, controls
and dashboards, and
HTML5
● Has more options than
Excel; create interactive,
animated and
geospatial graphics
Data Science for Managers DATA SOCIETY © 2023 243
Tableau
● Tool for creating
powerful and insightful
visuals
● No programming
required; drag and drop
● Share and collaborate
on premise or in the
cloud
● Platform can be used
department or
organization wide
Data Science for Managers DATA SOCIETY © 2023 244
R and RStudio
● Programming tool
● Mainly used for statistical
analysis
● Offers functions and
libraries to build
visualizations and
present data
● Open source and free
Data Science for Managers DATA SOCIETY © 2023 245
Python
● Programming tool
● You'll find libraries for
practically every data
visualization need
● Free and open source
Data Science for Managers DATA SOCIETY © 2023 246
Power BI
● Interactive
visualizations and
business intelligence
capabilities
● Simple interface
● Create dashboards
Data Science for Managers DATA SOCIETY © 2023 247
Break
Data Science for Managers 2023 248
DATA SOCIETY © 2021
Agenda
Day 4
• Data visualization
• Misleading statistics & visual
distortions • What do I look out for when
• Data storytelling reviewing statistics or
visualizations?
Data Science for Managers DATA SOCIETY © 2023 249
Misleading stats & visual distortions
• Sometimes charts and statistics look
presentable but could be misleading.
• Unreliable data comparisons erode
credibility and eventually dissuade
viewers from using the analysis.
Data Science for Managers DATA SOCIETY © 2023 250
Misleading statistics
Data Science for Managers DATA SOCIETY © 2021
2023 251
Misleading statistics
• “Bill Gates walks into a bar and everyone inside becomes a
millionaire…on average.”
• In 2011, the average income of the 7,878 households in Steubenville, Ohio,
was $46,341. But if just two people, Warren Buffett and Oprah Winfrey,
relocated to that city, the average household income in Steubenville
would rise 62 percent overnight, to $75,263 per household.
What’s wrong with these statements?
https://www.nytimes.com/2013/05/26/opinion/sunday/when-numbers-mislead.html
Data Science for Managers DATA SOCIETY © 2023 252
Misleading statistics
• Numbers don’t have to be
fabricated to be
misleading.
• Misleading statistics are the
misusage—purposeful or
not—of numerical data.
Data Science for Managers DATA SOCIETY © 2023
Misleading statistics
• Misleading statistics can be
• Small sample sizes
created through issues with: Data
• Biased sampling
collection • Loaded questions
• No/poor data
– data collection Data normalization
processing • Ignoring important
features
– data processing • Hiding context
Data • Omitting certain
presentation findings
– data presentation • Visual distortions
Data Science for Managers 2023 254
DATA SOCIETY © 2021
How to avoid being misled?
• Do some math. Are there any obvious mistakes?
• Check the source. Is it creditable and current?
• Question the methodology. Is there bias? Is the result statistically
significant?
• Conduct research. What does Google tell you?
Data Science for Managers DATA SOCIETY © 2023 255
Visual distortions
Data Science for Managers DATA SOCIETY © 2021
2023
Visual distortions
• Look at the top graph.
• At first, Jamaica seems to have
half the average workdays per
employee that Suriname does.
• In reality, the difference is much
less.
What’s the difference between
the two charts?
Data Science for Managers DATA SOCIETY © 2023 257
Truncated graphs
• One of the most common
manipulations is omitting baselines or
beginning the y-axis of a graph at an
arbitrary number instead of 0.
• This creates the impression that there is
a significant difference between data
points, when in fact, there is relatively
little disparity.
Data Science for Managers DATA SOCIETY © 2023 258
Visual distortions
What distortion has been used in these charts to change how the data appears?
Data Science for Managers DATA SOCIETY © 2023 259
Exaggerated scaling
• Exaggerating the scale of a line graph can easily minimize or maximize the change
shown.
Data Science for Managers DATA SOCIETY © 2023 260
Visual distortion
How might this chart be misleading?
https://financesonline.com/number-of-gamers-worldwide/
Data Science for Managers 2023 261
DATA SOCIETY © 2021
Ignoring convention
● Deviating from convention
(such as green is positive and
red is negative) can create
confusion and misinterpretation
of the facts.
● In this example, the axis also
moves downward, making an
increase in gamers look like an
decrease, at a quick glance.
https://financesonline.com/number-of-gamers-worldwide/
Data Science for Managers 2023 262
DATA SOCIETY © 2021
Visual distortion
What do you notice about these pie charts?
Data Science for Managers DATA SOCIETY © 2023 263
Numbers don’t add up
• With pie charts, the sum of each slice must add up to the
whole. When the numbers don't add up, you know there's an
issue.
Data Science for Managers DATA SOCIETY © 2023 264
Visual distortion
Does the S&D or the
EPP party have more
representation in
parliament?
https://www.businessinsider.com/pie-charts-are-the-worst-2013-6
Data Science for Managers DATA SOCIETY © 2023 265
3D distortion
• 3D pie charts can be used to distort and cause a misinterpretation of the data.
• The same data is represented in both charts below.
https://www.businessinsider.com/pie-charts-are-the-worst-2013-6
Data Science for Managers DATA SOCIETY © 2023 266
Visual distortion
Which company has a better sales trajectory?
Data Science for Managers DATA SOCIETY © 2023 267
Improper extraction
• Surprise! It’s the same company. One
graph showed only odd years and the
other only even.
• To align to a particular narrative, some
may choose to visualize only a portion
of the data.
• This is more common in graphs that
have time as one of their axes.
Data Science for Managers DATA SOCIETY © 2023 268
Visual distortion
What story does this visualization tell?
Data Science for Managers DATA SOCIETY © 2023 269
Correlating causation
• Data visualizations can create causal links by the way that data is presented to the viewer.
• However, correlation does not equal causation.
Data Science for Managers DATA SOCIETY © 2023 270
Recap
To avoid being misled, look for:
– misleading statistics
– truncated graphs
– exaggerated scaling
– ignored conventions
– numbers that don’t add up
– 3D distortion
– improper extraction
– correlating causation
Data Science for Managers DATA SOCIETY © 2021
2023
Break
Data Science for Managers DATA SOCIETY © 2021
2023
Agenda
Day 4
• Data visualization
• Misleading statistics & visual
distortions • Why are data stories useful?
• Data storytelling
• How do I craft a data story?
Data Science for Managers DATA SOCIETY © 2023 273
What is data storytelling?
● You focus on an insight and
● persuade an audience
● that the outcome of your
analysis
● demands a course of action
● through narrative and visual
communication.
Data Science for Managers DATA SOCIETY © 2023 274
Data stories and data visualizations
● A single data story may make use of multiple data visualizations.
● Data stories arrange visualizations into the linear sequence of storytelling: a beginning, a
middle, and an end.
● Data story formats will likely incorporate other elements to explain and contextualize the
visualizations:
○ prose text, either written or spoken
○ annotations, callouts, and labels
○ icons or graphics
○ images or photographs
Data Science for Managers DATA SOCIETY © 2023 275
Can’t I just use a chart?
● Narratives are super effective, “sticky” content delivery mechanisms.
● Not everyone is a statistician, but they still want to make evidence-
based decisions.
● Stories let you overview key findings quickly.
● Stories tap into both the logical and the emotional aspects of
persuasion.
Data Science for Managers DATA SOCIETY © 2023 276
Why choose story?
If your insight is... A story can...
Help convince your audience
Unpleasant that even unwanted results
are actionable.
Encourage your audience to
Disruptive break with tradition, if the
upshot is valuable enough.
Explain why a prediction or
Unexpected intuition failed, and offer some
analysis and a solution.
Data Science for Managers DATA SOCIETY © 2023 277
Why choose story?, cont’d
If your insight is... A story can...
Guide your audience to a more
Complex complete understanding in
manageable chunks.
Embolden your audience to
Risky take responsibility for making a
tough choice.
Compel your audience to
Costly consider a high-cost solution by
underscoring the high value.
Data Science for Managers DATA SOCIETY © 2023 278
How do I craft a data story?
REFINE
TAILOR
OUTLINE
PLOT
FORMAT
Data Science for Managers DATA SOCIETY © 2023 279
REFINE
Refining your insight
● In a data story, your insight is the most important piece.
● What will make your audience perceive your insight as maximally:
○ Valuable: an observation that seems to be rewarding
○ Relevant: an observation that seems timely
○ Practical: an observation that suggests a realistic and feasible course of action
○ Specific: an observation that clearly and completely accounts for a problem
● Make your insight as concrete and contextualized as possible
Data Science for Managers DATA SOCIETY © 2023 280
TAILOR
Tailoring to your audience
AUTHORITY
GOALS TIMING
EMOTION REASON
Data Science for Managers DATA SOCIETY © 2023 281
OUTLINE
Outlining
INSIGHT / OUTCOME
ACTIONS
PROBLEM
MEASURES
Data Science for Managers DATA SOCIETY © 2023 282
PLOT
Plotting with a storyboard
• It’s okay for your data story to remain flexible at
this early stage.
• There are no right answers, only consideration
and iteration.
• Focus on building the elements of the story first,
on paper.
• Try out different versions quickly and don’t get
too attached.
Data Science for Managers DATA SOCIETY © 2023 283
FORMAT
Formatting for delivery
● You may find yourself needing to alter the way you tell your data story based on
the affordances of the format.
● Sometimes the format is a given, but other times, it will depend upon your input
and the use case.
● As with visualizations, the simplest storytelling format is often the best.
Slide Deck Document Interactive Hybrid
Sequence of Illustrated text Digital object Blend /
slides intended (report, intended to align compromise of at
for real-time infographic) to function with user least two formats
presentation be read anytime experience
Data Science for Managers DATA SOCIETY © 2023 284
The Joy of Stats
Data Science for Managers DATA SOCIETY © 2023 285
Chat questions
● What data storytelling
elements did you notice?
● Was this a good example of
data visualization and
storytelling? Why or why not?
Data Science for Managers DATA SOCIETY © 2023 286
End of Day 4
Data visualization Questions
Misleading statistics & visual distortions
Data storytelling ?
Data Science for Managers DATA SOCIETY © 2023 287
Thank you and congratulations!
Data Science for Managers DATA SOCIETY © 2023 288