0% found this document useful (1 vote)
342 views288 pages

Data Science For Managers 03.2023

The document provides information about a data science training course offered by Data Society. It includes details such as Data Society's mission to integrate big data and machine learning across teams, the course schedule and structure, best practices for virtual learning, materials that will be provided, an example polling question, and the daily agenda. The agenda covers topics like data and its uses, data analytics overview, data governance, data ethics, data tools, data teams, the data science process, data visualization, misleading statistics, and data storytelling.

Uploaded by

Prabu Narayanan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
342 views288 pages

Data Science For Managers 03.2023

The document provides information about a data science training course offered by Data Society. It includes details such as Data Society's mission to integrate big data and machine learning across teams, the course schedule and structure, best practices for virtual learning, materials that will be provided, an example polling question, and the daily agenda. The agenda covers topics like data and its uses, data analytics overview, data governance, data ethics, data tools, data teams, the data science process, data visualization, misleading statistics, and data storytelling.

Uploaded by

Prabu Narayanan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 288

DATA SCIENCE FOR MANAGERS

Data Science for Managers DATA SOCIETY © 2023


Who we are

Data Society’s mission is to integrate Big Data


and machine learning best practices across
entire teams and empower professionals to
identify new insights.

We provide:

• High-quality data science training programs


• Customized executive workshops
• Custom software solutions and consulting
services

Since 2014, we’ve worked with thousands of


professionals to make their data work for them.

Data Science for Managers DATA SOCIETY © 2021


2023 2
About the course

● Instructor introduction

● Schedule:
○ 4 sessions
○ 11 am – 2 pm
○ 1 or 2 short breaks each session

Data Science for Managers DATA SOCIETY © 2023 3


Best practices for virtual learning

1. Find a quiet place, free of as many distractions as possible.


Headphones are recommended.

1. Remove or silence alerts from cell phones, e-mail pop-ups,


etc.

1. Participate in activities and ask questions.

1. Give your honest feedback so we can troubleshoot problems


and improve the course.

Data Science for Managers DATA SOCIETY © 2023 4


Class materials
You should have received the following
materials:

• Slides

• Participant guide
- Needed during class
- Contains activities, a data science
glossary, information about popular
data science tools, and more!

Data Science for Managers DATA SOCIETY © 2023 5


Polling question

What you rate your current data literacy level on a scale of 0 -10?

0 5 10
No knowledge or Has combined data from Works with Big
awareness; may read multiple sources; has made Data; has
articles or documents basic charts/graphs; deployed machine
that contain understands data limitations learning
percentages

Data Science for Managers DATA SOCIETY © 2023 6


Agenda
Day 1 Day 2

• Data and its uses • Building a data-driven culture


• Data analytics overview • Data tools
• Data governance • Data teams
• Data ethics • The data science process
• Putting together a project

Day 4
Day 3
• Data visualization
• Foundational data science methods • Misleading statistics & visual distortions
• Advanced data science methods • Data storytelling

Data Science for Managers DATA SOCIETY © 2023 7


Agenda
Day 1

• Data and its uses


• Data analytics overview
• Data governance • What is data and why should we
• Data ethics use it?
• How can data be used in ways
that bring value?

Data Science for Managers DATA SOCIETY © 2023 8


What is data?

Merriam Webster

Data Science for Managers DATA SOCIETY © 2023 9


Data in our daily life
● Using data to make informed decisions isn’t just for business but also personal reasons in
our day-to-day lives.

● Let’s look some examples how data is collected and used routinely:

Checking reviews before buying a new product

Using fitness tracker to measure your heart rate, calories burnt and to track
your progress

Tracking productivity during the day using an application on your phone or


laptop

Comparing two car rental organization to find the best deal


Data Science for Managers DATA SOCIETY © 2023 10
Types of data

Structured Semi-structured Quasi-structured Unstructured


Sep 17 02:33:08.536 [debug]
connection_edge_process_relay_
cell(): Now seen 1802 relay cells
here (command 2, stream 5845).
Sep 17 02:33:08.536 [debug]
connection_edge_process_relay_
cell(): circ deliver_window now
933.

Data Science for Managers DATA SOCIETY © 2023 11


Sources of data

• HR (performance data, • ERPs (Enterprise Resource


Platforms) - Oracle SAP, etc.
salary/compensation, hiring,
360 view, etc.)
• CRMs (Customer Relationship
Management) - SalesForce,
• Network data (application logs, Hubspot, etc.
webserver logs, firewall alert
logs, e-mails, etc.) • Webserver

• Contracts/proposals/procuremen
• Clickstream
t
Data Science for Managers DATA SOCIETY © 2023
12
External sources of data

● Publicly-accessible APIs
○ e.g., api.data.gov

● Other open data sources


○ e.g., data.worldbank.org

○ Large businesses (e.g., Wal-Mart, Best Buy, Trip Advisor, Expedia, Google,
and Spotify) are increasingly giving people access to their data

○ Data is sometimes available for purchase (e.g. weather data)

Data Science for Managers DATA SOCIETY © 2023 13


What is big data?
● “Big data” refers to a large volume ● Characteristics of big data include:
of data that can be mined for
information and used in machine o High volume. Typically, the size
learning projects and other analytics of big data is described in
applications. terabytes, petabytes, even
exabytes!
o High velocity. Big data flows
from sources at a rapid and
continuous pace.
o High variety. Big data comes in
different formats from
heterogeneous sources.

Data Science for Managers DATA SOCIETY © 2021


2023 14
Why use data?
Data may be collected, retained,
and used for several reasons:

● Compliance: avoiding penalties

● Automation: economic
efficiencies

● Analytics: insights

Data Science for Managers DATA SOCIETY © 2021


2023 15
What can using data do?

1. Find a needle in 2. Prioritize work for 3. Provide early warning


haystack high impact / detection

4. Speed up 5. Optimize 6. Enable


decisions resources experiments
Data Science for Managers DATA SOCIETY © 2023 16
Find a needle in a haystack

• Stanford is using satellite imagery and


predictive analytics to estimate
consumption expenditures and asset
wealth.

• This could transform efforts to track


and target poverty in developing
countries with existing, public data.

http://sustain.stanford.edu/predicting-poverty/

Data Science for Managers DATA SOCIETY © 2023 17


Prioritize work for high impact

using-predictive-modeling-to-prioritize-building-inspections/
http://urbanspatialanalysis.com/portfolio/proof-of-concept-
● Consultants in Philadelphia developed a
model for prioritizing building inspections
based on a location’s:
o Distance to nearby vacant properties
o Distance to certain crimes
o Distance to infestation reports

● Benefits could include generating better


daily inspection routes or providing more
information to inspectors on existing
routes.

Data Science for Managers DATA SOCIETY © 2023 18


Provide early warning / detection

• When individuals and groups are planning criminal activity, they often signal their
intentions online via open-source social media.

• Tactical Institute uses cognitive analytics to monitor social channels 24x7, analyze
billions of comments and posts, home in on threats, and identify perpetrators before
they can act.

• They then provide real-time notification of threats issued so that clients can take pre-
emptive action before the threat is executed.

https://www.ibm.com/case-studies/tactical-institute
Data Science for Managers DATA SOCIETY © 2023 19
Speed up decisions

• Recruiting chatbots are used to automate the communication


between recruiters and candidates. They are useful:
○ when there is a high number of applicants
○ to ensure that similar questions are asked of all candidates https://jobai.de/

○ for answering frequently asked questions effectively

• JobAI is a German recruiting chatbot. Their platform offers


jobseekers the opportunity to contact companies, inform
themselves, and apply via familiar messenger apps such as
WhatsApp and Telegram.

Data Science for Managers DATA SOCIETY © 2023 20


Optimize resources

• BNY Mellon developed and deployed more than 220 automated computer
programs in 2016 and 2017.

• These “bots” carry out repetitive tasks such as formatting requests for dollar
funds transfers and responding to data requests from external auditors.

• The bank estimates that its funds transfer bots alone are saving it $300,000
annually.

• Bots that reply to information requests on financial statements from auditors


enabled the bank to cut down its response time to 24 hours from 6 to 10
business days

https://www.reuters.com/article/us-bony-mellon-technology-ai-idUSKBN186253

Data Science for Managers DATA SOCIETY © 2023 21


Enable experiments

• The NYC government reduced the


number of people who fail to appear
(FTA) in court using data to evaluate
options.

• The cost of a one-time court summons’


redesign corresponded to a 13% drop
in FTAs.

• When paired with a text message


costing $0.0075 per message, there
was a 36% decrease.

https://www.sciencemag.org/news/2020/10/new-york-city-
uses-nudges-reduce-missed-court-dates
Data Science for Managers DATA SOCIETY © 2023 22
Polling question

The most relevant use of data for my


organization is:

● Finding a needle in haystack


● Prioritizing work for high impact
● Speeding up decisions
● Optimizing resources
● Enabling experiments
● Providing early warning/ detection

Data Science for Managers DATA SOCIETY © 2023 23


Chat question

What hurdles might you face trying to


implement a data analytics project in
your organization?

Data Science for Managers DATA SOCIETY © 2023 24


Agenda
Day 1

• Data and its uses


• Data analytics overview
• Data governance • What is data analytics and how
• Data ethics can it be used?

• What are the principles of data


science?

Data Science for Managers DATA SOCIETY © 2023 25


What is data analytics?
• Data analytics focuses on
processing and performing
statistical analysis on existing
datasets.

• Analysts capture, process, and


organize data to uncover
actionable insights for current
problems and establish the
best way to present this data.

Data Science for Managers DATA SOCIETY © 2021


2023 26
Chat question

How do you use data analytics within


your organization currently?

Data Science for Managers DATA SOCIETY © 2023 27


Data analytics maturity model
Realm of data science
How can we learn
to do this at scale,
How can we make continuously?
What will it happen? Cognitive/
happen? Prescriptive Artificial
Why did it
Analytics Intelligence
What happen? Predictive
happened? Diagnostic Analytics
Value

Descriptive Analytics
Analytics

Gartner
Information Optimization

Hindsight Insight Foresight

®
Difficulty
Data Science for Managers DATA SOCIETY © 2023 28
Model revisited
Realm of data science

How can we learn


What Why did it What will How can we make to do this at scale,
happened? happen? happen? it happen? continuously?

Descriptive Diagnostic Predictive Prescriptive Cognitive/


Analytics Analytics Analytics Analytics Artificial
Intelligence

Data Science for Managers DATA SOCIETY © 2023 29


Stage 1: descriptive analytics

What questions does it answer? What has happened in the past?

How valuable is it? Provides some value, but doesn’t provide causation or
prediction
How labor intensive is it? Easy to deploy provided you have the right data

Data Science for Managers DATA SOCIETY © 2023 30


Stage 2: diagnostic analytics

What questions does it answer? Why did something happen in the past?

How valuable is it? Provides insights into a particular problem, and can help
you identify some root causes for past trends and
behaviors

How labor intensive is it? Requires detailed data, but doesn’t have to be overly
intensive

Data Science for Managers DATA SOCIETY © 2023 31


Stage 3: predictive analytics

What questions does it answer? What is likely to happen?

How valuable is it? Provides trends / behaviors that are likely to happen

How labor intensive is it? Requires detailed data, and may require a moderate to
high level of computer power, depending on the
method and the amount of data

Data Science for Managers DATA SOCIETY © 2023 32


Stages 4, 5: prescriptive analytics, AI

What questions does it answer? What action should I take next?

How valuable is it? Provides recommendations for future actions

How labor intensive is it? Requires a lot of detailed data, as well as data from
other external sources that will impact the model; very
labor intensive

Data Science for Managers DATA SOCIETY © 2023 33


Example: fighting human trafficking
• Polaris has made a connection
between massage parlors and
human trafficking.

• Once they find one owner of an


illicit massage business by tracing
business records, they often find
that he owns several more
businesses in the area.
https://www.datanami.com/2016/10/07/data-analytics-fight-human-
• They are now able to use data to trafficking/

identify illicit activities and alert


law enforcement.
Data Science for Managers DATA SOCIETY © 2023
34
Polling question

What type of analytics is demonstrated when


Polaris uses data to identify possible illicit
activities and alert law enforcement?

• Descriptive
• Diagnostic
• Predictive
• Prescriptive

Data Science for Managers DATA SOCIETY © 2023 35


How do we move forward?
• To reach the realm of data science
organizations require:

o quality data

o an innovative environment

o resources, with the requisite


knowledge and technical skillsets
to use them

Data Science for Managers DATA SOCIETY © 2023 36


Break

Data Science for Managers DATA SOCIETY © 2021


2023 37
Agenda
Day 1

• Data and its uses


• Data analytics overview
• Data governance • What is data governance? Why is
• Data ethics it important?

• What do Federal managers need


to know about data governance?

Data Science for Managers DATA SOCIETY © 2023 38


Quality data is “clean”

Clean data is: Clean data is not:

• Valid • Corrupt
• Accurate • Incorrect
• Consistent • Duplicate
• Complete • Incomplete
• Uniform • Wrongly formatted

Data Science for Managers DATA SOCIETY © 2021


2023 39
Acquiring quality data is hard
2019 O’Reilly survey of more than 1,900 leaders and data professionals
https://www.oreilly.com/radar/the-state-of-data-quality-in-2020/

Data Science for Managers DATA SOCIETY © 2023 40


Controlling data
● Data is arguably the most important asset that organizations have.

● Controlling it through data governance practices and processes


helps to ensure that data is usable, accessible, and protected.

● Effective data governance helps to avoid data inconsistencies


and errors in data, plays a role in the organization’s ability to
comply with laws and regulations, and increases access to data.

Data Science for Managers DATA SOCIETY © 2023 41


What is data governance?
● Data governance is a
collection of practices and
processes that help to ensure
the formal management of
data assets within an
organization.

● It encompasses the complete


life cycle of IT investment, from
strategic planning to the day-
to-day operations of the IT
function.

Data Science for Managers DATA SOCIETY © 2023 42


What is data governance?

• Awareness and communication


• Policies, standards and
procedures
Within each process, data
• Tools and automation
governance is concerned with:
• Skills and expertise
• Responsibility and accountability
• Goal setting and measurement

Data Science for Managers DATA SOCIETY © 2023 43


Why is data governance important?
● Regulatory compliance – with increased regulation comes
compliance that needs to be implemented and followed

● Reduce risk – effective data governance enhances data security


and privacy

● Improve processes – when everyone follows the same standards,


projects and management become more efficient

Data Science for Managers DATA SOCIETY © 2023 44


Data governance principles

A data governance program should be:

● Sustainable –it survives beyond the initial implementation

● Embedded – data governance should be present in all processes


related to data

● Measured – there should be some defined metrics to help demonstrate


value to the organization

Data Science for Managers DATA SOCIETY © 2023 45


Data governance strategy
A data governance program might be documented using:

Charters Implementation Operating Plans for operational


roadmaps frameworks / success
accountabilities

Data Science for Managers DATA SOCIETY © 2023 46


Data governance models
Centralized

One overarching data governance


organization applies to all sectors.

Replicated

Each data governance section is


repeated across departments but
may have multiple governing bodies.

Federated

An overarching data governance


organization works with multiple
departments to maintain
Data Science for Managers 47
consistency. DATA SOCIETY © 2023
Poll question: data governance

Which governance model do you think is suitable for your


organization?

● Centralized
● Replicated
● Federated
● None of the above
.

Data Science for Managers DATA SOCIETY © 2023 48


Poll question: data governance
After purchasing three companies, an organization is interested
in ensuring high quality data across the enterprise, which
analytics governance strategy will probably best support that
goal?

● Centralized
● Replicated
.
● Federated
● None of the above
Data Science for Managers DATA SOCIETY © 2023 49
Data governance: process maturity
Level Description
0 Non-existent Complete lack of any recognizable processes; have not even recognized that there is an
issue to be addressed
1 Initial / ad hoc Enterprise has recognized that the issues exist and need to be addressed but there are no
standardized processes (only ad hoc approaches); overall approach to management is
disorganized
2 Repeatable Processes have developed to the stage where similar procedures are followed by different
but intuitive people undertaking the same task; no formal training or communication; high degree of
reliance on the knowledge of individuals and, therefore, errors are likely
3 Defined Procedures have been standardized, documented, and communicated; it’s mandated that
these processes should be followed; however, it is unlikely that deviations will be detected
4 Managed and Management monitors and measures compliance with procedures and takes action where
measurable processes appear not to be working effectively; processes are under constant improvement
and provide good practice; automation and tools are used in a limited or fragmented way
5 Optimized Processes have been refined to a level of best practice, based on the results of continuous
improvement and maturity modelling with other enterprises; IT is used in an integrated way to
automate the workflow, providing tools to improve quality and effectiveness, making the
enterprise quick to adapt
Data Science for Managers DATA SOCIETY © 2023 50
What do leaders need to know?
● Target maturity levels would be expected to vary for individual IT
processes, IT infrastructure, and industry characteristics

● Differences in maturity come from factors such as the risks facing the
enterprise and the contribution of processes to value generation and
service delivery

● It does not make sense to be at level 5 for every IT process because the
benefits could not justify the costs of achieving and maintaining that
level

Data Science for Managers DATA SOCIETY © 2023 51


Activity: evaluate yourself!
● Turn to your participant guide to the Data governance
assessment, which begins on page 4, to see how far along you
and your team are in the data governance cycle.

● You’ll measure the foundational components, such as


awareness, formalization, and metadata, as well as the project
components of stewardship, data quality, and master data
policies.

● Then, assess your progress and set goals for where you want your
team.

Data Science
Literacy for Managers DATA SOCIETY © 2023 52
Promoting good governance
● Catalog the data “owned” by your team or office. Which elements are critical?

● Define roles and responsibilities:


○ Data owners are accountable for the state of the data.
○ Data stewards make sure that data policies and standards are adhered to and
stay abreast of changes.

● Develop standardized data definitions and educate stakeholders on them.

● Implement preventative and detective controls to improve data quality.

Data Science for Managers DATA SOCIETY © 2023 53


Agenda
Day 1

• Data and its uses


• Data analytics overview
• Data governance • What are data ethics?
• Data ethics
• What does a Federal manager
need to know about data ethics?

Data Science for Managers DATA SOCIETY © 2023 54


What is data ethics?

Data ethics is a newer branch of ethics that studies and evaluates moral problems related
to:

● Data (including generation, recording, curation, processing, dissemination, sharing, and use)

● Algorithms (including artificial intelligence, artificial agents, machine learning, and robots)

● Corresponding practices (including responsible innovation, programming, hacking, and


professional codes)

Source: University of Oxford

Data Science for Managers DATA SOCIETY © 2023 55


Why data ethics?

● Data science has huge opportunities, but those opportunities are accompanied by complex
data ethical challenges.

○ To formulate and support morally good solutions (e.g., right conducts or right
values)

○ To maximize the value of data science for our societies, for all of us and for our
environments

The best single thing you can do to further data ethics is to talk about data
ethics!
Source: University of Oxford

Data Science for Managers DATA SOCIETY © 2023 56


FDS: Data Ethics Framework
● In December 2020, GSA published a Data Ethics Framework.

● The Framework’s purpose is to guide federal leaders and data users as


they make ethical decisions when acquiring, managing, and using data
to support their agency’s mission.

https://resources.data.gov/assets/documents/f
ds-data-ethics-framework.pdf

Data Science for Managers DATA SOCIETY © 2023 57


Federal Data Ethics Tenets

Uphold Applicable Statutes, Regulations, Professional Practices, and Ethical Standards

Respect the Public, Individuals, and Communities

Respect Privacy and Confidentiality

Act with Honesty, Integrity, and Humility

Hold Oneself and Others Accountable

Promote Transparency

Stay Informed of Developments in the Fields of Data Management and Data Science

Data Science for Managers DATA SOCIETY © 2023 58


Existing frameworks
● O’Reilly’s 5 Cs: consent, clarity, consistency,
control, consequences

● UK Government Data Ethics Framework


1. Start with clear user need and public benefit.
2. Be aware of relevant legislation and codes of practice.
3. Use data that is proportionate to the user need.
4. Understand the limitations of the data.
5. Ensure robust practices and work within your skillset.
6. Make your work transparent and be accountable.
7. Embed data use responsibly.

● GDPR regulations developed in Europe to help


individuals control their data

Data Science for Managers DATA SOCIETY © 2023 59


Data Society guidelines
1. Ownership: Who owns the data? Do you have the right to collect the data?

1. History: How long can you store the data?

1. Privacy: Who controls access to the data?

1. Uses: What kinds of inferences can you make?

1. Math: How do you prevent machine learning algorithms from learning the biases of the past?
Understanding how the math works is imperative for ethical data science!

Data Science for Managers DATA SOCIETY © 2023


60
Activity: data ethics
• Turn to page 9 of your participant guide to the Data
ethics activity.

• Read the scenario excerpted from the Data Ethics


Framework and answer the questions that follow.

Data Science
Literacy for Managers DATA SOCIETY © 2023 61
End of Day 1

Data and its uses


Data analytics overview
Data governance
Data ethics

Data Science for Managers DATA SOCIETY © 2023 62


DATA LITERACY FOR MANAGERS
Day 2

Data Science for Managers DATA SOCIETY © 2023


Recap
• To reach the realm of data science
organizations require:

o quality data

o an innovative environment

o resources, with the requisite


knowledge and technical skillsets
to use them

Data Science for Managers DATA SOCIETY © 2023 64


Agenda
Day 2

● Building a data-driven culture


● Data tools
• What is a data-driven culture?
● Data teams
● The data science process • How can a Federal manager
● Putting together a project encourage data-driven
practices?

Data Science for Managers DATA SOCIETY © 2023 65


Chat question

What does it mean to be


data driven?

Data Science for Managers DATA SOCIETY © 2023 66


What is a data-driven culture?

Data infrastructure
● A data-driven culture
incorporates data and
analysis into its business Data Prepared Data Driven
decisions, systems, and
processes. Data literacy

● It can be separated into


two main categories: Data Nascent Data Literate

○ Data infrastructure

○ Data literacy

Data Science for Managers DATA SOCIETY © 2023 67


Data infrastructure

DATA ACCESS DATA STORAGE DATA COLLECTION

Can staff access Is the data stored Is data collected in a


data easily and in securely with a timely and clean
a timely manner? backup? way?

Data Science for Managers DATA SOCIETY © 2023 68


Data literacy

DATA LEADERSHIP DATA GOVERNANCE DATA KNOWLEDGE

Do executives Are staff aware of Does staff


champion data data standards and understand how to
usage? practices? ask questions of
data?

Data Science for Managers DATA SOCIETY © 2023 69


Why is it important to be data driven?
• Identify trends. Trends can inform effective practices, help you become aware of issues,
and illuminate possible innovations or solutions.

• Reduce bias. Making decisions based on data is far more reliable than ones based on
instinct, assumptions, or perceptions.

• Benchmark performance. Benchmarking allows staff to connect their actions to


business results, which will reveal new opportunities for improvement.

A study from the MIT Center for Digital Business found that organizations driven most by
data-based decision making had 4% higher productivity rates and 6% higher profits.

Data Science for Managers DATA SOCIETY © 2023 70


Example: Walmart
• Walmart executives wanted to know
what items to stock before Hurricane
Frances in 2004.

• Analysts mined a terabyte of purchase


history from other Walmart stores under
similar conditions.

• Turns out, in times of natural disasters,


Americans want strawberry Pop-Tarts
Walmart Corporate, via Flickr
and beer! Stores were stocked
accordingly.

Data Science for Managers DATA SOCIETY © 2021


2023 71
Example: IRM
● Milan needed to replace its slow
computers.

● By pulling and analyzing data on


computer read/write speeds and
hard drive usage, IRM was able to
change the purchase order specs.

● Over $50,000 was saved by


eliminating unnecessary
requirements.

Data Science for Managers DATA SOCIETY © 2021


2023 72
Activity: Are you data driven?
Turn to page 10 of your
participant guide to the Data-
driven culture assessment to
evaluate your team.

Data Science
Literacy for Managers DATA SOCIETY © 2023 73
How to encourage data-
driven thinking & innovation

Data Science for Managers DATA SOCIETY © 2021


2023 74
Step 1: Create data-driven guideline

● All the data may be irrelevant if it is


not used correctly.

● Hence, organizations need to know


how to extract information and
knowledge from their data.

● They must incorporate data by


developing objectives and laying out
a broad roadmap for the data.

Data Science for Managers DATA SOCIETY © 2023 75


Step 2: Invest in data infrastructure and strategy

● Determine the space required to


manage data for your organization
and develop systems to support data
collection, storage, and analysis.

● Collaborate with the IT department to


establish databases and install
software for data reporting,
modeling, and analysis.

● Using the right tool to perform data


analysis.

Data Science for Managers DATA SOCIETY © 2023 76


Step 3: Encourage careful and comprehensive methods
of data collection
● Create policies for gathering data

● Establish practices to measure the


success

● Discuss the importance of collecting


data records in the future

● Meet with other managers or


department leaders to communicate
individual data collection methods

Data Science for Managers DATA SOCIETY © 2023 77


Step 4: Streamline data collection process:

● Every department gathers relevant and


valuable data and hence have a
central repository for all the collected
data is recommended.

● Data analysts evaluate the data and


provide understandable analytical
reports and insights back to the head of
each department.

● Each team can turn the outcomes from


these insights into actions, execute
them in their domain, and share results
with other departments/teams.

Data Science for Managers DATA SOCIETY © 2023 78


Step 5: Improve and maintain the data quality

● Data quality is just as crucial as data


quantity.

● If you do not have new data in your


repository, you might be looking at
outdated data and fake reality.

● Keep collecting more relevant, new data.

● Use data mining and software tools to


clean and maintain the quality of the
data automatically.

Data Science for Managers DATA SOCIETY © 2023 79


Step 6: Train your team

● Educate employees with the right


data-related skills and knowledge.

● Plan or suggest official training


sessions (like this one!) on data
literacy and information analysis.

● Have your team complete a


tutorial/training when the
organization incorporates a new
software or database system.

Data Science for Managers DATA SOCIETY © 2023 80


Step 7: Share insights and knowledge

● Encourage your team by sharing data


directly relevant to them

● Show the value and impact of data by


preparing information about the team's
performance and sharing it with them
during quarterly and yearly reviews.

● It is crucial to ask the team how they


came to a conflict, analyzed it, and
decided on the resolution. It gives your
data team a deeper understanding of
the data.

Data Science for Managers DATA SOCIETY © 2023 81


Step 8: Applaud your team

● Identify a successful analytics


project / team and highlight
their success through a
newsletter, event, or lunch and
learn.

● Recognize the right things—


including when mistakes move
you to another level.

Data Science for Managers DATA SOCIETY © 2023 82


Recap: 8 Actionable steps to establish data-driven culture
Here is the checklist of the steps:

✔ Create Data-Driven guideline

✔ Invest in data infrastructure and strategy

✔ Encourage careful and comprehensive methods of data collection

✔ Streamline data collection process

✔ Improve and maintain the data quality

✔ Train your team

✔ Share insights and knowledge

✔ Applaud your team

Data Science for Managers DATA SOCIETY © 2023 83


Chat question

● Which idea(s) that we’ve discussed could you implement in


the near term? What specifically would you do?

● What challenges do you expect to face when


implementing those ideas? How will you overcome them?

Data Science for Managers DATA SOCIETY © 2023 84


Data solutions

Data Science for Managers DATA SOCIETY © 2023 85


Break

Data Science for Managers DATA SOCIETY © 2021


2023 86
Agenda
Day 2

● Building a data-driven culture


● Data tools
● Data teams
● The data science process • What types of tools do data
● Putting together a project scientists use to do their work?

Data Science for Managers DATA SOCIETY © 2023 87


Data tools
● There’s no shortage of tools in
the data analytics space.

● There are different tools for


different functions, but most
overlap in their offerings.

Data Science for Managers DATA SOCIETY © 2023 88


Storage tools
• Databases y1 x1 x2 x3

o Relational A F X P

o Non-relational (NoSQL) B G Y Q

C H Z W
• Data warehouses & data lakes

Data Science for Managers DATA SOCIETY © 2021


2023 89
Cleaning tools
• Data cleaning is the process of
preparing data for analysis by
removing or modifying data
that is incorrect, incomplete,
irrelevant, duplicated, or
improperly formatted.

• Example tools: Drake,


OpenRefine, DataWrangler,
Data Cleaner, Winpure Data
Cleaning Tool

Data Science for Managers DATA SOCIETY © 2021


2023 90
Analysis tools
• Analysis tools make it easier to sort
through data in order to identify
patterns, trends, relationships,
correlations, and anomalies that
would otherwise be difficult to detect.

• Tools known to be in use at State:


o Excel
o R
o Python
o SAS
o WordSmith

Data Science for Managers DATA SOCIETY © 2021


2023 91
Visualization tools
• Visualization gives a visual or graphical
representation of data/concepts.

• Tools known to be in use at State:


o Excel
o Power BI
o Tableau
o R and RStudio
o Python
o Power BI
o MicroStrategy

Data Science for Managers DATA SOCIETY © 2021


2023 92
Collaboration tools
• Collaboration tools offer version
control, workflow, bug tracking, task
management, etc.

• Example tools: Git, GitHub

Data Science for Managers DATA SOCIETY © 2021


2023 93
Other technologies
● Several other technologies enable and support data analytics, including:

○ Application programming interfaces (APIs). An API is a computing interface that


allows two applications to talk to each other. Using them speeds up data
acquisition.

○ Graphical processing units (GPUs). A GPU is an electronic circuit specially


designed to process graphics such as images and video. Using them can help
speed up computation.

○ Cloud computing. Cloud computing offers fast and flexible servers, storage,
databases, networking, software, analytics, and intelligence over the Internet.

Data Science for Managers DATA SOCIETY © 2023 94


Questions to guide tool selection

1. What types of technologies are needed for working with data at various stages of the
data pipeline?

1. How do the different tools and technologies compare in their functionality, strengths,
and weaknesses?

1. Do you have staff who can be trained or know how to use particular tools?

1. Do you have budget constraints you need to be mindful of?

1. Is it on the approved software list?

Data Science for Managers DATA SOCIETY © 2023 95


Chat question

Let’s imagine you want to create


data visualizations for an upcoming
report.

What tool will you use? Why?

Data Science for Managers DATA SOCIETY © 2023 96


Agenda
Day 2

● Building a data-driven culture


● Data tools
● Data teams • Who is on the data team?
● The data science process
● Putting together a project • How do data teams fit within an
organization?

Data Science for Managers DATA SOCIETY © 2023 97


Data Analyst
● Ensures that collected data is
relevant and exhaustive while
also interpreting the analytics
results
● Main role and responsibilities
include:
○ Wrangling the data
○ Managing the data
○ Creating basic analyses
and visualizations
● Core skills to include: SQL, R /
Python, Tableau / Power BI

Data Science for Managers DATA SOCIETY © 2021


2023 98
Data Scientist
● Builds upon the analysts’ data work
to develop predictive models and
complex algorithms

● Main role and responsibilities include:


○ Asking the right questions from
the data
○ Building more complex
predictive models
○ Interpreting the results critically
and communicating them well

● Core skills to include: R, Python,


Spark, Hadoop

Data Science for Managers DATA SOCIETY © 2021


2023 99
Data Engineer
● Develops the infrastructure to house
the data and maintains the structural
components

● Main role and responsibilities:


○ Ensuring data integrity across
different data sources
○ Building out additional data
warehouses as needed
○ Maintaining data pipelines and
access

● Core skills to include: AWS,


MongoDB, MySQL, Hadoop, C++,
Azure

Data Science for Managers 2023 100


DATA SOCIETY © 2021
MLOps Engineer
● Aims to deploy and maintain
machine learning systems in
production reliably and efficiently

● Main role and responsibilities:


○ Requirements engineering
○ System design
○ Implementation and testing
○ Maintenance, support,
troubleshooting, etc.

● Core skills to include: distributed


computing principles, networking,
database architecture

Data Science for Managers 2023 101


DATA SOCIETY © 2021
Data Science Manager
● Oversees and directs data science
teams and projects and is the bridge
between data and non-data people

● Main role and key responsibilities


include:
○ Planning out people and
resources for projects
○ Communicating results to
executives and stakeholders
○ Running the data science teams

● Core skills to include: management


experience, programming skills (R /
Python / SQL), strong communication

Data Science for Managers 2023 102


DATA SOCIETY © 2021
Team structures
Centralized structure
Data
Team

Business
Units

Decentralized structure
Business
Units

Hybrid structure
Data analysts Data
Business analysts Team

Business
Units
Data Science for Managers DATA SOCIETY © 2023 103
Pros and cons
Centralized structure
+ easier to standardize team processes
Data - harder to coordinate projects to meet
Team
strategic goals
Business
Units

+ easier to coordinate projects to meet


Decentralized structure
strategic goals
Business - leads to inconsistent & redundant data
Units usage across organization

Hybrid structure
+ easier to standardize team processes
Data
+ easier to coordinate projects to meet
Team
strategic goals
Business
Units
Data Science for Managers DATA SOCIETY © 2023 104
Polling question
Centralized structure
Data Which best describes the structure of
Team
the data teams in your organization?
Business
Units ○ Centralized
○ Decentralized
○ Hybrid
Decentralized structure
Business
Units

Hybrid structure
Data
Team

Business
Units
Data Science for Managers DATA SOCIETY © 2023 105
Another option…
Contracting a team Hiring a team
Strengths Strengths
● Flexible cost structure can adapt to ● Data science becomes an endemic
changing budgets capability—better decision making
● Easy to change staff if people don't work
becomes part of the DNA
out
● Internal know-how is developed and
● Quickly add staff with new skills
sustained—the analytics capability has a
Weaknesses strong foundation
● Internal know-how is not built up
● Data science does not become an Weaknesses
endemic capability ● State-of-the-art capabilities may still need
● The organization becomes dependent on to be brought in from the outside ("rented")
forces outside of its control ● Organizational challenge: data science
must remain impartial to internal dynamics

Data Science for Managers 2023 106


DATA SOCIETY © 2021
Polling question
What would be the best option for your organization?

○ Contracting a team
○ Hiring a team

What are the key factor(s) in making that decision?

○ Recruitment/ training time


○ Cost
○ Internal know how
○ Flexibility
○ Not depending on outside forces

Data Science for Managers DATA SOCIETY © 2023 107


Break

Data Science for Managers 2023 108


DATA SOCIETY © 2021
Agenda
Day 2

● Building a data-driven culture


● Data tools
● Data teams
● The data science process • What are the six stages of the
● Putting together a project typical data science process?

Data Science for Managers DATA SOCIETY © 2023 109


Typical data science process
What is the What data do we
problem(s) we need and how
need to solve? do we get it?

Ask Research
How can we use Which method(s)
the conclusions in is appropriate
the real world? to use?

Interpret Model
How does the Do the model and
model generalize assumptions work
to real-world data? as expected?

Test Validate
Data Science for Managers DATA SOCIETY © 2023 110
What is the
problem(s) we
need to solve?

Ask
● The business and data teams should work together to develop a question that is
specific, measurable, and objective.
● Domain knowledge comes into play.
Examples

How can I make my Which 3 policies have demonstrated the


policies more effective? best results, and did they have anything in
common?

We’ll use the calculated ROI and the


We’ll use an indicator that percent difference in desired behaviors
shows the most from before and after.
improvement.
Data Science for Managers DATA SOCIETY © 2023 111
What data do we
need and how
do we get it?

Research
● The data team, with input from the business, gathers information about the data
needed to get a relevant answer.
● Is it already collected, or is time needed to get it? What format is it in?

Examples

I’m sure we have the data We’ll use the datasets from the policy
somewhere. report that can be found in X repository.

I’m sure the data is good Where can I read about how the data was
enough as is. collected and how the metrics are
defined?
Data Science for Managers DATA SOCIETY © 2023 112
Which method(s) Do the model and How does the
is appropriate assumptions work model generalize
to use? as expected? to real-world data?

Model Validate Test


● Models take questions and
provide answers and outputs.

● The methods chosen by the


data team are based on the
questions asked and the type(s)
of data that you have.

● Multiple iterations are required


to ensure the model works well.

Data Science for Managers DATA SOCIETY © 2023 113


How can we use
the conclusions in
the real world?

Interpret
● The data team looks at what the results are telling them—not what they were
expecting the results to be.

● They present the data and make recommendations based on the data, their domain
knowledge, and stakeholder needs.

Example

I’ll put the results in the How can I best convey the results that
same format as I usually matter most to my end users?
do.

Data Science for Managers DATA SOCIETY © 2023 114


Chat question
Which part of the data science
process do you think data teams
spend the most time on? Why?

Data Science for Managers DATA SOCIETY © 2023 115


Chat question
As a manager, which part of the
data science process do you
think you should spend the most
time on? Why?

Data Science for Managers DATA SOCIETY © 2023 116


Agenda
Day 2

● Building a data-driven culture


● Data tools
● Data teams
● The data science process • How do I identify feasible and
● Putting together a project impactful data projects?

Data Science for Managers DATA SOCIETY © 2023 117


Planning a data project
• A successful and comprehensive data project is way beyond just
programming.

• It involves sophisticated planning and a large amount of


communication.

• In this section, we will practice planning an impactful and feasible


project.

Data Science for Managers DATA SOCIETY © 2023 118


Planning a data project
In previous classes, participants have tacked projects such as:

• Using data to prioritize the distribution of COVID vaccines to Department


employees, family members, and members of the diplomatic community.

• Improving equity in the post bidding process using data.

• Determining the impact of bilateral engagement (e.g., meetings and trips)


on a country’s child abduction indicators, according to the data.

• Proving, with data, that counternarcotics funding in Colombia has reduced


cocaine consumption.

Data Science for Managers DATA SOCIETY © 2023 119


Activity: brainstorm ideas
• Turn to page 13 of your participant guide to the Project
brainstorm activity.

• Identify 3-5 ideas for leveraging data in your


workplace. Then, assess their feasibility and impact.

Data Science
Literacy for Managers DATA SOCIETY © 2023 120
End of Day 2

Building a data-driven culture


Data tools
Data teams
The data science process
Putting together a project

Data Science for Managers DATA SOCIETY © 2023 121


DATA LITERACY FOR MANAGERS
Day 3

Data Science for Managers DATA SOCIETY © 2023 122


World’s Smartest Home

Data Science for Managers DATA SOCIETY © 2023 123


Agenda
Day 3
• What are the basics of machine
• Foundational data science learning?
methods
• What is clustering and how is it
• Advanced data science used?
methods
• What is classification and how is it
used?

• What is regression and how is it


used?

Data Science for Managers DATA SOCIETY © 2023 124


Why learn about these methods?

1. To develop a common vocabulary with the data science


team

1. To direct data science projects and make recommendations

1. To understand what options are available for finding new


insights and becoming more efficient

Data Science for Managers DATA SOCIETY © 2023 125


What’s an algorithm?

Data Science for Managers DATA SOCIETY © 2023 126


What is machine learning?
● Machine learning uses algorithms to find patterns in massive amounts of data and
predict future results with minimal human intervention.

● It powers many of the services we use today:


o recommendation systems like those on Netflix
o search engines like Google
o social-media feeds like Facebook and Twitter
o voice assistants like Siri and Alexa

● Most is categorized as either supervised or unsupervised.

Data Science for Managers DATA SOCIETY © 2023 127


Supervised learning
● You have input variables (x) and an output variable (Y) and you use an
algorithm to learn the mapping function from the input to the output.

● The goal is to approximate the mapping function so well that when you have
new input data (x) that you can predict the output variables (Y) for that data.

● Requires labeled data (i.e., data tagged with one or more labels identifying
certain properties, characteristics, or classifications)

● Example: emails are classified as spam/not spam based on how their features
compare to the features of emails that a human “Marked as Spam.”

Data Science for Managers DATA SOCIETY © 2023 128


Unsupervised learning
● You only have input data (x) and no corresponding output variables.
The goal is to model the underlying structure or distribution in the data in
order to learn more about the data.

● In other words, the machine looks for whatever patterns it can find.

● Example: for marketing purposes, finding groups of customers with similar


behavior given a large database of customer data containing their
properties and past buying record

Data Science for Managers DATA SOCIETY © 2023 129


Polling question

The goal of this type of machine learning is to model


the underlying structure or distribution in the data in
order to learn more about the data.

Do you think this statement describes supervised


machine learning or unsupervised machine learning?

Data Science for Managers DATA SOCIETY © 2023 130


Polling question
The Stanford Dogs Dataset contains 20,580 images. Each image
is categorized into 1 of 120 different dog breed categories.

Based on the information provided, is this dataset suitable for


use with supervised machine learning techniques?
Data Science for Managers DATA SOCIETY © 2023 131
Before we go further…
● Remember that most data
science projects combine a
few methods to extract the full
picture.

● The two big components that


drive the decision for which
method to use are: the
question you’re asking, and the
data you have.

Data Science for Managers 2023 132


DATA SOCIETY © 2021
Clustering

Data Science for Managers 2023 133


DATA SOCIETY © 2021
Clustering
● Clustering is a type of unsupervised
machine learning.

● You find similarities between data


points and create groups (clusters)
based on those similarities.

● It tries to find whether there is a


relationship between the data points
when the classes are unknown.

Data Science for Managers DATA SOCIETY © 2023 134


How can you use clustering?
● Clustering answers the questions:
1. Who/what is this person/object similar to?
2. Is there a hidden pattern in the data that we can't see?
3. Are there groups of data with similar attributes?

● Domain knowledge is key!


o If we know that certain policies are more effective, we can model
more policies off of the similar metrics.
o If we had projects with similar objectives and outcomes, we can
consolidate ones that overlap to streamline progress.

Data Science for Managers DATA SOCIETY © 2023 135


Example: credit line optimization
● GE Capital created a model to 1
predict customer behavior and
4
offer tailored products.
● The clusters were defined using
2
existing GE Capital data—based
on days delinquent, monthly
5
spend, and percent of credit line 3

used.
● Led to more targeted marketing
and specific offers to those groups.

Data Science for Managers DATA SOCIETY © 2023 136


Types of clustering
● Centroid - iterative clustering algorithms where the proximity of data points is translated
into similarity
● Hierarchical - assumes that the closer the data points are to each other, the more
similar they are
● Density-based - searches for areas of varied density of data points in the dataset and
clusters based on the density

Data Science for Managers DATA SOCIETY © 2023 137


Evaluating the accuracy of the model
● Goal of clustering is to maximize the
separation between clusters and
minimize the distance within clusters
● The ratio of inter-cluster variance to
total variance can help you assess
the performance of algorithms,
although this is dependent on the
model you use

Variation explained by clusters


inter-cluster variance
total variance
Data Science for Managers DATA SOCIETY © 2023 138
Evaluating the accuracy of the model
● A screeplot identifies the contribution
of each variable on the explained
variance of the model.

● Good for identifying important


components of a model

Data Science for Managers DATA SOCIETY © 2023 139


Questions managers should ask
1. How was the distance measure identified?

1. Did you scale the data appropriately?

1. How many clusters do you expect or want? Why?

1. Does your algorithm scale to the size of the data?

1. What can we learn from the groups that the algorithm identified?

Data Science for Managers DATA SOCIETY © 2023 140


Common pitfalls with clustering
● Clustering algorithms don’t scale well to
large datasets
○ “Curse of dimensionality” – as the
dimensions increase, the data points
become sparse and increases distance
and similarity between points

● Different data types need to be


formatted correctly (i.e., mixing
categorical data with numerical data
may not be the best way to find similar
points).

● Make sure you use the right clustering


model for the data!

Data Science for Managers DATA SOCIETY © 2023 141


Recap: when should you use clustering?
● Use clustering when:

1. You have an unlabeled dataset


2. The dataset has multiple attributes
3. You need to identify patterns in your data
4. You need to find groups in your data

Data Science for Managers DATA SOCIETY © 2023 142


Classification

Data Science for Managers 2023 143


DATA SOCIETY © 2021
Classification
● Classification is a type of supervised
machine learning.

● It is the process of assigning new


data points to known classes.

● The assignment is done based on the


similarity of new data points to
existing data points with known class
assignment (category or behavior
pattern).

Data Science for Managers DATA SOCIETY © 2023 144


How can you use classification?
Classification answers the questions:
1. Which is the probability of an object / person being in a particular group?
2. What category is this person / object in?
3. What is this person / object most similar to?

Domain knowledge is key!


○ If we know that certain policies are most likely to be successful, we can predict if
new policies will also be successful
○ If we see behavioral outcomes based on certain decisions, we can predict similar
behaviors

Data Science for Managers DATA SOCIETY © 2023 145


Example: predicting pregnancy
● In 2002, Target implemented data
analytics to analyze buying patterns
in customers.
● New parents often get bombarded
with advertising offers, so Target
wanted a way to anticipate who is
expecting in order to get ahead of
the competition.
● They were able to predict pregnancy
of their customers based upon their
purchases and sent out targeted
coupons.

http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html?_r=0
Data Science for Managers DATA SOCIETY © 2023 146
Chat question

Time out! What ethical implications might


Target’s pregnancy predications have
raised?

Data Science for Managers DATA SOCIETY © 2023 147


Common classifiers
● k-Nearest Neighbors (KNN) – assumes ● Decision trees – uses a tree-like graph
that similar things exist in close or model of decisions and their
proximity; classifies a data point possible consequences to classify
based on how its neighbors are data
classified

Data Science for Managers DATA SOCIETY © 2023 148


Common classifiers
● Support vector machines – separates ● Logistic regression – determines the
data points by class using an optimal probability of a data point to be part
hyperplane of a certain class or not

Data Science for Managers DATA SOCIETY © 2023 149


Evaluating accuracy of a model
● In order to determine the accuracy of
the model, you need to split your
data into a training set and a test set.

● Then, compare the outcomes that


the model produced to the actual
outcomes to determine how
accurate your model is, and how well
it generalizes to new data.

● This is called a confusion matrix.

Data Science for Managers DATA SOCIETY © 2023 150


Accuracy, cont’d.
● Next, you can plot the ROC (receiver ● Another metric to plot is called the
operator characteristic), which is the AUC (area under curve), which
true positive rate against the false compares classification models to
positive rate at different thresholds. measure predictive accuracy.
The AUC should be above .5 to say
the model is better than a random
guess.

Data Science for Managers DATA SOCIETY © 2023 151


Questions managers should ask
1. How was the distance measure identified?

1. Did you scale the data appropriately?

1. How did you split the test and training data?

1. What thresholds did you use for AUC and ROC?

Data Science for Managers DATA SOCIETY © 2023 152


Recap: when should you use classification?

● Use classification when:

1. You have a labeled dataset


2. You want to predict group assignments
3. You want to predict behaviors / events
4. You want to identify important attributes

Data Science for Managers DATA SOCIETY © 2023 153


Regression

Data Science for Managers 2023 154


DATA SOCIETY © 2021
Regression
● Regression is a type of supervised
machine learning.

● It predicts the value of a variable


based on the value of another
variable or several variables.

● It’s used to examine and calculate


the relationship between a variable
of interest (dependent variable) and
one or more explanatory variables
(predictors or independent
variables).
Data Science for Managers DATA SOCIETY © 2023 155
How can you use regression?
● Regression answers the questions:
1. Which factors matter most?
2. Which can we ignore?
3. How do those factors interact with each other?
4. How certain are we about all of these factors?

● Domain knowledge is key!


○ We can predict political instability in countries

○ We can predict how tourism season affects a country’s economy

Data Science for Managers DATA SOCIETY © 2023 156


Types of regression techniques

Regression

# of independent Shape of the Type of dependent


variables regression line variable

Data Science for Managers DATA SOCIETY © 2023 157


Use case: predicting city movements
● There are over 500 bike-sharing
programs around the world with over
500,000 bikes.

● Automated systems track numerous


data points providing a treasure trove
of data about the mobility of
residents.

● Data can be used to forecast the


number of bikes required and adjust
pricing based on demand.

Data Science for Managers DATA SOCIETY © 2023 158


Chat question

What factors do you think might drive


demand for bike-share use?

Data Science for Managers DATA SOCIETY © 2023 159


Simple linear regression
1. Gather data on variables in question 4. Evaluate model performance
2. Plot the data • Measure error
• Deal with outliers
3. Draw the line to best fit the data
• Determine accuracy

y = mx + b

Number of bike users


=
2.6 * (Temperature) +
37.6

Data Science for Managers DATA SOCIETY © 2023 160


Measure error

Variance. How widely dispersed Randomness. Are the errors Standard deviation/Certainty.
is actual data from the random or is there bias in the What proportion of data
expected data? model? points fall within a given
range? How likely is a value to
be in that range?
Data Science for Managers DATA SOCIETY © 2023 161
Deal with outliers
● Just one outlier can have a very ● Methods such as scatterplots, box-
negative impact on a linear and-whisker plots, and Cook’s
regression if it is not identified and distance can be used to identify
handled properly. outliers.

Data Science for Managers DATA SOCIETY © 2023 162


Determine accuracy
● Look at:
○ Covariance: measures how changes in one variable effects another variable
○ Correlation: identifies the strength of the relationship between the variables
○ p-values: probability that pattern exists through random chance, and not a
relationship between the variables

● R2 determines the accuracy of a regression model. It’s the proportion of variance in the
outcome variable that's accounted for by regression
○ e.g., “about 40% of the variance in the number of bike users is explained by the
temperature”

Data Science for Managers DATA SOCIETY © 2023 163


Multiple linear regression
● Has more than one independent variable
○ e.g., How do several variables (temperature, humidity, day of the week, time of
day) affect demand for bikes?

● Added concerns:
○ Multicollinearity: when 2 or more independent variables are strongly correlated to
one another you may be effectively double counting an effect
○ Autocorrelation: when the correlation between the values of the same variables is
based on related objects
○ Heteroskedasticity: when the variability of a variable is unequal across the range of
values of a second variable that predicts it

Data Science for Managers DATA SOCIETY © 2023 164


Other types of regression
● Nonlinear Regression
● Binary Logistic Regression
● Ordinal Logistic Regression
● Nominal Logistic Regression
● Ridge Regression
● Lasso Regression
● Partial Least Squares Regression
● Polynomial Regression
● Logistic Regression
● Quantile Regression
● Elastic Net Regression
● Principal Components Regression
● Support Vector Regression
● Ordinal Regression
● Poisson Regression
● Negative Binomial Regression
● Ecologic Regression
● Bayesian Regression
● Jackknife Regression

Data Science for Managers DATA SOCIETY © 2023 165


Questions managers should ask
1. How well do we understand the underlying data distribution?

1. Did you identify any outliers? Were they significant? Did you remove them?

1. Did you test the variables for multicollinearity so as not to double-count their effects?

1. What was the R2 metric?

Data Science for Managers DATA SOCIETY © 2023 166


Recap: when should you use regression?

● Use regression when:

1. You have a labeled dataset


2. You want to predict trends
3. You want to anticipate needs or shortages

Data Science for Managers DATA SOCIETY © 2023 167


Polling question

Do you think the decision tree shown below depicts a


classification method?

Data Science for Managers DATA SOCIETY © 2023 168


Polling question

Would you use clustering, classification,


or regression to anticipate what
candidate a person would vote for?

Data Science for Managers DATA SOCIETY © 2023 169


How Machines Learn

Data Science for Managers DATA SOCIETY © 2023 170


Break

Data Science for Managers 2023 171


DATA SOCIETY © 2021
Agenda
Day 3

• Foundational data science • What is text mining and how is it


methods used?
• Advanced data science
• What is graph analysis and how is
methods
it used?

• What are neural networks and


how are they used?

Data Science for Managers DATA SOCIETY © 2023 172


Text Mining

Data Science for Managers 2023 173


DATA SOCIETY © 2021
Text mining
● Text mining employs methods from
various fields including mathematics,
statistics, computational linguistics,
and programming.

● It’s the process of getting insightful


and valuable information out of text
data.

● Includes entity extraction, document


classification, and sentiment analysis.

Data Science for Managers 2023 174


DATA SOCIETY © 2021
Text mining branches
Text mining

Entity Document Sentiment


extraction classification analysis

Data Science for Managers DATA SOCIETY © 2023 175


Entity extraction
● Use entity extraction when you want to get an overview of the themes and topics in
documents.
● Measure word frequency and word co-occurrences.

Data Science for Managers DATA SOCIETY © 2023 176


Document classification
● Use document classification
when you want to sort
through documents and
identify groups of similar
articles.

● Based on similarity of topics /


other metrics

Data Science for Managers DATA SOCIETY © 2023 177


Sentiment analysis
● Use sentiment analysis when you want
to understand the emotions and
overtones of documents.

● Use reference dictionaries to identify


positive / negative words.

● Natural language processing (a


similar branch) doesn’t focus
specifically on sentiment, but rather
on the meaning of the document.
What events might have driven the trends in
emotion depicted above?

Data Science for Managers DATA SOCIETY © 2023 178


Text mining process

Scrape / Clean &


Visualize Analyze
collect organize
Index Word Freq %

A Apple 5 20

B Book 7 28

C Cat 13 52

Data Science for Managers DATA SOCIETY © 2023 179


Evaluating accuracy of our model
● This is a tricky subject!

● Text analysis and text mining rely on other methods that we’ve
introduced in this class, such as clustering and classification. You’ll need
to use the evaluation methods for those particular models.

● In terms of sanity-checking the text mining process, look for unhelpful


stop words (frequent words that don’t provide additional information)
and see if the topics generally make sense.

Data Science for Managers DATA SOCIETY © 2023 180


Common pitfalls with text mining
● Cleaning text is extremely messy and
time consuming – this is a key problem
in text mining projects.

● Existing dictionaries are not a


panacea for catching the nuances of
language – typically, there need to be
manual additions of other words.

● Using the right methods and metrics to


classify and cluster documents
correctly.

Data Science for Managers DATA SOCIETY © 2021


2023 181
Questions managers should ask
1. How does the model take sarcasm / irony / colloquialisms into account?

1. Is there an existing library of reference words that can assist you in text
mining?

1. Does that reference library include misspellings, alternate versions of


words, symbols, different parts of speech or compound terms?

1. How do the topics change over time?

Data Science for Managers DATA SOCIETY © 2023 182


Graph analysis

Data Science for Managers 2023 183


DATA SOCIETY © 2021
Graph analysis
● Graph analysis (also known as network
analysis) seeks to find patterns within a
network, a set of points connected by lines
that represent connections.

● Networks can represent organizational


relationships; communications patterns;
economic relationships; environmental
relationships; connections based on interests,
preferences and similarities; as well as
geographic relationships.

Data Science for Managers DATA SOCIETY © 2023 184


Example: IBM & a volcano
● In April 2010 a volcano in Iceland halted flights throughout Europe.

● IBM's internal analytics software alerted the team that IBM's supply chain link most
relevant to the eruption was in Hong Kong – not Europe!

● The software showed that when flights resumed after the eruption was over, IBM would
need to quickly move a backlog of components from Asian manufacturers to
European customers. A bottleneck in Hong Kong would result.

● IBM booked additional space on commercial flights to help transport the backlog.

Source: Big Data Driven Supply Chain Management by Nada R. Sanders

Data Science for Managers DATA SOCIETY © 2023 185


Types of graph analysis
Graph analysis

Community Centrality Social


detection metrics Media

Data Science for Managers DATA SOCIETY © 2023 186


Community detection
● Use community detection
when you want to dive into
your network to find new
communities and groups.

● Identifies groups of individuals /


nodes that belong together;
can detect latent connections
and communities.

Data Science for Managers DATA SOCIETY © 2023 187


Centrality metrics
● Use centrality metrics when you want to look at an overview of a network and identify
key nodes.
● Identifies the most important nodes, most central nodes, shortest paths, etc.

This email network shows Marketing


how a company department
communicates.

CFO

Supply chain
Finance department
department
Data Science for Managers DATA SOCIETY © 2023 188
Social media
● Use social media
when you are using
data from social
media platforms.

● Identifies how an idea


travels across social
media platforms and
how individuals are
connected.

Data Science for Managers DATA SOCIETY © 2023 189


Ways to measure networks
Metric Purpose

# of nodes How many participants are included in the network?

# of edges How many connections exist in a network?

Distance How long does it take for information to travel through a network?

Degree (in-, out-) Direction of connections, is someone a follower or an opinion leader?

Degree centrality How many other people/objects can someone/something reach?

Closeness centrality On average, how quickly can someone/something reach every other point in the network?

Betweenness centrality How important is someone/something as a connector to the structure of the network?

Eigenvector centrality How important is someone/something based on who/what else they are connected to?

Tie strength How strong or significant is a connection between two people/objects?

Density How sparse and fragile or inter-connected and resilient is a network?

Jaccard Index How similar or redundant are 2 people/elements of a network?

Data Science for Managers DATA SOCIETY © 2023 190


Evaluating accuracy of our model
● This is a tricky subject!

● Graph analysis relies on other methods that we’ve introduced in this


class, such as clustering and classification. You’ll need to use the
evaluation methods for those particular models.

● In terms of sanity-checking the process, look at how the nodes are


accounted for in each community and determine what threshold
makes the most sense for your analysis.

Data Science for Managers DATA SOCIETY © 2023 191


Questions managers should ask
● What aspect of the relationship are you most interested in (i.e., who is
the most connected, who has the strongest connections, who is most
important)?

● Does the data you’re using account for a large amount of a


relationship? How much is in the numbers versus not collected?

● What metrics did you use to evaluate the proximity between nodes /
communities?

Data Science for Managers DATA SOCIETY © 2023 192


Neural networks

Data Science for Managers 2023 193


DATA SOCIETY © 2021
Activity: field trip
● Visit https://quickdraw.withgoogle.com/

● Click the “Let’s Draw!” button and play a round (6 drawings).

● At the end of the round, visit the data to see why guesses were
made. Also, make a note of how many of your drawings were
guessed correctly.

Note: A clickable link is available on page


16 of the participant guide.

Data Science
Literacy for Managers DATA SOCIETY © 2023 194
Neural networks
● A neural network is born ignorant and
builds on itself to get smarter and
smarter.

● It starts out with a guess, and then


tries to make better guesses as it
learns from its mistakes.

● Neural networks cover the same


topics that we’ve reviewed
previously. In theory, you can apply
them to almost any method!

Data Science for Managers 2023 195


DATA SOCIETY © 2021
Intuition: neural networks
● Neural networks are made up of perceptrons. Hidden
Input
● A simple perceptron has 3 layers:
○ Input: observations that enter the model
Output
○ Hidden layer: composed of an activation function that
derives the output based on inputs and other factors
○ Output: target variable you want to predict

● Once the output is produced, the model measures the


error, then walks it back over the model to adjust its
performance and reduce errors.

A perceptron acts like a


neuron.
Data Science for Managers DATA SOCIETY © 2023 196
What data do you need?
1. Relevant: data must resemble the real-world data you hope to process as much as
possible

1. Properly classified: in order for a deep-learning solution to correctly classify, a labeled


dataset is needed. If a labeled dataset is not available, someone needs to actually
apply the labels to the raw data

1. Formatted: all data needs to be vectorized, and the vectors should be the same
length when they enter the neural net

2. Minimum data requirement: this may vary with the complexity of the problem, but
100,000 instances in total across all categories is a good place to start

Data Science for Managers DATA SOCIETY © 2023 197


Neural networks: pros and cons
● Pros
○ Neural networks are highly versatile.
○ They are fairly insensitive to noise in your data.
○ They are well-equipped to handle fuzzy and convoluted
relationships.

● Cons
○ It’s a black box – those hidden layers are difficult to explain
and evaluate.
○ They are in danger of overfitting the training data, so it
might not generalize as well to new information.
○ An experienced data scientist should develop the
parameters of hidden layers and nodes.

Data Science for Managers DATA SOCIETY © 2023 198


Chat question
How many of your drawings did the
neural network guess correctly?
Does that mean you are a good (or
We started our discussion bad) artist?

on neural networks with a


drawing activity...

Data Science for Managers DATA SOCIETY © 2023 199


Key points
● Don’t accept an analysis at face value – you need
to ask the right questions!

● Most data analyses incorporate multiple methods


in order to determine which one is the most
accurate.

● Remember! The two big components that drive the


decision for which method to use are: the question
you’re asking, and the data you have.

Data Science for Managers DATA SOCIETY © 2023 200


End of Day 3

Foundational data science methods Questions


Advanced data science methods
?
Data Science for Managers DATA SOCIETY © 2023 201
DATA LITERACY FOR MANAGERS
DAY 4

Data Science for Managers DATA SOCIETY © 2023 202


Agenda
Day 4

• Data visualization
• Misleading statistics & visual
distortions • What is data visualization?
• Data storytelling
• How to I pick and design visuals?

Data Science for Managers DATA SOCIETY © 2023 203


What is data visualization?
• Data visualization is any attempt
to make data more easily
digestible by rendering it in a
visual context.

• Common data visualizations


include tables, charts, graphs,
and dashboards.

Data Science for Managers DATA SOCIETY © 2023 204


Explore or explain
● We can use data visualization to review new data to discover patterns, to spot
anomalies, to test hypotheses, and to check assumptions.

● We can also use data visualization to transform raw data into a compelling story or
takeaway for an external audience.

Data Science for Managers DATA SOCIETY © 2023 205


Chat questions

● What types of data visualization does your


organization produce?

● What improvements would you like to see in


the visualizations created or used by your
organization? Why?

Data Science for Managers DATA SOCIETY © 2023 206


Getting it right
Using visualizations
incorrectly can cause you
to lose your audience, lose
the value in your data,
and ultimately lead to
poor decision making.

Data Science for Managers DATA SOCIETY © 2023 207


Example: The Challenger
• On January 27, 1986, concerned engineers presented data and the following charts to
try to illustrate the damage cold temperatures would have on the O-rings of the
Challenger space shuttle.

Source: Presidential Commission on the Space Shuttle Challenger Accident, vol. 5 (Washington, DC: US Government Printing Office, 1986.) pp. 895-
Data
896.Science for Managers DATA SOCIETY © 2023 208
Example: The Challenger
• January 28, 1986, the Challenger space The chart below shows O-ring

challenger-explosion-2/
https://vizdatar.wordpress.com/2015/05/06/space-shuttle-
damage on the y-axis and
shuttle exploded within seconds of temperature on the x-axis.
takeoff.
Is it easier to see the issue?

• Data visualization legend Edward


Tufte argues that the shuttle’s engineers
failed to communicate dangers
because their data wasn't presented in
an easily digestible form.

Data Science for Managers DATA SOCIETY © 2023 209


Using appropriate visuals
• We’ll start by talking about how to
pick and design the right visual for
your purpose.

• Then, we’ll discuss common mistakes.

• Later, we’ll talk about how to avoid


being misled by visualizations.

Data Science for Managers DATA SOCIETY © 2023 210


To get started with data viz
1. Know your audience and understand
how it processes visual information.
(Who)
2. Determine what you’re trying to
visualize and what kind of information
you want to communicate. (What)
3. Choose a type of visual that conveys
the information in the best and
simplest form for your audience.
(How)

Data Science for Managers DATA SOCIETY © 2023 211


Who
• Know your audience and understand how it processes visual information.

• Consider audience familiarity. For example:


– High-level executives are generally well-versed in visual data, so use a variety of
methods to stand out
– Less-experienced audiences will want it kept simple (e.g., pie charts, bar graphs, and
word maps)

• Consider how the visualization will be used by the audience:


– Is it for executives to use to make decisions?
– Is it to inform the public?

Data Science for Managers DATA SOCIETY © 2023 212


Chat: who is the audience?

https://www.oreilly.com/library/view/learning-highcharts-4/9781783287451/ch08s04.html http://www.mensfitness.com/nutrition/what-to-eat/mens-fitness-food-
pyramid
Data Science for Managers DATA SOCIETY © 2023 213
What
• Determine what you’re trying to visualize and what kind of information
you want to communicate.

• Remember, the audience only knows as much as you tell them:


– Do you want them to explore the data on their own? (exploratory analysis)
– Do you want to tell a specific story about the data? (explanatory purposes)

• If the message is explanatory, consider:


– What type of data you have on which to base the analysis?
– What are the audience’s topmost concerns or requirements?
– What decisions can be made based on the results you provide?

Data Science for Managers DATA SOCIETY © 2023 214


How
• Choose a type of visual that conveys the information in the best and
simplest form for your audience.

• The type of visual you use depends primarily on two things:


1. the data you want to communicate
2. what you want to convey about that data

• Then, choose the visual that will be easiest for your audience to read.
– Aim for them to “get it” in 30 seconds or less.

Data Science for Managers DATA SOCIETY © 2023 215


Just a few numbers
• Don’t overcomplicate!
…we spent only $75,000 of our
$125,000 budget…
• Simple text works well when
…therefore, it is not surprising
there is just a number or two that only 29 percent of the
to share. applications were accepted…

…product A ($12.99) was much


more affordable than product B
($59.99)…

Data Science for Managers DATA SOCIETY © 2023 216


Unique data
• Don’t overcomplicate!

• Tables are great when


communicating to a mixed
audience who will look to a
particular row of interest or when
you need to show different units Product Weight Price
of measure. Toaster (UK) 1.05 kg £17,49

Toaster (US) 3.13 lbs. $29.99


Toaster (South
1.07 kg R239,00
Africa)

Data Science for Managers DATA SOCIETY © 2023 217


Common messages
Comparison Composition

Evaluate and Understand


compare values how individual
between two or parts make up
more data points a whole

Distribution Relationships

Combine Represent the


comparison and correlation or
composition connection
between 2+
variables

Data Science for Managers DATA SOCIETY © 2023 218


What if it’s more complicated?
• A dashboard is a collection of visual reports that display important
metrics and KPIs, usually in real-time.

Data visualization Data dashboard


• a visual representation of your • a collection of data visualizations
data, such as a chart assembled into a single, unified
• can be static or dynamic view
• typically shows data for a single • might display data visualizations
metric, such as electricity usage for electricity usage, energy
costs, CO2 emissions, and
peak/off peak use

Data Science for Managers DATA SOCIETY © 2023 219


Sample dashboard

Data Science for Managers DATA SOCIETY © 2023 220


Designing compelling visuals
• Picking the right chart type isn't
enough.

• There are choices to be made


about the elements you include
and how they are formatted.

• Data visualization is an art,


informed by science.

Data Science for Managers DATA SOCIETY © 2023 221


Visual design theory
• Our eyes “load” information while the brain
“processes” it.

• We give the most attention to what looks good


and struggle when our working memory is
overwhelmed.

• For information to be effective, it should not


provide more data than what the human brain
can process.
Data Science for Managers DATA SOCIETY © 2023 222
Example: buying oranges
• You want to buy oranges at a new supermarket.
• Our eyes scan the layout of the supermarket, while the brain processes the various
sections.
• The brain then instructs the eyes to zone in on the fruit section by sending signals about
how fruits look from memory.
• The eyes then break the entire scanned area into parts and scan each part to spot the
fruit section.
• The process is repeated until oranges are located.

Data Science for Managers DATA SOCIETY © 2023 223


Designing compelling visuals
• Our eyes and brains work the same way with data visualizations as they
did in the oranges example.

– Use visual clues to make data visualizations easier for the audience.

• However, every piece of information in a visualization also creates


cognitive load on the viewer, asking them to use their brain power to
process it.

– Reduce visual clutter to lower the cognitive load and help transmission of the message.

Data Science for Managers DATA SOCIETY © 2023 224


Theory
• The visual design tips we’ll review today draw on theory such as:

– the building blocks of visual design described by the Interaction Design Foundation

– the four categories of preattentive visual attributes described in Colin Ware’s book,
Information Visualization: Perception for Design

– the Gestalt Principles of visual perception, which describe how people group similar
elements, recognize patterns, and simplify complex images when we perceive objects

Data Science for Managers DATA SOCIETY © 2023 225


Tip
1 Make position meaningful
Data should be sorted and placed in the visual in a meaningful way.

The left chart is sorted alphabetically; the right by value.


When would you use one over the other?
Data Science for Managers DATA SOCIETY © 2023 226
Tip
2 Group related items
• Things that are closer appear to be more related than those that are spaced farther
apart.
• In fact, proximity overrules the similarity of other factors (e.g., shape, color).

Data Science for Managers DATA SOCIETY © 2023 227


Tip
3 Distinguish different items

• The mind groups together things that look


to be similar and assumes they have the
same function.
F

• We can use this principle for: A B C D E F G H I J K L M N O P

– distinguishing different sections


– differentiating links from regular text
– showing that elements with certain
characteristics serve one purpose and
others different
A B C D E F G H I J K L M N O P

Data Science for Managers DATA SOCIETY © 2023 228


Tip
4 Use natural positioning

• People usually tend to start at the top


left of the visual and scan in zig-
zag motions across the page forming
a Z-pattern.

• Aim to position elements in a way that


will feel natural for users to consume.

• Also, remember that the top of the


page is the most precious.
Data Science for Managers DATA SOCIETY © 2023 229
Tip
5 Use labels and legends

● Labels can be used to show value


of datapoint.

● Legends can be used to identify the


size, color or any other distinguishing
feature in the visual.

The labels and legends used


in the bottom chart makes it
easier to understand.

Data Science for Managers DATA SOCIETY © 2023 230


Tip
6 Use size to show importance

• Relative size represents relative importance.


• Visuals of almost equal importance should be sized similarly.
• If there’s one really important thing, it must be BIG.

Resizing the “Next


Page” button
deemphasizes its
importance.

Data Science for Managers DATA SOCIETY © 2023 231


Tip
7 Use color to grab attention

• Color is another powerful tool used to draw the audience's attention


• However, the following must be kept in mind:
– Use it sparingly: too much variety prevents anything from standing out
– Use it consistently: a color change can be used to visually reinforce change in topic or
tone
A A
B B
C C
D D
E

Too many colors are


E

F F

used in the image on G

H
G

the left, making it I I

difficult to identify
J J
K K

which are the busiest L L

months.
M M
N N

Data Science for Managers DATA SOCIETY © 2023 232


Tip
8 Use color to evoke emotion

• Color evokes emotion, so choose the one that helps reinforce the emotion you want to
arouse in your audience.

Warm
represent energy
colors
Cool
represent calmness
colors

Data Science for Managers DATA SOCIETY © 2023 233


Tip
9 Encode data with color

• Use color schemes to encode data as sequential, diverging, or categorical.

Sequential Diverging Categorical

for discrete data


to highlight
values
when the order minimums,
representing
matters maximums, and
distinct
midpoints
categories

Data Science for Managers DATA SOCIETY © 2023 234


Sequential color schemes
• Use a sequential color scheme when
the order matters.

• These schemes range between two


colors—usually a lighter shade to a
darker one—by varying one or more
parameters such as saturation.

Data Science for Managers DATA SOCIETY © 2023 235


Diverging color schemes
• Use a diverging color scheme to
highlight minimums, maximums, and
midpoints.

• These schemes range between three or


more colors with the different colors
being quite distinct—usually having
different hues.

Data Science for Managers DATA SOCIETY © 2023 236


Categorical color schemes
• Use a categorical color
scheme for discrete data
values representing distinct
categories.

• These schemes use different


hues with consistent steps in
lightness and saturation.

Data Science for Managers DATA SOCIETY © 2023 237


Tip
10 Reduce chart clutter

Small changes can have a big effect on


a visualization’s impact.

1. Remove special effects


2. Lighten the background
3. Remove chart borders
4. Remove gridlines
5. Direct label
6. Clean up axis titles and labels
7. Use consistent colors

Data Science for Managers DATA SOCIETY © 2023 238


Activity: analyze visualizations
● Turn to page 17 of your participant guide to find the Analyzing
visualizations activity.

● You will be asked to assess 4 visualizations. Write down your


notes.

Data Science
Literacy for Managers DATA SOCIETY © 2023 239
Polling question

For each of the charts, select the best way to improve


visual:

• Change colors
• Remove extra information
• Add more information

Which chart is the best?

Data Science for Managers DATA SOCIETY © 2023 240


Polling question

What tools have you used to visualize data?


● Google charts
● Excell
● Tableau
● Python
● RStudio
● Power BI

Data Science for Managers DATA SOCIETY © 2023 241


Excel
● Create basic chart
types such as pie, line,
bar, scatter, and
more.

● Charts created in
Excel can easily be
ported to PowerPoint
and Word.

Data Science for Managers DATA SOCIETY © 2023 242


Google Charts
● Free and open source,
which includes a rich
gallery, fully
customizable, controls
and dashboards, and
HTML5

● Has more options than


Excel; create interactive,
animated and
geospatial graphics

Data Science for Managers DATA SOCIETY © 2023 243


Tableau
● Tool for creating
powerful and insightful
visuals
● No programming
required; drag and drop
● Share and collaborate
on premise or in the
cloud
● Platform can be used
department or
organization wide

Data Science for Managers DATA SOCIETY © 2023 244


R and RStudio
● Programming tool
● Mainly used for statistical
analysis
● Offers functions and
libraries to build
visualizations and
present data
● Open source and free

Data Science for Managers DATA SOCIETY © 2023 245


Python
● Programming tool

● You'll find libraries for


practically every data
visualization need

● Free and open source

Data Science for Managers DATA SOCIETY © 2023 246


Power BI
● Interactive
visualizations and
business intelligence
capabilities

● Simple interface

● Create dashboards

Data Science for Managers DATA SOCIETY © 2023 247


Break

Data Science for Managers 2023 248


DATA SOCIETY © 2021
Agenda
Day 4

• Data visualization
• Misleading statistics & visual
distortions • What do I look out for when
• Data storytelling reviewing statistics or
visualizations?

Data Science for Managers DATA SOCIETY © 2023 249


Misleading stats & visual distortions
• Sometimes charts and statistics look
presentable but could be misleading.

• Unreliable data comparisons erode


credibility and eventually dissuade
viewers from using the analysis.

Data Science for Managers DATA SOCIETY © 2023 250


Misleading statistics

Data Science for Managers DATA SOCIETY © 2021


2023 251
Misleading statistics
• “Bill Gates walks into a bar and everyone inside becomes a
millionaire…on average.”

• In 2011, the average income of the 7,878 households in Steubenville, Ohio,


was $46,341. But if just two people, Warren Buffett and Oprah Winfrey,
relocated to that city, the average household income in Steubenville
would rise 62 percent overnight, to $75,263 per household.

What’s wrong with these statements?

https://www.nytimes.com/2013/05/26/opinion/sunday/when-numbers-mislead.html

Data Science for Managers DATA SOCIETY © 2023 252


Misleading statistics

• Numbers don’t have to be


fabricated to be
misleading.

• Misleading statistics are the


misusage—purposeful or
not—of numerical data.

Data Science for Managers DATA SOCIETY © 2023


Misleading statistics
• Misleading statistics can be
• Small sample sizes
created through issues with: Data
• Biased sampling
collection • Loaded questions

• No/poor data
– data collection Data normalization
processing • Ignoring important
features
– data processing • Hiding context
Data • Omitting certain
presentation findings
– data presentation • Visual distortions

Data Science for Managers 2023 254


DATA SOCIETY © 2021
How to avoid being misled?

• Do some math. Are there any obvious mistakes?

• Check the source. Is it creditable and current?

• Question the methodology. Is there bias? Is the result statistically


significant?

• Conduct research. What does Google tell you?

Data Science for Managers DATA SOCIETY © 2023 255


Visual distortions

Data Science for Managers DATA SOCIETY © 2021


2023
Visual distortions
• Look at the top graph.

• At first, Jamaica seems to have


half the average workdays per
employee that Suriname does.

• In reality, the difference is much


less.

What’s the difference between


the two charts?

Data Science for Managers DATA SOCIETY © 2023 257


Truncated graphs
• One of the most common
manipulations is omitting baselines or
beginning the y-axis of a graph at an
arbitrary number instead of 0.

• This creates the impression that there is


a significant difference between data
points, when in fact, there is relatively
little disparity.

Data Science for Managers DATA SOCIETY © 2023 258


Visual distortions
What distortion has been used in these charts to change how the data appears?

Data Science for Managers DATA SOCIETY © 2023 259


Exaggerated scaling
• Exaggerating the scale of a line graph can easily minimize or maximize the change
shown.

Data Science for Managers DATA SOCIETY © 2023 260


Visual distortion

How might this chart be misleading?

https://financesonline.com/number-of-gamers-worldwide/

Data Science for Managers 2023 261


DATA SOCIETY © 2021
Ignoring convention
● Deviating from convention
(such as green is positive and
red is negative) can create
confusion and misinterpretation
of the facts.

● In this example, the axis also


moves downward, making an
increase in gamers look like an
decrease, at a quick glance.
https://financesonline.com/number-of-gamers-worldwide/

Data Science for Managers 2023 262


DATA SOCIETY © 2021
Visual distortion

What do you notice about these pie charts?

Data Science for Managers DATA SOCIETY © 2023 263


Numbers don’t add up
• With pie charts, the sum of each slice must add up to the
whole. When the numbers don't add up, you know there's an
issue.

Data Science for Managers DATA SOCIETY © 2023 264


Visual distortion
Does the S&D or the
EPP party have more
representation in
parliament?

https://www.businessinsider.com/pie-charts-are-the-worst-2013-6

Data Science for Managers DATA SOCIETY © 2023 265


3D distortion
• 3D pie charts can be used to distort and cause a misinterpretation of the data.
• The same data is represented in both charts below.

https://www.businessinsider.com/pie-charts-are-the-worst-2013-6
Data Science for Managers DATA SOCIETY © 2023 266
Visual distortion
Which company has a better sales trajectory?

Data Science for Managers DATA SOCIETY © 2023 267


Improper extraction
• Surprise! It’s the same company. One
graph showed only odd years and the
other only even.

• To align to a particular narrative, some


may choose to visualize only a portion
of the data.

• This is more common in graphs that


have time as one of their axes.

Data Science for Managers DATA SOCIETY © 2023 268


Visual distortion

What story does this visualization tell?


Data Science for Managers DATA SOCIETY © 2023 269
Correlating causation
• Data visualizations can create causal links by the way that data is presented to the viewer.
• However, correlation does not equal causation.

Data Science for Managers DATA SOCIETY © 2023 270


Recap
To avoid being misled, look for:
– misleading statistics
– truncated graphs
– exaggerated scaling
– ignored conventions
– numbers that don’t add up
– 3D distortion
– improper extraction
– correlating causation

Data Science for Managers DATA SOCIETY © 2021


2023
Break

Data Science for Managers DATA SOCIETY © 2021


2023
Agenda
Day 4

• Data visualization
• Misleading statistics & visual
distortions • Why are data stories useful?
• Data storytelling
• How do I craft a data story?

Data Science for Managers DATA SOCIETY © 2023 273


What is data storytelling?
● You focus on an insight and

● persuade an audience

● that the outcome of your


analysis

● demands a course of action

● through narrative and visual


communication.

Data Science for Managers DATA SOCIETY © 2023 274


Data stories and data visualizations
● A single data story may make use of multiple data visualizations.

● Data stories arrange visualizations into the linear sequence of storytelling: a beginning, a
middle, and an end.

● Data story formats will likely incorporate other elements to explain and contextualize the
visualizations:
○ prose text, either written or spoken
○ annotations, callouts, and labels
○ icons or graphics
○ images or photographs

Data Science for Managers DATA SOCIETY © 2023 275


Can’t I just use a chart?
● Narratives are super effective, “sticky” content delivery mechanisms.

● Not everyone is a statistician, but they still want to make evidence-


based decisions.

● Stories let you overview key findings quickly.

● Stories tap into both the logical and the emotional aspects of
persuasion.

Data Science for Managers DATA SOCIETY © 2023 276


Why choose story?
If your insight is... A story can...

Help convince your audience


Unpleasant that even unwanted results
are actionable.

Encourage your audience to


Disruptive break with tradition, if the
upshot is valuable enough.

Explain why a prediction or


Unexpected intuition failed, and offer some
analysis and a solution.
Data Science for Managers DATA SOCIETY © 2023 277
Why choose story?, cont’d
If your insight is... A story can...

Guide your audience to a more


Complex complete understanding in
manageable chunks.

Embolden your audience to


Risky take responsibility for making a
tough choice.

Compel your audience to


Costly consider a high-cost solution by
underscoring the high value.

Data Science for Managers DATA SOCIETY © 2023 278


How do I craft a data story?

REFINE

TAILOR

OUTLINE

PLOT

FORMAT

Data Science for Managers DATA SOCIETY © 2023 279


REFINE
Refining your insight

● In a data story, your insight is the most important piece.

● What will make your audience perceive your insight as maximally:

○ Valuable: an observation that seems to be rewarding

○ Relevant: an observation that seems timely

○ Practical: an observation that suggests a realistic and feasible course of action

○ Specific: an observation that clearly and completely accounts for a problem

● Make your insight as concrete and contextualized as possible

Data Science for Managers DATA SOCIETY © 2023 280


TAILOR
Tailoring to your audience

AUTHORITY

GOALS TIMING

EMOTION REASON
Data Science for Managers DATA SOCIETY © 2023 281
OUTLINE
Outlining

INSIGHT / OUTCOME

ACTIONS

PROBLEM

MEASURES

Data Science for Managers DATA SOCIETY © 2023 282


PLOT
Plotting with a storyboard

• It’s okay for your data story to remain flexible at


this early stage.

• There are no right answers, only consideration


and iteration.

• Focus on building the elements of the story first,


on paper.

• Try out different versions quickly and don’t get


too attached.

Data Science for Managers DATA SOCIETY © 2023 283


FORMAT
Formatting for delivery

● You may find yourself needing to alter the way you tell your data story based on
the affordances of the format.

● Sometimes the format is a given, but other times, it will depend upon your input
and the use case.

● As with visualizations, the simplest storytelling format is often the best.

Slide Deck Document Interactive Hybrid


Sequence of Illustrated text Digital object Blend /
slides intended (report, intended to align compromise of at
for real-time infographic) to function with user least two formats
presentation be read anytime experience

Data Science for Managers DATA SOCIETY © 2023 284


The Joy of Stats

Data Science for Managers DATA SOCIETY © 2023 285


Chat questions

● What data storytelling


elements did you notice?

● Was this a good example of


data visualization and
storytelling? Why or why not?

Data Science for Managers DATA SOCIETY © 2023 286


End of Day 4

Data visualization Questions


Misleading statistics & visual distortions
Data storytelling ?
Data Science for Managers DATA SOCIETY © 2023 287
Thank you and congratulations!

Data Science for Managers DATA SOCIETY © 2023 288

You might also like