Department of Statistics
The Maharaja Sayajirao University of Baroda
Agenda
 What is Data Science?
 What does Data Science promise for your business?
 Investment in Data Science and ROI
 Data Science Process
 Data Science Roles
 Infrastructure Requirements
 Data Science Tools and Techniques
 Where do I begin?
 Developing Data Science Culture
 Questions
What is Data Science?
Everything concerning Data
is in the purview of Data Science
What is Data Science?
Data science is a young inter-disciplinary field that uses
scientific principles, methods, processes, algorithms and
systems to extract knowledge and insights from data.
 Data science involves Statistics at its core.
 Data Science extends the field of statistics to
incorporate advances in computing with data
 Apart from Statistics, Computer Science is another
major discipline that plays a major role in capturing,
managing and sharing data.
 It is a driving force behind innovations is almost all
disciplines of Science.
 This new approach is termed Data driven science.
Data Science Discipline
Data Science Profession
The Data Science promise
Top Objectives of Successful Businesses
 Increase profitability
 Ensure customer satisfaction
 Optimize productivity
 Make your employees happy
 Social and public responsibility
Businesses traditionally rely on intuition, creativity and
experience to fulfill these objectives.
This has been reflected by HIPPO phenomenon for
decades.
The Data Science promise
Without Data, you are just another person with an opinion
– Edwards Deming
Although, intuition, experience, etc. are important, these work
gets much better when supported with data.
Data Science helps you to
 Understand your customers better by
 Learning about their needs
 Their struggles, their motivations, their habits and their
relationships to your product or service.
 Use this understanding to create a better product and/or
service and turning that into profit.
The Data Science promise
Data science helps you to
 See clearly how your business performs.
 Understand dynamics of your business
 Improve business processes
 Discover new opportunities / products / services that
your customers need.
 Discover new audiences for your current products /
services.
and much more...
The Data Science promise
If you manage to collect the right data and use it well,
 You will be able to make better decisions more quickly
and more easily.
 That will lead to a better product, happier
customers and eventually more revenue.
That’s what business data science is all about.
If you are among the first in your domain to embrace
data science, you can outsmart your competition.
Signs that You Should Invest in Data
Science
 Your marketing budgets are growing, but your sales
numbers are not.
 Your company is struggling with personalization
 It’s taking too long for the sales team to score leads
 You are unable to analyze your marketing ROI
 You want the competitive edge without significantly
increasing your budget
 Your competitors are already investing in Data Science
Data Science Investments
Human Resource
According to an estimate, good teams spend about 5% of
their total working hours with data and quantitative
research.
 So, if you are working alone, that's around 2-3 hours a
week.
 If you are a team of 50, then ideally you should have
one or two full-time dedicated people for Data Science
projects.
 As your business grows, you may setup Data Science
division
Data Science Investments
Data Infrastructure
A data infrastructure is a digital infrastructure for
promoting data sharing and consumption.
 It includes data assets, hardware, software and
processes.
 It includes data ingestion and storage infrastructure
 It includes data management, data security and data
privacy.
Data Science Investments
Analytics Infrastructure
Much of data science work involves computationally
intensive experiments.
 Thus, Data scientists should be able to access large
machines/ specialized hardware for running
experiments or doing exploratory analysis.
 They should also be able to easily use burst/elastic
compute on demand.
 Data Scientists need software support for
communicating their findings to business
stakeholders.
Cloud Analytics
On-premises analytics solutions have challanges
 Cost of infrastructure
 Need for specialized skills
 Time required to configure and maintain these
systems
 Nonscalability
Cloud Analytics provides solution. Some major players
 IBM Cognos analytics
 Microfost Azure Stream Analytics
 AWS Analytics
Success Stories
 Southwest Airlines saved $ 100 million by reducing the
time its planes stood idle on the airstrip.
 UPS, a logistics company, saved 38 million gallons of
fuel by optimizing its fleet.
 $ 2 billion tax dollars saved by the Internal Revenue
Service by improving its ability to detect identity fraud
and improper payments.
 Croma, a subsidiary of Tata sons used data science to
understand 360° view of its users and used it to give
personalized shopping experience to its online
customers and their conversions have significantly
improved.
And many more…
With Data in your possession,
You are sitting on a gold mine…
However, if you don't know this fact OR don’t know how
to extract it, you won't be able to benefit from it.
Data Science Process
The diagram shows the major phases of data science
process. The diagram presents the CRISP-DM methodology
Data Science Process
The six steps of a data science project
 Data Collection
 Data Storage
 Data Preparation
 Data Utilization
 Business Analytics
 Predictive Analytics
 Developing Data Product
 Communication, data visualization
 Data-driven Decision
Data Collection
This is where many businesses fail. Too many companies collect
incomplete, unreliable data and everything they do after that is just
messed up.
Proper tracking and collection of data, and ensuring its quality is
crucial for every business doing data science.
What to collect?
 It is important to decide the details of the data that must be
collected/ captured.
 The general idea is to collect everything you can – because the
value of data can be realized any time in future.
 However, the more data you capture, the more engineering time
you need to allocate to implement it, the slower your business
processes will be, the more complex your data infrastructure
becomes, and so on…
Also consider legal and ethical aspects!
Data Wrangling
Data wrangling is all about getting the data into the right
form that is suitable for feeding into the modeling and
visualization stages.
This activity involves variety of tasks from discovering
data to acquiring and transforming it into the form
where the Data that is ready to be processed.
The tasks following the data acquisition are also referred
to by different terms such as Data Munging or Data
Preprocessing.
Big Data
Big data is like teenage sex: everyone talks about it,
nobody really knows how to do it, everyone thinks
everyone else is doing it, so everyone claims they are
doing it.
- Dan Ariely
What is Big data?
 Big data is a data set whose volume is beyond the ability of
commonly used hardware and software tools to capture, manage,
and process the data within a tolerable execution time.
 They are gathered by information-sensing mobile devices,
remote sensing technologies, software logs, cameras,
microphones, RFID readers, and many such devices.
 As a result, such datasets are continuously growing in size.
 By 2020, there will be around 40 trillion gigabytes of data
 90% of the data in the world today was created within just the
past two years.
 Internet users generate about 2.5 quintillion bytes (2.5 million
terabytes) of data each day
Twitter
 500 million tweets per day
Facebook
 Facebook generates 4 petabytes of data per day.
 Users generate 4 million likes every minute.
 350 million photos are uploaded per day.
Instagram
 The Like button is hit an average of 4.2 billion times/ day.
WhatsApp
 In 2018, WhatsApp users sent 65 billion messages per
day
Almost every field
Some Examples
Characteristics of big data (3V’s)
In a 2001 research report, Gartner analyst, Doug Laney,
defined data growth challenges (and opportunities) as being
three-dimensional - increasing volume, velocity , and variety.
Data volume:
 This is the primary attribute of big data. Most people
define big data in multi terabytes—sometimes petabytes.
Data variety
 Big data is coming from a greater variety of sources than
ever before. Many of the newer ones are Web sources,
including logs, click-streams, and social media.
Data velocity
 Big data can be described by its velocity or speed. The rate
at which new data is generated.
Data Analysis
Data Analysis is process for extracting value from Data.
This is where data science gets exciting. It’s a creative process.
 Ask right Questions
It is important to ask right questions. They usually comes
from the management/ or other colleagues, who may
already have suspicions based on their experience.
 Do Qualitative research
It’s important to understand the things concerning
business and its customers in detail. This can be achieved
through qualitative research, which in turn gives direction
to the useful investigations through data.
Three Major Business Applications
 Business Analytics
It answers the questions of “what has happened in the
past?” and “where are we now?”
E.g. reporting, measuring retention, finding the right user
segments, funnel analysis, etc.
 Predictive Analytics
It answers the question, “what will happen in the future?”
E.g. early warning, predicting the marketing budget you will
need in the next quarter, etc.
 Data (Based) Product
A product that is built, and works using your data.
E.g. recommendation systems, image recognition, voice
recognition, etc.
 SafetiPin is a map-based mobile phone application, which
leverages the power of big data to make our communities
and cities safer for women.
 It provides safety-related information collected through
crowdsourcing.
 The app captures data on 9 parameters (Lighting,
openness, visibility, people density, security in the area,
walk path, transportation, gender diversity, feeling in the
area), and uses it to compute and provide safety score, the
information on personal vulnerability to crime, in every
pocket of the city.
 App utilizes this score ang integrates with big data sources
such as Google map to recommends Safest Route to
provide the best possible route in terms of safety.
Data Communication
This is the step where most data science projects fail.
To reap the benefits of Data Science, effective
communication of the findings is crucial.
 It is necessary to build a culture where people can
communicate and use data. For this, everyone at your
company needs to be involved.
 Business people should also educate data scientists by
helping them to create and deliver better presentations.
 Communication should be as simple as it can be.
 No fancy scientific words
 No complicated charts
What People you need in your Team?
You data science team should feature
 Best Data Engineers,
 Best software developers, and
 Best statisticians
They need to have domain knowledge to know the actual
business application of their data projects.
Data Science Roles: Data Engineer
The data engineer is someone who develops, constructs,
tests and maintains data architectures, such as
databases, data warehouses, data lakes and large-scale
processing systems.
Data engineers manage data of all sizes, and types. They
develop, deploy, manage, and optimize data pipelines
and infrastructure to transform and transfer data to data
scientists for querying.
Skills needed: SQL, Data bases, Data warehousing,
ETL, Big data tools, Building API’s
Data Science Roles: Data Analyst
Data analysts perform the following tasks
 Data wrangling
 Create Data visualizations and Dash boards
 Analyze data to discover and interesting trends in the data
 Presenting the results of analysis to business clients or
internal teams
 Help other stakeholders to optimize their data utilization
Skills needed: Programming skills (SAS, R, Python),
statistical and mathematical skills, data wrangling, data
visualization tools like tableau/ Power BI
Data Science Roles: Data Scientist
A data scientist is a specialist having expertise in
Statistics and developing models, including predictive
models and machine learning models.
 Data scientists can tackle more open-ended questions
by leveraging their knowledge of advanced statistics.
 Data scientists bring an entirely new approach and
perspective to understanding data
Skills needed: Programming skills (SAS, R, Python),
statistical and mathematical skills, storytelling and data
visualization, Hadoop, SQL, machine learning, Big data
analytics.
Data Science projects can fail
Yes, that’s true!
Here are some of the reasons.
 Not every manager is ready for this change.
Even a very well-executed data project can fail, just
because someone’s feelings or ego is hurt.
 Answering the wrong question
 Failure to integrate into business operations
 Stakeholders disengaged
 Benefits don’t justify the costs
Developing Data Science culture
Failures can be prevented by establishing a data-driven
company culture early on. As the company size
increases, it becomes harder to make the organization
data-driven.
 It’s important that the managers develop the right
mindset.
 It important that everyone in the organization
understands importance of data science.
Data professionals should hold frequent presentations
about their recent findings.
Data Strategy
Why Data Strategy?
If you don't have a data strategy, you won't have enough
information to make the right decisions. Having data
strategy is crucial to become a data-driven organization.
Without it
 you will waste money on the wrong marketing
campaigns
 you will have wrong product development plans
Where do I begin?
It is recommended to start with development of Data Strategy. For
this, following questions need to be answered
 What are the right metrics to focus on? And how to figure it out?
 How to collect and store the data. Which tools should you use?
 Can you trust your data? And how can you make it trustworthy?
 How to communicate the data in your organization efficiently?
Start with a simple data project that answers the basic questions
about your business.
Subsequently, as you recognize your customers’ needs, you may
initiate other projects such as Predictive modelling, and Machine
learning
Pick your first data project
Develop and use the Prioritization matrix.
Your first data project
Your first data project should be a simple project (feasible)
with an aim to understanding your own business and your
customers better (High business value)
In other words, Start with investing in business analytics and
simple reports.
This project answers the basic questions about your business,
such as
 Who prefers what and why?
 How to win customer loyalty?
 Why a particular product failed?
And so on …
Questions?
You can write to me
kalamkar.vipul-stat@msubaroda.ac.in
Thanks!

Embracing data science

  • 1.
    Department of Statistics TheMaharaja Sayajirao University of Baroda
  • 2.
    Agenda  What isData Science?  What does Data Science promise for your business?  Investment in Data Science and ROI  Data Science Process  Data Science Roles  Infrastructure Requirements  Data Science Tools and Techniques  Where do I begin?  Developing Data Science Culture  Questions
  • 3.
    What is DataScience? Everything concerning Data is in the purview of Data Science
  • 4.
    What is DataScience? Data science is a young inter-disciplinary field that uses scientific principles, methods, processes, algorithms and systems to extract knowledge and insights from data.  Data science involves Statistics at its core.  Data Science extends the field of statistics to incorporate advances in computing with data  Apart from Statistics, Computer Science is another major discipline that plays a major role in capturing, managing and sharing data.  It is a driving force behind innovations is almost all disciplines of Science.  This new approach is termed Data driven science.
  • 5.
  • 6.
  • 7.
    The Data Sciencepromise Top Objectives of Successful Businesses  Increase profitability  Ensure customer satisfaction  Optimize productivity  Make your employees happy  Social and public responsibility Businesses traditionally rely on intuition, creativity and experience to fulfill these objectives. This has been reflected by HIPPO phenomenon for decades.
  • 8.
    The Data Sciencepromise Without Data, you are just another person with an opinion – Edwards Deming Although, intuition, experience, etc. are important, these work gets much better when supported with data. Data Science helps you to  Understand your customers better by  Learning about their needs  Their struggles, their motivations, their habits and their relationships to your product or service.  Use this understanding to create a better product and/or service and turning that into profit.
  • 9.
    The Data Sciencepromise Data science helps you to  See clearly how your business performs.  Understand dynamics of your business  Improve business processes  Discover new opportunities / products / services that your customers need.  Discover new audiences for your current products / services. and much more...
  • 10.
    The Data Sciencepromise If you manage to collect the right data and use it well,  You will be able to make better decisions more quickly and more easily.  That will lead to a better product, happier customers and eventually more revenue. That’s what business data science is all about. If you are among the first in your domain to embrace data science, you can outsmart your competition.
  • 11.
    Signs that YouShould Invest in Data Science  Your marketing budgets are growing, but your sales numbers are not.  Your company is struggling with personalization  It’s taking too long for the sales team to score leads  You are unable to analyze your marketing ROI  You want the competitive edge without significantly increasing your budget  Your competitors are already investing in Data Science
  • 12.
    Data Science Investments HumanResource According to an estimate, good teams spend about 5% of their total working hours with data and quantitative research.  So, if you are working alone, that's around 2-3 hours a week.  If you are a team of 50, then ideally you should have one or two full-time dedicated people for Data Science projects.  As your business grows, you may setup Data Science division
  • 13.
    Data Science Investments DataInfrastructure A data infrastructure is a digital infrastructure for promoting data sharing and consumption.  It includes data assets, hardware, software and processes.  It includes data ingestion and storage infrastructure  It includes data management, data security and data privacy.
  • 14.
    Data Science Investments AnalyticsInfrastructure Much of data science work involves computationally intensive experiments.  Thus, Data scientists should be able to access large machines/ specialized hardware for running experiments or doing exploratory analysis.  They should also be able to easily use burst/elastic compute on demand.  Data Scientists need software support for communicating their findings to business stakeholders.
  • 15.
    Cloud Analytics On-premises analyticssolutions have challanges  Cost of infrastructure  Need for specialized skills  Time required to configure and maintain these systems  Nonscalability Cloud Analytics provides solution. Some major players  IBM Cognos analytics  Microfost Azure Stream Analytics  AWS Analytics
  • 16.
    Success Stories  SouthwestAirlines saved $ 100 million by reducing the time its planes stood idle on the airstrip.  UPS, a logistics company, saved 38 million gallons of fuel by optimizing its fleet.  $ 2 billion tax dollars saved by the Internal Revenue Service by improving its ability to detect identity fraud and improper payments.  Croma, a subsidiary of Tata sons used data science to understand 360° view of its users and used it to give personalized shopping experience to its online customers and their conversions have significantly improved. And many more…
  • 17.
    With Data inyour possession, You are sitting on a gold mine… However, if you don't know this fact OR don’t know how to extract it, you won't be able to benefit from it.
  • 18.
    Data Science Process Thediagram shows the major phases of data science process. The diagram presents the CRISP-DM methodology
  • 19.
    Data Science Process Thesix steps of a data science project  Data Collection  Data Storage  Data Preparation  Data Utilization  Business Analytics  Predictive Analytics  Developing Data Product  Communication, data visualization  Data-driven Decision
  • 20.
    Data Collection This iswhere many businesses fail. Too many companies collect incomplete, unreliable data and everything they do after that is just messed up. Proper tracking and collection of data, and ensuring its quality is crucial for every business doing data science. What to collect?  It is important to decide the details of the data that must be collected/ captured.  The general idea is to collect everything you can – because the value of data can be realized any time in future.  However, the more data you capture, the more engineering time you need to allocate to implement it, the slower your business processes will be, the more complex your data infrastructure becomes, and so on… Also consider legal and ethical aspects!
  • 21.
    Data Wrangling Data wranglingis all about getting the data into the right form that is suitable for feeding into the modeling and visualization stages. This activity involves variety of tasks from discovering data to acquiring and transforming it into the form where the Data that is ready to be processed. The tasks following the data acquisition are also referred to by different terms such as Data Munging or Data Preprocessing.
  • 22.
    Big Data Big datais like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it. - Dan Ariely
  • 23.
    What is Bigdata?  Big data is a data set whose volume is beyond the ability of commonly used hardware and software tools to capture, manage, and process the data within a tolerable execution time.  They are gathered by information-sensing mobile devices, remote sensing technologies, software logs, cameras, microphones, RFID readers, and many such devices.  As a result, such datasets are continuously growing in size.  By 2020, there will be around 40 trillion gigabytes of data  90% of the data in the world today was created within just the past two years.  Internet users generate about 2.5 quintillion bytes (2.5 million terabytes) of data each day
  • 24.
    Twitter  500 milliontweets per day Facebook  Facebook generates 4 petabytes of data per day.  Users generate 4 million likes every minute.  350 million photos are uploaded per day. Instagram  The Like button is hit an average of 4.2 billion times/ day. WhatsApp  In 2018, WhatsApp users sent 65 billion messages per day Almost every field Some Examples
  • 25.
    Characteristics of bigdata (3V’s) In a 2001 research report, Gartner analyst, Doug Laney, defined data growth challenges (and opportunities) as being three-dimensional - increasing volume, velocity , and variety. Data volume:  This is the primary attribute of big data. Most people define big data in multi terabytes—sometimes petabytes. Data variety  Big data is coming from a greater variety of sources than ever before. Many of the newer ones are Web sources, including logs, click-streams, and social media. Data velocity  Big data can be described by its velocity or speed. The rate at which new data is generated.
  • 26.
    Data Analysis Data Analysisis process for extracting value from Data. This is where data science gets exciting. It’s a creative process.  Ask right Questions It is important to ask right questions. They usually comes from the management/ or other colleagues, who may already have suspicions based on their experience.  Do Qualitative research It’s important to understand the things concerning business and its customers in detail. This can be achieved through qualitative research, which in turn gives direction to the useful investigations through data.
  • 27.
    Three Major BusinessApplications  Business Analytics It answers the questions of “what has happened in the past?” and “where are we now?” E.g. reporting, measuring retention, finding the right user segments, funnel analysis, etc.  Predictive Analytics It answers the question, “what will happen in the future?” E.g. early warning, predicting the marketing budget you will need in the next quarter, etc.  Data (Based) Product A product that is built, and works using your data. E.g. recommendation systems, image recognition, voice recognition, etc.
  • 28.
     SafetiPin isa map-based mobile phone application, which leverages the power of big data to make our communities and cities safer for women.  It provides safety-related information collected through crowdsourcing.  The app captures data on 9 parameters (Lighting, openness, visibility, people density, security in the area, walk path, transportation, gender diversity, feeling in the area), and uses it to compute and provide safety score, the information on personal vulnerability to crime, in every pocket of the city.  App utilizes this score ang integrates with big data sources such as Google map to recommends Safest Route to provide the best possible route in terms of safety.
  • 29.
    Data Communication This isthe step where most data science projects fail. To reap the benefits of Data Science, effective communication of the findings is crucial.  It is necessary to build a culture where people can communicate and use data. For this, everyone at your company needs to be involved.  Business people should also educate data scientists by helping them to create and deliver better presentations.  Communication should be as simple as it can be.  No fancy scientific words  No complicated charts
  • 30.
    What People youneed in your Team? You data science team should feature  Best Data Engineers,  Best software developers, and  Best statisticians They need to have domain knowledge to know the actual business application of their data projects.
  • 31.
    Data Science Roles:Data Engineer The data engineer is someone who develops, constructs, tests and maintains data architectures, such as databases, data warehouses, data lakes and large-scale processing systems. Data engineers manage data of all sizes, and types. They develop, deploy, manage, and optimize data pipelines and infrastructure to transform and transfer data to data scientists for querying. Skills needed: SQL, Data bases, Data warehousing, ETL, Big data tools, Building API’s
  • 32.
    Data Science Roles:Data Analyst Data analysts perform the following tasks  Data wrangling  Create Data visualizations and Dash boards  Analyze data to discover and interesting trends in the data  Presenting the results of analysis to business clients or internal teams  Help other stakeholders to optimize their data utilization Skills needed: Programming skills (SAS, R, Python), statistical and mathematical skills, data wrangling, data visualization tools like tableau/ Power BI
  • 33.
    Data Science Roles:Data Scientist A data scientist is a specialist having expertise in Statistics and developing models, including predictive models and machine learning models.  Data scientists can tackle more open-ended questions by leveraging their knowledge of advanced statistics.  Data scientists bring an entirely new approach and perspective to understanding data Skills needed: Programming skills (SAS, R, Python), statistical and mathematical skills, storytelling and data visualization, Hadoop, SQL, machine learning, Big data analytics.
  • 34.
    Data Science projectscan fail Yes, that’s true! Here are some of the reasons.  Not every manager is ready for this change. Even a very well-executed data project can fail, just because someone’s feelings or ego is hurt.  Answering the wrong question  Failure to integrate into business operations  Stakeholders disengaged  Benefits don’t justify the costs
  • 35.
    Developing Data Scienceculture Failures can be prevented by establishing a data-driven company culture early on. As the company size increases, it becomes harder to make the organization data-driven.  It’s important that the managers develop the right mindset.  It important that everyone in the organization understands importance of data science. Data professionals should hold frequent presentations about their recent findings.
  • 36.
    Data Strategy Why DataStrategy? If you don't have a data strategy, you won't have enough information to make the right decisions. Having data strategy is crucial to become a data-driven organization. Without it  you will waste money on the wrong marketing campaigns  you will have wrong product development plans
  • 37.
    Where do Ibegin? It is recommended to start with development of Data Strategy. For this, following questions need to be answered  What are the right metrics to focus on? And how to figure it out?  How to collect and store the data. Which tools should you use?  Can you trust your data? And how can you make it trustworthy?  How to communicate the data in your organization efficiently? Start with a simple data project that answers the basic questions about your business. Subsequently, as you recognize your customers’ needs, you may initiate other projects such as Predictive modelling, and Machine learning
  • 38.
    Pick your firstdata project Develop and use the Prioritization matrix.
  • 39.
    Your first dataproject Your first data project should be a simple project (feasible) with an aim to understanding your own business and your customers better (High business value) In other words, Start with investing in business analytics and simple reports. This project answers the basic questions about your business, such as  Who prefers what and why?  How to win customer loyalty?  Why a particular product failed? And so on …
  • 40.
  • 41.
  • 42.