IDS 1 IntroDataScience
IDS 1 IntroDataScience
MODULE # 1 : INTRODUCTION
IDS Course Team
BITS Pilani
The instructor is gratefully
acknowledging the authors who
made their course materials freely
available online.
Assignment Online 20 %
8 / 67
Fundamentals of Data Science
What is Data Science
Data science is a ‘concept to
unify statistics, data
analysis, machine learning
and their related methods’ in
order to ‘understand and
analyze actual phenomena’
with data. - Wikipedia https://www.zeolearn.com/magazine/data‐science‐vs‐machine‐learning ‐artificial ‐intelligence
https://www.sciencedirect.com/topics/physics-and-astronomy/artificial-intelligence
12 / 67
Data Science, AI and ML
• Artificial Intelligence
• AI involves making machines capable of mimicking human behavior,
particularly cognitive functions like facial recognition, automated driving…
• Machine Learning
• Considered a sub-field of or one of the tools of AI.
• Involves providing machines with the capability of learning from experience.
• Data Science
• Data science is the application of machine learning, artificial intelligence, and
other quantitative fields like statistics, visualization, and mathematics to
uncover insights from data to enable better decision marking.
13 / 67
Why Data Science ?
Case study - Moneyball: The Art of Winning an Unfair Game
• The 2003 publication of Moneyball: The Art of Winning an Unfair Game, by
Michael Lewis
• Moneyball told the story of how the Oakland Athletics, under general manager
Billy Beane, employed data and analytics to field a competitive baseball team
on a low budget
• It is the story of how existing data can be examined for meaning in ways that
were never intended or imagined when they were originally collected
• http://dataanalyticsedge.com/2019/11/14/moneyball-the-must-watch-movie-key-learning-for-eve
ry-aspiring-data-analyst-and-data-scientist/
Why Data Science ? - Data-Driven Decisions
1 COURSE LOGISTICS
2 FUNDAMENTALS OF DATA SCIENCE
3 DATA SCIENCE REAL WORLD APPLICATIONS
4 DATA SCIENCE CHALLENGES
5 DATA SCIENCE TEAMS
6 SOFTWARE ENGINEERING FOR DATA SCIENCE
7 FURTHER READING
DataFlai
r
INTRODUCTION TO DATA SCIENCE 17 / 79
DATA SCIENCE IN FACEBOOK
Social Analytics
Quantitative research
Makes use of deep learning
Deeptext
Targeted Advertising
Predictive Analysis
Anticipatory shipping model
Price discounts
Fraud Detection
Improving Packaging Efficiency
Spotify uses data science to gain insights about which universities had the
highest percentage of party playlists and which ones spent the most time
on it.
”Spotify Insights” publishes information about the ongoing trends in the
music.
Spotify’s Niland, an API based product, uses machine learning to
provide better searches and recommendations to its users.
Spotify analyzes listening habits of its users to predict the Grammy Award
Winners.
DataFlai
r
INTRODUCTION TO DATA SCIENCE 25 / 79
DATA SCIENCE CHALLENGES
Cognitive Biases are the distortions of reality because of the lens through
which we view the world.
Each of us sees things differently based on our preconceptions, past
experiences, cultural, environmental, and social factors. This doesn’t
necessarily mean that the way we think or feel about something is truly
representative of reality.
https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-ke
y-models-and-roles/
INTRODUCTION TO DATA SCIENCE 28 / 79
SKILLSET FOR A DATA SCIENTIST
• PROGRAMMING
• QUANTITATIVE ANALYSIS
• PRODUCT INTUITION
• COMMUNICATION
• TEAMWORK
Communicati Qualitativ
ve e
Data
Curiou Technica
s Scienti l
st
Creativ Skeptic
e al
R
SQL
Python
Scala
Tools SAS
Hadoo
p
Julia
Tableau
Wek
a
INTRODUCTION TO DATA SCIENCE 32 / 79
ALGORITHMS FOR A DATA SCIENTIST
Logistic
Decisio
SVM
n
ANN Tree
34
35
37
38
Customer Attrition/Churn:
– Goal: To predict whether a customer is likely to be lost to a
competitor.
– Approach:
• Use detailed record of transactions with each of the past and
present customers, to find attributes.
• How often the customer calls, where he calls, what time-of-the day
he calls most, his financial status, marital status, etc.
• Label the customers as loyal or disloyal.
• Find a model for loyalty.
40
Euclidean Distance
Based Clustering in
3-D space
41
42
Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman
43
44
45
46
47
Predict a value of a given continuous valued variable based on the values of other
variables, assuming a linear or nonlinear model of dependency.
Greatly studied in statistics, neural network fields.
Examples:
• Predicting sales amounts of new product based on advertising expenditure.
• Predicting wind velocities as a function of temperature, humidity, air pressure, etc.
• Time series prediction of stock market indices.
48
49
• Team and its processes and tools are organized around the goal
of publishing releases to the users every few weeks
Source: The DataOps Cookbook by Christopher Bergh, Gil Benghiat, and Eran Strod
DevOps
• merging of development and IT/Operations
Source: The DataOps Cookbook by Christopher Bergh, Gil Benghiat, and Eran Strod
DATAOPS
Machin
e
Learnin
g
MLOps
Data
DevOp
Engineerin
s
g
https://ml-ops.org/content/mlops-principles
INTRODUCTION TO DATA SCIENCE 62 / 79
Data science steps for ML
Steps can be completed manually or can be completed by an automatic pipeline
• Data extraction :select and integrate the relevant data from various data sources for
the ML task.
• Data analysis :perform exploratory data analysis
• Data preparation :prepare date for the ML task - involves data cleaning, data
annotations,data transformations splitting and feature engineering
• Model training : implement different algorithms with the prepared data to train
various ML models
• Model evaluation :evaluate on a holdout test set to assess the model quality
• Model validation :confirm whether model is adequate for deployment — predictive
performance is better than a certain baseline
• Model serving :deploy to a target environment to serve predictions
• Model monitoring :monitor models predictive performance to potentially invoke a
new iteration in the ML process
https://ml-ops.org/content/mlops-principles
INTRODUCTION TO DATA SCIENCE 63 / 79
MLOPS
https://builtin.com/machine-learning/mlops
INTRODUCTION TO DATA SCIENCE 64 / 79
Three phases
• To first automate the pipeline for model training and deployment (level 1)
• To automate the process to change and experiment with this pipeline (level
2)
DATAOPS AND MLOPS
69
IN TROD U C TI ON TO DATA S 70 / 60
CRISP-DM
P HASES
Business Understanding
) Understand project objectives and
requirements.
) Data mining problem definition.
Data Understanding
) Initial data collection and familiarization.
) Identify data quality issues.
) Identify initial obvious results.
Data Preparation
) Record and attribute selection.
) Data cleansing.
IN TROD U C TI ON TO DATA S 71 / 60
CRISP-DM
P HASES
Modeling
) Run the data mining tools.
Evaluation
) Determine if results meet business objectives.
) Identify business issues that should have been addressed
earlier.
Deployment
) Put the resulting models into practice.
) Set up for continuous mining of the data.
IN TROD U C TI ON TO DATA S 72 / 60
C R I S P - D M P HASES AND
T ASKS
IN TROD U C TI ON TO DATA S 73 / 60
A Case Study of Evaluating Job Readiness with Data
Mining Tools and CRISP-DM Methodology
Objectives
• Use of data mining techniques in evaluating job readiness of unemployed population in
Ireland
• whether to automate the classification system with regard to job readiness
Step 1 – Business Understanding
• Job readiness is one of the basic characteristics of a customer identified in the process of
registration.
• Existing system:
• evaluated by a case officer
• hundreds of officers making independent judgments
Ref: Wowczko, I. (2015). A case study of evaluating job readiness with data mining tools and CRISP-DM methodology. International Journal for Infonomics, 8(3), 1066-1070.
Step 1 – Business Understanding
a. Business Objectives :
• Job readiness will decide the type of support offered to a client – job
matching or further training opportunities
• The average customer is about 35.5 years old, the youngest is 16 and the oldest is 101
Step 2 – Data Understanding
• wrong type attributes, e.g integer attributes were regarded as nominal, dates were stored as
integers
Step 3 – Data Preparation
Data transformation:
• missing and noisy values stand for ‘no’, ‘zero’, ‘did not happen’, etc
• some attributes were relevant in their raw form, they contain information that might be useful,
converting them into binomial attributes (Y, N). 7 attributes were cleaned and transformed
with this method (GENERAL_COMMENTS, EMAIL_ADDRESS, MOBILE_NUMBER,
PHONE_NUMBER, WORK_SKILLS, COMPUTER_PACKAGES and SPECIAL_NEEDS_REQS).
• missing INERVIEW_DATE simply means that a client has not yet been interviewed. filtering
those examples resulted in a loss of too much data. therefore, converted into binomial
attributes (Y, N).
Step 4 - Modelling
• One split on attribute EXP1 (experience declared by a client with relation to their main
profession - MANCO1)
• EXP1 attribute was excluded from the dataset to enable the algorithm finding a new split
point.
• Another two variables were identified with this method: EXP2 (experience declared by a
client with relation to their secondary profession – MANCO2) and FULL_TIME willingness
to work full-time)
• There is a very straightforward relationship among the features of the customers – a person
is job ready if they have some work experience and are willing to be employed full- time
• The observed pattern can be easily recognized and there was no previously unknown
information discovered in the process of mining the dataset
• It has been concluded that automating the system would not offer any advantage over the
existing classification based on simple heuristics
The data mining process must be reliable and repeatable by people with
little data mining skills.
CRISP-DM provides a uniform framework for
) guidelines.
) experience documentation.
CRISP-DM is flexible to account for differences.
) Different business/agency problems.
) Different data
IN TROD U C TI ON TO DATA S 83 / 60
DATA SCIENCE TEAM BUILDING
https://towardsdatascience.com/why-team-building-is-important-to-data-scientists-a8fa74dbc09b
INTRODUCTION TO DATA SCIENCE 84 / 79
ORGANISATION OF DATA SCIENCE TEAM
[1] Decentralized
[2] Functional
[3] Consulting
[4] Centralized
[5] Center of Excellence
[6] Federated
https://data-flair.training/blogs/data-science-use-cases
https://www.northeastern.edu/graduate/blog/what-does-a-data-scientist-do/
https://www.visual-paradigm.com/guide/software-development-process/
what-is-a-software-process-model/
https://www.altexsoft.com/blog/datascience/
how-to-structure-data-science-team-key-models-and-roles/
https://www.cio.com/article/230532/what-is-a-data-scientist-a-key-data-analytics-rol
e-and-a-lucrative-career.html
https://atlan.com/what-is-dataops/
THANK YOU
INTRODUCTION TO DATA SCIENCE 95 / 79