2 Data Science - Managing Data

Managing large scale data involves acquiring data through web crawling, processing and parsing the unstructured data into a structured format, manipulating the data through actions like selecting, updating, inserting, and deleting, and cleaning the data through techniques such as removing outliers, handling missing values, and transforming the data through normalization. Web crawlers are commonly used to index websites and collect data for purposes such as sentiment analysis and stock price forecasting. Tools exist that allow collecting data through visual interfaces without coding.

Uploaded by

Anushka Kundu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views37 pages

2 Data Science - Managing Data

Uploaded by

Anushka Kundu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

2.

Managing Large Scale Data

Contents
• Types of Data and Data Representations,
• Acquire Data (E.G., Crawling),
• Process and Parse Data,
• Data Manipulation,
• Data Wrangling
• Data Cleaning.
Types of Data
• Data is a set of qualitative and quantitative
Types of Data
• Data is a set of qualitative and quantitative
• Quantitative Random Variable: Discrete and Continuous, Interval and
Ratio
• Categorical Variable: Binary, Nominal, Ordinal
Types of Data
• Binary Nominal
Ordinal Data
Discrete and Continuous Data
Forms of Data: Structured, Semi structured,
Unstructured
Data Representations

• Bar Chart
• Bar chart helps us to represent the collected data visually.
• The collected data can be visualized horizontally or vertically in a bar
chart like amounts and frequency. It can be grouped or single
Histogram
• A histogram is the graphical representation of data. It is similar
to the appearance of a bar graph but there is a lot of difference
between histogram and bar graph because a bar graph helps
to measure the frequency of categorical data.
• A categorical data means it is based on two or more
categories like gender, months, etc. Whereas histogram is used
for quantitative data.
Line Graph
• The graph which uses lines and points to present the change in
time is known as a line graph. Line graphs can be based on the
number of animals left on earth, the increasing population of
the world day by day, or the increasing or decreasing the
number of bitcoins day by day, etc.
• The line graphs tell us about the changes occurring across the
world over time. In a line graph, we can tell about two or more
types of changes occurring around the world.
Line Graph
• Line Graph
Pie Chart
• Pie chart is a type of graph that involves a structural graphic
representation of numerical proportion. It can be replaced in most
cases by other plots like a bar chart, dot plot, etc.
• As per the research, it is shown that it is difficult to compare the
different sections of a given pie chart, or if it is to compare data
across different pie charts.
Scatter Plot
• In science, the scatterplot is widely used to present
measurements of two or more related variables.

• It is particularly useful when the values of the variables of

the y-axis are thought to be dependent upon the values of
the variable of the x-axis.

• Example: Car ownership increases as the household income

increases, showing that there is a positive relationship
between these two variables.
Scatter Plot
Scatter Plot
Scatter Plot
Scatter Plot
How Do Data Scientists Collect Data?
• Use Existing Datasets
• Use public datasets: There are numerous datasets on the internet to be used as a
benchmark for general computer science problems
• Purchase datasets: There are various online platforms and marketplaces where you
can buy datasets such as environmental data, political data, customer data, etc.
• Company’s datasets: Companies can easily access their own data stack.

• Create a new Datasets:

• Create data manually: Data scientists can manually create online surveys to gather
results. Or, they can use old surveys and their results or pay employees to perform
manual tasks of data classification and data labeling.
• Convert existing data into a dataset: Another great way to gather data from the
internet is by crawling websites and downloading public data. This can be done via
dedicated web crawlers or manually through RPA bots that are programmed for web
crawling.
Web Crawler

• Crawler, Indexer and Page ranking algorithm.

Acquire Data: Web Crawling
• Web crawling is the technique used to collect a huge amount of data
from different websites and learn what every webpage on the
website is all about. The collected data can help you to retrieve
specific information that you need.
• A web crawler is typically operated by search engines such as Google,
Bing, and Yahoo. The goal is to index the content of different websites
all over the internet so that they can appear on the search engine
result whenever a person tries to find something on the web.
• Web crawlers can receive a search query and apply a search algorithm
to search and provide relevant information in response to the search
queries by using search engines.
Web Crawling
• Ever wondered how a giant search engine like Google collects data to
display in the search engine results pages? Does it use a web crawler
to retrieve data faster?
• A Web crawler, also known as a web robot, a web spider or a spider
bot, is an automated script or program that logically browses the
internet. This automated process of indexing data on web pages is
known as web crawling or spidering.
• Search engines such as Bing, Google, and etc. use web crawlers to
provide up-to-date information in SERPs.
Web Crawler- Use Cases
• Many companies rely on a web crawler to collect data about their
customers, products, and services on the web.
• Data science project starts by formulating the business problem to
solve and then followed by the second stage of collecting the right
data to solve that problem.
• In this stage, you can use web crawlers to collect the data on the
internet that you need for your data science project.
Use Cases
1. Collect Social Media Data for Sentiment Analysis
• Many companies use web crawling to collect posts and comments on various
social media platforms such as Facebook, X and Instagram. Companies use the
collected data to assess how their brand is performing and discover how their
products or services are reviewed by their customers, it can be a positive
review, negative review or neutral.
2. Collect Financial Data for Stock Prices Forecasting
• The stock market is full of uncertainty, therefore stock price forecasting is very
important in business. Web crawling is used to collect stock prices data from
different platforms for different periods (for example 54 weeks, 24 months
e.t.c).
No-Code Web Crawling Tools
• Octoparse is a visual software tool that you can use to extract different
types of data from the web without writing codes. It also has various
features that make it easier to collect data within a short period.
• Parsehub is another easy-to-learn visual web crawling tool that is simple,
friendly to use, powerful and flexible to extract data from the web. It offers
an easy-to-use interface to set your run and automatically extract
millions of data points from any website in minutes.
• Webscraper is a web crawling tool that does not require you to write code
and it runs within the browser as an extension. You can use this tool to
collect data from the web on an hourly, daily, or weekly basis. It can also
automatically export data to Dropbox, Google sheets, or Amazon S3.
Process and Parse Data
• An important aspect of parsing is to capture information from data in a way that it fits
contextual structures.

• Data parsing is used for crawling information from large datasets and structuring it in a
way humans can understand. Traditional data parsing is done on HTML files where the
parser converts HTML text into readable data. Data parse program is used for converting
unstructured data into JSON, CSV, and other file formats and adds structure to said
information.

• However, not all parsers work the same and there are distinct differences in parsing
technologies.

• There are numerous benefits of data parsing for businesses ranging from automated
data extraction, improved visibility, cutting costs, and boosting employee productivity.
Parser
Data Manipulation
• Data manipulation refers to the process of adjusting data to make it organised and
easier to read.

• Data manipulation language, or DML, is a programming language that adjusts data by

inserting, deleting and modifying data in a database such as to cleanse or map the
data. SQL, or Structured Query Language, is a language that communicates with
databases. When using SQL- data change statements for data manipulation, four
functions can occur, namely:
• Select
• Update
• Insert
• Delete
Data Manipulation
• These commands tell the database where to select data from and what to do with it.
• Here’s how it works:
• SELECT: The select statement allows users to pull a selection from the database to work
with. You tell the computer what to SELECT and FROM where.
• UPDATE: To change data that already exists, you will use the UPDATE statement. You can
tell the database to update certain sets of information and the new information that
should be input, either with single records or multiple records at a time.
• INSERT: You can move data from one location to another by using the INSERT statement.
• DELETE: To get rid of existing records within a table, you use the DELETE statement. You
tell the system where to delete from and what files to get rid of.
• Since SQL does not allow you to import or export data from outside sources, some
providers can store data and give you the tools to manipulate data for your business
needs.
Standard Deviation
Standard Deviation
Standard Deviations
Data Transformation
• Normalization Methods
• A function that maps the entire set of values of a given attribute to a
new set of replacement values s.t. each old value can be identified
with one of the new values
• Normalization is a technique often applied as part of data preparation
for data science. The goal of normalization is to change the values of
numeric columns in the dataset to a common scale, without
distorting differences in the ranges of values. Every dataset does not
require normalization. It is required only when features have different
ranges.
Normalization
• For example, consider a data set containing two features, age(x1), and
income(x2). Where age ranges from 0–100, while income ranges from
0–20,000 and higher. Income is about 1,000 times larger than age and
ranges from 20,000–500,000.
• So, these two features are in very different ranges. When we do
further analysis, some time, for example, the attributed income will
intrinsically influence the result more due to its larger value. But this
doesn’t necessarily mean it is more important as a predictor.
Normalization Methods
Normalization Methods
Normalization Methods

LTE RAN Parameters Ericsson Huawei
No ratings yet
LTE RAN Parameters Ericsson Huawei
115 pages
Module 2_final
No ratings yet
Module 2_final
58 pages
Lecture 3 (DS) - Steps in Data Science Process
No ratings yet
Lecture 3 (DS) - Steps in Data Science Process
57 pages
DSE 3 Unit 3
No ratings yet
DSE 3 Unit 3
4 pages
Unit I- Data Science
No ratings yet
Unit I- Data Science
161 pages
Data Science: October 2021
No ratings yet
Data Science: October 2021
51 pages
Unit I- Data Science
No ratings yet
Unit I- Data Science
161 pages
FDS - UNIT 1
No ratings yet
FDS - UNIT 1
233 pages
Explaratory Data Analysis - Python
No ratings yet
Explaratory Data Analysis - Python
16 pages
DA-1,2,3[1]_merged
No ratings yet
DA-1,2,3[1]_merged
39 pages
21css303t Datascience Unit 1 Notes (1)
No ratings yet
21css303t Datascience Unit 1 Notes (1)
246 pages
Data Science
No ratings yet
Data Science
24 pages
unit-1 .ds
No ratings yet
unit-1 .ds
30 pages
Lecture 01
No ratings yet
Lecture 01
40 pages
Combine PDF
No ratings yet
Combine PDF
270 pages
Data Science
No ratings yet
Data Science
59 pages
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
Introduction to Data Science
No ratings yet
Introduction to Data Science
29 pages
Unit 1
No ratings yet
Unit 1
26 pages
Module 1 - Data Science Introduction _Detailed
No ratings yet
Module 1 - Data Science Introduction _Detailed
131 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Ds Notes
No ratings yet
Ds Notes
88 pages
Defining Data Science
100% (1)
Defining Data Science
167 pages
Introduction to Data Science
No ratings yet
Introduction to Data Science
23 pages
L2_Data Acquisition
No ratings yet
L2_Data Acquisition
48 pages
Ds unit 2 notes
No ratings yet
Ds unit 2 notes
26 pages
Big Data and Data Science
No ratings yet
Big Data and Data Science
6 pages
ETCh2
No ratings yet
ETCh2
36 pages
Data Science Introduction
No ratings yet
Data Science Introduction
82 pages
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
No ratings yet
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
28 pages
DM Unit2(Part1)
No ratings yet
DM Unit2(Part1)
19 pages
UNIT 1_PPT
No ratings yet
UNIT 1_PPT
67 pages
data scince report
No ratings yet
data scince report
11 pages
Lecture 1 and 2 Powerpoints
No ratings yet
Lecture 1 and 2 Powerpoints
32 pages
DSBDA_UNIT1
No ratings yet
DSBDA_UNIT1
232 pages
08 Gtu Tpt Report.docx
No ratings yet
08 Gtu Tpt Report.docx
37 pages
Data Collection and Management
No ratings yet
Data Collection and Management
62 pages
FDSNotes
No ratings yet
FDSNotes
12 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
TYCS DS Unit1
No ratings yet
TYCS DS Unit1
28 pages
Chap1-Overview of Data Science
No ratings yet
Chap1-Overview of Data Science
50 pages
UNIT I - Introduction - DataScience - New
No ratings yet
UNIT I - Introduction - DataScience - New
34 pages
Screenshot 2025-04-23 at 8.26.12 AM
No ratings yet
Screenshot 2025-04-23 at 8.26.12 AM
14 pages
Data Science: by Neha Tyagi
100% (1)
Data Science: by Neha Tyagi
17 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
The 365 DS Booklet PDF
100% (1)
The 365 DS Booklet PDF
67 pages
Data Analytics: UCSC0601
No ratings yet
Data Analytics: UCSC0601
64 pages
Project Report
No ratings yet
Project Report
29 pages
02 Data Science
No ratings yet
02 Data Science
23 pages
Unit 1
No ratings yet
Unit 1
28 pages
Data Preparation
No ratings yet
Data Preparation
6 pages
1.1
No ratings yet
1.1
23 pages
PSK Unit 1 Merged
No ratings yet
PSK Unit 1 Merged
125 pages
FDS Module I-I
No ratings yet
FDS Module I-I
38 pages
Data Science Concepts And Techniques With Applications 1st Ed Usman Qamar download
100% (2)
Data Science Concepts And Techniques With Applications 1st Ed Usman Qamar download
88 pages
1 Intor To DMW
No ratings yet
1 Intor To DMW
22 pages
Unit 1
No ratings yet
Unit 1
137 pages
Unit 1: Data Warehousing & Data Mining
No ratings yet
Unit 1: Data Warehousing & Data Mining
54 pages
mod 3
No ratings yet
mod 3
96 pages
Lecture Notes: Introduction To Data Science and Big Data
No ratings yet
Lecture Notes: Introduction To Data Science and Big Data
5 pages
Be Data Curious!: Be Data Curious!, #1
From Everand
Be Data Curious!: Be Data Curious!, #1
Nick Jewell
No ratings yet
Mizoram Subordinate Services Selection Board: Common Recruitment Examination September - 2023
No ratings yet
Mizoram Subordinate Services Selection Board: Common Recruitment Examination September - 2023
16 pages
2.2.13 Packet Tracer - Point-to-Point Single-Area OSPFv2 Configuration
No ratings yet
2.2.13 Packet Tracer - Point-to-Point Single-Area OSPFv2 Configuration
5 pages
9-Basic GK Questions With Answers PDF Notes
No ratings yet
9-Basic GK Questions With Answers PDF Notes
23 pages
AI tools
No ratings yet
AI tools
1 page
The Form of The Discourse - A Contextual Hermeneutic of Vittorio Gregotti's - Territory - Maura Lucking - Academia PDF
No ratings yet
The Form of The Discourse - A Contextual Hermeneutic of Vittorio Gregotti's - Territory - Maura Lucking - Academia PDF
26 pages
Security in Social Networks
No ratings yet
Security in Social Networks
32 pages
User Manual: Powerlogic Pm8000 Series
No ratings yet
User Manual: Powerlogic Pm8000 Series
226 pages
Fortigate Level 2
No ratings yet
Fortigate Level 2
4 pages
9618_w24_ms_32
No ratings yet
9618_w24_ms_32
15 pages
Inbound 2996043052163560581
No ratings yet
Inbound 2996043052163560581
32 pages
Usenet-History-ACM
No ratings yet
Usenet-History-ACM
2 pages
Fit4Life-Website Design - SRS
No ratings yet
Fit4Life-Website Design - SRS
7 pages
How To Login E-Leave System
No ratings yet
How To Login E-Leave System
6 pages
03 AnalysisofVoLTEend To EndqualityofserviceusingOPNET
No ratings yet
03 AnalysisofVoLTEend To EndqualityofserviceusingOPNET
7 pages
An FM PCM30
100% (1)
An FM PCM30
2 pages
F
No ratings yet
F
14 pages
Quectel GSM HTTPS Application Note V3.2
No ratings yet
Quectel GSM HTTPS Application Note V3.2
22 pages
Customer Needs and Wants: Costcutter Innovator Mercedes Workhorse
No ratings yet
Customer Needs and Wants: Costcutter Innovator Mercedes Workhorse
4 pages
Chapter VIl Framework For Web Marketing
100% (1)
Chapter VIl Framework For Web Marketing
38 pages
FTP Server in Linux
100% (1)
FTP Server in Linux
22 pages
CCIE Data Center v3 Learning Matrix
No ratings yet
CCIE Data Center v3 Learning Matrix
17 pages
Huawei Cli Introduction PDF
No ratings yet
Huawei Cli Introduction PDF
3 pages
Cloud Computing New One Project
No ratings yet
Cloud Computing New One Project
77 pages
QEP 2024-25 Theme Digital India TheIAShub
No ratings yet
QEP 2024-25 Theme Digital India TheIAShub
36 pages
GS 200
No ratings yet
GS 200
4 pages
Send Email - ABAP
No ratings yet
Send Email - ABAP
3 pages
Free Men Magazine Issue 4
No ratings yet
Free Men Magazine Issue 4
24 pages
Crpid158 Aliranpertuturanbahasarojak
No ratings yet
Crpid158 Aliranpertuturanbahasarojak
11 pages