0% found this document useful (0 votes)
33 views37 pages

2 Data Science - Managing Data

Managing large scale data involves acquiring data through web crawling, processing and parsing the unstructured data into a structured format, manipulating the data through actions like selecting, updating, inserting, and deleting, and cleaning the data through techniques such as removing outliers, handling missing values, and transforming the data through normalization. Web crawlers are commonly used to index websites and collect data for purposes such as sentiment analysis and stock price forecasting. Tools exist that allow collecting data through visual interfaces without coding.

Uploaded by

Anushka Kundu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views37 pages

2 Data Science - Managing Data

Managing large scale data involves acquiring data through web crawling, processing and parsing the unstructured data into a structured format, manipulating the data through actions like selecting, updating, inserting, and deleting, and cleaning the data through techniques such as removing outliers, handling missing values, and transforming the data through normalization. Web crawlers are commonly used to index websites and collect data for purposes such as sentiment analysis and stock price forecasting. Tools exist that allow collecting data through visual interfaces without coding.

Uploaded by

Anushka Kundu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

2.

Managing Large Scale Data


Contents
• Types of Data and Data Representations,
• Acquire Data (E.G., Crawling),
• Process and Parse Data,
• Data Manipulation,
• Data Wrangling
• Data Cleaning.
Types of Data
• Data is a set of qualitative and quantitative
Types of Data
• Data is a set of qualitative and quantitative
• Quantitative Random Variable: Discrete and Continuous, Interval and
Ratio
• Categorical Variable: Binary, Nominal, Ordinal
Types of Data
• Binary Nominal
Ordinal Data
Discrete and Continuous Data
Forms of Data: Structured, Semi structured,
Unstructured
Data Representations

• Bar Chart
• Bar chart helps us to represent the collected data visually.
• The collected data can be visualized horizontally or vertically in a bar
chart like amounts and frequency. It can be grouped or single
Histogram
• A histogram is the graphical representation of data. It is similar
to the appearance of a bar graph but there is a lot of difference
between histogram and bar graph because a bar graph helps
to measure the frequency of categorical data.
• A categorical data means it is based on two or more
categories like gender, months, etc. Whereas histogram is used
for quantitative data.
Line Graph
• The graph which uses lines and points to present the change in
time is known as a line graph. Line graphs can be based on the
number of animals left on earth, the increasing population of
the world day by day, or the increasing or decreasing the
number of bitcoins day by day, etc.
• The line graphs tell us about the changes occurring across the
world over time. In a line graph, we can tell about two or more
types of changes occurring around the world.
Line Graph
• Line Graph
Pie Chart
• Pie chart is a type of graph that involves a structural graphic
representation of numerical proportion. It can be replaced in most
cases by other plots like a bar chart, dot plot, etc.
• As per the research, it is shown that it is difficult to compare the
different sections of a given pie chart, or if it is to compare data
across different pie charts.
Scatter Plot
• In science, the scatterplot is widely used to present
measurements of two or more related variables.

• It is particularly useful when the values of the variables of


the y-axis are thought to be dependent upon the values of
the variable of the x-axis.

• Example: Car ownership increases as the household income


increases, showing that there is a positive relationship
between these two variables.
Scatter Plot
Scatter Plot
Scatter Plot
Scatter Plot
How Do Data Scientists Collect Data?
• Use Existing Datasets
• Use public datasets: There are numerous datasets on the internet to be used as a
benchmark for general computer science problems
• Purchase datasets: There are various online platforms and marketplaces where you
can buy datasets such as environmental data, political data, customer data, etc.
• Company’s datasets: Companies can easily access their own data stack.

• Create a new Datasets:


• Create data manually: Data scientists can manually create online surveys to gather
results. Or, they can use old surveys and their results or pay employees to perform
manual tasks of data classification and data labeling.
• Convert existing data into a dataset: Another great way to gather data from the
internet is by crawling websites and downloading public data. This can be done via
dedicated web crawlers or manually through RPA bots that are programmed for web
crawling.
Web Crawler

• Crawler, Indexer and Page ranking algorithm.


Acquire Data: Web Crawling
• Web crawling is the technique used to collect a huge amount of data
from different websites and learn what every webpage on the
website is all about. The collected data can help you to retrieve
specific information that you need.
• A web crawler is typically operated by search engines such as Google,
Bing, and Yahoo. The goal is to index the content of different websites
all over the internet so that they can appear on the search engine
result whenever a person tries to find something on the web.
• Web crawlers can receive a search query and apply a search algorithm
to search and provide relevant information in response to the search
queries by using search engines.
Web Crawling
• Ever wondered how a giant search engine like Google collects data to
display in the search engine results pages? Does it use a web crawler
to retrieve data faster?
• A Web crawler, also known as a web robot, a web spider or a spider
bot, is an automated script or program that logically browses the
internet. This automated process of indexing data on web pages is
known as web crawling or spidering.
• Search engines such as Bing, Google, and etc. use web crawlers to
provide up-to-date information in SERPs.
Web Crawler- Use Cases
• Many companies rely on a web crawler to collect data about their
customers, products, and services on the web.
• Data science project starts by formulating the business problem to
solve and then followed by the second stage of collecting the right
data to solve that problem.
• In this stage, you can use web crawlers to collect the data on the
internet that you need for your data science project.
Use Cases
1. Collect Social Media Data for Sentiment Analysis
• Many companies use web crawling to collect posts and comments on various
social media platforms such as Facebook, X and Instagram. Companies use the
collected data to assess how their brand is performing and discover how their
products or services are reviewed by their customers, it can be a positive
review, negative review or neutral.
2. Collect Financial Data for Stock Prices Forecasting
• The stock market is full of uncertainty, therefore stock price forecasting is very
important in business. Web crawling is used to collect stock prices data from
different platforms for different periods (for example 54 weeks, 24 months
e.t.c).
No-Code Web Crawling Tools
• Octoparse is a visual software tool that you can use to extract different
types of data from the web without writing codes. It also has various
features that make it easier to collect data within a short period.
• Parsehub is another easy-to-learn visual web crawling tool that is simple,
friendly to use, powerful and flexible to extract data from the web. It offers
an easy-to-use interface to set your run and automatically extract
millions of data points from any website in minutes.
• Webscraper is a web crawling tool that does not require you to write code
and it runs within the browser as an extension. You can use this tool to
collect data from the web on an hourly, daily, or weekly basis. It can also
automatically export data to Dropbox, Google sheets, or Amazon S3.
Process and Parse Data
• An important aspect of parsing is to capture information from data in a way that it fits
contextual structures.

• Data parsing is used for crawling information from large datasets and structuring it in a
way humans can understand. Traditional data parsing is done on HTML files where the
parser converts HTML text into readable data. Data parse program is used for converting
unstructured data into JSON, CSV, and other file formats and adds structure to said
information.

• However, not all parsers work the same and there are distinct differences in parsing
technologies.

• There are numerous benefits of data parsing for businesses ranging from automated
data extraction, improved visibility, cutting costs, and boosting employee productivity.
Parser
Data Manipulation
• Data manipulation refers to the process of adjusting data to make it organised and
easier to read.

• Data manipulation language, or DML, is a programming language that adjusts data by


inserting, deleting and modifying data in a database such as to cleanse or map the
data. SQL, or Structured Query Language, is a language that communicates with
databases. When using SQL- data change statements for data manipulation, four
functions can occur, namely:
• Select
• Update
• Insert
• Delete
Data Manipulation
• These commands tell the database where to select data from and what to do with it.
• Here’s how it works:
• SELECT: The select statement allows users to pull a selection from the database to work
with. You tell the computer what to SELECT and FROM where.
• UPDATE: To change data that already exists, you will use the UPDATE statement. You can
tell the database to update certain sets of information and the new information that
should be input, either with single records or multiple records at a time.
• INSERT: You can move data from one location to another by using the INSERT statement.
• DELETE: To get rid of existing records within a table, you use the DELETE statement. You
tell the system where to delete from and what files to get rid of.
• Since SQL does not allow you to import or export data from outside sources, some
providers can store data and give you the tools to manipulate data for your business
needs.
Standard Deviation
Standard Deviation
Standard Deviations
Data Transformation
• Normalization Methods
• A function that maps the entire set of values of a given attribute to a
new set of replacement values s.t. each old value can be identified
with one of the new values
• Normalization is a technique often applied as part of data preparation
for data science. The goal of normalization is to change the values of
numeric columns in the dataset to a common scale, without
distorting differences in the ranges of values. Every dataset does not
require normalization. It is required only when features have different
ranges.
Normalization
• For example, consider a data set containing two features, age(x1), and
income(x2). Where age ranges from 0–100, while income ranges from
0–20,000 and higher. Income is about 1,000 times larger than age and
ranges from 20,000–500,000.
• So, these two features are in very different ranges. When we do
further analysis, some time, for example, the attributed income will
intrinsically influence the result more due to its larger value. But this
doesn’t necessarily mean it is more important as a predictor.
Normalization Methods
Normalization Methods
Normalization Methods

You might also like