0% found this document useful (0 votes)

32 views

Big Data

Uploaded by

Arshad Ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views

Big Data

Uploaded by

Arshad Ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Big Data - An Introduction

Big Data is a process of storing, analyzing, and dealing with data that is very large and complex for
traditional RDBMS systems to handle. There are several big data technologies like Hadoop that make
the process possible. Big Data didn’t exist before a few decades. Growing digital technologies and
the ability to store and process the data produced led to the start of the big data era.

Industries in different sectors realized the benefits of analyzing the data and hence arrived at several
applications. One of the profitable sectors is personalized marketing. Companies like Amazon and
Netflix started creating and using personal recommendation engines for their consumers. Amazon
also uses sentiment analysis (NLP) on its customer reviews to provide a better customer experience
based on the positive or negative reviews of a product. Every click a user makes on a browser is
recorded and users are delivered with personalized ads that might suit their interests.

MNCs also analyze the collective behavior of their customers in order to important business
decisions. Big data technologies made a large impact in the medical field. With better symptom
detection techniques, a lot of deadly medical conditions can be treated early. Big data also plays a
huge part in sequencing billions of genomes that help in identifying genetic disorders and mutations.

Now diving deep into forms of big data, they can be,

 Structured (like RDBMS tables)

 Semi-structured (like log files)

 Unstructured (like images, social media data)

Though Big Data is beneficial, not every organization needs to implement big data technologies and
process their data differently from their traditional processing methods. The need for Big data arises
only when the problem comes under one of these criteria,

 Volume (high volumes of data in TB or PB or even more)

 Variety (Structured, Unstructured and Semi-structured)

 Velocity (the speed of data generation, loading, and analysis)

 Veracity (the quality and validity of the data)

 Valence (how data items are connected to each other)

If a firm faces one of the above complications, then it can use big data to address its problems and
get useful insights. Getting value or useful insights out of Big data is Data Science. And the
professionals who possess skills such as data engineering, scientific method, math, statistics,
advanced computing, domain expertise, hacker mindset, and visualization; are called data scientists
and they make the data science process possible.

Big Data + Analysis Question → Insight

Now let’s learn more about how a company builds its big data strategy.

Building a Big Data Strategy

Normally a strategy involves Aim, Policy, Plan, and Action. Here the strategy starts with
defining Business objectives or goals. They can be long-term or short-term goals. Business
Objectives are the questions that we need to ask in order to turn big data into insights.

Then, the leadership team should take initiative and also support the process. The reason for data
science involvement must be understood and supported at all levels in order for the implementation
to be successful. The leadership team should initiate building diverse data science teams with
diverse expertise and they should deliver as a team. They can always make the shift more effective
by training the existing employees than recruiting new ones, as the existing employees would have
better domain knowledge. They can also open R&D labs that can research and communicate findings
to be implemented at a larger scale.

Sharing data within the organization should be encouraged by removing barriers to data access and
eradicating data silos. Data silos are compartmented data within an organization that have no
connection with each other. They lead to outdated, unsynchronized, even invisible data, and hinder
opportunity generation for the business.

The organization as a whole should define big data policies that involve privacy and the lifetime of
the data involved, curation and quality, interoperability, and regulation. An analytics-driven culture
helps the teams to work together and provide better outcomes.

Once a strategy is built, the data scientists need to ask the right questions in order to arrive at the
best insights.

Steps in Data Science Process

The process of data science can be divided into five steps which are iterative till the best outcome is
obtained.

STEP 1 : ACQUIRING DATA

The first goal of the organization after establishing its objectives is to identify what kind of data they
require to find answers and how to acquire all the data. Data comes from many places (many
formats) and data scientists should choose the apt methods to access them.

The common data formats are,

 Traditional DB ⇒ They can be queried from SQL DBs such as MySQL, Oracle

 Text files ⇒ They can be obtained by programs created with scripting languages (JS, Python,
R, PHP, Ruby)

 Remote data ⇒ Data from the internet can be acquired from SOAP (XML), REST (HTML,
JSON), web socket (RSS, W3C)

 NoSQL ⇒ They can be accessed via API, web services (HBASE, Cassandra, MongoDB)

STEP 2 : EXPLORING AND PREPROCESSING DATA

Data exploration → Data understanding → Informed analysis

The first step of exploring the data is to understand what’s present in the data and to recognize
trends.

To understand the data,

 We have to understand the dependencies of different variables in data
(Correlations between the variables)

 We have to figure out if there is a consistent direction in which the values are moving, that is
to identify the general trends

 Then check for errors, that is identify the outliers that might affect the outcome of the data

Once we understand the data we can describe the data by applying statistical methods. We can find

 how the data is located through mean, median.

 how far and wide the data is spread through the range and standard deviation

 the frequency of data with mode

Then we can visualize the data with the help of heat maps (where the hotspots are), histograms
(distribution of data, skewness/unusual dispersion), box plots (data distribution), line graphs (value
of data change over time) or scatter plots (correlation between two variables) to get even a better
idea of the understood details of data.

After exploring the data, we have to preprocess it to a certain format as the real-world data is
messy. There can be inconsistent values, duplicate values, missing values, invalid data, and can have
outliers. To improve the quality of the data we process, we can remove the records with missing
values, merge duplicate records, or if suitable we can generate the best estimate for invalid values
and can remove the outliers if necessary.

The data we acquire directly from the data sources might not suit the analysis format, we can
implement certain techniques to manipulate the data to suit our needs,

 Scaling can be done to change the range of values to be in a specified range. This is to avoid
a few values dominating the data results.

 Data can be transformed to remove noise and reduce variability

 When data has too many dimensions we can reduce its dimensions. It involves finding a
smaller subset of dimensions that captures most of the variations in data. Principle
Component Analysis (PCA) is used.

 We can remove redundant or irrelevant features, combine features, and add features with
the process of feature selection.

STEP 3 : ANALYZING THE DATA

INPUT DATA → ANALYSIS TECHNIQUE → MODEL → MODEL OUTPUT

Categories of Analyzing techniques:

 Classification: to predict the category of input data

 Regression: to predict a numerical value

 Clustering: to organize similar items into groups

 Association analysis: to find rules to capture the association between items (ex: Market
basket analysis)
 Graph analysis: to use graph structures to find connections between entities

Modeling:

Select Technique → Build model → Validate model

For validating the model, we divide the input data into two and evaluate the model. For different
techniques, we evaluate using different methods.

Evaluating the model

 In Classification & Regression models, we compare the predicted values vs the correct value
to evaluate the model.

 In clustering, the outcomes are to be compared with business goals

Once the model is evaluated, we can determine the next steps of what to do with the analysis. If the
results are bad, then we might have to tune our model and repeat the process. If the results are
great, then we can use the outcome.

STEP 4: REPORTING INSIGHTS

Once we are satisfied with the analysis's outcome, we must decide what to present. Generally, all
the findings must be presented. We can add more value to the insight by focusing on the main
outcome, the added values, and how the results turned up from the beginning to the end of the data
science process.

Presenting the insights in an understandable way is more important. One has to choose the best
visualization method to present a different kind of outcome data. There are many visualization tools
that can help data scientists to present their outcomes such as Tableau, D3, Google developers
charts, Leaflet, and Timeline JS.

STEP 5: TURNING INSIGHTS INTO ACTION

If the leadership team and the stakeholders are satisfied with the outcome, then they can use them
for decision-making. If they don’t find it favorable, the analysis process might have to repeat. The
process is iterated till useful insights are acquired.

Boeco BRV 3000 Manual v.7.09 New
100% (3)
Boeco BRV 3000 Manual v.7.09 New
38 pages
Data Analytics 1
No ratings yet
Data Analytics 1
4 pages
5 Data Science Project Lifecycle
No ratings yet
5 Data Science Project Lifecycle
33 pages
_unit2 DATA SCIENCE
No ratings yet
_unit2 DATA SCIENCE
8 pages
intro
No ratings yet
intro
144 pages
DTS Modul Data Science Methodology
100% (1)
DTS Modul Data Science Methodology
56 pages
FDS Introduction
No ratings yet
FDS Introduction
41 pages
DS&BDA Unit 3
No ratings yet
DS&BDA Unit 3
51 pages
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
From Everand
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
Calvert Long
No ratings yet
Emerging - 2021 - Module 2 PDF
No ratings yet
Emerging - 2021 - Module 2 PDF
61 pages
JobRecord MUHAMMAD NAEEM f70a3eba Db3d 11ef a12f 96f32f87411b
No ratings yet
JobRecord MUHAMMAD NAEEM f70a3eba Db3d 11ef a12f 96f32f87411b
63 pages
1 - Introduction To Data Science
No ratings yet
1 - Introduction To Data Science
6 pages
Data Science Methodology
No ratings yet
Data Science Methodology
21 pages
W3 - DA Life Cycle
No ratings yet
W3 - DA Life Cycle
49 pages
ds final
No ratings yet
ds final
3 pages
Lecture 3 (DS) - Steps in Data Science Process
No ratings yet
Lecture 3 (DS) - Steps in Data Science Process
57 pages
Data Analytics Key Notes
No ratings yet
Data Analytics Key Notes
5 pages
Unit V
No ratings yet
Unit V
4 pages
Data Science Methodolgy
No ratings yet
Data Science Methodolgy
12 pages
Business Undestanding and Data Collection
No ratings yet
Business Undestanding and Data Collection
27 pages
IBM Q1 Technical Marketing ASSET2 - Data Science Methodology-Best Practices For Successful Implementations Ov37176 PDF
No ratings yet
IBM Q1 Technical Marketing ASSET2 - Data Science Methodology-Best Practices For Successful Implementations Ov37176 PDF
6 pages
As You Delve Into The World of Data Analytics
No ratings yet
As You Delve Into The World of Data Analytics
10 pages
Data Science Process Stages Lecture 2
No ratings yet
Data Science Process Stages Lecture 2
4 pages
Introduction-to-Data-Analytics
No ratings yet
Introduction-to-Data-Analytics
15 pages
Data Analytics
No ratings yet
Data Analytics
11 pages
Beginners Guide To Data Science - A Twics Guide 1
100% (1)
Beginners Guide To Data Science - A Twics Guide 1
41 pages
DA-1,2,3[1]_merged
No ratings yet
DA-1,2,3[1]_merged
39 pages
ds sem
No ratings yet
ds sem
71 pages
Data Science: Chapter 1: Introduction To Big Data
100% (2)
Data Science: Chapter 1: Introduction To Big Data
77 pages
Approaches in data analysis [Slides]
No ratings yet
Approaches in data analysis [Slides]
13 pages
IDS Unit 1
No ratings yet
IDS Unit 1
67 pages
DS
No ratings yet
DS
32 pages
Data Analytics and Data Processing Essentials
From Everand
Data Analytics and Data Processing Essentials
gareth thomas
No ratings yet
Module1 Data Science
No ratings yet
Module1 Data Science
15 pages
LIFE CYCLE
No ratings yet
LIFE CYCLE
35 pages
Data Science Tools Final
No ratings yet
Data Science Tools Final
11 pages
Data Science Lifecycle
No ratings yet
Data Science Lifecycle
3 pages
1-DA (1).pptx
No ratings yet
1-DA (1).pptx
44 pages
Unit 3
No ratings yet
Unit 3
18 pages
BUSINESS ANALYTICS UNIT I
No ratings yet
BUSINESS ANALYTICS UNIT I
45 pages
Data Science
No ratings yet
Data Science
5 pages
Unit4 - DataAnalytics and IoT PDF
No ratings yet
Unit4 - DataAnalytics and IoT PDF
40 pages
Approaches in data science [Slides]
No ratings yet
Approaches in data science [Slides]
13 pages
1. Data Science Introduction
No ratings yet
1. Data Science Introduction
24 pages
Principles of Data Mining
From Everand
Principles of Data Mining
Subodh Keshari
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Chapter 14 Big Data and Data Science - DONE DONE DONE
No ratings yet
Chapter 14 Big Data and Data Science - DONE DONE DONE
28 pages
Data Science 1
No ratings yet
Data Science 1
2 pages
IT Unit 10
No ratings yet
IT Unit 10
4 pages
DS Unit 1
No ratings yet
DS Unit 1
26 pages
step by step data wrangling
No ratings yet
step by step data wrangling
4 pages
Unit 1 - Exploratory Data Analysis Fundamentals
No ratings yet
Unit 1 - Exploratory Data Analysis Fundamentals
47 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Unit I and unit ii dev (1)
No ratings yet
Unit I and unit ii dev (1)
36 pages
Big Data Categories-Life Cycle
No ratings yet
Big Data Categories-Life Cycle
15 pages
Bsd1313 Chapter 3
No ratings yet
Bsd1313 Chapter 3
74 pages
Unit 1
No ratings yet
Unit 1
36 pages
Get Hired as a Data Analyst FAST in 2024
From Everand
Get Hired as a Data Analyst FAST in 2024
Silas Meadowlark
No ratings yet
Data Analysis: An In-depth Insight
From Everand
Data Analysis: An In-depth Insight
Pasquale De Marco
No ratings yet
Data Insights: The Science of Data Analysis
From Everand
Data Insights: The Science of Data Analysis
Lexa N. Palmer
No ratings yet
dsbd
No ratings yet
dsbd
23 pages
ROBOTICS class notes.docx
No ratings yet
ROBOTICS class notes.docx
9 pages
Data Base Project
No ratings yet
Data Base Project
10 pages
1 - 9 FGHBHJ
No ratings yet
1 - 9 FGHBHJ
12 pages
Complex Numbers and The Complex Exponential: Figure 1. A Complex Number
No ratings yet
Complex Numbers and The Complex Exponential: Figure 1. A Complex Number
19 pages
SQL for Data Analysis.pdf
100% (1)
SQL for Data Analysis.pdf
10 pages
COGS-Engr. Cherryl Mae. Almojuela
No ratings yet
COGS-Engr. Cherryl Mae. Almojuela
1 page
Microsoft access inventory management system
No ratings yet
Microsoft access inventory management system
19 pages
Database/Cd-Rom Search Services Unit: WWW - Teeal.abu - Edu.ng
100% (3)
Database/Cd-Rom Search Services Unit: WWW - Teeal.abu - Edu.ng
3 pages
S2927-E_DataSheet
No ratings yet
S2927-E_DataSheet
2 pages
4 01 Define Cost and Material Resources
No ratings yet
4 01 Define Cost and Material Resources
3 pages
1201 2031 PDF
No ratings yet
1201 2031 PDF
16 pages
16.Ergonomics in an Office Environment
No ratings yet
16.Ergonomics in an Office Environment
30 pages
Peripheral Devices Input-Output Interface Asynchronous Data Transfer Modes of Transfer Priority Interrupt Direct Memory Access Input-Output Processor
No ratings yet
Peripheral Devices Input-Output Interface Asynchronous Data Transfer Modes of Transfer Priority Interrupt Direct Memory Access Input-Output Processor
7 pages
Chapter 8 Review
No ratings yet
Chapter 8 Review
6 pages
Oosd Full Notes
No ratings yet
Oosd Full Notes
70 pages
Assignment-3-CSR REPORT
No ratings yet
Assignment-3-CSR REPORT
9 pages
Cybercrime Training Judges Prosecutors en
No ratings yet
Cybercrime Training Judges Prosecutors en
35 pages
MockB HL P2 2015v1 Sect A Only
No ratings yet
MockB HL P2 2015v1 Sect A Only
11 pages
Type The Document Subtitle
No ratings yet
Type The Document Subtitle
24 pages
Trigonometry Based Modeling For Engineering Problems
No ratings yet
Trigonometry Based Modeling For Engineering Problems
8 pages
ashutosh BIS. PDF new (2)
No ratings yet
ashutosh BIS. PDF new (2)
16 pages
Localization Using Two Different IMU Sensor-Based Dead Reckoning System
No ratings yet
Localization Using Two Different IMU Sensor-Based Dead Reckoning System
5 pages
ROG Strix GeForce RTX™ 2080 Ti
No ratings yet
ROG Strix GeForce RTX™ 2080 Ti
17 pages
Human Factors Checklist
No ratings yet
Human Factors Checklist
7 pages
JAVA UNIT-1 Question Bank
No ratings yet
JAVA UNIT-1 Question Bank
4 pages
Learn Java JPA Spring For Beginners To Expert Professional and Attend The Interviews (Amit K) (Z-Library)
No ratings yet
Learn Java JPA Spring For Beginners To Expert Professional and Attend The Interviews (Amit K) (Z-Library)
384 pages
TSSN Material Q&Ans
No ratings yet
TSSN Material Q&Ans
103 pages
Grundig G3-Flyer
No ratings yet
Grundig G3-Flyer
1 page
Taiko Series Bill Acceptor PUB-7-11.Sflb
0% (1)
Taiko Series Bill Acceptor PUB-7-11.Sflb
72 pages

Big Data

Uploaded by

Big Data

Uploaded by

Big Data - An Introduction

 Structured (like RDBMS tables)

 Semi-structured (like log files)

 Unstructured (like images, social media data)

 Volume (high volumes of data in TB or PB or even more)

 Variety (Structured, Unstructured and Semi-structured)

 Velocity (the speed of data generation, loading, and analysis)

 Veracity (the quality and validity of the data)

 Valence (how data items are connected to each other)

Big Data + Analysis Question → Insight

Building a Big Data Strategy

Steps in Data Science Process

STEP 1 : ACQUIRING DATA

The common data formats are,

STEP 2 : EXPLORING AND PREPROCESSING DATA

Data exploration → Data understanding → Informed analysis

To understand the data,

 how the data is located through mean, median.

 the frequency of data with mode

 Data can be transformed to remove noise and reduce variability

STEP 3 : ANALYZING THE DATA

INPUT DATA → ANALYSIS TECHNIQUE → MODEL → MODEL OUTPUT

Categories of Analyzing techniques:

 Classification: to predict the category of input data

 Regression: to predict a numerical value

 Clustering: to organize similar items into groups

Select Technique → Build model → Validate model

Evaluating the model

 In clustering, the outcomes are to be compared with business goals

STEP 4: REPORTING INSIGHTS

STEP 5: TURNING INSIGHTS INTO ACTION

You might also like