0% found this document useful (0 votes)
32 views

Big Data

Uploaded by

Arshad Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Big Data

Uploaded by

Arshad Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Big Data - An Introduction

Big Data is a process of storing, analyzing, and dealing with data that is very large and complex for
traditional RDBMS systems to handle. There are several big data technologies like Hadoop that make
the process possible. Big Data didn’t exist before a few decades. Growing digital technologies and
the ability to store and process the data produced led to the start of the big data era.

Industries in different sectors realized the benefits of analyzing the data and hence arrived at several
applications. One of the profitable sectors is personalized marketing. Companies like Amazon and
Netflix started creating and using personal recommendation engines for their consumers. Amazon
also uses sentiment analysis (NLP) on its customer reviews to provide a better customer experience
based on the positive or negative reviews of a product. Every click a user makes on a browser is
recorded and users are delivered with personalized ads that might suit their interests.

MNCs also analyze the collective behavior of their customers in order to important business
decisions. Big data technologies made a large impact in the medical field. With better symptom
detection techniques, a lot of deadly medical conditions can be treated early. Big data also plays a
huge part in sequencing billions of genomes that help in identifying genetic disorders and mutations.

Now diving deep into forms of big data, they can be,

 Structured (like RDBMS tables)

 Semi-structured (like log files)

 Unstructured (like images, social media data)

Though Big Data is beneficial, not every organization needs to implement big data technologies and
process their data differently from their traditional processing methods. The need for Big data arises
only when the problem comes under one of these criteria,

 Volume (high volumes of data in TB or PB or even more)

 Variety (Structured, Unstructured and Semi-structured)

 Velocity (the speed of data generation, loading, and analysis)

 Veracity (the quality and validity of the data)

 Valence (how data items are connected to each other)

If a firm faces one of the above complications, then it can use big data to address its problems and
get useful insights. Getting value or useful insights out of Big data is Data Science. And the
professionals who possess skills such as data engineering, scientific method, math, statistics,
advanced computing, domain expertise, hacker mindset, and visualization; are called data scientists
and they make the data science process possible.

Big Data + Analysis Question → Insight

Now let’s learn more about how a company builds its big data strategy.

Building a Big Data Strategy


Normally a strategy involves Aim, Policy, Plan, and Action. Here the strategy starts with
defining Business objectives or goals. They can be long-term or short-term goals. Business
Objectives are the questions that we need to ask in order to turn big data into insights.

Then, the leadership team should take initiative and also support the process. The reason for data
science involvement must be understood and supported at all levels in order for the implementation
to be successful. The leadership team should initiate building diverse data science teams with
diverse expertise and they should deliver as a team. They can always make the shift more effective
by training the existing employees than recruiting new ones, as the existing employees would have
better domain knowledge. They can also open R&D labs that can research and communicate findings
to be implemented at a larger scale.

Sharing data within the organization should be encouraged by removing barriers to data access and
eradicating data silos. Data silos are compartmented data within an organization that have no
connection with each other. They lead to outdated, unsynchronized, even invisible data, and hinder
opportunity generation for the business.

The organization as a whole should define big data policies that involve privacy and the lifetime of
the data involved, curation and quality, interoperability, and regulation. An analytics-driven culture
helps the teams to work together and provide better outcomes.

Once a strategy is built, the data scientists need to ask the right questions in order to arrive at the
best insights.

Steps in Data Science Process

The process of data science can be divided into five steps which are iterative till the best outcome is
obtained.

STEP 1 : ACQUIRING DATA

The first goal of the organization after establishing its objectives is to identify what kind of data they
require to find answers and how to acquire all the data. Data comes from many places (many
formats) and data scientists should choose the apt methods to access them.

The common data formats are,

 Traditional DB ⇒ They can be queried from SQL DBs such as MySQL, Oracle

 Text files ⇒ They can be obtained by programs created with scripting languages (JS, Python,
R, PHP, Ruby)

 Remote data ⇒ Data from the internet can be acquired from SOAP (XML), REST (HTML,
JSON), web socket (RSS, W3C)

 NoSQL ⇒ They can be accessed via API, web services (HBASE, Cassandra, MongoDB)

STEP 2 : EXPLORING AND PREPROCESSING DATA

Data exploration → Data understanding → Informed analysis

The first step of exploring the data is to understand what’s present in the data and to recognize
trends.

To understand the data,


 We have to understand the dependencies of different variables in data
(Correlations between the variables)

 We have to figure out if there is a consistent direction in which the values are moving, that is
to identify the general trends

 Then check for errors, that is identify the outliers that might affect the outcome of the data

Once we understand the data we can describe the data by applying statistical methods. We can find

 how the data is located through mean, median.

 how far and wide the data is spread through the range and standard deviation

 the frequency of data with mode

Then we can visualize the data with the help of heat maps (where the hotspots are), histograms
(distribution of data, skewness/unusual dispersion), box plots (data distribution), line graphs (value
of data change over time) or scatter plots (correlation between two variables) to get even a better
idea of the understood details of data.

After exploring the data, we have to preprocess it to a certain format as the real-world data is
messy. There can be inconsistent values, duplicate values, missing values, invalid data, and can have
outliers. To improve the quality of the data we process, we can remove the records with missing
values, merge duplicate records, or if suitable we can generate the best estimate for invalid values
and can remove the outliers if necessary.

The data we acquire directly from the data sources might not suit the analysis format, we can
implement certain techniques to manipulate the data to suit our needs,

 Scaling can be done to change the range of values to be in a specified range. This is to avoid
a few values dominating the data results.

 Data can be transformed to remove noise and reduce variability

 When data has too many dimensions we can reduce its dimensions. It involves finding a
smaller subset of dimensions that captures most of the variations in data. Principle
Component Analysis (PCA) is used.

 We can remove redundant or irrelevant features, combine features, and add features with
the process of feature selection.

STEP 3 : ANALYZING THE DATA

INPUT DATA → ANALYSIS TECHNIQUE → MODEL → MODEL OUTPUT

Categories of Analyzing techniques:

 Classification: to predict the category of input data

 Regression: to predict a numerical value

 Clustering: to organize similar items into groups

 Association analysis: to find rules to capture the association between items (ex: Market
basket analysis)
 Graph analysis: to use graph structures to find connections between entities

Modeling:

Select Technique → Build model → Validate model

For validating the model, we divide the input data into two and evaluate the model. For different
techniques, we evaluate using different methods.

Evaluating the model

 In Classification & Regression models, we compare the predicted values vs the correct value
to evaluate the model.

 In clustering, the outcomes are to be compared with business goals

Once the model is evaluated, we can determine the next steps of what to do with the analysis. If the
results are bad, then we might have to tune our model and repeat the process. If the results are
great, then we can use the outcome.

STEP 4: REPORTING INSIGHTS

Once we are satisfied with the analysis's outcome, we must decide what to present. Generally, all
the findings must be presented. We can add more value to the insight by focusing on the main
outcome, the added values, and how the results turned up from the beginning to the end of the data
science process.

Presenting the insights in an understandable way is more important. One has to choose the best
visualization method to present a different kind of outcome data. There are many visualization tools
that can help data scientists to present their outcomes such as Tableau, D3, Google developers
charts, Leaflet, and Timeline JS.

STEP 5: TURNING INSIGHTS INTO ACTION

If the leadership team and the stakeholders are satisfied with the outcome, then they can use them
for decision-making. If they don’t find it favorable, the analysis process might have to repeat. The
process is iterated till useful insights are acquired.

You might also like