Big Data
Big Data
Big Data is a process of storing, analyzing, and dealing with data that is very large and complex for
traditional RDBMS systems to handle. There are several big data technologies like Hadoop that make
the process possible. Big Data didn’t exist before a few decades. Growing digital technologies and
the ability to store and process the data produced led to the start of the big data era.
Industries in different sectors realized the benefits of analyzing the data and hence arrived at several
applications. One of the profitable sectors is personalized marketing. Companies like Amazon and
Netflix started creating and using personal recommendation engines for their consumers. Amazon
also uses sentiment analysis (NLP) on its customer reviews to provide a better customer experience
based on the positive or negative reviews of a product. Every click a user makes on a browser is
recorded and users are delivered with personalized ads that might suit their interests.
MNCs also analyze the collective behavior of their customers in order to important business
decisions. Big data technologies made a large impact in the medical field. With better symptom
detection techniques, a lot of deadly medical conditions can be treated early. Big data also plays a
huge part in sequencing billions of genomes that help in identifying genetic disorders and mutations.
Now diving deep into forms of big data, they can be,
Though Big Data is beneficial, not every organization needs to implement big data technologies and
process their data differently from their traditional processing methods. The need for Big data arises
only when the problem comes under one of these criteria,
If a firm faces one of the above complications, then it can use big data to address its problems and
get useful insights. Getting value or useful insights out of Big data is Data Science. And the
professionals who possess skills such as data engineering, scientific method, math, statistics,
advanced computing, domain expertise, hacker mindset, and visualization; are called data scientists
and they make the data science process possible.
Now let’s learn more about how a company builds its big data strategy.
Then, the leadership team should take initiative and also support the process. The reason for data
science involvement must be understood and supported at all levels in order for the implementation
to be successful. The leadership team should initiate building diverse data science teams with
diverse expertise and they should deliver as a team. They can always make the shift more effective
by training the existing employees than recruiting new ones, as the existing employees would have
better domain knowledge. They can also open R&D labs that can research and communicate findings
to be implemented at a larger scale.
Sharing data within the organization should be encouraged by removing barriers to data access and
eradicating data silos. Data silos are compartmented data within an organization that have no
connection with each other. They lead to outdated, unsynchronized, even invisible data, and hinder
opportunity generation for the business.
The organization as a whole should define big data policies that involve privacy and the lifetime of
the data involved, curation and quality, interoperability, and regulation. An analytics-driven culture
helps the teams to work together and provide better outcomes.
Once a strategy is built, the data scientists need to ask the right questions in order to arrive at the
best insights.
The process of data science can be divided into five steps which are iterative till the best outcome is
obtained.
The first goal of the organization after establishing its objectives is to identify what kind of data they
require to find answers and how to acquire all the data. Data comes from many places (many
formats) and data scientists should choose the apt methods to access them.
Traditional DB ⇒ They can be queried from SQL DBs such as MySQL, Oracle
Text files ⇒ They can be obtained by programs created with scripting languages (JS, Python,
R, PHP, Ruby)
Remote data ⇒ Data from the internet can be acquired from SOAP (XML), REST (HTML,
JSON), web socket (RSS, W3C)
NoSQL ⇒ They can be accessed via API, web services (HBASE, Cassandra, MongoDB)
The first step of exploring the data is to understand what’s present in the data and to recognize
trends.
We have to figure out if there is a consistent direction in which the values are moving, that is
to identify the general trends
Then check for errors, that is identify the outliers that might affect the outcome of the data
Once we understand the data we can describe the data by applying statistical methods. We can find
how far and wide the data is spread through the range and standard deviation
Then we can visualize the data with the help of heat maps (where the hotspots are), histograms
(distribution of data, skewness/unusual dispersion), box plots (data distribution), line graphs (value
of data change over time) or scatter plots (correlation between two variables) to get even a better
idea of the understood details of data.
After exploring the data, we have to preprocess it to a certain format as the real-world data is
messy. There can be inconsistent values, duplicate values, missing values, invalid data, and can have
outliers. To improve the quality of the data we process, we can remove the records with missing
values, merge duplicate records, or if suitable we can generate the best estimate for invalid values
and can remove the outliers if necessary.
The data we acquire directly from the data sources might not suit the analysis format, we can
implement certain techniques to manipulate the data to suit our needs,
Scaling can be done to change the range of values to be in a specified range. This is to avoid
a few values dominating the data results.
When data has too many dimensions we can reduce its dimensions. It involves finding a
smaller subset of dimensions that captures most of the variations in data. Principle
Component Analysis (PCA) is used.
We can remove redundant or irrelevant features, combine features, and add features with
the process of feature selection.
Association analysis: to find rules to capture the association between items (ex: Market
basket analysis)
Graph analysis: to use graph structures to find connections between entities
Modeling:
For validating the model, we divide the input data into two and evaluate the model. For different
techniques, we evaluate using different methods.
In Classification & Regression models, we compare the predicted values vs the correct value
to evaluate the model.
Once the model is evaluated, we can determine the next steps of what to do with the analysis. If the
results are bad, then we might have to tune our model and repeat the process. If the results are
great, then we can use the outcome.
Once we are satisfied with the analysis's outcome, we must decide what to present. Generally, all
the findings must be presented. We can add more value to the insight by focusing on the main
outcome, the added values, and how the results turned up from the beginning to the end of the data
science process.
Presenting the insights in an understandable way is more important. One has to choose the best
visualization method to present a different kind of outcome data. There are many visualization tools
that can help data scientists to present their outcomes such as Tableau, D3, Google developers
charts, Leaflet, and Timeline JS.
If the leadership team and the stakeholders are satisfied with the outcome, then they can use them
for decision-making. If they don’t find it favorable, the analysis process might have to repeat. The
process is iterated till useful insights are acquired.