0% found this document useful (0 votes)
49 views

Big Data Cat 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

Big Data Cat 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

1) What is big data ?

Big data primarily refers to data sets that are too large or complex to be dealt with by
traditional data-processing application software. Data with many entries (rows) offer
greater statistical power, while data with higher complexity (more attributes or columns)
may lead to a higher false discovery rate.
2) What is Structured Data?
data that is organized and design in a specific way to make it easily readable and
understand by both humans and machines. This is typically achieved through the use of a
well-defined schema or data model, which provides a structure for the data.
3) Difference between Descriptive Analytics and Predictive Analytics?

Descriptive analytics focuses on understanding past events and provides insights into
what has happened.
Predictive analytics aims to forecast future outcomes and understand what could
happen.
4) What is Big Data Visualization?
Data visualization is the graphical representation of information and data. By using visual
elements like charts, graphs, and maps, data visualization tools provide an accessible way
to see and understand trends, outliers, and patterns in data.
This practice is crucial in the data science process, as it helps to make data more
understandable and actionable for a wide range of users, from business professionals to
data scientists.
5) List the 5 V's of Big Data ?
Volume, Velocity, Variety, Veracity, and Value
6) Give any example of using Big Data in today's life?

1. Online Shopping Online shopping is an easy one. ...


2. Banking Our banks use big data to keep our financial information safe, too. .
10 marks
7) List out the types and characteristics of Big Data?

big data refers to large collections of data that are so complex and expansive that they
cannot be interpreted by humans or by traditional data management systems. When
properly analyzed using modern tools, these huge volumes of data give businesses the
information they need to make informed decisions.

New software developments have recently made it possible to use and track big data sets.Much
of this user information would seem meaningless and unconnected to the humans eye.
However, big data analytic tools can track the relationships between hundreds of types and
sources of data to produce useful business intelligence.
CHARACTERISTICS:
The characteristics of big data include several key attributes, commonly known as the “Vs.”
These characteristics are important for understanding the nature of Big Data. These
characteristics of big data are –
Volume
As the name itself suggest, big data involves large amounts of information. Terabytes,
petabytes, and even larger amounts of data are possible. It needs specialized processing
and storage infrastructure to handle such massive quantities.
Example – Google processes over 3.5 billion searches per day, leading to an annual estimate
of around 1.28 trillion searches which is a really big data.
Velocity
Other characteristics of big data include Velocity. This refers to the speed at which data is
generated, processed, and made available for analysis. With real-time data sources like social
media, sensors, and IoT devices, data is often produced at high speeds, requiring quick
processing capabilities.
Data is flowing continuously in large quantities. This defines the data’s potential, or how
quickly the data can be created and processed to satisfy needs.
Example – Facebook’s user base is increasing by approximately 22% year by year. As of the
latest available data, Facebook had around 2.8 billion monthly active users, reflecting the rapid
pace of user growth.
Variety
One of the main characteristics of big data is Variety. It includes different types of data,
including structured data (e.g., databases), semi-structured data (e.g., XML, JSON), and
unstructured data (e.g., text, images, videos). Managing and analyzing this variety of data
requires flexible and adaptable processing methods.
Example -YouTube has over 500 hours of video content uploaded every minute. This immense
variety includes videos in different formats, resolutions, and content types.
Veracity
Focuses on the correctness and dependability of the data. Big Data sources may contain
inconsistencies, errors, or noise, making it crucial to ensure the quality of the information for
meaningful analysis.
Example – Google’s search algorithms are designed to filter through and prioritize accurate
information from the vast volume of web pages indexed.
Value
Value is one of the important characteristics of big data. It focuses on the goal of extracting
meaningful insights and value from the data. The primary purpose of dealing with big data is
to get information that can lead to improved decision-making and strategic advantages.
Example – Facebook’s advertising revenue amounted to approximately $84.2 billion in the
most recent fiscal year. The value derived from targeted advertising based on user data
contributes significantly to the company’s revenue.

Variability
Variability refers to those characteristics of big data that represents the dynamic nature of data
flow. Big Data sources show changes in volume, velocity, and variety over time, requiring
flexible processing methods.
Example – Twitter experiences variability in data flow, especially during major events. The
platform sees a rise in tweets and user interactions during such events, requiring adaptable
processing methods to handle the fluctuating data volume.
Visibility
Being one of the characteristics of big data, variability refers to the strong nature of data
sources, requiring adaptability in processing methods to handle changes in volume and type
over time.
Example – Google Maps uses Big Data to provide visibility into real-time traffic conditions.
By analyzing data from smartphones and other sources, Google Maps helps users navigate
efficiently by avoiding busy routes.
Volatility
Characteristics of big data include the capture of the temporary nature of certain data. Some
data in big data environments may have a short validity or relevance, which requires
organizations to adapt quickly to changes in the data landscape.
Example – Financial markets generate vast amounts of data in real-time. Stock prices, currency
exchange rates, and commodity prices can be highly volatile.
Moving on from the different characteristics of big data, let’s discuss the types of big data.

Types Of Big Data


The characteristics of big data based on their structure and organization, play an important role
in forming how businesses extract value from this amount of information. Big Data is classified
into different types based on this structure and organization of information. The different kinds
of big data include –
Structured Data
• Structured data is highly organized and formatted information that sticks to a specific
pattern or data model.
• From the types of big data structured data is most noticeable in our everyday life.
• It is typically tabular and fits neatly into relational databases, allowing for easy inquiry
and analysis.
• By generating a single record to represent an entity, it is divided across several tables
to improve the data’s integrity. Table constraints are applied to apply relationships.
• Structured data has a clear plan and pre-planned data types.
• It is easy to search and accessible.
• Structured data is commonly used in traditional databases like SQL.
• Example – A structured data example is a monthly budget, expenses like rent, utilities,
groceries, and entertainment.
Unstructured Data
• Unstructured data lacks a predefined data model and doesn’t follow traditional
databases.
• It is flexible in terms of content and format, making it difficult to analyze using
traditional methods.
• Unstructured data has no fixed structure or pattern.
• It has different formats such as text, images, audio, and video.
• Requires advanced analytics for meaningful insights.
• Example – Social media posts, multimedia files, and text documents where information
is not organized in a standardized way.
Semi-Structured Data
• Semi-structured data falls between structured and unstructured data. While it has some
organizational properties, it doesn’t strictly stick to the pattern of relational databases.
• These types of big data do not have a strict structure that controls the management or
storage of the data. Like a spreadsheet, where the data is neatly arranged into rows and
columns, Semi-structured data is not stored in a relational format.
• Semi-structured data is also referred to as NoSQL data as it doesn’t require a structured
query language.
• These types of big data can be exchanged across systems with different basic structures
by using a data serialization language.
• It has a hierarchical structure with flexibility.
• May have tags or markers for organization.
• It is suited for NoSQL databases.
• The most common formats for semi-structured data include JSON (JavaScript Object
Notation), XML (eXtensible Markup Language), Zip files, YAML (YAML Ain’t
Markup Language).
• Example – JSON or XML files used in web development, where data has a defined
structure but allows for some variability.
8) Challenges and Limitations of Big Data Analytics?
Evolving constantly, the data management and architecture field is in an unprecedented
state of sophistication. Globally, more than 2.5 quintillion bytes of data are created every
day, and 90 percent of all the data in the world got generated in the last couple of years
(Forbes). Data is the fuel for machine learning and meaningful insights across industries,
so organizations are getting serious about how they collect, curate, and manage
information.
This article will help you learn more about the vast world of Big Data, and the challenges
of Big Data. And in case you thing challenges of Big Data and Big data as a concept is not
a big deal, here are some facts that will help you reconsider:
• About 300 billion emails get exchanged every day (Campaign Monitor)
• 400 hours of video are uploaded to YouTube every minute (Brandwatch)
• Worldwide retail eCommerce accounts for more than $4 billion in revenue (Shopify)
• Google receives more than 63,000 search inquiries every minute (SEO Tribunal)
• By 2025, real-time data will account for more than a quarter of all data (IDC)
What Is Big Data?
To get a handle on challenges of big data, you need to know what the word "Big Data"
means. When we hear "Big Data," we might wonder how it differs from the more common
"data." The term "data" refers to any unprocessed character or symbol that can be recorded
on media or transmitted via electronic signals by a computer. Raw data, however, is useless
until it is processed somehow.
Before we jump into the challenges of Big Data, let’s start with the five ‘V’s of Big Data.
The Five ‘V’s of Big Data
Big Data is simply a catchall term used to describe data too large and complex to store in
traditional databases. The “five ‘V’s” of Big Data are:
• Volume – The amount of data generated
• Velocity - The speed at which data is generated, collected and analyzed
• Variety - The different types of structured, semi-structured and unstructured data
• Value - The ability to turn data into useful insights
• Veracity - Trustworthiness in terms of quality and accuracy
What Does Facebook Do with Its Big Data?
Facebook collects vast volumes of user data (in the range of petabytes, or 1 million
gigabytes) in the form of comments, likes, interests, friends, and demographics. Facebook
uses this information in a variety of ways:
• To create personalized and relevant news feeds and sponsored ads
• For photo tag suggestions
• Flashbacks of photos and posts with the most engagement
• Safety check-ins during crises or disasters
Next up, let us look at a Big Data case study, understand it’s nuances and then look at some
of the challenges of Big Data.
Big Data Case Study
As the number of Internet users grew throughout the last decade, Google was challenged
with how to store so much user data on its traditional servers. With thousands of search
queries raised every second, the retrieval process was consuming hundreds of megabytes
and billions of CPU cycles. Google needed an extensive, distributed, highly fault-tolerant
file system to store and process the queries. In response, Google developed the Google File
System (GFS).
GFS architecture consists of one master and multiple chunk servers or slave machines. The
master machine contains metadata, and the chunk servers/slave machines store data in a
distributed fashion. Whenever a client on an API wants to read the data, the client contacts
the master, which then responds with the metadata information. The client uses this
metadata information to send a read/write request to the slave machines to generate a
response.
The files are divided into fixed-size chunks and distributed across the chunk servers or
slave machines. Features of the chunk servers include:
• Each piece has 64 MB of data (128 MB from Hadoop version 2 onwards)
• By default, each piece is replicated on multiple chunk servers three times
• If any chunk server crashes, the data file is present in other chunk servers
Next up let us take a look at the challenges of Big Data, and the probable outcomes too!
Challenges of Big Data
Storage
With vast amounts of data generated daily, the greatest challenge is storage (especially
when the data is in different formats) within legacy systems. Unstructured data cannot be
stored in traditional databases.
Processing
Processing big data refers to the reading, transforming, extraction, and formatting of useful
information from raw information. The input and output of information in unified formats
continue to present difficulties.
Security
Security is a big concern for organizations. Non-encrypted information is at risk of theft or
damage by cyber-criminals. Therefore, data security professionals must balance access to
data against maintaining strict security protocols.
Finding and Fixing Data Quality Issues
Many of you are probably dealing with challenges related to poor data quality, but solutions
are available. The following are four approaches to fixing data problems:
• Correct information in the original database.
• Repairing the original data source is necessary to resolve any data inaccuracies.
• You must use highly accurate methods of determining who someone is.
Scaling Big Data Systems
Database sharding, memory caching, moving to the cloud and separating read-only and
write-active databases are all effective scaling methods. While each one of those
approaches is fantastic on its own, combining them will lead you to the next level.
9) Explain about Big Data, Traditional Business Intelligence, and Data Warehousing?
ig data refers to extremely large and complex data sets that cannot be easily managed or
analyzed with traditional data processing tools, particularly spreadsheets. Big data includes
structured data, like an inventory database or list of financial transactions; unstructured
data, such as social posts or videos; and mixed data sets, like those used to train large
language models for AI. These data sets might include anything from the works of
Shakespeare to a company’s budget spreadsheets for the last 10 years.
Big data has only gotten bigger as recent technological breakthroughs have significantly
reduced the cost of storage and compute, making it easier and less expensive to store more
data than ever before. With that increased volume, companies can make more accurate and
precise business decisions with their data. But achieving full value from big data isn’t only
about analyzing it—which is a whole other benefit. It’s an entire discovery process that
requires insightful analysts, business users, and executives who ask the right questions,
recognize patterns, make informed assumptions, and predict behavior.
What are the Five “Vs” of Big Data?
Traditionally, we’ve recognized big data by three characteristics: variety, volume, and
velocity, also known as the “three Vs.” However, two additional Vs have emerged over the
past few years: value and veracity.
Those additions make sense because today, data has become capital. Think of some of the
world’s biggest tech companies. Many of the products they offer are based on their data,
which they’re constantly analyzing to produce more efficiency and develop new initiatives.
Success depends on all five Vs.
• Volume. The amount of data matters. With big data, you’ll have to process high
volumes of low-density, unstructured data. This can be data of unknown value, such as
X (formerly Twitter) data feeds, clickstreams on a web page or a mobile app, or sensor-
enabled equipment. For some organizations, this might be tens of terabytes of data. For
others, it may be hundreds of petabytes.
• Velocity. Velocity is the fast rate at which data is received and (perhaps) acted on.
Normally, the highest velocity of data streams directly into memory versus being
written to disk. Some internet-enabled smart products operate in real time or near real
time and will require real-time evaluation and action.
• Variety. Variety refers to the many types of data that are available. Traditional data
types were structured and fit neatly in a relational database. With the rise of big data,
data comes in new unstructured data types. Unstructured and semistructured data types,
such as text, audio, and video, require additional preprocessing to derive meaning and
support metadata.
• Veracity. How truthful is your data—and how much can you rely on it? The idea of
veracity in data is tied to other functional concepts, such as data quality and data
integrity. Ultimately, these all overlap and steward the organization to a data repository
that delivers high-quality, accurate, and reliable data to power insights and decisions.
• Value. Data has intrinsic value in business. But it’s of no use until that value is
discovered. Because big data assembles both breadth and depth of insights, somewhere
within all of that information lies insights that can benefit your organization. This value
can be internal, such as operational processes that might be optimized, or external, such
as customer profile suggestions that can maximize engagement.
Business intelligence (BI) is a process driven by technology that analyzes business data in order
to provide information that can be actioned so that executives and managers can make better-
informed business decisions.

Business intelligence is a broad term that encompasses data mining, process analysis,
performance benchmarking, and descriptive analytics. BI parses all the data generated by a
business and presents easy-to-digest reports, performance measures, and trends that inform
management decisions.

Understanding Business Intelligence (BI)


The need for BI was derived from the concept that managers with inaccurate or incomplete
information will tend, on average, to make worse decisions than if they had better information.
Creators of financial models recognize this as “garbage in, garbage out.”
BI attempts to solve this problem by analyzing current data that is ideally presented on a
dashboard of quick metrics designed to support better decisions.
Types of BI Tools and Software
BI tools and software come in a wide variety of forms. Let's take a quick look at some common
types of BI solutions.
• Spreadsheets: Spreadsheets like Microsoft Excel and Google Docs are some of the
most widely used BI tools.
• Reporting software: Reporting software is used to report, organize, filter, and display
data.
• Data visualization software: Data visualization software translates datasets into easy-
to-read, visually appealing graphical representations to quickly gain insights.
• Data mining tools: Data mining tools "mine" large amounts of data for patterns using
things like artificial intelligence, machine learning, and statistics.
• Online analytical processing (OLAP): OLAP tools allow users to analyze datasets
from a wide variety of angles based on different business perspectives.
Data warehouse:

A data warehouse, also called an enterprise data warehouse (EDW), is an enterprise data
platform used for the analysis and reporting of structured and semi-structured data from
multiple data sources, such as point-of-sale transactions, marketing automation, customer
relationship management, and more.

Data warehouses include an analytical database and critical analytical components and
procedures. They support ad hoc analysis and custom reporting, such as data pipelines, queries,
and business applications. They can consolidate and integrate massive amounts of current and
historical data in one place and are designed to give a long-range view of data over time. These
data warehouse capabilities have made data warehousing a primary staple of enterprise
analytics that help support informed business decisions.

Traditional data warehouses are hosted on-premises, with data flowing in from relational
databases, transactional systems, business applications, and other source systems.
However, they are typically designed to capture a subset of data in batches and store it
based on rigid schemas, making them unsuitable for spontaneous queries or real-time
analysis. Companies also must purchase their own hardware and software with an on-
premises data warehouse, making it expensive to scale and maintain. In a traditional
warehouse, storage is typically limited compared to compute, so data is transformed
quickly and then discarded to keep storage space free.

Today’s data analytics activities have transformed to the center of all core business
activities, including revenue generation, cost containment, improving operations, and
enhancing customer experiences. As data evolves and diversifies, organizations need more
robust data warehouse solutions and advanced analytic tools for storing, managing, and
analyzing large quantities of data across their organizations.

These systems must be scalable, reliable, secure enough for regulated industries, and
flexible enough to support a wide variety of data types and big data use cases. They also
need to support flexible pricing and compute, so you only pay for what you need instead of
guessing your capacity. The requirements go beyond the capabilities of most legacy data
warehouses. As a result, many enterprises are turning to cloud-based data warehouse
solutions.
A cloud data warehouse makes no trade-offs from a traditional data warehouse, but extends
capabilities and runs on a fully managed service in the cloud. Cloud data warehousing
offers instant scalability to meet changing business requirements and powerful data
processing to support complex analytical queries.
With a cloud data warehouse, you benefit from the inherent flexibility of a cloud
environment with more predictable costs. The up-front investment is typically much lower
and lead times are shorter with on-premises data warehouse solutions because the cloud
service provider manages and maintains the physical infrastructure.

You might also like