0% found this document useful (0 votes)
23 views32 pages

Big Data Intro

Big data refers to large and complex data sets that challenge traditional database management tools, encompassing various types such as relational, text, and streaming data. The exponential growth of data, driven by mobile devices, social media, and scientific instruments, necessitates advanced analytics to extract valuable insights. Hadoop serves as a scalable solution for storing and processing big data, enabling organizations to manage and analyze vast amounts of information effectively.

Uploaded by

madhuri.bitcse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views32 pages

Big Data Intro

Big data refers to large and complex data sets that challenge traditional database management tools, encompassing various types such as relational, text, and streaming data. The exponential growth of data, driven by mobile devices, social media, and scientific instruments, necessitates advanced analytics to extract valuable insights. Hadoop serves as a scalable solution for storing and processing big data, enabling organizations to manage and analyze vast amounts of information effectively.

Uploaded by

madhuri.bitcse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Big data and challenges

Data

BIG DATA
Maximilien Brice, © CERN
The Earthscope
• The Earthscope is the world's
largest science project. Designed to
track North America's geological
evolution, this observatory records
data over 3.8 million square miles,
amassing 67 terabytes of data. It
analyzes seismic slips in the San
Andreas fault, sure, but also the
plume of magma underneath
Yellowstone and much, much more.
(http://www.msnbc.msn.com/id/44
363598/ns/technology_and_science
-future_of_technology/#.TmetOdQ-
-uI)
Big data -Definition
• Big data is a collection of data sets so large
and complex that it becomes difficult to
process using on-hand database management
tools
• The challenges include capture, storage,
search, sharing, analysis, and visualization.
Big Data: A definition
• Put another way, big data is the realization of
greater business intelligence by storing,
processing, and analyzing data that was
previously ignored due to the limitations of
traditional data management technologies

Source: Harness the Power of Big Data: The IBM Big Data Platform
Type of Data
• Relational Data (Tables/Transaction/Legacy
Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
– Social Network, Semantic Web (RDF), …

• Streaming Data
– You can only scan the data once
Who’s Generating Big Data

Mobile devices
(tracking all objects all the time)

Social media and networks Scientific instruments


(all of us are generating data) (collecting all sorts of data)

Sensor technology and networks


(measuring all kinds of data)
• The progress and innovation is no longer hindered by the ability to collect data
• But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion
Challenges

How to transfer Big Data?


Characteristics of Big Data:
1-Scale (Volume)

• Data Volume
– 44x increase from 2009 2020
– From 0.8 zettabytes to 35zb
• Data volume is increasing exponentially

Exponential increase in
collected/generated data

10
•A typical PC might have had 10 gigabytes of storage in 2000.
•Today, Face book ingests 500 terabytes of new data every day.
•Boeing 737 will generate 240 terabytes of flight data during a
single flight across the US.
•The smart phones, the data they create and consume; sensors
embedded into everyday objects will soon result in billions of new,
constantly-updated data feeds containing environmental, location,
and other information, including video.
Characteristics of Big Data:
2-Complexity (Varity)
• Various formats, types, and structures
• Text, numerical, images, audio, video,
sequences, time series, social media
data, multi-dim arrays, etc…
• Static data vs. streaming data
• A single application can be
generating/collecting many types of
data

To extract knowledge all these types of


data need to linked together

12
• Big Data isn't just numbers, dates, and strings. Big Data is
also geospatial data, 3D data, audio and video, and
unstructured text, including log files and social media.
• Traditional database systems were designed to address
smaller volumes of structured data, fewer updates or a
predictable, consistent data structure.
• Big Data analysis includes different types of data
Characteristics of Big Data:
3-Speed (Velocity)

• Data is begin generated fast and need to be


processed fast
• Online Data Analytics
• Late decisions  missing opportunities
• Examples
– E-Promotions: Based on your current location, your purchase history, what
you like  send promotions right now for store next to you

– Healthcare monitoring: sensors monitoring your activities and body  any


abnormal measurements require immediate reaction

14
• Click streams and ad impressions capture user behavior at
millions of events per second
• high-frequency stock trading algorithms reflect market
changes within microseconds
• machine to machine processes exchange data between
billions of devices
• infrastructure and sensors generate massive log data in real-
time
• on-line gaming systems support millions of concurrent users,
each producing multiple inputs per second.
Big Data is a Hot Topic Because Technology Makes
it Possible to Analyze ALL Available Data
Cost effectively manage and analyze
all available data in its native form
unstructured, structured, streaming
Why Big Data and BI

Source: Business Intelligence Strategy: A Framework for Achieving BI


Excellence
4 types of Big Data BI
• Prescriptive: Analysis of actions to be taken.
• Predictive: Analysis of scenarios that may
happen.
• Diagnostic: Look at past performance.
• Descriptive: Real time dashboard.
So, in a nutshell
• Big Data is about better analytics!
Problems Associated with reading and
writing data from multiple disks
• Hardware failure
• Combining from different disks

Hadoop is the solution


-- reliable, scalable platform for storage and
analysis
--Open sourse
An OS for Networks
• Towards an Operating System for Networks

Software-Defined Networking (SDN)


Control Programs

Global Network View

Network Operating System


Control via
forwarding
interface
Protocols Protocols

22
Source: Business Intelligence Strategy: A Framework for Achieving BI
Excellence
Big Data Conundrum
• Problems:
– Although there is a massive spike available data,
the percentage of the data that an enterprise can
understand is on the decline
– The data that the enterprise is trying to
understand is saturated with both useful signals
and lots of noise

Source: IBM http://www-01.ibm.com/software/data/bigdata/


The Big Data platform Manifesto
imperatives and underlying technologies
HADOOP

Hadoop File System


Manage & store huge volume of any data
MapReduce
Hadoop
• Hadoop is a distributed file system and data
processing engine that is designed to handle
extremely high volumes of data in any structure.
• Hadoop has two components:
– The Hadoop distributed file system (HDFS), which
supports data in structured relational form, in
unstructured form, and in any form in between
– The MapReduce programing paradigm for managing
applications on multiple distributed servers
• The focus is on supporting redundancy,
distributed architectures, and parallel processing
Hadoop Related
Names to Know
• Apache Avro: designed for communication between
Hadoop nodes through data serialization
• Cassandra and Hbase: a non-relational database designed
for use with Hadoop
• Hive: a query language similar to SQL (HiveQL) but
compatible with Hadoop
• Mahout: an AI tool designed for machine learning; that is,
to assist with filtering data for analysis and exploration
• Pig Latin: A data-flow language and execution framework
for parallel computation
• ZooKeeper: Keeps all the parts coordinated and working
together
Some concepts
• NoSQL (Not Only SQL): Databases that “move
beyond” relational data models (i.e., no tables,
limited or no use of SQL)
– Focus on retrieval of data and appending new data
(not necessarily tables)
– Focus on key-value data stores that can be used to
locate data objects
– Focus on supporting storage of large quantities of
unstructured data
– SQL is not used for storage or retrieval of data
– No ACID (atomicity, consistency, isolation, durability)
Resources
• BigInsights Wiki
• Udacity – Big data and data science
• BigData University
Thank you

You might also like