SlideShare a Scribd company logo
COMPUTER SCIENCE AND ENGINEERING
ANALYSIS OF HISTORICAL MOVIE DATA BY
USING HADOOP SYSTEM
INTERNAL GUIDE:T.CHANDRA SHEKAR REDDY
:
G.VEERABHADRA(13R21A05C8)
 Abstract
 Requirements
 Dataflow Diagram
 Methodology
 Screenshots
 Future Extension
 Conclusion
 References
Recommendation system provides the facility to understand a person's taste and
find new, desirable content for them automatically based on the pattern between
their likes and rating of different items. In this paper, we have proposed a
recommendation system for the large amount of data available on the web in the
form of ratings, reviews, opinions, complaints, remarks, feedback, and comments
about any item (product, event, individual and services) using Hadoop Framework.
 Hadoop 2.x
 My Sql
 HDFS
 Hive
 Pig
 Hue
 JDK 1.6
Dataflow Diagram
MS Excel (datasets
in csv format)
Import into
cloudera home
Load the data
into mysql
Create database
in mysql
Load the data into
hive using sqoop
Load the data into
Hue
Hadoop Distributed File System (HDFS):
 The Hadoop Distributed File System (HDFS) is designed to store very large data
sets reliably, and to stream those data sets at high bandwidth to user applications. In
a large cluster, thousands of servers both host directly attached storage and execute
user application tasks.
 An important characteristic of Hadoop is the partitioning of data and computation
across many (thousands) of hosts, and the execution of application computations in
parallel close to their data.
HDFS Architecture:
• Hive is a data warehousing frame work in hadoop where we store data in the form
of tables ( structured format).Hive runs on the top of hdfs and mapreduce.
• The back end storage for hive is hdfs and executing model is mapreduce.
• Hive provides SQL like language called HiveQL(HQL). HQL is very similar to
SQL.
• Hive is designed for scalability and easy of use.
 Tinyint(1 byte)
 SmallInt(2 bytes)
 int(4 bytes)
 Bigint(8 bytes)
 float(4 bytes)
 double(8 bytes)
 String(max size 2gb)
 varchar(hive-0.12.0 supports 1 to 65535 characters)
 Boolean --->true/false
 sqoop is a tool designed to transfer data between hadoop and relational databases.
You can use sqoop to import data from a relational database management system
such as MYSQL,or ORACLE into the hadoop distributed file system and then
export the data back into an RDBMS.
 Sqoop automates most of the this process, relying on the database to describe the
schema for the data to be imported . Sqoop uses mapreduce to import and export
the data which provides parallel operations as well as fault tolerance.
Copy the file from windows to cloudera.
 For creating the database: Mysql>create database name;
 For using the database: Mysql>use name;
For creating table name: Mysql>create table tablename(….);
To import data sets in to MYSQL the following command is used:
load the file Mysql>load data local infile ‘path of the file’ into table tablename fields
terminated by ‘,’ enclosed by ‘”’ lines terminated by ‘rn’;
exit;
For importing the data from mysql to hive the following command is used:
Sqoop import –connect jdbc:mysql//localhost/datbasename --username root –
password cloudera --table tablename --fields-terminated-by ’,’ --hive -import -m 1
To log in to HUE:
username: Cloudera
password: Cloudera
go to hive editor.
Where at the left side we have to select database and at the right side we can try
some analytical queries on the tables created. Once the result is displayed select
some charts and repeat the same process for all the respective years.
Analysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRA
Clearly Big Data is in its beginnings, and is much more to be discovered. This
technology itself brings business benefits by being leveraged across domains like
Big Data, Business Intelligence and Analytics.
These business benefits are:
 Speed and Accelerated performance
Good query performance for improved decision making, boost of performance for
data load processes for a low data latency, accelerated memory planning
capabilities.
 New Business Insights
Self-service BI and more flexible modeling capabilities.
Faster Business Processes.
 The availability of Big Data, low-cost commodity hardware, and new information
management and analytic software has produced a unique moment in the history of
data analysis. The convergence of these trends means that we have the capabilities
required to analyze astonishing data sets quickly and cost-effectively for the first
time in history. These capabilities are neither theoretical nor trivial. They represent
a genuine leap forward and a clear opportunity to realize enormous gains in terms
of efficiency, productivity, revenue, and profitability. The Age of Big Data is here,
and these are truly revolutionary times if both business and technology
professionals continue to work together and deliver on the promise. Promises of
Big Data include innovation, growth and long term sustainability.
 From the results we can analyze the movies and project reports like the best rated,
highest budget and highest collection with in a click.
 https://www.tutorialspoint.com/
 http://hadooptutorials.co.in/tutorials/hadoop/internals-of-hdfs-file-read-
operations.html
 http://www.hadooptpoint.com/hadoop-hive-architecture/
 http://downloads.vmware.com/d/info/desktop_downloads/vmware_workstation/7_0
 http://www.cloudera.com/
 Hadoop: The Definitive Guide -- John White
 Big Data Analytics -- Wiley
Analysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRA
Gantt Chart (definition):
Gantt chart is a chart in which a series of horizontal lines shows the amount of work
done or production completed in certain periods of time in relation to the amount
planned for those periods.
Future Work:
In the further process we will be analyzing the datasets which are loaded in the
Hive using Hue or R tool.
Conclusion:
In this project we have loaded large set of datasets in to HDFS using Sqoop and Hive
Further the movie data can be easily analyzed using Hue.
Analysis of historical movie data by BHADRA

More Related Content

PPTX
Big data Analytics Hadoop
PPTX
Big data and hadoop
PDF
Big data Hadoop Analytic and Data warehouse comparison guide
PPTX
Big Data Analytics Projects - Real World with Pentaho
PPTX
Hadoop for beginners free course ppt
PPTX
Intro to Big Data Hadoop
PDF
Hadoop - Architectural road map for Hadoop Ecosystem
PPTX
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data Analytics Hadoop
Big data and hadoop
Big data Hadoop Analytic and Data warehouse comparison guide
Big Data Analytics Projects - Real World with Pentaho
Hadoop for beginners free course ppt
Intro to Big Data Hadoop
Hadoop - Architectural road map for Hadoop Ecosystem
Big data vahidamiri-tabriz-13960226-datastack.ir

What's hot (20)

ODP
An introduction to Apache Hadoop Hive
PPTX
Big Data Analytics with Hadoop, MongoDB and SQL Server
PDF
On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms
PPTX
Introduction to Apache Hadoop
PDF
Hadoop Ecosystem Architecture Overview
PPTX
Big Data Analytics for Non-Programmers
DOCX
Hotel inspection data set analysis copy
PPTX
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
PPTX
Hadoop and Big Data
PDF
Introduction to Bigdata and HADOOP
PPTX
Big data
PPTX
Introduction to Apache Hadoop Eco-System
PPTX
Hadoop: An Industry Perspective
PPTX
Hadoop
PPTX
Big Data and Hadoop
DOCX
Hadoop Report
ODP
BigData Hadoop
PDF
Hw09 Welcome To Hadoop World
PPT
Big Data & Hadoop
PPSX
Hadoop Ecosystem
An introduction to Apache Hadoop Hive
Big Data Analytics with Hadoop, MongoDB and SQL Server
On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms
Introduction to Apache Hadoop
Hadoop Ecosystem Architecture Overview
Big Data Analytics for Non-Programmers
Hotel inspection data set analysis copy
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Hadoop and Big Data
Introduction to Bigdata and HADOOP
Big data
Introduction to Apache Hadoop Eco-System
Hadoop: An Industry Perspective
Hadoop
Big Data and Hadoop
Hadoop Report
BigData Hadoop
Hw09 Welcome To Hadoop World
Big Data & Hadoop
Hadoop Ecosystem
Ad

Viewers also liked (19)

PPTX
Smart card
DOCX
plant disease recognition method is proposed based on plant images abstract
DOCX
BTech Resume
PPTX
I locate security for lost or misplaced devices PPT
DOC
BRAIN COMPUTER INTERFACE Documentation
PPTX
FUN AND FOOD PPT
PPT
Braincomputerinterface ppt
DOCX
5G NETWORK AND INTERNET OF THINGS doc
DOC
mobile-jammer
PPTX
5G wireless technology and internet of things
DOCX
Worldranking universities final documentation
PPTX
Smart card technology
PPTX
Smart card system ppt
PPT
Pill camera presentation
PPT
Smart Card Technology
DOCX
Smatcard documentation
PPT
Ppt Smart Card
PPT
Smart card
Smart card
plant disease recognition method is proposed based on plant images abstract
BTech Resume
I locate security for lost or misplaced devices PPT
BRAIN COMPUTER INTERFACE Documentation
FUN AND FOOD PPT
Braincomputerinterface ppt
5G NETWORK AND INTERNET OF THINGS doc
mobile-jammer
5G wireless technology and internet of things
Worldranking universities final documentation
Smart card technology
Smart card system ppt
Pill camera presentation
Smart Card Technology
Smatcard documentation
Ppt Smart Card
Smart card
Ad

Similar to Analysis of historical movie data by BHADRA (20)

PPTX
Hadoop Integration with Microstrategy
PDF
Infrastructure Considerations for Analytical Workloads
PDF
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
PPTX
Hd insight overview
PPTX
Big Data Practice_Planning_steps_RK
PDF
Building a Big Data platform with the Hadoop ecosystem
PPTX
Hadoop and IoT Sinergija 2014
PDF
PDF
What is hadoop
PPTX
Hadoop and IoT Sinergija 2014
PPTX
What it takes to run Hadoop at Scale: Yahoo! Perspectives
PPTX
Hadoop
PDF
Big data and hadoop
PPSX
Haddop in Business Intelligence
PDF
Modern data warehouse
PDF
Modern data warehouse
PPTX
Pivotal HD and Spring for Apache Hadoop
PPT
Hadoop in action
PPTX
ETL big data with apache hadoop
ODP
Hadoop seminar
Hadoop Integration with Microstrategy
Infrastructure Considerations for Analytical Workloads
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Hd insight overview
Big Data Practice_Planning_steps_RK
Building a Big Data platform with the Hadoop ecosystem
Hadoop and IoT Sinergija 2014
What is hadoop
Hadoop and IoT Sinergija 2014
What it takes to run Hadoop at Scale: Yahoo! Perspectives
Hadoop
Big data and hadoop
Haddop in Business Intelligence
Modern data warehouse
Modern data warehouse
Pivotal HD and Spring for Apache Hadoop
Hadoop in action
ETL big data with apache hadoop
Hadoop seminar

More from Bhadra Gowdra (9)

PDF
Your birth-date-surprisingly-reveals-a-lot-about-your-personality,-know-them!
PDF
Information Security Management
DOCX
COLLEGE PHONE BOOK Final documentation
DOCX
Parent communication register android application Coding
DOC
Parent communication register android application
PDF
The uniform trade secrets act
DOC
Fun Food
DOCX
Pill camera documentation
PPTX
Pill camera by bhadra
Your birth-date-surprisingly-reveals-a-lot-about-your-personality,-know-them!
Information Security Management
COLLEGE PHONE BOOK Final documentation
Parent communication register android application Coding
Parent communication register android application
The uniform trade secrets act
Fun Food
Pill camera documentation
Pill camera by bhadra

Recently uploaded (20)

PPTX
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
PDF
RPKI Status Update, presented by Makito Lay at IDNOG 10
PDF
An introduction to the IFRS (ISSB) Stndards.pdf
PDF
Decoding a Decade: 10 Years of Applied CTI Discipline
PDF
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
PPTX
Slides PPTX World Game (s) Eco Economic Epochs.pptx
PPTX
Digital Literacy And Online Safety on internet
PDF
Paper PDF World Game (s) Great Redesign.pdf
PPTX
innovation process that make everything different.pptx
PPT
256065457-Anaesthesia-in-Liver-Disease-Patient.ppt
PPTX
ENCOR_Chapter_10 - OSPFv3 Attribution.pptx
PPTX
CSharp_Syntax_Basics.pptxxxxxxxxxxxxxxxxxxxxxxxxxxxx
PDF
www-codemechsolutions-com-whatwedo-cloud-application-migration-services.pdf
PPTX
522797556-Unit-2-Temperature-measurement-1-1.pptx
PDF
Testing WebRTC applications at scale.pdf
PDF
LABUAN4D EXCLUSIVE SERVER STAR GAMING ASIA NO.1
PDF
WebRTC in SignalWire - troubleshooting media negotiation
PPTX
international classification of diseases ICD-10 review PPT.pptx
PDF
Behind the Smile Unmasking Ken Childs and the Quiet Trail of Deceit Left in H...
PPTX
CHE NAA, , b,mn,mblblblbljb jb jlb ,j , ,C PPT.pptx
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
RPKI Status Update, presented by Makito Lay at IDNOG 10
An introduction to the IFRS (ISSB) Stndards.pdf
Decoding a Decade: 10 Years of Applied CTI Discipline
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
Slides PPTX World Game (s) Eco Economic Epochs.pptx
Digital Literacy And Online Safety on internet
Paper PDF World Game (s) Great Redesign.pdf
innovation process that make everything different.pptx
256065457-Anaesthesia-in-Liver-Disease-Patient.ppt
ENCOR_Chapter_10 - OSPFv3 Attribution.pptx
CSharp_Syntax_Basics.pptxxxxxxxxxxxxxxxxxxxxxxxxxxxx
www-codemechsolutions-com-whatwedo-cloud-application-migration-services.pdf
522797556-Unit-2-Temperature-measurement-1-1.pptx
Testing WebRTC applications at scale.pdf
LABUAN4D EXCLUSIVE SERVER STAR GAMING ASIA NO.1
WebRTC in SignalWire - troubleshooting media negotiation
international classification of diseases ICD-10 review PPT.pptx
Behind the Smile Unmasking Ken Childs and the Quiet Trail of Deceit Left in H...
CHE NAA, , b,mn,mblblblbljb jb jlb ,j , ,C PPT.pptx

Analysis of historical movie data by BHADRA

  • 1. COMPUTER SCIENCE AND ENGINEERING ANALYSIS OF HISTORICAL MOVIE DATA BY USING HADOOP SYSTEM INTERNAL GUIDE:T.CHANDRA SHEKAR REDDY : G.VEERABHADRA(13R21A05C8)
  • 2.  Abstract  Requirements  Dataflow Diagram  Methodology  Screenshots  Future Extension  Conclusion  References
  • 3. Recommendation system provides the facility to understand a person's taste and find new, desirable content for them automatically based on the pattern between their likes and rating of different items. In this paper, we have proposed a recommendation system for the large amount of data available on the web in the form of ratings, reviews, opinions, complaints, remarks, feedback, and comments about any item (product, event, individual and services) using Hadoop Framework.
  • 4.  Hadoop 2.x  My Sql  HDFS  Hive  Pig  Hue  JDK 1.6
  • 5. Dataflow Diagram MS Excel (datasets in csv format) Import into cloudera home Load the data into mysql Create database in mysql Load the data into hive using sqoop Load the data into Hue
  • 6. Hadoop Distributed File System (HDFS):  The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks.  An important characteristic of Hadoop is the partitioning of data and computation across many (thousands) of hosts, and the execution of application computations in parallel close to their data.
  • 8. • Hive is a data warehousing frame work in hadoop where we store data in the form of tables ( structured format).Hive runs on the top of hdfs and mapreduce. • The back end storage for hive is hdfs and executing model is mapreduce. • Hive provides SQL like language called HiveQL(HQL). HQL is very similar to SQL. • Hive is designed for scalability and easy of use.
  • 9.  Tinyint(1 byte)  SmallInt(2 bytes)  int(4 bytes)  Bigint(8 bytes)  float(4 bytes)  double(8 bytes)  String(max size 2gb)  varchar(hive-0.12.0 supports 1 to 65535 characters)  Boolean --->true/false
  • 10.  sqoop is a tool designed to transfer data between hadoop and relational databases. You can use sqoop to import data from a relational database management system such as MYSQL,or ORACLE into the hadoop distributed file system and then export the data back into an RDBMS.  Sqoop automates most of the this process, relying on the database to describe the schema for the data to be imported . Sqoop uses mapreduce to import and export the data which provides parallel operations as well as fault tolerance.
  • 11. Copy the file from windows to cloudera.  For creating the database: Mysql>create database name;  For using the database: Mysql>use name;
  • 12. For creating table name: Mysql>create table tablename(….);
  • 13. To import data sets in to MYSQL the following command is used: load the file Mysql>load data local infile ‘path of the file’ into table tablename fields terminated by ‘,’ enclosed by ‘”’ lines terminated by ‘rn’; exit;
  • 14. For importing the data from mysql to hive the following command is used: Sqoop import –connect jdbc:mysql//localhost/datbasename --username root – password cloudera --table tablename --fields-terminated-by ’,’ --hive -import -m 1 To log in to HUE: username: Cloudera password: Cloudera go to hive editor. Where at the left side we have to select database and at the right side we can try some analytical queries on the tables created. Once the result is displayed select some charts and repeat the same process for all the respective years.
  • 20. Clearly Big Data is in its beginnings, and is much more to be discovered. This technology itself brings business benefits by being leveraged across domains like Big Data, Business Intelligence and Analytics. These business benefits are:  Speed and Accelerated performance Good query performance for improved decision making, boost of performance for data load processes for a low data latency, accelerated memory planning capabilities.  New Business Insights Self-service BI and more flexible modeling capabilities. Faster Business Processes.
  • 21.  The availability of Big Data, low-cost commodity hardware, and new information management and analytic software has produced a unique moment in the history of data analysis. The convergence of these trends means that we have the capabilities required to analyze astonishing data sets quickly and cost-effectively for the first time in history. These capabilities are neither theoretical nor trivial. They represent a genuine leap forward and a clear opportunity to realize enormous gains in terms of efficiency, productivity, revenue, and profitability. The Age of Big Data is here, and these are truly revolutionary times if both business and technology professionals continue to work together and deliver on the promise. Promises of Big Data include innovation, growth and long term sustainability.  From the results we can analyze the movies and project reports like the best rated, highest budget and highest collection with in a click.
  • 22.  https://www.tutorialspoint.com/  http://hadooptutorials.co.in/tutorials/hadoop/internals-of-hdfs-file-read- operations.html  http://www.hadooptpoint.com/hadoop-hive-architecture/  http://downloads.vmware.com/d/info/desktop_downloads/vmware_workstation/7_0  http://www.cloudera.com/  Hadoop: The Definitive Guide -- John White  Big Data Analytics -- Wiley
  • 27. Gantt Chart (definition): Gantt chart is a chart in which a series of horizontal lines shows the amount of work done or production completed in certain periods of time in relation to the amount planned for those periods.
  • 28. Future Work: In the further process we will be analyzing the datasets which are loaded in the Hive using Hue or R tool.
  • 29. Conclusion: In this project we have loaded large set of datasets in to HDFS using Sqoop and Hive Further the movie data can be easily analyzed using Hue.