BigDataMadeEasy-AWorkingGuideToTheCompleteHadoopToolset资源-CSDN下载

需积分: 10 37 浏览量 2015-11-06 09:59:07 上传评论收藏 14.56MB PDF 举报

### Big Data Made Easy – A Comprehensive Guide to the Hadoop Ecosystem #### Introduction to Big Data and Hadoop In today's digital era, the volume and variety of data generated by businesses and organizations have grown exponentially. Traditional data processing methods are often insufficient to manage this vast amount of information. Big Data technologies, such as Hadoop, provide scalable and efficient solutions for storing, processing, and analyzing large datasets. **Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset** is a comprehensive resource designed to help readers understand and utilize the Hadoop ecosystem effectively. This guide covers various aspects of Hadoop, including installation, configuration, data collection, processing, scheduling, moving data, monitoring, cluster management, analytics, ETL (Extract, Transform, Load), and reporting. #### Chapter 1: The Problem with Data This chapter delves into the challenges associated with handling big data. It explains why traditional databases and processing tools are not suitable for managing large volumes of unstructured and semi-structured data. Key points include: - **Volume**: The sheer amount of data that needs to be stored and processed. - **Velocity**: The speed at which data is generated and needs to be analyzed. - **Variety**: The different types of data, including structured, semi-structured, and unstructured formats. - **Veracity**: The uncertainty and quality of the data. Understanding these challenges is crucial for appreciating the importance of big data technologies like Hadoop. #### Chapter 2: Storing and Configuring Data with Hadoop, YARN, and ZooKeeper This chapter focuses on setting up and configuring Hadoop for data storage and processing. It covers the following topics: - **Hadoop Distributed File System (HDFS)**: A distributed file system designed to store large datasets across multiple servers. - **YARN (Yet Another Resource Negotiator)**: A framework for managing computing resources in a cluster. - **Apache ZooKeeper**: A service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. The chapter provides step-by-step instructions for installing and configuring these components on CentOS, a popular Linux distribution. #### Chapter 3: Collecting Data with Nutch and Solr Data collection is a critical step in the big data pipeline. This chapter discusses two tools for collecting web data: - **Nutch**: An open-source web crawler that can be used to gather data from the web. - **Apache Solr**: A powerful search platform for indexing and searching text-based documents. The chapter provides practical examples of how to use these tools to collect and index web data efficiently. #### Chapter 4: Processing Data with MapReduce MapReduce is a programming model and software framework for processing and generating large data sets. This chapter covers: - **MapReduce Basics**: Understanding the Map and Reduce phases. - **Programming Models**: Using Java, Pig, Perl, and Hive for implementing MapReduce jobs. - **Performance Tuning**: Techniques for optimizing MapReduce jobs to improve performance. #### Chapter 5: Scheduling and Workflow Effective scheduling and workflow management are essential for managing tasks in a big data environment. This chapter discusses: - **Schedulers**: Fair and Capacity schedulers in Hadoop for managing job priorities. - **Oozie**: A workflow scheduler for managing Hadoop jobs and complex workflows. The chapter includes detailed instructions on how to set up and use these tools to automate and schedule data processing tasks. #### Chapter 6: Moving Data Data movement is a critical aspect of big data processing. This chapter covers tools and techniques for moving data into and out of Hadoop clusters: - **Hadoop Commands**: Basic commands for managing files and directories in HDFS. - **Sqoop**: A tool for transferring bulk data between Hadoop and relational databases. - **Flume**: A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. - **Apache Storm**: A real-time computation system for processing streaming data. #### Chapter 7: Monitoring Data Monitoring is essential for ensuring the health and performance of Hadoop clusters. This chapter covers: - **Hue**: A web interface for interacting with Hadoop clusters. - **Nagios**: A monitoring tool for tracking the status of Hadoop services. - **Ganglia**: A scalable distributed monitoring system for high-performance computing systems. The chapter provides guidance on setting up and using these tools to monitor and troubleshoot issues in Hadoop clusters. #### Chapter 8: Cluster Management Managing Hadoop clusters involves various tasks, such as provisioning, configuration, and maintenance. This chapter discusses: - **Ambari**: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters. - **Cloudera Distribution Including Apache Hadoop (CDH)**: A comprehensive Hadoop distribution that includes a wide range of big data technologies. The chapter provides insights into best practices for managing and scaling Hadoop clusters. #### Chapter 9: Analytics with Hadoop Analyzing data is one of the primary goals of using big data technologies. This chapter covers tools and frameworks for performing analytics on Hadoop: - **Impala**: A high-performance SQL query engine for Hadoop. - **Apache Hive**: A data warehousing component that provides SQL-like queries for Hadoop. - **Apache Spark**: A fast and general-purpose cluster-computing system for large-scale data processing. The chapter includes practical examples of how to use these tools for data analysis. #### Chapter 10: ETL with Hadoop Extract, Transform, Load (ETL) processes are fundamental in preparing data for analysis. This chapter discusses: - **Pentaho**: An open-source data integration tool that supports ETL processes. - **Talend**: A commercial and open-source platform for data integration. The chapter provides guidance on using these tools to extract, transform, and load data into Hadoop. #### Chapter 11: Reporting with Hadoop Generating reports is a critical part of presenting the results of data analysis. This chapter covers: - **Splunk**: A tool for searching, monitoring, and analyzing machine-generated big data. - **Talend**: A platform that includes reporting capabilities for visualizing data. The chapter provides examples of how to create reports and visualizations using these tools. ### Conclusion **Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset** is an invaluable resource for anyone looking to understand and implement big data technologies. By the end of the book, readers will have a deep understanding of the Hadoop ecosystem and will be able to build and manage their own big data systems. Whether you are a developer, data scientist, or IT professional, this guide offers a gentle learning curve through the functional layers of Hadoop-based big data.

资源推荐

资源详情

资源评论

Frampton

Shelve in

Databases/Data Warehousing

User level:

Beginning–Advanced

www.apress.com

SOURCE CODE ONLINE

BOOKS FOR PROFESSIONALS BY PROFESSIONALS

Big Data Made Easy

Many corporations are finding that the size of their data sets are outgrowing the capability of their systems to

store and process them. The data is becoming too big to manage and use with traditional tools. The solution:

implementing a big data system.

As Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset shows, Apache Hadoop

offers a scalable, fault-tolerant system for storing and processing data in parallel. It has a very rich toolset

that allows for storage (Hadoop/YARN), configuration (ZooKeeper), collection (Nutch, Solr, Gora, and

HBase), processing (Java Map Reduce and Pig), scheduling (Oozie), moving (Sqoop and Flume), monitoring

(Hue, Nagios, and Ganglia), testing (Bigtop), and analysis (Hive, Impala, and Spark).

The problem is that the Internet offers IT pros wading into big data many versions of the truth and some

outright falsehoods born of ignorance. What is needed is a book just like this one: a wide-ranging but easily

understood set of instructions to explain where to get Hadoop tools, what they can do, how to install them,

how to configure them, how to integrate them, and how to use them successfully. And you need an expert

who has more than a decade of experience—someone just like author and big data expert Mike Frampton.

Big Data Made Easy approaches the problem of managing massive data sets from a systems

perspective, and it explains the roles for each project (like architect and tester, for example) and shows how

the Hadoop toolset can be used at each system stage. It explains, in an easily understood manner and

through numerous examples, how to use each tool. The book also explains the sliding scale of tools available

depending upon data size and when and how to use them. Big Data Made Easy shows developers and

architects, as well as testers and project managers, how to:

• Store big data

• Configure big data

• Process big data

• Schedule processes

• Move data among SQL and NoSQL systems

• Monitor data

• Perform big data analytics

• Report on big data processes and projects

• Test big data systems

Big Data Made Easy also explains the best part, which is that this toolset is free. Anyone can download it

and—with the help of this book—start to use it within a day. With the skills this book will teach you under your

belt, you will add value to your company or client immediately, not to mention your career.

9781484 200957

54499

ISBN 978-1-4842-0095-7

Contents at a Glance

About the Author �� xv

About the Technical Reviewer �� xvii

Acknowledgments �� xix

Introduction �� xxi

Chapter 1: The Problem with Data ■ ��1

Chapter 2: Storing and Conﬁguring Data with Hadoop, YARN, and ZooKeeper ■ ��11

Chapter 3: Collecting Data with Nutch and Solr ■ ��57

Chapter 4: Processing Data with Map Reduce ■ ��85

Chapter 5: Scheduling and Workﬂow ■ ��121

Chapter 6: Moving Data ■ ��155

Chapter 7: Monitoring Data ■ ��191

Chapter 8: Cluster Management ■ ��225

Chapter 9: Analytics with Hadoop ■ ��257

Chapter 10: ETL with Hadoop ■ ��291

Chapter 11: Reporting with Hadoop ■ ��325

Index ��361

xxi

Introduction

If you would like to learn about the big data Hadoop-based toolset, then Big Data Made Easy is for you. It provides

a wide overview of Hadoop and the tools you can use with it. I have based the Hadoop examples in this book on

CentOS, the popular and easily accessible Linux version; each of its practical examples takes a step-by-step approach

to installation and execution. Whether you have a pressing need to learn about Hadoop or are just curious, Big Data

Made Easy will provide a starting point and oer a gentle learning curve through the functional layers of Hadoop-

based big data.

Starting with a set of servers and with just CentOS installed, I lead you through the steps of downloading,

installing, using, and error checking. e book covers following topics:

Hadoop installation (V1 and V2)•

Web-based data collection (Nutch, Solr, Gora, HBase)•

Map Reduce programming (Java, Pig, Perl, Hive)•

Scheduling (Fair and Capacity schedulers, Oozie)•

Moving data (Hadoop commands, Sqoop, Flume, Storm)•

Monitoring (Hue, Nagios, Ganglia)•

Hadoop cluster management (Ambari, CDH)•

Analysis with SQL (Impala, Hive, Spark)•

ETL (Pentaho, Talend)•

Reporting (Splunk, Talend)•

As you reach the end of each topic, having completed each example installation, you will be increasing your

depth of knowledge and building a Hadoop-based big data system. No matter what your role in the IT world,

appreciation of the potential in Hadoop-based tools is best gained by working along with these examples.

Having worked in development, support, and testing of systems based in data warehousing, I could see that

many aspects of the data warehouse system translate well to big data systems. I have tried to keep this book practical

and organized according to the topics listed above. It covers more than storage and processing; it also considers

such topics as data collection and movement, scheduling and monitoring, analysis and management, and ETL

and reporting.

is book is for anyone seeking a practical introduction to the world of Linux-based Hadoop big data tools.

It does not assume knowledge of Hadoop, but it does require some knowledge of Linux and SQL. Each command use

is explained at the point it is utilized.

剩余380页未读，继续阅读

评论收藏

内容反馈

silverbull

粉丝: 0

Big Data Made Easy - A Working Guide To The Complete Hadoop Tool...

最新资源

Big Data Made Easy - A Working Guide To The Complete Hadoop Tool...

Big-Data-Processing-with-Hadoop---A-Complete-Reference-Guide:使用高级Hadoop概念设计，构建和执行有效的大数据策略

spark-2.4.5-bin-without-hadoop.tgz

spark-3.1.3-bin-without-hadoop.tgz

Architecting Modern Data Platforms: A Guide to Enterprise Hadoop at Scale

spark-2.4.7-bin-without-hadoop

spark-2.1.0-bin-without-hadoop版本的压缩包，直接下载到本地解压后即可使用

hadoop最新版本3.1.1全量jar包

spark-3.0.0-bin-without-hadoop.tgz

spring-data-hadoop-2.1.0.RELEASE-hadoop24.zip

spring-data-hadoop-2.1.1.RELEASE-hadoop24-sources.jar

spark-2.1.0-bin-without-hadoop.tgz

hadoop-common-2.7.3-bin-master包含hadoop.dll、winutils.exe

spring-data-hadoop-2.0.4.RELEASE-hadoop25.jar

Modern Big Data Processing with Hadoop-Packt Publishing(2018)

Service-generated Big Data and Big Data-as-a-Service: An Overview

big-data-cloudera-hadoop.pdf

spark-3.4.1-bin-hadoop3.tgz - Spark 3.4.1 安装包(内置了Hadoop 3)

spark-2.4.7-bin-hadoop2.7.tgz

hadoop-eclipse-plugin三个版本的插件都在这里了。

hadoop-eclipse-plugin1.2.1 and hadoop-eclipse-plugin2.8.0

Big Data Architect's Handbook

Scaling Big Data with Hadoop and Solr

hadoop-2.2.0-src.tar

hadoop-common-2.2.0-bin-master(包含windows端开发Hadoop2.2需要的winutils.exe)

hadoop-eclipse-plugin-2.7.3和2.7.7

hadoop.dll-and-winutils.exe-for-hadoop2.7.3-on-windows_X64

Navicat 17.0 中文绿色免安装版

navicat17安装包和破解

红黑树详解

探索Perl编程：从入门到精通

最新资源