Book Detail : Pro Apache Hadoop, 2nd Edition

Book Title: 
Pro Apache Hadoop, 2nd Edition
Resource Category: 
Publisher: 
Publication Year: 
2 014
Number of Pages: 
444
ISBN: 
978-1-4302-4863-7
978-1-4302-4864-4
Language: 
English
Edition: 
Second
WishList: 
yes
Available at Shelf: 
No
Description: 

ANALYZE LARGE VOLUMES OF DATA IN AMAZINGLY SHORT WALL-CLOCK INTERVALS

Table of Contents (Summary): 
  1. Motivation for Big Data

  2. Hadoop Concepts  

  3. Getting Started with the Hadoop Framework

  4. Hadoop Administration

  5. Basics of MapReduce Development

  6. Advanced MapReduce Development 

  7. Hadoop Input/Output  

  8. Testing Hadoop Programs  

  9. Monitoring Hadoop 

  10. Data Warehousing Using Hadoop  

  11. Data Processing Using Pig

  12. HCatalog and Hadoop in the Enterprise  

  13. Log Analysis Using Hadoop  

  14. Building Real-Time Systems Using HBase 

  15. Data Science with Hadoop 

  16. Hadoop in the Cloud  

  17. Building a YARN Application

Table of Contents (Expanded): 
  1. Motivation for Big Data

    • What Is Big Data?

    • Key Idea Behind Big Data Techniques

      • Data Is Distributed Across Several Nodes

      • Applications Are Moved to the Data

      • Data Is Processed Local to a Node

      • Sequential Reads Preferred Over Random Reads

      • An Example

    • Big Data Programming Models 

      • Massively Parallel Processing (MPP) Database Systems

      • In-Memory Database Systems

      • MapReduce Systems

      • Bulk Synchronous Parallel (BSP) Systems

    • Big Data and Transactional Systems

    • How Much Can We Scale?  

      • A Compute-Intensive Example

      • Amdhal’s Law

    • Business Use-Cases for Big Data

  2. Hadoop Concepts  

    • Introducing Hadoop 

    • Introducing the MapReduce Model

    • Components of Hadoop 

      • Hadoop Distributed File System (HDFS)  

      • Secondary NameNode

      • TaskTracker 

      • JobTracker

    • Hadoop 2.0

      • Components of YARN 

    • HDFS High Availability

  3. Getting Started with the Hadoop Framework

    • Types of Installation  

      • Stand-Alone Mode 

      • Pseudo-Distributed Cluster

      • Multinode Node Cluster Installation

      • Preinstalled Using Amazon Elastic MapReduce

    • Setting up a Development Environment with a Cloudera Virtual Machine

    • Components of a MapReduce program 

    • Your First Hadoop Program 

      • Prerequisites to Run Programs in Local Mode

      • WordCount Using the Old API

      • Building the Application 

      • Running WordCount in Cluster Mode

      • WordCount Using the New API

      • Building the Application

      • Running WordCount in Cluster Mode  

    • Third-Party Libraries in Hadoop Jobs

  4. Hadoop Administration

    • Hadoop Configuration Files

    • Configuring Hadoop Daemons

    • Precedence of Hadoop Configuration Files

    • Diving into Hadoop Configuration Files  

      • core-site.xml

      • hdfs-*.xml

      • mapred-site.xml

      • yarn-site.xml

      • Memory Allocations in YARN

    • Scheduler 

      • Capacity Scheduler

      • Fair Scheduler 

      • Fair Scheduler Configuration

      • yarn-site.xml Configurations

      • Allocation File Format and Configurations  

      • Determine Dominant Resource Share in drf Policy 

    • Slaves File 

    • Rack Awareness

      • Providing Hadoop with Network Topology  

    • Cluster Administration Utilities  

      • Check the HDFS

      • Command-Line HDFS Administration

      • Rebalancing HDFS Data

      • Copying Large Amounts of Data from the HDFS

  5. Basics of MapReduce Development

    • Hadoop and Data Processing

    • Reviewing the Airline Dataset 

      • Preparing the Development Environment 

      • Preparing the Hadoop System

    • MapReduce Programming Patterns

      • Map-Only Jobs (SELECT and WHERE Queries)

      • Problem Definition: SELECT Clause 

      • Problem Definition: WHERE Clause

      • Map and Reduce Jobs (Aggregation Queries)

      • Problem Definition: GROUP BY and SUM Clauses

      • Improving Aggregation Performance Using the Combiner 

      • Problem Definition: Optimized Aggregators

      • Role of the Partitioner

      • Problem Definition: Split Airline Data by Month 

    • Bringing it All Together

  6. Advanced MapReduce Development 

    • MapReduce Programming Patterns

      • Introduction to Hadoop I/O 

      • Problem Definition: Sorting 

      • Problem Definition: Analyzing Consecutive Records

      • Problem Definition: Join Using MapReduce

      • Problem Definition: Join Using Map-Only jobs 

      • Writing to Multiple Output Files in a Single MR Job 

      • Collecting Statistics Using Counters

  7. Hadoop Input/Output  

    • Compression Schemes 

      • What Can Be Compressed?

      • Compression Schemes

      • Enabling Compression 

    • Inside the Hadoop I/O processes 

      • Input Format

      • Output Format

      • Custom Output Format: Conversion from Text to XML

      • Custom Input Format: Consuming a Custom XML file 

    • Hadoop Files

      • Sequence File

      • Map Files

      • Avro Files 

  8. Testing Hadoop Programs  

    • Revisiting the Word Counter

    • Introducing MRUnit

      • Installing MRUnit 

      • MRUnit Core Classes 

      • Writing an MRUnit Test Case  

      • Testing Counters 

      • Features of MRUnit 

      • Limitations of MRUnit

    • Testing with LocalJobRunner

      • Limitations of LocalJobRunner 

    • Testing with MiniMRCluster 

      • Setting up the Development Environment  

      • Example for MiniMRCluster

      • Limitations of MiniMRCluster 

    • Testing MR Jobs with Access Network Resources

  9. Monitoring Hadoop 

    • Writing Log Messages in Hadoop MapReduce Jobs

    • Viewing Log Messages in Hadoop MapReduce Jobs 

    • User Log Management in Hadoop 2.x

      • Log Storage in Hadoop 2.x

      • Log Management Improvements

      • Viewing Logs Using Web–Based UI

      • Command-Line Interface

      • Log Retention

    • Hadoop Cluster Performance Monitoring

    • Using YARN REST APIs

    • Managing the Hadoop Cluster Using Vendor Tools

      • Ambari Architecture

  10. Data Warehousing Using Hadoop

    • Apache Hive  

      • Installing Hive

      • Hive Architecture

      • Metastore 

      • Compiler Basics

      • Hive Concepts

      • HiveQL Compiler Details

      • Data Definition Language  

      • Data Manipulation Language

      • External Interfaces

      • Hive Scripts

      • Performance

      • MapReduce Integration  

      • Creating Partitions

      • User-Defined Functions 

    • Impala

      • Impala Architecture

      • Impala Features 

      • Impala Limitations

    • Shark

      • Shark/Spark Architecture

  11. Data Processing Using Pig

    • An Introduction to Pig

    • Running Pig

      • Executing in the Grunt Shell 

      • Executing a Pig Script

      • Embedded Java Program

    • Pig Latin

      • Comments in a Pig Script

      • Execution of Pig Statements

      • Pig Commands

    • User-Defined Functions 

      • Eval Functions Invoked in the Mapper

      • Eval Functions Invoked in the Reducer

      • Writing and Using a Custom FilterFunc

    • Comparison of PIG versus Hive

    • Crunch API

      • How Crunch Differs from Pig 

      • Sample Crunch Pipeline

  12. HCatalog and Hadoop in the Enterprise  

    • HCatalog and Enterprise Data Warehouse Users

    • HCatalog: A Brief Technical Background

      • HCatalog Command-Line Interface

      • WebHCat

      • HCatalog Interface for MapReduce

      • HCatalog Interface for Pig

      • HCatalog Notification Interface

    • Security and Authorization in HCatalog

    • Bringing It All Together 

  13. Log Analysis Using Hadoop  

    • Log File Analysis Applications

      • Web Analytics

      • Security Compliance and Forensics

      • Monitoring and Alerts 

      • Internet of Things

    • Analysis Steps

      • Load

      • Refine 

      • Visualize 

    • Apache Flume

      • Core Concepts 

    • Netflix Suro

    • Cloud Solutions

  14. Building Real-Time Systems Using HBase 

    • What Is HBase?

    • Typical HBase Use-Case Scenarios

    • HBase Data Model 

      • HBase Logical or Client-Side View

      • Differences Between HBase and RDBMSs 

      • HBase Tables

      • HBase Cells 

      • HBase Column Family 

    • HBase Commands and APIs

      • Getting a Command List: help Command

      • Creating a Table: create Command

      • Adding Rows to a Table: put Command 

      • Retrieving Rows from the Table: get Command

      • Reading Multiple Rows: scan Command

      • Counting the Rows in the Table: count Command  

      • Deleting Rows: delete Command

      • Truncating a Table: truncate Command 

      • Dropping a Table: drop Command

      • Altering a Table: alter Command

    • HBase Architecture

      • HBase Components

      • Compaction and Splits in HBase

      • Compaction 

    • HBase Configuration: An Overview

      • hbase-default.xml and hbase-site.xml 

    • HBase Application Design 

      • Tall vs. Wide vs. Narrow Table Design

      • Row Key Design

    • HBase Operations Using Java API

      • HBase Treats Everything as Bytes 

      • Create an HBase Table

      • Administrative Functions Using HBaseAdmin

      • Accessing Data Using the Java API

    • HBase MapReduce Integration 

    • A MapReduce Job to Read an HBase Table

    • HBase and MapReduce Clusters

      • Scenario I: Frequent MapReduce Jobs Against HBase Tables 

      • Scenario II: HBase and MapReduce have Independent SLAs  

  15. Data Science with Hadoop 

    • Hadoop Data Science Methods  

    • Apache Hama

      • Bulk Synchronous Parallel Model

      • Hama Hello World!

      • Monte Carlo Methods 

      • K-Means Clustering

    • Apache Spark

      • Resilient Distributed Datasets (RDDs)

      • Monte Carlo with Spark

      • KMeans with Spark 

    • RHadoop

  16. Hadoop in the Cloud  

    • Economics

      • Self-Hosted Cluster

      • Cloud-Hosted Cluster

      • Elasticity

      • On Demand

      • Bid Pricing

      • Hybrid Cloud  

    • Logistics 

      • Ingress/Egress

      • Data Retention 

    • Security 

    • Cloud Usage Models

    • Cloud Providers

      • Amazon Web Services

      • Google Cloud Platform

      • Microsoft Azure

      • Choosing a Cloud Vendor

    • Case Study: Amazon Web Services

      • Elastic MapReduce

      • Elastic Compute Cloud 

  17. Building a YARN Application

    • YARN: A General-Purpose Distributed System

    • YARN: A Quick Review

    • Creating a YARN Application

      • POM Configuration

    • DownloadService .java Class 

    • Client.java 

      • Steps to Launch the Application Master from the Client  

    • ApplicationMaster .java  

      • Communication Protocol between Application Master and Resource Manager: Application Master Protocol 

      • Node Manager Communication Protocol: Container Management Protocol

      • Steps to Launch the Worker Tasks

    • Executing the Application Master

      • Launch the Application in Un-Managed Mode

      • Launch the Application in Managed Mode  

 

Appendix A: Installing Hadoop

  • Installing Hadoop 2.2.0 on Windows

  • Installing Hadoop 2.2.0 on Linux 

Appendix B: Using Maven with Eclipse

  • A Quick Introduction to Maven 

  • Using Maven with Eclipse

Appendix C: Apache Ambari

  • Hadoop Components Supported by Apache Ambari

  • Installing Apache Ambari

  • Trying the Ambari Sandbox on Your OS

Index

2.44617
Average: 2.4 (209 votes)

Search the Web

Custom Search

Searches whole web. Use the search in the right sidebar to search only within javajee.com!!!