You are on page 1of 13

Hadoop 2.

0 Internals, ecosystem and,


analytics
Training

Course
Contents

page2

Mobile: +91 7719882295/ 9730463630


Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com

Hadoop 2.0 Internals, ecosystem and,


analytics
Course Contents

Hadoop Introduction

Day1

MapReduce
Distributing Data with HDFS Day2
Understanding Hadoop I/O
Advanced MapReduce
Writing Map-Reduce Applications Day3
Map-Reduce Internals
Managing Hadoop Day4
Map-Reduce Ecosystem
Hadoop Ecosystem (Tools)
Map Reduce Design Patterns Day5
Hadoop-2
Analytics

Mobile: +91 7719882295/ 9730463630


Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com

Hadoop 2.0 Internals, ecosystem and,


analytics

Hadoop Introduction

Hadoop performance and data scale facts.


Volunteer Computing
Move computation not data.
Hadoop Releases
The Apache Hadoop Project.
Grid Computing
Hadoop in the context of other data stores.
The Hadoop Ecosystem.
Apache Hadoop and the Hadoop Ecosystem
A Brief History of Hadoop
Hadoop an inside view: MapReduce and HDFS.
What about NoSQL?
RDBMS
Comparison with Other Systems

MapReduce

Constructing the basic template of a MapReduce program


Running a Distributed MapReduce Job
Data FlowCombiner Functions
Java MapReduceScaling Out
Analyzing the Data with Hadoop
Counting things
Map and Reduce
Hadoop Pipes
Adapting for Hadoops API changes
Improving performance with combiners

Hadoop Streaming
Ruby
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com

Streaming in Hadoop

Streaming with key/value pairs


Streaming with Unix commands
Streaming with the Aggregate package
Streaming with scripts

Distributing Data with HDFS

Interfaces
Hadoop Filesystems
The Design of HDFS

Parallel Copying with distcp

Anatomy of a File Write


Anatomy of a File Read
Coherency Model

The Command-Line Interface

Limitations

Data Flow

Keeping an HDFS Cluster Balanced


Hadoop Archives

Using Hadoop Archives

Basic Filesystem Operations

The Java Interface

Python

Querying the Filesystem


Reading Data Using the FileSystem API
Directories
Deleting Data
Reading Data from a Hadoop URL
Writing Data

Understanding Hadoop I/O


File-Based Data Structures

MapFile
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com

Serialization

Codecs
Using Compression in MapReduce
Compression and Input Splits

Data Integrity

Implementing a Custom Writable


Serialization Frameworks
The Writable Interface
Writable Classes
Avro

Compression

SequenceFile

ChecksumFileSystem
LocalFileSystem
Data Integrity in HDFS

Advanced MapReduce
Chaining MapReduce jobs

Creating a Bloom filter

What does a Bloom filter do?


Bloom filter in Hadoop version 0.20+
Implementing a Bloom filter

Joining data from different sources

Chaining preprocessing and postprocessing steps


Chaining MapReduce jobs in a sequence
Chaining MapReduce jobs with complex dependency

Reduce-side joining
Replicated joins using DistributedCache
Semijoin: reduce-side join with map-side filtering

Writing Map-Reduce Applications

Hadoop in the Cloud


Hadoop Configuration
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com

Cluster Setup and Installation


YARN Configuration
The Configuration API
Running Locally on Test Data
Configuring the Development Environment
Cluster Specs
Tuning
MapReduce Workflows
Monitoring and debugging on a production cluster
Tuning for performance
Benchmarking a Hadoop Cluster

Map-Reduce Internals
Failures

Anatomy of a MapReduce Job Run

The Reduce Side


The Map Side
Configuration Tuning

Task Execution

Classic MapReduce (MapReduce 1)


YARN (MapReduce 2)

Shuffle and Sort

Failures in YARN
Failures in Classic MapReduce

Skipping Bad Records


Output Committers
The Task Execution Environment
Speculative Execution
Task JVM Reuse

Job Scheduling

The Capacity Scheduler


The Fair Scheduler
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com

Managing Hadoop

Setting permissions
Enabling trash
Adding DataNodes
Managing NameNode and Secondary NameNode
Designing network layout and rack awareness
Checking systems health
Managing quotas
Setting up parameter values for practical use
Removing DataNodes
Recovering from a failed NameNode

Map-Reduce Features

Counters
Sorting
Side Data Distribution
Map-Reduce Library
Joins

Map-Reduce Ecosystem
Hive

Installing and configuring Hive

HiveQL in details
Example queries
Hive Sum-up

Hbase

Intoduction
Clients
Concepts
Hbase vs RDBMS

Pig

Installing Pig

Running Pig
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com

Thinking like a Pig

Data flow language


User-defined functions
Data types

Speaking Pig Latin

Learning Pig Latin through Grunt


Managing the Grunt shell

Execution optimization
Expressions and functions
Relational operators
Data types and schemas

Hadoop Ecosystem (Tools)


HBase Operations

Co-Processor
Scan Operations
Column Value & Key Pair
Column Families
Index & Query
Counters
CRUD Operations
Result Scanner
Batch and Caching
MapReduce and HBase
Filters
Creating Table Shell and Programming
Importing into HBase

Deep Dive in Hive

Understanding Hive , Architecture, Physical Model, Data


Model, Data Types
Hive QL- DDL,DML,other Operations
Playing with huge data and Querying extensively.
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com

User defined Functions,Optimizing Queries, Tips and Tricks for


performance tuning

Tables in Hive, Partitioning, Indexes,Bucketing,Sub Queries,


Joining Tables,Data Load and appending data to exisiting Table

Deep Dive in Pig

Sqoop

HBase DB Design

Handling Index
Designing Keys
Transaction
Integration for search
Schema Design

Flume

Map Reduce Design Patterns

Advance Pig Latin, Evaluation and Filter functions, Pig


and Ecosystem
Grunt, Script Mode, Data Model,
Real time use cases

Metapatterns
Join Patterns
Summarization Patterns
The Effects of YARN
Data Organization Patterns
Filtering Patterns
Input and Output Patterns
Final Thoughts

Hadoop-2
Apache Tez

Apache Tez: A New Chapter in Hadoop Data Processing


Data Processing API in Apache Tez
Writing a Tez Input/Processor/Output
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com

Apache YARN

Agility
global ResourceManager
per-node slave NodeManager
Scalability
Support for workloads other than MapReduce
Compatibility with MapReduce
per-application Container running on a NodeManager
Improved cluster utilization
per-application ApplicationMaster

HDFS-2

Runtime API in Apache Tez


Apache Tez: Dynamic Graph Reconfiguration

High Availability for HDFS


HDFS-append support
HDFS Federation
HDFS Snapshots

Analytics
Clustering

Measuring the similarity of items


Exploring distance measures
Clustering basics

Clustering algorithms in Mahout

Fuzzy k-means clustering


Model-based clustering
K-means clustering
Beyond k-means: an overview of clustering techniques

Topic modeling using latent Dirichlet allocation (LDA)

Taking clustering to production

Batch and online clustering


Tuning clustering performance
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com

Evaluating and improving clustering quality

Inspecting clustering output


Analyzing clustering output
Improving clustering quality

Clustering algorithms in Mahout

Topic modeling using latent Dirichlet allocation (LDA)

K-means clustering
Beyond k-means: an overview of clustering techniques
Inspecting clustering output
Analyzing clustering output
Fuzzy k-means clustering
Evaluating and improving clustering quality
Improving clustering quality
Model-based clustering

Representing data

Quick-start tutorial for running clustering on Hadoop

Improving quality of vectors using normalization


Representing text documents as vectors
Visualizing vectors
Generating vectors from documents

Classification

Work flow in a typical classification project


The fundamentals of classification systems
Introduction to classification
How classification works
Mahout for classification
Classification example

Training a classifier

Classifying the 20 newsgroups data set with SGD


Preprocessing raw data into classifiable data
Converting classifiable data into vectors
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com

Evaluating and tuning a classifier

Mahout classifier
Choosing an algorithm to train the classifier
Classifying the 20 newsgroups data with naive Bayes
The classifier evaluation API
Process for deployment in huge systems
Thrift-based classification server
Building a training pipeline for large systems
When classifiers go bad
Classifier evaluation in Mahout
Determining scale and speed requirements
Deploying a classifier

Recommendations
Introducing recommenders

Real-world applications of clustering

Finding similar users on Twitter


Analyzing the Stack Overflow data set
Suggesting tags for artists on Last.fm

Representing recommender data

Evaluating the GroupLens data set


Defining recommendation
Evaluating precision and recall
Evaluating a recommender

Coping without preference values


In-memory DataModels
Representing preference data

Making recommendations

Exploring similarity metrics


Slope-one recommender
New and experimental recommenders
Comparison to other recommenders
Mobile: +91 7719882295/ 9730463630
Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com

Distributing recommendation computations

Understanding user-based recommendation


Item-based recommendation
Exploring the user-based recommender
Designing a distributed item-based algorithm
Implementing a distributed algorithm with MapReduce
Analyzing the Wikipedia data set
Pseudo-distributing a recommender

Taking recommenders to production

Analyzing example data from a dating site


Finding an effective recommender
Recommending to anonymous users
Injecting domain-specific information

Mobile: +91 7719882295/ 9730463630


Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com

You might also like