Had Oop 2 Internals Ecosystem and Analytics

Hadoop 2.
0 Internals, ecosystem and,

analytics
Training
Course
Contents
page2
Mobile: +91 7719882295/ 9730463630

Email: sales@anikatechnologies.com
Website:www.anikatechnologies.com
Hadoop 2.0 Internals, ecosystem and,

analytics
Course Contents
Hadoop Introduction
Day1
MapReduce
Distributing Data with HDFS Day2
Understanding Hadoop I/O
Advanced MapReduce
Writing Map-Reduce Applications Day3
Map-Reduce Internals
Managing Hadoop Day4
Map-Reduce Ecosystem
Hadoop Ecosystem (Tools)
Map Reduce Design Patterns Day5
Hadoop-2
Analytics
Mobile: +91 7719882295/ 9730463630

Hadoop 2.0 Internals, ecosystem and,

analytics
Hadoop Introduction
Hadoop performance and data scale facts.

Volunteer Computing
Move computation not data.
Hadoop Releases
The Apache Hadoop Project.
Grid Computing
Hadoop in the context of other data stores.
The Hadoop Ecosystem.
Apache Hadoop and the Hadoop Ecosystem
A Brief History of Hadoop
Hadoop an inside view: MapReduce and HDFS.
What about NoSQL?
RDBMS
Comparison with Other Systems
MapReduce
Constructing the basic template of a MapReduce program

Running a Distributed MapReduce Job
Data FlowCombiner Functions
Java MapReduceScaling Out
Analyzing the Data with Hadoop
Counting things
Map and Reduce
Hadoop Pipes
Adapting for Hadoops API changes
Improving performance with combiners
Hadoop Streaming
Ruby
Mobile: +91 7719882295/ 9730463630
Streaming in Hadoop
Streaming with key/value pairs

Streaming with Unix commands
Streaming with the Aggregate package
Streaming with scripts
Distributing Data with HDFS
Interfaces
Hadoop Filesystems
The Design of HDFS
Parallel Copying with distcp
Anatomy of a File Write

Anatomy of a File Read
Coherency Model
The Command-Line Interface
Limitations
Data Flow
Keeping an HDFS Cluster Balanced

Hadoop Archives
Using Hadoop Archives
Basic Filesystem Operations
The Java Interface
Python
Querying the Filesystem

Reading Data Using the FileSystem API
Directories
Deleting Data
Reading Data from a Hadoop URL
Writing Data
Understanding Hadoop I/O

File-Based Data Structures
MapFile
Mobile: +91 7719882295/ 9730463630
Serialization
Codecs
Using Compression in MapReduce
Compression and Input Splits
Data Integrity
Implementing a Custom Writable

Serialization Frameworks
The Writable Interface
Writable Classes
Avro
Compression
SequenceFile
ChecksumFileSystem
LocalFileSystem
Data Integrity in HDFS
Advanced MapReduce
Chaining MapReduce jobs
Creating a Bloom filter
What does a Bloom filter do?

Bloom filter in Hadoop version 0.20+
Implementing a Bloom filter
Joining data from different sources
Chaining preprocessing and postprocessing steps

Chaining MapReduce jobs in a sequence
Chaining MapReduce jobs with complex dependency
Reduce-side joining
Replicated joins using DistributedCache
Semijoin: reduce-side join with map-side filtering
Writing Map-Reduce Applications
Hadoop in the Cloud

Hadoop Configuration
Mobile: +91 7719882295/ 9730463630
Cluster Setup and Installation

YARN Configuration
The Configuration API
Running Locally on Test Data
Configuring the Development Environment
Cluster Specs
Tuning
MapReduce Workflows
Monitoring and debugging on a production cluster
Tuning for performance
Benchmarking a Hadoop Cluster
Map-Reduce Internals
Failures
Anatomy of a MapReduce Job Run
The Reduce Side

The Map Side
Configuration Tuning
Task Execution
Classic MapReduce (MapReduce 1)

YARN (MapReduce 2)
Shuffle and Sort
Failures in YARN
Failures in Classic MapReduce
Skipping Bad Records

Output Committers
The Task Execution Environment
Speculative Execution
Task JVM Reuse
Job Scheduling
The Capacity Scheduler

The Fair Scheduler
Mobile: +91 7719882295/ 9730463630
Managing Hadoop
Setting permissions
Enabling trash
Adding DataNodes
Managing NameNode and Secondary NameNode
Designing network layout and rack awareness
Checking systems health
Managing quotas
Setting up parameter values for practical use
Removing DataNodes
Recovering from a failed NameNode
Map-Reduce Features
Counters
Sorting
Side Data Distribution
Map-Reduce Library
Joins
Map-Reduce Ecosystem
Hive
Installing and configuring Hive
HiveQL in details
Example queries
Hive Sum-up
Hbase
Intoduction
Clients
Concepts
Hbase vs RDBMS
Pig
Installing Pig
Running Pig
Mobile: +91 7719882295/ 9730463630
Thinking like a Pig
Data flow language

User-defined functions
Data types
Speaking Pig Latin
Learning Pig Latin through Grunt

Managing the Grunt shell
Execution optimization
Expressions and functions
Relational operators
Data types and schemas
Hadoop Ecosystem (Tools)

HBase Operations
Co-Processor
Scan Operations
Column Value & Key Pair
Column Families
Index & Query
Counters
CRUD Operations
Result Scanner
Batch and Caching
MapReduce and HBase
Filters
Creating Table Shell and Programming
Importing into HBase
Deep Dive in Hive
Understanding Hive , Architecture, Physical Model, Data

Model, Data Types
Hive QL- DDL,DML,other Operations
Playing with huge data and Querying extensively.
Mobile: +91 7719882295/ 9730463630
User defined Functions,Optimizing Queries, Tips and Tricks for

performance tuning
Tables in Hive, Partitioning, Indexes,Bucketing,Sub Queries,

Joining Tables,Data Load and appending data to exisiting Table
Deep Dive in Pig
Sqoop
HBase DB Design
Handling Index
Designing Keys
Transaction
Integration for search
Schema Design
Flume
Map Reduce Design Patterns
Advance Pig Latin, Evaluation and Filter functions, Pig

and Ecosystem
Grunt, Script Mode, Data Model,
Real time use cases
Metapatterns
Join Patterns
Summarization Patterns
The Effects of YARN
Data Organization Patterns
Filtering Patterns
Input and Output Patterns
Final Thoughts
Hadoop-2
Apache Tez
Apache Tez: A New Chapter in Hadoop Data Processing

Data Processing API in Apache Tez
Writing a Tez Input/Processor/Output
Mobile: +91 7719882295/ 9730463630
Apache YARN
Agility
global ResourceManager
per-node slave NodeManager
Scalability
Support for workloads other than MapReduce
Compatibility with MapReduce
per-application Container running on a NodeManager
Improved cluster utilization
per-application ApplicationMaster
HDFS-2
Runtime API in Apache Tez

Apache Tez: Dynamic Graph Reconfiguration
High Availability for HDFS

HDFS-append support
HDFS Federation
HDFS Snapshots
Analytics
Clustering
Measuring the similarity of items

Exploring distance measures
Clustering basics
Clustering algorithms in Mahout
Fuzzy k-means clustering

Model-based clustering
K-means clustering
Beyond k-means: an overview of clustering techniques
Topic modeling using latent Dirichlet allocation (LDA)
Taking clustering to production
Batch and online clustering

Tuning clustering performance
Mobile: +91 7719882295/ 9730463630
Evaluating and improving clustering quality
Inspecting clustering output

Analyzing clustering output
Improving clustering quality
Clustering algorithms in Mahout
Topic modeling using latent Dirichlet allocation (LDA)
K-means clustering
Beyond k-means: an overview of clustering techniques
Inspecting clustering output
Analyzing clustering output
Fuzzy k-means clustering
Evaluating and improving clustering quality
Improving clustering quality
Model-based clustering
Representing data
Quick-start tutorial for running clustering on Hadoop
Improving quality of vectors using normalization

Representing text documents as vectors
Visualizing vectors
Generating vectors from documents
Classification
Work flow in a typical classification project

The fundamentals of classification systems
Introduction to classification
How classification works
Mahout for classification
Classification example
Training a classifier
Classifying the 20 newsgroups data set with SGD

Preprocessing raw data into classifiable data
Converting classifiable data into vectors
Mobile: +91 7719882295/ 9730463630
Evaluating and tuning a classifier
Mahout classifier
Choosing an algorithm to train the classifier
Classifying the 20 newsgroups data with naive Bayes
The classifier evaluation API
Process for deployment in huge systems
Thrift-based classification server
Building a training pipeline for large systems
When classifiers go bad
Classifier evaluation in Mahout
Determining scale and speed requirements
Deploying a classifier
Recommendations
Introducing recommenders
Real-world applications of clustering
Finding similar users on Twitter

Analyzing the Stack Overflow data set
Suggesting tags for artists on Last.fm
Representing recommender data
Evaluating the GroupLens data set

Defining recommendation
Evaluating precision and recall
Evaluating a recommender
Coping without preference values

In-memory DataModels
Representing preference data
Making recommendations
Exploring similarity metrics

Slope-one recommender
New and experimental recommenders
Comparison to other recommenders
Mobile: +91 7719882295/ 9730463630
Distributing recommendation computations
Understanding user-based recommendation

Item-based recommendation
Exploring the user-based recommender
Designing a distributed item-based algorithm
Implementing a distributed algorithm with MapReduce
Analyzing the Wikipedia data set
Pseudo-distributing a recommender
Taking recommenders to production
Analyzing example data from a dating site

Finding an effective recommender
Recommending to anonymous users
Injecting domain-specific information
Mobile: +91 7719882295/ 9730463630


Had Oop 2 Internals Ecosystem and Analytics

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Had Oop 2 Internals Ecosystem and Analytics

Uploaded by

Copyright:

Available Formats

Hadoop 2.

0 Internals, ecosystem and,

Mobile: +91 7719882295/ 9730463630

Hadoop 2.0 Internals, ecosystem and,

Mobile: +91 7719882295/ 9730463630

Hadoop 2.0 Internals, ecosystem and,

Hadoop performance and data scale facts.

Constructing the basic template of a MapReduce program

Streaming with key/value pairs

Distributing Data with HDFS

Parallel Copying with distcp

Anatomy of a File Write

The Command-Line Interface

Keeping an HDFS Cluster Balanced

Using Hadoop Archives

Basic Filesystem Operations

The Java Interface

Querying the Filesystem

Understanding Hadoop I/O

Implementing a Custom Writable

Creating a Bloom filter

What does a Bloom filter do?

Joining data from different sources

Chaining preprocessing and postprocessing steps

Writing Map-Reduce Applications

Hadoop in the Cloud

Cluster Setup and Installation

Anatomy of a MapReduce Job Run

The Reduce Side

Classic MapReduce (MapReduce 1)

Shuffle and Sort

Skipping Bad Records

The Capacity Scheduler

Installing and configuring Hive

Thinking like a Pig

Data flow language

Speaking Pig Latin

Learning Pig Latin through Grunt

Hadoop Ecosystem (Tools)

Deep Dive in Hive

Understanding Hive , Architecture, Physical Model, Data

User defined Functions,Optimizing Queries, Tips and Tricks for

Tables in Hive, Partitioning, Indexes,Bucketing,Sub Queries,

Deep Dive in Pig

Map Reduce Design Patterns

Advance Pig Latin, Evaluation and Filter functions, Pig

Apache Tez: A New Chapter in Hadoop Data Processing

Runtime API in Apache Tez

High Availability for HDFS

Measuring the similarity of items

Clustering algorithms in Mahout

Fuzzy k-means clustering

Topic modeling using latent Dirichlet allocation (LDA)

Taking clustering to production

Batch and online clustering

Evaluating and improving clustering quality

Inspecting clustering output

Clustering algorithms in Mahout

Topic modeling using latent Dirichlet allocation (LDA)

Quick-start tutorial for running clustering on Hadoop

Improving quality of vectors using normalization

Work flow in a typical classification project

Classifying the 20 newsgroups data set with SGD

Evaluating and tuning a classifier