You are on page 1of 1

Apache

manual setup of environment


manual installing of packages
fixing configuration files

Master\Name node Cloudera (Cloudera Distribution including


Map Has path to files , blocks and their replications. Suppliers Apache Hadoop)
is executing in parallel and if it is possible locally on
each block MapReduce Hadoop Distributed File System Cloudera Manager with all popular tools
everything what is possible to process parallel is installing and monitoring all needed packages
It's a special file system trying to get new features first
processing paralely
Combine
Aggregate data on local servers Save intermidiate results to disc
Slave\Data Node Hortonworks
data divided to blocks(64\128 mb) (Hortonworks Data Platform) one general solution
Reduce instead of developing own tools investing into
Aggregate data on a highest level existing Apache products
HDP looks more stable then CDH

Spark MapR
Use idea of local data , but do most calculations in
memory instead of disc. selling their own solutions, not only consulting
resilient distributed datase pros:
Spark has interfaces for Scala, Java and a lot of optimizations
Python partner program with amazon
Engines cons:
M3 has cutted functional

Tez Hadoop ECO System Import: Apache Kafka


Alternative engine from Hortonworks Sends messages to disc immediately and
main principal Directed acyclic graph
keep these data configured amount of days.
used mainly in Hive so far.
Easy salable.
Kafka is not lie about reliability
Hive
Language HiveQL.
consumer groups is not working (all
version 0.13 uses TEZ engine which has a great messages will be given to all consumers)
optimization and works very fast compare to SQL tools server do not saves offsets for consumers
previous. for analysis of historical records.
Has ODBC drivers and can work with Tableau,
Micro Strategy and Excel.

Impala
Cloudera product. Uses C++ engine. Has caching of
frequently used blocks and column storage. Has NOSQL: HBase
ODBC driver. Allows working with different records in real
time.
Spark SQL New records are added into sorted structure
Do not have its own metadata warehouse. Is pretty in memory , and only when its achive
week so far . restricted volume it is sent to disc.

Mahout
Colaboration filtering
Clasterization algorithms
Advanced Analytic randomm forest
So far it uses mapreduce engine but this going to
be changed to spark engine
MLlib
Spark Streaming Basic statistics
Can take data from Kafka, ZeroMQ,soket , linear and logistic regresion
SVM
Twitter etc.
k-means
DStream interface— collection of small RDD, SVD
which are got for fixed time range PCA
SGD
Data Types
L-BFGS
Parquet
Has Python interface — NumPy
Columnar format optimized for saving complicated
structures and effective compressing . Used by
Spark and Impala.

ORC ZooKeeper
Optimized format for Hive. Main tool for coordimation of element in Hadoop
infrastructure.
Avro
Hue
Can send schema with the data
Web interface for Hadoop services, part of Claudera
or can work with dynamically typed objects.
Maneger.
Managers\ task planners
Flume
Service for organizing streaming data

Oozie
Task planner

Azkaban — suports the following actions:


command from console консольная команда (а
что ещё надо),
executing via schedule
log app
notifying about failed jobs
etc.

Airflow

You might also like