Professional Documents
Culture Documents
Spark MapR
Use idea of local data , but do most calculations in
memory instead of disc. selling their own solutions, not only consulting
resilient distributed datase pros:
Spark has interfaces for Scala, Java and a lot of optimizations
Python partner program with amazon
Engines cons:
M3 has cutted functional
Impala
Cloudera product. Uses C++ engine. Has caching of
frequently used blocks and column storage. Has NOSQL: HBase
ODBC driver. Allows working with different records in real
time.
Spark SQL New records are added into sorted structure
Do not have its own metadata warehouse. Is pretty in memory , and only when its achive
week so far . restricted volume it is sent to disc.
Mahout
Colaboration filtering
Clasterization algorithms
Advanced Analytic randomm forest
So far it uses mapreduce engine but this going to
be changed to spark engine
MLlib
Spark Streaming Basic statistics
Can take data from Kafka, ZeroMQ,soket , linear and logistic regresion
SVM
Twitter etc.
k-means
DStream interface— collection of small RDD, SVD
which are got for fixed time range PCA
SGD
Data Types
L-BFGS
Parquet
Has Python interface — NumPy
Columnar format optimized for saving complicated
structures and effective compressing . Used by
Spark and Impala.
ORC ZooKeeper
Optimized format for Hive. Main tool for coordimation of element in Hadoop
infrastructure.
Avro
Hue
Can send schema with the data
Web interface for Hadoop services, part of Claudera
or can work with dynamically typed objects.
Maneger.
Managers\ task planners
Flume
Service for organizing streaming data
Oozie
Task planner
Airflow