You are on page 1of 6

16/04/14 19:05 Big Data A to ZZ A Glossary of my Favorite Data Science Things | MapR

Pgina 1 de 6 http://www.mapr.com/blog/big-data-zz--glossary-my-favorite-data-science-things#.U064EyjvDPz
Big Data A to ZZ A Glossary of my Favorite Data Science
Things
Dr. Kirk Borne (/blog/author/dr-kirk-borne)
Nearly 50 years ago, one of the most popular musical movies of all time was released (The Sound of Music
(http://www.imdb.com/title/tt0059742/)). Perhaps the most memorable song from that production is My Favorite Things
(http://www.youtube.com/watch?v=E3aBB-J9vhg). A remake of the production was shown on live television in December last
year, and it inspired me to think about a few of my favorite things, particularly data science things. So, I started compiling a
list, with the intention of writing an article about those favorites. But soon the list became too long for a single article, and the
idea was born to put the list into an A to Z Glossary format. At that point, the fun challenge was thinking of interesting and
useful big data science concepts that fit the glossary model. Data science is all about fitting models, so I accepted the
challenge. The following is the result of those deliberations. It is a glossary that lists a few of my favorite things about big data
and data science, from A to Z (actually, ZZ), one for each letter. There are no raindrops on roses or whiskers on kittens here,
but ponies and elephants are fair game.
Of course, this glossary represents my own preferences, and there are many other possible choices. Please feel free to add
some of your own favorite things in the comments. The descriptions provided below are very briefthis is just the first
installmentwe will look more deeply into some specific ones from these glossary entries in future blog posts. So, here we go:
AA Association rule mining: Association rule mining: unsupervised machine learning method
(http://blog.programmableweb.com/2014/02/10/swiftiq-released-innovative-data-mining-api/) for finding frequently occurring
patterns (item sets) in discrete data (numeric or categorical).
BB Bayes belief networks (BBN): Bayes belief networks (BBN): algorithm for building a network of conditional dependencies among a large number of
Sign Up
for updates on the latest
Hadoop and Big Data trends
SUBSCRIBE NOW
March 21, 2014
(/blog/author/dr-kirk-borne)
Dr. Kirk Borne
(/blog/author/dr-kirk-
borne)
PROFESSOR OF ASTROPHYSICS AND
COMPUTATIONAL SCIENCE, GEORGE MASON
UNIVERSITY
https://twitter.com/kirkdborne
(https://twitter.com/kirkdbo
rne)
Dr. Kirk Borne is a Transdisciplinary
Data Scientist and an Astrophysicist.
He is Professor of Astrophysics and
Computational Science in the George
Mason University School of Physics,
Astronomy, and Computational
Sciences. He has been at Mason since
2003, where he teaches and advises
students in the graduate and
undergraduate Computational
Science, Informatics, and Data Science
programs. Previously, he spent
nearly 20 years in positions
supporting NASA projects, including
an assignment as NASA's Data
Archive Project Scientist for the
Hubble Space Telescope, and as
Project Manager in NASA's Space
Science Data Operations Office. He
has extensive experience in big data
and data science, including expertise
in scientific data mining and data
systems. He has published over 200
articles (research papers, conference
papers, and book chapters), and
given over 200 invited talks at
Search
16/04/14 19:05 Big Data A to ZZ A Glossary of my Favorite Data Science Things | MapR
Pgina 2 de 6 http://www.mapr.com/blog/big-data-zz--glossary-my-favorite-data-science-things#.U064EyjvDPz
variables, which enables prediction, classification, and missing value imputation.
CC Characterization: Characterization: methodology for generating descriptive parameters that describe the behavior and characteristics of a
data item, for use in any unsupervised learning algorithm to find clusters, patterns, and trends without the bias of
incorporating class labels.
DD Deep learning: Deep learning: one of the hottest new machine learning algorithms in recent years, useful for finding a hierarchy of the
most significant features, characteristics, and explanatory variables in complex data sets. It is particularly useful in
unsupervised machine learning of large unlabeled datasets.
EE Ensemble learning: Ensemble learning: machine learning approach that combines the results from many different algorithms, whose
combined vote (from the ensemble) provides a more robust and accurate predictive output than any single algorithm can
muster.
FF Forests, random: Forests, random: a decision tree classifier that produces a forest of trees, yielding highly accurate models, essentially by
iteratively randomizing one input variable at a time in order to learn if this randomization process actually produces a less
accurate classifier. If it doesnt, then that variable is ousted from the model.
GG Gaussian mixture models (GMM): Gaussian mixture models (GMM): an unsupervised learning technique for clustering that generates a mixture of
clusters from the full data set using a Gaussian (normal) data distribution model for each cluster. The GMMs output is a set of
cluster attributes (mean, variance, and centroid) for each cluster, thereby producing a set of characterization metadata that
serves as a compact descriptive model of the full data collection.
HH ??: ??: as a hat-tip to my MapR (http://www.mapr.com/) host here, I postpone H to the end!
II Informatics: Informatics: data science for data-intensive science. There are many examples: Bioinformatics, Geoinformatics, Climate
informatics, Environment informatics, Health and Medical informatics, Biodiversity informatics, Urban informatics,
Neuroinformatics, Cheminformatics, Astroinformatics (https://asaip.psu.edu/Articles/astroinformatics-in-a-nutshell), etc.
JJ JSON and JAQL: JSON and JAQL: emerging lightweight data-interchange format (JavaScript Object Notation) and query language (JSON
Query Language) for unstructured and semi-structured data, likely to surpass XML-based data languages in data-as-a-service
applications.
KK K-anything in data mining: K-anything in data mining: K-Nearest Neighbors algorithm (for classification), K-Means (for clustering), K-itemsets (for
association rule mining see A, above), K-Nearest Neighbors Data Distributions (http://link.springer.com/chapter/10.1007/978-
1-4614-3520-4_26) (for outlier detection), KD-trees (for indexing and rapid search of high-dimensional data), and more KDD
(Knowledge Discovery from Data) things.
LL Local linear embedding (LLE): Local linear embedding (LLE): a form of manifold learning (http://mdp-
toolkit.sourceforge.net/examples/lle/lle.html%20) that discovers the true topological shape of your data distribution, which
might be quite warped and twisted when seen in the coordinate space that is represented by your easily available database
attributes, though in fact the data may actually lie on a complex hyperplane (i.e., in the natural coordinates of the data
domain).
MM Multiple weak classifiers: Multiple weak classifiers: an example of ensemble learning, applied to classification problems in which you can
generate a large number of different classifiers, where none of them are particularly accurate (hence, they are weak), but
when combined they can yield a strong voting (scoring) metric to determine a data items most likely classification. A research
paper with a very interesting title was written on this subject: Good Learners for Evil Teachers
(http://research.microsoft.com/pubs/80598/dekelsh09.pdf).
NN Novelty detection: Novelty detection: another name for outlier detection, or anomaly detection, or interestingness discovery, which I
prefer to call Surprise Discoveryfinding the novel, surprising, and unexpected data points or patterns in your data set that lie
FOLLOW MAPR FOLLOW MAPR
Top Posts
conferences and universities
worldwide. In these roles, he focuses
on achieving big discoveries from big
data through data science, and he
promotes the use of information and
data-centric experiences with big data
in the STEM education pipeline at all
levels. He believes in data literacy for
all! Learn more about him at
http://kirkborne.net/
(http://kirkborne.net/)
can follow him on Google+ here
(https://plus.google.com/10
3116560436527485944?)
on Twitter at @KirkDBorne
(https://twitter.com/KirkDB
orne), where he has been identified
as one of the social networks top big
data influencers.
Five Steps to Avoiding Java
Heap Space Errors
(/blog/how-to-avoid-java-
heap-space-errors-
understanding-and-
managing-task-attempt-
memory)
Hadoop Goes to Hollywood
(/blog/hadoop-goes-to-
hollywood)
How to Use MapR Volumes
with Hive and MySQL for
Mirroring (/blog/how-to-
use-mapr-volumes-with-
hive-and-mysql-for-
mirroring)
16/04/14 19:05 Big Data A to ZZ A Glossary of my Favorite Data Science Things | MapR
Pgina 3 de 6 http://www.mapr.com/blog/big-data-zz--glossary-my-favorite-data-science-things#.U064EyjvDPz
outside the bounds of your expectations. This even applies to social networks, in which you can find interesting subgraphs
within the network.
OO One-class classifier: One-class classifier: an efficient logistic classification technique, which is used to test if a data item belongs to a
particular class or not. This is useful in cases where there are a variety of alternative classes, but your attention is focused on
only one of many possible outcomes. This is also used in novelty detection (see above).
PP Profiling (specifically, data profiling): Profiling (specifically, data profiling): a collection of data exploration methods
(http://insideanalysis.com/2014/02/data-profiling-four-steps-to-knowing-your-big-data/) that enable you to find the good, bad,
and ugly parts of your data set. For example: examining the unique values for a database attribute (which is a great way to
find typos in discrete categorical data, such as US state namesI once did this for a NASA project and I discovered that there
were over 90 distinct US state names in the database).
QQ Quantified and Tracked: Quantified and Tracked: the second half of my new definition of Big Data that I am promoting: Big Data is Everything,
Quantified and Tracked! The quantification and measurement (tracking) of anything therefore allows data science to play a
major role in nearly every application domain
(http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation) (hence, job security
and huge job opportunities for data scientists).
RR Recommender engines: Recommender engines: These are probably the most fun and most profitable applications of data science to big data
collections. Learn more in these two articles: Design Patterns for Recommendation Systems Everyone Wants a Pony
(http://www.mapr.com/blog/design-patterns-recommendation-systems-%E2%80%93-everyone-wants-pony) and
Personalization Its Not Just for Hamburgers Anymore (http://www.mapr.com/blog/personalization-%E2%80%93-
it%E2%80%99s-not-just-hamburgers-anymore).
SS SVM (Support Vector Machines): SVM (Support Vector Machines): powerful Jedi machine learning classifier. Among classification algorithms used in
supervised machine learning, SVM usually produces the most accurate classifications. Read more about SVM in this article:
The Importance of Location in Real Estate, Weather, and Machine Learning (http://www.mapr.com/blog/importance-location-
real-estate-weather-and-machine-learning).
TT Tree indexing schemes: Tree indexing schemes: tree-based data structures, brilliantly implemented in the super-fast machine learning
algorithms delivered by SkyTree (http://www.skytree.net/) Corporation, developed at GeorgiaTech Universitys FastLab
(http://www.fast-lab.org/).
UU Unsupervised exploration: Unsupervised exploration: the purest form of data mining, exploring unlabeled datasets with unsupervised machine
learning algorithms (e.g., Clustering, Association Mining, Link Analysis, PCA, Outlier Detection). One researcher expressed it
this way (http://arxiv.org/abs/0905.1682): unsupervised exploratory analysis plays an important role in the study of large,
high-dimensional datasets that arise in a variety of applications.
VV Visual analytics: Visual analytics: exploratory and explanatory visualizations of large complex datasets. Visual storytelling is a critically
important analytics component of a data scientists duties: to explore and explain discoveries in big data collections visually,
because a picture is worth a thousand words (i.e., a picture is worth 4 kilobytes).
WW WEKA: WEKA: free data mining package (http://www.cs.waikato.ac.nz/ml/weka/), for data exploration, profiling, mining, and
visual analytics, containing hundreds of machine learning algorithms, techniques, and methods.
XX XML, specifically PMML: XML, specifically PMML: Predictive Modeling Markup Language
(http://en.wikipedia.org/wiki/Predictive_Model_Markup_Language), which is an XML language for describing and sharing
(machine-to-machine) predictive models learned within a data mining process (such as Data Mining-as-a-Service, or Decision
Science-as-a-Service (http://syntasa.com/)).
YY YarcData: YarcData: an important new vendor in the field of big data science (http://www.yarcdata.com/Products/), specifically
TM
16/04/14 19:05 Big Data A to ZZ A Glossary of my Favorite Data Science Things | MapR
Pgina 4 de 6 http://www.mapr.com/blog/big-data-zz--glossary-my-favorite-data-science-things#.U064EyjvDPz
developing high-performance computing architectures for linked datasets. Their Urika product is an in-memory graph
database, which can hold up to 0.5 petabytes (= 500 terabytes) of graph data in memory!
ZZ ZZ Zero bias, Zero variance: Zero bias, Zero variance: two of the most common myths in big data analyses
(http://www.statisticsviews.com/details/feature/4911381/Statistical-Truisms-in-the-Age-of-Big-Data.html). These myths suggest
that the sample bias and/or the variance in various parameter values should go to zero as the size of the data set gets larger
and larger. This is simply not true. The sample bias and variance is a feature of your data collection process, no matter how
much data you collect.
HH Hadoop (of course, Hadoop (of course, Hadoop Hadoop (http://www.mapr.com/products/apache-hadoop)! Did you think that I forgot about ! Did you think that I forgot about
the H?): the H?): Hadoop is the de facto big data computing paradigm. It enables distributed processing of large data sets across
clusters of commodity servers. The Hadoop ecosystem now includes the compute engine, scripting language, file server,
database, analytics tools, query language, and workflow manager. To learn more, check out the Executives Guide to Big Data
and Apache Hadoop (http://www.mapr.com/The-Executives-Guide-to-Big-Data-and-Apache-Hadoop) from MapR, which you
can download for free here (http://www.mapr.com/The-Executives-Guide-to-Big-Data-and-Apache-Hadoop).
Come back next week to see which of my favorite big data and data science things receive more attention and deeper coverage.
By the way, in case you missed the pony and elephant that were mentioned in the opening paragraph, look again at R
(http://www.mapr.com/blog/design-patterns-recommendation-systems-%E2%80%93-everyone-wants-pony) and H
(https://www.google.com/search?q=hadoop+elephant&tbm=isch).
0 Comments mapr.com Login
Sort by Best Share
Start the discussion
Be the first to comment.
Subscribe Add Disqus to your site
Favorite
TM
April Newsletter: MapR Adds Complete Apache Spark Stack
Posted on by KAREN WHIPPLE
We just wrapped up a great quarter for MapR! We introduced our free Sandbox for Hadoop, achieved the
highest ranking for Current Offering in a Big Data Hadoop Solutions report by Forrester, and announced
the MapR Distribution for Hadoop with YARN and HP Vertica. Read about our latest announcements, top
blog posts, webinars, white papers and more in this information-packed newsletter.
READ MORE
April 14, 2014
16/04/14 19:05 Big Data A to ZZ A Glossary of my Favorite Data Science Things | MapR
Pgina 5 de 6 http://www.mapr.com/blog/big-data-zz--glossary-my-favorite-data-science-things#.U064EyjvDPz
Top 10 Big Data Challenges A Serious Look at 10 Big Data Vs
Posted on by KIRK BORNE
About 13 years ago, Doug Laney of the META Group (now Gartner) wrote an amazing report that showed
both great insight and great foresight.
READ MORE
April 11, 2014
MapR Distribution for Apache Hadoop updates: Hadoop 2.x, Pig, Hive,
Oozie, HBase and others
Posted on by ANOOP DAWAR
On the heels of the recent Spark stack inclusion announcement, here is some more fresh powder (For
non-skiers, thats fresh snow on a mountain).
READ MORE
April 11, 2014
Databricks and MapR Partner to Provide Enterprise Support for Spark
Posted on by ARSALAN TAVAKOLI-SHIRAJI
Today, MapR announced that it will distribute and support the Apache Spark platform as part of the
MapR Distribution for Hadoop in partnership with Databricks. Were thrilled to start on this journey with
MapR for a multitude of reasons.
READ MORE
April 10, 2014
MapR Integrates the Complete Apache Spark Stack
Posted on by TOMER SHIRAN
With over 500 paying customers, my team and I have the opportunity to talk to many organizations that
are leveraging Hadoop in production to extract value from big data. One of the most common topics
raised by our customers in recent months is Apache Spark. Some customers just want to learn more about
the advantages of this technology and the use cases that it addresses, while others are already running it
in production with the MapR Distribution.
READ MORE
April 10, 2014
16/04/14 19:05 Big Data A to ZZ A Glossary of my Favorite Data Science Things | MapR
Pgina 6 de 6 http://www.mapr.com/blog/big-data-zz--glossary-my-favorite-data-science-things#.U064EyjvDPz

You might also like