You are on page 1of 7

Lico, Argie B.

BSCS-1A

Machine Learning: It is all about data

Our world revolves for about 365 days around the sun that equals to a
year and rotates on its axis for about 24 hours that equals to a day. From this
measures of time, we do not usually notice how bytes of information are sent
and spread throughout the world and we do not know how large the data we
have and we used. The recent innovations of our technologies today brought
us to another world of speed and ampleness. From these huge numbers of
data, it will take a long time and process and more efforts to extract the
patterns out of it and gain vital information to be used in solving real life
problems. The whole world is in need of analysis and ideal approaches to
manage it. Furthermore, with the ever increasing amounts of data becoming
available, there is good reason to believe that smart data analysis will become
even more pervasive as a necessary ingredient for technological progress
(Smola & Vishwanathan, 2008).

In dealing with various types and amount of data, we need an ideal


approach to address and manage this and here Machine Learning comes in.
Machine learning has got its inspiration from a variety of academic disciplines,
including computer science, statistics, biology, and psychology (Muhammad
and Yan, 2015). It became one of the fastest growing areas of computer
science, with far-reaching applications where it refers to the automated
detection of meaningful patterns in data (Osisanwo, Akinsola, Awodele,
Hinmikaiye, Olakanmi, and Akinjobi, 2017). According to the Limited Edition
IBM Machine Learning for Dummies book authored by Judith Hurwitz and
Daniel Kirsch (2018), Machine Learning is a form of Artificial Intelligence (AI)
that enables a system to learn from data rather than through explicit
programming. They added that it uses a variety of algorithms that
iteratively learn from data to improve, describe data, and predict
outcomes. When the algorithms take the training data it can produce more
accurate models for the data itself. While, AI is broader and higher than
Machine Learning which is a type of it. It is the study of the computations that
make it possible to perceive, reason, and act (Winston, 1992).

Machine Learning is not that new today because it exist decades ago.
In the era of creation and building of machines that intelligent enough as
humans way back year 1950, an IBM researcher named Arthur Lee Samuels
developed a self-learning program for playing the game checkers. It is
considered by the world as one of the earliest machine learning programs.
The checker program play 10000 games against itself and work out which
board positions were good and bad depending on wins/losses. The program
gradually learns how to defeat itself by making the data of moves to win and
lose as its foundation. Then, he coined the term machine learning (Hurwitz
and Kirsch, 2018) and defined as a field of study that gives computers the
ability to learn without being explicitly programmed (Al Musawi, 2018). In the
year 1959, a paper was published in IBM Journal of Research and
Development explaining his approach about the concept of machine learning.

Machine Learning Styles

Solving real life given problems by making computers instructed to use


data or past experiences is one of the core objectives of Machine Learning. It
is organized into a taxonomy compose of different types and approaches
dealing with various problems with data handling and management. But in this
section only supervised, unsupervised and semi-supervised learning will be
discussed.

Supervised Learning

More like, teaching your computer on how to do something, then letting


itself use its new found or learned knowledge to do it. Supervised Machine
Learning typically begins with an established set of data and a certain
understanding of how the data is classified and intended to find patterns
that can be applied to an analytics process (Hurwitz and Kirsch, 2018). In
any dataset used and utilized by the algorithms of machine learning, every
instance of it is represented using the same set of features which may be
continuous, categorical or binary. If the instances are given with known
labels (the corresponding correct outputs) then the learning is called
supervised (Kotsiantis, 2007). The algorithm of this learning generates a
function that maps inputs to desired outputs. One standard formulation of
the supervised learning task is the classification problem: the learner is
required to learn (to approximate the behavior of) a function which
maps a vector into one of several classes by looking at several input-
output examples of the function (Ayodele, 2010).

Supervised learning is fairly common in classification problems


because the goal is often to get the computer to learn a classification
system that have created (Osisanwo et. al, 2017). According to the study
conducted by Osisanwo et. al (2017), there are supervised machine learning
algorithms which deals more with classification and this includes the
following (Some are defined):

Linear Classifiers, Logistic Regression, Naïve Bayes Classifier, Perceptron,


Support Vector Machine; Quadratic Classifiers, K-Means Clustering,
Boosting, Decision Tree, Random Forest (RF); Neural networks,
Bayesian Networks and so on.

Linear Classifier is often used in situations where the speed of


classification is an issue, since it is rated the fastest classifier. The goal
of classification in linear classifiers in machine learning, is to group
items that have similar feature values, into groups (Osisanwo et. al, 2017).

Support Vector Machines (SVMs) are sets of supervised learning


methods which have been used for classification, regression and outlier’s
detection. One benefit of SVM is it uses a subset of training points in the
decision function (called support vectors), so it is also memory efficient.

Practical Application

“Pattern recognition aims to study the differences of the metabolite expression


profiles acquired under different physiological conditions. There are two main
categories in pattern recognition: supervised and unsupervised learning. The
former one uses the labeled training data to train a model or classifier for the
regression or classification tasks, while the latter one groups data without
prior knowledge.

Many supervised learning algorithms have been developed, such as linear


discriminant classifier, decision tree, nearest-neighbor algorithm, artificial
neural networks, and support vector machine (SVM). The SVM classifier has
been demonstrated as the most popular and effective classifier in various data
classification tasks. The main idea of SVM is to transform the original input
space to a high-dimensional feature space by using a kernel function and then
achieve optimum classification in this new feature space by choosing the
hyperplane that maximizing the margin from the transformed features.

The most frequently used pattern recognition method in analysis of GC–MS


data is the unsupervised learning. To reduce dimensions of the multivariate,
multi-channeled data to a manageable subset, principal component analysis
and partial least squares can be applied to improve the efficiency of clustering.
Several clustering methods can be used to group the data into meaningful
clusters, for example, k-means clustering, agglomerative hierarchical
clustering, spectral clustering, fuzzy c-means clustering (Bezdek, 1981), and
density-based spatial clustering of applications with noise (Ester, Kriegel,
Sander, & Xu, 1996). To measure the clustering performance, the clustering
accuracy can be represented as the number of correctly grouped samples
divided by the number of all samples.” Chapter Sixteen - Analysis of
Metabolomic Profiling Data Acquired on GC–MS

Imhoi Koo, Xiaoli Wei, Xiang Zhang

Department of Chemistry, University of Louisville, Louisville, Kentucky, USA

Available online 9 June 2014.

https://doi.org/10.1016/B978-0-12-801329-8.00016-7
Unsupervised Learning

More on identifying hidden patterns in an unlabelled input data.


Unsupervised Machine Learning refers to the ability to learn and organize
information without providing an error signal to evaluate the potential
solution (Sathya & Abraham, 2013). When the problem requires a massive
amount of data that is unlabeled, well this learning is best suited. On the
contrary, unsupervised learning seems much harder because the goal is to
have the computer learn how to do something that we do not tell it how to do
(Ayodele, 2010) and use this to determine structure and patterns in data (Al
Musawi, 2018). It takes time, training and long process to make the computer
learn a new vital knowledge. This is considered important since it is likely to
be much more common in the brain than supervised learning (Dayan, ).

Unsupervised learning algorithms can help businesses understand large


volumes of new, unlabeled data by segmenting data into groups of
examples (clusters) or groups of features (Hurwitz and Kirsch, 2018).
Clustering Algorithms is the organization of unlabeled data into similarity
groups called clusters. A cluster is a collection of data items which are
“similar” between them, and “dissimilar” to data items in other clusters
(Harari, Ullman, Poggio, Zysman & Seibert, 2014) and ideally share some
common characteriscs (Donalek, 2011). This has been widely used in
different areas of:

A. Internet

Like in Google News where it groups news stories into cohesive groups.

B. Social Media Applications

Applications such as Twitter, Instagram, Snapchat, and so on are using


Clustering because of having a large amount of unlabeled data.
Unsupervised learning is used with email spam-detecting technology.
There are far too many variables in legitimate and spam emails for an
analyst to f lag unsolicited bulk email. Instead, machine learning
classifiers based on clustering and association are applied in order to
identify unwanted email.

Basta search for practical use of unsupervised learning.

The goal in unsupervised learning is generally to cluster the data into characteristically different
groups. Unsupervised machine learning is more challenging than supervised learning due to the
absence of labels.

Same data can be clustered into different groups depending upon the way clustering is done. If you
look at the below figure, 16 animals which were represented using 13 boolean features (appearance
and activity based) can be clustered into two ways depending upon whether appearance based
features were given more weights or activity based features.

[Image Reference : Pampalk et. al. 2013]

The first partitioning, clusters them into mammals and birds while the other clusters them into
predators and preys. Both are equally meaningful and therefore it is up to the scientist to choose his
representation to obtain a desired clustering, which obviously is a challenging task.

Some widely known application of unsupervised learning is in market segmentation for targeting
appropriate customers, anomaly/fraud detection in banking sector, image segmentation, gene
clustering for grouping gene with similar expression levels, deriving climate indices based on
clustering of earth science data, document clustering based on content etc.

The application I would like to talk about is related to ecology which I recently came across.
Clustering techniques are being used to cluster the audio recordings captured through
microphone placed at selected places in the region of interest. These recordings are then
analyzed using unsupervised learning techniques to gauge the biodiversity, say number of
species of birds and animals, in the region of interest.

Semi-supervised Learning

It stands between supervised and unsupervised learning. The name “semi-


supervised learning (SSL) ” comes from the fact that the data used is between
supervised and unsupervised learning (Zhu, 2005). It dates back to the 60s, in
the pioneering works of Scudder (1965), Fralick (1967) and Agrawala (1970)
(Cholaquidis, 2018). SSL exists for the reason that labeled data are expensive
and scarce while the unlabeled data are abundant and free/cheap (Rai, 2011)
so to learn better models, it is needed to devise ways to utilize properly the
labeled and unlabeled data. Furthermore, in obtaining labeled data, it is
difficult and time consuming as they require the efforts of experienced human
annotators unlie unlabeled data it may be relatively easy to collect, but there
has been few ways to use them (Zhu, 2005). A type of Machine Learning is
created to address this problem, Semi-supervised learning, which using large
amount of unlabeled data, together with the labeled data, to build better
classifiers. For the reason that it requires less human effort and gives higher
accuracy (Zhu, 2005).

“One real world application for semi-supervised learning, is webpage


classification. Say you want to classify any given webpage into one of several
categories (like "Educational", " Shopping", "Forum", etc.). This is a case
where it's expensive to go through tens of thousands of webpages and have
humans annotate them (imagine how boring and strenuous it would be).
However, in terms of availability, webpages are abundant. Simply write a
Python/Java/etc. crawler, and you can collect millions of pages in a few hours.

Such a situation is ideal for semi-supervised learning. Annotation expensive,


but cheap and high availability.

For a really good algorithm to solve my above example, have a look at the
Co-training Algorithm. Especially the wiki; it's really concise and provides a
good overview. It basically uses two weak learners on a small amount of hand
labeled data, and then each classifier participates in training data expansion,
but labeling some of the unlabeled set.”

You might also like