You are on page 1of 14

Sean Jordan

References
Béjar, J. (2014, September). ​Unsupervised machine learning and data mining​. Retrieved from
http://www.cs.upc.edu/~bejar/amlt/material/AMLTTransBook.pdf
This compilation of course slides from the Department of Computer Science at the
Polytechnic University of Catalonia, Spain gives an overview of unsupervised
machine learning. Unsupervised learning, in contrast to supervised learning, does not
seek to predict a set of outputs from a “learned” set of inputs. Instead, it simply seeks
to find patterns and relationships within a pre-existing dataset. This involves using
tools like standard deviation, arithmetic mean, and other statistical techniques to
analyze datasets and enable humans to draw conclusions from them very easily. The
course describes how data clustering can be used to group datasets into different sets
to enable easy and fast processing and analysis by both machines and humans. Finally,
the course touched on how to “mine” or dig through processed data in order to form
detailed visuals such as graphs, tables, and charts.

This source is useful because it goes over unsupervised machine learning techniques,
which are the main algorithms used by Galileo when it is analyzing categorical
datasets. It also describes quantitative data clustering, which is useful when comparing
it to the density-based categorical data clustering that Galileo uses when fitting a
model to the dataset. Additionally, the course goes over visualization techniques and
data mining. Socrates, the program that contains Galileo, has a visualization part of it
that creates graphs and colorful tables detailing what it learned from a big chunk of
raw data. The information provided in this source can be applied to improving the
efficiency of the Socrates and Galileo algorithms. This is a credible source because it
was published by a leading technical institute in Spain that deals extensively with
machine learning.
Brownlee, J. (2016, June 10). Your first machine learning project in python step-by-step.
Retrieved from
https://machinelearningmastery.com/machine-learning-in-python-step-by-step/
This website provides a detailed and step-by-step introduction to what machine
learning actually is, and how to use it with the Python programming language. The
tutorial starts by instructing readers how to download and install the Python source
code and development environments (the console). Once that is finished, the author
talks about why the iris data set is the “Hello World” of the machine learning world. It
is a very well understood dataset, and it is of the perfect size/attributes for a machine
learning newcomer. According to Jason Brownlee, it is a great supervised learning
practice problem because it can easily be examined and graphed. It has only
quantitative data, not categorical, and offers only four characteristics to be tested (and
150 tests in total). Next, the tutorial goes over installing the various add-ons and
required libraries that help the program work as it should without throwing errors.
Next, the tutorial, in chronological order, goes over importing the data set into the
python environment, importing the required bits of libraries into the environment and
summarizing the data set using statistical methods. The second part of the tutorial goes
over creating visual graphs such as box and whisker plots, histograms, and regression
scatter plots. The 3rd and final part of the walkthrough depicts how to obtain results
and fine tune the algorithms to have a higher true positive rate.

This tutorial is useful because it provides a walkthrough for testing an algorithm


against a data set. The research done in the large scale analytics group utilizes this
method very frequently with the SOCRATES program. This tutorial serves as a
valuable resource whenever the machine learning algorithms do not go as planned, and
something needs to be checked against the tutorial’s methods. This is a credible source
because the author is solely dedicated to machine learning and has multiple degrees in
artificial intelligence.
De Fauw, J. (2015, August 10). Detecting diabetic retinopathy in eye images. Retrieved from
http://blog.kaggle.com/2015/08/10/detecting-diabetic-retinopathy-in-eye-images/
This article, published on the official Kaggle blog, discusses how machine learning
can be used to analyze images as well as datasets related to analyzing patterns seen in
daily life (such as amount of people in a store). In 2015, there was a Kaggle
competition that awarded $100,000 to the person or team that could analyze and detect
diabetic retinopathy in images of eyes with the greatest accuracy. This article was
written by the competitor who came in 5th place globally. The author starts off by
going over an introduction of what diabetic retinopathy is, and why it is important
enough to host a competition of this magnitude. He then goes into his actual
methodology for writing the program code this way using a variety of different
complex algorithms. He describes the hardest part as “normalizing” the images, which
is essentially making them all look the same (which is easier for analysis of
differences). Next, he delves into the scoring system for the competition, which was
the quadratic weighted kappa metric - which basically evaluates the similarity between
a human and computer rating of accuracy. Finally, he discusses his results, and what
he learned from the experience (as well as what the audience should take away from
the article).

This blog post is useful because it describes the methodology behind analyzing images
using machine learning techniques, which will be the main topic of interest at the
internship for the latter part of the year. The article also has a link to the source code
that the author used to develop a 5th place algorithm (among 661 competitors). This
source code is crucial to understanding the ins and outs of machine learning with
images.
Ferguson, M. (2016, May 17). What is graph analytics? Retrieved from IBM Big Data and
Analytics Hub website: http://www.ibmbigdatahub.com/blog/what-graph-analytics
This web page, published by the Big Data Analytics Sector of IBM, provides an
intermediate overview on what graph analytics is and why it is important in today’s
society. The article starts by discussing why graph analytics is useful. It can help
detect financial crimes, resolve grid issues, and help prevent terrorism (i. e. hacking),
among other things. It then goes over different types of graph analysis that a
programmer might find when working with algorithms. Path analysis determines the
shortest distance between two nodes on a graph. Connectivity analysis can determine
weaknesses in already existing graph networks. Community analysis helps find major
groups of interacting nodes on a network in order to sort them. Finally, centrality
analysis finds the most influential node on the network. Next, the article delves into
what nodes and edges actually are. Nodes are any ​thing​ in the world - whether it be a
data point, person, etc.. Edges are what connects these nodes to each other - they are
the relationships that bring the shape of the network into perspective. Finally, the
article talks about how MySQL can be used in order to aid the process of analyzing
the graphs. This is a credible source because it was published by IBM, which is a
leading institution in computer science research.

This article is useful because it provides an overview of what exactly graph analytics
is, what it can be used for, and how to use it. The SOCRATES scalable analytics
program will mainly use the community analysis (determining relationships between
different nodes and sorting them based on their similarities). The information on nodes
and edges is helpful in visualizing what a network may look like if it were a physical
object, and then creating program code based off of that image. The Large Scale
Analytics Group (QAS) at JHU/APL uses MySQL in order to backup and store their
data, so the information regarding it in the article is extremely helpful.
Gehr, E. (2017, December 6). [Personal interview by the author].
This interview, conducted at the JHU/APL Asymmetric Operations Holiday Party on
December 6th, 2017, gave light into the inner workings of the QAS group. The
interviewee was Mrs. Elaine Gehr, who is a Systems Engineer and QAS Project
Manager at the Applied Physics Lab. Her job is to oversee all of the work in the QAS
(Large Scale Analytics Group). As the project manager, she gave some valuable
insight into what SOCRATES (the program that houses Galileo) does and where it is
going. The whole point of SOCRATES is to conduct unsupervised and supervised
machine learning on multiple, very large datasets that exist in the world. Galileo
specifically handles categorical machine learning while other algorithms handle
quantitative and mixed aspects of words and numbers in datasets. It is designed to
suck up data, analyze it, and spit out a report very quickly. In any given data scenario,
a company needs a backend, analytics middle, and a visualization frontend. Mrs. Gehr
talks about the need for a “Socrates Sandwich”, where Socrates is the analytics part,
and the company does the front and backend that tie Socrates to the outside world.
Another aspect of Socrates that she talks about is its proposed future. It will be
designed so that there are three different levels of security clearance when it is
marketed to government institutions. Things like statistics on the US population will
be level 1 (lowest), while locations of enemy military bases will be level 3 (highest).
This has no effect on the performance of the software itself.

This interview was useful because it gave a broader picture as to what Socrates does
and what its future plans are as a program. The past weeks and months have been
filled with focus on Galileo and categorical data analysis, but the interview allowed a
step back to be taken and the whole realm of the QAS and their project to be observed.
The interviewee being the project manager helped tenfold due to her broad scope of
knowledge on the topic and the structure of her responses.
Geron, A. (2017). The fundamentals of machine learning. In ​Hands-on machine learning with
Scikit-Learn and TensorFlow: Concepts, tools, and techniques to build intelligent systems
(pp. 3-228). Sebastopol, CA: O’Reilly.
This book provides a very in-depth overview of machine learning using Scikit-Learn
and TensorFlow, which are two of the world’s most recognized tools used for machine
learning algorithms. The book starts off with an introduction to machine learning
altogether. It goes over what machine learning actually is, the difference between
unsupervised and supervised learning, and testing and validating train/test data sets to
improve an algorithm. In the next section, it goes over what a Receiver Operator
Characteristic (ROC) curve is, and how it can be applied to testing and understanding
the multitudes of different algorithms used to sort labeled data. The last part of
Scikit-Learn that the book covers is dimensionality reduction, which uses Principal
Component Analysis (PCA) to better understand the bounds of a dataset in the x, y,
and z dimensions in order to develop a more fitting graph.

This book is useful because it provides detailed explanations and examples on how to
use Scikit-Learn, which is the main algorithm used in the SOCRATES and GALILEO
platforms. A better understanding of these complex algorithms (which this book
provides) will open the door to analyzing images using similar algorithms, which is
supposed to be the culmination of this internship. The second half of the book covers
TensorFlow, which is really only involved with neural networks (i.e. Face
Recognition). Later, this will probably come in handy, but it is not needed as of now.
This is a reliable source because the author is a machine learning consultant and was
previously in charge of supervised learning with YouTube videos classifying extremist
or inappropriate content.
Gilat, D. (1972). Convergence in distribution, convergence in probability, and almost sure
convergence of discrete Martingales [PDF]. ​The Annals of Mathematical Statistics​, ​43​(4),
1374-1379. Retrieved from
https://projecteuclid.org/download/pdf_1/euclid.aoms/1177692494
This article, published in the ​Annals of Mathematical Statistics​ journal, gives a brief
but broad overview of high level calculus and statistical concepts. It combines
convergence, commonly taught in calculus II, and probability analysis, commonly
taught in upper-level statistics. One of the main points of the article is that finding
patterns with big datasets of probability mappings is difficult, and there are many
different possibilities of outcomes. One main pattern is convergence in probability but
not in distribution, and the other is convergence in probability but not absolutely
certain convergence in distribution. Convergence occurs when a function (such as a
probability model) approaches a certain number when integrated instead of
approaching infinity. This whole process is based, in part, on Martingales, which are
basically givers of randomness in probability theory. Martingales are sequences of
random variables that use the prior known values to predict a future unknown value.
This analysis of martingales is similar to supervised machine learning, but is instead
fit for probability densities, not specifically quantitative or categorical data.
This journal article is useful because it provides information on convergence as it
relates to probability distributions in statistics. Galileo relies heavily on probability
distributions that are constructed and then modified throughout the cluster dropping
process. Since the algorithm aims to find hidden patterns within the dataset and
probability distributions, it uses many of the convergence methods outlined above to
determine how likely a given attribute or point is to belong to a given cluster. This
article is credible because it was published by a renowned professor at Columbia
University.
Groves, B. (n.d.). ​Causal-Comparative research​. Retrieved from University of Arizona College
of Agriculture and Life Sciences website:
https://cals.arizona.edu/classes/aed615/documents/causal_comparative_research.ppt
This informative slideshow, published by the University of Arizona College of
Agriculture and Life Science, goes over what this type of research actually is. It is
defined as determining the causes of differences that are pre-existing in datasets. The
simple question asked by the author is what the relationship is between an independent
and dependent variable, and why these variables cause changes to occur in the
experiment. The next part of the slides are about similarities and differences to
correlational research. While both types seek to discover relationships between
different variables or datasets, causal comparative research focuses on comparing two
or more groups of subjects that have categorical attributes. Finally, the slide previews
some examples of research methods and explains why they are or are not good
examples of causal comparative research.

This slideshow is useful because it goes over causal-comparative research, which is


the type of experimental research that will be performed when analyzing the Galileo
algorithms for efficiency and effectiveness. The slides give examples denoting how to
perform this type of research so that correlational research and bad examples of causal
comparative research can be avoided through the rest of the year. This source is
credible because it was published by a reputable professor from the University of
Arizona, a leading educational institution in the United States.
He, Z., Xu, X., & Deng, S. (2007). ​Attribute value weighting in k-modes clustering​. Retrieved
from Harbin Institute of Technology - Department of Computer Science and Engineering
website: https://arxiv.org/pdf/cs/0701013.pdf
This document, published by researchers at the Harbin Institute of Technology in
Weihai, China discusses how clusters and attributes can be weighted in a dataset
model by using the k-modes algorithm. It starts off by describing how the k-modes
algorithm is better than other similarly used algorithms in trying to mine categorical
data. Instead of using means, as other programs do, k-modes uses modes and generates
a heatmap for the different attributes taken from the dataset. This heatmap is then used
to calculate the different probabilities of each value, attribute, and cluster being chosen
amongst their similar types in the dataset model. After proving the credibility of the
mathematical formulas that the algorithm uses, the authors talk about how they
weighted the dataset’s probabilities in order to create a better representation of the
model in the eyes of a human. This is based off of the value of k, which is supplied by
the user or function caller.
This article is useful because it describes an algorithm that is basically a carbon copy
of Galileo, with the exception that it lacks the basis on entropy-based density metrics.
A good portion of Galileo is dedicated to creating a stable weighting system that
models the dataset efficiently and accurately clusters the data into well-defined
bubbles. It gave insight into the behind-the-scenes work of where Galileo originated
from and why it works.This is a credible source because it was published by the
Harbin Institute of Technology, which is part of China’s elite C9 league, an alliance of
the top nine universities in all of China.
Jain, A. (2016, January 3). 12 useful Pandas techniques in Python for data manipulation [Blog].
Retrieved from
https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manip
ulation/
This blog post, published by Analytics Vidhya, discusses how the pandas library can
be used to easily manipulate all kinds of datasets in the Python programming
language. It is essentially a pandas tutorial with side notes in scikit-learn, which is a
machine learning toolset that has much support for the pandas API. Through the
tutorial, the author covers certain techniques that are lesser known in the pandas
universe, including boolean indexing, advanced indexing, pivot tables, crosstabs, and
plotting the datasets on graphs such as histograms. Each of these sections covered how
to perform the specified operation, examples of how it is performed, and a side note
listing how it can be used in conjunction with scikit-learn in order to provide a better
structured and more accessible dataset.

This website is useful because it outlines the more advanced methods seen in the
pandas library. Much of these functions are used in the program code for Galileo and
Socrates and will prove useful for maximizing the efficiency and readability of the
code when creating new data analysis algorithms to add on to the programs. This is a
credible source because it is a journal/magazine website that focuses solely on data
analysis, which means that they are most likely fluent in the matter. The author of this
article graduated with a MS in Computer Science at Columbia University in December
of 2017.
Kumar, R., & Indrayan, A. (2011). Receiver operating characteristic (ROC) curve for medical
researchers. ​Indian Pediatrics - Perspective​, ​48​, 277-287. Retrieved from
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.455.207&rep=rep1&type=pdf
This journal article goes over how the ROC curve can be used in the medical field.
First off, the article delves into why receiver operating characteristics (ROCs) are
used, especially in the health industry. Medical diagnostic tests return one of two
results: a positive or negative outcome. These two results can be subdivided into true
and false counterparts detailing whether the diagnosis was correct or incorrect. For
example, a true positive would mean a patient does have the disease and is identified
correctly by diagnostics. However, a false positive means that the patient does not
have the disease but is diagnosed as having it by the program/physician. The false
positive and false negative varieties of diagnosis are especially harmful to patients
because they lead to more time and money down the line then if they were properly
diagnosed. If one were to graph the ROC curve, it would look like a logarithmic graph
- the closer it is to the upper left point on the grid, the better the machine is at
diagnosing true positives and true negatives. This can be used to accurately determine
the feasibility of a new machine learning algorithm or physician in the healthcare
industry and determine if they (the physician) should or should not stay in work.

This article is useful because it discusses the consequences of false positives and false
negatives in the field of medical informatics, which is what Galileo is being tested
against at this moment. It gives insight into the different algorithms that go into a
computer analyzing these results and generating a ROC curve that fits it. In a way,
Galileo is doing the same thing but instead of drawing the actual curve to describe the
data, it creates clusters that model that attributes and values in the dataset. This source
is credible because it was published by the University College of Medical Sciences,
which is a leading institution in medicine and computer science in India.
Orloff, J., & Bloom, J. (2014, April). ​Conditional probability, independence and Bayes’
theorem​. Retrieved from
https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-
spring-2014/readings/MIT18_05S14_Reading3.pdf
This document, published by the Massachusetts Institute of Technology Department of
Mathematics, covers basic probability concepts that are used often in the field of
statistics. It starts off by going over conditional probability, and how it can help
provide data for analysis when provided with a new or “toyed with” dataset, which is
slightly different than the first one it encounters. It then discusses the meaning of
P(A|B). This implies that given a population B, the percent of population A that lies
within population B is wanted to be found. Next, the lecture delves into why the
multiplication rule is useful in statistics. Based on the law of independence, it can help
separate or divide an “A given B” scenario into more manageable parts if one of the
components of the problem is unknown. This, as the document states, is how Bayes’
theorem was created. It shows how to convert an “A given B” situation into a “B given
A”, which is extremely vital if a statistician does not know the full bounds of one
dataset.

This document is useful because it gives a much needed refresher on basic statistics, as
well as a detailed look into why Bayes’ theorem works for many statisticians. One of
the core components of GALILEO uses Bayes’ theorem, and this document helps
decode the mumbo jumbo that the formula appeared to be at first. This source is
credible because it was published by MIT, a leading institution in computer science
and statistical research.
Pan, W., Shen, X., & Liu, B. (2013). Cluster analysis: Unsupervised learning via supervised
learning with a non-convex penalty. ​Journal of Machine Learning Research​, ​14​(1),
1865-1889. Retrieved from http://jmlr.csail.mit.edu/papers/volume14/pan13a/pan13a.pdf
This journal article was issued in the Massachusetts Institute of Technology’s annual
“Machine Learning Research” journal. This specific article, published by the
University of Minnesota Division of Biostatistics, discusses, among other things, the
difference between supervised and unsupervised learning. The article starts off by
grouping different types of specific graph analysis (i. e. clustering analysis, penalized
regression, generalized cross-validation) into two groups - either supervised or
unsupervised learning. It then goes into more detail about the differences between
these two learning methods. Unsupervised graph learning occurs when the base model
is not provided with the desired results during the algorithm training - this can cluster
the input data based on their analyzed statistical properties only (mean, median, etc.).
Unsupervised learning usually results in slow and pretty inaccurate results, but can
help the users see what areas of the code need the most work right away. Supervised
learning, however, provides the computer analysis algorithm with the correct results at
the get-go, providing faster and more accurate results. This allows users to see the
process of how the computer got there in order to see if it needs any tweaking.

This article is useful because it goes over the different types of analysis and learning
methods that are going to be used when trying to solve problems in SOCRATES. The
research question for the project deals with trying to use unsupervised class
declarations and methods in order to arrive at the correct results. In these trials, no
correct dataset will be provided, and the SOCRATES algorithms will have to adapt to
the dataset in order to try and succeed in analyzing the relationships between data
points. This source is credible because it was published in a journal sponsored by the
Massachusetts Institute of Technology, a leading educational institution in machine
learning research.
Paruchuri, V. (2016, October 18). NumPy tutorial: Data analysis with Python [Blog]. Retrieved
from https://www.dataquest.io/blog/numpy-tutorial-python/
This blog post, published by DataQuest, gives a comprehensive tutorial in NumPy, a
python library that serves as a mathematical add-on to the inherent python language. It
provides useful features such as matrix manipulation, array creation/editing, and more,
advanced data types to the user of the library. Additionally, it serves as a fundamental
basis for other Python libraries as well, namely SciPy and Pandas. The website goes
through all of these features of NumPy with the same example of different features of
a dataset containing information about wines, which provides an easy to follow guide
for a beginning or advanced Python programmer. The reader learns how to manipulate
data arrays into a type that works the best for their purposes, and thus provides a lot of
applicability to the tutorial. Finally, it provides a very useful NumPy cheat sheet,
which provides a quick and short reference when things slip someone’s mind, which is
easy to do when manipulating datasets in this magnitude.

This tutorial is useful because it provides an expansive review of the NumPy library.
This library is used extensively with both Socrates and Galileo, and knowing all of the
aspects of it will help when creating program code that is efficient and manageable.
Also, understanding NumPy will help when trying to decipher tricky functions in
SciPy and Pandas, which decreases dow time when programming. This is a credible
source because DataQuest houses a whole bunch of different articles ranging from
computer engineering to software development, and has a good track record of
providing in-depth and up-to-date articles concerning different libraries in a variety of
programming languages.
Rudin, C. (2012, April). ​K-NN classifiers​ (S. Ertekin, Ed.). Retrieved from
https://ocw.mit.edu/courses/sloan-school-of-management/15-097-prediction-machine-lear
ning-and-statistics-spring-2012/lecture-notes/MIT15_097S12_lec06.pdf
This lecture, published by the Massachusetts Institute of Technology, is designed to
supplement their Prediction, Machine Learning and Statistics course by providing an
in-depth overview of the KNN machine learning algorithm. It starts off by explaining
the principles behind KNN. The K-Nearest-Neighbors Classifier is designed to both
classify certain data points as belonging to a certain class or form a regression curve to
do the same job. The KNN algorithm is mainly used with supervised learning, which
is the process of using labeled data to increase an algorithms accuracy when presented
with unlabeled data. The lecture goes on to talk about what the ​k-value​ is - a number
that determines how far the algorithm should look from the origin when presented
with a data set. The lecture does warn the reader that while KNN is simple, it is not
too smart - in order to get semi-accurate results on the first run, the distance between
the data points must be relatively small. If it is too big, the return value for the
accuracy will be too small, and the learning process will be too slow. Finally, the
lecture goes into the pros and cons of the KNN algorithm. On the plus side, it is
simple, powerful, and easy - it requires little to no work to execute. However, it is
quite slow and very resource expensive - the distance formula must be used for ​each
data point to ​every other ​data point - which can quickly add up to very big numbers.

This lecture is useful because it discusses one of the most well-known algorithms in
the world of machine learning. The SOCRATES program contains an algorithm
named ​Galileo​, which is a classification algorithm based upon KNN and a few other
more complex formulas. Knowing about the KNN classification methods will help
when developing the SOCRATES scalable graph analytics algorithms as well. This
source is credible because it is published by a renowned educational institution that
excels in computerized learning concepts.
Savkli, C., Carr, R., Chapman, M., Chee, B., & Minch, D. (2014). SOCRATES: A system for
scalable graph analytics. ​IEEE Xplore 2014​. Retrieved from
http://www.ieee-hpec.org/2014/CD/index_htm_files/FinalPapers/122.pdf
This journal article, published by the Johns Hopkins Applied Physics Laboratory,
provides an in-depth overview of their Socrates computer program to aspiring
computer scientists for their thirst for knowledge. The article starts by differentiating
Socrates from other widely-used big data analytics software like Apache Hadoop and
RDBMS and stating how it combines the shining features of both softwares and
removes the problems seen in each one. It then goes over how Socrates uses the
“nodes” and “edges” in order to sift through large amounts of data. These elements are
then combined in order to produce a visually pleasing and interactive graph that can
easily be analyzed by humans and by computers. The developers of the program then
go into a overview of the efficiency of the Socrates project, and conclude that the
more nodes (data points) and edges (like-to-like connections between certain nodes)
that a data set / graph has, the faster the program processes this information. Finally,
the journal article goes into how well the software fared when different large data sets
and graphs were inputted as parameters to the Java code, and what could be improved
upon in the future.

This publication is useful because it provides a medium-level overview of the Socrates


program. This program is the main application utilized by the Large Scale Analytics
Group at the Applied Physics Laboratory, and provides the necessary information in
order to get started individually when starting work on a project involving large
amounts of data. This is a credible source because it was published by the JHU/APL, a
government-sponsored institution devoted to scientific research.
Schoening, A. (2012, July). ​Section 1.3: Data collection and experimental design​. Retrieved
from Introduction to Probability and Statistics website:
http://www.math.utah.edu/~anna/Sum12/LessonPlans/Section13.pdf
This course lecture, designed by the Department of Mathematics at the University of
Utah, goes over what experimental design is and how to collect data accurately. First
off, the resource goes over how to design a study effectively that collects meaningful
data from an experiment or set of experiments. When designing a data experiment, a
researcher must carefully design a plan and comb through the procedure to make sure
that there are no errors. Additionally, one must be able to describe the data accurately
using descriptive statistics. It examines the difference between an observational study,
where the experimenter does not change existing conditions, and an experiment, in
which a treatment is prescribed, and the response is measured and recorded as the final
results. Finally, the lecture touches on various sampling techniques that a researcher
can use to describe the data. The most common are systematic sampling (choosing a
random data point and then selecting subsequent data values at regular intervals after
the randomly selected one), cluster sampling (grouping certain data points into
subgroups so that the dataset can be analyzed more easily) and convenience sampling
(in which only available members of the population are analyzed and studied).

This source is useful because it goes over different types of experimental design and
data collection, which is how the data will be collected from the Galileo program. It
also touches on cluster sampling, which is essentially what Galileo already does with
the categorical data that it is fed. It gives insight into how to do the research so that
common pitfalls are avoided and that no other research type is done accidentally. This
is a credible source because it is published by a well-known professor at a prominent
educational institution, the University of Utah.
Sedgewick, R., & Wayne, K. (2016, August 2). Section 4.1: Analysis of algorithms. Retrieved
from Introduction to Programming in Java website:
http://introcs.cs.princeton.edu/java/41analysis/
This website was published by the Princeton University Department of Computer
Science and offers an overview of how to use algorithms written in the Java
programming language in order to iterate through and subsequently analyze large data
sets. The tutorial begins by stating how to go about setting up and executing the
program code so that it is useful to both the programmer and others studying it. It lists
a five-step method that programmers should take in order to write documented,
understood, and comprehensive code that can make an impact and last for a long
while. Then, it delves into using qualitative data along with the Stopwatch.java class
in order to accurately measure the exact running time of the program, which is the
primary measure of efficiency (which is very important in software design). Next, the
tutorial goes into different types of mathematical analysis with graphs, and even
provides iteration statements that model particular graphs (i. e. logarithmic, cubic,
etc.). Finally, the website goes over memory usage, which is a second way of
determining program efficiency.

This website is useful because it provides an introduction to working with big data in
the Java programming language. Socrates, the computer program used at the
JHU/APL, is written and executed in the Java programming language, and this source
will be very helpful in that the information contained within can be applied to the
Socrates platform and be used in order to efficiently and effectively analyze large
groups of data. This source is credible because it was published by researchers from
Princeton University, a leading institution in the field of computer science.
Sontag, D., Zettlemoyer, L., Guestrin, C., Klein, D., & Gogate, V. (2013). ​Bayesian methods &
naïve bayes​. Retrieved from
http://people.csail.mit.edu/dsontag/courses/ml13/slides/lecture18.pdf
This lecture, published by New York University, is designed to supplement their 2013
Introduction to Machine Learning Course by introducing Bayesian methods that are
often used to formulate algorithms for supervised machine learning. It starts off by
introducing the pretty complex math and statistics behind the Bayesian model. The
lecture states that the more advanced algorithms in the world of machine learning use
a ton of advanced calculus and probability concepts that require extended knowledge
of math and statistics. Next, the lecture delves into what a Gaussian curve is, which is
a more basic statistical concept. A Gaussian curve is used as a precursor to the Naive
Bayesian Model - the results from the Gaussian curve will be inputted to the algorithm
for the Naive Bayesian, and the result will be computed for each data point. Again, the
lecture goes into the entire mathematical theory behind why this curve works. Finally,
the lecture goes into what Naive Bayes actually is, and what it brings to the world of
machine learning. It defines one example of an application of the Naive Bayes
algorithm - digit recognition. This algorithm can easily recognize a number that a user
has drawn as a computer digit, even though the two may be pretty different. It
discusses how, like many other machine learning formulas, it takes in multiple inputs
in order to classify it as one of many outputs. Unlike other algorithms, however, the
Naive Bayes model can take text data as an input - making categorical data much more
manageable.

This lecture is useful because the Naive Bayes model is another algorithm that is a
precursor to the ​Galileo​ program at JHU/APL. Knowing the ins and outs of this and
other precursor algorithms is useful when developing program code for the
SOCRATES platform. This source is credible because it is published by a renowned
educational institution.
Taylor, J. (2005, January 25). ​Model selection: General techniques​. Retrieved from Statistics
203: Introduction to Regression and Analysis of Variance website:
http://statweb.stanford.edu/~jtaylo/courses/stats203/notes/selection.pdf
This lecture presentation was published by the Stanford University Department of
Statistics for the Statistics 201 course in 2005, and offers an overview of regression
techniques that are used when analyzing both quantitative and categorical data. The
lecture starts off by going over general strategies, tips, and goals for model selection in
advanced statistics. Choosing the best model to fit a big set of data is not an easy task,
however - in order to begin to comprehend what the best model is, one must “data
mine” - which, according to the lecture, is a long and exhausting process.
Computational algorithms have started to make this process somewhat easier, but the
repeated work involved in the algorithms is still resource intensive. Next, the lecture
covers two types of models specifically - AIC and BIC. It goes over how calculus
techniques (specifically optimization) can be used in order to compare the distance of
focal points to data points and the total number of focal points in the Cartesian system.
Then, AIC plots a line that dips down steeply and then gradually slope up. This is
because the more focal points that are added to the data set, the more penalized the
algorithm becomes, which encourages the use of a low amount of data points.
However, this penalty is very small in AIC. However, it is much larger in BIC. In the
new graph, one would see a sharp decrease in slope, followed by a slightly less steep
increase in slope (and, therefore, score - lowest score is best).

This lecture is useful because is goes over two main model selection tools that are
found in the GALILEO algorithm in the SOCRATES program. The whole point is to
obtain an “accuracy” score, which describes how well a specific model works with a
set of data. Understanding how these algorithms work is crucial to gaining an
appreciation for the advanced statistics and mathematics concepts that are used in the
program. This is a credible source because it was published by Stanford University, a
prominent educational institution.
Venkatasubramanian, N. (n.d.). ​Hadoop, a distributed framework for big data​. Retrieved from
CompSci 237 - Spring 2017: Distributed Systems Middleware website:
http://www.csbio.unc.edu/mcmillan/Comp521F16/Lecture37.pdf
This lecture presentation was published by the University of California, Irvine School
of Computing and Information Sciences, and provides an in-depth overview of what
Apache Hadoop does regarding big data. The lecture starts off by introducing the
concept behind Hadoop. It goes over how it is designed to handle large amounts of
data by distributing them across a virtual network by sending them to different
“nodes” - all with low cost and high efficiency. Next, the lecture goes into the pretty
brief history of Hadoop. It was originally designed to be a storage handler for the
Google search engine - however, it was funded by Yahoo and, when funding was cut,
was then sold to Apache (where it lives currently). Finally, the lecture delves into how
Hadoop works and what its structure is like. It discusses how Hadoop receives large
files and proceeds to chop them up into smaller pieces and then replicate them. The
many copies then get sent to different servers linked to the Hadoop command line
interface. It is through this method that Hadoop manages big data.
This lecture is useful because it outlines how Hadoop works. Hadoop is the precursor
to SOCRATES, and the latter is designed to improve upon the drawbacks seen in the
former. Learning how Hadoop works and what it cannot do currently can help with
determining the improvements to put in the SOCRATES program. This website is
credible because it is published by a leading educational institution in the information
sciences.
Wang, X., Sontag, D., & Wang, F. (2014, August). ​Unsupervised learning of disease
progression models​. Retrieved from
http://people.csail.mit.edu/dsontag/papers/WanSonWan_kdd14.pdf
This research article, published by machine learning scientists from New York
University and IBM Laboratories, goes over how unsupervised machine learning can
be applied to the medical fields, specifically with chronic diseases such as Alzheimer’s
Disease and Diabetes. It describes how the field of precision medicine is rapidly
growing, and how algorithms need to be created that help meet the excessive demand
for data processing capabilities. It goes on, however, to detail the problems and/or
difficulties with developing these large-scale data analytics algorithms. One of the
issues it touches on is the inherent unreliability and unpredictability of the data coming
from the medical field. Additionally, cluster definitions are fuzzy at best, due to each
patient, as a data point, having a unique condition, with different symptoms and
factors affecting the data set. This limits the quality of the clusters created and
provides a model that does not accurately represent the data and patients. The
researchers go on to describe their time-progression method of analyzing data so that
these issues are not as prevalent in the resultant datasets.

This research article is useful because it provides information on the crucial link
between precision medicine (accurately representing electronic patient data) and
unsupervised machine learning. Galileo uses unsupervised machine learning
extensively and is partly designed to help analyze patient data in hospitals. This article
helps reveal the inherent difficulties seen with analyzing hospital dataset, which will
help avoid pitfalls when developing the program code for Galileo. This is a credible
source because it was published by recognized leaders in machine learning research
employed by New York University and IBM Laboratories.
Whitmore, J. C. (2016, July 17). Confusion matrix. Retrieved from
http://cran.cnr.berkeley.edu/web/packages/heuristica/vignettes/confusion-matrix.html
This website, published by the UC Berkeley College of Natural Resources, provides
an overview of what a confusion matrix is and how the information it provides can be
useful for machine learning algorithms. The webpage starts off by outlining what a a
confusion matrix tells the researcher. The matrix helps determine what percentage of
the time the machine-calculated results coincide with the actual, true results provided
by the user. The page then jumps into different heuristics and algorithms that are
useful when it comes to separating out a validation set (the true results) and then
executing the regression formula on the test set in order to see how well the machine
learning algorithm fared. Third, the site talks about how guesses and ties are shown in
the confusion matrix, and why they are the most vital part to study. Finally, the site
gives a brief lesson in statistics and discusses the “big four” of machine learning:
accuracy, precision, specificity, and sensitivity.

This website is useful because it provides both information and example program code
for obtaining and analyzing a confusion matrix. In the world of machine learning,
especially with SOCRATES, confusion matrices are essential because they help the
user discover how much their algorithm worked. This will encompass much of the
research done with unsupervised learning at the large scale analytics group. This
website is credible because it was published by the University of California -
Berkeley, which is an educational institution.
Willems, K. (2016, November 15). Jupyter notebook tutorial: The definitive guide. Retrieved
from DataCamp Tutorials website:
https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook
This website, published by DataCamp, provides a detailed overview of the Jupyter
Notebook program. First off, it goes into what Jupyter is. It discusses the benefits that
it brings a Python programmer - block-by-block code execution, easy saves, and
cloud-based server hosting. Then, it delves into how to install the program on a
computer. It discusses what pip is and how scientists can use it as a package installer
for things other than Jupyter. Then, once it finishes installing, the tutorial goes into
how to run Jupyter in a server setting. It provides the command prompt commands that
are needed in order to start the server’s kernel, as well as organizational tips that help
keep the notebook clean and tidy. Finally, the tutorial states the most important feature
of the Jupyter Notebook: the ability to save work when coding. Usually, when coding
in the command prompt for python, work cannot be saved between sessions. However,
Jupyter adds this functionality, which helps programmers a lot.

This tutorial is useful because it discusses the ins and outs of the Jupyter Notebook.
The main program used for programming the SOCRATES algorithms is the Jupyter
Notebook, and knowing the aspects of the program will help when it comes to adding
segments to the actual SOCRATES code on the QAS git lab. This is a credible source
because it provides correct and meaningful tutorials on other programming languages
and modules, and the specific article was written by a renowned Data Scientist in
Belgium.

You might also like