You are on page 1of 8

An open-source framework for the interactive

exploration of Big Data: applications in understanding


health care
A. Ravishankar Rao, PhD, Daniel Clarke
Fellow, IEEE Student Member IEEE
Fairleigh Dickinson University, NJ, USA Fairleigh Dickinson University, NJ, USA
raviraodr@gmail.com danieljbclarke@gmail.com

Abstract Though there are many opportunities to gain new


When analyzing data where the relationships between insights from such data sources, there are also significant
variables are not fully understood, it is typical to challenges that need to be overcome. For instance, the
engage in visual exploration. However, this is slow and storage, retrieval and processing of large datasets is
manually intensive, and interesting trends could becoming increasingly demanding. Specialized
potentially be missed. Ideally, we would like to focus architectures such as Hadoop and Spark [4] have been
on patterns and locations within the data that are developed to allow the handling of large and loosely
interesting, such as significant clusters and outliers. structured datasets. Another challenge is the extraction of
We present a novel iterative k-means clustering knowledge and insight from the data being collected. This
algorithm to efficiently identify clusters in large is being partially addressed by the use of machine-
datasets. This facilitates rapid visual exploration of learning techniques [5]. Usually machine learning
new datasets. algorithms are applied in a supervised fashion, where
training data is used to describe outcomes of interest, and
We illustrate our framework by performing a detailed algorithms can predict outcomes on new data. A critical
analysis of open health care data released by the US part of Big Data analytics is to be able to perform efficient
Government and New York State. We apply our exploratory analysis on new domains or on new data in
iterative k-means algorithm to identify clusters of existing domains. This problem has not received
trends in labor force participation to present a unique sufficient attention in the literature, and a significant gap
perspective by area of medical specialty over a 50-year exists in our ability to quickly and effectively explore new
period. Specialties such as nurse practitioners have data and determine relationships between variables.
seen a significant increase in the number of
practitioners relative to internal medicine. Our code is A typical approach is to apply visualization techniques
open source and available on Github. [6] to the datasets, so that humans can perceive and
interpret relevant patterns and trends. Upon subsequent
refinement and understanding, the insight gained from the
I. INTRODUCTION AND MOTIVATION exploration can be codified into specific algorithms.
However, such human visual exploration is inherently
The area of Big Data has garnered significant interest slow and prone to error. Furthermore, there may be
over the past few years, with applications to several several combinations of variables that could remain
domains including finance, advertising, health, and the unexplored. In order to make visual exploration more
Internet of Things [1]. There are many sources for this efficient, it is desirable to draw the users attention to
data, some of which are proprietary such as financial data, portions of the dataset that may contain interesting or
and others that are becoming increasingly open-access. meaningful patterns, given their unique relationship with
Open-access data is being released by several other variables in the data.
governments world-wide including, most notably, the
USA [2, 3]. This trend is driven by transparency, allowing We combine techniques from database analysis and
concerned citizens to better understand how to improve machine learning to arrive at a novel solution to this
the functioning of their own governments. problem. As an initial approach, we consider outliers in

978-1-5090-6182-2/17/$31.00 2017 IEEE 1641


the dataset to be meaningful. We develop a method, the identify outliers and highlight interesting relationships in
iterative k-means analysis to identify such outliers. the data, thereby reducing the search space substantially.
Though this method can be applied to each variable in the
Goldstein and Uchida [15] review different techniques
dataset at a time or to features derived from the variables,
for unsupervised anomaly detection. They focus on
the size of the dataset and the number of variables could
techniques that work in a one-shot manner, rather than
make this approach prohibitively expensive. We address
embedding the algorithm in an iterative framework as
the high dimensionality of potential datasets by first
presented in the current paper. Furthermore, they do not
applying the split-apply-combine paradigm from the data
present an integrated solution that involves visualization
analysis literature [7]. Our technique is described in detail
to facilitate data exploration.
in Section IV.
We illustrate our approach with specific results from Bullinger et al [16] envision an ecosystem for health
analyzing open health care data. We choose health care care which is open, and permits innovation. This
for two main reasons. Firstly, it is an issue of national facilitates its widespread adoption by patients, caregivers,
importance especially in the US due to the enormous physicians, family members and the interested public. We
expenditures in this field (about 17 percent of GDP is take this concept further by adding an open-source
spent on health care). Secondly, increasing amounts of dimension to the health care ecosystem, which should
open data are released by US Government initiatives [8]. accelerate the analysis of available data by patients and
Much research is currently being devoted to improving the interested public. We expect concerned citizens to
care delivery and reducing costs [9]. Health care is being play an important role in determining the future of health
transformed to a data-driven and evidence-based field care by exercising continuous vigilance about the options
[10]. The US Federal Government continuously releases available to them, and by understanding the relationship
data from the Center for Medicare and Medicaid Services between government health care expenditure and public
(CMS) [8]. Rao et al. review many of the benefits health outcomes. Typical questions that are of interest
provided by this open data movement for health care [11, include the following. What is the expected cost of a
12]. specific medical procedure, say hip replacement? What
are some of the outliers in such costs, ie which hospitals
charge the most? What are the important trends in the
II. BACKGROUND AND RELATED WORK evolution of cost over time, e.g. which medical procedure
Choo and Park [13] recognize the need to apply has seen the fastest increase in costs? Which hospital or
clustering and machine learning techniques for the county has seen the biggest jump in costs? It would be
visualization of large datasets. Their approach is more highly desirable to have a semi-automatic way to quickly
focused on the relationship between what is computed draw a users attention to interesting views of the data.
and what is visualized on a display. They take into Krumholz [17] advocates the use of machine learning
account the desired resolution on a display before and advanced analytic techniques to further our
determining the precision with which a result can be understanding of health care data. Analysis has now
rendered. This reduces the potential computation time. become a bottleneck in the learning process. The research
Chen et al. [14] propose that it is important to go presented in our current paper is squarely aimed at
through a visualization process before data can be addressing this bottleneck.
converted to knowledge. Transformation of raw data into
a visual space makes it more amenable for human
consumption, leading to improved perception and
III. DESIGN
cognition of relationships within the data. This is
especially useful in new domains of enquiry where such In our earlier papers, we presented the architecture of an
relationships may not have been explored before. They open-source system for the analysis of open health data
equate visualization with a search process, where one is [11, 12]. We use a Python-based solution with the
searching for interesting patterns in the data. The search following components: Python Pandas, Scikit-Learn and
space inevitably exceeds our capacity to see all Matplotlib [11]. Python Pandas integrates SQL-like
relationships visually. database querying and tabular arithmetic with Matplotlib,
a visualization and plotting package. We use the iPython
The approach that we use in the current paper is to toolkit for development as it facilitates rapid exploration
apply our iterative k-means clustering algorithm to of datasets and algorithms. The Scikit-Learn Python

1642
library [18] provides several basic machine-learning number of clusters, say k=4 in this illustration, and then
capabilities including clustering, classification and apply the following operations in each step.
prediction. The data and results of processing can be
1. Apply the k-means clustering algorithm.
visualized through packages such as Matplotlib and
Graphviz. 2. If single-element or substantially small clusters
(e.g. size < 2 or 3) exist, these are treated as
Our original data analysis framework [12] utilized the
outliers. These outliers are removed and we
following sequence of steps, as shown in Figure 1: data
continue with the rest of the data.
cleansing/ETL, data joining, feature engineering,
clustering classification and prediction, visualization of 3. If there are no substantially small clusters, we
results, interpretation and reporting. terminate the iterative k-means algorithm.
This approach uses k-means sensitivity to outliers as a
way to discover them. Another approach to tackle this
same problem is the k-means-- (k-means minus minus)
algorithm described in [19], which tries to find the outliers
and cluster centers at the same time. While quicker, k-
means-- works on a pre-specified number of clusters and
outliers. Though our approach requires the number of
clusters to be specified, we can find an arbitrary number
of outliers at each iteration thus identifying a greater
number of potential outliers without any necessary
knowledge about the number of outliers present. We can
also run our algorithm for multiple iterations until the user
is satisfied with the results.

Figure 1: Framework for data analysis.

Our current paper is focused on the module indicated by


Clustering, classification, Prediction, Analytics. We
utilize multiple Python libraries including SciPy, scikit-
learn, Matplotlib and Pandas.

IV. METHODS Figure 2: A description of the iterative k-means algorithm.


We first describe the iterative k-means algorithm based The algorithm works sequentially on the original data as
on the steps shown in Figure 2. The algorithm operates by shown in steps (a)-(d). The solid filled circles represent
successively running k-means, treating substantially individual data points. The blue outline identifies clusters at
small clusters of points as outliers and removing them, a given step. The red outline identifies outliers that are not
and repeating the algorithm until no substantially small considered for further analysis.
groups of clusters are present. The resulting clusters Our method is best described by a specific example. We
capture the most relevant groups of data that are not use Medicare-provided data from the following location:
considered to be outliers. In Figure 2 we use hypothetical https://data.medicare.gov/Physician-Compare/National-
data and show four iterations of the application of our Downloadable-File/s63f-csi6, which we refer to as the
algorithm in Steps (a) - (d). We first select a desired Physician-Compare-National-Downloadable-File.

1643
from sklearn.cluster import KMeans 2010, there has been considerable interest in
understanding the effect of this act on developments in
def iterative_kmeans(data, min_size=1,
iters=range(3), n_clusters=8, **kwargs): health care. Questions of interest include: what are the
outliers = [] overall trends in the graduation of practitioners across
iteration = 0
while len(data) != 0 and next(iters):
different specialties? Are there groups of related
# create a kmeans object with a viable specialties based on these trends? How has the Affordable
number of clusters Care Act affected graduation rates? We now demonstrate
km = KMeans(n_clusters=min(len(data),
n_clusters), **kwargs) how insights into these questions can be obtained from
# find kmeans predictions with current data the Physician-Compare-National-Downloadable-File
preds = km.fit_predict(data)
# count the results in each cluster
described earlier.
counts = pd.value_counts(preds)
# retrieve current outliers
current_outliers = counts[counts <=
min_size].index NPI 123456789
# any new outliers?
if len(current_outliers) != 0:
# add new outliers to the outlier list
PAC ID 11223344
outliers += [[data[preds == ind] for ind
in current_outliers]] Professional Enrollment ID 22334455
else:
break # no more outliers found Last Name Doe
# take outliers out of data
data = data[np.in1d(preds, First Name John
current_outliers[current_outliers >
min_size].index)]
# keep track of number of iterations
Gender M
iteration += 1
return ( Credential
km, # the final kmeans object
data, # remaining data (no outliers) Medical school name SOME SCHOOL
outliers, # outliers pulled out
iteration, # actual number of iterations ran Graduation year 2003
)
Primary specialty ANESTHESIOLOGY
Figure 3: The Python code used to implement the iterative
k-means algorithm. Organization legal name ANY MEDICAL
Group Practice PAC ID 33445566
The data consists of several columns including the
Number of Group Practice
hospital id and practitioner information such as name, 6
members
specialty and graduation year. A synopsis of this data is
given in Line 1 Street Address Main Street
Figure 4, where some of the original entries have been City Anytown
modified to safeguard privacy concerns. We note that the
original file contains such data for 895,431 health care State NY
practitioners, which is 7.2% of the total US health care
Zip Code 0
workforce of 12.4 million workers [20]. This translates to
a coverage of more than 75% of current practitioners in Claims based hospital affiliation
the entire USA for several specialties, such as Internal 445566
CCN 1
Medicine, which is highly representative for a freely
available dataset.
An important issue in the health care arena concerns Figure 4: An example of the data fields in the Physician
Compare National Downloadable file. We highlight the
trends in the supply of practitioners across different entries for Graduation Year and Primary Specialty as
specialties. This is of interest from multiple perspectives, these fields are used to drive our iterative k-means
such as the expected population growth in the US, the clustering algorithm.
increasing number of aging citizens, and the graying of
practitioners in certain medical specialties such as surgery
[21, 22]. Since the passing of the Affordable Care Act in

1644
We have created a processing pipeline for exploratory hospital in New York State in the database, and processed
data analysis that uses the following steps, summarized in them for outliers using the iterative k-means algorithm.
Figure 5.
1. Use the split-apply-combine method described in
V. RESULTS
[7]. We use the Pandas groupby command to
group the data by Primary specialty (one of the We applied the iterative k-means algorithm described in
highlighted fields in Figure 4). Section IV with k=8 on the Physician-Compare-National-
Downloadable-File. The results are shown in Figure 7 and
2. This produces grouped items for each specialty, Figure 8. Figure 9 shows the result of applying iterative
which are then further binned into histograms k-means clustering with k=8 on the New York SPARCS
over the field Graduation Year. This is done dataset. We computed the percentage change in total costs
through the Numpy hist command. The result is for medical procedures reported to New York State from
a feature vector for each specialty which consists the years 2009-2014 by using the year 2009 as the
of the count of practitioners who graduated baseline. Outliers in the cost increases were identified
within a given year. The bin size we use for the through our technique, and shown in the legend of Figure
histogram is 1 year, in order to make the results 9.
easier to interpret.
3. The feature vectors for each specialty are then fed VI. DISCUSSION
into the iterative k-means algorithm described in Figure 7 and Figure 8 show that Cluster 2 contains
Figure 3. specialties with fast-rising numbers of graduates in recent
4. Using our interactive visualization technique, the years. This is in sharp contrast to the specialties in Cluster
user is able to understand the organization of 5 which shows declines in specialties such as Family
clusters of trends in the graduation years across Practice and Geriatric Medicine. We observe that Cluster
specialties. This is shown in Figure 6 and Figure 7. 2 contains specialties such as Nurse Practitioner whose
deployment has been identified as a mechanism to contain
rising health care costs. This is achieved by having trained
nurse practitioners complement the role of physicians by
being able to prescribe drugs in limited situations. In the
US, this is one of the strategies being deployed since the
passing of the Affordable Care Act of 2010. At the same
time, the number of practitioners in Geriatric Medicine
has been steadily declining, though the US is faced with
an increase in the population of the elderly requiring such
care. This has definite labor participation and capacity
planning implications. It is interesting to see that a Big
Data approach with the interactive visualization tools that
have been developed can provide researchers with a ready
capacity to start with the raw data provided by the US
Government and quickly identify and explore interesting
trends. We expect the widespread adoption of such tools
Figure 5: Processing pipeline for exploratory data analysis.
to enable concerned citizens to draw their own
We also use a second dataset, provided by the New York conclusions from important national data sources without
State Statewide Planning and Research Cooperative having to deal with potential biases in reporting.
System (SPARCS) program which releases de-identified
in-patient discharge information about disease diagnoses
and costs [23]. This data runs into the tens of gigabytes
and is released annually. The aggregate data [3] contains
15,213,123 rows of patient data from 254 hospitals and
58 counties. There are a total of 264 different diagnosis
descriptions. We computed the trends for the percentage
increase in total costs over all medical diagnoses for each

1645
Figure 7: The final clusters after termination of the iterative
Figure 6: This figure shows the outliers detected at each k-means algorithm (continued on next column).
iteration of the k-means algorithm. The iteration number
and cluster number are displayed on top of the plot, along
with the name of the specialty.

1646
Figure 9 shows that the total costs at most hospitals
reported in the New York SPARCS dataset changed by
about 100% during the 2009-2014 time frame. However,
there are a few outliers with much higher cost increases,
shown in the legend for Figure 9. These results can be
combined with news releases during that time period to
gain a more detailed perspective on health care
developments.
For instance, the Democrat and Chronicle reported that
the CEO of Monroe Community Hospital was fired
(http://www.democratandchronicle.com/story/news/local
/2013/12/26/todd-spring-seeks-3m-in-suit-against
monroe-county/4207877/). The Times Herald-Record
reported in 2013 that the CEO of the Catskill Regional
Medical Center was replaced, and that Federal
sequestration cuts hit hard and fast prior to this event.
(http://www.recordonline.com/article/20130520/NEWS/
130529987). It is likely that this was a contributing cause
to the costs at this hospital rising faster than most other
hospitals in New York State. These results demonstrate
that we can quickly determine interesting and relevant
trends in large health-care related datasets. This capability
could provide concerned citizens with an unbiased data-
driven interpretation of breaking news events in their
regions as well as nationally.
In order to facilitate widespread adoption of the
techniques presented in this paper, we have made our
framework and code available freely to the research
community at github.com/fdudatamining/.
VII. CONCLUSION
Figure 8 (continued from Figure 7): The final clusters after We presented an open-source toolkit based on Python that
termination of the iterative k-means algorithm. can be utilized to analyze and interpret large datasets
thereby driving insight. Many government agencies, such
as Medicare in the US release detailed data about their
inner workings, but the capabilities of tools to interpret
this data has not kept pace. Since the relationships
between variables in these large datasets are not fully
known, users typically engage in visual exploration,
which tends to be slow and manually intensive. We have
developed a machine learning approach, called iterative
k-means, where clusters and outliers are automatically
identified and presented to the user. This facilitates rapid
visual exploration of new datasets.
We applied our toolkit to analyze health care data released
by two government agencies, the Center for
Medicare/Medicaid Services, and New York State
Figure 9: This figure shows the result of applying iterative Statewide Planning and Research Cooperative System
k-means clustering to the New York SPARCS dataset. We (SPARCS). Our technique was able to identify interesting
used 3 iterations, whereby the outliers in the legend were and meaningful trends in graduation rates of health care
identified, such as the Monroe Community Hospital. professionals over a 50-year period. This has definite

1647
labor capacity planning implications for policy makers. [14] M. Chen, D. Ebert, H. Hagen, R. S. Laramee, R. Van
We produced novel insights by identifying hospitals in Liere, K.-L. Ma, et al., "Data, information, and
New York State with significantly different patterns of knowledge in visualization," IEEE Computer
cost increases over the past few years. This could enable Graphics and Applications, vol. 29, pp. 12-19, 2009.
[15] M. Goldstein and S. Uchida, "A comparative
policy makers to understand the implications of monetary
evaluation of unsupervised anomaly detection
incentives for hospitals and the impact they have on their algorithms for multivariate data," PLOS one, vol. 11,
communities. Our approach should prove valuable to p. e0152173, 2016.
researchers, software developers and concerned citizens [16] A. C. Bullinger, M. Rass, S. Adamczyk, K. M.
who want to analyze large publicly available datasets. Moeslein, and S. Sohn, "Open innovation in health
care: analysis of an open health platform," Health
REFERENCES:
Policy, vol. 105, pp. 165-75, May 2012.
[1] L. Da Xu, W. He, and S. Li, "Internet of things in [17] H. M. Krumholz, "Big data and new knowledge in
industries: A survey," IEEE Transactions on medicine: the thinking, training, and tools needed for
Industrial Informatics, vol. 10, pp. 2233-2243, 2014. a learning health system," Health Affairs (Millwood),
[2] "http://www.medicare.gov/hospitalcompare/data/total- vol. 33, pp. 1163-70, Jul 2014.
performance-scores.html." [18] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
[3] "https://health.data.ny.gov/Health/Hospital-Inpatient- B. Thirion, O. Grisel, et al., "Scikit-learn: Machine
Discharges-SPARCS-De-Identified/rmwa-zns4." learning in Python," Journal of Machine Learning
[4] V. S. Agneeswaran, Big Data Analytics Beyond Research, vol. 12, pp. 2825-2830, 2011.
Hadoop: Real-Time Applications with Storm, Spark, [19] S. Chawla and A. Gionis, "k-means-: A Unified
and More Hadoop Alternatives: FT Press, 2014. Approach to Clustering and Outlier Detection," in
[5] R. Fang, S. Pouyanfar, Y. Yang, S.-C. Chen, and S. SDM, 2013, pp. 189-197.
Iyengar, "Computational health informatics in the big [20] "Total Health Care Employment," The Henry J.
data age: a survey," ACM Computing Surveys (CSUR), Kaiser Family Foundation, 2015.
vol. 49, p. 12, 2016. [21] P. J. Schenarts and S. Cemaj, "The Aging Surgeon:
[6] C. K. Leung, V. V. Kononov, A. G. Pazdor, and F. Implications for the Workforce, the Surgeon, and the
Jiang, "PyramidViz: Visual Analytics and Big Data Patient," Surg Clin North Am, vol. 96, pp. 129-38, Feb
Visualization for Frequent Patterns," in 2016 IEEE 2016.
14th Intl Conf on Pervasive Intelligence and [22] J. M. Kupfer, "The Graying of US Physicians:
Computing, 2016, pp. 913-916. Implications for Quality and the Future Supply of
[7] H. Wickham, "The split-apply-combine strategy for Physicians," JAMA, vol. 315, pp. 341-2, Jan 26 2016.
data analysis," Journal of Statistical Software, vol. 40, [23] New York State Department Of Health, Statewide
pp. 1-29, 2011. Planning and Research Cooperative System
[8] "https://data.medicare.gov/Physician- (SPARCS).Available:
Compare/National-Downloadable-File/s63f-csi6." https://www.health.ny.gov/statistics/sparcs/
[9] A. L. Schwartz, B. E. Landon, A. G. Elshaug, M. E.
Chernew, and J. M. McWilliams, "Measuring low-
value care in Medicare," JAMA Intern Med, vol. 174,
pp. 1067-76, Jul 2014.
[10] S. Schneeweiss, "Learning from big health care data,"
N Engl J Med, vol. 370, pp. 2161-3, Jun 5 2014.
[11] A. R. Rao and D. Clarke, "A fully integrated open-
source toolkit for mining healthcare big-data:
architecture and applications," in IEEE International
Conference on Healthcare Informatics ICHI, Chicago,
2016, pp. 255-261.
[12] A. R. Rao, A. Chhabra, R. Das, and V. Ruhil, "A
framework for analyzing publicly available healthcare
data," in 2015 17th International Conference on E-
health Networking, Application & Services (IEEE
HealthCom), 2015, pp. 653-656.
[13] J. Choo and H. Park, "Customizing computational
methods for visual analytics with big data," IEEE
Computer Graphics and Applications, vol. 33, pp. 22-
28, 2013.

1648

You might also like