Professional Documents
Culture Documents
1642
library [18] provides several basic machine-learning number of clusters, say k=4 in this illustration, and then
capabilities including clustering, classification and apply the following operations in each step.
prediction. The data and results of processing can be
1. Apply the k-means clustering algorithm.
visualized through packages such as Matplotlib and
Graphviz. 2. If single-element or substantially small clusters
(e.g. size < 2 or 3) exist, these are treated as
Our original data analysis framework [12] utilized the
outliers. These outliers are removed and we
following sequence of steps, as shown in Figure 1: data
continue with the rest of the data.
cleansing/ETL, data joining, feature engineering,
clustering classification and prediction, visualization of 3. If there are no substantially small clusters, we
results, interpretation and reporting. terminate the iterative k-means algorithm.
This approach uses k-means sensitivity to outliers as a
way to discover them. Another approach to tackle this
same problem is the k-means-- (k-means minus minus)
algorithm described in [19], which tries to find the outliers
and cluster centers at the same time. While quicker, k-
means-- works on a pre-specified number of clusters and
outliers. Though our approach requires the number of
clusters to be specified, we can find an arbitrary number
of outliers at each iteration thus identifying a greater
number of potential outliers without any necessary
knowledge about the number of outliers present. We can
also run our algorithm for multiple iterations until the user
is satisfied with the results.
1643
from sklearn.cluster import KMeans 2010, there has been considerable interest in
understanding the effect of this act on developments in
def iterative_kmeans(data, min_size=1,
iters=range(3), n_clusters=8, **kwargs): health care. Questions of interest include: what are the
outliers = [] overall trends in the graduation of practitioners across
iteration = 0
while len(data) != 0 and next(iters):
different specialties? Are there groups of related
# create a kmeans object with a viable specialties based on these trends? How has the Affordable
number of clusters Care Act affected graduation rates? We now demonstrate
km = KMeans(n_clusters=min(len(data),
n_clusters), **kwargs) how insights into these questions can be obtained from
# find kmeans predictions with current data the Physician-Compare-National-Downloadable-File
preds = km.fit_predict(data)
# count the results in each cluster
described earlier.
counts = pd.value_counts(preds)
# retrieve current outliers
current_outliers = counts[counts <=
min_size].index NPI 123456789
# any new outliers?
if len(current_outliers) != 0:
# add new outliers to the outlier list
PAC ID 11223344
outliers += [[data[preds == ind] for ind
in current_outliers]] Professional Enrollment ID 22334455
else:
break # no more outliers found Last Name Doe
# take outliers out of data
data = data[np.in1d(preds, First Name John
current_outliers[current_outliers >
min_size].index)]
# keep track of number of iterations
Gender M
iteration += 1
return ( Credential
km, # the final kmeans object
data, # remaining data (no outliers) Medical school name SOME SCHOOL
outliers, # outliers pulled out
iteration, # actual number of iterations ran Graduation year 2003
)
Primary specialty ANESTHESIOLOGY
Figure 3: The Python code used to implement the iterative
k-means algorithm. Organization legal name ANY MEDICAL
Group Practice PAC ID 33445566
The data consists of several columns including the
Number of Group Practice
hospital id and practitioner information such as name, 6
members
specialty and graduation year. A synopsis of this data is
given in Line 1 Street Address Main Street
Figure 4, where some of the original entries have been City Anytown
modified to safeguard privacy concerns. We note that the
original file contains such data for 895,431 health care State NY
practitioners, which is 7.2% of the total US health care
Zip Code 0
workforce of 12.4 million workers [20]. This translates to
a coverage of more than 75% of current practitioners in Claims based hospital affiliation
the entire USA for several specialties, such as Internal 445566
CCN 1
Medicine, which is highly representative for a freely
available dataset.
An important issue in the health care arena concerns Figure 4: An example of the data fields in the Physician
Compare National Downloadable file. We highlight the
trends in the supply of practitioners across different entries for Graduation Year and Primary Specialty as
specialties. This is of interest from multiple perspectives, these fields are used to drive our iterative k-means
such as the expected population growth in the US, the clustering algorithm.
increasing number of aging citizens, and the graying of
practitioners in certain medical specialties such as surgery
[21, 22]. Since the passing of the Affordable Care Act in
1644
We have created a processing pipeline for exploratory hospital in New York State in the database, and processed
data analysis that uses the following steps, summarized in them for outliers using the iterative k-means algorithm.
Figure 5.
1. Use the split-apply-combine method described in
V. RESULTS
[7]. We use the Pandas groupby command to
group the data by Primary specialty (one of the We applied the iterative k-means algorithm described in
highlighted fields in Figure 4). Section IV with k=8 on the Physician-Compare-National-
Downloadable-File. The results are shown in Figure 7 and
2. This produces grouped items for each specialty, Figure 8. Figure 9 shows the result of applying iterative
which are then further binned into histograms k-means clustering with k=8 on the New York SPARCS
over the field Graduation Year. This is done dataset. We computed the percentage change in total costs
through the Numpy hist command. The result is for medical procedures reported to New York State from
a feature vector for each specialty which consists the years 2009-2014 by using the year 2009 as the
of the count of practitioners who graduated baseline. Outliers in the cost increases were identified
within a given year. The bin size we use for the through our technique, and shown in the legend of Figure
histogram is 1 year, in order to make the results 9.
easier to interpret.
3. The feature vectors for each specialty are then fed VI. DISCUSSION
into the iterative k-means algorithm described in Figure 7 and Figure 8 show that Cluster 2 contains
Figure 3. specialties with fast-rising numbers of graduates in recent
4. Using our interactive visualization technique, the years. This is in sharp contrast to the specialties in Cluster
user is able to understand the organization of 5 which shows declines in specialties such as Family
clusters of trends in the graduation years across Practice and Geriatric Medicine. We observe that Cluster
specialties. This is shown in Figure 6 and Figure 7. 2 contains specialties such as Nurse Practitioner whose
deployment has been identified as a mechanism to contain
rising health care costs. This is achieved by having trained
nurse practitioners complement the role of physicians by
being able to prescribe drugs in limited situations. In the
US, this is one of the strategies being deployed since the
passing of the Affordable Care Act of 2010. At the same
time, the number of practitioners in Geriatric Medicine
has been steadily declining, though the US is faced with
an increase in the population of the elderly requiring such
care. This has definite labor participation and capacity
planning implications. It is interesting to see that a Big
Data approach with the interactive visualization tools that
have been developed can provide researchers with a ready
capacity to start with the raw data provided by the US
Government and quickly identify and explore interesting
trends. We expect the widespread adoption of such tools
Figure 5: Processing pipeline for exploratory data analysis.
to enable concerned citizens to draw their own
We also use a second dataset, provided by the New York conclusions from important national data sources without
State Statewide Planning and Research Cooperative having to deal with potential biases in reporting.
System (SPARCS) program which releases de-identified
in-patient discharge information about disease diagnoses
and costs [23]. This data runs into the tens of gigabytes
and is released annually. The aggregate data [3] contains
15,213,123 rows of patient data from 254 hospitals and
58 counties. There are a total of 264 different diagnosis
descriptions. We computed the trends for the percentage
increase in total costs over all medical diagnoses for each
1645
Figure 7: The final clusters after termination of the iterative
Figure 6: This figure shows the outliers detected at each k-means algorithm (continued on next column).
iteration of the k-means algorithm. The iteration number
and cluster number are displayed on top of the plot, along
with the name of the specialty.
1646
Figure 9 shows that the total costs at most hospitals
reported in the New York SPARCS dataset changed by
about 100% during the 2009-2014 time frame. However,
there are a few outliers with much higher cost increases,
shown in the legend for Figure 9. These results can be
combined with news releases during that time period to
gain a more detailed perspective on health care
developments.
For instance, the Democrat and Chronicle reported that
the CEO of Monroe Community Hospital was fired
(http://www.democratandchronicle.com/story/news/local
/2013/12/26/todd-spring-seeks-3m-in-suit-against
monroe-county/4207877/). The Times Herald-Record
reported in 2013 that the CEO of the Catskill Regional
Medical Center was replaced, and that Federal
sequestration cuts hit hard and fast prior to this event.
(http://www.recordonline.com/article/20130520/NEWS/
130529987). It is likely that this was a contributing cause
to the costs at this hospital rising faster than most other
hospitals in New York State. These results demonstrate
that we can quickly determine interesting and relevant
trends in large health-care related datasets. This capability
could provide concerned citizens with an unbiased data-
driven interpretation of breaking news events in their
regions as well as nationally.
In order to facilitate widespread adoption of the
techniques presented in this paper, we have made our
framework and code available freely to the research
community at github.com/fdudatamining/.
VII. CONCLUSION
Figure 8 (continued from Figure 7): The final clusters after We presented an open-source toolkit based on Python that
termination of the iterative k-means algorithm. can be utilized to analyze and interpret large datasets
thereby driving insight. Many government agencies, such
as Medicare in the US release detailed data about their
inner workings, but the capabilities of tools to interpret
this data has not kept pace. Since the relationships
between variables in these large datasets are not fully
known, users typically engage in visual exploration,
which tends to be slow and manually intensive. We have
developed a machine learning approach, called iterative
k-means, where clusters and outliers are automatically
identified and presented to the user. This facilitates rapid
visual exploration of new datasets.
We applied our toolkit to analyze health care data released
by two government agencies, the Center for
Medicare/Medicaid Services, and New York State
Figure 9: This figure shows the result of applying iterative Statewide Planning and Research Cooperative System
k-means clustering to the New York SPARCS dataset. We (SPARCS). Our technique was able to identify interesting
used 3 iterations, whereby the outliers in the legend were and meaningful trends in graduation rates of health care
identified, such as the Monroe Community Hospital. professionals over a 50-year period. This has definite
1647
labor capacity planning implications for policy makers. [14] M. Chen, D. Ebert, H. Hagen, R. S. Laramee, R. Van
We produced novel insights by identifying hospitals in Liere, K.-L. Ma, et al., "Data, information, and
New York State with significantly different patterns of knowledge in visualization," IEEE Computer
cost increases over the past few years. This could enable Graphics and Applications, vol. 29, pp. 12-19, 2009.
[15] M. Goldstein and S. Uchida, "A comparative
policy makers to understand the implications of monetary
evaluation of unsupervised anomaly detection
incentives for hospitals and the impact they have on their algorithms for multivariate data," PLOS one, vol. 11,
communities. Our approach should prove valuable to p. e0152173, 2016.
researchers, software developers and concerned citizens [16] A. C. Bullinger, M. Rass, S. Adamczyk, K. M.
who want to analyze large publicly available datasets. Moeslein, and S. Sohn, "Open innovation in health
care: analysis of an open health platform," Health
REFERENCES:
Policy, vol. 105, pp. 165-75, May 2012.
[1] L. Da Xu, W. He, and S. Li, "Internet of things in [17] H. M. Krumholz, "Big data and new knowledge in
industries: A survey," IEEE Transactions on medicine: the thinking, training, and tools needed for
Industrial Informatics, vol. 10, pp. 2233-2243, 2014. a learning health system," Health Affairs (Millwood),
[2] "http://www.medicare.gov/hospitalcompare/data/total- vol. 33, pp. 1163-70, Jul 2014.
performance-scores.html." [18] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
[3] "https://health.data.ny.gov/Health/Hospital-Inpatient- B. Thirion, O. Grisel, et al., "Scikit-learn: Machine
Discharges-SPARCS-De-Identified/rmwa-zns4." learning in Python," Journal of Machine Learning
[4] V. S. Agneeswaran, Big Data Analytics Beyond Research, vol. 12, pp. 2825-2830, 2011.
Hadoop: Real-Time Applications with Storm, Spark, [19] S. Chawla and A. Gionis, "k-means-: A Unified
and More Hadoop Alternatives: FT Press, 2014. Approach to Clustering and Outlier Detection," in
[5] R. Fang, S. Pouyanfar, Y. Yang, S.-C. Chen, and S. SDM, 2013, pp. 189-197.
Iyengar, "Computational health informatics in the big [20] "Total Health Care Employment," The Henry J.
data age: a survey," ACM Computing Surveys (CSUR), Kaiser Family Foundation, 2015.
vol. 49, p. 12, 2016. [21] P. J. Schenarts and S. Cemaj, "The Aging Surgeon:
[6] C. K. Leung, V. V. Kononov, A. G. Pazdor, and F. Implications for the Workforce, the Surgeon, and the
Jiang, "PyramidViz: Visual Analytics and Big Data Patient," Surg Clin North Am, vol. 96, pp. 129-38, Feb
Visualization for Frequent Patterns," in 2016 IEEE 2016.
14th Intl Conf on Pervasive Intelligence and [22] J. M. Kupfer, "The Graying of US Physicians:
Computing, 2016, pp. 913-916. Implications for Quality and the Future Supply of
[7] H. Wickham, "The split-apply-combine strategy for Physicians," JAMA, vol. 315, pp. 341-2, Jan 26 2016.
data analysis," Journal of Statistical Software, vol. 40, [23] New York State Department Of Health, Statewide
pp. 1-29, 2011. Planning and Research Cooperative System
[8] "https://data.medicare.gov/Physician- (SPARCS).Available:
Compare/National-Downloadable-File/s63f-csi6." https://www.health.ny.gov/statistics/sparcs/
[9] A. L. Schwartz, B. E. Landon, A. G. Elshaug, M. E.
Chernew, and J. M. McWilliams, "Measuring low-
value care in Medicare," JAMA Intern Med, vol. 174,
pp. 1067-76, Jul 2014.
[10] S. Schneeweiss, "Learning from big health care data,"
N Engl J Med, vol. 370, pp. 2161-3, Jun 5 2014.
[11] A. R. Rao and D. Clarke, "A fully integrated open-
source toolkit for mining healthcare big-data:
architecture and applications," in IEEE International
Conference on Healthcare Informatics ICHI, Chicago,
2016, pp. 255-261.
[12] A. R. Rao, A. Chhabra, R. Das, and V. Ruhil, "A
framework for analyzing publicly available healthcare
data," in 2015 17th International Conference on E-
health Networking, Application & Services (IEEE
HealthCom), 2015, pp. 653-656.
[13] J. Choo and H. Park, "Customizing computational
methods for visual analytics with big data," IEEE
Computer Graphics and Applications, vol. 33, pp. 22-
28, 2013.
1648