You are on page 1of 10

Marc Martnez Gmez, Jess Hernndez Herrero, Guglielmo Enrico Mezzadra

PCA APPLICATION: 2014 DEMOGRAPHICAL AND SOCIAL


DATA WORLDWIDE

Abstract
The continuous technological innovation and the recent surge in digital equipment have
become an ambitious but fascinating challenge for every company and organization.
Every hour several devices register millions of recordings: from customer transactions,
patterns and general trends to traffic monitored information and the memory to storage
all that knowledge is no longer a deterrent. Thus, the situation nowadays is not aiming
to gather the maximum data, but to get the specific skills which allow handling and
processing such a tremendous amount of information. In the same way, a multidimensional data system is susceptible to present a high level of redundancy. That is
the reason why the dimensionality reduction is so important and can help to better
understand the behavior of a specific data set and providing efficient management in
different applications.
In this manner, the main purpose of this work is to go through a highly powerful
approach, the Principal Component Analysis (PCA), and applied it to a data set of
worldwide information related to social and demographical variables. The approach is
relying on variance maximization and decorelation of the low-dimensional variables and
enables, for instances, to represent a multidimensional space in a straight line, plane or
3D spaces, depending on the redundancy of the data and the variability that is
accepted to be lost.

In this application, a first qualitative analysis of the data set is required in order to turn
some variables into relative terms. In the same manner, both developed and
underdeveloped countries are selected in order to maximize the variability of the data.
The scope of the PCA is analyzed seeking to gain meaningful insights into the social
and culture situation of the different countries and comments regarding the
performance and usefulness of the procedure are discussed.

Marc Martnez Gmez, Jess Hernndez Herrero, Guglielmo Enrico Mezzadra

DATA-SET PROCESSING
The first process in order to conduct the analysis is focused on obtaining the data-set
from a specific organization. In this case the data are obtained from the United States
Census Bureau. It initially consists of 2014 12 social variables for 127 countries around
the world both developed and under-developed. The exhibit 1 provides the description
of the different variables.

Exhibit 1 Description of the initial variables

In this case is worth to mention that some variables are displayed in terms of absolute
values and others are percentages or relative. This could lead to some mistakes
regarding the weight of the different variables, so the after conducting some analysis
and observing the results, the data set is slightly modified by removing some of the
variables and turn others into relative terms. Hence, the final data set in which the PCA
will be carried out is formed by 8 variables and displayed in the exhibit 2.

Exhibit 2 Final Data-set selected.

Then, to summarize, our final data set consist of 8 variables and 127 measurements.
As a purpose of clarification, the 5 mortality rate refers to mortality in children under 5
years and the infant mortality rate in infants under 1 year. In the same manner, the net
migration can be either positive or negative depending on the immigration or emigration
characteristics.

Marc Martnez Gmez, Jess Hernndez Herrero, Guglielmo Enrico Mezzadra

Again the non-dimensionality of the sample is essential to avoid weight problems in the
process, then, the different measurements can be made non-dimensional by
considering the standardized variables
Where S is the Sample Covariance Matrix

The main goal of the project, hence, is to perform the dimensionality reduction in
accordance with the PCA seeking to gain meaningful insights into the current social
situations of different countries worldwide. The data selected, highly correlated and
with high variability, was considered appropriate for the PCA application so a high
performance of the method is expected.

PCA ANALYSIS
Spectrum analysis
One of the most important aspects of the PCA method is the spectrum evolution(S) in
terms of the different principal components considered. The S is able to express the
remaining variability which is kept once a specific reduction is selected. Regarding PCA
application, the main interest in order to obtain effective dimensionality reduction is to
get an S curve quite steep and with a rapid and significant evolution. The Spectrum
Evolution obtained is shown on the next exhibit:
Hence, the results obtained by separating the data-set can be summarized as follows:

Marc Martnez Gmez, Jess Hernndez Herrero, Guglielmo Enrico Mezzadra

Principal Components
Variability Retained

7 8

0.63447 0.84728 0.95764 0.98191 0.99746 0.99907 0.99978 1

Exhibit 3 Spectrum Evolution along Principal Components

As can be observed, The results obtained from the Spectral point of view are clearly
stating that, in this case, the data-set selected is appropiate for conducting the PCA.
The suitability of the data, as was explained above, relies on the high correlation of the
different variables. By just taking into consideration one principal component a global
variance of 63% is obtained. By using a 2D representation, the number reaches 85%.
Then, the dimensionality reduction can be accepted as efficient and useful so with two
or three variables is able to gather 85-95% of the total variability. A common practice is
to choose d, the number of Principal Components to be used, such that:

Understanding the Principal Components.


Once the dimensionality reduction has been carried out, it is important to present the
data in a clear and meaningful way in order to identify some significant trends or
patterns that are difficult to figure out in a high dimensional space. Classical
representations, thus, are 2D or 3D depending on the Spectrum Evolution of the
sample. For this analysis, for simplifications sake and as a result of the fast spectrum
evolution, a 2D representation will be selected, retaining around 85% of the total
variability but, in exchange, representing the results in a plane.
One of the negative aspects of the PCA is the fact that the Principal Components are
not easily interpreted and cannot be always fully defined in terms of physical-social
characteristics, especially regarding complex and high multivariable data. In some
cases, as a result of the previous knowledge of the data set and some qualitative
assessments this obstacle can be overcome by relying on the eigenvectors
components of each principal component.
Analyzing the results obtained from the procedure and only counting on the two first
principal components, the principal components can be understood as a lineal
combination of the initial variables, according to the next exhibit:

Marc Martnez Gmez, Jess Hernndez Herrero, Guglielmo Enrico Mezzadra

GROWTH RATE(%)
BIRTHS PER WOMAN
LIFE EXPECTANCY
BIRTHS OVER POPULATION(%)
DEATHS OVER POPULATION(%)
UNDER 5 MORTALITY RATE OVER BIRTHS(%)
INFANT MORTALITY RATE OVER BIRTHS(%)
NET MIGRATION OVER POPULATION(%)

PC1
PC2
-0.304424308
0.552561335
-0.422540976
0.020892323
0.407769693
0.159269608
-0.431217324
0.048922373
-0.060589917
-0.266446443
-0.431255989
-0.078633932
-0.427889529
-0.068259014
0.061791147
0.578597541
Quality of live Growth rate due to migration

Exhibit 4 Two first principal component description

Thus, analyzing the first principal component it can be seen that it is positevily
correlated with the life expectancy of a specific country and, at the same time,
negatively correlated with the infant mortality and births per woman. By accepting
simplifications and with the goal of matching the component with one phisical variable,
it may be sensible to define the PC1 as the quality of live in a specific country, in the
sense that, there where the life expectancy is higuer and there is less infant mortality
the life conditions are usually better.
On the other hand and regarding the PC2, the component is positively correlated with
the growth rate and the net migration. By taking into account that the total growth rate
is understood as the addition of several components such as the migration, births and
deaths, it may also be assumed that the PC2 is related to the Growth Rate due to
migration, in the sense that the variable will be significant in those countries where the
total growth rate is mainly led by the migration component, being deaths and births
nearly similar and hence negligible.
As was explained above, this representation of the principal components are open to
assumptions and interpretations so there is implicitly certain uncertainty around it. The
representation must be always aligned with the results obtained and are only possible if
the content and variability of the data are easy to interpret.

Marc Martnez Gmez, Jess Hernndez Herrero, Guglielmo Enrico Mezzadra

FINAL RESULTS
Regarding the results obtained, the representation was firstly focused on the 2D
representations of the data in the plane. By following the procedure and assuming a
first qualitative analysis of the data in terms of developed and non-developed countries
the following results are obtained:

Exhibit 5. 2D data Representation. The results differ by developed under-developed

The first conclusion that can be drawn is that there is a significant difference between
the developed and under-developed countries, especially in terms of variability. While
the developed countries are located all in the same position showing low variability and
having large values of PC1 associated to good quality of live, the under-developed are
spread along the whole plain, being many of them clearly stating poor life conditions.
The envelope expresses the variability of the data, significantly larger in the nondeveloped ones. In order to get better knowledge the following representation is
performed:

Marc Martnez Gmez, Jess Hernndez Herrero, Guglielmo Enrico Mezzadra

Exhibit 6. 2D data representation. Specific Countries

From this exhibit we can extract some interesting insights. First of all that the behavior
of Men and Women presents meaningful differences and this can be assessed
specially by the PC1 value. As can be observed, the Womens points tend to be placed
slightly on the right while Mens are located on the left. On the other hand, the
redundancy of both sub-sets is quite similar and the PC2 value does not suggest
differences between them. Despite the fact that may be complex to interpret the
physical meaning related to the PC1 value , in this case, a possible suggestion than
can be easily interpreted is that this component is strongly aligned with the size of a
individual thus the differences between Men and Women can be intuitively observed.
It is also worth noticing that this projection using only 2 components only retains
around 60% of the initial variation in the data set, so despite being really easy to
The exhibit 6 is selecting some countries among the 126 measurements, taking the
edges and limits of the exhibit 5 and other that may be worth an analysis. According to
it, It can be seen that African Countries such as Burkina Faso and Chad are performing
very poor in terms of PC1, but the worst is Afghanistan, hardly damaged by the loss of
human people as a result of the war and other social-political conflicts. On the other
limit, Monaco is the country which performs better in PC1, followed by other
economical and social referent countries.

Marc Martnez Gmez, Jess Hernndez Herrero, Guglielmo Enrico Mezzadra


In terms of the PC2, is curious to notice that Qatar is the country with larger component
value. As was defined above, the sense of the PC2 is related to the high growth rate
led mainly by immigration and Qatar, as a result of the high economical activity
especially in terms of infrastructure and construction is attracting a lot of people from
several countries. Just as a curiosity, the surge in population during recent years can
be easily understood by knowing that there were 900,000 citizens in 2007 and
nowadays there are more than 1.7 millions, being only 300,000(less than 20%) actual
from Qatar. Another important insight is the fact that Spain, for instances, is better
place than USA in terms of both PC1 and PC2. This could be explained, from the PC1
point of view, because of the public access to health care system in Spain, avoiding
many loss of human life specially the infants. From the PC2, is clear than Spain has
high migrant activity due to the location near to Africa and also because of language
conditions from South America. On the other hand in the USA migration is strictly
controlled and the requirements to establish a new life there much more severe.
Trying to drill-down on the analysis, some correlations graphs were proposed. For
instances, the correlation between the PC1 and the life expectancy was carried out
obtaining the following results displayed on the next exhibit:

Exhibit 7. PC1 VS Life Expectancy

Marc Martnez Gmez, Jess Hernndez Herrero, Guglielmo Enrico Mezzadra


As can be observed, the trend is the high correlation between both variables but there
is a specific country, underlined in red, which is not following the general pattern. This
country results to be Botswana, a relative rich African country characterized by having
one of the smallest births per woman ratio but highly damaged by inequalities and HIV
virus. The PC1 component of that country is hence not explained by the Life
expectancy.
Another important insight can be drawn from the exhibit 8. It can be observed that the
general correlation is also respected but in this case there a significant countries which
are not aligned with the patterns, having higher growth rate than PC2 Component. This
countries are represented mainly by the African Countries, in which the growth rate is
not led by the migration rate, but by the significant amount of births.

Exhibit 8 PC2 VS Growth Rate

Marc Martnez Gmez, Jess Hernndez Herrero, Guglielmo Enrico Mezzadra


CONCLUSIONS
The recent PCA application within a data-set formed by demographical information of
different countries has enabled to gain meaningful insights into the current social
situation worldwide. Relying on the results, the next conclusions can be drawn:
-

The PCA method is a powerful approach to deal with high dimensional data, a crucial
problem which many organizations are facing nowadays. The more variables we are
collecting, the higher the tendency to be redundant, so its definitely essential once we
are analyzing large memory devices. It is worth emphasizing the importance of the
spectrum. In practice, PCA works well if the spectrum of S decays quickly, so
assumptions must be aware of the data-set behavior and how many principal
components are needed to maintain a significant percentage of variability.

Before going through the analysis, a previous qualitative assignment is required. All the
data must be processed and analyzed in order to determine the most appropriate
method to deal with it and obtain significant breakthroughs. The more is known about
the data, the more meaningful will be the analysis. In this application, the initial data-set
was turned into the final one in which many variables were relative. Moreover, the
variables must be always made non-dimensional by using the covariance sample. In
the same manner, an initial separation taking into consideration developed and nondeveloped countries was also essential.

The suitability of the data in terms of high correlation has enabled the PCA application
and the following results analysis. Firstly, the different behavior of developed and nondeveloped countries in terms of performance and variability has been observed.
Secondly, an analysis has been carried out to figure out the consistency of the data
and the anomalous points which were not following the general trends. The procedure
followed has shown the powerful of the method and has provided important knowledge
regarding the application within other important field beyond the social and
demographical context.

Hence, a statistical learning procedure has been needed to carry out the whole
analysis, a highly important knowledge which is required to perform any data analysis.
Parallel to data, a familiarization with the Big Data and the way to manage it effectively
have been really worth and it will help to better understand next challenges in which the
correlation of the different variables are not so well described. The PCA, despite being
widely used and having many applications, has also some flaws that are overcome by
using other interesting methods like the Kernel analysis or the non-linear dimensionality
reduction.

You might also like