You are on page 1of 8

Streamlined Life Cycle Assessment Through Materials Classification

and Under-Specification
Lynn Reis Materials Systems Laboratory, MIT, lynnreis@mit.edu
Elsa Olivetti Materials Systems Laboratory, MIT, elsao@mit.edu
Randolph Kirchain Materials Systems Laboratory, MIT, kirchain@mit.edu
Matthew Pietrzykowski GE Global Research, pietrzyk@research.ge.com
Abstract. Data gaps in life cycle assessment (LCA) lead to a reliance on proxy data and
introduce uncertainty into assessments. This work shows how data mining techniques can be
used to quantitatively understand the important characteristics driving environmental impacts of
materials when more than one impact is considered. The results can subsequently be used to
fill data gaps and predict a materials environmental performance. This paper explores the use
of clustering and principal component analysis to identify potential material classifications, and
then the use of regression trees and other statistical methods to evaluate the classifications and
develop a material taxonomy. This method is applied to a life cycle inventory dataset for several
metals, and the TRACI 2.0 impact assessment method is used to illustrate the observed
reduction in uncertainty as more information is specified across the hierarchy using this
proposed classification. The analysis indicates that function, price, and recycled content are
significant classifiers for metals. Applying these factors to create taxonomies demonstrates both
an effective and efficient reduction in uncertainty. This is represented by a 68-78% reduction in
the sum of squares error within (SSW) of the worst case groups from level 1 to level 2 of the
taxonomy and an 84-91% overall reduction in SSW of the worst case groups from level 1 to
level 3. Using this materials taxonomy reduces the amount of information required to specify a
product, therefore enabling one component of streamlined life cycle assessment.
Introduction. Performing an LCA is resource intensive. If the application of LCA is going to
scale such that it is broadly used, this challenge must be addressed. In particular, there is a
need to reduce data collection requirements. One key aspect of this challenge is that primary
data are often not available to perform an LCA, or are outside the control of the party conducting
the LCA. In fact, even gathering bill-of-materials information for highly-outsourced products can
be challenging. This can be especially problematic in screening LCAs where materials
information is important, but may not be fully determined or may be particularly difficult to obtain.
Various methods have been explored to select proxy information where primary data are not
available and appropriate secondary database information has not been collected. These
include directly substituting information on other processes, scaling known data to compensate
for unknown data, averaging a set of proxy values, or extrapolating from existing data [1, 2]. For
all of these methods, experts are often required to identify the appropriate criteria for selecting
substitutes, but experts differ in their knowledge and there has been little study on what are the
Proceedings of the International Symposium on Sustainable Systems and Technologies (ISSN 2329-9169) is
published annually by the Sustainable Conoscente Network. Melissa Bilec and Jun-ki Choi, co-editors.
ISSSTNetwork@gmail.com.
Copyright 2013 by Lynn Reis, Elsa Olivetti, Randolph Kirchain, Matthew Pietrzykowski. Licensed under CC-BY 3.0.
Cite as:
Streamlined Life Cycle Assessment Through Materials Classification and Under-Specification. Proc. ISSST,
Lynn Reis, Elsa Olivetti, Randolph Kirchain, Matthew Pietrzykowski. http://dx.doi.org/10.6084/
m9.figshare.805145. v1 (2013)
Copyright 2013 by the Authors
most appropriate criteria [3]. Reliance on database or proxy data introduces a source of
uncertainty and subsequent biases.
Data mining methods have been proposed to improve on some of the proxy selection methods
described above. One method is to use neural network models alone or in combination with
process data [2, 4]. Neural networks have also been applied at the product level [5]. However,
these models obscure variable relationships and make it difficult to interpret the true drivers.
An alternative to proxy selection is to use probabilistic under-specification [6]. Under-
specification involves hierarchical categorization of materials or other life cycle activities
according to available attributes, and considers uncertainty using categorical data at various
levels of the hierarchy. This allows less information to be provided about the BOM, and
leverages the aggregation of existing data for groups of materials. Previous work demonstrated
that under-specification can streamline data collection requirements. However, this work
focused predominantly on a single impact, Cumulative Energy Demand (CED), and did not use
quantitative mapping to develop the classification structure. Thus, there are opportunities to
build on this assessment.
Research Question. The research as a whole explores whether applying a hierarchical
structure to life cycle inventory data can be used to estimate data gaps and enable streamlining
of LCA. This paper more specifically seeks to answer the question of whether there are
quantitative methods that can help create such a hierarchical structure that is more effective and
efficient for LCA.
Investigative Method. This work identifies the key characteristics for improved materials
classification when multiple impact categories are considered. It further explores how data
mining methods can be used to determine these characteristics to inform hierarchical
classification. The goal is to be able to produce an effective and efficient taxonomy of materials
to be used in streamlined assessment through under-specification. To achieve this, a database
of materials and classifiers was generated. Using this dataset, the analysis was done in two
parts: 1) exploratory analysis of the data along with expert judgment to identify possible
classifiers related to underlying patterns in the data, and 2) evaluation of those classifiers for
their potential for uncertainty reduction and predictive abilities. The results were translated into a
taxonomy or hierarchical structure.
Database Formation. Data was derived from available life cycle inventory data. Boundary
conditions are set within the individual datasets, and so differ between them. The goal was to
have a broad and representative database, so all were used despite these differences in
boundaries, etc. Each material in the database was evaluated on a variety of environmental
midpoints. TRACI 2.0 was selected to provide a range of midpoints to assess, and because of
its relevance to North America [7]. Classifier information was collected from three main sources:
1) characteristics of individual process datasets, 2) categorization through qualitative
assessment of materials categories, and 3) materials property data from the CES database [8].
Exploratory Analysis and Classifier Identification. Data mining methods can be useful in
exploratory analysis to identify hidden patterns in the data. Clustering is one such method,
where similar data elements are grouped together depending on specified criteria. Clustering
analysis has been applied in LCA to identify data gaps, fill in missing information, and enable
more rapid assessment of large BOMs [9-12]. Multiple clustering methods are available for
determining similarity, and thus can produce different groupings. For this analysis, focus was
placed on hierarchical clustering, because of the goal to create a hierarchical classification and

Copyright 2013 by the Authors
lack of knowledge about the number of clusters a priori. Multiple hierarchical methods are
available that use different distance metrics in assigning groups. The goal of this assessment
was to reduce uncertainty within the clusters. Thus, the Ward method was chosen, which
minimizes the sum of squares at a given step. The values used in the clustering analysis were
the environmental impacts. If the data were highly skewed, they were log transformed. All
midpoints were assumed to be equally important, and so were scaled by the mean and standard
deviation. After running the cluster analysis, selection of the best number of clusters was done
by examining the knee in the joining distance scree plot.

Principal component analysis (PCA) was also examined as an exploratory method for identifying
important patterns in the data and has previously been used within LCA [9, 13]. PCA is a useful
method for reducing dimensionality by identifying a set of orthogonal principal components
(PCs) which capture the variation of the original dataset [14]. PCA can identify groupings of
materials and impacts which are more closely aligned, as well as potential outliers and leverage
data points. Appropriate classifiers may need account for factors that differentiate these points.

The methods described above identify exploratory groupings and key materials within the
dataset. It was then up to the user to identify the key characteristics describing each group and
differentiating certain materials. Expert judgment was useful to examine unifying themes within
clusters to indicate possible material classifiers that may not have otherwise been considered.
This generated a list of possible classifiers which are evaluated in the next section.

Classifier Evaluation. Materials may be grouped in many ways to identify similarity, and it is
difficult to know which groupings are most related to the environmental impacts [1]. In this study,
multiple classification structures were examined to identify which reduced uncertainty the most.
Multiple measures of uncertainty exist, and so a few methods were explored to give a more
complete sense of the data. Both continuous and categorical classifiers were considered.
Continuous classifiers were evaluated for contribution to uncertainty, and the most relevant
were translated into categorical classifiers to enable the creation of hierarchical data structures.
The classifiers were evaluated in terms of six criteria, described below. In each case, the top
classifiers were identified, and the results were aggregated to explore trends across methods.

The first evaluation used data visualization of the classifiers compared to rotated scores from
PCA. As the first few PCs may account for a large portion of the variation in the assessment, the
correlation of potential continuous classifiers with these PCs can also be explored to see which
contribute most to each portion of the uncertainty. Graphical examination of spread and
distinction between groups was observed for categorical classifiers.

The second and third evaluations used uncertainty metrics to compare groups. The former
considered the sum of squares error within the groupings (SSW) as a percentage of the total
sum of squares error of the entire dataset (SST). This gives a measure of how low the
remaining uncertainty is within the groups. The latter metric examines the F-statistic, or the ratio
of the sum of squares error between groups to the SSW. This metric is examined because it
gives a penalty for creating too many groups, analogous to requiring too much information in a
streamlining process. The F-statistic forms the basis for ANOVA, which requires certain
assumptions to be met, and should be considered cautiously in case they are not.

Models were also examined to determine which classifiers were most effective at capturing the
variation. Ordinary least squares regression was considered using continuous classifiers and
categorical classifiers translated into dummy variables to predict the standardized environmental
impact. 20% of the data was reserved for validation of the model, and the analysis was done in

Copyright 2013 by the Authors
a stepwise fashion. A few possible models were selected and then compared using the
validation data. The best one was selected and the important classifiers were identified.

The inverse model prediction was also performed, where each individual environmental impact
was considered as an independent variable to predict classification. This was done using
logistic regression. Once again, 20% of the data was kept for validation, and the models for
each classification were compared based on their improvement over baseline values (ie.
selecting the largest group to represent all samples).

Lastly, regression trees were considered. Regression trees are beneficial since they are non-
parametric and can simultaneously analyze continuous and categorical classifiers [15]. The
regression trees were run using 5-fold cross validation to create the best tree structure. The
trees were then pruned based on the k-fold values of R
2
. The variables were assessed for their
contributions to the sum of squares, and those that contributed the most were selected.

Taxonomy Formation. The important classifiers in each of the six analyses from the classifier
evaluation section were aggregated to identify those that were most significant across multiple
analyses. These were combined in multiple arrangements to explore possible hierarchies.
These hierarchies were examined graphically to show the comparative advantages and
improvement over the baseline, where the baseline is a previously determined structure based
only on qualitative rather than quantitative information [16]. The goal of the analysis is to reduce
the uncertainty at each level (effective) without having to provide too much information (efficient).

Case Study and Results. This methodology was applied to a dataset of metals to demonstrate
the ability to reduce group variation as well as to gain insights about parameters that lead to
groups of materials with similar environmental impacts.

Database Formation. Data for this analysis was derived from a combination of datasets from the
European Aluminum Association (EAA), European Copper Institute (ECI), ecoinvent 2.2,
European Reference Life Cycle Database (ELCD), Eurofer, GaBi 5 database from PE
International, United States Life Cycle Inventory (USLCI), and World Steel. A set of 215 metals
were examined in the exploratory analysis. For the classifier evaluation step, this set was
reduced to keep boundaries more consistent, leaving 169 metals. Only four impacts from TRACI
2.0 were considered (acidification, global warming potential, non carcinogenics, and smog) due
to differences in the evaluation procedure for some of the impacts across the data sources. Due
to the highly skewed nature of the data, all of these impacts were log transformed.

Exploratory Analysis and Classifier Identification. Clustering was performed on these four
impacts, and 14 clusters were evident. Expert evaluation highlighted that commercial purity,
density, function, price, and recycled content might be useful classifiers. The structure for under-
specification in prior work grouped materials based on whether they were ferrous or non ferrous
[16]. However, clustering showed this classifier to be mixed among clusters, indicating an
opportunity for improvement.

Next PCA was conducted on the database. The first PC alone comprised 78% of the variation,
and the first two PCs made up 92% of the variation. Some of the more influential points
indicated that again recycled content, price, purity, or function might be good classifiers.

Classifier Evaluation. Using the exploratory analysis results, a set of classifier data was
collected. The classifiers included aspects of the process information (ie. data source, recycled
content, level of processing, location), materials categorization (ie. ferrous vs. non ferrous,

Copyright 2013 by the Authors
function, groupings in the CES software), and intrinsic and manifest materials properties
(density, price, tensile strength, crustal abundance).

PCA showed that the logarithm of the price was correlated (r=0.79) with the first PC, and
recycled content was correlated to a lesser degree (r=-0.53). Once these factors were
accounted for, density and crustal abundance were aligned with the additional uncertainty found
in PC2. Examining histograms of categorical classifiers against the PCs showed that there was
distinction between the groups for function on both PC1 and PC2.

Using the SSW/SST metric, a few groupings were able to significantly decrease the overall
uncertainty. Grouping based on price showed the biggest change, where the data were split into
three parts for a 60% reduction as compared to the whole dataset. A few other classifiers
demonstrated a 40% reduction in uncertainty. The F-statistic highlighted a different set of
classifiers because of both an increased number of categories for some items (function, CES
software grouping) as well as insufficient separation between some groups.

All of the logistic regression models showed significant predictive capabilities. The best logistic
regression models were able to show approximately 20-25% misclassification rates on the
validation data. However, given that some categories made up large portions of the data, this
represented approximately 45-55% improvement over the baseline.

Regression results showed similar results to the PCA analysis. They indicated that price and
recycled content were very significant, but that density and data source made up some of the
residual uncertainty. Thus, some error remains from the data source that we cannot fully
account for. The parameters chosen performed the best on the validation data, and all were
shown to be significant without much multicollinearity, as demonstrated by the VIF values. The
residuals approximate a normal distribution, with some deviations at the tails.

Lastly, the pruned regression trees with only five splits showed that recycled content and price
were the main factors of importance. Eliminating some additional splits, a simplified proposed
hierarchy was created from the analysis. This showed that first, very low recycled content
materials should be segregated. Then within each of those groups, different price point splits
are relevant. The simplified pruned regression tree is shown in Figure 1.


Figure 1: Regression tree derived from the data showing splits based on recycled content and log(price).

A summary of the important parameters in all of the methods are highlighted in Table 1. More
were considered for the analysis, but this subset contains the ones that continued to be relevant.
Recycled content and price appeared within every method evaluated.






Copyright 2013 by the Authors
Table 1. Aggregated results showing important classifiers for each of the six classifier evaluation metrics.

Ferrous/ Non
Ferrous
Data
Source
Recycled
Content
Price Function Density
CES
Grouping
PCA x x x x
SSW/SST x x x x
F-Statistic x x x x
Logistic Regression x x x
Regression x x x x
Regression Trees x x x

Taxonomy Formation. Using the most significant classifiers from the classifier evaluation, a few
possible taxonomies were created and then compared to the baseline classification structure
from prior work. For this analysis, splits were made to create three possible levels of
information: level 1 is the aggregate of all samples, level 2 represents groups created by a first
split based on a classifier, and level 3 represents a subsequent split using a second classifier.
The baseline grouped metals by ferrous vs. non ferrous metals at level 2 and by material at
level 3. Figure 2 shows the SSW uncertainty metric at all three levels for each of the proposed
taxonomies as well as the baseline, where each bubble represents a group formed at that level
with the size related to the number of items in the group. Both methods from the analysis in this
study reduce the uncertainty more quickly than in the baseline case. When comparing to the
worst group, where the worst group is the group with the highest SSW in a given level, the two
taxonomies derived from this analysis significantly improve on the baseline, demonstrating an
effective structure. At level 3, improved performance is seen with significantly fewer groups than
the baseline, showing efficiency. These differences are summarized in Table 2.


Figure 2: Bubble plots showing three proposed hierarchies for the taxonomy where baseline is on the left. The y-
axis is the uncertainty as represented by the SSW, and the x-axis shows splits at level of the hierarchical
classification structure. The size of the bubbles indicates the number of items in the groups at a given level.

Table 2. For each possible taxonomy, the number of groupings and relative reduction in the SSW between
the worst case groups in each level is reported.

Level 2:
# of Groups
Level 2:
Reduction in
SSW
Level 3:
# of Groups
Level 3:
Reduction in
SSW
Baseline: Ferrous-NonFerrous/Material 2 42% 30 89%
Function/Recycled Content 5 78% 12 91%
Recycled Content/Price 2 68% 4 84%

Discussion. This analysis proved useful in reducing uncertainty in classification structures by
creating groups with less variation than the baseline case. This methodology allows less
information to be provided in order to be able to achieve results with higher resolution. One
possible critique is that recycled content values may be difficult to obtain. However, the recycled

Copyright 2013 by the Authors
content cutoffs between groups, although they are given specific numeric values, split off only
those with very low recycled content, and thus may not be too burdensome to collect. Price is
often known more readily, and so is less of a concern. One caution with price is that it is variable,
and so materials may move from one category to another over time.

Another benefit this work provides is an understanding of key characteristics about the data. For
instance, it is very important to know during comparative analysis that the data source may be
what is driving the difference in results. This analysis is also very useful in identifying outlier
materials that are distinct in their environmental impact, which can be done by clustering or
through examining the influential points in the PCA. This helps to know which materials may be
more poorly suited to a streamlined assessment and which may need further specification.

Future Work. In the future, more work can be done to understand the drivers causing classifiers
to be significant in our taxonomy and to understand relative costs of acquiring information on
classifiers. This method will also be expanded to other material types to create a comprehensive
taxonomical structure. Furthermore, product case studies will be conducted to analyze how
much improvement is seen from the new taxonomy and what reduction is possible in the
specified set of interest for under-specification. Lastly, this work did not include an exhaustive
look at data mining methods, so others could be considered.

Acknowledgment. Thank you to everyone at the Materials Systems Laboratory for your
assistance with this project.

References
[1] Llorenc Mila i Canals, Adisa Azqpagic, Gabor Doka, Donna J effries, Henry King,
Christopher Mutel, Thomas Nemecek, Anne Roches, Sarah Sim, Heinz Stichnothe, Greg
Thoma, and A. Williams. 2011. Approaches for Addressing Life Cycle Assessment Data
Gaps for Bio-based Products. Journal of Industrial Ecology. 15: 707-725.
[2] Gregor Wernet, Stefanie Hellweg, and K. Hungerbuhler. 2012. A Tiered Approach to
Estimate Inventory Data and Impacts of Chemical Products and Mixtures. Int J Life
Cycle Assess. 17: 720-728.
[3] Vee Subramanian, Eric Williams, J oby Carlson, and J . Golden. Patching Data Gaps
Through Expert Elicitation: A Case Study of Laundry Detergents. In IEEE International
Symposium on Sustainable Systems & Technology, 2011, Chicago, IL.
[4] Gregor Wernet, Stefanie Hellweg, Ulrich Fischer, Stavros Papadokonstantakis, and K.
Hungerbuhler. 2008. Molecular-Structure-Based Models of Chemical Inventories using
Neural Networks. Environ. Sci. Technol. 42: 6717-6722.
[5] I. Sousa, D. Wallace, and J . L. Eisenhard. 2000. Approximate Life-Cycle Assessment of
Product Concepts Using Learning Systems. Journal of Industrial Ecology. 4: 61-81.
[6] Elsa Olivetti, Siamrut Patanavanich, and R. Kirchain. Exploring the Viability of
Probabilistic Underspecification to Streamline Life-Cycle Assessment. accepted to
Environmental Science & Technology.
[7] J . Bare. 2011. TRACI 2.0: The Tool for the Reduction and Assessment of Chemical and
Other Environmental Impacts 2.0. Clean Techn Environ Policy. 13: 687-695.
[8] Granta Material Intelligence, "Granta's CES EduPack," Granta Design Limited, 2012.
[9] M. Pietrzykowski. Data Mining and LCA: A Survey of Possible Marriages. In LCA IX,
October 2009, Boston, MA.
[10] Manish Marwah, Amip Shah, Cullen Bash, Chandrakant Patel, and N. Sundaravaradan.
2011. Using Data Mining to Help Design Sustainable Products. IEEE Computer.

Copyright 2013 by the Authors
[11] Naren Sundaravaradan, Manish Marwah, Amip Shah, and N. Ramakrishnan. Data
Mining Approaches for Life Cycle Assessment. In IEEE International Symposium on
Sustainable Systems and Technology, 2011, Chicago, IL.
[12] A. J . Izenman. 2008. Modern Multivariate Statistical Techniques: Regression,
Classification, and Manifold Learning. Springer Science+Business Media, LLC.
[13] L. Basson and J . G. Petrie. 2007. An Integrated Approach for the Consideration of
Uncertainty Making Supported by Life Cycle Assessment. Environmental Modelling &
Software. 22: 167-176.
[14] I. T. J oliffe. 2002. Principal Component Analysis. Springer-Verlag.
[15] A. Saldivar-Sali, "A Global Typology of Cities: Classification Tree Analysis of Urban
Resource Consumption," Department of Architecture, Massachusetts Institute of
Technology, Cambridge, MA, 2010.
[16] S. Patanavanich, "Exploring the Viability of Probabilistic Underspecification as a Viable
Streamlining Method for LCA," Materials Science and Engineering, Massachusetts
Institute of Technology Cambridge, MA, 2011.

You might also like