10 1 1 451 9121

Volume 15 Number 2 2nd Quarter 2010
THE LEADING PUBLICATION FOR BUSINESS INTELLIGENCE AND DATA WAREHOUSING PROFESSIONALS
BI-based Organizations
Hugh J. Watson
Beyond Business Intelligence
Barry Devlin
Advances in Predictive Modeling: How

In-Database Analytics Will Evolve to
Change the Game
17
Sule Balkan and Michael Goul
BI Case Study: SaaS Helps HR Firm

Better Analyze Sales Pipeline
26
Linda L. Briggs
Enabling Agile BI with a Compressed

Flat Files Architecture
29
William Sunna and Pankaj Agrawal
BI Experts Perspective: Pervasive BI
36
Jonathan G. Geiger, Arkady Maydanchik,

and Philip Russom
BI and Sentiment Analysis
41
Mukund Deshpande and Avik Sarkar
Dashboard Platforms
Alexander Chiang
51
BI Training Solutions:
As Close as Your Conference Room
We know you cant always send people to training, especially

in todays economy. So TDWI Onsite Education brings the
training to you. The same great instructors, the same great
BI/DW education as a TDWI eventbrought to your own
conference room at an affordable rate.
Its just that easy. Your location, our instructors, your team.
Contact Yvonne Baho at 978.582.7105 or
ybaho@tdwi.org for more information.
www.tdwi.org/onsite
VOLUME 15 NUMBER 2
From the Editor
BI-based Organizations
Beyond Business Intelligence
Hugh J. Watson
Barry Devlin
17 Advances in Predictive Modeling: How In-Database Analytics Will

Evolve to Change the Game
26 BI Case Study: SaaS Helps HR Firm Better Analyze Sales Pipeline

Linda L. Briggs
29 Enabling Agile BI with a Compressed Flat Files Architecture

36 BI Experts Perspective: Pervasive BI
Jonathan G. Geiger, Arkady Maydanchik, and Philip Russom
41 BI and Sentiment Analysis
50 Instructions for Authors

51 Dashboard Platforms
Alexander Chiang
56 BI Statshots
BUSINESS INTELLIGENCE JOURNAL VOL. 15, NO. 2
VOLUME 15 NUMBER 2
tdwi.org
EDITORIAL BOARD
Editorial Director
James E. Powell, TDWI
Managing Editor
Jennifer Agee, TDWI
President
Rich Zbylut
Director, Online Products

& Marketing
Melissa Parrish
Graphic Designer
Rod Gosser
President &
Chief Executive Officer
Neal Vitale
Senior Vice President &

Chief Financial Officer
Richard Vitale
Executive Vice President
Michael J. Valenti
Senior Vice President,

Audience Development
& Digital Media
Abraham M. Langer
Vice President, Finance

& Administration
Christopher M. Coates
Senior Editor
Hugh J. Watson, TDWI Fellow, University of Georgia
Director, TDWI Research
Wayne W. Eckerson, TDWI
Senior Manager, TDWI Research
Philip Russom, TDWI
Associate Editors
David Flood, TDWI Fellow, Novo Nordisk
Mark Frolick, Xavier University
Paul Gray, Claremont Graduate University
Claudia Imhoff, TDWI Fellow, Intelligent Solutions, Inc.
Graeme Shanks, University of Melbourne
James Thomann, TDWI Fellow, DecisionPath Consulting
Barbara Haley Wixom, TDWI Fellow, University of Virginia
Advertising Sales: Scott Geissler, sgeissler@tdwi.org, 248.658.6365.
Vice President,
Erik A. Lindgren
Information Technology
& Application Development
Vice President,
Attendee Marketing
Carmel McDonagh
Vice President,
Event Operations
David F. Myers
Chairman of the Board
Jeffrey S. Klein
Reaching the staff
List Rentals: 1105 Media, Inc., offers numerous e-mail, postal, and telemarketing
Staff may be reached via e-mail, telephone, fax, or mail.
lists targeting business intelligence and data warehousing professionals, as well
E-mail: To e-mail any member of the staff, please use the

following form: FirstinitialLastname@1105media.com
as other high-tech markets. For more information, please contact our list manager,
Merit Direct, at 914.368.1000 or www.meritdirect.com.
Reprints: For single article reprints (in minimum quantities of 250500),
e-prints, plaques and posters contact: PARS International, Phone: 212.221.9595,
E-mail: 1105reprints@parsintl.com, www.magreprints.com/QuickQuote.asp
Renton office (weekdays, 8:30 a.m.5:00 p.m. PT)

Telephone 425.277.9126; Fax 425.687.2842
1201 Monster Road SW, Suite 250, Renton, WA 98057
Corporate office (weekdays, 8:30 a.m.5:30 p.m. PT)
Telephone 818.814.5200; Fax 818.734.1522
9201 Oakdale Avenue, Suite 101, Chatsworth, CA 91311
Copyright 2010 by 1105 Media, Inc. All rights reserved. Reproductions in

whole or in part are prohibited except by written permission. Mail requests to
Permissions Editor, c/o Business Intelligence Journal, 1201 Monster Road SW,
Suite 250, Renton, WA 98057. The information in this journal has not undergone
any formal testing by 1105 Media, Inc., and is distributed without any warranty
expressed or implied. Implementation or use of any information contained herein
is the readers sole responsibility. While the information has been reviewed for
accuracy, there is no guarantee that the same or similar results may be achieved
in all environments. Technical inaccuracies may result from printing errors,
new developments in the industry, and/or changes or enhancements to either
hardware or software components. Printed in the USA. [ISSN 1547-2825]
Product and company names mentioned herein may be trademarks and/or
registered trademarks of their respective companies.
Business Intelligence Journal

(article submission inquiries)
Jennifer Agee
E-mail: jagee@tdwi.org
tdwi.org/journalsubmissions
TDWI Membership
(inquiries & changes of address)
E-mail: membership@tdwi.org
tdwi.org/membership
425.226.3053
Fax: 425.687.2842
From the Editor

I
n good economies and bad, the secret to success is to meet your customers or clients
needs. Your enterprise has to respond to changing conditions and emerging trends, and
it has to do so quickly. Your organization must be, in a word, agile.
Agile has been used to describe an application development methodology designed
to help IT get more done in less time. Were expanding the meaning of agile to include
the techniques and best practices that will help an organization as a whole be more
responsive to the marketplace, especially as it relates to its business intelligence efforts.
In our cover story, William Sunna and Pankaj Agrawal note that rapid results in active
data warehousing become vital if organizations are to manage and make optimal use of
their data. Their compressed flat-file architecture helps an enterprise develop less costly
solutions and do so fasterwhich is at the very heart of agile BI.
Sule Balkan and Michael Goul explain how in-database analytics advance predictive
modeling processes. Such technology can significantly reduce cycle times for rebuilding
and redeploying updated models. It will benefit analysts who are under pressure to
develop new models in less time and help enterprises fine-tune their business rules and
react in record timethat is, boost agility.
Barry Devlin notes that businesses need more from IT than just BI. Transaction
processing and social networking must be considered. Devlin points out how agility is
a major driver of operational environment evolution, and how the need for agility in
the face of change is driving the need for a new architecture. Alexander Chiang looks
at dashboard platforms (the technologies, business challenges, and solutions) and how
rapid deployment of agile dashboard development reduces costs and puts dashboards
into the hands of users quickly.
Also in this issue, senior editor Hugh J. Watson looks at enterprises that have immersed
BI in the business environment, where work processes and BI intermingle and are
highly interdependent. Mukund Deshpande and Avik Sarkar explain how sentiment
data (opinions, emotions, and evaluations) can be mined and assessed as part of your
overall business intelligence. In our Experts Perspective column, Jonathan G. Geiger,
Arkady Maydanchik, and Philip Russom suggest best practices for correcting data
quality issues.
Were always interested in your comments about our publication and specific articles
youve enjoyed. Please send your comments to jpowell@1105media.com. I promise to
be agile in my reply.
BI-BASED ORGANIZATIONS
BI-based
Organizations
Hugh J. Watson
A growing number of companies are becoming BI-based.
For these firms, business intelligence is not just nice
to have; rather, it is a necessity for competing in the
marketplace. These firms literally cannot survive without
BI (Wixom and Watson, 2010).
Hugh J. Watson is Professor of MIS and
C. Herman and Mary Virginia Terry Chair
of Business Administration in the Terry
College of Business at the University
of Georgia. hwatson@terry.uga.edu
In BI-based organizations, BI is immersed in the business environment.1 Work processes and BI intermingle,
are highly interdependent, and influence one another.
Business intelligence changes the way people work as
individuals, in groups, and in the enterprise. People
perform their work following business processes that
have BI embedded in them. Business intelligence extends
beyond organizational boundaries and is used to connect
and inform suppliers and customers.
An Example of a BI-based Organization

I recently completed a case study of a major online
retailer. (The well-known company asked that its name
not be used in this article.) Business intelligence permeates its operations. The company has a data warehouse
group that maintains the decision-support data repository
and a decision-support team with analysts scattered
throughout the business to help develop and implement
BI applications. The applications include:
Forecasting product demand

Determining the selling price for products, both
initially and later for products with sales below
expectations
Market basket analysis
1. The concept of a BI-based organization is similar to the immersion view

of IT introduced in O.A. El Sawy, 2003.
Customer segmentation analysis

Product recommendations, both while customers are
on the Web site and in follow-on communications
Customer and product profitability analysis
Campaign planning and management
Supply chain integration
Web analytics
Fact-based decision making
(e.g., Walmart), telecommunications firms (AT&T), and

financial institutions (Bank of America). They all use BI
to understand and communicate with customers and to
optimize their operations.
Consider the online retailer discussed earlier. It must
compete against large pure-plays such as Amazon.com
and the traditional brick-and-mortar companies that also
have a strong online presence. To be successful, it must
use analytics to understand and anticipate the needs
and wants of its millions of customers, offer products at
appealing yet profitable prices, communicate and make
offers that are wanted (no spam), and acquire and deliver
products in an efficient and effective way.
Some of the details of these applications are interesting.

For example, the customer profitability analysis considers whether a customer typically buys products at full
price or only those that are discounted, and whether a
customer has a history of returning products. When
a product has an excessive return rate, it is pulled off
the Web site and assigned to an investigative team to
work with the vendor to identify and fix the problem.
When the problem resides with the vendor, the vendor is
expected to make good on the costs incurred. Hundreds
of tests are also run each year to see what marketing
approaches work best, such as the content of offers and
the most effective e-mail subject lines.
A Strategic Vision and Executive Support for BI

Several years ago I worked with a regional bank that was
close to going under (Cooper et al, 2000). The new management team stopped the bleeding by cutting costs
but knew that this was not a sustainable business strategy.
For the long run, the bank implemented a strategy based
on knowing its customers exceptionally well. The CEO
had a vision for how the strategy would work. He wanted
everyone on the project team to understand and buy into
the strategy and to have the necessary skills to execute it.
When the vice president of marketing proved unable to
execute the vision, he was replaced, despite having been
hired specifically for the job.
When asked to describe his company, the retailers

CEO said he does not view it as a retailer or even an
information-based company. Rather, he sees it as a BI
company. The use of analytics is key to its success.
Because time was of the essence and in-house data

warehousing and BI skills were lacking, consultants were
used extensively. It was an expensive, bet-the-bank
approach, but it proved highly successful. With senior
managements vision and support, the bank became a
leader in the use of analytics and an emerging leader in
the industry.
Becoming a BI-based Organization

If companies are becoming increasingly BI-based, what
are the drivers behind this trend and what are the
requirements for success? Although there are exceptions,
the following conditions are typical.
A Highly Competitive Business Environment
Nearly all firms face a competitive business environment
and can benefit from BI in some way. This is especially
true for firms that serve high-volume markets using
standardized products and processes. Think of retailers
Smart companies have formal BI vision documents and

use them to guide and communicate their plans for BI.
Sponsors must understand their responsibilities, such as
being visible users of the system and helping to handle
political problems.
An Analytical Culture
The bank we mentioned had 12 marketing specialists
prior to implementing its customer intimacy strategy, and
had 12 different people afterwards. All of the original
dozen employees had moved to other positions or left the
bank. The banks CEO said their idea of marketing was
handing out balloons and suckers at the teller line and
running focus groups. The new marketing jobs were very
analytical, and the previous people couldnt or didnt
want to do that kind of work.
At Harrahs Entertainment, decisions used to be made
based on Harrahismspieces of conventional wisdom
that were believed to be true (Watson and Volonino,
2002). As Harrahs moved to fact-based decision making,
these Harrahisms were replaced by analyses and tests of
what worked best. Using this strategy, Harrahs evolved
from a blue-collar casino into the industry leader.
In the short run, a company either has an analytical
culture or it doesnt. Change needs to originate at the
top, and it may require replacing people who dont have
analytical skills.
A Comprehensive Data Infrastructure
A companys BI efforts cannot be any better than the
available data. That is why so much time and effort is
devoted to building data marts and warehouses, enhancing data quality, and putting data governance in place.
Once these exist, however, it is relatively easy to realize
the benefits of BI.
Continental Airlines has a comprehensive data
warehouse that includes marketing, revenue, operations, flight and crew data, and more (Watson et al,
2006). Because the data is in place, and the BI team
and business users are familiar with the data and have
the ability to build applications, new applications can
be developed in days rather than months, allowing
Continental to be very agile.
Talented BI Professionals
BI groups need a mix of technical and business skills.
Although good technical talent is a must, an enterprise
must have people who can work effectively with users.
I have been most impressed with those firms that have

hybrid employeesthat is, people with excellent technical and business skills.
At Continental Airlines, it is not always clear whether
you are talking with someone from the BI group or
from one of the business units. Many people understand both BI and the business, and there is a good
reason for this: Some of the people in BI used to work
in the business, and vice versa. This approach can help
eliminate the chasm that is so common between IT and
business people.
Conclusion
My list of drivers and requirements for a BI-based organization is not all-inclusive, but if you get these things right,
you are well on your way to creating a successful BI-based
organization. n
References
Cooper, B.L., H.J. Watson, B.H. Wixom, and D.L.
Goodhue [2000]. Data Warehousing Supports
Corporate Strategy at First American Corporation,
MIS Quarterly, December, pp. 547567.
El Sawy, O.A. [2003]. The IS CoreThe 3 Faces of
IS Identity: Connection, Immersion, and Fusion,
Communications of the Association for Information
Systems, Vol. 12, pp. 588598.
Watson, H.J., B.H. Wixom, J.A. Hoffer, R. AndersonLehman, and A.M. Reynolds [2006]. Real-time
Business Intelligence: Best Practices at Continental
Airlines, Information Systems Management, Winter, pp.
718.
, and L. Volonino [2002]. Customer Relationship
Management at Harrahs Entertainment, DecisionMaking Support Systems: Achievements and Challenges
for the Decade, Forgionne, G.A., J.N.D. Gupta, and M.
Mora (eds.), Idea Group Publishing.
Wixom, B.H., and H.J. Watson [2010]. The BI-Based
Organization, International Journal of Business
Intelligence Research, JanuaryMarch, pp. 1325.
BEYOND BI
Beyond Business
Intelligence
Barry Devlin
Abstract
Barry Devlin, Ph.D., is a founder of the

data warehousing industry and among the
foremost worldwide authorities on business
intelligence and the emerging field of
business insight. He is a widely respected
consultant, lecturer, and author of the seminal
book, Data Warehouse: From Architecture to
Implementation. He is founder and principal
of 9sight Consulting (www.9sight.com).
barry@9sight.com
It has been almost 25 years since the original data warehouse

was conceived. Although the term business intelligence (BI)
has since been introduced, little has changed from the original
architecture. Meanwhile, business needs have expanded dramatically and technology has advanced far beyond what was
ever envisioned in the 1980s. These business and technology
changes are driving a broader and more inclusive view of what
the business needs from IT; not just in BI but across the entire
spectrumfrom transaction processing to social networking.
If BI is to be at the center of this revolution, we practitioners
must raise our heads above the battlements and propose a new,
inclusive architecture for the future.
Business integrated insight (BI2) is that architecture. This
article focuses on the information component of BI2the
business information resource. I introduce a data topography
and a new modeling approach that can support data warehouse
implementers to look beyond the traditional hard information content of BI and consider new ways of addressing such
diverse areas as operational BI and (so-called) unstructured
content. This is an opportunity to take the next step beyond BI
to provide complete business insight.
The Evolution of an Architecture

The first article describing a data warehouse architecture
was published in 1988 in the IBM Systems Journal (Devlin
and Murphy, 1988), based on work in IBM Europe over
the previous three years. At almost 25 years old, data
warehousing might thus be considered venerable. It has
also been successful; almost all of that original architecture is clearly visible in todays approaches.
The structure and main components of that first
warehouse architecture are shown in Figure 1, inverted
to match later bottom-to-top flows but otherwise
unmodified. Despite changes in nomenclature, all
but one of the major components of the modern data
BEYOND BI
Reports
End user
workstation
Data marts
End user interface
Metadata
Metadata
Business data warehouse

Enhanced data,
summary
Enhanced data,
detailed
Raw data,
detailed
Enterprise data warehouse
Data interface
Data dictionary and
business process
definitions
Local
data
Operational
systems
Operational systems
Figure 1. Data warehouse architecture, 1988
Figure 2. The layered data warehouse architecture (Devlin, 1997)
warehouse architecture appear. The data interface clearly

corresponds to ETL. The business data directory was
later labeled metadata. The absence of data marts is more
apparent than real. The business data warehouse explicitly
described data at different levels of granularity, derivation,
and usageall the characteristics that later defined data
marts. The only missing component, seen only recently
in data warehouses, is enterprise information integration
(EII) or federated access.
A key mutation occurred in the architecture in the

early 1990s. This mutation, shown in Figure 2, split the
singular business data warehouse (and all informational
data) into two horizontal layersthe enterprise data
warehouse (EDW) and the data martsand also vertically split the data mart layer into separate stovepipes of
data for different informational needs. The realignment
was driven largely by the need for better query performance in relational databases. The highly normalized
tables in the EDW usually required extensive and
expensive joins of such tables to answer user queries.
Another driver was slice-and-dice analysis, which is
most easily supported using dimensional models and
even specialized data stores.
Figure 1 is a logical architecture. It shows two distinct

types of dataoperational and informationaland
recognizes the fundamental differences between them.
Operational data was the ultimate source of all data
in the warehouse, but was beyond the scope of the
warehouse: fragmented, often unreliable, and in need
of cleansing and conditioning before being loaded. The
warehouse data, on the other hand, was cleansed, consistent, and enterprisewide. This dual view of data informed
how decision support was viewed by both business and IT
since its invention in the 1960s (Power, 2007).
This redrawing of the original, logical architecture

picture has had significant consequences for subsequent
thinking about data warehousing. First was a level of
mental confusion about whether the architecture picture
was supposed to be logical or physical. Such a basic
architectural misunderstanding divides the community
BEYOND BI
into factions debating the right architecturerecall the

Inmon versus Kimball battles of the 1990s.
current mess in decision support that will be cured by

data warehousing.
Second, and more important, is the disconnect from

a key requirement of the original architecture: that
decision-support information must be consistent and
integrated across the whole enterprise. When viewed as a
physical picture, Figure 2 can encourage fragmentation of
the information vertically (based on data granularity or
structure) and horizontally (for different organizational/
user needs or divisions). The implication is that data
should be provided to users through separate data stores,
optimized for specific query types, performance needs,
etc. Vendors of data mart tools thus promoted quick
solutions to specific data and analysis needs, paying lip
serviceat bestto the EDW. In truth, most generalpurpose databases struggled to provide the performance
required across all types of queries. The EDW is often
little more than a shunting yard for data on its way to
data marts or a basic repository for predefined reporting.
This brief review of the evolution of data warehousing

poses three questions:
The third, and more subtle, consequence is that thinking

about logical and physical data models and storage has
also split into two camps. Enterprise architecture focuses
on data consistency and integrity, often assuming that the
model may never be physically instantiated. On the other
hand are solution developers who focus on application
performance at the expense of creating yet more copies of
data. The result is dysfunctional IT organizations where
corporate and departmental factions promote diametrically opposed principles to the detriment of the business
as a whole.
Of course, Figure 2 is not the end of the architecture
evolution. Todays pictures show even more data storage
components. Metadata is split off into a separate layer
or pillar. The EDW is complemented by stores such as
master data management (MDM) and the operational
data store (ODS). Data marts have multiplied into
various types based on usage, function, and data type.
The connectivity of EII has been added in recent years.
In truth, these modern pictures have become more like
graphical inventories of physical components than true
logical architectures; they have begun to look like the
spaghetti diagrams beloved by BI vendors to show the
After 25 years of changing business needs, do we need

a new architecture to meet the current and foreseen
business demands?
What would a new logical data architecture look like?
What new modeling and implementation approaches
are needed to move to the new architecture?
What Business Needs from IT in the 21st Century

The concepts of operational BI and unstructured content
analytics point to the most significant changes in what
business expects of IT over the past decade. The former
reflects a huge increase in speed and agility required by
modern business; the latter points to a fundamental shift
in focus by decision makers and a significant expansion in
the scope of their attention.
Speed has become one of the key drivers of business
success today. Decisions or processes that 20 years ago
took days or longer must now be completed in hours or
even minutes. The data required for such activities must
now be up to the minute rather than days or weeks old.
Increasing speed may require eliminating people from
decision making, which drives automation of previously
manual work and echoes the prior automation of blue
collar work. As a result, the focus of data warehousing
has largely shifted from consistency to speed of delivery.
In truth, of course, delivering inconsistent data more
quickly is actually worse in the long term than delivering
it slowly, but this obvious consideration is often conveniently ignored.
As the term operational BI implies, decision making
is being driven into the operational environment by this
trend. Participants from IT in operational BI seminars
repeatedly ask: How is this different from what goes on
in the operational systems? The answer is: not a lot. This
response has profound implications for data warehouse
BEYOND BI
architecture, disrupting the division that has existed

between operational and informational data since
the 1960s. If BI architects can no longer distinguish
between operational and informational activities, how
will users do so?
Agilityhow easily business systems cope with and
respond to internal and external changeis a major
driver of evolution in the operational environment.
Current thinking favors service-oriented architecture
(SOA) as a means of allowing rapid and easy modification
of workflows and exchange of business-level services as
business dictates. Such rapid change in the operational
environment creates problems for data loading using
traditional ETL tools with more lengthy development
cycles. On the plus side, the message-oriented interfaces
between SOA services can provide the means to load data
continuously into the warehouse.
Furthermore, the operational-informational boundary
becomes even more blurred as SOA becomes pervasive,
especially as it is envisaged that business users may
directly modify business processes. Users simply do
not distinguish between operational and informational
functions. They require any and all services to operate
seamlessly in a business workflow. In this environment,
the old warehousing belief that operational data is
inconsistent while warehouse data is reliable simply
cannot be maintained. Operational data will have to be
cleansed and made consistent at the source, and as this
occurs, one rationale for the EDWas the dependable
source of consistent datadisappears.
Turning to the growing interest in and importance of
unstructured data, we encounter further fundamental
challenges to our old thinking about decision making
and how to support it. We are constantly reminded of the
near-exponential growth in these data volumes and the
consequent storage and performance problems. However,
this is really not the issue.
The real problem lies in the oxymoron unstructured
data. All data is structuredby definition. Structured
data, as its known, is designed to be internally consistent
and immediately useful to IT systems that record and
10
analyze largely numerical and categorized information.

Such hard information is modeled and usually stored in
tabular or relational form. Unstructured information,
in reality, has some other structure less amenable to
numerical use or categorization. This soft information
often contains or is related to hard information. For
example, a business order can exist as: (1) a message on
a voicemail system; (2) a scanned, handwritten note; (3)
an e-mail message; (4) an XML document; and (5) a row
in a relational database. As we proceed along this list,
the information becomes harder, that is, more usable by
a computer. On the other hand, we may lose some value
inherent in the softer information: the tone of voice in the
voicemail message may alert a person to the urgency of
the order or some dissatisfaction of the buyer.
Business decision makers, especially at senior levels, have
always used soft information, often from beyond the
enterprise, in their work. Such information was gleaned
from the press and other sources, gathered in conversations with peers and competitors, and grafted together in
face-to-face interactions between team members. Today,
these less-structured decision-making processes are electronically supported and computerized. The basic content
is digitized, stored, and used online. Conversations occur
via e-mail and instant messaging. Conferences are remote
and Web-based.
For data warehousing, as a result, the implications extend
far beyond the volumes of unstructured data that must be
stored. These volumes would pose major problemsthe
viability of copying so much data into the data warehouse
and management of potentially multiple copiesif we
accepted the current architecture. However, of deeper
significance is the question of how soft information and
associated processes can be meaningfully and usefully
integrated with existing hard information and processes.
At its core, this is an architectural question. How can
existing modeling and design approaches for hard
information extend to soft information? Assuming they
can, how can soft information, with its loose and fluid
structure, be mined on the fly for the metadata inherent
in its content? Although these questions are not new,
there is little consensus so far about how this will be
BEYOND BI
Personal
Action
Domain
Deferred
Immediate
Active
Thoughtful
Inventive
Business
Function
Assembly
Workflow
Activity
Creative Conditioning Analytical Decisional
Business
Information
Resource
Uncertified
Unstructured
Structured
In-flight
Live Reconciled Historical

Certified
Figure 3. The business integrated insight architecture
done. As was the case for enterprise data modeling, which

matured in tandem with the data warehouse architecture,
methods of dealing with soft information will surface as a
new architecture for life beyond BI is defined.
In the case of operational BI and SOA, the direction
is clear and the path is emerging: The barrier between
operational and informational data is collapsing, and
improvements in database technology suggest that we
can begin to envisage something of a common store. For
the structured/unstructured divide, the direction is only
now emerging and the path is yet unclear. However,
the direction echoes that for operational/informational
storesthe barriers we have erected between these data
types no longer serve the business. We need to tear
down the walls.
Business Integrated Insight and the Business

Information Resource
Business integrated insight (BI2), a new architecture
that shows how to break down the walls, is described
elsewhere (Devlin, 2009). As Figure 3 shows, this is
again a layered architecture, but one where the layers are
information, process, and people, and all information
resides in a single layer.
used by the organizationfrom minute-to-minute

operations to strategic decision makingis needed. At
its most comprehensive, this comprises every disparate
business data store on every computer in the organization,
all relevant information on business partners computers,
and all relevant information on the Internet! It includes
in-flight transaction data, operational databases, data
warehouses and data marts, spreadsheets, e-mail repositories, and content stores of all shapes and sizes inside the
business and on the Web.
This article focuses on the business information resource
(BIR), the information layer in BI2, to provide an
expanded and improved view of that component of
Figure 3. The BIR provides a single, logical view of the
entire information foundation of the business that aims
to significantly reduce the physical tendency to separate
and then duplicate data in multiple stores. This BIR is
a unified information space with a conceptual structure
that allows for reasoned decisions about where to draw
boundaries of significant business interest or practical
implementation viability. As business changes or technology evolves, the BIR allows boundaries to change in
response without reinventing the logical architecture
or defining new physical components to simply store
alternative representations of the same information.
As seen in the business directions described earlier, a

single, consistent, and integrated set of all information
11
BEYOND BI
is ephemeral to eternal; consistency moves

from standalone to consistent, integrated
data. When data is very timely (i.e., close
to real time), ensuring consistency between
related data items can be challenging. As
timeliness is relaxed, consistency is more
easily ensured. Satisfying a business need for
high consistency in near-real-time data can
be technically challenging and ultimately
very expensive.
Reliance/usage
Global
Knowledge
density
Enterprise
Multiplex
Local
Compound
Personal
Derived
Vague
Atomic
In-flight
Live
Stable
Reconciled Historical
Along this axis, in-flight data consists of

messages on the wire or the enterprise service
bus; data is valid only at the instant it passes
by. This data-in-motion might be processed,
used, and discarded. However, it is normally recorded
somewhere, at which stage it becomes live. Live data has
a limited period of validity and is subject to continuous
change. It also is not necessarily completely consistent
with other live data. That is the characteristic of stable
and reconciled data, which are stable over the medium
term. In addition to its stability, reconciled data is also
internally consistent in meaning and timing. Historical
data is where the period of validity and consistency is, in
principle, forever.
Timeliness/consistency
Figure 4. The axes of the business information resource
The structure of the BIR is based on data topography,

with a set of three continuously variable axes characterizing the data space. Data topography refers to the type
and use of data in a general senseeasy to recognize
but often difficult to define. This corresponds to physical
topography, where most people can easily recognize a hill
or a mountain when they see one, but formal definitions
of the difference between them seldom make much sense.
Similarly, most business or IT professionals can distinguish between hard and soft information as discussed
earlier, but creating definitions of the two and drawing a
boundary between them can be problematic.
The three axes of data topography, as shown in Figure
4, provide business and IT with a common language
to understand information needs and technological
possibilities and constraints. Placing data elements or sets
along the axes of the data space defines their business
usage and directs us to the appropriate technology.
The Timeliness/Consistency Axis
The timeliness/consistency (TC) axis defines the time
period over which data validly exists and its level of
consistency with logically related data. These two factors
reside on the same axis because there is a distinct, and
often difficult, inverse technical relationship between
them. From left to right, timeliness moves from data that
12
The TC axis broadly mirrors the lifecycle of data from

creation through use to disposal or archival. Within its
lifecycle, data traverses the TC axis from left to right,
although some individual data items may traverse only
part of the axis or may be transformed en route. A
financial transaction, for example, starts life in-flight and
exists unchanged right across the axis to the historical
view. On the other hand, customer information usually
appears first in live data, often in inconsistent subsets that
are transformed into a single set of reconciled data and
further expanded with validity time frame data in the
historical stage.
It is vital to note that this axis (like the others) is a
continuum. The words in-flight, live, and so on denote
broad phases in the continuous progression of timeliness
from shorter to longer periods of validity and consistency
from less- to more-easily achieved. They are not discrete
categories of data. Nor are there five data layers between
BEYOND BI
which data must be copied and transformed. They

represent broad, descriptive levels of data timeliness and
consistency against which business needs and technical
implementation can be judged. Placing data at the left
end of the axis emphasizes the need for timeliness; at the
right end, consistency is more important.
It should be clear that the TC axis is the primary one
along which data warehousing has traditionally operated.
The current architecture splits data along this axis into
discrete layers, assigning separate physical storage to each
layer and distributing responsibility for the layers across
the organization. Reuniting these layers, at first logically
and perhaps eventually physically, is a key aim of BI2.
The Knowledge Density Axis
The knowledge density (KD) axis shows the amount
of knowledge contained in a single data instance and
reflects the ease with which meaning can be discerned in
information. In principle, this measure could be numerical. For example, a single data item, such as Order Item
Quantity, contains a single piece of information, while
another data item, such as a Sales Contract, contains
multiple pieces of information. In practice, however,
counting and agreeing on information elements in more
complex data items is difficult and, as with the TC axis,
the KD axis is more easily described in terms of general,
loosely bounded classes.
At the lowest density level is atomic data, containing a
single piece of information (or fact) per data item. Atomic
data is extensively modeled and is most often structured
according to the relational model. It is the most basic and
simple form of data, and the most amenable to traditional
(numerical) computer processing. The modeling process
generates the separate descriptions of the data (the
metadata) without which the actual data is meaningless. At the next level of density is derived data, which
typically consists of multiple occurrences of atomic data
that have been manipulated in some way. Such data may
be derived or summarized from atomic data; the latter
process may result in data loss. Derived data is usually
largely modeled, and the metadata is also separate from
the data itself.
Compound data is the third broad class on the KD axis

and refers to XML and similar data structures, where the
descriptive metadata has been included (at least in part)
with the data and where the combined data and metadata
is stored in more complex or hierarchical structures.
These structures may be modeled, but their inherent flexibility allows for less rigorous implementation. Although
well suited to SOA and Web services approaches, such
looseness can impact internal consistency and cause
problems when combining with atomic or derived data.
The final class is multiplex data, which includes documents,
general content, image, video, and all sorts of binary large
object (BLOB) data. In such data, much of the metadata
about the meaning of the content is often implicit in the
content itself. For example, in an e-mail message, the To:
and From: fields clearly identify recipient and sender, but
we need to apply judgment to the content of the fields and
even the message itself to decide whether the sender is a
person or an automated process.
This axis allows us to deal with the concepts of hard and
soft information mentioned earlier. The KD axis also
relates to the much-abused terms structured, semistructured, and unstructured. Placing information on
this axis is increasingly important in modern business
as more soft information is used. Given that such data
makes up 80 percent or more of all stored data, it makes
sense that much useful information can be found here,
for example, by text mining and automated modeling
tools. Just as we have traditionally transformed and
moved information along the TC axis in data warehousing, we now face decisions about whether and how to
transform and move data along the KD axis. In this case,
the direction of movement is likely to be from multiplex
to compound, with further refinement into atomic or
derived. The challenge is to do so with minimal copying.
The Reliance / Usage Axis
The final axis, reliance/usage (RU), has been largely
ignored in traditional data warehousing, which confines
itself to centrally managed and allegedly dependable
data. However, the widespread use of personal data, such
as spreadsheets, has always been problematic for data
management (Eckerson and Sherman, 2008). Similarly,
13
BEYOND BI
data increasingly arrives from external sources: from

trusted business partners all the way to the world wild
west of the Internet. All this unmanaged and undependable information plays an increasingly important role in
running a business. It is becoming clear that centrally
managed and certified information is only a fraction of
the information resource of any business.
The RU axis, therefore, classifies information according to
how much faith can be placed in it and the uses to which
it can be put. Global and enterprise information is strongly
managed, either at an enterprise level or more widely
by government, industry, or other regulatory bodies. It
adheres to a well-defined and controlled information
model, is highly consistent, and may be subject to audit.
By definition, reconciled and historical information fall
into these classes. Local information is also strongly managed, but only within a departmental or similar scope.
Internal operational systems, with their long history of
management and auditability, usually contain local or
enterprise-class data. Information produced and managed
by a single individual is personal and can be relied upon
and used only within a very limited scope. A collaborative
effort by a group of individuals produces information of
higher reliability and wider usage and thus has a higher
position on the RU axis.
Vague information is the most unreliable and poorly
controlled. Internet information is vague, requiring
validation and verification before use. Information from
other external sources, such as business partners, has
varying levels of reliability and usage.
The placement of information on this axis and the
definition of rules and methods for handling different
levels of reliance and usage are topics that are still in their
infancy, but they will become increasingly important
as the volumes and value of less closely managed and
controlled data grow.
A Note about Metadata
The tendency of prior data warehouse architectures to carve
up business information is also evident in their positioning
of metadata as a separate layer or pillar. Such separation
was always somewhat arbitrary and is no longer reasonable.
14
We have probably all encountered debates about whether

timestamps, for example, are business data or metadata.
This new architecture places metadata firmly and fully in
the business information resource for three key reasons.
First, as discussed earlier, metadata is actually embedded in
the compound and multiplex information classes by definition. Second, metadata is highly valuable and useful to the
business. This is obvious for business metadata, but even
so-called technical metadata is often used by power users
and business analysts as they search for innovative ways
to combine and use existing data. Third, as SOA exposes
business services to users, their metadata will become
increasingly important in creating workflows. Integrating
metadata into the BIR simply makes life easier for business
and IT alike. Metadata, when extracted from business
information, resides in the compound data class.
Introducing Data Space Modeling and Implementation

The data topography and data space described above
recognize and describe a fact of life for the vast majority
of modern business processes: Any particular business
process (or, in many cases, a specific task) requires
information that is distributed over the data space. A call
center, for example, uses live, stable, and historical data
along the TC axis; atomic, derived, and multiplex data
along the KD axis; and local and enterprise data on the
RU axis, as shown in Figure 5.
Although this data space illustration provides a valuable
visual representation of the data needs of the process
and their inherent complexity, a more formal method of
describing the data relationships is required to support
practical implementation: data space modeling. Its aim
is to create a data model beyond the traditional scope
of hard information. Data space modeling includes soft
information and describes the data relationships that exist
within and across all data elements used by a process or
task, irrespective of where they reside in the data space.
To do this, I introduce a new modeling construct, the
information nugget, and propose that a new, dynamic
approach to modeling is needed, especially for soft
information. It should be noted that much work remains
to bring data space modeling to fruition.
BEYOND BI
Reliance/usage
Modeling Soft Information

Traditional information modeling
Global
approaches focus on (and, indeed, define
Knowledge
and create) hard information. It is a
density
Enterprise
relatively small step from such traditional
Multiplex
modeling to envision how the relationships
Local
Compound
between multiple sets of hard informaPersonal
tion used in a particular task can be
Derived
represented through simple extensions of
Vague
Atomic
existing models to describe information
nuggets. The real problem arises with soft
In-flight
Live Stable Reconciled Historical
information, particularly that represented
by the multiplex data class on the KD
Timeliness/consistency
axis. Such data elements are most often
modeled simply as text or object entities
Figure 5. Sample data space mapping for the call center process
at the highest level, with no recognition
that more fundamental data elements exist within these
high-level entities.
The Information Nugget
An information nugget is the smallest set of related data
Returning to the call center example, consider the customer
(wherever it resides in or is distributed through the data
complaint information that is vital to interactions between
space) that is of value to a business user in a particular
agents and customers. When such information arrives
context. It is the information equivalent of an SOA service,
in the form of an e-mail or voicemail message from the
also defined in terms of the smallest piece of business
customer, we can be sure that within the content exists real,
function from a user viewpoint. An information nugget can
valuable, detailed information including product name,
thus be as small as a single record when dealing with an
type of defect, failure conditions, where purchased, name of
individual transaction or as large as an array of data sets
customer, etc. In order to relate such information to other
used by a business process at a particular time. As with SOA
data of interest, we must model the complaint information
services, information nuggets may be composed of smaller
(multiplex data) at a lower level, internal to the usual text or
nuggets or be part of many larger nuggets. They are thus
object class.
granular, reusable, modular, composable, and interoperable.
They often span traditional information types.
Such modeling must recognize and handle two characteristics of soft information. First is the level of uncertainty
As modeled, an information nugget exists only once in the
about the information content and our ability to recognize
BIR, although it may be widely dispersed along the three
the data items and values contained therein. For example,
axes. At a physical level, it ideally maps to a single data
the clutch failed when climbing a hill, and I lost the
instantiation, although the usual technology performance
clutch going up the St. Gotthard Pass, contain the same
and access constraints may require some duplication.
information about the conditions of a clutch failure, but
However, the purpose of this new modeling concept is
may be difficult to recognize immediately. Second, because
to ensure that information, as seen by business users, is
soft information may contain lower-level information
uniquely and directly related to its use, while minimizing
elements in different instances of the same text/object entity,
the level of physical data redundancy. When implemented,
each instance must be individually modeled on the fly as it
the information nugget leads to rational decisions about
arrives in the store.
when and how data should be duplicated and to what extent
federation/EII approaches can be used.
15
BEYOND BI
Automated text mining and semantic and structural

analysis are key components in soft information modeling
given the volumes and variety of information involved. Such
tools essentially extract the tacit metadata from multiplex
data and store it in a usable form. This enables multiplex
data to be used in combination with the simpler atomic, reconciled, and derived classes on the KD axis. By storing this
metadata in the BIR and using it as pointers to the actual
multiplex data, we can avoid the need to transform, extract,
and copy vast quantities of soft information into traditional
warehouse data stores. We may also decide to extract certain
key elements for performance or referential integrity needs.
The important point is that we need to automatically model
soft information at a lower level of detail to enable such
decisions and to use this information class fully.
3. An initial step toward implementing this architecture

is to describe and model a new topography of data
based on broad types and uses of information. A
data space mapped along three axes is proposed and
a new modeling concept, the information nugget,
introduced. The architecture also requires dynamic,
in-flight modeling particularly of soft information to
handle the expanded data scope.
Conclusions
References
This article posed three questions: (1) Do we need a

new architecture for data warehousing after 25 years of
evolution of business needs and technology? (2) If so,
what would such an architecture look like? and (3) What
new approaches would we need to implement it? The
answers are clear.
Devlin, B. [1997]. Data Warehouse: From Architecture to

Implementation, Addison-Wesley.
1. Business needs and technology have evolved

dramatically since the first warehouse architecture.
Speed of response, agility in the face of change,
and a significantly wider information scope for all
aspects of the business demand a new, extensive level
of information and process integration beyond any
previously attempted. We need a new data warehouse
architecture as well as a new enterprise IT architecture of which data warehousing is one key part.
2. Business integrated insight (BI2) is a proposed
new architecture that addresses these needs while
taking into account current trends in technology. It
is an architecture with three layersinformation,
process, and people. Contrary to the traditional
data warehouse approach, all information is
placed in a single layerthe business information resourceto emphasize the comprehensive
integration of information needed and the aim to
eliminate duplication of data.
16
Although seemingly of enormous breadth and impact, the

BI2 architecture builds directly on current knowledge and
technology. Prior work to diligently model and implement a
true enterprise data warehouse will contribute greatly to this
important next step beyond BI to meet future enterprise
needs for complete business insight. n
[2009]. Business Integrated Insight (BI2):

Reinventing enterprise information management,
white paper, September. http://www.9sight.com/
bi2_white_paper.pdf
, and P. T. Murphy [1988]. An architecture for a
business and information system, IBM Systems Journal,
Vol. 27, No. 1, p. 60.
Eckerson, Wayne W., and Richard P. Sherman [2008].
Strategies for Managing Spreadmarts: Migrating to
a Managed BI Environment, TDWI Best Practices
Report, Q1. http://tdwi.org/research/2008/04/
strategies-for-managing-spreadmarts-migrating-to-amanaged-bi-environment.aspx
Power, D. J. [2007]. A Brief History of Decision Support
Systems, v 4.0, March 10. http://dssresources.com/
history/dsshistory.html
PREDICTIVE MODELING
Advances in Predictive
Modeling: How
In-Database Analytics
Will Evolve to Change
the Game
Abstract
Sule Balkan is clinical assistant

professor at Arizona State University,
department of information systems.
sule.balkan@asu.edu
Organizations using predictive modeling will benefit from

recent efforts in in-database analyticsespecially when they
become mainstream, and after the advantages evolve over
time as adoption of these analytics grows. This article posits
that most benefits will remain under-realized until campaigns
apply and adapt these enhancements for improved productivity. Campaign managers and analysts will fashion in-database
analytics (in conjunction with their database experts) to support their most important and arduous day-to-day activities. In
this article, we review issues related to building and deploying
analytics with an eye toward how in-database solutions
advance the technology. We conclude with a discussion of how
analysts will benefit when they take advantage of the tighter
coupling of databases and predictive analytics tool suites,
particularly in end-to-end campaign management.
Introduction
Michael Goul is professor and chair at

Arizona State University, department
of information systems.
michael.goul@asu.edu
Decoupling data management from applications has

provided significant advantages, mostly related to data
independence. It is therefore surprising that many vendors
are more tightly coupling databases and data warehouses
with tool suites that support business intelligence (BI)
analysts who construct and manage predictive models.
These analysts and their teams construct and deploy models
for guiding campaigns in areas such as marketing, fraud
detection, and credit scoring, where unknown business
patterns and/or inefficiencies can be discovered.
In-database analytics includes the embedding of
predictive modeling functionalities into databases or data
warehouses. It differs from in-memory analytics, which is
17
PREDICTIVE MODELING
designed to minimizing disk access. In-database analytics

focuses on the movement of data between the database
or data warehouse and analysts workbenches. In the
simplest form of in-database analytics, the computation
of aggregates such as average, variance, and other statistical summaries can be performed by parallel database
engines quickly and efficientlyespecially in contrast to
performing computations inside an analytics tool suite
with comparatively slow file management systems. In
tightly coupled environments, those aggregates can be
passed from the data engine to the predictive modeling
tool suite when building analytical models such as statistical regression models, decision trees, and even neural
networks. In-database analytics also enable streamlining
of modeling processes.
The typical modeling processes referred to as CRISP-DM,
SEMMA, and KDD contain common BI steps or phases.
Knowledge Discovery in Databases (KDD) refers to the
broad process of finding knowledge using data mining
(DM) methods (Fayyad, Piatetski-Shapiro, Smyth, and
Uthurusamy, 1996). KDD relies on using a database
along with any required preprocessing, sub-sampling, and
transformation of values in that database. Another version
of a DM process approach was developed by SAS Institute:
Sample, Explore, Modify, Model, Assess (SEMMA) refers
to the lifecycle of conducting a DM project.
Another approach, CRISP-DM, was developed by a
consortium of Daimler Chrysler, SPSS, and NCR. It stands
for CRoss-Industry Standard Process for Data Mining,
and its cycle has six stages: business understanding, data
understanding, data preparation, modeling, evaluation,
and deployment (Azavedo and Santos, 2008). All three
methodologies address data mining processes. Even though
the three methodologies are different, their common
objective is to produce BI by guiding the construction of
predictive models based on historical data.
A traditional way of discussing methodologies for predictive analytics involves a sense, assess, and respond cycle
that organizations and managers should apply in making
effective decisions (Houghton, El Sawy, Gray, Donegan,
and Joshi, 2004). Using historical data to enable managers
to sense what is happening in the environment has been the
18
foundation of the recent thrust to vitalize evidence-based

management (Pfeffer and Sutton, 2006). Predictive models
help managers assess and respond to the environment in
ways that are informed by historical data and the patterns
within that data. Predictive models help to scale responses
because, for example, scoring models can be constructed
to enable the embedding of decision rules into business
processes. In-database analytics can streamline elements of
the sense, assess, and respond cycle beyond those steps or
phases in KDD, SEMMA, and CRISP-DM.
This article explains how basic in-database analytics
will advance predictive modeling processes. However,
we argue that the most important advancements will
be discovered when actual campaigns are orchestrated
and campaign managers access the new, more tightly
coupled predictive modeling tool suites and database/data
warehouse engines. We assert that the most important
practical contribution of in-database analytics will occur
when analysts are under pressure to produce models
within time-constrained campaigns, and performances
from earlier campaign steps need to be incorporated to
inform follow-up campaign steps.
The next section discusses current impediments to predictive analytics and how in-database analytics will attempt
to address them. We also discuss the benefits to be realized
after more tightly coupled predictive analytics tool suites
and databases/data warehouses become widely available.
These benefits will be game-changers and will occur in such
areas as end-to-end campaign management.
What is Wrong with Current

Predictive Analytics Tool Suites?
Current analytics solutions require many steps and take
a great deal of time. For analysts who build, maintain,
deploy, and track predictive models, the process consists
of many distributed processes (distributed among
analysts, tool suites, and so on). This section discusses
challenges that analysts face when building and deploying
predictive models.
Time-Consuming Processes
To build a predictive model, an analyst may have to tap
into many different data sources. Data sources must con-
PREDICTIVE MODELING
SAMPLE
Input data,
sampling,
data partition
EXPLORE
Ranks-plots
variable selection
MODIFY
MODEL
Transform variable,
filter outliers,
missing imputation
ASSESS
Regression,
tree,
neural network
Assessment,
score,
report
Figure 1. SEMMA methodology supported by SAS Enterprise Mining environment
tain known values for target variables in order to be used

when constructing a predictive model. All the attributes
that might be independent variables in a model may reside
in different tables or even different databases. It takes time
and effort to collect and synthesize this data.
Once all of the needed data is merged, each of the independent variables is evaluated to ascertain the relations,
correlations, patterns, and transformations that will be
required. However, most of the data is not ready to be
analyzed unless it has been appropriately customized. For
example, character variables such as gender need to be
manipulated, as do numeric variables such as ZIP code.
Some continuous variables may need to be converted into
scales. After all of this preparation, the modeling process
continues through one of the many methodologies such as
KDD, CRISP-DM, or SEMMA. For our purposes in this
article, we will use SEMMA (see Figure 1).
The first step of SEMMA is data sampling and data
partitioning. A random sample is drawn from a population to prevent bias in the model that will be developed.
Then, a modeling data set is partitioned into training and
validation data sets. Next is the Explore phase, where each
explanatory variable is evaluated and its associations with
other variables are analyzed. This is a time-consuming step,
especially if the problem at hand requires evaluating many
independent variables.
In the Modify phase, variables are transformed; outliers
are identified and filtered; and for those variables that are
not fully populated, missing value imputation strategies
are determined. Rectifying and consolidating different
analysts perspectives with respect to the Modify phase
can be arduous and confusing. In addition, when applying
transformations and inserting missing values in large data
sets, a tool suite must apply operations to all observations

and then store the resulting transformations within the tool
suites file management system.
Many techniques can be used in the Model phase of
SEMMA, such as regression analysis, decision trees, and
neural networks. In constructing models, many tool suites
suffer from slow file management systems, which can
constrain the number and quality of models that an analyst
can realistically construct.
The last phase of SEMMA is the Assess phase, where all
models built in the modeling phase are assessed based
on validation results. This process is handled within
tool suites, and it takes considerable time and many
steps to complete.
Multiple Versions and Sources of the Truth
Another difficulty in building and maintaining predictive
models, especially in terms of campaign management,
is the risk that modelers may be basing their analysis on
multiple versions and sources of data. That base data is
often referred to as the truth, and the problem is often
referred to as having multiple versions of the truth.
To complete the time-consuming tasks of building
predictive models as just described, each modeler extracts
data from a data warehouse into an analytics workstation.
This may create a situation where different modelers are
working from different sources of truth, as modelers
might extract data snapshots at different times (Gray and
Watson, 2007). Also, having multiple modelers working on
different predictive models can mean that each modeler is
analyzing the data and creating different transformations
from the same raw data without adopting a standardized
method or a naming convention. This makes deploying
19
PREDICTIVE MODELING
multiple models very difficult, as the same raw data may

be transformed in different ways using different naming
conventions. It also makes transferring or sharing models
across different business areas challenging.
Another difficulty relates to the computing resources on
each modelers workbench when multiple modelers are
going through similar, redundant steps of data preparation, transformation, segmentation, scoring, and all the
other functions that can take a great deal of disk space
and CPU time.
The Challenges of Leveraging Unstructured Data and Web
Data Mining in Modeling Environments
Modelers often tap into readily available raw data in the
database or data warehouse. However, unstructured data
is rarely used during these phases because handling data
in the form of text, e-mail documents, and images is
computationally difficult and time consuming. Converting unstructured data into information is costly in a
campaign management environment, so it isnt often
done. The challenges of creating reusable and repeatable
variables for deployment make using unstructured data
even more difficult.
Web data mining spiders and crawlers are often used
to gather unstructured data. Current analyst tool suite
processes for unstructured data require that modelers
understand archaic processing commands expressed in
specialized, non-standard syntax. There are impediments
to both gathering and manipulating unstructured data,
and there are difficulties in capturing and applying
predictive models that deal with unstructured data. For
example, clustering models may facilitate identifying rules
for detecting what cluster a new document is most closely
aligned with. However, exporting that clustering rule from
the predictive modeling workbench into a production
environment is very difficult.
Managing BI Knowledge Worker Training and
Standardization of Processes
In most organizations, there is a centralized BI group that
builds, maintains, and deploys multiple predictive models
for different business units. This creates economies of scale,
because having a centralized BI group is definitely more
20
cost effective than the alternative. However, the economies

of scale do not cascade into standardization of processes
among analyst teams. Each individual contributor usually
ends up with customized versions of codes. Analysts may
not be aware of the latest constructs others have advanced.
What Basic Changes Will In-Database

Analytics Foster?
In-database analytics major advantage is the efficiencies
it brings to predictive model construction processes due
to processing speeds made possible by harnessing parallel
database/warehouse engine capabilities. Time savings are
generated in the completion of computationally intensive
modeling tasks. Faster transformations, missing-value
imputations, model building, and assessment operations
create opportunities by leaving more time available
for fine-tuning model portfolios. Thanks to increasing
cooperation between database/warehouse experts and
predictive modeling practitioners, issues associated with
non-standardized metadata may also be addressed. In
addition, there is enhanced support for analyses of very
large data sets. This couldnt come at a better time, because
data volumes are always growing.
In-database analytics make it easier to process and use
unstructured data by converting complicated statistical processes into manageable queries. Tapping into
unstructured data and creating repeatable and reusable
informationand combining this into the model-building
processmay aid in constructing much better predictive
models. For example, moving clustering rules into the
database eliminates the difficulty of exporting these rules to
and from tool suites. It also eliminates most temporary data
storage difficulties for analyst workbenches.
Shared environments created by in-database analytics may
bring business units together under common goals. As
different business units tap into the same raw data, including all possible versions of transformations and metadata,
productivity can be enhanced. When new ways of building
models are available, updates can be made in-database.
All individual contributors have access to the latest
developments, and no single business unit or individual
is left behind. Saving time in the labor-intensive steps of
model building, working from a single source of truth,
PREDICTIVE MODELING
Process
Benefits
Data set creation and preparation
Reduce cycle time by parallel-processing multiple functions; accurate andtimely completion of

tasks by functional embedding
Data processing and model buildingby multiple analysts
Eliminate multiple versions of truth and large data set movements to andfrom analytical tool
suites
Unstructured data management
Broaden analytics capability by streamlining repeatability and reusability
Training and standardization
Create operational and analytical efficiencies; access to latest developments; automatically

update metadata
Table 1. Preliminary benefits of in-database analytics
RET
AR
N
SIG
DE
MO
D
ORE
PL
X
E
MPLE
SA
SEMMA
EVALUA
TE
DEPLOYMENT
EM
PO
The DEEPER phases delineate, in

sequential fashion, the types of activities
involved in model deployment with a special
emphasis on campaign management. The
T
GE
ANCE
ORM
RF SURE
PE MEA
To drive measurable business results from predictive

models, SEMMA (or a similar methodology) is followed by
a deployment cycle. That cycle may involve the continued
application of models in a (recurring) campaign, refinement when model performance results are used to revise
other models, making decisions on whether completely
new models are required given model
performance, and so on. We distinguish
deployment from the SEMMA-supported
phase (intelligence) because deployment
MODE
L
often engages the broader organization and
Y
IF
AS
requires a predictive model (or models)
SE
SS
to be put into actual business use. This
INTELLIGENCE
section introduces a new methodology we
created to describe deployment: DEEPER
(Design, Embed, Empower, Performancemeasurement, Evaluate, and Re-target).
Figure 2 depicts the iterative relationship
between SEMMA and DEEPER.
EMBED
Context for In-Database Analytics Innovation
design phase involves making plans for how to transition a

scoring model (or models) from the tool suite (where it was
developed) to actual application in a business context. It
also involves thinking about how to capture the results of
applying the model and storing those results for subsequent
analysis. There may also be other data that a campaign
manager wishes to capture, such as the time taken before
seeing a response from a target. A proper design can eliminate missteps in a campaign. For example, if a targeted
catalog mailing is enabled by a scoring model developed
using SEMMA, then users must choose which deciles to
target first, how to capture the results of the campaign (e.g.,
actual purchases or requests for new service), and what new
data might be appropriate to capture during the campaign.
ER
W
having access to repeatable and reusable structured and

unstructured data, and making sure all the business units
are working with the same standards and updatesall
this makes it easier to transfer knowledge as new analysts
join or move across business units. Table 1 summarizes the
preliminary benefits of in-database analytics for modelers.
DEEPER
Figure 2. DEEPER phases guide the deployment, adoption, evaluation, and recalibration of
predictive models.
21
PREDICTIVE MODELING
Once designed, the model must be accurately embedded

into business processes. Model score views must be secured;
developers must ensure scores appear in user interfaces at
the right time; and process managers must be able to insert
scores into automated business process logic. Embedding
a predictive model may require safeguards to exceptions.
If there are exceptions to applications of a model, other
safeguards need to be considered.
Making the results of a predictive model (e.g., a score)
available to people and systems is just the first step in
ensuring it is used. In the empower phase, employees
may need to be trained to interpret model results; they
may have to learn to look at data in a certain way using
new interfaces; or they may need to learn the benefits of
evidence-based management approaches as supported by
predictive modeling. Similarly, if people are involved, testing may be required to ensure that training approaches are
working as intended. The empower step ensures appropriate
behaviors by both systems and people as they pertain to the
embedding of the predictive model into business processes.
A campaign begins in earnest after the empower phase.
Targets receive their model-prescribed treatments, and
reactions are collected as planned for in the design phase
of DEEPER. This reactions-directed phase, performance
measurement, involves ensuring the reactions and events
subsequent to a predictive models application are captured
and stored for later analysis. The results may also be
captured and made available in real-time support for
campaign managers. Dashboards may be appropriate for
monitoring campaign progress, and alerts may support
managers in making corrections should a campaign stray
from an intended path. If there is an anomaly, or when a
campaign has reached a checkpoint, campaign managers
take time to evaluate the effectiveness or current progress of
the campaign. The objective is to address questions such as:
22
Are error levels acceptable?

Were campaign results worth the investment in the
predictive analytics solution?
How is actual behavior different from predicted
behavior for a model or a model decile?
This is the phase when the campaigns effectiveness and

current progress are assessed.
The results of the evaluate phase of DEEPER may lead to a
completely new modeling effort. This is depicted in Figure
3 by the gray background arrow leading from evaluate to
the sample phase of SEMMA. This implies a transition
from deployment back to what we have referred to as
intelligence. However, there is not always time to return
to the intelligence cycle, and minor alterations to a model
might be deemed more appropriate than starting over. The
latter decision is most prevalent in time-pressured, recurring campaigns. We refer to this phase as re-target, which
requires analysts to take into account new information
gathered as part of the performance management deployment phase. It also takes advantage of the plans for how
this response information was encoded per the design phase
of deployment.
The most important consideration involves interpreting
results from the campaign and managing non-performing
targets. A non-performing target is one that scored high in
a predictive model, for example, but that did not respond
as predicted. In a recurring campaign, there may be an
effort to re-target that subset. There could also be an effort
to re-target the campaign to another set of targets, e.g.,
those initially scored into other deciles. Re-targeting can
be a time-consuming process; new data sets with response
results need to be made available to predictive modeling
tool suites, and findings from tracking need to be incorporated into decisions.
DEEPER provides the context for considering how
improvements to in-database analytics can be game-changers. In-database analytics can make significant inroads to
DEEPER processes that take time and are under-supported
by predictive modeling tool suites. However, these improvements will be driven by analysts who work closely with
their organizations database experts. This combination
of analyst and data management skills, experience, and
knowledge will spur innovation significantly beyond
current expectations.
PREDICTIVE MODELING
How Might In-Database Analytics

for DEEPER Evolve?
Extending in-database analytics to DEEPER processes
requires considering how each DEEPER phase might be
streamlined given tighter coupling between predictive
modeling tool suites and databases/data warehouses.
Although many of the advantages of this tighter coupling
may be realized differently by different organizations, there
are generic value streams to guide efforts. Here the phrase
value stream refers to process flows central to DEEPER.
This section discusses these generic value streams: (1)
intelligence-to-plan, (2) plan-to-implementation, (3)
implementation-to-use, (4) use-to-results, (5) results-toevaluation, and (6) evaluation-to-decision.
In the design phase of DEEPER, planning can be facilitated by examining possible end-user database views that
could be augmented with predictive intelligence. Instead
of creating new interfaces, it is possible that Web pages
equipped with embedded database queries can quickly
retrieve and display predictive model scores to decision
makers or front-line employees. Many of these displays are
already incorporated into business processes, so opportunities to use the tables and queries to supply model results can
streamline implementation. When additional data items
need to be captured, that data may be captured at the point
of sale or other customer touch points. A review of current
metadata may speed up the design of a suitable deployment
strategy. In addition to pushing model intelligence to
interfaces, there may also be ways of pulling data from
the database/warehouse to facilitate re-targeting or for
initiating new SEMMA cycles.
For example, it may be possible to design queries to
automate the retrieval of data items such as target response
times from operational data stores. Similarly, it may be
possible to use SQL to aggregate the information needed
for this type of next-step analysis. For example, total
sales to a customer within a specified time period can be
aggregated using a query and then used in the re-targeting
phase to reflect whether a target performed as predicted.
In-database analytics can support the design phase because
it eliminates many of the traditional bottlenecks such as
complex requirements gathering and the creation of formal
specification documents (including use cases). Instead,
existing use cases can be reviewed and augmented, and

database/warehousesupported metadata facilities can
support the design of schema for capturing new target
response data. We refer to this as an intelligence-to-plan
value stream for the in-database analytics supported design
deployment phase.
In the embed phase, transferring scored model results
to tables is a first step in considering ways to make use
of database/warehouse capabilities to support DEEPER.
Once the scores are appropriately stored in tables, there are
many opportunities to use queries to embed the scores into
people-supported and automated business processes. For
example, coding to retrieve scores for inclusion in front-line
employee interfaces can be done in a manner consistent
with other embedded SQL applications. This saves time
in training interface developers because it implies that
the same personnel who implemented the interfaces can
effectively alter them to include new intelligence.
There is also no need for additional project governance
functions or specialized software. In fact, database/
warehouse triggers and alerts can be used to ensure that
predictive analytics are used only when model deployment
assumptions are relevant. As the database/warehouse is the
same place where analytic model results reside, there are
numerous implementation advantages. We refer to this as
a plan-to-implementation value stream for the in-database
analytics supported embed deployment phase.
After implementation, testing will ensure that model
results/scores are understandable to decision makers (the
empower phase) and that their performance can scale
when production systems are at high capacity. Such stress
tests can be conducted in a manner similar to database
view tests. Because of the inherent speed of database/
warehouse systems, their performance will likely exceed
separate, isolated workbench performance. Global roll-out
can be eased by tried-and-true database/warehouse roll-out
processes. We refer to this as an implementation-to-use value
stream for the in-database analytics supported empower
deployment phase.
Similarly, the use-to-results value stream is that part of a
campaign when actions are taken and targets respond.
23
PREDICTIVE MODELING
In this performance management phase of deployment,

dashboards can be used to track performance, database
tables can automatically collect and store ongoing
campaign results, queries can aggregate responses over
time as part of automating responses, and many other
in-database solutions can help to streamline related
processes. This information is central to the evaluate phase,
where the results-to-evaluation value stream can enable
careful scrutiny of the predictive analytics model portfolio.
Queries can be written to compare actual results to those
predicted during SEMMA phases. When more than one
model has been constructed in the SEMMA processes, all
can be re-examined in light of the new information about
responses. If-then statements can be embedded in queries
to identify target segments that have responded according
to business goals, and remaining non-responders can be
quickly identified.
performance of the models in the portfolio. If models exist

that were not used but appear to perform better, those
models may be used in the next DEEPER cycle. Alternatively, a combination or pooling of models might be most
appropriate. Again, automated queries might be able to
provide decision support for such pooling options, and they
can aid in scheduling the appropriate model for the data
sets as the DEEPER cycle progresses. In addition, it may
be possible to use queries to apply business rules to manage
data sets, and prior results could inform the scheduling
of resting periods for targets such that each target isnt
inundated with catalog mailings, for example.
Such analysis can be done for each analytical model in

the portfolio and for each decile of predicted respondents
associated with those models. This has been an enormously
time-consuming process in the past, but the database/warehouse query engine can conduct this type of post-analysis
efficiently. Queries can also identify subsets of respondents
that outperformed the predicted model performanceand
those that significantly under-performed. This type of
analysis can be quickly supported through queries, and it
can provide significant insight for the re-target phase.
Conclusion
Following the results-to-evaluation value stream of the

deployment cycle, the evaluation-to-decision value stream
focuses on whether a new intelligence cycle (a repeat of
SEMMA processes) is required. If performance results
indicate major model failures, then a repeat is likely
necessary to resurrect and continue a campaign. Even if
there werent major failures, environmental changes such as
economic conditions may have rendered models outdated.
Data collected in the performance evaluation phase may
help to streamline the decision process. If costs arent being
recovered, then it is likely that either the campaign will
cease or a new intelligence cycle is necessary.
Often a portfolio of models is created in the initial intelligence cycle. It may be possible to use queries to automate
the process of recalculating the prior and anticipated
24
Table 2 summarizes key generic value streams that can be

supported by in-database analytics and briefly describes
the possibilities discussed in this section. Opportunities to
evolve in-database analytics are likely to be numerous.
In-database analytics create an environment where

functions are embedded and processed in parallel, thereby
streamlining the steps of both intelligence (e.g., SEMMA)
and deployment (e.g., DEEPER) cycles. As data sources
are updated, attribute names and formats may change, yet
they are sharable. In-database analytics can support quality
checks and create warning messages if the range, format,
and/or type of data differ from a previous version or model
assumptions. If external data has attributes that were not in
the data dictionary, metadata can be updated automatically.
Data conversions can be handled in-database and only once
instead of being repeated by multiple modelers. In-database
analytics fosters stability, enhances efficiency, and improves
productivity across business units.
In-database analytics will be critical to a companys
bottom line when models are deployed and there is
time pressure for multiple, successive campaigns where
ongoing results can be used to build updated, improved
predictive models. Enhancements can be realized in a
host of value streams. For example, in-database analytics
can significantly reduce cycle times for rebuilding and
redeploying updated models to meet campaign deadlines. As multiple models are constructed, in-database
analytics will enable managing them as a portfolio.
Timely responses, tracking, and fast interpretation of
PREDICTIVE MODELING
Intelligence-to-plan
Planning is streamlined; push and pull strategies are feasible; schema design can support planning
Plan-to-implementation
Scores maintained in-database; embedded SQL in HTML can facilitate view deployment; triggers and alerts can be used
to guard for exceptions
Implementation-to-use
Stress testing and global rollout follow database/warehouse methodologies and rely on common human and physical
resources
Use-to-results
Dashboards can be readily adapted; database/warehouse tables can be used as response aggregators
Results-to-evaluation
Re-examine all created models efficiently in light of response information; embed if-then logic to re-target nonresponders
Evaluation-to-decision
Consider applying different models; allow targeted respondents to rest; use database to provide decision support for
deciding to re-target or re-enter the intelligence cycle
Table 2. Generic value streams and areas for innovation with in-database analytics
early responders to campaigns will enable companies to

fine-tune business rules and react in record time.
As the fine line between intelligence and deployment cycles
fades because of the fast-paced environment supported
by in-database analytics, businesses may move away
from the concept of campaign management into triggerbased, lights-out processing, where all data feeds are
automatically updated and processed, and there is no need
to compile data into periodic campaigns. There will be realtime decision making with instant scoring each time there
is an update in one of the important independent variables.
Analysts will spend their time fine-tuning model performance, building business rules, analyzing early results,
monitoring data movements, and optimizing the use of
multiple modelsinstead of dealing with the manual tasks
of data preparation, data cleansing, and managing file
movements and basic statistical processes that have been
moved into the database/warehouse.
Although lights-out processing is not on the near-term
horizon, the evolution of in-database analytics promises to
move organizations in that direction. Once in the hands of
analysts and their database/warehouse teams, in-database
analytics will be a game-changer. n
References
Azevedo, Ana, and Manuel Felipe Santos [2008]. KDD,
SEMMA AND CRISP-DM: A Parallel Overview.
IADIS European Conference Data Mining,
pp. 182185.
Fayyad, U. M., Gregory Piatetski-Shapiro, Padhraic
Smyth, and Ramasamy Uthurusamy [1996]. Advances in
Knowledge Discovery and Data Mining, AAAI Press/The
MIT Press.
Gray, Paul, and Hugh J. Watson, Hugh [2007]. What Is
New in BI, Business Intelligence Journal, Vol. 12, No. 1.
Houghton, Bob, Omar A. El Sawy, Paul Gray, Craig
Donegan, and Ashish Joshi [2004]. Vigilant
Information Systems for Managing Enterprises in
Dynamic Supply Chains: Real-Time Dashboards at
Western Digital, MIS Quarterly Executive,
Vol. 3, No. 1.
Pfeffer, Jeffrey, and Robert I. Sutton [2006]. Evidence
Based Management, Harvard Business Review, January.
25
BI CASE STUDY
BI Case Study
SaaS Helps HR Firm Better Analyze Sales Pipeline
By Linda L. Briggs
When Tom Svec joined Taleo as marketing operations manager, he
immediately ran up against what he calls The Beast, a massive, 100-MBplus sales and marketing report in Microsoft Excel.
Ugly as it was, the monster Excel report, created weekly from Salesforce.
com data, served a critical function in helping with basic sales trend
analysis. Each Monday, data imported from Salesforce.com offered
snapshots of the previous weeks patterns to provide guidance on upcoming sales opportunities.
The information was critical to Taleos sales managers. The publicly traded
company, with 900 employees and just under $200 million in reported
revenue in 2009, provides software-as-a-service (SaaS) solutions for talent
management. Its products are designed to help HR departments attract,
hire, and retain talent; they range from recruiting and performance
management functions to compensation and succession planning tools.
Given Taleos current needs and projected continuing rapid growth, Svec
says he realized that along with the need for more sales visibilityespecially for senior managersthe risks of manipulating such critical data in
Excel had increased to an unacceptable level. He also needed a tool that
could manipulate data and provide information faster than Excel could.
I needed to look for a scalable solution, a reliable solution, and a low-risk
solution, Svec notes. He thus began a search for a BI tool to help manage
the sales opportunity data, particularly entry and pipeline metrics, for the
demand-generation group as well as for Taleos sales organization overall.
The tool would need to work with Salesforce.com initially, but eventually
might be used with other data as well. For example, Taleo uses a front-end
marketing automation and demand-generation platform called Eloqua to
execute and measure marketing activity. In time, Svec says, the company
may want to import and manipulate Eloqua data directly in its BI solution.
As the companys only marketing operations expertwith lots of overlap
with the sales operations team as wellSvec needed a complete lifecycle
view of both sales and marketing data. The demand-generation team and
I are very, very focused on everything from the top of the funnel all the
way through to close of business, he explains. That includes involvement in
26
BI CASE STUDY
the sales pipeline side of thingsall

of which made The Beast a challenge when users needed to glean
useful information quickly.
The Search for SaaS

Embarking on the search for a BI
solution, Svec turned to Salesforce.
coms AppExchange, an online
marketplace that lists more than
800 cloud-based applications that
work with Salesforce.com. From that
list, Svec selected and considered
vendors including Cloud9 Analytics,
LucidEra (which closed in 2009),
and other SaaS BI vendors.
Initially, this was supposed to be
just a departmental solution for
specific use in managing marketing
demand generation and sales pipeline data, Svec says. Rather than a
large ERP system or on-premises
solution, we were looking for a
solution to solve a specific issue.
With limited technology resources
to call on within the company, he
wanted a quick, easy implementation that he could accomplish
without IT involvement and that
could be ramped up quickly while
providing rock-solid security.
As a SaaS company itself, Taleo was
in a good position to understand
and appreciate the SaaS concept of
on-demand software hosted offsite
by the vendor. In that vein, the
company eventually selected PivotLink, which offers an on-demand
BI solution that includes technologies such as in-memory analytics
and columnar data storage.
Svec says a key PivotLink feature

was its ability to handle data from
any source. That helped it stand
out from the many other solutions he found on AppExchange
that seemed geared specifically
toward working with Salesforce.
With more anticipated growth
ahead, both organic and through
acquisitions, the company needed
something more versatile.
Given limited
technology resources,
Svec wanted a quick,
easy implementation
that he could
accomplish without IT
involvement.
Today our [focus] is Salesforce,
Svec says, but looking down the
road 6, 12, however many months,
we wanted something built to
accommodate other data sources.
During a relatively quick six- to
eight-week implementation, Svec
worked closely with PivotLink in
a collaborative process, pushing
them a bit, he says, to integrate
more deeply with Salesforce. He
was pleased overall with how the
integration proceeded, in particular with the vendor relationship: I
think [PivotLink] was discovering
new things along the way, particu-
larly some of the historical snapshot

requirements we had.
The end result: A master set of
locked-down sales reports built in
PivotLink that sales and marketing
managers can use for a detailed
view of the demand-generation
funnel and analysis of the sales
pipeline for historical trending.
Looking under the Hood

A key concern during the selection
process was the long-term financial
viability of any SaaS provider. We
were very cognizant of financial
viability, Svec says (and in fact,
SaaS company LucidEra closed its
doors just weeks after Taleo signed
its deal with PivotLink).
To avoid potential problems,
Taleo examined PivotLink closely,
weighing factors including funding history: When was the last
round of funding? When is the
next round of funding scheduled?
Where is the company in the
fundraising process? They also
considered number of customers
and growth rate.
Taleo also conducted an extensive
security review. As a SaaS company, were very, very serious about
security, Svec stresses. Having
conducted a SAS-70 compliance
process himself as VP of operations
at a SaaS company earlier in his
career, Svec was highly conscious
of what he required in terms of
security from any hosted software
vendor. The focus was especially
sharp because PivotLink is a
relatively new company, he says.
27
BI CASE STUDY
We asked, Is this reliable? Is this

secure? The answers were a very
important part of why we chose
PivotLink.
PivotLink users so far at Taleo make
up a small group in marketing
management; data they prepare is
consumed by users in key positions
such as the chief marketing officer,
head of sales, head of finance, and
regional sales VPs. Over time, Svec
hopes to begin work on the holy
grail, as he calls itusing predictive analytics to examine data and
predict future events.
A new version of PivotLink will
help spread the tool to more users,
Svec hopes, based on what hes
seen so far of the revamped user
interface. The improvements
[PivotLink] has made in the [user
interface], especially in the dashboard itself, make it much more
conducive to casual use. Thats
really going to help bring [more
casual users] to the tool at Taleo,
he predicts.
Future Plans
To that end, Svec first wants to
boost user interest in and adoption
of PivotLink throughout the company. Taleos finance department,
for example, is enviously eyeing
the PivotLink-produced reports
coming from Svecs group and is
thus a candidate for adoption.
Second, he envisions incorporating
additional data sourcesand this
is where PivotLinks ability to
handle disparate data sources will
be importantthus giving him
28
and his team a more unified view of

marketing and demand-generation
activity. He wants to see how we
can better leverage our marketing
metrics by bringing in data from
other sources beyond Salesforce.
Those sources include the
companys marketing automation
system, as well as large data sets
that are sent to outside companies
for cleanup and validation, which
then need to be re-imported and
manipulated by Svec and his
team. The ability to examine a
before-and-after view of the data
can reveal what changes have been
made and how data quality has
been increased.
Taleos finance
department is
enviously eyeing the
PivotLink-produced
reports coming from
Svecs group and is
thus a candidate for
adoption.
Although the return on investment
from making better decisions is
always an elusive measure, Svec says
that PivotLinks pricing model has
proven economical for a company
the size of Taleo. Certainly, issues
such as better risk management
from avoiding the manipulation
of copious amounts of data in an

Excel spreadsheet is part of the
savings equation. Svec also sees
everyday time savings gleaned from
simply making it more efficient
for users to dig into data. Those
are things that you cant really
measure, he says. [PivotLink] is
definitely allowing us to measure
and see things in different ways [or
more efficiently] than we were able
to do in the past.
In a nutshell, what the BI tool
really does, Svec says, is allow sales
and marketing management to
zero in on circumstancethat
is, to identify situations where
trending patterns are evident from
the data in the sales pipeline.
Exposing that data much more
clearly in order to find patterns
and anomalies allows users to
drill down and perform further
comparative analysis.
Excel, while still put to everyday
use throughout the company for
one-off data extracts, manipulations, validations, and the like, is
no longer the primary analysis tool.
Our reliance on it isnt as great,
Svec says. Weve mitigated risk
and improved our scalability by
using PivotLink instead.
Linda L. Briggs writes about technology in corporate, education, and
government markets.She is based in
San Diego. lbriggs@lindabriggs.com
AGILE BI
Enabling Agile BI
with a Compressed
Flat Files Architecture
Abstract
Dr. William Sunna is a principal consultant

with Compact Solutions.
william.sunna@compactsolutionsllc.com
As data volumes explode and business needs continually

change in large organizations, the need for agile business
intelligence (BI) becomes crucial. Furthermore, business
analysts often need to perform studies on granular data for
strategic and tactical decision making such as risk or fraud
analysis and pricing analysis. Rapid results in active data
warehousing become vital in order for organizations to better
manage and make optimal use of their data. All of this triggers
the need for new approaches to data warehousing that can
support both agility and access to granular data.
This article presents a new approach to agile BI: the compressed flat files (CFF) architecture, a file-based analytics
solution in which large amounts of core enterprise transactional data are stored in compressed flat files instead
of an RDBMS. The data is accessed via a metadata-driven,
high-performance query engine built using a standard ETL or
software tool.
Pankaj Agrawal is CTO of Compact Solutions.

pankaj.agrawal@compactsolutionsllc.com
When compared to traditional solutions, the CFF architecture is

substantially faster and less costly to build thanks to its simplicity. It does not use any commercial database management
systems; is quick and easy to maintain and update (making it
highly agile); and could potentially become the single version
of truth in an organization and therefore act as an authoritative
data source for downstream applications.
Introduction
Large enterprises often find themselves unable to use
their core data effectively to perform BI. This is mainly
due to a lack of agility in their information systems and
the delays required to update their data warehouses with
new information. As business climates change rapidly,
new dimensions, key performance indicators, and derived
facts need to be added quickly to the data warehouse so the
29
AGILE BI
business can stay competitive. In addition, access to historical, low-granularity transaction data is vital for tactical and
strategic decision making.
Traditional data warehouse solutions that use relational
databases and implement complicated models may not be
sufficient to satisfy the agility needs of such BI environments. Introducing new data into a warehouse often
involves relatively long development and testing cycles.
Furthermore, the traditional data warehouse architectures
do not adequately cope with many years of transactional
data while meeting the performance expectations of end
users. Enterprises often settle for summarized data in the
warehouse, but this severely compromises their ability
to perform advanced analytics that require access to vast
amounts of low-level transactional data.
With all of these inconveniences, the need for an agile
solution that can handle these challenges has become
acute. This article presents an innovative architecture that
As business climates change rapidly,

new dimensions, key performance
indicators, and derived facts need
to be added quickly to the data
warehouse so the business can
stay competitive.
offers a cost-effective solution to create large transactional
repositories to support complex data analytics and has agile
development and maintenance phases.
In this architecture, the core enterprise data is extracted
from operational sources and stored in a denormalized
form on a more granular level in compressed flat files. The
data is then extracted using a high-performance extraction
engine that performs SQL-like queries including selection,
filtering, aggregation, and join operations. Power users
30
can extract transactional or aggregated data using a simple

graphical interface. More casual business users can use a
standard OLAP tool to access data from the compressed
flat files.
The benefits of CFF architecture are manifold. The
infrastructure cost of a CFF solution is substantially lower,
as RDBMS license costs are eliminated. Storing data in
standard compressed flat files can reduce disk storage
requirements by an order of magnitude. This not only
reduces cost, but also allows an organizationto provide
many more years of transactional data to its analysts,
allowing for much richer analysis. In addition, the simplified architecture can be built and supported by a much
smaller team. This article will explain the CFF architecture
through a simple case study; discuss the metadata-driven
feature of the architecture; and compare the CFF architecture to traditional architecture, with an emphasis on agility.
Case Study
We will use a simple case study to demonstrate the CFF
architecture. Suppose researchers and pricing analysts in
a major retail chain want to study the sales trends and
profitability of the products sold at their stores located in all
50 states. To support their analyses, they need 10 years of
detailed sales transaction data available online.
Lets assume the chain sells more than 30 categories of
products such as automotive and hardware. Each category
contains a wide range of products. For example, the
automotive category contains engine oil, windshield washer
fluid, and wiper blades; each of these products has a unique
product code. Once a day, all the stores send a flat file
containing point-of-sale (POS) transactions to headquarters. In addition to product and geographical information,
the transactions also contain other information such as the
manufacturer code, sales channel, cost of the product, and
sale price.
Assume that most users analysis is based on the geographic
location, product category, and the accounting month in
which the products are sold. Lets refer to such attributes
as major key attributes. For example, a business analyst
may request a profitability report for a selected number of
products in a given category in Illinois in the first quarter
of 2009.
AGILE BI
Operational
Data Sources
Data
Data
ETL
ta
Da
Compressed
Flat Files
Data Files
High-Performance
Query Engine
Query
Business
Analysts
Figure 1. Overview of CFF architecture
The Compressed Flat Files Architecture

The CFF architecture (Figure 1) is best characterized by
its simplicity, yet it delivers many invaluable benefits. The
architecture is highly metadata-driven to allow flexibility
and agility in development and maintenance. The architecture also allows the enterprise to implement a security layer
to regulate data access. This section describes how the data
is generated, organized, and extracted, along with the CFF
metadata-driven characteristics.
Data Generation and Organization
In this step, the data is extracted from operational data
sources using a standard ETL tool. The data can be
extracted from legacy systems, operational databases, flat
files, or any other available data sources. The extracted
data can be on a very granular level, such as POS
transactions for a certain retail chain, as described in the
case study. For the widest possible use, we recommend
storing the data at the most granular level. The data
is then cleansed and transformed according to a set of
business rules, then partitioned, compressed, and stored
in multiple files on hard disk. A set of key performance
indicators (KPIs) is also calculated at this stage, again at
the most granular level.
The way the data is organized and distributed in
compressed flat files is a key factor in the success of this
architecture. Similar to commercial database partition
elimination mechanisms, the compressed flat files should

be organized to optimize extraction as much as possible. In
other words, the main goal is to read as few files as possible
to satisfy any given query. To ensure this is the case, some
extraction patterns should be analyzed before finalizing the
organization of the files.
For our case study, it makes sense to split the data files by
their major key attributes. For example, we can split the
files by product category, state, and accounting month
because these three attributes are used in almost all the
extractions. If we are storing 10 years of data for 50 states
and 30 product categories, then the number of compressed
flat files will be 10 years x 12 months x 50 states x 30
categories = 180,000 files. Each compressed file should be
named in a way that describes its contents. For example,
given a compressed flat file, a user should be able to identify
what product categories it contains for what state and what
accounting month. If the file names do not describe the
major key attributes, then there should be a mapping file to
link the file name to its major key attributes.
Data Extraction
The extraction process starts with the end users, who
compose their requests in a simple, standard user interface
that can be developed using Java or .NET. The user
interface should allow users to specify the data attributes
they would like to see and what measures or metrics they
31
AGILE BI
would like to calculate. In addition, it should provide filters

to further refine the data.
up-front performance gain in query processing and is one

of the major strengths of the CFF architecture.
Lets take a data extraction request for our case study:

A user wishes to perform a profitability analysis for four
products (with codes 01, 02, 03, and 04) in the automotive
category, which has a code of 01, for the state of Illinois
(IL) in the first quarter of 2009. The user interface allows
the user to select attributes (category code, state, product
code, transaction date, number of items sold, sales amount,
and cost amount) and specify the relevant filters, as shown
in Figure 2.
The query engine then reads the data in the relevant files
and applies additional data filters such as the product code.
The next step will be aggregating the measures requested
(sales amount and cost) by product code and presenting
the results to the analyst. The resulting data sets can be
produced in any format, such as comma-separated or
SAS-formatted files. Note that the user interface presented
here is to be used as a data extraction interface, as opposed
to a standard reporting or presentation interface. Standard
BI tools such as MicroStrategy and Business Objects are
also supported by this architecture.
Once the user submits the request, the query details are
passed to the high-performance query engine that is
responsible for extracting data directly from the compressed
flat files. The query engine will first build a list of the
compressed flat files needed for the extraction based on
the major key attributes selected. In our example, only
one category code has been requested for one state during
a three-month period. Therefore, only three compressed
flat files out of the 180,000 total files are needed to satisfy
the request. This early selection of files represents a huge
Figure 2. Example of a query user interface
32
The high-performance query engine can be implemented

with any commercial or open source ETL tool, or it
can be built using any programming language. If the
organization uses such tools and software, then there will
be no need to purchase additional licenses for a database
management system.
AGILE BI
Metadata
Management
Module
User
Login
User
Interface
Security
Grid
Schema
Files
Query
Control
Process
CFF
High
Performance
Query Engine
Requests
Configuration
Repository
Results
Figure 3. Metadata-driven architecture
Metadata-driven Approach
The CFF architecture is highly metadata-driven to allow for
maximum agility in both the initial build of the application
and any required maintenance in the future. Due to the
simplicity of the data model manifested in the CFF, the
data layouts (schema files) of the CFF are leveraged to
generate the contents of the user interface via the metadata
management module, as shown in Figure 3. Therefore, the
addition of new fields or modifications to existing fields are
reflected in the user interface unit without requiring any
programming effort.
The metadata management module also takes into
consideration the classification of attributes in the data
as specified in the schema files; it distinguishes major key
attributes from other dimensional attributes and measures.
Furthermore, it provides user privileges information to the
interface by consulting the security grid module, which
contains privileges and security rules for data access. The
user interface builds custom data extraction menus for
different users depending on what they are allowed to
query or extract.
All user requests are deposited in the requests configuration

repository, a standard, secure location that contains the
specifics of each request. This allows users to access any
requests they submitted in the past, modify them if needed,
and resubmit them. The query control process gathers new
requests from the request configuration repository and submits them to the high-performance query engine. Queuing
of requests, priorities, and other scheduling considerations
are implemented in the query control process.
Traditional Architectures, CFF, and Agility

The CFF presents an alternate way to implement complex
data analytics solutions with huge gains. Compared to
traditional architectures, it is significantly faster to build
due to its simplicity. It is far easier to maintain due to its
metadata-driven characteristics. Systems based on this
architecture can provide very rich information to analysts
because very large amounts of highly granular data can
be kept online at a fraction of the cost of traditional
architectures. In one implementation of this architecture,
more than 100 power users at a large insurance company
perform complex analytics on 22 years worth of claims
and premium transactions.
33
AGILE BI
Business
Table
Manager
Staged
Raw Data
Extract
Acquire
Stage
Conformed
Operational
Data Source
Enterprise
Data Warehouse
Presentation
Conform
Synchronize
Integrate
Load EDW
Present
Reprocessing
Automated
Balance and Audit
Control
Operational
Source
Systems
Admin
Services
Notification
Services
Reporting and
Monitoring
Data
Architecture
ETL
Processes
ETL
Components
ETL
Metadata
Tool Metadata ABAC Tables
Figure 4. Traditional data warehouse architecture
Traditional data warehousing solutions based on relational

databases require many layers of data models with
corresponding ETL processes, making the architecture
very complex, as shown in Figure 4. The traditional data
architectures usually require separate models to be built for
staged data, conformed data, the operational data store, a
data warehouse or data mart, and presentation layers. These
models are populated by multiple ETL processes.
Because this architecture depends heavily on an RDBMS
for storing data, data is often aggregated to provide better
performance and manage data growth. Because of the
very large data volumes involved, it is extremely expensive
to store many years of transactional data in such data
warehouses. Therefore, most such solutions keep a small
amount of granular data (say a few months) in base tables,
and rely heavily on aggregated data to meet user demands.
Such aggregated data is often of limited use for applications
such as risk and fraud analysis, price modeling, and other
analytics that require a longer historical perspective.
If we compare the CFF solution to a traditional data
warehousing solution on basic development and maintenance activities, we can easily recognize the agility gains
offered by the CFF architecture. Table 1 compares the
34
CFF architecture with the traditional architecture along

some key criteria.
During the development phase of the CFF architecture,
adding new attributes or deleting/updating existing
attributes requires making changes to only one repository
and one ETL application, whereas the traditional architecture requires changes in many places. In fact, this simple
difference can save substantial time, money, and resources
because it eliminates the need to build many sophisticated
models, whether dimensional or normalized in a relational
database. Since data is stored in only one repository (the
set of compressed flat files), only one set of ETL routines
needs to be developed, saving time and money. Thus, the
architecture is intrinsically agile.
During the maintenance phase, inserting new attributes
in the data is easier in CFF because of its metadata-driven
nature. Once the CFF layout is modified, the rest of the
updates are done automatically all the way to the user
interface. In a traditional solution, the new attributes have
to be propagated from one process to another and from one
data model to another, requiring significant development
and testing. Making a small change requires the involvement of data modelers, database administrators, ETL
developers, and testers.
AGILE BI
Phase
Development
Maintenance
Change Step
Traditional Architecture
CFF
New attribute
Updates to several layouts, data models,

and ETL processes
Updates to only one layout and one ETL process
Delete attribute
Updates to several layouts, data models,

and ETL processes
Updates to only one layout and one ETL process
Update attributes
Updates to several ETL processes
Updates to only one ETL process
Insert attributes
NULL for historical data and layouts;

updates to several ETL processes going
forward
Easier with metadata automation; updates to only one

ETL process
Delete attributes
Nullify column; updates to several ETL

processes going forward
Easier with metadata automation; updates to only one

ETL process going forward
Update attributes
Updates to several ETL processes
Updates to only one ETL process
Table 1. Comparison of CFF solution and a traditional data warehouse solution
Summary
According to Forrester Research principal analyst
Boris Evelson, the slightest change in a traditional data
warehouse solution can trigger massive amounts of work
involving changing multiple ETL routines, operational data
store attributes, facts, dimensions, major key performance
indicators, filters, reports, cubes, and dashboards. Such
changes cost time and money. This frustrates IT managers
and business users alike. The need for agile data management has, therefore, become acute. Such solutions should
not be driven by what tools are available but by smart
strategies and architectures.
In response to business needs for agility and lower cost,
we have presented a new but proven data management
architecture, the compressed flat files architecture. We have
demonstrated the simplicity of this architecture and how
it can be used to satisfy business needs in an agile environment. We have shown how this architecture is independent
of any technologies or tools. We also demonstrated how
it allows business users to analyze vast amounts of data at
the most granular level without any loss of detail, a feature
that would be prohibitively expensive to build using a
traditional solution.
We compared the CFF architecture with traditional

architectures to demonstrate the agility of CFF in multiple
activities in the development and maintenance phases. We
have shown that the CFF architecture offers important
benefits: reduced development time due to simplicity and
metadata-driven architecture; reduced cost from eliminating the need to use a relational database management
system; and the ability to store much larger amounts of
data on smaller storage devices.
A solution based on the CFF architecture has already
proved its value at a large corporation where it handles
more than 50 TB of raw historical transactional data.
Furthermore, the CFF architecture has been recognized by
data warehousing experts such as Bill Inmon and industry
analysts such as Forrester as an important evolutionary step
in data management and BI.
Todays BI challenges require non-traditional solutions
to rein in the cost and complexity of managing data, as
well as more agile responses to business changes. The CFF
architecture meets these requirements. n
35
BI EXPERTS PERSPECTIVE
BI Experts Perspective:
Pervasive BI
Jonathan G. Geiger, Arkady Maydanchik, and Philip Russom
Jonathan G. Geiger, CBIP, is an executive
Kelsey Graham has recently taken over as business intelligence (BI) director
at Omega, a manufacturer of office products. She inherits a BI staff that has
been in place for four years and boasts many accomplishments, includingan
enterprise data warehouse, performance dashboards, forecasting models, and
pricing models. There are eight BI professionals on staff; they perform roles and
tasks that vary: planning the BI architecture, developing and maintaining the
warehouse, and developing enterprisewide applications.
vice president with Intelligent Solutions,

Inc. He presents frequently at national
and international conferences, has
written more than 30 articles, and is a
co-author of three books: Data Stores, Data
Warehousing and the Zachman Framework:
One of Kelseys charges is to make BI more pervasive. Senior management

wants decision support data, tools, and applications available to more employees and trading partners along the supply chain. Although Kelsey is on board
with this initiative, she is concerned about the quality of both the data in the
warehouse and the metadata.
Managing Enterprise Knowledge; Building

the Customer-Centric Enterprise; and
Mastering Data Warehouse Design.
jggeiger@earthlink.net
Arkady Maydanchik is a recognized
practitioner, author, and educator in the field
Her predecessor didnt make much progress in working with some of the business units to correct the data quality problems originating in the source systems,
and there is limited metadata that informs users about the quality of the data
they are accessing. Kelsey knows that as BI becomes more pervasive, these
data quality issues will demand more attention. She needs to think through what
actions to take.
of data quality and information integration.

Arkadys data quality methodology and
1.
How should Kelsey start a dialogue with senior management about correcting
the data quality problems in the source systems? Her sense is that she needs
senior managements help to get the business units to allocate the necessary
resources to address the problems.
2.
What metadata about data quality does Kelsey need to provide to users?
Should she use categorical indicators such as excellent, good, fair, or poor,
or specific numerical indicators such as 90 percent accurate?
3.
Should the indicators of quality be placed at the warehouse or the application

level? Kelsey knows that data quality is related to the datas intended use, but
providing data quality metrics at the application level would be much more
labor intensive for her staff.
breakthrough ARKISTRA technology were

used to provide services to numerous
organizations. He is co-author of Data
Quality Assessment for Practitioners.
arkadym@dataqualitygroup.com
Philip Russom is the senior manager of TDWI
Research at The Data Warehousing Institute
(TDWI), where he oversees many of TDWIs
research-oriented publications, services, and
events. Before joining TDWI in 2005, Russom
was an industry analyst covering business
intelligence (BI) at Forrester Research, Giga
Information Group, and Hurwitz Group.
prussom@tdwi.org
36
Jonathan G. Geiger
Kelsey is dealing with
a BI program that
is perceived to be sufficiently
successful to be widely adopted
but that has some gaps under the
covers. In addition, her team seems
to be oblivious to the data quality
issues. Fortunately, she recognizes
that she needs to address the
deficiencies before providing wider
access to the data. She needs to
address the teams attitude, get a
realistic assessment of the situation,
gain senior management support,
and provide information on the
actual data quality.
Team Attitude
Kelseys team is proud of its
accomplishments, and probably
with good reason. They have, after
all, implemented a data warehouse
that provides data to its intended
audience, and this information
is used to provide benefits to
the organization. If Kelsey is to
address her data quality concerns,
she must first discuss these
concerns with her team.
Kelsey should speak with her team,
individually and collectively, to discuss the strengths and risks of the
existing environment. If there is
any merit to her suspicions, at least
some of the team members will
mention concerns about data quality. Being careful to give the team
credit for its accomplishments, as
the manager, Kelsey is in a position
to determine which deficiencies
need to be addressed first. If she
feels that the most significant issue
to be addressed prior to widespread
implementation is data quality, she

should communicate this to the
team and gain its understanding
and support. (There may be other
high-priority issues, but they are
outside the scope of this article.)
Data Quality Assessment
Kelsey is not in a position to start a
dialogue with senior management
until she can substantiate her
concerns about the datas quality.
If she were to simply approach
management with her concerns,
she would probably be perceived
as a naysayer and would lose the
support of both senior management
and her proud team. By the same
token, she does not have the luxury
of time to conduct a full data
profiling effort.
Once she has enlisted the teams
understanding and support, Kelsey
should solicit input from the team
about the areas in which they are
most concerned about data quality.
The team should then conduct
some quick analysis to identify
specific examples and possible root
causes. The root causes are likely
to include aspects of both business
processes and operational systems.
Kelsey should accumulate this
information and project how these
deficiencies might impact the
quality of the decisions people
make if they are using the poorquality data. Kelsey should develop
a realistic plan for providing a
pervasive BI environment (that
includes addressing the major
issues such as data quality). This
places her in a position of present-
ing management with a solution

that addresses its objectives.
Management Support and
Commitment
Kelsey is now prepared to have a
dialogue with senior management.
She should structure her discussion
as a plan for meeting the goal
of having a more pervasive BI
environment. Within that plan,
she needs to point out the need to
address data quality deficiencies
and the business involvement that
will be needed to make it happen.
Its probably premature to establish
a formal data stewardship program,
but her presentation should lay
the foundation for the subsequent
introduction of such a program.
Key business roles to be discussed
include setting the quality
expectations, ensuring that the
business processes support the
desired levels, and ultimate
responsibility for the data quality.
In addition to describing the
business roles, Kelsey should
discuss how the data quality can be
measured and reported.
Data Quality Metrics and Metadata
There are three basic sets of data
quality metrics that should be
developed. These involve:
The data quality of the

source systems: This implicitly
involves the business processes.
The information should be
initially collected during the
data profiling (if conducted),
and then through the ETL
process on an ongoing basis. It
37
will be useful when deficiencies

identified by the third set of
metrics must be addressed.
The audit and control metadata: This is the measurement of

the quality of the ETL process,
and it should confirm that no
errors were introduced during
that process. It is of primary
interest to the BI team, as it
must address any deficiencies.
The business-facing set of
metrics: These are measures of
the quality of the delivered data.
This is where information needs
to be provided to the business
community so it can determine
if the data is good enough
for its intended purpose. The
metrics should yield an indication of how well each relevant
quality expectation is being
met. (Supporting, lower-level
measures could also be available
to guide preventive and corrective actions.)
Kelsey correctly recognizes that if

the business users dont trust the
data, the program will ultimately
fail. By addressing her data quality
concerns head on, she will be
better positioned to ensure the
programs success.
warehouse and mutate along the

way. New problems are inevitably
created in ETL processes because
of inconsistencies between the data
in various source applications. As
a result, data quality in the data
warehouse is often the lowest
among all databases.
In theory, given a known data
problem, the best course of action
would be for Kelsey to perform
a root cause analysis and fix the
problems at the source. This way
she does not just reactively cleanse
individual erroneous data elements,
but rather proactively prevents all
future problems of the same kind
before they occur. Regardless of the
ideal, however, it is not practical
to expect that data quality will be
ensured at the source. There are
several reasons for this.
A systemic data
quality assessment
project can be
executed with limited
resources and in a
short time period.
Arkady Maydanchik
Data quality in data warehousing
and BI is a common problem
because the data comes to data
warehouses from numerous source
systems and through numerous
interfaces. Existing source data
problems migrate to the data
38
First, in most organizations,

comprehensive data quality
management is a distant dream.
Many source systems lack adequate
controls. Source system stewards
and data owners often do not know
that their data is bad, or at least do
not have any specific knowledge

of which data are bad and what
impact the data quality problem
has on the business.
Second, data warehouses obtain
data from multiple source systems.
Oftentimes, the data coming from
each source seems consistent and
accurate when examined independently from the other sources. It
is only when data from multiple
sources is put together that the
inconsistencies and inaccuracies
can be discovered.
Finally, lack of data quality
controls is sometimes a conscious
financial decision. Data quality
management is not free! Thus, it
is often decided that existing data
quality is adequate for the purposes
for which the data is used within
the source system and investing in
data quality improvement is not
worth the investment. Of course,
such calculations typically ignore
the impact of poor source data
quality on downstream systems
such as data warehouses.
To attack the problem, Kelsey must
start by assessing data quality in
the data warehouse. A systemic
data quality assessment project can
be executed with limited resources
and in a short time period. Data
quality assessment produces a
detailed data quality metadata
warehouse that shows all identified
individual data errors, as well as a
data quality scorecard that allows
for aggregating the results and
estimating the financial impact of
the bad data on various data uses

and business processes.
One important category of
aggregate scores is by data source.
These scores indicate where the
bad data came from. Another
important category incorporates
the time dimension, showing the
trends in data quality overall and
by the data source.
Armed with this information,
Kelsey can go to the source data
stewards and discuss the financial
implications of their data quality
problems. Hopefully, understanding the downstream implications
of bad data would allow for an
adequate argument for data quality
management at the source. Also,
such findings may give the source
data stewards a glimpse into their
own data quality and thus a better
understanding of what it may cost
them directly.
The next step is setting up data
quality monitoring solutions for
the data interfaces through which
source data comes to the data
warehouse. This is necessary even
if the source data systems have
adequate data quality controls in
place. The reality is that it is simply
impossible to completely ensure
data quality at the source and
guarantee that all data coming via
interfaces to downstream systems is
accurate. Monitoring data quality
in each interface is a necessary part
of any data integration solution.
There are different types of data
quality monitors. Error monitors
look for individual erroneous data

elements. Change monitors look
for unexpected changes in data
structure and meaning. Of course,
monitoring data quality in data
interfaces is not free. Advanced
monitors require greater investment
of time and money. The desired
level of data quality monitoring in
the interfaces is a financial decision and requires analysis of the
ramifications of bad data.
Getting a data quality

program started
and organized is 90
percent organizational
dynamics.
The final question is how much
Kelsey wants to expose data quality
metadata to the data users. There is
no right answer to this question. A
good guideline is that any information must be actionable. Providing
too much detail is of little value to
someone who cannot act upon it.
Another factor to consider is that in
the data warehouse that gets large
volumes of data from numerous
sources at breakneck speed, it may
be impossible to ensure that the
data quality metadata are always
current. In that case, providing
detailed information to the users
may be counterproductive, as it
may lead to a perception that the
information about data quality is

absolutely accurate and up-to-date.
In any case, this is the decision
that can be made and changed
many times as Kelseys data quality
management program matures.
Once she sets up the processes
for data quality assessment at
the data warehouse, data quality
monitoring for the interfaces, and
root-cause analysis and data quality
management at the source, she has
all the ingredients to fine-tune data
quality reporting to the individual
needs of the users.
Philip Russom
I envy Kelsey. Then again, I dont.
Kelseys position is strong because
the BI team has an impressive
track record of producing a wide
range of successful BI solutions.
More strength comes from senior
managements direct support of an
expansion of BI solutions to more
employees and partnering companies. Kelsey and team have useful
and exciting work ahead, backed
up by an executive mandate. I envy
them shamelessly.
Thats the good news. Here comes
the bad.
Kelsey is contemplating crossing
the line by sticking her nose into
another teams business so she
can tell them their hard work isnt
good enough. Not only is this a
tall hurdle, but Kelsey will face
fearsome opposition on the other
side. Her chances of success are
39
slim if she goes it alone. Frankly, I

dont envy this part of her job.
You see, fixing data quality
problems isnt really a technology
problemat least, not in the
beginning. Getting a data quality
program started and organized is
90 percent organizational dynamics.
Thats a euphemism for turf wars,
office politics, and my IT system
aint broke so it dont need fixin.
You have to work through these
barriers and build a big foundation
for your data quality program.
Way down the road, you eventually
get to fix something. Im exaggerating for dramatic effect, but you
get the point.
Kelsey cannotand should not
lead the campaign for data quality.
After all, its not her job, andto
emphasize my pointshed
probably drown in the torrent of
organizational dynamics anyway.
Instead, the march into a data quality program should be led in a way
that defuses most organizational
dynamics. Essentially, any initiative that involves coordination and
change across multiple teams and
business units (as does data quality)
will need a strong executive sponsor
whos placed high enough in the
organizational chart to be impossible to ignore.
The sponsor needs to carry a big
stick and speak softly. The stick is
a firmly stated executive mandate,
the kind that limits your career
should you fail to deliver on it. To
avoid insurgencies, however, there
40
must be soft speaking that clearly

defines goals for datas quality and
how improving data will improve
the business for everyone. The
sponsor needs to parachute in
unannounced and repeat this pep
talk occasionally. Furthermore, the
soft speaking needs to avoid blame.
If getting a data quality campaign
started depends on a unilateral
pardon of all data-related crimes,
then so be it.
By this point, youre probably sick
of hearing about organizational
dynamics relative to data quality,
but theres more.
An executive mandate forms a
required foundation, but you (and
Kelsey) still have to build a team
or organizational structure on top
of it. This is an immutable truth,
not just my assertion. The fact that
data quality work is almost always
identified and approved via a data
stewardship program corroborates
my assertion. In recent years, data
stewardship has evolved into (or
been swallowed by) data governance.
Kelsey needs to pick one of these
team types, based on Omegas
corporate culture and pre-existing
organizational structures. Next,
shell need staffing thats appropriate to the team type. For example,
data governance is often overseen
by a committee thats populated
part-time by people who have
day jobs in (or influenced by)
data management. Finally, the
team must institute a process for
proposing changes.
Yes, effective data quality improvements come down to a credible,

non-ignorable process for change
orders. Why didnt I just jump to
this conclusion earlier, and save us
all a lot of time? Its because the
change management process only
works when built atop a strong
foundation. The foundation is
required because the changes that
are typical of data quality improvements reach across multiple lines
of business, plus their managers,
technical staff, application users,
and others.
As Kelsey will soon discover,
thats quite a number of people,
technologies, and businesses to
coordinate. Shes right to start with
a conversation with senior management, not the owners of offending
applications. Shes also right not to
go it alone.
Kelsey has a firm conviction that
data quality is a critical success
factor for BI. With any luck, shell
convince the right business sponsor,
wholl start building the big
foundation that cross-business-unit
data quality solutions demand. n
SENTIMENT ANALYSIS
BI and Sentiment
Analysis
Overview
Dr. Mukund Deshpande is senior architect at

the business intelligence competency center of
Persistent Systems. He has helped enterprises,
e-commerce companies, and ISVs make better
business decisions for the past 10 years by using
machine learning and data mining techniques.
mukund_deshpande@persistent.co.in
Dr. Avik Sarkar is technical lead at the analytics

competency center at Persistent Systems and
has over nine years of experience using analytics,
data mining, and statistical modeling techniques
across different industry vertical markets.
avik_sarkar@persistent.co.in
Over the past two decades, there has been explosive growth
in the volume of information and articles published on the
Internet. With this enormous increase in online content
came the challenge of quickly finding specific information.
Google, AltaVista, MSN, Yahoo, and other search sites
stepped in and developed novel technologies to efficiently
search and harness the massive amount of Internet
information. Some search engines indexed keywords; others
used information hierarchies, arranging Web pages in a
structured way for easy browsing and for quickly locating
requested information. Text classification, also known as
text categorization, and text-clustering-based techniques
advanced, allowing Web pages to be automatically
organized into relevant hierarchies.
Web sites frequently discuss consumer products or
servicesfrom movies and restaurants to hotels and
politics. These shared opinions, termed the voice of the
customer, have become highly valuable to businesses and
organizations large and small. In fact, a recent study by
Deloitte found that 82 percent of purchase decisions have
been directly influenced by reviews. The rapid spread of
information over the Internet and the heightened impact
of the media have broken down physical and geographical
boundaries and caused organizations to become increasingly cautious about their reputations.
Businesses and market research firms have carried out
traditional sentiment analysis (also referred to as opinion
analysis or reputation analysis) for some time, but it
requires significant resources (travel to a given location;
staffing the survey process; offering survey respondents
incentives; and collecting, aggregating, and analyzing
results). Such analysis is cumbersome, time-consuming,
and costly.
41
SENTIMENT ANALYSIS
Automated sentiment analysis based on text mining

techniques offers a simpler, more cost-effective solution
by providing timely and focused analysis of huge, everincreasing volumes of content. The concept of automated
sentiment analysis is gaining prominence as companies
seek to provide better products and services to capture
market share and increase revenues, especially in a challenging global economy. Understanding market trends and
buzz enables enterprises to better target their campaigns
and determine the degree to which sentiment is positive,
negative, or neutral for a given market segment.
Text Mining
Research and business communities are using text
mining to harness large amounts of unstructured textual
information and transform it into structured information.
Text mining refers to a collection of techniques and
algorithms from multiple domains, such as data mining,
artificial intelligence, natural language processing (NLP),
machine learning, statistics, linguistics, and computational
linguistics. The objective of text mining is to put the
already accumulated data to better use and enhance an
organizations profitability. With a variety of customer
trends and behavior and increasing competition in each
market segment, the better the quality of the intelligence,
the better the chances of increasing profitability.
Document summarization: An automated technique

for deriving a short summary of a longer text document
Sentiment analysis applies these techniques to assign

sentiment or opinion information to certain entities within
text. Sentiment evaluation is another step in the process of
converting unstructured content into structured content
so that data can be tracked and analyzed to identify trends
and patterns.
Sentiment Analysis
Sentiment analysis broadly refers to the identification and
assessment of opinions, emotions, and evaluations, which,
for the purposes of computation, might be defined as
written expressions of subjective mental states.
For example, consider this unstructured English sentence
in the context of a digital camera review:
Canon PowerShot A540 had good aperture
combined with excellent resolution.
Consider how sentiment analysis breaks down the information. First, the entities of interest are extracted from the
sentence:
Digital camera model: Canon PowerShot A540
Camera dimensions or features: aperture, resolution
The major text mining techniques include:
Text clustering: The automated grouping of textual

documents based on their similarityfor example,
clustering documents in an enterprise to understand its
broad areas of focus
Sentiments are further extracted and associated for each

entity, as follows:
42
Text classification or categorization: The automated

assignment of documents into some specific topics
or categoriesfor example, assigning topics such as
politics, sports, or business to an incoming stream of
news articles
Entity extraction: The automated tagging or extraction
of entities from textfor example, extracting names of
people, organizations, or locations
Digital camera model = Canon PowerShot A540;

Dimension = aperture, Sentiment = good (positive)
Digital camera model = Canon PowerShot A540;
Dimension = resolution, Sentiment = excellent (positive)
Based on the individual sentence-level sentiments,

aggregated and summarized sentiment about the digital
camera is obtained and stored in the database for
reporting purposes.
SENTIMENT ANALYSIS
Fetch/Crawl
+ Cleanse
Text
Classification
Sentiment
Extraction
Entity
Extraction
Sentiment
Summary
Reports/
Charts
Figure 1. Sentiment analysis steps
The following sections delve into the technical details and

algorithms used for this type of sentiment analysis.
Sentiment Analysis Steps

Suppose we are interested in deriving the sentiment or
opinion of various digital cameras across dimensions such
as price, usability, and features. Figure 1 illustrates the steps
we will follow in this analysis.
Step 1: Fetch, Crawl, and Cleanse
Comments about digital cameras might be available on
gadget review sites or in discussion forums about digital
cameras, as well as in specialized blogs. Data from all of
these sources needs to be collected to give a holistic view
of all the ongoing discussions about digital cameras. Web
crawlerssimple applications that grab the content of a
Web page and store it on a local diskfetch data from the
targeted sites. The downloaded Web pages are in HTML
format, so they need to be cleansed to retain only the
textual content and the remaining HTML tags used for
rendering the page on the Web site.
Step 2: Text Classification
The sites from which data is fetched might contain extra
information and discussions about other electronic gadgets,
but our current interest is limited to digital cameras. A text
classifier determines whether the page or discussions on it
are related to digital cameras; based on the decision of the
classifier, the page is either retained for further analysis or
discarded from the system.
The text classifier is provided by a list of relevant (positive)

and irrelevant (negative) words. This list consists of a base
list of words supplied by the software provider, which is
typically enhanced by the user (the enterprise) to make
it relevant to the particular domain. A simple rule-based
classifier determines the polarity of the page based on the
proportion of positive or negative words it contains. You
can train complex and robust classifiers by feeding them
samples of positive and negative pages. These samples
allow you to build probabilistic models based on machinelearning principles. Then, these models are applied on
unknown pages to determine the pages relevance.
Commercial forums, blog aggregation services, and search
engines (such as BoardReader and Moreover) have become
popular recently, eliminating the need to build in-house
text classifiers. You can use these services to specify
keywords or a taxonomy of interest (in this case, digital
camera models), and they will fetch the matching forums
or blog articles.
Step 3: Entity Extraction
Entity extraction involves extracting the entities from the
articles or discussions. In this example, the most important
entity is the name or model of the digital cameraif the
name is incorrectly extracted, the entire sentiment or
opinion analysis becomes irrelevant. There are three major
approaches for entity extraction:
Dictionary or taxonomy: A dictionary or taxonomy

of available and known models of digital cameras is
provided to the system. Whenever the system finds
43
SENTIMENT ANALYSIS
a name in the article, it tags it as a digital camera

entity. This technique, though simple to set up, needs
frequent updates on every subsequent model launch,
so its not robust.
Rules: A digital camera model name has a certain

pattern, such as Canon PowerShot A540. Therefore,
a rule may be written to tag any alphanumeric token
following the string Canon PowerShot as a digital
camera model. Such techniques are more robust than
the dictionary-based method, but if Canon decides to
launch a new model, say the SuperShot, such rules must
be updated manually.
Machine learning: This algorithm learns the extraction
rules automatically based on a sample of articles with
the entities properly tagged. The rules are learned by
forming graphical and probabilistic models of the
entities and the arrangement of other terms adjoining
them. Popular machine learning models for entity
extraction are based on hidden Markov models (HMM)
and conditional random fields (CRF).
Step 4: Sentiment Extraction

Sentiment extraction involves spotting sentiment words
within a particular sentence. This is typically achieved
using a dictionary of sentiment terms and their their
semantic orientations. There are obvious limitations to the
dictionary-based approach. For example, the sentiment
word high in the context of price might have a negative
polarity, whereas high in the context of camera resolution will be of positive polarity. (Approaches to dealing
with varying and domain-specific sentiment words and
their semantic orientation are discussed in the next section.)
Once an entity of interest (for example, the digital camera
model or sentiment word) is identified, structured sentiment is extracted from the sentence in the form of {model
name, score}, where score is the positive or negative polarity
value of the identified sentiment word in the sentence. If
some dimension (such as price or resolution) is also
found in the sentence, then the sentiment is extracted in
the form of {model name, dimension, score}. We may also
choose to report the source name or source ID to associate
the extracted sentiment back to that source.
44
The presence of negation words, such as not, no,

didnt, and never, require special attention. These
keywords lead to a transformation in the polarity value
of the sentiment words and hence in their reported score.
Natural-language techniques are used to detect the effect
of the negation word on the adjoining sentiment word.
If the negation effect is detected, then the polarity of the
sentiment word is inverted.
The extracted sentiment data is now in a structured format
that can be loaded into relational databases for further
transformation and reporting.
Step 5: Sentiment Summary
The raw sentiments extracted in Step 4 come from
individual sentences that are specific to certain entities.
To make the data meaningful for reporting, it must
be aggregated. One of the obvious aggregations in the
context of digital cameras will be model-name-based
aggregationin this case, all of the positive, negative, or
neutral entries in the database are grouped together. Again,
model- and dimension-based sentiment aggregation would
allow the discovery of detailed, dimension-wise sentiment
distribution for every model. Based on the reporting needs,
different levels of aggregation and summarization need to
be carried out and stored in a database or data warehouse.
Step 6: Reports/Charts
Reports and charts can be generated directly from the
database or data warehouse where the aggregated data is
stored in a structured format. Such reporting falls under
the purview of traditional BI and reporting, and is not
related to the core sentiment analysis steps.
The steps described above have been used to transform
the unstructured textual data in blogs and forums to
structured, quantifiable numeric sentiment data related to
the entity of interest.
Sentiment Analysis Challenges

There are challenges in sentiment analysis, but fortunately
some simple tactics can help you overcome them. The
challenges discussed in this section are related to sentiment
assignment, co-reference resolution, and assigning domainspecific polarity values to sentiment words.
SENTIMENT ANALYSIS
Sentiment Assignment
Suppose a sentence mentions digital camera features such
as resolution, usage, and megapixels; the sentence also
mentions a sentiment word, say, good. Should we relate
all or only some of the features to the sentiment word?
The issue becomes even more challenging when multiple
sentiment words or model names are mentioned in the
same sentence. Limited accuracy can be achieved by using
simple heuristics, such as assigning the model name or
feature to the nearest occurring sentiment word (this yields
acceptable accuracy). Deep NLP techniques may be used
to identify the model names or features (nouns) that are
related to the sentiment word (adjective or adverb) in the
context of that sentence.
Reviews often include comparative comments about
multiple digital camera models within single sentences. For
example:
Kodak V570 is better than the Canon

Power-Shot A460.
Kodak V570 scores more points than Canon
PowerShot A460 in terms of resolution.
In comparing the Kodak V570 and Canon PowerShot A460, the latter wins in terms of resolution.
Nikon D200 is good in terms of resolution, while
Kodak V570 and Canon PowerShot A460 have
better usability.
Dealing with such comparative sentences requires building

complex natural-language rules to understand the impact
and span of every word. For example, the word better
would signal a positive sentiment extraction for one camera
model or feature and negative sentiment data for another.
Co-reference Resolution
Suppose a discussion about a digital camera mentions
the model in the beginning of the article, but subsequent
references use pronouns such as it or phrases such as the
camera. Referring to a proper noun by using a pronoun is
called co-reference.
Co-reference is a common feature of the English language.

Ignoring sentences that use it will lead to a loss in data and
incorrect reporting. Co-reference resolution, also referred to
as anaphora resolution, is a vast area of research in the NLP
and computational linguistics communities. It is achieved
using rule-based methods or machine-learning-based
techniques. Open source co-reference resolution systems
such as GATE (General Architecture for Text Engineering)
provide the accuracy required for sentiment analysis.
Domain-specific polarity values and sentiment words
As discussed earlier, sentiment words have different
interpretations in different contexts. For example, long
in the context of movies might convey a negative sentiment, whereas in sports it would indicate positive polarity.
Similarly, unpredictable might convey positive sentiment
for movies, but would indicate negative polarity when used
to describe digital cameras or mobile phones.
This problem can be tackled by using a domain-specific
sentiment word list. Such a list is created by analyzing all
the adjectives, adverbs, and phrases in the domain-specific
document collection. The analysis calculates the proximity
of these words to generic positive words such as good and
generic negative words such as bad. Another calculation
is called point-wise mutual information, which provides a
measure of whether two terms are related and hence jointly
occurring, rather than showing up together by chance.
These calculations can be performed for the word across all
documents to determine whether a word occurs more often
in the positive sense than in the negative sense.
These techniques work well if a certain sentiment word has
a fixed polarity interpretation within a certain domain.
Now, suppose we have the sentiment word high, which
in the digital camera domain could indicate negative
sentiment for price but positive sentiment for camera
resolution. Such cases are a bit more difficult to handle
and can often lead to errors in sentiment analysis. To tackle
such scenarios, the system has to store some mapping of
entity, the sentiment word, and its associated polarityfor
example, {high, price, ve} and {high, resolution, +ve}.
Creating and verifying such mappings involves considerable
manual work on top of automated techniques.
45
SENTIMENT ANALYSIS
Examples
Sentiment Analysis of Digital Camera Reviews
There are many Web sites that contain reviews related to
digital cameras. Suppose a consumer is looking to buy a
particular digital camera and would like to get a complete
understanding of the cameras different features, strengths,
and weakness. She would then compare this information
to other contemporary digital camera models of the same
or competing brands. This would involve manual research
across all related Web sites, which might require days
or even months of research. Rather than doing this, the
consumer is more likely to gather incomplete information
by visiting just a few sites.
Automated sentiment analysis and BI-based reporting can
come to the rescue by providing a complete overview of
the many discussions about digital camera models and
their features.
First, a list of available digital camera models is collected
from the various companies catalogs to create a comprehensive taxonomy of digital camera models. An initial list
of digital camera features or dimensions is also collected
from these catalogs. All online discussion pages are
collected from the digital camera review Web sites.
One important consideration during taxonomy creation is
the grouping of synonymous entities. For example, Canon
PowerShot A540 may also be referred to as PowerShot
540 or Canon A540. All of these should be grouped as a
single entity. Again, the dimension camera resolution may
be referred to as resolution, megapixel, or simply MP;
all should be aligned to the single entity resolution.
The presence of the camera model name on a given page
indicates that it should be considered for further analysis.
The next challenge is to extract the entities of interest from
the textthat is, the digital camera model names and
features. A taxonomy-based method is used to extract those
that are known. Machine-learning-based approaches can
extract the others. Here, documents tagged with existing
model names and features are provided as training to the
machine learning the algorithm, which uses the data to
learn the extraction rules. These rules are then used to
extract entities from other incoming articles.
46
Raw sentiment is extracted from the sentiment-bearing

sentences using the approaches described above. A list
of sentiment-bearing words, along with their polarity
values, is provided as input. Based on the raw sentiments,
sentiment aggregation is carried out on two dimensions:
digital camera model and digital camera features. Further
aggregation can be carried out for each Web site to identify
any site-specific bias in the extracted sentiments. These
aggregated values are then stored in the data warehouse for
reporting purposes.
Sentiment Analysis of Election Campaigns
The most recent U.S. presidential election saw a large
number of online Web sites discussing the post-election
policies and agendas of Democratic nominee Barack
Obama and Republican John McCain. These discussions
come from people who are very likely to be legitimate
American voters (rather than, say, children or people residing outside the U.S.). Political parties such as Democrats
and Republicans employ armies of people across the U.S.
to survey people about their opinions on the policies of the
presidential candidates. These surveys incur huge costs and
delays in information collection and analysis.
Automated BI and sentiment analysis can work magic here
by continuously analyzing the comments posted on Web
sites and providing prompt, sentiment-based reporting.
For example, a popular presidential debate on television
one evening will lead to comments on the Web. Sentiment
analysis performed on the comments can be completed in
real time, and the political parties can gauge the response
to the debate and to the policy matters discussed. Smart
technology use and intelligent data collection can provide
in-depth, state-wide sentiment analysis of the comments.
Such analysis would be extremely powerful in determining
the future election campaigning strategy in each state.
Considering the sensitivity and impact of the analysis,
careful attention must be paid to generating the taxonomy,
which consists of two main entities: the presidential nominees and the policies or issues discussed. The presidential
nominees list is finite, corresponding to the major political
parties. Variations in the names, acronyms, or synonyms
should also be carefully studied and collated.
SENTIMENT ANALYSIS
Generating the taxonomy of issues or policies is far more

challenging. Each issue is defined in terms of keywords
or phrases; some of these will appear in multiple issues or
policies. Variations among keywords and phrases can be
quite large, and capturing them requires considerable time
and effort. Automated methods may be used for many of
these steps, but manual verification and editing is required
to remove discrepancies. Another challenge is determining
the location of each person entering comments. This can
be done by capturing their Internet protocol (IP) addresses,
then associating them with physical and geographical
locations. Comments from outside the country are ignored.
Other comments are associated with states (or cities, as
available). Finally, carefully selected, election-specific
sentiment words are added to the taxonomy.
Once the taxonomy is in place, the raw sentiments
may be extracted from the comments. They are in two
primary forms:
{Presidential Nominee, Location, Sentiment}, which

captures generic sentiment about the presidential
candidate regardless of issue
{Presidential Nominee, Issue or Policy, Location,
Sentiment}, which captures the sentiment or opinion
about the particular issue for the presidential candidate
A single comment may lead to the extraction of more

than one raw sentiment, as shown above. Next, the data is
aggregated along dimensions such as presidential nominee,
policy issue, or location. The aggregated results are stored
in a warehouse for quick access and reporting.
In the future, many Web sites will likely collect further
details about the people making the comments, including
age group, income, education, religion, race, ethnic origin,
and number of family members. This would allow more
detailed analysis and drill-down of the sentiment results,
which would aid in advanced campaign management such
as micro-targeting specific groups of voters.
Washington
Montana
Maine
North Dakota
Minnesota
Oregon
Idaho
South Dakota
Vt.
Wisconsin
Wyoming
Nebraska
Nevada
Utah
California
Arizona
New Mexico
Kansas
Oklahoma
N.H.
Mass.
R.I.
Pennsylvania
Indiana
Ohio
Md.
West
Virginia
Virginia
Missouri
Kentucky
North Carolina
Tennessee
Arkansas
South
Mississippi
Texas
Michigan
Iowa
Illinois
Colorado
New York
Alabama
Georgia
Negative
Louisiana
Florida
Positive
Alaska
Obama
McCain
Figure 2. Sample election campaignvoter sentiment report
47
SENTIMENT ANALYSIS
Other Applications of BI and Sentiment Analysis

Additional applications of sentiment analysis and BI-based
reporting include:
Online product reviews. These contributed to the

development of sentiment analysis. Product reviews are
analyzed to provide an overall idea about the features of
the product along with its strengths and weaknesses.
Online movie reviews. These are available in
abundance, which led to the discovery of a new
domain of sentiment analysis that analyzes peoples
opinions about movies.
Company news. Analyzing news articles and
discussions related to a company can provide
detailed sentiment analysis about an organizations
performance, along with criteria such as profit,
customer satisfaction, and products.
Online videos. Sentiment analysis helps to capture
opinions about both video quality and the
events portrayed.
Hotels, vacation homes, holiday destinations,
and restaurants. Sentiment analysis helps people
make informed decisions about holiday plans or
where to dine out.
Movie stars, popular sports figures, and television personalities. Sentiment analysis can capture
the sentiments and opinions of large groups of
people by analyzing discussions or articles related to
such public figures.
Existing Research in Sentiment Analysis

Sentiment/opinion analysis is an emerging area of research
in text mining. Early researchers rated movie reviews on
a positive/negative scale by treating each review as a bag
of words and applying machine-learning algorithms like
Nave Bayes. Successive research progressed to detecting
sentence-level sentiment and hence reporting higher
accuracy figures. In contrast to the research on movie
reviews, experts from the finance domain analyzed the
48
sentiment in published news articles to predict the price of

a certain stock for the following day.
Experts also discovered new techniques for using Web
search to determine the semantic orientation of words,
which is at the core of quantifying the sentiment expressed
in a sentence. See the bibliography at the end of this article
for additional studies and reports.
Final Thoughts
In closing, we would like to spotlight two observations that
highlight the growing need for sentiment analysis:
With the explosion of Web 2.0 platforms such as
blogs, discussion forums, peer-to-peer networks,
and various other types of social media all of
which continue to proliferate across the Internet
at lightning speed, consumers have at their
disposal a soapbox of unprecedented reach and
power by which to share their brand experiences
and opinions, positive or negative, regarding
any product or service. As major companies are
increasingly coming to realize, these consumer
voices can wield enormous influence in shaping the opinions of other consumersand,
ultimately, their brand loyalties, their purchase
decisions, and their own brand advocacy. Companies can respond to the consumer insights they
generate through social media monitoring and
analysis by modifying their marketing messages,
brand positioning, product development, and
other activities accordingly.
Jeff Zabin and Alex Jefferies [2008]. Social Media
Monitoring and Analysis: Generating Consumer Insights from
Online Conversation, Aberdeen Group Benchmark Report.
Marketers have always needed to monitor media
for information related to their brandswhether
its for public relations activities, fraud violations,
or competitive intelligence. But fragmenting
media and changing consumer behavior have
crippled traditional monitoring methods.
Technorati estimates that 75,000 new blogs are
created daily, along with 1.2 million new posts
SENTIMENT ANALYSIS
each day, many discussing consumer opinions on

products and services. Tactics [of the traditional
sort] such as clipping services, field agents, and ad
hoc research simply cant keep pace.
Peter Kim [2006]. The Forrester Wave: Brand Monitoring, Q3 2006, white paper, Forrester Wave.
Bibliography
Baeza-Yates, Ricardo, and B. Ribeiro-Neto [1999].
Modern Information Retrieval. Addison-Wesley
Longman Publishing Company.
Cunningham, Hamish, Diana Maynard, Kalina
Bontcheva, and Valentin Tablan [2002]. GATE: A
Framework and Graphical Development Environment
for Robust NLP Tools and Applications. Proceedings of
the 40th Anniversary Meeting of the Association for
Computational Linguistics (ACL02). Philadelphia, PA.
Pang, Bo, and Lillian Lee [2005]. Seeing Stars: Exploiting

Class Relationships for Sentiment Categorization with
Respect to Rating Scales. Proceedings of the ACL,
pp. 115124.
[2004]. A Sentimental Education: Sentiment
Analysis Using Subjectivity Summarization Based on
Minimum Cuts. Proceedings of the ACL,
pp. 271278.
, and Shivakumar Vaithyanathan [2002]. Thumbs
up? Sentiment Classification Using Machine Learning
Techniques. Proceedings of the ACL-02 Conference on
Empirical Methods in Natural Language Processing,
Vol. 10, pp. 7986.
Rabiner, Lawrence R. [1989]. A Tutorial on Hidden
Markov Models and Selected Applications in Speech
Recognition. Proceedings of the IEEE, Vol. 77, No. 2,
pp. 257286.
Das, Sanjiv Ranjan, and Mike Y. Chen [2001]. Yahoo!

for Amazon: Sentiment Parsing from Small Talk on
the Web. Proceedings of the 8th Asia Pacific Finance
Association Annual Conference.
Sebastiani, Fabrizio [2002]. Machine Learning in

Automated Text Categorization. ACM Computing
Surveys, Vol. 34, No. 1, pp. 147.
Esuli, Andrea, and Fabrizio Sebastiani [2006].

SentiWordNet: A Publicly Available Lexical Resource
for Opinion Mining. Proceedings of LREC-06, 5th
Conference on Language Resources and Evaluation,
Genova, Italy, pp. 417422.
Turney, Peter D. [2002]. Thumbs Up or Thumbs Down?

Semantic Orientation Applied to Unsupervised
Classification of Reviews. Proceedings of the 40th
Annual Meeting of the Association for Computational
Linguistics, pp. 417424. Philadelphia, PA.
Hurst, Matthew, and Nigam Kamal [2004]. Retrieving

Topical Sentiments from Online Document
Collections. Document Recognition and Retrieval XI,
pp. 2734.
, and Michael L. Littman [2003]. Measuring

praise and criticism: Inference of semantic orientation
from association. ACM Transactions on Information
Systems, Vol. 21, No. 4, pp. 315346.
Lafferty, John, Andrew McCallum, and Fernando Pereira

[2001]. Conditional Random Fields: Probabilistic
Models for Segmenting and Labeling Sequence
Data. Proceedings of the Eighteenth International
Conference on Machine Learning, pp. 282289.
Nigam, Kamal, and Matthew Hurst [2004]. Towards a
Robust Metric of Opinion. AAAI Spring Symposium on
Exploring Attitude and Affect in Text.
49
AUTHOR INSTRUCTIONS
Editorial Calendar and

Instructions for Authors
The Business Intelligence Journal is a quarterly journal that
focuses on all aspects of data warehousing and business
intelligence. It serves the needs of researchers and practitioners in this important field by publishing surveys of
current practices, opinion pieces, conceptual frameworks,
case studies that describe innovative practices or provide
important insights, tutorials, technology discussions, and
annotated bibliographies. The Journal publishes educational articles that do not market, advertise, or promote
one particular product or company.
Editorial Acceptance
n
Editorial Topics for 2010

Journal authors are encouraged to submit articles of
interest to business intelligence and data warehousing
professionals, including the following timely topics:
n
Agile business intelligence
Project management and planning
Architecture and deployment
Data design and integration
All articles are reviewed by the Journals editors before

they are accepted for publication.
The publisher will copyedit the final manuscript to
conform to its standards of grammar, style, format,
and length.
Articles must not have been published previously
without the knowledge of the publisher. Submission
of a manuscript implies the authors assurance that
the same work has not been, will not be, and is not
currently submitted elsewhere.
Authors will be required to sign a release form before
the article is published; this agreement is available
upon request (contact journal@tdwi.org).
The Journal will not publish articles that market,
advertise, or promote one particular product
or company.
Submissions
Data management and infrastructure
Data analysis and delivery
tdwi.org/journalsubmissions
Materials should be submitted to:
Jennifer Agee, Managing Editor
E-mail: journal@tdwi.org
Analytic applications
Upcoming Submissions Deadlines
Selling and justifying the data warehouse
Volume 15, Number 4

Submission Deadline: September 3, 2010
Distribution Date: December 2010
Volume 16, Number 1
Submission Deadline: December 17, 2010
Distribution Date: March 2011
50
DASHBOARD PLATFORMS
Dashboard Platforms
Alexander Chiang
Introduction
This article discusses the importance of a platformbased dashboard solution for business professionals
responsible for developing a digital dashboard. The first
two sections focus on business users and information
workers such as business analysts. The latter sections
speak to technologists, including software developers.
Alexander Chiang is director of consulting
services for Dundas Data Visualization, Inc.
alexanderc@dundas.com
We will take a brief look at the technologies in the context of the BI stack to help readers put the significance
of dashboard platforms into perspective. Next, we
will present the business challenges of the dashboards,
followed by an explanation of how these challenges can
be addressed with a dashboard solution that is based on
a platform.
A Brief History of the BI Stack

The business intelligence (BI) community has mature
technologies for several components of the BI stack. In
particular, the data and the analytics layers have been focal
points for most BI solution vendors for the last few decades.
This makes sense, considering those layers represent the
basic foundations of storing and analyzing data.
The data layer has received the most attention, and
technologists have implemented the majority of features
necessary to address the challenges of storing, consolidating, and retrieving data pertinent to organizations.
The analytics layer has been revitalized since the dotcom boom. As more information was brought online,
massive amounts of unstructured data began floating
in cyberspace, and analysts realized the value proposition of mining and disseminating all this useful data.
Content analysis tools were built to solve the challenges
of making sense of all the data; they came from a
ready supply of analysis tools on which to build. Other
sectors still maturing in this area include predictive
analysis, which allows organizations to analyze historical data in search of insight about future trends.
51
DASHBOARD PLATFORMS
Finally, there is the presentation layer. This is the

mechanism for delivering to end users all the information provided by the data and analytics layers.
Traditionally, the information is presented in the
form of scorecards, reports, and/or dashboards. This
article discusses the ideal presentation layer solution
for dashboards. In general, the concepts covered here
can be applied to other areas (such as scorecards and
reporting) as well as the newer advances in analytics.
The Dashboard Platform

A dashboard platform is a software framework designed
to address the visualization needs of businesses.
The platform must provide interfaces and common
functionality to help users address common business
cases with minimum involvement from technologists.
It must also be both highly customizable and extensible
to address complex needs. These software concepts
can be summarized by a statement from computer
scientist Alan Kay: Simple things should be simple
and complex things should be possible (Leuf and
Cunningham, 2001).
A dashboard platform serves specifically to develop and
deploy dashboards. Out of the box, the ideal dashboard
platform should provide:
dashboardassuming there are enough resources to

execute this concurrent development.
From a technology perspective, a dashboard solution
should facilitate defining business metrics without
requiring the underlying data to be prepared first. The
personnel responsible for finding the data can use these
business metric definitions as a communication medium;
that is, they can start looking for the columns and
preparing the calculations necessary to satisfy the definitions. Simultaneously, those responsible for designing
and creating the dashboard can use these business metric
definitions to begin choosing appropriate visualizations
and adding interactivity. Once the data and the design
are ready, the dashboard can be deployed.
This approach will accelerate production of the
dashboard solution thanks to the concurrent workflow
between the data team and the dashboard designers.
Collaborative Dashboard Development

I discussed the key players and processes of a dashboard
initiative in detail in a previous Business Intelligence
Journal article (see References). To summarize, the
following resources are generally needed:
An accelerated development and deployment

timeline
A collaborative workflow
An open application programming interface (API)
Rapid Dashboard Development and Deployment

There are many ways to develop and deploy a dashboard.
The best place to start is to define the business metrics
needed. By starting with business metrics, the data
behind the business metrics can be discovered while the
dashboards are being designed. This parallel workflow
significantly decreases the time needed to create a
52
Business users to utilize the dashboards and confirm

the business metrics needed
Business analysts to determine the business metrics
and design the dashboards
Database administrators to discover the underlying
data used in the business metrics
IT workers to maintain and integrate any technology needed for delivering a BI solution
A dashboard solution should take advantage of all the

players participating in a dashboard initiative. This
goal can be accomplished by implementing interfaces
and functionality specific to the particular audience.
Business users should have a portal to access the
dashboards; business analysts should have a work area
so they can define business metrics and design dash-
DASHBOARD PLATFORMS
boards. Database administrators should have a work

area that allows them to connect to their data sources
and manipulate the data so it can satisfy the business
metric definitions. Finally, IT should have interfaces
for administering those who have access to particular
dashboards and interfaces within the system. By
providing work areas, tools, and functionalities specific
to particular tasks and resources, expertise is leveraged
to achieve maximum resource efficiency.
Open Application Programming Interface

An API allows technologists to effortlessly leverage
all the services and functionalities within a platform.
A well-designed dashboard platform will have been
developed with this paradigm in mind. Such a platform
should leverage its own API to add new features.
Furthermore, dashboard platforms (in fact, any software
platform) wont necessarily have all the features an
organization needs at the time they are evaluated, but
the organization should understand that the platform
will allow for customization so it can meet future needs.
Before examining the details of the open API, it is
important to recognize the technology challenges
dashboard solutions need to address.
Dashboard Technology Challenges
The three key technical challenges in leveraging
dashboards as an information delivery mechanism are:
System integration
Data source compatibility
Specialized data visualizations
System Integration
In general, organizations have an existing IT infrastructure in place, including corporate Web portals.
Ideally, the chosen dashboard solution should be easy
to integrate within this infrastructure; traditionally,
most dashboard solutions and their respective tools
were standalone desktop applications. It is difficult to
couple the corporate Web portal with such applications

because they are two different types of technologies.
Dashboard vendors recognized this and moved their
tools toward Web-based solutions. The full benefits of
moving to this type of solution are beyond the scope of
this article, but scalability and maintainability are the
two major advantages. An IT infrastructure usually has
a security subsystem. The dashboard solution should
leverage this existing subsystem so the IT team wont
have to maintain two different security systems.
Data Source Compatibility
Data source neutrality is important for dashboard vendors. That is, these solutions must connect to multiple
data sources to feed into the dashboards. Although
most dashboard products provide connectivity to
popular databases and analytics packages, the challenge
arises when an organization has to use a homegrown
analytics engine or a more specialized database. For
those businesses investing in complete BI solutions
provided by bigger vendors, this is a non-issue, as they
can leverage their consolidation technologies. For the
mid-market, choosing an end-to-end solution may
not be practical or within the budget. This makes it
important for the dashboard solution to provide a way
to connect to various types of data sources.
Specialized Data Visualizations
There are various dashboard types (e.g., strategic,

tactical, operational) as well as dashboards targeted
on specific verticals. Generally, vertical dashboards
require particular types of visualizations. For example,
a media company may be interested in a dashboard that
analyzes social networks so the company can target
specific individuals or groups with many ties to other
individuals and groups. This requires a specific type of
network diagram that is not found in most dashboard
products. As a result, the media company might
consider creating a custom solution.
These challenges make it difficult for an organization
to choose a vendor and understand what effect their
choice has on their long-term strategy and growth
53
DASHBOARD PLATFORMS
prospects. For example, the same media company may

decide to provide television broadcasting services and
may require real-time dashboards to monitor ratings.
This scenario would require visualizations specifically
created for real-time presentation, which typically
entails performance challenges.
The point is that dashboard solutions provide basic
visualizations such as standard chart types, gauges,
and maps, but they do not generally provide more
specialized visualizations. How do we address these
technology issues?
The Dashboard Platform API
A dashboard platform should address the technology
problems previously described: system integration, data
source compatibility, and specialized visualizations.
These can be resolved by an API that affords the
following:
A standalone dashboard viewer
Data source connectors
Third-party data visualization integration
with dashboards. In addition, files can be created from

dashboard data and exported to a variety of file formats
using export APIs.
With these areas exposed for customization, the
majority of the integration requirements typical of a BI
infrastructure are addressed.
Data Source Connectors
A good data source connector API should provide

standard data schemas for consumption by the
platform. Developers would then develop an adapter
that connects to the unsupported data sources and
manipulates the data they contain until they map to a
dashboard platform data schema. Once completed, the
platform can consume the data source.
There are many types of data sources, such as Web
services and database engines. Their importance is
apparent: to facilitate the connection of dashboards
to new data sources without third-party consolidation software. This will keep the door wide open for
emerging data technologies. Each newly supported data
type should be accessible through an appropriate user
interfaceeither by reuse of an existing screen or the
creation of a custom one.
Standalone Dashboard Viewer
A standalone dashboard viewer is a separate control

that allows developers to integrate dashboards into
other applications. Most organizations have a Webbased portal, which suggests that a dashboard platform
shouldat a minimuminclude a Web-based viewer.
Although rare, company portals built around desktop
technologies are not necessarily out of luck. Most thickclient development tools have a standalone browser
control that will allow the viewer to be embedded.
Many businesses display sensitive data on dashboards,
and the viewer should take this into consideration. The
viewer should leverage a companys security system
to allow dashboard access using existing role-based
credentials. This allows for role- and parameterspecific dashboards to be shown rather than generic
dashboards. With adequate integration, further
supporting data and files can be paired and shared
54
Third-Party Data Visualization Integration
A dashboard designer interface generally comes with a

set of standard charts, gauges, and maps for visualizing
data. However, there are many types of additional
visualizations for dashboards, and a dashboard solution
may not have all that are needed to satisfy an organizations requirements.
A good plug-in API should provide a standard interface
for developers to integrate third-party visualizations
into the platforms dashboard designer. This interface
should allow KPIs defined in the platform to be hooked
up to the visualization. In addition, it should define
common events associated with dashboard interactivity (such as mouse clicks). This allows developers to
customize any interaction that may be associated with
the visualization. One example is a workflow diagram.
When the dashboard user clicks on a particular block
DASHBOARD PLATFORMS
of the workflow, the visualization may zoom in and

show sub-workflows of that block.
The standard data visualizations (DV) that come
with the platform should also be incorporated into an
extensible API. For example, a chart type not provided
by the platform may have many properties of a chart,
such as X and Y axes. Consider a real-time line chart: it
has similar properties to a line chart, but the key difference is that it changes with time and it should move the
window of time as new data points are received. With a
DV API, developers can leverage the basic functionality
and properties of the platforms charts and customize
them to their organizations needs.
References
Chiang, Alexander [2009]. Creating Dashboards: The
Players and Collaboration You Need for a Successful
Project, Business Intelligence Journal, Vol. 14, No. 1,
pp. 5963.
Leuf, Bo, and Ward Cunningham [2001]. The Wiki Way:
Quick Collaboration on the Web, Addison-Wesley.
Choosing a platform that allows third-party visualizations to be integrated into the dashboard design
provides comfort to a company that is unsure of what
types of DVs it will need in the future.
Final Note
A dashboard solution should facilitate accelerated
dashboard production, infuse a sense of collaboration
among personnel involved in development, and provide
an open API to allow for a customized solution. Companies choosing a flexible and customizable dashboard
solution should be looking for these features.
The benefits are apparent and should be realized
immediately. Rapid dashboard development deployment decreases development costs and gets dashboards
into the hands of decision makers more quickly.
Interfaces and workflow designed for specific resources
reduce the learning curve and increase the likelihood of
corporate adoption so the software doesnt just sit on a
shelf. Finally, an open API will allow an organization
to customize a solution specific to its requirements,
lowering the risk of choosing an inappropriate solution
for its immediate and long-term needs. Viewing these
areas as checkboxes during a product evaluation will
help an organization select the right solution.
55
BI STATSHOTS
BI StatShots
Strategic Value. To test perceptions
of UDMs strategic status, this
reports survey asked respondents to
rate UDMs possible strategic value.
Unified Data Management

Barriers. According to our research
survey, unified data management is
most often stymied by turf issues.
These include a corporate culture
based on silos, data ownership, and
other politics. UDM also suffers
when theres a lack of governance
or stewardship, a lack of business
sponsorship, or unclear business goals
for data.
In the perceptions of survey

respondents, UDM has a strong
potential for high strategic impact. By
extension, UDM is indeed strategic
(despite its supporting role) when it
is fully aligned with and satisfying
the data requirements of strategic
business initiatives and strategic
business goals.
A whopping 59 percent reported that

it could be highly strategic, whereas
an additional 22 percent felt it could
be very highly strategic. Few survey
respondents said that UDM is not
very strategic (5 percent) and no one
felt its not strategic at all (0 percent).
Philip Russom
In your organization, what are the top potential barriers to coordinating multiple data management
practices? (Select six or fewer.)
Corporate culture based on silos
61%
Data ownership and other politics
60%
Lack of governance or stewardship
44%
Lack of business sponsorship
42%
Poor master data or metadata
32%
Inadequate budget for data management
31%
Data management over multiple organizations
28%
Inadequate data management infrastructure
28%
Unclear business goals for data
28%
Poor quality of data
24%
Independence of data management teams
23%
Consolidation/reorganization of data management teams
20%
Existing tools not conducive to UDM
20%
Lack of compelling business case
19%
Poor integration among data management tools

Other
14%
4%
Figure 1. Based on 857 responses from 179 respondents (4.8 average responses per respondent).
Source: Unified Data Management, TDWI Best Practices Report, Q2 2010.
56
www.tdwi.org/cbip
Set yourself
apart from
the crowd.
Get certified.
W H AT SE T S YOU A PA R T FROM T HE CROW D?
Distinguishing yourself in your career can be a difficult task.
Through TDWIs CBIP (Certified Business Intelligence Professional)
program, we help you define, establish, and set yourself apart
professionally with a meaningful BI certification credential.
Become a Certified Business Intelligence Professional today!

To find out what CBIP exams you should take, how to prepare,
and where you can take the exams, visit www.tdwi.org/cbip.
TDWI Partner Members

These solution providers have
joined TDWI as special Partner
Members and share TDWIs strong
commitment to quality and content
in education and knowledge
transfer for business intelligence
and data warehousing.

10 1 1 451 9121

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10 1 1 451 9121

Uploaded by

Copyright:

Available Formats

Volume 15 Number 2 2nd Quarter 2010

Beyond Business Intelligence

Advances in Predictive Modeling: How

Sule Balkan and Michael Goul

BI Case Study: SaaS Helps HR Firm

Enabling Agile BI with a Compressed

William Sunna and Pankaj Agrawal

BI Experts Perspective: Pervasive BI

Jonathan G. Geiger, Arkady Maydanchik,

BI and Sentiment Analysis

Mukund Deshpande and Avik Sarkar

We know you cant always send people to training, especially

From the Editor

Beyond Business Intelligence

17 Advances in Predictive Modeling: How In-Database Analytics Will

26 BI Case Study: SaaS Helps HR Firm Better Analyze Sales Pipeline

29 Enabling Agile BI with a Compressed Flat Files Architecture

36 BI Experts Perspective: Pervasive BI

Jonathan G. Geiger, Arkady Maydanchik, and Philip Russom

41 BI and Sentiment Analysis

Mukund Deshpande and Avik Sarkar

50 Instructions for Authors

BUSINESS INTELLIGENCE JOURNAL VOL. 15, NO. 2

Director, Online Products

Senior Vice President &

Executive Vice President

Senior Vice President,

Vice President, Finance

Advertising Sales: Scott Geissler, sgeissler@tdwi.org, 248.658.6365.

Chairman of the Board

Reaching the staff

Staff may be reached via e-mail, telephone, fax, or mail.

lists targeting business intelligence and data warehousing professionals, as well

E-mail: To e-mail any member of the staff, please use the

Renton office (weekdays, 8:30 a.m.5:00 p.m. PT)

Copyright 2010 by 1105 Media, Inc. All rights reserved. Reproductions in

BUSINESS INTELLIGENCE JOURNAL VOL. 15, NO. 2

Business Intelligence Journal

From the Editor

BUSINESS INTELLIGENCE JOURNAL VOL. 15, NO. 2

An Example of a BI-based Organization

Forecasting product demand

1. The concept of a BI-based organization is similar to the immersion view

BUSINESS INTELLIGENCE JOURNAL VOL. 15, NO. 2

Customer segmentation analysis

Customer and product profitability analysis

Campaign planning and management

Supply chain integration

Fact-based decision making

(e.g., Walmart), telecommunications firms (AT&T), and

Some of the details of these applications are interesting.

A Strategic Vision and Executive Support for BI

When asked to describe his company, the retailers

Because time was of the essence and in-house data

Becoming a BI-based Organization

Smart companies have formal BI vision documents and

BUSINESS INTELLIGENCE JOURNAL VOL. 15, NO. 2

BUSINESS INTELLIGENCE JOURNAL VOL. 15, NO. 2

I have been most impressed with those firms that have

Barry Devlin, Ph.D., is a founder of the

It has been almost 25 years since the original data warehouse

The Evolution of an Architecture

BUSINESS INTELLIGENCE JOURNAL VOL. 15, NO. 2