You are on page 1of 24

Business Intelligence at Ahold Netherlands

Pallas

the Albert Heijn Data Warehouse

A Description of its Architecture

Lidwine van As
Wouter van Aerle

October 2004
Table of content:
1. Introduction ______________________________________________________________ 3
1.1 Work Method and Organisation of Documents ____________________________________ 3
1.2 Explaining the Choices _______________________________________________________ 3
1.3 Reference___________________________________________________________________ 4
1.4 About the Authors ___________________________________________________________ 4
2. Rationale _________________________________________________________________ 5
2.1 Defining the Problem _________________________________________________________ 5
2.2 Solution ____________________________________________________________________ 5
3. A Bird’s-eye View of Pallas __________________________________________________ 6
4. Functional Capability_______________________________________________________ 7
4.1 Point of Departure ___________________________________________________________ 7
4.2 Process Orientation __________________________________________________________ 8
4.3 Quality of the Data ___________________________________________________________ 9
4.4 Performance Capability for Reporting _________________________________________ 10
4.5 Other Performance Capabilities _______________________________________________ 10
5. Technology ______________________________________________________________ 12
5.1 Infrastructure ______________________________________________________________ 12
5.2 Architecture _______________________________________________________________ 13
5.2.1 Overall Architecture _____________________________________________________________ 13
5.2.2 Transaction Repository ___________________________________________________________ 14
5.2.3 Data Marts_____________________________________________________________________ 15
5.3 Data Staging Areas (DSA) ____________________________________________________ 16
5.3.1 Processing _____________________________________________________________________ 17
5.4 Interfaces and Interaction with Other Systems___________________________________ 18
6. Approach________________________________________________________________ 20
6.1 Organisational Measures_____________________________________________________ 20
6.1.1 Current Services ________________________________________________________________ 20
6.1.2 Metadata ______________________________________________________________________ 20
6.1.3 New Services __________________________________________________________________ 21
6.2 Methods, Techniques and Documentation_______________________________________ 21
7. Costs and Benefits ________________________________________________________ 22
7.1 General Situation ___________________________________________________________ 22
7.2 Costs _____________________________________________________________________ 23
7.3 Benefits ___________________________________________________________________ 23
7.4 Future Prospects ___________________________________________________________ 24

Pallas - The Albert Heijn Data Warehouse Page: 2


1. INTRODUCTION
This description of the architecture is the entry from Albert Heijn Competence Centre for
Business Intelligence (CC-BI) to the Dutch Championship Architecture competition for
2004. The IEEE recommended practice for architecture descriptions, IEEE 1471, has been
used as guiding principle.

1.1 Work Method and Organisation of Documents

In compliance with IEEE 1471, the primary stakeholders of the data warehouse and their
concerns have been described. The concerns are then translated into and recapitulated
according to 4 angles of approach:

Stakeholders Concerns Approaches


User What processes are supported? What Functional
functional capability is offered? What is Capability
the quality level of the data?
Owner What does the data warehouse cost? What Costs and Benefits
does it produce? What are its prospects for
the future?
Internal Accounting Service What is the quality level of the data? How Functional
are security and privacy guaranteed? How Capability
is trace ability guaranteed?
IT Management How do we control and reduce costs? Approach
DWH technical team How is the data warehouse (DWH) Technology,
(development team, DWH organised? How is the demand for Approach
architect, management team) functional capability translated into a
technical solution?
IT Management How does the data warehouse fit in with Technology
AH’s IT environment?

The chapter on ‘Rationale’ explains the reason for the system’s existence; after that the
chapter on ‘A Bird’s-eye View of Pallas’ provides an overall description of the system.
Then the chapters on ‘Functional Capability’, ‘Technology’, ‘Approach’ and ‘Costs and
Benefits’ elaborate further from the various angles of approach.

1.2 Explaining the Choices

Taking into account that in 2000 the original architectural design for Pallas filled
73 pages, the present document is inevitably briefly worded. Not all subjects can
be treated with the same degree of depth. So a selection has to be made from
among the supply of information.

The most important basic principle that was applied here is that what is unique
was preferred above what is trivial. For that reason, little attention was spent on
the privacy and security aspects of the data warehouse, since no measures have
been taken in that territory that are essentially different from what one may expect
from any system of this type.

Pallas - The Albert Heijn Data Warehouse Page: 3


1.3 Reference

[1] "The Data Warehouse Life Cycle Toolkit", Ralph Kimball et al., ISBN 0-471-25547-
5, Wiley, 1998
[2] Kimball Design Tip #59: "The Surprising Value of Data Profiling", September 2004
[3] "Corporate Information Factory, 2nd edition", W.H. Inmon et al, ISBN 0-471-39961-2,
Wiley, 2000

1.4 About the Authors

Lidwine van As (las@grey-matter.nl) is a self-employed IT consultant working in the


field of software development and business intelligence. She has been involved as a data
warehouse architect in the Pallas project practically from its inception.

Wouter van Aerle (wouter-van.aerle@ahold.com) started working in 2001 for Albert


Heijn in CC-BI. As an information analyst, he was involved in augmenting the data
warehouse with bonus card information and in creating a separate data mart for bonus
card analysis. In addition to this, he took part in many other new developments. He is
currently occupied with increasing further the professionalism of the information analysis
specialisation within the CC-BI.

Pallas - The Albert Heijn Data Warehouse Page: 4


2. RATIONALE

2.1 Defining the Problem

It was ascertained in 1999 that the quality of the management information then provided at
Albert Heijn was felt to be far from satisfactory. The situation was characterised by poor
controllability, incompleteness, contradictory figures, delayed availability and the lack of
a single, integrated method for approaching the sources of information.

Apart from that, new developments in the business, such as farther-reaching


differentiation, made improvements desirable. This led to an increasing need for
information with a greater level of detail (branch and check-out level) than it was possible
to provide at that time. Because of their closed character, the management information
systems then available could not provide this or were not equipped for the volume of data
that could be expected.

That is why the Pallas project was launched in 1999 with the following objective:
“To create a solution extending over the whole of Albert Heijn for the problem relating to
the provision of information within the organisation, to raise this to a higher level and to
lay the foundation for further expansion”.

An additional objective derived from this, was adding value by protecting and supporting
the complete value chain and facilitating the comprehensive use of the available
information. As well as this, the new solution had to call a halt to the veritable tower of
Babel-like confusion that was the consequence of using several types of information
environments in juxtaposition. Establishing one central source of historical data would
make it possible to guarantee a common reference framework for information and
business definitions or “one copy of the truth”.
A final objective was to derive cost advantages by reusing the environment for other
Ahold operating companies.

2.2 Solution

To achieve these objectives, an overall information environment was developed to


support the decision-making processes in the AH organisation. It would bring together
information from all the business processes and would be called Pallas, the enterprise
data warehouse.

The following starting points and supplementary conditions for developing the
environment were formulated to increase the chances that the new environment would
be successful:
• The data warehouse was to be business-driven: the information that the business
needs was the primary drivers for developing and expanding the data warehouse.
• A pragmatic approach was to be used without losing sight of the theory.
One way in which this was done was to seek a connection with already
existing best practices in the field of data warehousing.
• Think big, act small: The final objective is an enterprise-wide data warehouse,
but it is an enormous chore to get this set up all at once. Instead of doing that, the
first step was to lay the foundation (the architecture), and then build in small
increments.
• Proven technology was to be the preferred choice.

Pallas - The Albert Heijn Data Warehouse Page: 5


3. A BIRD’S-EYE VIEW OF PALLAS
The Pallas architecture is constructed around a central data warehouse, the Transaction
Repository (TR). The TR is filled from the different source systems within Albert Heijn.
The data marts from which the end users can request data are generated from the TR; they
are customised to suit the support needed by specific business demands and processes.

Pallas has a delay of a day when compared to the source systems. The source information
is supplied every night and is loaded and processed in the TR and the data marts so that
they are available the next day for reporting and analysis. Every week approximately 600
million records are loaded into the data warehouse. The feeding systems will gradually
switch over to supplying data in real time, which will make the data warehouse’s
response duration increasingly shorter. In the meantime, reports are available that can
display the turnover as of ten minutes ago.

After the foundation for the architecture was laid in 2000, Pallas was developed and
elaborated incrementally. Although the enlargements were triggered in the first place by,
and were implemented for the benefit of, the solutions to specific business issues, the idea
of integration always lay at the basis of the work: the added information was to be
reintegrated into the information already available. In this way it would be possible to
procure extra insight into the underlying business processes. The kinds of information
now available extend over nearly all elements of the retail value chain, from the
distribution centre (DC) deliveries to check-out counter transactions. The information is
used to support nearly all of the business processes in the company. Thanks to the rapid
availability of the information and the high degree of detail, the data warehouse can
provide both management and operational reports.

Since the start of the project, the Pallas users' group has grown relentlessly. Every week
1800 individual users of all levels, from the board of directors to the shop floor and the DC
employees, use the reports and the analysis environments.

Pallas - The Albert Heijn Data Warehouse Page: 6


4. FUNCTIONAL CAPABILITY

4.1 Point of Departure

The objective of a business intelligence environment is simple: providing the right


information, at the right time to the right person, in the right way. Making management
and other information available to end users is what determines the fundamental
functional capacity of the BI environment. Seen from a functional perspective, this can be
done in several ways: from statistical reports to extensive opportunities for searching the
data. Seen from a technical perspective, there are innumerable alternatives for achieving
all of this. The different ranges of applicability have been classified in a display sheet.

Figure 1: Display Sheet depicting end-user functional capability

As can be seen on this display sheet, all end-user functional capability has been put into
operation. The choice was made to achieve each type of functional capability (reporting,
analysis, and the like) with standard front-end tools that could offer both out-of-the-box
reporting capability as well as modules for constructing one’s own specific types of
reports (such as score cards). This approach determines a priori what is and what is not
functionally possible. After all, the available functional components and modalities of the
standard tools determine the way in which data can be reported and made available. As
well as this, the classification is a good means of communication when co-ordinating
information needs with users and the way in which these needs will be made operational.

Pallas - The Albert Heijn Data Warehouse Page: 7


4.2 Process Orientation

Each type of business (merchandising, fulfilment, replenishment, etc) has its own unique
information requirements. This has been anticipated in the architectural options made at
the TR and data mart levels. Relevant choices in this context include such items as the
application of conformed dimensions and taking “one data mart per business process” as
a starting point. (In some cases, applying this principle meant that exactly the same data
had to be included in several different data marts. Take, for instance, the example of sales
information that is required for the management of several different primary processes
such as store operation, merchandising, and the like.) The use of tooling supports the
creation of several different reporting environments that can all be handled via one portal.
Thanks to this, it was possible to create a dedicated reporting environment for each
supported process or focus for attention as component of the chosen tool. This
environment could be fed its own underlying data.

The following bus matrix [1] gives an impression of the primary processes that Pallas
opens up and supports, and the data mart dimensions that are related to this.

Time Article Branch Sales Campaign Distribution Customer Employee Supplier


Formula Centre
Brand mgmt /
CRM

Format mgmt

Merchandising

Replenishment

Fulfilment store

Fulfilment
logistics

Figure 2: Bus matrix: achieved sourcing and support with regard to primary processes.

Formal and navigation characteristics have been standardised across the various reporting
environments. This gives each report and each environment the same look and feel. This
affords a uniform and thus “peaceful” presentation for the user, and also creates a sense
of recognition regarding what information does and does not originate with the data
warehouse.

Besides this, the management of data definitions ensures a uniformity of information


across the various environments; for instance, the name for the monetary value of sales in
one environment can be “Turnover”, while another environment calls it “Customer
Sales”, because the underlying definitions differ. The data definitions are documented in
a central location and are accessible for the end user, so that the latter can interpret the
information offered in a correct manner. Information that has the same definition, will
also have the same name in each environment, be it at the back-end side, at the database
level, at the front-end side or in reports and cubes.

Pallas - The Albert Heijn Data Warehouse Page: 8


4.3 Quality of the Data

A primary driver for the added value and effectiveness of the data warehouse is the
quality of the data stored in it1. Designing and developing high-quality data was a
fundamental principle from the very beginning. The following measures were applied to
accomplish this:

1. Drafting KDDs2 with respect to the source systems and source data: since a data
warehouse is fed from source systems, the quality of its content is determined to a
large degree by the quality of the source systems. The KDDs then encompass the
following:

• Original source: data is only transferred from the source that created the data, and
not from other systems that, for efficiency reasons, also have access to this data but
not to its source.

• The highest possible level of detail: information is opened up and stored in the
highest possible level of detail that is available. Thanks to this, the potential of the
data warehouse is not unnecessarily restricted by the data that has been incorporated,
while in the future, perhaps, more detailed data could be required.

• Complete source: when a source table is opened up, the whole source table is
sourced and not only the data that is required at that moment.

2. It is always presupposed that sources are contaminated. Consequently, a full


technical and functional source analysis is performed on every new source up to the
attribute level (data profiling [2]). This provides insight in how a source is actually
being used. Acknowledged issues are tackled in one of the following manners:

• The source owner is requested to remedy the issue mentioned.

• A control is made of the ETL process to prevent the issue from arising. Each
prevention is logged. In this way, for instance, a report can be generated on the
number of times that a specific issue has been resolved.

• In addition, there is a customised follow-up3 on the issue

3. Because the TR is the only placeholder for all data, duplications are immediately
apparent. When the same type of information arises in various sources, it is collected
in the TR into one and the same table if this is applicable. On this score, the TR
proves its added value as data integrator. This means that the condition of the stock
of articles in distribution centres and branches can be stored separately in the
operational systems; modalities are created in the TR to combine data so that
comprehensive reports can be drafted on the total stock position of the article.

1
Efficiency aspects are established in advance by the shape of the output used. It could be stated
that the same decision or steering measure could be made with either the data sheet or a
dashboard/ exception report, provided that the underlying data is the same and is of the same
quality. However, the latter is not determined by the tools, but by the back-end of the data
warehouse.
2
Key Design Decisions: fundamental decisions with respect to the functional capability and the
architecture. See the chapter on ‘Approach’
3
Reject, Ignore, Correct, Suspend
Pallas - The Albert Heijn Data Warehouse Page: 9
4. In each data mart, reference data has the same modelling and the same content. This
principle of conformed dimensions [1] imposes a uniform pattern on all information.
Moreover, conformed dimensions offer the prospect of integrating all data relating to
the same subject (article, branch, customer, and the like). One challenge on this score
is the way of tackling informal stratification’s of things like articles or branches.
There appears to be a variety of alternative groupings that are not supported by a
source system. A strict policy is pursued with respect to this: the data is included in
the data warehouse only when a formal definition and a formal source are available
for this type of stratification.

5. Once the software has been constructed to open up new information, a verification is
performed on the data processing. This check examines whether everything that has
been supplied has successfully arrived in the data warehouse and the data mart.

4.4 Performance Capability for Reporting

In conformity with the display sheet, the output in the form of reports on the data
warehouse (‘Reporting’) makes up the bulk of the functional capability that is supplied
from the Pallas architecture. The potential for finding content for this type of functional
capability varies from Excel-style spreadsheets with large amounts of detailed
information up to and including pushed exception reports, balanced scorecards and
management dashboards with incorporated information. All these options are supported
by one and the same standard tool. To keep the number and variety of reports
manageable, a taxonomy of the reports has been drafted; newly developed reports must
be classifiable within this taxonomy, otherwise they will not be developed. However, it is
still difficult to foresee specific functional requirements with these standard forms. This is
due to procedure-based reasons more than to technological reasons: to a certain degree
standard report forms require standardised and structured procedures but the procedures
are far from always being standardised. Nevertheless, one positive effect is that the use of
the BI environment for the business has provided an occasion to address this subject.

4.5 Other Performance Capabilities

The second level in the display sheet is Analysis. Within Pallas, this is understood as
meaning posing a series of ad-hoc question to the database where each following question
is determined by the answers to the preceding questions. This requires a first-rate
performance from the environment; the functional capability in question is made
operational within Pallas with the aid of (M)OLAP capability (multi-dimensional cubes).
Target groups for this type of functional capability are the knowledge employees in the
Planning & Control, Market Research and similar support departments.
As well as this, a limited amount of statistical analysis also takes place. A proof-of-concept
is also performed with data-mining technology.

Pallas - The Albert Heijn Data Warehouse Page: 10


Two forms of functional capability that are not directly related to the BI, but which Pallas
does provide, deserve a brief glance:

• The delivery of information: Pallas is not only the platform for management
information within Albert Heijn, the chosen data warehouse architecture has also
made it the central storage place for historical information. This characteristic,
combined with the fact that the data is of a high quality, makes Pallas the ideal
supplier of historical information. This is effectuated as interfaces leading back to
the operational systems. We can thus speak of closing the loop: Pallas supplies the
required information to operational applications that need historical information for
the implementation of their task. This means that such tasks as predicting volumes
for automatic deliveries to stores, predicting campaigns and designing shelf plans for
stores can be supported.

• Operational reporting: because the selected standard tool for reporting can also be
used on operational databases, it was decided to use this front-end tool to standardise
all reporting (not only management information, but also operational reports).

This makes it possible to dismantle the many varieties of other reporting tools that are
often included in standard packages. In practice this means creating a supplementary
reporting environment, this time, however, in the operational system instead of in the data
warehouse. One example of the use of such environments is in the logistics systems. A
high degree of transparency is guaranteed for the user: with the same portal, tools and
layout, he/she can retrieve both management and operational reports.

Pallas - The Albert Heijn Data Warehouse Page: 11


5. TECHNOLOGY

5.1 Infrastructure

The largest part of the Pallas data warehouse is run under AIX on an IBM p670 with 16
1.45-GHz processors. Beside this, one other performance-intensive data mart runs, also
under AIX, on an IBM S85 with 8 750 MHz processors. All data is stored in an EMC
disk cabinet that can hold more than 11 TB.

Oracle is used as RDBMS. An ETL tool is used to transform and load the data into the
data warehouse; this was preferable to hand encoded procedures due to management-
related concerns, given the expected amount of processing jobs. We chose Informatica
Powercenter. The co-operation and scheduling of Powercenter workflow is established
and managed in control-M (BMC); a part of the scheduling and the workflow
management will gradually be shifted to the Powercenter Workflow Manager.

The end users do not retrieve information from the data warehouse directly, but via end-
user tools. Microstrategy is used for standard reporting. Most users view their reports in a
web browser via Microstrategy Web, or have these delivered as PDF files, Excel
spreadsheets or SMS messages sent by Microstrategy Narrowcaster. A few web servers
have been set up using Microsoft IIS to support Microstrategy Web. These web servers
can balance the load of user requests among one another.

Hyperion Essbase is available for ad-hoc analyses, with as front-ends Hyperion Analyzer
and Temtec Executive Viewer.

Pallas - The Albert Heijn Data Warehouse Page: 12


5.2 Architecture

5.2.1 Overall Architecture

The Pallas architecture is constructed around a central data warehouse, the Transaction
Repository (TR). This architecture is actually a hybrid intermediate form between the
Inmon’s [3] Corporate Information Factory on one side and on the other the data
warehouse from Kimball [1], which is based on dimensional data marts. The TR is filled
with source information, this is then routed to the front portal –data staging area) from
where it is processed further. The data marts are built from the TR. Multi-dimensional
cubes for ad-hoc analysis can be generated from the data marts.

Figure 3: Overall Architecture of Pallas

Pallas - The Albert Heijn Data Warehouse Page: 13


5.2.2 Transaction Repository

The Transaction Repository (TR) functions as the central ‘one copy of the truth’, the
storage place for historical information at the most detailed possible level. The TR is
optimised for bulk loading and queries on detailed information and, in principle, end users
do not retrieve information directly from it. The ultimate horizon for the TR is 5 years,
with a database size of approximately 2.5 TB (raw data).

With a view to controllability, it is not desirable to change the TR model frequently: the
central source for historical information should not need to be completely redesigned,
implemented and migrated every six months. Therefore when the TR model was being
designed, we had in mind the greatest possible independence for the source systems, on
one side, and the business requirements of the moment, on the other. A generic manner of
modelling based on the relational model, was used for reference data; reference data were
stored according to the following pattern:

Figure 4: PRODUCT Modelling subject area (simplified)

Pallas - The Albert Heijn Data Warehouse Page: 14


The entity PRODUCT constitutes the heart of the PRODUCT subject area: the domain in
the TR where all reference data relating to sales articles are stored. This entity contains
only unchangeable PRODUCT attributes; it is also the place where the mapping between
the operational keys and surrogate keys takes place. Only the surrogate keys are used
within the data warehouse. All changeable attributes of PRODUCT are stored in
PRODUCT_HIST. This table contains several snapshots of PRODUCT, as it were,
provided with an indication of the time; each snapshot is stored with a starting and ending
date for the situation concerned. This also contains relations with reference data such as
unit of measure and brand name. In this way, the history is stored even when the source
does not do this itself.

PRODUCT groupings are stored in PRODUCT_GROUP. A link to the table


PRODUCT_GROUP_TYPE establishes the type of grouping that is stored. The table
PRODUCT GROUP IN PGRP_HIST is used to link groupings to one another
hierarchically. The table PRODUCT IN PRODUCT GR HIST situates a product in a
hierarchy. The last two tables again also contain a starting and ending date, because
hierarchies can change over time.

Thanks to this way of modelling it is possible to add and expand hierarchies and
hierarchy levels without consequences for the modelling.

5.2.3 Data Marts

Business Requirements

While the TR has a permanent character, the data marts are oriented exclusively to
fulfilling a demand for information from the business, without attention for such
requirements as being future-safe, reusable, etc. The idea behind this is that the TR serves
as an unchangeable historical source, while the way in which the end user regards the
information can change, depending on the support that he/she needs to perform his/her
task. If the end user’s way of looking at the data changes, this can mean that a data mart
must be completely remodelled and regenerated. The data marts thus have a much more
transitory character than does the TR. As was indicated in the chapter on ‘Functional
Capability’, when modelling is done to suit the users’ needs, an attempt is made to
construct the data marts in such a way that they can act as a whole when supporting a
given business process.

Another difference between the TR and data marts is that the data marts usually contain
no detailed information, but contain primarily information that has been collected. The
detailed information from the TR serves only as a basis for composing the views that the
user needs to support his/her work. In this way the details of check-out transactions that
are stored in the TR can serve as a basis for the composition of the turnover information
for each branch in each quarter. The user is only interested in the collected information,
thus the data mart need only contain the aggregation. (One exception is the data mart used
for bonus card analysis, where the analysis takes place at the check-out transaction level.)
The time horizon for each data mart is also determined by the desires of the user for
whom it has been developed. The data mart with the broadest historical coverage contains
information going back (collected over) 3 years, but there is also a data mart with a 6-
week history. Beside this the cubes have technical restrictions on the amount of
information that they can store.

Pallas - The Albert Heijn Data Warehouse Page: 15


Tool-determined Requirements; Dimensional Modelling

In addition to the business drive, the requirements that are imposed by the end-user tool
are also relevant to the modelling of the data marts. Most end-user tools require a
dimensional modelling (star diagrams), thus we follow the Kimball [1] dimensional
modelling method for the data marts. Beside this Microstrategy places several specific
demands on the dimensional model. For instance, the Microstrategy’s performance is
significantly improved when the hierarchies of the dimensions in the model are explicitly
modelled (‘snowflaking’). On this point, there is a divergence from the Kimball
modelling, which prescribes far-reaching de-normalisation. The performance of the
snowflake is improved still further by allowing the keys from the higher levels to be
inherited by lower levels in the hierarchy so that the number of joins needed to reach a
hierarchical level is kept to a minimum. For instance, the entity ARTICLE contains the
keys of all the higher-situated levels. (For that matter, it is also true here that should
Microstrategy be replaced by another end-user tool, this would probably impose such
different requirements on the modelling of the data marts that these would have to be
remodelled and regenerated. The separation between the TR and the data marts and the
independent modelling of the TR means that this can soon be achieved without posing
any problem.) In the data mart, the users always see the facts from ‘today’s perspective’,
whether these are yesterday’s facts or facts from two years ago. This means that the facts
in the data mart are always collated according to the most recently known version of the
dimensions (in Kimball-terms: a type 1 approach of slowly changing dimensions is
followed). Although the TR can be used to generate other ‘perspectives’, there has thus
far been no business demand for this. Thus, the data marts have not yet made any use of
the history dimension available in the TR.

The generation of cubes takes place on the basis of the data marts, and not the TR, because
the construction of the dimensions in the cubes is the same as those in the data marts. Once
they have been generated in the data marts, they are immediately ready in the correct
format for the cubes.

All the dimensions in the data marts are designed as conformed dimensions, so that
the information in the different star diagrams can be compared and related to one
another (‘drilling across’).

5.3 Data Staging Areas (DSA)

Source files are received in the first DSA, and the source data is transformed into the
model of the TR format. This is where quality controls, cleaning and enrichment of the
data take place. The initial DSA consists both of files and of tables. In the second DSA, it
is decided which factual information must be loaded onward (see the section on Deltas),
and the conformed dimensions are constructed on the basis of the last state of the
reference data in the TR and are transferred as a whole to the data marts.

The DSA tables are a physical part of the TR database.

Pallas - The Albert Heijn Data Warehouse Page: 16


5.3.1 Processing

The information from the source systems are usually batch processed at night. That keeps
the processor capacity during the day available for producing reports on the data marts
and making analyses in the cubes; moreover, some information can only be supplied after
the close of the day. The processing is subdivided into several main flows, each of which
loads a particular type of data: check-out transactions, logistic movements, condition of
the stock, etc. The main flows are, in their turn, composed of approximately 1500
Powercenter jobs. Managing the underlying dependencies and starting these flows is done
in Control-M.

Information can be used in several data marts; there is thus an m:n correspondence
between data flows and data marts. In the first place, source information is loaded onward
in one flow from the source via the TR to the data marts, where they were needed.
Sometimes for reasons of performance, the information is loaded from other data marts
instead of from the TR, because the tables needed were already present in the desired
basic form in another data mart. This means that the various flows become increasingly
interwoven, which had an unfavourable effect on the underlying dependencies and the
expandability of the environment.

That is why a new approach was developed in which it is possible to work with semi-
finished products. This means that the information flows are cut in two: first the
information from all flows is processed in the TR and the underlying DSA, where the so-
called semi-finished products are produced. These semi-finished products then serve as
the basis for constructing the separate data marts. Shared basic tables now arrive in the
DSA.

Figure 5:ETL processing by means of semi-finished products

A procedure for undoing the interweaving and partially reorganising the standing
environment according to the semi-finished product principle has already started and will
run until the end of 2004.

Pallas - The Albert Heijn Data Warehouse Page: 17


Deltas

Some sources can deliver changes (corrections) to factual information that has already
been loaded. A ‘smart’ delta mechanism is used to process this correction: it sends a
signal when ‘movements’ in the factual information have taken place. It ascertains which
information has changed since the last processing run or which has been newly supplied.
This is used to create a To Do list that indicates which facts must be updated. The ETL
processing uses the To Do list to determine which information must then be passed on for
loading into the data marts. Altered information is first removed from the tables and then
entered anew. The fact tables are thus constructed incrementally instead of always being
completely recreated anew each time.

This not only reduces the total amount of processing in comparison with “dumb” bulk
loading, it also maintains a record of the administration and the logging of information that
has changed.

Re-incorporation

The use of the today’s perception raises an issue in the case of incorporated information.
After all, this is incorporated on the basis of the perception at the time of the
incorporation; factual information that has not been changed is not reloaded, so changes
in the corresponding reference information (adjustment of the ‘perception’) is not
implemented in the incorporations. The picture of the reality that the incorporations
provide becomes gradually less and less accurate. This applies to incorporations that are
incorporated by means of dimensions in which entities can be assigned a different
grouping, for instance in the article dimension, when articles are shifted from one
assortment group to another.
That is why these incorporations must regularly be reconstructed anew. Given the
magnitude of the available history in some data marts, it is not practicable to include re-
incorporation as element of the nightly processing: the re-incorporation would simply
require too much time. When the time arrives for re-incorporation, a copy of the basic
tables is made and the re-integration process is started to run in the background. When the
integration has been completed, the old and new integration tables are swapped.

5.4 Interfaces and Interaction with Other Systems

By definition, a data warehouse that has been created to integrate information from other
systems has links to these other systems.

In integrating information from different systems, Pallas has had an important advantage
from the very start with respect to many of its fellow data warehouses: within Albert
Heijn the source systems were already in an highly standardised and uniform state
because a corporate data model had been in use. That meant that for each type of source
data there was, in principle, only one source system in use; delivery of similar types of
data from source systems that differed in terms of modelling and had dissimilar data
definitions was thus not an issue.

In 2000, it was decided to remodel the AH application landscape. To anticipate better the
flexibility and differentiation demanded by business requirements, it was decided to
switch over to a service-oriented approach. Applications were henceforth set up as
independent services each of which had its own data store and could communicate with
other applications non-synchronically in an event-driven manner by means of messages
sent over a message broker.

Pallas - The Albert Heijn Data Warehouse Page: 18


It is clear that this also had repercussions on the delivery of source data to Pallas:
although the majority of the data would still be delivered as an interface file in a batch
process, a growing number of applications delivered their source data in near-real-time in
the form of messages, to the extent that this was practicable and meaningful. Although
this made new demands on the data warehouse’s reception and loading mechanism, it did
offer new potential. In the meantime, Pallas has linked up to the message flow for check-
out transactions: each check-out transaction in the store is forwarded in near-real-time to
headquarters where it is processed in various relevant systems, including the data
warehouse. The data warehouse collects the real-time information in an ODS (operational
data store) and sends this onward to the TR in small batches for loading every ten
minutes. This is called “trickle feed”: the information ‘trickles’, as it were, into the data
warehouse. This not only gave Pallas more room in its nightly batch window – after all,
the more information that trickles in during the day, the less there is that need to be
processed at night – it also meant that users could be provided with up-to-date
information in near-real-time The area of focus in shifting all or part of the processing to
day-time hours is the balance between the time gained by loading during the day and the
users’ experience of reporting performance. After all, if the loading is done during the
day, this can be at the expense of the processor capacity that is available for running
reports.

Pallas - The Albert Heijn Data Warehouse Page: 19


6. APPROACH

6.1 Organisational Measures

Within Albert Heijn, the development and maintenance of Pallas is entrusted to the
Competence Centre Business Intelligence (CC-BI). This is a component unit of AH IT,
Albert Heijn’s IT department. In addition to this, the CC-BI also carries out BI activities
for various other of Ahold’s operating companies on the basis of existing infrastructure
and working methods. The CC-BI is divided into two sub-departments: Operations &
Improvements (O&I) and Projects. The activities performed included all the activities that
occur from DBA activities to end-user support; only the management of the machinery is
situated elsewhere. Depending on the number of current projects, there are between 30
and 50 people working in the department. The CC-BI supplies services to its users. A
service can be seen as a logical whole of information on offer, including the complete
flows leading up to it, from source to report.

6.1.1 Current Services

The delivery of existing services is provided by O&I. O&I is primarily responsible for the
daily operation and the maintenance of the standing architecture and BI solutions; This
also includes implementing cost-saving improvements to the architecture and
infrastructure. A Service Level Agreement is entered into with each users' group for the
service that it receives; this is to help provide an orientation for its management. This
establishes agreements on such items as the expected time at which the requested reports
will be available each morning and the maximum down time of a service. In addition,
O&I supplies support and training for the end users of the data warehouse and provides
Quick Services: upon request, information that the users cannot access themselves but
which is stored in the data warehouse can be supplied via reports or cubes.

Finally, standards and directives for developing new solutions are drafted and monitored
within O&I and the Pallas Delivery Framework is also managed there. The development
processes and deliverables for new projects are defined within this framework.

6.1.2 Metadata

Metadata (information about the information) is indispensable to the proper management


of the environment. Two types of metadata are particularly interesting in this context. The
first is dynamic metadata, metadata about the process: how many reports are run, how
heavily taxed the machinery is during the day, what service level has been attained, etc.
The data warehouse, as standardised environment for reporting, reports information on its
own operation. For this purpose, the system’s own tooling registers facts and dimensions
in the TR. This is then processed into a separate data mart from which reporting is also
done using standard tools. A second type of metadata for supporting the operations
consists of statistical metadata; information on the structure of the data warehouse.
Various services can have certain components (machines, databases, ETL processes, etc.)
in common, thus the failure or modification of one such component can have an impact
on several services. Setting out clearly the mutual dependencies between services and
evaluating the impact of changes and service interruptions on the various services is still
primarily manual work. The endeavour is to automate this further with the help of an
integrated metadata solution.

Pallas - The Albert Heijn Data Warehouse Page: 20


6.1.3 New Services

The Projects sub-department is organised into project teams that develop new services on
the basis of the existing environment and architecture. These project teams do not work in
isolation: O&I employees are involved in projects as reviewers to guarantee the tie-in of
newly developed solutions with the existing environment. Ultimately O&I determines
whether and when a new solution is put into production.

The central framework for each project team is the Pallas Delivery Framework. Because
each project or release usually has the same basic elements – data logistics and front-end
functional capability – a high degree of standardisation in the activities is attainable.
Although a lot of room must still be reserved for specific solutions, everyone operates on
the basis of same reference framework. This is also necessary because everyone is in fact
working on the same environment.

6.2 Methods, Techniques and Documentation

• Fundamental decisions with respect to the functional capability and the architecture of
the data warehouse are recorded as Key Design Decisions (KDDs). Originally created
as a medium for formalising decision-making, they form a good record and reference
for the principles that lie at the foundation of Pallas.

• Besides this, the directives and best practices are formulated and established for daily
development practices. After experimenting with various types of recording, design
patterns have recently been introduced for this purpose. The design patters have been
drafted for processing facts with an unknown reference, for the way of logging data
quality problems and for the processing of messages delivered in real time.

• Analysis: information analysis for BI applications appears to be a difficult matter.


Ultimately, it is not very difficult to design the desired output in the shape of reports
once it has been established for which management or decision-making processes the
information is needed. However, analysing, modelling and describing all this is still
undeveloped territory. Ideas and experiences within the CC-BI are being developed
into best practices, including varieties of prototyping, solution scenarios and menu
cards. A menu card is a representation method in which an operational or other
process is analysed; which management or other information is needed for each
process step is determined as is the source that can provide the information.

• In any case, experience has shown that the analysis methods used for ‘normal’ system
development are inadequate. For instance, it has been established that Use Case
Engineering (RUP/UML) offers hardly any grip, since each use case would amount to
a ‘Print Report’. Experiences with function point analysis have also shown that the
character of information analysis and functional design for BI applications is
fundamentally different than for transactional systems.

Pallas - The Albert Heijn Data Warehouse Page: 21


7. COSTS AND BENEFITS

7.1 General Situation

While drafting a general cost/benefit analysis for IT is difficult, the job is particularly
thorny for BI applications. After all, how do you determine the benefits of a new,
improved BI environment? Ultimately it comes down to quantifying the effect of better
decision making and steering, which seems to be an impossible exercise. Added to this is
that investments must often be assessed in comparison to one another. Think of the
comparative assessment that the management of a company must make when deciding to
invest either in logistics or in BI: it is probably easier to make credible for the first that
there are tangible benefits than it is to do so for the second.

As was already indicated in the rationale, several considerations played a role in Albert
Heijn’s decision to invest in BI:

• Controllability: There was a high degree of fragmentation in the old solutions: a


multiplicity of technical solutions, each of which was managed by a different IT
department. From the perspective of management, it was desirable to consolidate all
this and make it uniform. The cost/benefit consideration for this consideration could
be expressed in terms of lower management costs.

• Necessity: each company has the need of the functional capability to provide its
management and steering information. As the complexity of the organisation
increases, the demand for management and steering information also increases. It was
evident that future developments within Albert Heijn (including the differentiation
strategy) would pose demands that could no longer be met by the then current
solutions: so something new was necessary. Within the framework of these
considerations it was important to set an acceptable investment level.

• Ideology: it was believed – also at the level of the board of directors – that creating
one integrated information environment would provide many advantages for the
company in terms of data quality, integration and combination of information,
availability of detailed data, the ‘one copy of the truth’ principle and the like. Without
being able to quantify this in hard figures, there was a strong conviction that this
solution could lay the foundation for future benefits. This last consideration in
particular contributed to the decision to make the actual in depth investment.
It is understandable that the combination of considerations was decisive in the ultimate
choice to make the investment.

Pallas - The Albert Heijn Data Warehouse Page: 22


7.2 Costs

In the sphere of action the rule of thumb was used that more than 70% of the development
costs would be made in the back-end development, so specifically in the development of
the ETL processing (excluding the hardware and licence costs). This rule of thumb also
applied to Pallas; for this reason much steering was placed on this part of the system
development. One of the measures that were taken to control costs was ,in the case of
opening up sources, to read in the entire source table and not only the attributes for which
there were specific requirements. The additional costs of adding this later proved to be
substantially higher than the extra costs of including them in the initial development. In
addition, considerable attention was devoted to data quality analysis, ETL design and
integration testing. Inevitably, there were costs attached to learning from experience and
increments that were too large were initially chosen in some sub-areas. In general,
however, practice showed that the foundation that had been laid was solid and the future
secure.

Just as with any other information system, Pallas is continually expanding and there is a
permanent demand for changes. To prevent uncontrolled proliferation, a steering group
was created. In addition to keeping the total cost of Pallas under control, it also had the
task of keeping watch over the quality and coherence of the content of the BI
environment as a whole.

7.3 Benefits

On the benefit side, the advantages were mainly found in the reusability of elements
of the data warehouse architecture. Relevant elements in this context are specifically:
• knowledge
• standard tools
• approach and methodology.

Several architectural variants were developed as spin-offs from Pallas by combining the
above mentioned components. These could be used to service different types of desires in
the area of management information. By ‘different types’ we mean smaller-scale, with a
smaller scope and lower demands for integration and business processes Within the CC-
BI, these were referred to as ‘the Ferrari and the Volkswagen’: the competence centre has
the architecture, the knowledge, the tools and the approach in house to build a Ferrari, but
there is no reason why these could not also be used to build a Volkswagen.

These alternative variants have thus far been used for other of Ahold’s operating
companies, including Gall&Gall, Ahold Vastgoed, albert.nl and the holding company
itself. For instance, the data warehouse solution for Gall&Gall was created at a relatively
very low cost. The management costs are proportionately low, certainly when they are
compared with a scenario in which Gall&Gall would have kept control over the
development and management. Strictly speaking, these benefits are not for Albert Heijn,
but for Ahold as concern. This certainly also played a role in the considerations when
deciding to make the initial investments.

Pallas - The Albert Heijn Data Warehouse Page: 23


For Albert Heijn, the integration and combination of information has opened up new
opportunities. For instance, in the past there was little, if any, insight in the financial
effects of write-offs in stores: the information that was relevant for determining this was
spread over various systems and could not be combined due to their differing data
definitions. Only after the information was combined and given one data definition in
Pallas, did it become possible for the first time to take structural action to stem the flow
of write-offs.

Another important advantage from which Albert Heijn benefited directly was the lead
time for delivering information on an ad-hoc basis. Many examples could be given in
which the integrated collection of information and the speed with which this could be
accessed have proven their cogency:

• Setting out clearly the hoarding behaviour after major calamities (11 September,
Iraq war)
• Supporting the initial price offensive in the fall of 2003
• Interim evaluation of specific campaigns

In addition to this, it has become evident that preserving the history of both reference and
factual information offers considerable added value. This not only produces better insight
into the historical behaviour of campaigns, the condition of the stock and the turnover, for
instance, but the historical behaviour of reference data, such as the development of
purchasing prices over the years, can also be analysed.

7.4 Future Prospects

Now that a large part of the information available in source systems is accessible from the
data warehouse, attention on the supply side will shift more to ‘filtering out’ the golden
nuggets from the enormous amount of information that is available: ‘less is more’. Users
will, for instance, be more frequently informed via automatic alerts and proactive
exception reports based on business rules. It is as if intelligence were being added to the
information.

On the demand side, initiatives have been launched to give third parties, such as
suppliers, access to elements of the data warehouse. In house, the integration aspect will
come to play a greater role: while in past years the primary objective in opening up most
subject areas was to support separate business processes, in the future the demand for
comprehensive information will increase over the entire chain. This movement is visible,
for instance, on the replenishment side, where after the demand from separate business
requirements for supporting DC replenishment and supplying stores, the demand now
arises for information on stock movements throughout the whole chain. Thanks to the
chosen approach and architecture, this type of information is in most cases already
present in an integrated form in the enterprise data warehouse. On this point, too, the
choices made with regard to the approach and architecture will continue to yield an
increasing benefit.

Pallas - The Albert Heijn Data Warehouse Page: 24

You might also like