You are on page 1of 16

Reporting - Data Architecture Strategy

(Draft 2012-06-27 for Discussion)

BUILDING THE DATA FOUNDATION FOR REPORTING & ANALYTICS

Table of Contents

EXECUTIVE SUMMARY...................................................................................................... 3
1.1 INTRODUCTION................................................................................................................... 3
1.2 ARCHITECTURE DIAGRAM...................................................................................................3
1.3 PRIMARY TOPICS............................................................................................................... 3

WHAT DATA WILL BE MADE AVAILABLE FOR REPORTING & ANALYTICS..................4


2.1 SELECTING DATA FROM SOURCE SYSTEMS AT THE TABLE LEVEL........................................4
2.2 IDENTIFICATION OF DESIRED SAP TABLES...........................................................................5
2.2.1 COMMONLY USED TABLES BY OTHER SAP CUSTOMERS.................................................5
2.2.2 ADDING SAP TABLES BASED ON ANTICIPATED BUSINESS NEED......................................5
2.2.3 FLEXIBILITY TO ADD ADDITIONAL SAP TABLES IN FUTURE..............................................6
2.3 IDENTIFICATION OF DESIRED TABLES FROM OTHER SYSTEMS..............................................6
2.4 [COST OF INCLUDING DATA] [MOVE THIS TO TOOL SELECTION CRITERIA?]...............6
2.4.1 TYPE OF DATA.............................................................................................................6
2.4.2 TOTAL VOLUME OF DATA..............................................................................................6
2.4.3 METHOD OF DATA MOVEMENT......................................................................................7

HOW WILL THE DATA BE MOVED?................................................................................... 8


3.1 PROPRIETARY EXTRACTORS FOR PRE-DEVELOPED CONTENT..............................................8
3.2 PROPRIETARY/GENERIC EXTRACTORS FOR TABLE LEVEL EXTRACTION AND LOADING.............8
3.3 RFCS AND BAPIS........................................................................................................... 8
3.4 REPLICATION..................................................................................................................... 9

Executive Summary

1.1

Introduction

This Data Architecture and Strategy document (the Data Architecture) is the second document
in a series of three documents that together outline and memorialize the reporting and analytics
strategy of Kiewit.
The other two documents are the [Reporting Process Strategy] document, and the [Reporting
Tool Selection] document. The Data Architecture is comprised of a diagram (see Section 1.2)
and accompanying text that describes each aspect of the diagram.
The purpose of the Data Architecture is to provide an end-to-end view of where Kiewit is headed
with respect to the data layer needed to support transactional and operational reporting, as well
as a variety of analytical applications.
The Data Architecture needs to allow for rapid, incremental success in the areas of transactional
and operational reporting needs, while at the same time lay the groundwork for advanced and
sophisticated achievements in areas such as predictive analytics. The architecture must also
be the readily adaptable bridge between the more stable domain of data collection in
transactional systems, and the evolving marketplace of new front end tools and delivery
methods.
1.2

Architecture Diagram

See [attached / appendix]


1.3

Primary Topics

The primary topics described are:

WHAT data will be included?

HOW will the data be moved?

WHERE will the data be moved?

WHAT development will take place on the data prior to consumption for reporting and
analytics? WHO will do that development, and at WHEN during the process?

These questions will be addressed according to key strategies and principles, without regard to
specific toolsets.

What Data Will be Made Available for Reporting & Analytics

The data will consist of both SAP data and non-SAP structured data from other Kiewit systems
such as Hard Dollar and Telematics. Unstructured data (e.g., retained email for legal
compliance) is not included in the scope.
The first step in the strategy to identify the data needed to support anticipated reporting and
analytics needs. Not all of the data collected in SAP and non-SAP systems will have relevance
for reporting and analytics. The strategy in this section outlines the process to determine the
subset of relevant data to include.
2.1

Selecting Data from Source Systems at the Table Level

The first strategic decision is whether to pull pre-delivered content (i.e., fixed data sets for
specific reporting or analytical use) or alternatively, to pull table-level data from the source
systems. The primary pros and cons of each alternative are:

Pre-developed content
o Pros:
Matches to a specific data source and/or or output tool, giving rapid
results for the defined scope
o Cons
Only matches to a specific data source and/or output tool
Not transparent, which makes modification difficult
Pulls may take more time, since pre-defined extractors are more
complicated

Table by table basis


o Pros:
Flexible for developers later when reporting needs change, since whole
tables are available, and specific new extractors do not need to be
developed
Method proven during Kiewit POC for V0
Other?

To lay the proper groundwork for long term viability, the Data Architecture relies on pulling data
at the table level. Existing data assets that have been sourced as pre-developed content will be
maintained as needed, however new development should focus on a data foundation built from
tables pulled from the SAP and non-SAP source systems.
2.2

Identification of desired SAP Tables

Of primary importance is determining the desired SAP tables to include in the reporting and
analytics universe. Since SAP has well over 80,000 tables, it is impractical and actually
4

counterproductive for the development team to simply pull them all. On the other hand, an
adequate cushion of tables is sought to ensure that progress in development of reports and
analytic outputs is not derailed by the need to stop the process while new tables are brought
into the reporting environment.
2.2.1

Commonly Used Tables by other SAP customers

SAP subject matter experts will provide a list of usual suspects that are tables commonly
used by SAP customers for their reporting and analytics. This list should include both tables
with substantive data, and ancillary related tables. This list should be viewed as a starting
point, and not as a final list that would meet Kiewits needs.
Another list to inform Kiewits decision would be the list of SAP tables currently being pulled
for reporting by TIC. This list is attached as [Appendix A-1]. There are differences in the
SAP environments and the reporting needs of Kiewit and TIC, so this list should also not be
viewed as a final list to meet Kiewits needs. However, it is a helpful comparison point.
In the next step, the SME list of will be compared with the TIC list. A new list will be created,
called the Baseline SAP Table List, which will include all tables that appear on either the
SME list or the TIC list. The Baseline SAP Table List will comprise the minimum list of tables
to be pulled from SAP into the reporting and analytic environment.
2.2.2

Adding SAP Tables Based on Anticipated Business Need

Since Kiewit is new to SAP, in the near future the business users are unlikely to know SAP
table names of interest. Rather than seeking to gather their direct input at this time, the
strategy is to anticipate their likely needs and pull adequate tables to cover those needs. In
addition to the baseline list of tables described in Section 2.2.1, the following tables will also
be included:

Tables utilized for the V0 Proof of Concept (list from KieCore)


Tables anticipated for usage in V1 (table list to be developed by consultation among
[WHO] and a SAP subject matter expert.)
Other tables of likely business interest and not already included, based on the installed
Modules of SAP (table list to be developed by consultation among [WHO] and a SAP
subject matter expert.)
Ancillary tables that are needed for meaningful reporting and analytics (table list to be
developed by a subject matter expert with knowledge and tools to find related ancillary
tables.)

These tables are then added to the table list developed in Section 2.2.1, for the
comprehensive initial set of SAP tables. This set is intended to be comprehensive, and
need only minimal additions in the future.

2.2.3

Flexibility to Add Additional SAP Tables in Future

Ideally, the process described in Sections 2.2.1 and 2.2.2 will result in the pulling of all of the
SAP data needed to meet the near and intermediate term desires of the business for
reporting and analytics, without cluttering the reporting landscape with thousands of tables
that are clearly not relevant. By pulling tables, and not pre-designed outputs, there is
always the flexibility to develop and redevelop data assets starting with the tables, rather
than having to go back to the first step of building a new specific extractor. In the event that
a handful of tables are not included in the initial universe, and are later identified as
important for the reporting landscape, those additional tables could be readily added. The
protocol for adding tables needs to be established after the tool selection for data
movement.
2.3

Identification of Desired Tables from Other Systems

KieCore has currently identified [670] tables from Hard Dollar, Telematics, and other applications
that are desired for the reporting and analytics environment. These tables are attached in
[Appendix ___]. Changes to this list would be made based on the [SEE PROCESS
DOCUMENT.] As with the SAP tables, there is future flexibility to add additional tables from a
variety of non-SAP source databases.
2.4

Considerations related to data type and movement

After the desired initial data scope is determined, an evaluation should be made regarding the
characteristics of the data and any particular implications related to the nature of the data and/or
the proposed methods of moving the data. Some of the implications are highly dependent on
tool selection for data movement, and on tool selection and nature of the target reporting
system.
2.4.1

Type of Data

Some categories of data present more technical constraints than others. For example, SAP
cluster tables are not accessible in the same ways as SAP non-clustered tables. Another
example is that SAP data from different SAP functional modules (e.g., SD and FI) cannot be
readily combined for reporting purposes inside the SAP landscape, making cross-functional
reporting a challenge.
2.4.2

Total Volume of Data

At certain thresholds, very large data volumes become unwieldy and more expensive to
manage. Compression of data and design of the database and can reduce the total volume of
data in the reporting environment. If the source SAP data is not already compressed, then
compression in the target reporting environment would often be in the range of 7:1 to 10:1,
depending on the specific SAP table being compressed. [Do Hard Dollar, Telematics and other
6

applications pose similar big data challenges?] With regard to database design, a
compressed ODS could have a fraction of the data of the source systems, while some
multidimensional data warehouse designs could have a data volume rate that increases in a
non-linear fashion more rapidly than the rate in the source systems.
2.4.3

Method of Data Movement

In some methods of data movement, there is an economy of scale when moving multiple tables.
In other methods of data movement, there is no economy of scale, whether with regard to
establishing the initial pull of data or with regard to maintenance.

How will the Data be Moved?

The Data Architecture analyzes a number of methods to move the data from the source SAP
and non-SAP systems into the target, and identifies the recommended methods for both the
near and longer term. The methods reviewed are:

3.1

Proprietary Extractors for pre-developed content


Proprietary/Generic Extractors for Table level extraction and loading
RFCs and BAPIs
Replication
Extractors for Pre-Developed Content

A variety of proprietary tools exist in the market to act as a go-between for specific data sets and
specific outputs. Without reviewing specific tools, which is beyond the scope of the Data
Architecture, we reviewed the pros and cons of this approach from an architectural standpoint.
On the positive side, when absolutely no modifications or customizations are required, the use
of these proprietary tools can speed the time to deployment. On the negative side, very few
implementations are truly out of the box. When customizations are required, they make the
data movement layer brittle, in addition to negating the primary benefit of speed to solution.
Another downside to extractors is that they work through the central instance of SAP or other
application. In the case of SAP, the extractions of pre-defined content can be extremely
expensive, and must be done in windows where the added load will not interfere with SAP
transactions.
Because the Data Architecture is a long term strategy and not the means to meet a short term
specific need, proprietary extractors for pre-defined content are excluded from the going forward
road map. They do not provide an adequate groundwork for sourcing data for myriad purposes
in the future, from operational reporting to predictive analytics, and for the variety of tools (and
versions/variations) that could be expected over the lifecycle of the SAP deployment.
3.2

Extractors for Table level extraction and loading

[Describe method used in VO Proof of Concept]


3.3

RFCs and BAPIs

RFCs (Remote Function Calls) and BAPIs call the SAP central instance, and communicate
data requests via the ABAP coding language. The benefit is the ability to access all types of
SAP data, including cluster tables. The downside is potentially tremendous burden placed on
the SAP central instance by the requests. Again, the result is often running the RFCs and
BAPIs during quiet windows.
8

3.4

Replication

Replication technology is used for a wide variety of SAP purposes, such as mirroring, reporting
and HA/DR. (See SAP Technical Note _______________). However, care must be made not
to use trigger-based replication strategies with SAP data sources. Replication is also a viable
strategy for the non-SAP data sources needed for the reporting environment.
Replication can be of a push or pull variety, but some basics are always present. There is
always some type of initializing snap shot of data, followed by a continual stream of updates.
The intial snap shot presents some, but not all, of the challenges of a batch extract. After the
initial snapshot and the placing of tables under replication, there is very little ongoing
administration needed. The ongoing administration consists primarily of:
-

Monitoring system resources (memory)


Enforcing change management procedures with regard to activities that cause
contention with replication, such as initiating SAP Transports
Following protocols when updates are made to SAP and SAP tables
Updating replication parameters when changes to the network occur (such as
renaming of the source database)

Pros:

elimination of batches, deltas, queues, and other challenges inherent in extractor based
approaches
higher uptime for reporting and analytics less chance of replication failing than of batch
failing
data is always complete
data is always on no shutdowns for lengthy batch runs
efficiency of scale in pulling large numbers of tables and large data volumes
no development of custom extracts
resilient not brittle
makes real-time data available; data as of particular time is still available
enables new methods to drive SAP workflows
not burdensome on the SAP transactional system offloads the reporting burden
resulting in better performance for both the reports and the SAP transactional system

Cons:

replication alone does not result in easily useable SAP data additional software
needed
limited software vendors offer replication based data movement tools specifically for SAP
data
perception that the purpose of replication is only for real-time data

3.5

Summary of Data Movement Recommendations

The Data Architecture recommends that initial efforts focus on moving data via extractors for
table level extraction and loading (Section 3.2). Longer term, one or more of these extractors
could be replaced by a replication based approach (Section 3.4). The criteria for switching
tables to replication include:

batch has frequent failure rate (specific batch or various batches in the aggregate)
the batch processing time has outgrown the available time window (may occur as data
volumes grow)
need to drive up efficiency over longer term (i.e., redeploy resources from batch
management to other initiatives)
certain data needs to be refreshed more frequently than the batch window (e.g., to
facilitate month end close)

If some tables are updated by batch and others by replication, time coherence may be managed
by the design of the reporting and analytics queries. (Time coherence is also of concern when
all tables are loaded by batch, although management techniques may differ.)
Extractors for pre-defined content (Section 3.1) are not a part of the going forward road map for
new development, based on the limits described in that section. Any such extractors used for
V0 reporting will be maintained unless the cost of maintenance outweighs the benefits, and a
determination is made to redevelop the content based on one of the recommended methods of
data movement.
RFCs and BAPIs (Section 3.3) are recommended only for specific data types, such as SAP
cluster tables. Longer term, these RFCs would be eligible for replacement by replication based
methods on the same criteria described above.

10

Where the Data will be Moved Initial Target is ODS

This section of the Data Architecture describes the initial target of the data identified in Section
2, and whose movement is described in Section 3.
This initial target is a single Operational Data Store (ODS). This section describes why an
ODS is the first target, strategic considerations for the establishment of the ODS, and the basic
development that will take place inside the ODS.
4.1

Why an ODS as the Initial Target for Data?

One design constraint is to move data out of the multiple source systems only once (i.e., into
one target only). If data is needed in multiple reporting/analytic environments, then it should be
moved from the initial target into subsequent downstream systems. This design constraint is
necessary to achieve a number of objectives related to data integrity, management of extracts,
multi-source data integration, etc. It therefore becomes necessary to carefully select the initial
target for the data moved out of the source systems.
An ODS is only one alternative considered as the initial target for the data; other alternatives
considered were a multidimensional data warehouse and a tabular database.
The Data Architecture takes an AND approach instead of an OR approach to database
models. Relational and multidimensional databases offer different functionality and benefits,
and both are included in the Data Architecture (see diagram in Section 1.2). An ODS (relational
database) can act as a steppingstone to data warehouses (multidimensional database) and
tabular databases (tabular model with many benefits similar to multidimensional databases).
Conversely, neither a data warehouse nor the tabular database is suited to act as a
steppingstone to the other alternatives. Therefore, the logical first target of data is into the
ODS. This choice sets up the pathway to the downstream multidimensional and tabular
environments. It also serves as the data source for operational reporting.
While logic dictates targeting the ODS to receive the data moved from Kiewits source systems,
there is the positive side benefit that the ODS offers the shortest time to solution. While
reporting against the ODS will not meet all of Kiewits business needs, it is capable of meeting
the vast majority of operational reporting needs. Meeting these needs quickly will provide a winwin scenario, or an upward spiral, where because business users see useful reports from their
transactional systems, they put more focus and effort to ensure data is entered properly into the
systems.
Multi-dimensional data warehouses and tabular in-memory databases offer functionality and
performance that are not available from an ODS. These alternatives will be discussed in later
sections of this Data Architecture.

11

4.2

Establishment of the ODS

This Data Architecture includes a single ODS database comprised of multiple schemas related
to the SAP and non-SAP data sources.
4.2.1

Single Instance One ODS database

Having a single database instance will optimize performance, since joins between tables in a
single instance take advantage of the database indexes and optimizers.
When joins are made in a single instance, there is no need for distributed transactions, which
largely negate the performance improvements of modern ANSI SQL databases. Also, joins
made in a single database require less maintenance effort than joins across databases.
Sometimes having multiple databases is useful to allow different security access to be enforced
at the database level. Also, having different databases allows for different back up options (by
database).
However, this Data Architecture concluded that the benefits of multiple databases are
outweighed by the benefits of a single instance.
4.2.2

Schemas

4.2.3

Building Tables to receive the imported tables

Each table from both SAP and non-SAP sources will need to have a corresponding table
established in the ODS. A determination will need to be made on whether to add audit
columns for these topics:

Create date/time
Created by [ID of program doing the pull]
Update date/time
Update by [ID of program doing the update]

[discussion of cost of these audit columns, and benefit of having them]


4.2.4

Catalog Tables

The database engine will automatically create the catalog tables after [STEP X]. [Discuss how
the catalogs can help developers find data.] Using one database will result in one set of catalog
tables. A unified catalog is one aspect of a streamlined development environment that positions
IT for faster cycle times in development and response times for business requests.

12

4.3

Populating the ODS

4.4

Developing Views

4.5

Maintaining the ODS

13

Data Warehouse Sequentially follows the ODS

This section of the Data Architecture describes the strategy for leveraging the data that has
been sourced into the ODS, as described in Section 4, via the establishment of a data
warehouse. First, there is a discussion of the design considerations for establishing both
tabular and multi-dimensional models in [either the same or separate data warehouses]. Next,
this section describes strategic considerations for the establishment of each model, populating
the data warehouse with data, the further development of the data, and maintenance
considerations.
Because the tabular model is relatively new when compared with relational and
multidimensional models, this section will provide some additional background information about
the tabular model as a level set.
5.1

Why Two Data Models for Analytics?

Kiewits data is a corporate asset, and to manage that asset for the greatest return requires an
AND approach rather than an OR approach to analytic data models. While there is some
overlap in the functionality and benefits of the tabular model and the multidimensional model,
each has specific strengths and drawbacks. The Data Architecture makes available both
models, allowing the best attributes of each to be available to support business decisions and
processes.
The establishment of the Tabular and the Multidimensional models may be performed linearly or
concurrently. If done linearly, then the Data Architecture recommends establishing the tabular
model first, since it is simpler, faster to solution, and will be a useful tool for prototyping for the
multidimensional model.
The Tabular and Multidimensional models can reside in one data warehouse. [DISCUSS
MORE ABOUT THIS ARCHITECTURE, AND CRITERIA FOR SEPARATING]
5.2

Tabular Model

A Tabular Database is built on a tabular model and runs in memory. There is less installed base
and history with Tabular Databases than with the traditional multidimensional data warehouse
model, because the multidimensional model predates in memory database developments.
The popularity of the tabular model has been growing because it offers a unique bundle of
benefits when compared with both relational databases and multidimensional databases.
These benefits include:

Rapid Initial Development ability to leverage existing relational model, without the need
for building star schemas and dealing with the resultant ETL complexities
14

Faster, simpler development of data for specific reporting and analytic needs when
compared with multidimensional development
Eliminates snapshots (e.g., quantity by time period) because the calculations can be
done on the fly at the time of the query, thanks to power of the in memory database
Extremely fast end user experience
Faster performance for distinct counts when compared with multidimensional

Drawbacks of the tabular model, which can be overcome with the multidimensional model,
include:

5.3

Not suited for very complex models and data sets


Doesnt support many-to-many relationships
No writeback support
Other
5.2.1

Establishing the Tabular Model

5.2.2

Populating the Tabular Database

5.2.3

Development of Data Sets for Reporting using the Tabular Model

5.2.4

Maintenance

Multidimensional Model
5.3.1

Establishing the Multidimensional Model

5.3.2

Populating the Multidimensional Warehouse

[IMPORT FROM DOC #1 REGARDING ETL]


5.3.3

Cube Development

5.3.4

Maintenance

15

Reporting Development Environment

[introduction]
6.1

Operational Reporting
6.1.1

Transactional Reports

[discuss reports available directly from source systems]

6.2

6.3

6.1.2

Operational Reports

6.1.3

Ad Hoc Queries

6.1.4

Self Service

6.1.5

Data Discovery (overlaps ad hoc, self serve & analytics)

Analytics
6.2.1

Data Mining/Predictive Analytics

6.2.2

Relation to Planning and Forecasting

Delivery Layer
6.3.1

Portal

6.3.2

Dashboard / Drilldowns

6.3.3

Remote Access to Kiewit network

6.3.4

Mobile Devices and BYOD

16

You might also like