You are on page 1of 10

160 BROUGHT TO YOU IN PARTNERSHIP WITH

CONTENTS

öö WHAT IS DATA WAREHOUSING

öö DATA WAREHOUSE ARCHITECTURE

Data Warehousing:
öö DATA

öö DATA MODELING

öö NORMALIZED DATA

öö ATOMIC DATA WAREHOUSE


Best Practices for Collecting, Storing, and öö SUPPORTING TABLES

Delivering Decision-Support Data öö DIMENSIONAL DATABASE

öö FACTS

öö DIMENSIONS

UPDATED BY ROI AVINOAM, CTO AND CO-FOUNDER OF PANOPLY öö DATA INTEGRATION


PREVIOUSLY UPDATED BY ALON BRODY, LEAD DATA ARCHITECT OF PANOPLY
ORIGINAL BY DAVID HAERTZEN BIG DATA ANALYTICS ARCHITECT öö AND MORE...

WHAT IS DATA WAR E HOUSING? DATA WAR E HOUS E ARCHITEC TU R E


Data warehousing is a process for collecting, storing, and delivering DATA WAREHOUSE ARCHITECTURE COMPONENTS

decision-support data for some or all of an enterprise. Data The data warehouse's technical architecture includes data sources,

warehousing is a broad subject that is described point-by-point in data integration, BI/analytics data stores, and data access.
DZO N E .CO M/ RE FCA RDZ

this Refcard. A data warehouse is one of the artifacts created in the


data warehousing process.

William (Bill) H. Inmon has provided an alternate and useful


definition of a data warehouse: "A subject-oriented, integrated,
time-variant, and nonvolatile collection of data in support of
management's decision-making process."

As a total architecture, data warehousing involves people,


processes, and technologies to achieve the goal of providing
decision-support data that is consistent, integrated, standardized,
and easy to understand.

See the book The Analytical Puzzle: Profitable Data Warehousing,


Business Intelligence and Analytics (ISBN 978-1935504207) for details.

WHAT A DATA WAREHOUSE IS AND IS NOT


A data warehouse is a database whose data includes a copy of
operational data. This data is often obtained from multiple data
sources and is useful for strategic decision-making. It does not,
however, contain original data.

"Data warehouse," by the way, is not another name for "database."


Some people incorrectly use the term "data warehouse" as if
it's a generic name for a database. A data warehouse does not
only consist of historic data — it can be made up of analytics
and reporting data, too. Transactional data that is managed in
application data stores will not reside in a data warehouse.

1
+2%
+4%
+7%


DATA WAREHOUSING

DATA WAREHOUSE TECH STACK


Business-intelligence software tools
PLATFORM NAME DESCRIPTION
that select data through query and

A software tool that contains data that present it as reports and/or graphical

describes other data. The two kinds of Reporting and displays. The business or analyst will
Metadata Query Tools be able to explore the data-exploration
metadata are: business metadata and
technical metadata. sanction. These tools also help produce
reports and outputs that are desired
and needed to understand the data.
A software tool that enables the design
of data and databases through graphical
Software tools that find patterns in
means. This tool provides a detailed
Repository stores of data or databases. These tools
design capability that includes the Data Mining Tools
are useful for predictive analytics and
design of tables, columns, relationships,
optimization analytics.
rules, and business definitions.

A software tool that supports INFRASTRUCTURE ARCHITECTURE


understanding data through exploration The data warehouse tech stack is built on a fundamental framework
and comparison. This tool accesses of hardware and software known as the infrastructure.
the data and explores it, looking for
Data Modeling Tool
patterns such as typical values, ranges,
and allowed values. It is meant to help
DZO N E .CO M/ RE FCA RDZ

you better understand the content and


quality of the data.

ETL (extract, transform, and load) tools,


as well as realtime integration tools
like the ESB (enterprise service bus)
Data Profiling Tool
software tools. These tools copy data
from place to place and also scrub and
clean the data.

Using a data warehouse appliance or a dedicated database


Software that stores data in a relational
infrastructure helps support the data warehouse. This technique
format using SQL (Structured Query
RDBMS (Relational tends to yield the highest performance. The data warehouse
Database Language). This is really the Database
appliance is optimized to provide database services using
Management system that is going to maintain robust
Systems) massively parallel processing (MPP) architecture. It includes
data and store it. It is also important to
multiple tightly coupled computers with specialized functions, plus
the expandability of the system.
at least one array of storage devices that are accessed in parallel.
Specialized functions include system controller, database access,
Database software designed for data data load, and data backup.
MOLAP mart-type operations. This software
(Multidimensional
organizes data into multiple dimensions, Data warehouse appliances provide high performance. They can be
OLAB)
known as “cubes,” to support analytics. up to 100x faster than the typical database server. Consider the data
warehouse appliance when more than 2TB of data must be stored.

Software the manages huge amounts of


Big Data Store data (relational databases for example) DATA ARCHITECTURE

that other types of software cannot. Data architecture is a blueprint for the management of data in an
enterprise. The data architect builds a picture of how multiple sub-

3 BROUGHT TO YOU IN PARTNERSHIP WITH


DATA WAREHOUSING

domains work. Some of these subdomains are data governance,


data quality, ILM (information lifecycle management), data An activity focused on producing and
framework, metadata and semantics, master data, and, finally, making available a “golden record” of
business intelligence. master data and essential business
Master Data
entiries, such as customers, products,
Management (MDM)
and financial accounts. Master data is
data describing major subjects of interest
that is shared by multiple applications.

The people, tools, and processes


Business that support planning and decision
Intelligence making, both stategic and operational,
for an organization.
DATA ARCHITECTURE SUB-DOMAINS
SUB-DOMAIN DESCRIPTION
DATA FLOW
The overall management of data and
The diagram below displays how data flows through the data
information includes people, processes,
warehouse system. Data first originates from the data sources, such
Data Governance and technologies that improve the value
as inventory systems (systems stored in data warehouses and
(DG) obtained from data and information
operational data stores). The data stores are formatted to expose data
by treating data as an asset. It is the
in the data marts that are then accessed using BI and analytics tools.
cornerstone of the data architecture.

The disipline of ensuring that data is


DZO N E .CO M/ RE FCA RDZ

fit for use by the enterprise. It includes


Data Quality obtaining requirements and rules
Management (DQM) that specify the dimensions of quality
required, such as accuracy, completeness,
timeliness, and allowed values.

The discipline of specifying and managing


information through its life from its
Information conception to disposal. Information
Lifecycle
activities that make up ILM include
Management (ILM)
classification, creation, distribution, use,
DATA
maintenance, and disposal.
Data is the raw material through which we can gain understanding.
It is a critical element in data modeling, statistics, and data mining.
A description of data-related systmes
It is the foundation of the pyramid that leads to wisdom and to
that is in terms of a set of fundamental
informed action.
parts and the recommended methods
Data Framework for assembling those parts using
DATA ATTRIBUTE CHARACTERISTICS
patterns. The data framework can
CHARACTERISTIC DESCRIPTION
include: database management, data
storage, and data integration.
Each attribute has a name, such as
“Account Balance Amount.” An attribute
Information that describes and specifies
name is a string that identifies and
data related objects. This description can
Name describes and attribute. In the early stages
Metadata and include: structure and storage of data,
Semantics of data design, you may just list names
business use of data, and processes that
without adding clarifying information,
act on the data. “Semantics” refers to the
called metadata.
meaning of the data.

4 BROUGHT TO YOU IN PARTNERSHIP WITH


DATA WAREHOUSING

The datatype, also known as the “data


format,” could have a value such as a
decimal (12.4). This is the format used to
Datatype store the attribute. This specifies whether
the information is a string, a number, or
ATTRIBUTES
a date. In addition, it specifies the size of
An attribute is a characteristic of an entity. Attributes are
the attribute.
categorized as primary keys, foreign keys, alternate keys, and non-
keys, as depicted in the diagram below.
A domain, such as Currency Amounts, is a
Domain
categorization of attributes by function.

An initial value such as 0.0000 is the default


Initial Value value that an attribute is assigned when it is
first created.
RELATIONSHIPS
Rules are constraints that limit the values A relationship is an association between entities. Such a
that an attribute can contain. An example relationship is diagrammed by drawing a line between the related
Rules rule is, “the attribute must be greater than entities. The following diagram depicts two entities — Customer
or equal to 0.0000.” Use of rules helps to and Order — that have a relationship specified by the verb phrase
improve data quality. "places" in this way: Customer Places Order.

A narrative that converys or describes


DZO N E .CO M/ RE FCA RDZ

the meaning of an attribute. For example.


CARDINALITY
Definition Account Balance Amount is a measure of the
Cardinality specifies the number of entities that may participate
monetary value of a financial account, such
in a given relationship, expressed as one-to-one, one-to-many, or
as a bank account or an investment account.
many-to-many, as depicted in the following example.

DATA MODE LING


Three levels of data modeling are developed in sequence:

1. Conceptual data model: A high-level model that describes a


problem using entities, attributes, and relationships.
Cardinality is expressed as minimum and maximum numbers. In
the first example below, an instance of entity A may have one
2. Logical data model: A detailed data model that describes a
instance of entity B, and entity B must have one and only one
solution in business terms, and that also uses entities, attri-
instance of entity A. Cardinality is specified by putting symbols on
butes, and relationships.
the relationship line near each of the two entities that are part of
3. Physical data model: A detailed data model that defines the relationship.
database objects, such as tables and columns. This model is
In the second case, entity A may have one or more instances of entity
needed to implement the models in a database and produce a
B, and entity B must have one and only one instance of entity A.
working solution.

ENTITIES
An entity is a core part of any conceptual and logical data model. An
entity is an object of interest to an enterprise --- it can be a person,
organization, place, thing, activity, event, abstraction, or idea.
Entities are represented as rectangles in the data model. Think of
entities as singular nouns.

5 BROUGHT TO YOU IN PARTNERSHIP WITH


DATA WAREHOUSING

Minimum cardinality is expressed by the symbol farther away from HEADER AND DETAIL ENTITIES
the entity. A circle indicates that an entity is optional, while a bar The ADW is organized into non-changing data with logical keys and
indicates that an entity is mandatory. At least one is required. changeable data that supports tracking of changes and rapid load/
insert. Use an integer as the primary surrogate key. Then, add the
effective date to track changes.

Maximum cardinality is expressed by the symbol closest to the


entity. A bar means that a maximum of one entity can participate,
while a crow's foot (a three-prong connector) means that many
entities may participate. This means a large unspecified number.

ASSOCIATIVE ENTITIES
Track the history of relationships between entities using an
associative entity with effective dates and expiration dates.
DZO N E .CO M/ RE FCA RDZ

NOR MALIZE D DATA


Normalization is a data modeling technique that organizes data by
breaking it down to its lowest level, i.e. its "atomic" components, to
avoid duplication. This method is used to design the atomic data
warehouse part of the data warehousing system.

SUB-DOMAIN NAME DESCRIPTION

First Normal Form Entities contain no repeating groups ATOMIC DW SPECIALIZED ATTRIBUTES
(1NF) of attributes. Use specialized attributes to improve ADW efficiency and
effectiveness. Identify these attributes using a prefix of ADW_.
Entity is in the first normal form and
attributes that depend on only part ATTRIBUTE NAME DESCRIPTION
Second Normal
Form (2NF) of a composite key are separated into
new entities. Data warehouse assigned surrogate
key. Replace ‘xxx’ with a reference to
dw_xxx_id
The entity is in the second normal form the table name, such as ‘dw_customer_

and non-key attributes representative of dim_id’.


Third Normal Form
(3NF) an entity’s facts are separated to more
independent, multi-valued entities.
The date and time when a row was
dw_insert_date
inserted into the data warehouse.
ATOM IC DATA WAR E HOUS E
The atomic data warehouse (ADW) is an area where data is broken
down into low-level components in preparation for export to data The date and time when a row in the
dw_effective_date
marts. The ADW is designed using normalization and methods that data warehouse began to be active.

make for speedy history loading and recording.

6 BROUGHT TO YOU IN PARTNERSHIP WITH


DATA WAREHOUSING

• Data process log: Traces each data warehouse load.


The date and time when a row in the • Message type: Specifies the kind of message.
dw_expire_date
data warehouse stopped being active.
• Message log: Contains an individual message.

A reference to the data process log. The


dw_data_process_ log is a record of the process of how
log_id data was loaded or modified in the
data warehouse.

SU PP ORTING TAB LE S
Supporting data is required to enable the data warehouse to
DIM E N SIONAL DATABAS E
operate smoothly. Here is some supporting data:
A dimensional database is a database that is optimized for query
• Code management and translation. and analysis and is not normalized like the atomic data warehouse.
It consists of fact and dimension tables, where each fact is
• Data source tracking.
connected to one or more dimensions.
• Error logging.

SALES ORDER FACT

CODE TRANSLATION The sales order fact includes the measurer's order quantity and

Data warehousing requires that codes, such as gender code and currency amount. Dimensions of Calendar Date, Product, Customer,

units of measure, be translated to standard values aided by code- Geo Location, and Sales Organization put the sales order fact into
DZO N E .CO M/ RE FCA RDZ

translation tables like these: context. This star schema supports looking at orders in a cubical
way, enabling slicing and dicing by customer, time, and product.
• Code set: Group of codes, such as "gender code."

• Code: An individual code value.

• Code translation: Mapping between code values.

DATA SOURCE TRACKING AND LOGGING


Data source tracking provides a means of tracing where data
FAC TS
originated within a data warehouse:
A fact is a set of measurements. It tends to contain quantitative
• Data source: Identifies the system or database. data that gets presented to users. It often contains amounts
of money and quantities of things. Facts are surrounded by
• Data process: Traces the data integration procedure.
dimensions that categorize the fact.

• Data process log: Traces each data warehouse load.


ANATOMY OF A FACT
Facts are SQL tables that include:

• Table name: A descriptive name usually containing the


word "Fact".

• Primary keys: Attributes that uniquely identify each fact


MESSAGE LOGGING
occurrence and relate it to dimensions.
Message logging provides a record of events that occur while
loading the data warehouse: • Measures: Quantitative metrics.

7 BROUGHT TO YOU IN PARTNERSHIP WITH


DATA WAREHOUSING

FACT-LESS FACT
The fact-less fact tracks an association between dimensions
rather than quantitative metrics. Examples include miles, event
attendance, and sales promotions.

EVENT FACT EXAMPLE


Event facts record single occurrences, such as financial transactions,
sales, complaints, or shipments.

DIM E N SION S
A dimension is a database table that contains properties that
identify and categorize. The attributes serve as labels for reports
and as data points for summarization. In the dimensional model,
dimensions surround and qualify facts.

SNAPSHOT FACT DATA AND TIME DIMENSIONS


The snapshot fact captures the status of an item at a point in time, Date dimensions support trend analysis. Date dimensions include
such as a general ledger balance or inventory level. the date and its associated week, month, quarter, and year. Time-
of-day dimensions are used to analyze daily business volume.
DZO N E .CO M/ RE FCA RDZ

CUMULATIVE SNAPSHOT FACT


The cumulative snapshot fact adds accumulated data, such as year-
to-date amounts, to the snapshot fact.
MULTIPLE-DIMENSION ROLES
One dimension can play multiple roles. The date dimension could
play roles of a snapshot date, a project start date, and a project
end date.

AGGREGATED FACT
Aggregated facts provide summary information, such as general
DEGENERATE DIMENSION
ledger totals during a period of time or complaints per product per
A degenerate dimension has a dimension key without a dimension
store per month.
table. Examples include transaction numbers, shipment numbers,
and order numbers.

8 BROUGHT TO YOU IN PARTNERSHIP WITH


DATA WAREHOUSING

SLOWLY CHANGING DIMENSIONS EXTRACT LOAD TRANSFORM (ELT)


Changes to dimensional data can be categorized into levels: In the ELT pattern of data integration, data is extracted from the
data source and loaded to staging without transformation. After
SCD TYPE DESCRIPTION
that, data is transformed within staging and then loaded to the
data warehouse.
Data is non-changing. It is inserted once
SCD Type 0
and never changed.

Data is overwritten. New data will


SCD Type 1
overwrite and update existing values.

Adds a new row. Each change will add CHANGE DATA CAPTURE (CDC)

a new row where all the values will be The CDC pattern of data integration is strong in event processing.

the same except for the changed fields. Database logs that contain a record of database changes are
SCD Type 2 replicated near real time at staging. This information is then
This will mean that a new field(s) will be
added to mark the rows and state which transformed and loaded to the data warehouse.

one is effective.

Adds a new attribute. Changes will be


tracked in another field(s). In addition
to the effective field, we will have the CDC is a great technique for supporting real-time data warehouses.
SCD Type 3 last X values stored in other fields (each
one in a seperate field) where usually
DZO N E .CO M/ RE FCA RDZ

IN NOVATION IN DATA WAR E HOUS E


only one history record will be saved for
TECH NOLOGY
each field.
Thanks to the agility offered by today's cloud-based data ware-
house solutions, there are cutting-edge innovations that can auto-
History table. In addition to the effective
mate some of the key aspects of data warehousing. For example,
table we keep a history table that will
the ETL process described earlier has changed considerably thanks
hold all the changes that occurred in the
SCD Type 4 to natural language processing, ultimately resulting in its complete
table. It will create a snapshot of every
automation. What’s more, data warehouse storage and compute
row that was changed and save it with
have also benefited from automated optimization through machine
the relevant timestamp.
learning, saving data analysts time on tasks associated with que-
rying, storage, and scalability, which dramatically cuts down costs,
SCD Type 6 Hybrid method of types 1, 2, and 3. coding time, and resources.

As a modern solution, cloud data warehouse automation saves


DATA INTEG R ATION endless hours of coding and modeling for data ingestion, integration,
Data integration is a technique for moving data or otherwise and transformation. The tasks listed below can be now easily
making data available across data stores. The data integration automated and seamlessly connected to third-party solutions, such
process can include extraction, movement, validation, cleansing, as business intelligence visualization tools, via the cloud:
transformation, standardization, and loading.
• Automate data source connections.

• Seamlessly connect to third-party SaaS APIs.


EXTRACT TRANSFORM LOAD (ETL)
In the ETL pattern of data integration, data is extracted from the • Easily connect to the most common storage services.
data source and then transformed while in flight to a staging
database. Data is then loaded into the data warehouse. ETL is great New technology now exists that automates data schema

for batch processing of bulk data. modeling, where adaptive schema changes at real time along with
the data, and changes are seamless. You would only need to just
upload the data sources, everything else is automated including
the following tasks:

9 BROUGHT TO YOU IN PARTNERSHIP WITH


DATA WAREHOUSING

• Data types are automatically discovered, and a schema is SOLVING CONCURRENCY ISSUES
generated based on the initial data structure. To remedy concurrency issues, new cloud data warehousing
technologies today can separate storage from compute
• Likely relationships between tables are automatically and increase the compute nodes based on the amount of
detected and used to model a relational schema. connections. Consequently, the number of available clusters
scales with the number of users and the intensity of the workload,
• Aggregations are automatically generated.
supporting hundreds of parallel queries that are load-balanced

• Table history, which stores data uploaded from API data between clusters.

sources, allowing an easy comparison and analysis of data


from different time periods. STORAGE OPTIMIZATION
Data warehouse automation has also vastly improved how data
is stored and used. New "smart" data warehouse technologies
AUTOMATED QUERY OPTIMIZATION
constantly run periodic processes to mark data and optimize
Automated cloud data warehousing technology exists that
the storage based on usage. Smart data warehouse technology
can automatically re-index the schema and perform a series of
scales up and down based on the data volume. Scaling happens
optimizations on the queries and data structure to improve runtime
automatically behind the scenes, keeping clusters available
based on algorithms that assess usage so that:
for both reads and write, and thus ingestion can continue

• Re-indexing happens automatically whenever the algorithm uninterrupted. When the scaling is complete, the old and new

detects changes in query patterns. clusters are swapped instantly. Data warehouse maintenance itself
has been greatly improved as well, by automating the cleaning and
• Redistributing the data across nodes to improve data locality compressing of tables to boost database performance.
and join performance is done automatically.
DZO N E .CO M/ RE FCA RDZ

Written by Roi Avinoam, CTO and Co-Founder of Panoply


Roi Avinoam is CTO and Co-Founder of Panoply and has decades of experience in systems engineering, high-
scalability software, big data architectures, and technical leadership. Roi served as CTO of Mytopia and Win.com,
managing teams of over 60 engineers on multiple global locations. Prior to that, Roi led the semantic database
and infrastructure team at Metacafe (acquired by Collective Digital Services). Overall, Roi has built software from
a very young age, is active in the open-source community on several high-profile projects, and has extensive
technical know-how.

DZone, Inc.
DZone communities deliver over 6 million pages each 150 Preston Executive Dr. Cary, NC 27513
month to more than 3.3 million software developers, 888.678.0399 919.678.0300
architects and decision makers. DZone offers something for
Copyright © 2018 DZone, Inc. All rights reserved. No part of this publication
everyone, including news, tutorials, cheat sheets, research
may be reproduced, stored in a retrieval system, or transmitted, in any form
guides, feature articles, source code and more. "DZone is a or by means electronic, mechanical, photocopying, or otherwise, without
developer’s dream," says PC Magazine. prior written permission of the publisher.

10 BROUGHT TO YOU IN PARTNERSHIP WITH

You might also like