You are on page 1of 46

Data Modeling 101

www.agiledata.org: Bringing data professionals and


application developers together.
Home Search Vision Help Services Mailing List Site Map Tip Jar

This essay is taken from Chapter 3 of Agile Database Techniques.

The goals of this chapter are to overview fundamental data modeling skills that all developers
should have, skills that can be applied on both traditional projects that take a serial approach to
agile projects that take an evolutionary approach. My personal philosophy is that every IT
professional should have a basic understanding of data modeling. They don’t need to be experts
at data modeling, but they should be prepared to be involved in the creation of such a model, be
able to read an existing data model, understand when and when not to create a data model, and
appreciate fundamental data design techniques. This chapter is a brief introduction to these
skills. The primary audience for this chapter is application developers who need to gain an
understanding of some of the critical activities performed by an Agile DBA. This understanding
should lead to an appreciation of what Agile DBAs do and why they do them, and it should help to
bridge the communication gap between these two roles.

Table of Contents
 The role of the Agile DBA
 What is Data Modeling?
o How are Data Models Used in Practice?
o What About Conceptual Models?
o Common Data Modeling Notations
 How to Model Data
o Identify entity types
o Identify attributes
o Apply naming conventions
o Identify relationships
o Apply data model patterns
o Assign keys
o Normalize to reduce data redundancy
o Denormalize to improve performance
 Evolutionary data modeling
 Agile data modeling
 How to Become Better At Modeling Data
 References
 Acknowledgements
 Let us help

1. The Role of the Agile DBA


Although you wouldn’t think it, data modeling can be one of the most challenging tasks that an
Agile DBA can be involved with on an agile software development project. Your approach to data
modeling will often be at the center of any controversy between the agile software developers and
the traditional data professionals within your organization. Agile software developers will lean
towards an evolutionary approach where data modeling is just one of many activities whereas
traditional data professionals will often lean towards a “big design up front (BDUF)” approach
where data models are the primary artifacts, if not THE artifacts. This problem results from a
combination of the cultural impedance mismatch and “normal” political maneuvering within your
organization. As a result Agile DBAs often find that navigating the political waters is an important
part of their data modeling efforts.

Additionally, when it comes to data modeling Agile DBAs will:

 Mentor application developers in fundamental data modeling techniques.


 Mentor experienced enterprise architects and administrators in evolutionary
modeling techniques.
 Ensure that the team follows data modeling standards and conventions.
 Develop and evolve the data model(s), in an evolutionary (iterative and
incremental) manner, to meet the needs of the project team.
 Keep the database schema(s) in sync with the physical data model(s).

2. What is Data Modeling?


Data modeling is the act of exploring data-oriented structures. Like other modeling artifacts data
models can be used for a variety of purposes, from high-level conceptual models to physical data
models. From the point of view of an object-oriented developer data modeling is conceptually
similar to class modeling. With data modeling you identify entity types whereas with class
modeling you identify classes. Data attributes are assigned to entity types just as you would
assign attributes and operations to classes. There are associations between entities, similar to
the associations between classes – relationships, inheritance, composition, and aggregation are
all applicable concepts in data modeling.

Data modeling is different from class modeling because it focuses solely on data – class models
allow you to explore both the behavior and data aspects of your domain, with a data model you
can only explore data issues. Because of this focus data modelers have a tendency to be much
better at getting the data “right” than object modelers.

Although the focus of this chapter is data modeling, there are often alternatives to data-oriented
artifacts (never forget Agile Modeling’s Multiple Models principle). For example, when it comes
to conceptual modeling ORM diagrams aren’t your only option – In addition to LDMs it is quite
common for people to create UML class diagrams and even Class Responsibility Collaborator
(CRC) cards instead. In fact, my experience is that CRC cards are superior to ORM diagrams
because it is very easy to get project stakeholders actively involved in the creation of the model.
Instead of a traditional, analyst-led drawing session you can instead facilitate stakeholders
through the creation of CRC cards (Ambler 2001a).
2.1. How are Data Models Used in Practice?

Although methodology issues are covered later, we need to discuss how data models can
be used in practice to better understand them. You are likely to see two basic styles of
data model:

 Conceptual data models. These models, sometimes called domain models, are
typically used to explore domain concepts with project stakeholders. Conceptual data
models are often created as the precursor to LDMs or as alternatives to LDMs.
 Logical data models (LDMs). LDMs are used to explore the domain concepts,
and their relationships, of your problem domain. This could be done for the scope of a
single project or for your entire enterprise. LDMs depict the logical entity types,
typically referred to simply as entity types, the data attributes describing those entities,
and the relationships between the entities.
 Physical data models (PDMs). PDMs are used to design the internal schema of
a database, depicting the data tables, the data columns of those tables, and the
relationships between the tables. The focus of this chapter is on physical modeling.

Although LDMs and PDMs sound very similar, and they in fact are, the level of detail
that they model can be significantly different. This is because the goals for each diagram
is different – you can use an LDM to explore domain concepts with your stakeholders
and the PDM to define your database design. Figure 1 presents a simple LDM and Figure 2
a simple PDM, both modeling the concept of customers and addresses as well as the
relationship between them. Both diagrams apply the Barker (1990) notation, summarized
below. Notice how the PDM shows greater detail, including an associative table required
to implement the association as well as the keys needed to maintain the relationships.
More on these concepts later. PDMs should also reflect your organization’s database
naming standards, in this case an abbreviation of the entity name is appended to each
column name and an abbreviation for “Number” was consistently introduced. A PDM
should also indicate the data types for the columns, such as integer and char(5). Although
Figure 2 does not show them, lookup tables for how the address is used as well as for
states and countries are implied by the attributes ADDR_USAGE_CODE, STATE_CODE,
and COUNTRY_CODE.

Figure 1. A simple logical data model.


Figure 2. A simple physical data model.

An important observation about Figures 1 and 2 is that I’m not slavishly following
Barker’s approach to naming relationships. For example, between Customer and Address
there really should be two names “Each CUSTOMER may be located in one or more
ADDRESSES” and “Each ADDRESS may be the site of one or more CUSTOMERS”.
Although these names explicitly define the relationship I personally think that they’re
visual noise that clutter the diagram. I prefer simple names such as “has” and then trust
my readers to interpret the name in each direction. I’ll only add more information where
it’s needed, in this case I think that it isn’t. However, a significant advantage of
describing the names the way that Barker suggests is that it’s a good test to see if you
actually understand the relationship – if you can’t name it then you likely don’t
understand it.

Data models can be used effectively at both the enterprise level and on projects.
Enterprise architects will often create one or more high-level LDMs that depict the data
structures that support your enterprise, models typically referred to as enterprise data
models or enterprise information models. An enterprise data model is one of several
critical views that your organization’s enterprise architects will maintain and support –
other views may explore your network/hardware infrastructure, your organization
structure, your software infrastructure, and your business processes (to name a few).
Enterprise data models provide information that a project team can use both as a set of
constraints as well as important insights into the structure of their system.

Project teams will typically create LDMs as a primary analysis artifact when their
implementation environment is predominantly procedural in nature, for example they are
using structured COBOL as an implementation language. LDMs are also a good choice
when a project is data-oriented in nature, perhaps a data warehouse or reporting system is
being developed. However LDMs are often a poor choice when a project team is using
object-oriented or component-based technologies because the developers would rather
work with UML diagrams or when the project is not data-oriented in nature. As Agile
Modeling (Ambler 2002) advises, Apply The Right Artifact(s) for the job. Or, as your
grandfather likely advised you, use the right tool for the job.

When a relational database is used for data storage project teams are best advised to
create a PDMs to model its internal schema. My experience is that a PDM is often one of
the critical design artifacts for business application development projects.

2.2. What About Conceptual Models?


Halpin (2001) points out that many data professionals prefer to create an Object-Role Model
(ORM), an example is depicted in Figure 3, instead of an LDM for a conceptual model. The
advantage is that the notation is very simple, something your project stakeholders can quickly
grasp, although the disadvantage is that the models become large very quickly. ORMs enable
you to first explore actual data examples instead of simply jumping to a potentially incorrect
abstraction – for example Figure 3 examines the relationship between customers and addresses
in detail. For more information about ORM, visit www.orm.net.

Figure 3. A simple Object-Role Model.

My experience is that people will capture information in the best place that they know. As a result
I typically discard ORMs after I’m finished with them. I sometimes user ORMs to explore the
domain with project stakeholders but later replace them with a more traditional artifact such as an
LDM, a class diagram, or even a PDM. As a “generalizing specialist” (Ambler 2003b), someone
with one or more specialties who also strives to gain general skills and knowledge, this is an easy
decision for me to make; I know that this information that I’ve just “discarded” will be captured in
another artifact – a model, the tests, or even the code – that I understand. A specialist who only
understands a limited number of artifacts and therefore “hands-off” their work to other specialists
doesn’t have this as an option. Not only are they tempted to keep the artifacts that they create
but also to invest even more time to enhance the artifacts. My experience is that generalizing
specialists are more likely than specialists to travel light.
2.3. Common Data Modeling Notations
Figure 4 presents a summary of the syntax of four common data modeling notations: Information
Engineering (IE), Barker, IDEF1X, and the Unified Modeling Language (UML). This diagram isn’t
meant to be comprehensive, instead its goal is to provide a basic overview. Furthermore, for the
sake of brevity I wasn’t able to depict the highly-detailed approach to relationship naming that
Barker suggests. Although I provide a brief description of each notation in Table 1 I highly
suggest David Hay’s (1999) paper A Comparison of Data Modeling Techniques as he goes
into greater detail than I do.

Figure 4. Comparing the syntax of common data modeling notations.


Table 1. Discussing common data modeling notations.

Notation Comments

The IE notation (Finkelstein 1989) is simple and easy to read, and is well suited
for high-level logical and enterprise data modeling. The only drawback of this
IE notation, arguably an advantage, is that it does not support the identification of
attributes of an entity. The assumption is that the attributes will be modeled
with another diagram or simply described in the supporting documentation.

The Barker (1990) notation is one of the more popular ones, it is supported by
Barker Oracle’s toolset, and is well suited for all types of data models. It’s approach to
subtyping can become clunky with hierarchies that go several levels deep.

This notation is overly complex. It was originally intended for physical modeling
but has been misapplied for logical modeling as well. Although popular within
IDEF1X some U.S. government agencies, particularly the Department of Defense
(DoD), this notation has been all but abandoned by everyone else. Avoid it if
you can.

This is not an official data modeling notation (yet). Although several


suggestions for a data modeling profile for the UML exist, including Naiburg and
Maksimchuk’s (2001) and my own (Ambler 2001a), none are complete and
more importantly are not “official” UML yet. Having said that, considering the
UML
popularity of the UML, the other data-oriented efforts of the Object Management
Group (OMG), and the lack of a notational standard within the data community
it is only a matter of time until a UML data modeling notation is accepted within
the IT industry.

3. How to Model Data

It is critical for an application developer to have a grasp of the fundamentals of data


modeling so they can not only read data models but also work effectively with Agile
DBAs who are responsible for the data-oriented aspects of your project. Your goal
reading this section is not to learn how to become a data modeler, instead it is simply to
gain an appreciation of what is involved.

The following tasks are performed in an iterative manner:

 Identify entity types


 Identify attributes
 Apply naming conventions
 Identify relationships
 Apply data model patterns
 Assign keys
 Normalize to reduce data redundancy
 Denormalize to improve performance

Very good practical books about data modeling include Joe Celko’s Data & Databases (Celko
1999) and Data Modeling for Information Professionals (Schmidt 1998) as they both focus on
practical issues with data modeling. The Data Modeling Handbook (Reingruber and Gregory
1994) and Data Model Patterns (Hay 1996) are both excellent resources once you’ve
mastered the fundamentals. An Introduction to Database Systems (Date 2001) is a good
academic treatise for anyone wishing to become a data specialist.

3.1 Identify entity types

An entity type, also simply called “entity”, is similar conceptually to object-orientation’s


concept of a class – an entity type represents a collection of similar objects. An entity
could represent a collection of people, places, things, events, or concepts. Examples of
entities in an order entry system would include Customer, Address, Order, Item, and Tax.
If you were class modeling you would expect to discover classes with the exact same
names. However, the difference between a class and an entity type is that classes have
both data and behavior whereas entity types just have data.

Ideally an entity should be “normal”, the data modeling world’s version of cohesive. A
normal entity depicts one concept, just like a cohesive class models one concept. For
example, customer and order are clearly two different concepts; therefore it makes sense
to model them as separate entities.

3.2 Identify Attributes

Each entity type will have one or more data attributes. For example, in Figure 1 you saw
that the Customer entity has attributes such as First Name and Surname and in Figure 2
that the TCUSTOMER table had corresponding data columns CUST_FIRST_NAME and
CUST_SURNAME (a column is the implementation of a data attribute within a relational
database).

Attributes should also be cohesive from the point of view of your domain, something that
is often a judgment call. – in Figure 1 we decided that we wanted to model the fact that
people had both first and last names instead of just a name (e.g. “Scott” and “Ambler” vs.
“Scott Ambler”) whereas we did not distinguish between the sections of an American zip
code (e.g. 90210-1234-5678). Getting the level of detail right can have a significant
impact on your development and maintenance efforts. Refactoring a single data column
into several columns can be quite difficult, database refactoring is described in detail in
Database Refactoring,
although over specifying an attribute (e.g. having three attributes for
zip code when you only needed one) can result in overbuilding your system and hence
you incur greater development and maintenance costs than you actually needed.

3.3 Apply Data Naming Conventions

Your organization should have standards and guidelines applicable to data modeling,
something you should be able to obtain from your enterprise administrators (if they don’t
exist you should lobby to have some put in place). These guidelines should include
naming conventions for both logical and physical modeling, the logical naming
conventions should be focused on human readability whereas the physical naming
conventions will reflect technical considerations. You can clearly see that different
naming conventions were applied in Figures 1 and 2.

As you saw in the Introduction to Agile Modeling chapter, AM includes the Apply Modeling
Standards practice. The basic idea is that developers should agree to and follow a
common set of modeling standards on a software project. Just like there is value in
following common coding conventions, clean code that follows your chosen coding
guidelines is easier to understand and evolve than code that doesn't, there is similar value
in following common modeling conventions.

3.4 Identify Relationships


In the real world entities have relationships with other entities. For example, customers PLACE
orders, customers LIVE AT addresses, and line items ARE PART OF orders. Place, live at, and
are part of are all terms that define relationships between entities. The relationships between
entities are conceptually identical to the relationships (associations) between objects.

Figure 5 depicts a partial LDM for an online ordering system. The first thing to notice is the
various styles applied to relationship names and roles – different relationships require different
approaches. For example the relationship between Customer and Order has two names, places
and is placed by, whereas the relationship between Customer and Address has one. In this
example having a second name on the relationship, the idea being that you want to specify how
to read the relationship in each direction, is redundant – you’re better off to find a clear wording
for a single relationship name, decreasing the clutter on your diagram. Similarly you will often
find that by specifying the roles that an entity plays in a relationship will often negate the need to
give the relationship a name (although some CASE tools may inadvertently force you to do this).
For example the role of billing address and the label billed to are clearly redundant, you really
only need one. For example the role part of that Line Item has in its relationship with Order is
sufficiently obvious without a relationship name.

Figure 5. A logical data model (Information Engineering notation).


You also need to identify the cardinality and optionality of a relationship (the UML combines the
concepts of optionality and cardinality into the single concept of multiplicity). Cardinality
represents the concept of “how many” whereas optionality represents the concept of “whether
you must have something.” For example, it is not enough to know that customers place orders.
How many orders can a customer place? None, one, or several? Furthermore, relationships are
two-way streets: not only do customers place orders, but orders are placed by customers. This
leads to questions like: how many customers can be enrolled in any given order and is it possible
to have an order with no customer involved? Figure 5 shows that customers place one or more
orders and that any given order is placed by one customer and one customer only. It also shows
that a customer lives at one or more addresses and that any given address has zero or more
customers living at it.

Although the UML distinguishes between different types of relationships – associations,


inheritance, aggregation, composition, and dependency – data modelers often aren’t as
concerned with this issue as much as object modelers are. Subtyping, one application of
inheritance, is often found in data models, an example of which is the is a relationship between
Item and it’s two “sub entities” Service and Product. Aggregation and composition are much less
common and typically must be implied from the data model, as you see with the part of role that
Line Item takes with Order. UML dependencies are typically a software construct and therefore
wouldn’t appear on a data model, unless of course it was a very highly detailed physical model
that showed how views, triggers, or stored procedures depended on the schema of one or more
tables.

3.5 Apply Data Model Patterns

Some data modelers will apply common data model patterns, David Hay’s (1996) book
Data Model Patterns is the best reference on the subject, just as object-oriented
developers will apply analysis patterns (Fowler 1997; Ambler 1997) and design patterns
(Gamma et al. 1995). Data model patterns are conceptually closest to analysis patterns
because they describe solutions to common domain issues. Hay’s book is a very good
reference for anyone involved in analysis-level modeling, even when you’re taking an
object approach instead of a data approach because his patterns model business structures
from a wide variety of business domains.
3.6 Assign Keys
First, some terminology. A key is one or more data attributes that uniquely identify an entity. A
key that is two or more attributes is called a composite key. A key that is formed of attributes that
already exist in the real world is called a natural key. For example, U.S. citizens are issued a
Social Security Number (SSN) that is unique to them. SSN could be used as a natural key,
assuming privacy laws allow it, for a Person entity (assuming the scope of your organization is
limited to the U.S.). an entity type in a logical data model will have zero or more candidate keys,
also referred to simply as unique identifiers. For example, if we only interact with American
citizens then SSN is one candidate key for the Person entity type and the combination of name
and phone number (assuming the combination is unique) is potentially a second candidate key.
Both of these keys are called candidate keys because they are candidates to chosen to be the
primary key, an alternate key (also known as a secondary key), or perhaps not even a key at all
within a physical data model. A primary key is the preferred key for an entity type whereas an
alternate key (also known as a secondary key) is an alternative way to access rows within a table.
In a physical database a key would be formed of one or more table columns whose value(s)
uniquely identifies a row within a relational table.

Figure 6 presents an alternative design to that presented in Figure 2, a different naming


convention was adopted and the model itself is more extensive. In Figure 6 the Customer table
has the CustomerNumber column as its primary key and SocialSecurityNumber as an alternate
key. This indicates that the preferred way to access customer information is through the value of
a person’s customer number although your software can get at the same information if it has the
person’s social security number. The CustomerHasAddress table has a composite primary key,
the combination of CustomerNumber and AddressID. A foreign key is one or more attributes in
an entity type that represents a key, either primary or secondary, in another entity type. Foreign
keys are used to maintain relationships between rows. For example, the relationships between
rows in the CustomerHasAddress table and the Customer table is maintained by the
CustomerNumber column within the CustomerHasAddress table. The interesting thing about the
CustomerNumber column is the fact that it is part of the primary key for CustomerHasAddress as
well as the foreign key to the Customer table. Similarly, the AddressID column is part of the
primary key of CustomerHasAddress as well as a foreign key to the Address table to maintain the
relationship with rows of Address.

Figure 6. Customer and Address revisited (UML notation).


There are two strategies for assigning keys to tables. The first is to simply use a natural key, one
or more existing data attributes that are unique to the business concept. For the Customer table
there was two candidate keys, in this case CustomerNumber and SocialSecurityNumber. The
second strategy is to introduce a new column to be used as a key. This new column is called a
surrogate key, a key that has no business meaning, an example of which is the AddressID
column of the Address table in Figure 6. Addresses don’t have an “easy” natural key because
you would need to use all of the columns of the Address table to form a key for itself, therefore
introducing a surrogate key is a much better option in this case. The primary advantage of
natural keys is that they exist already, you don’t need to introduce a new “unnatural” value to your
data schema. However, the primary disadvantage of natural keys is that because they have
business meaning it is possible that they may need to change if your business requirement
change. For example, if your users decide to make CustomerNumber alphanumeric instead of
numeric then in addition to updating the schema for the Customer table (which is unavoidable)
you would have to change every single table where CustomerNumber is used as a foreign key. If
the Customer table instead used a surrogate key then the change would have been localized to
just the Customer table itself (CustomerNumber in this case would just be a non-key column of
the table). Naturally, if you needed to make a similar change to your surrogate key strategy,
perhaps adding a couple of extra digits to your key values because you’ve run out of values, then
you would have the exact same problem. The fundamental problem is that keys are a significant
source of coupling within a relational schema, and as a result they are difficult to change. The
implication is that you want to avoid keys with business meaning because business meaning
changes.

This points out the need to set a workable surrogate key strategy. There are several common
options:
1. Key values assigned by the database. Most of the leading database vendors –
companies such as Oracle, Sybase, and Informix – implement a surrogate key strategy
called incremental keys. The basic idea is that they maintain a counter within the
database server, writing the current value to a hidden system table to maintain
consistency, which they use to assign a value to newly created table rows. Every time a
row is created the counter is incremented and that value is assigned as the key value for
that row. The implementation strategies vary from vendor to vendor, sometimes the
values assigned are unique across all tables whereas sometimes values are unique only
within a single table, but the general concept is the same.
2. MAX() + 1. A common strategy is to use an integer column, start the value for
the first record at 1, then for a new row set the value to the maximum value in this
column plus one using the SQL MAX function. Although this approach is simple it
suffers from performance problems with large tables and only guarantees a unique key
value within the table.
3. Universally unique identifiers (UUIDs). UUIDs are 128-bit values that are
created from a hash of the ID of your Ethernet card, or an equivalent software
representation, and the current datetime of your computer system. The algorithm for
doing this is defined by the Open Software Foundation (www.opengroup.org).
4. Globally unique identifiers (GUIDs). GUIDs are a Microsoft standard that
extend UUIDs, following the same strategy if an Ethernet card exists and if not then they
hash a software ID and the current datetime to produce a value that is guaranteed unique
to the machine that creates it.
5. High-low strategy. The basic idea is that your key value, often called a persistent
object identifier (POID) or simply an object identified (OID), is in two logical parts: A
unique HIGH value that you obtain from a defined source and an N-digit LOW value that
your application assigns itself. Each time that a HIGH value is obtained the LOW value
will be set to zero. For example, if the application that you’re running requests a value
for HIGH it will be assigned the value 1701. Assuming that N, the number of digits for
LOW, is four then all persistent object identifiers that the application assigns to objects
will be combination of 17010000,17010001, 17010002, and so on until 17019999. At
this point a new value for HIGH is obtained, LOW is reset to zero, and you continue
again. If another application requests a value for HIGH immediately after you it will
given the value of 1702, and the OIDs that will be assigned to objects that it creates will
be 17020000, 17020001, and so on. As you can see, as long as HIGH is unique then all
POID values will be unique. An implementation of a HIGH-LOW generator can be found
on www.theserverside.com.

The fundamental issue is that keys are a significant source of coupling within a relational schema,
and as a result they are difficult to change. The implication is that you want to avoid keys with
business meaning because business meaning changes. However, at the same time you need to
remember that some data is commonly accessed by unique identifiers, for example customer via
their customer number and American employees via their Social Security Number (SSN). In
these cases you may want to use the natural key instead of a surrogate key such as a UUID or
POID.

How can you be effective at assigning keys? Consider the following tips:
1. Avoid “smart” keys. A “smart” key is one that contains one or more subparts
which provide meaning. For example the first two digits of an U.S. zip code indicate the
state that the zip code is in. The first problem with smart keys is that have business
meaning. The second problem is that their use often becomes convoluted over time. For
example some large states have several codes, California has zip codes beginning with 90
and 91, making queries based on state codes more complex. Third, they often increase
the chance that the strategy will need to be expanded. Considering that zip codes are nine
digits in length (the following four digits are used at the discretion of owners of buildings
uniquely identified by zip codes) it’s far less likely that you’d run out of nine-digit
numbers before running out of two digit codes assigned to individual states.
2. Consider assigning natural keys for simple “look up” tables. A “look up” table
is one that is used to relate codes to detailed information. For example, you might have a
look up table listing color codes to the names of colors. For example the code 127
represents “Tulip Yellow”. Simple look up tables typically consist of a code column and
a description/name column whereas complex look up tables consist of a code column and
several informational columns.
3. Natural keys don’t always work for “look up” tables. Another example of a
look up table is one that contains a row for each state, province, or territory in North
America. For example there would be a row for California, a US state, and for Ontario, a
Canadian province. The primary goal of this table is to provide an official list of these
geographical entities, a list that is reasonably static over time (the last change to it would
have been in the late 1990s when the Northwest Territories, a territory of Canada, was
split into Nunavut and Northwest Territories). A valid natural key for this table would be
the state code, a unique two character code – e.g. CA for California and ON for Ontario.
Unfortunately this approach doesn’t work because Canadian government decided to keep
the same state code, NW, for the two territories.
4. Your applications must still support “natural key searches”. If you choose to
take a surrogate key approach to your database design you mustn’t forget that your
applications must still support searches on the domain columns that still uniquely identify
rows. For example, your Customer table may have a Customer_POID column used as a
surrogate key as well as a Customer_Number column and a Social_Security_Number
column. You would likely need to support searches based on both the customer number
and the social security number. Searching is discussed in detail in Finding Objects in a
Relational Database.

3.7 Normalize to Reduce Data Redundancy


Data normalization is a process in which data attributes within a data model are organized to
increase the cohesion of entity types. In other words, the goal of data normalization is to reduce
and even eliminate data redundancy, an important consideration for application developers
because it is incredibly difficult to stores objects in a relational database that maintains the same
information in several places. Table 2 summarizes the three most common normalization rules
describing how to put entity types into a series of increasing levels of normalization. Higher levels
of data normalization (Date 2000) are beyond the scope of this book. With respect to
terminology, a data schema is considered to be at the level of normalization of its least
normalized entity type. For example, if all of your entity types are at second normal form (2NF) or
higher then we say that your data schema is at 2NF.

Table 2. Data Normalization Rules.

Level Rule

First normal form (1NF) an entity type is in 1NF when it contains no repeating groups of data.

Second normal form an entity type is in 2NF when it is in 1NF and when all of its non-key attributes are
(2NF) fully dependent on its primary key.

Third normal form (3NF) an entity type is in 3NF when it is in 2NF and when all of its attributes are directly
dependent on the primary key.

depicts a database schema in ONF whereas Figure 8 depicts a normalized schema.


Figure 7
Read the Introduction to Data Normalization essay for details.

Why data normalization? The advantage of having a highly normalized data schema is
that information is stored in one place and one place only, reducing the possibility of
inconsistent data. Furthermore, highly-normalized data schemas in general are closer
conceptually to object-oriented schemas because the object-oriented goals of promoting
high cohesion and loose coupling between classes results in similar solutions (at least
from a data point of view). This generally makes it easier to map your objects to your
data schema. Unfortunately, normalization usually comes at a performance cost. With
the data schema of Figure 7 all the data for a single order is stored in one row (assuming
orders of up to nine order items), making it very easy to access. With the data schema of
Figure 7 you could quickly determine the total amount of an order by reading the single
row from the Order0NF table. To do so with the data schema of Figure 8 you would need
to read data from a row in the Order table, data from all the rows from the OrderItem
table for that order and data from the corresponding rows in the Item table for each order
item. For this query, the data schema of Figure 7 very likely provides better performance.

Figure 7. An Initial Data Schema for Order (UML Notation).


Figure 8. An normalized schema (UML Notation).
3.8 Denormalize to Improve Performance

Normalized data schemas, when put into production, often suffer from performance
problems. This makes sense – the rules of data normalization focus on reducing data
redundancy, not on improving performance of data access. An important part of data
modeling is to denormalize portions of your data schema to improve database access
times. For example, the data model of Figure 9 looks nothing like the normalized schema
of Figure 8. To understand why the differences between the schemas exist you must
consider the performance needs of the application. The primary goal of this system is to
process new orders from online customers as quickly as possible. To do this customers
need to be able to search for items and add them to their order quickly, remove items
from their order if need be, then have their final order totaled and recorded quickly. The
secondary goal of the system is to the process, ship, and bill the orders afterwards.

Figure 9. A Denormalized Order Data Schema (UML notation).


To denormalize the data schema the following decisions were made:

1. To support quick searching of item information the Item table was left alone.
2. To support the addition and removal of order items to an order the concept of an
OrderItem table was kept, albeit split in two to support outstanding orders and fulfilled
orders. New order items can easily be inserted into the OutstandingOrderItem table, or
removed from it, as needed.
3. To support order processing the Order and OrderItem tables were reworked into
pairs to handle outstanding and fulfilled orders respectively. Basic order information is
first stored in the OutstandingOrder and OutstandingOrderItem tables and then when the
order has been shipped and paid for the data is then removed from those tables and
copied into the FulfilledOrder and FulfilledOrderItem tables respectively. Data access
time to the two tables for outstanding orders is reduced because only the active orders are
being stored there. On average an order may be outstanding for a couple of days,
whereas for financial reporting reasons may be stored in the fulfilled order tables for
several years until archived. There is a performance penalty under this scheme because
of the need to delete outstanding orders and then resave them as fulfilled orders, clearly
something that would need to be processed as a transaction.
4. The contact information for the person(s) the order is being shipped and billed to
was also denormalized back into the Order table, reducing the time it takes to write an
order to the database because there is now one write instead of two or three. The retrieval
and deletion times for that data would also be similarly improved.

Note that if your initial, normalized data design meets the performance needs of your
application then it is fine as is. Denormalization should be resorted to only when
performance testing shows that you have a problem with your objects and subsequent
profiling reveals that you need to improve database access time. As my grandfather says,
if it ain’t broke don’t fix it.

4. Evolutionary Data Modeling


Evolutionary data modeling is data modeling performed in an iterative and incremental manner.
The essay Evolutionary Development explores evolutionary software development in greater
detail.

5. Agile Data Modeling


Agile data modeling is evolutionary data modeling done in a collaborative manner. The essay
Agile Data Modeling: From Domain Modeling to Physical Modeling works through a case
study which shows how to take an agile approach to data modeling.

6. How to Become Better At Modeling Data


How do you improve your data modeling skills? Practice, practice, practice. Whenever you get a
chance you should work closely with Agile DBAs, volunteer to model data with them, and ask
them questions as the work progresses. Agile DBAs will be following the AM practice Model
With Others so should welcome the assistance as well as the questions – one of the best ways
to really learn your craft is to have someone as “why are you doing it that way”. You should be
able to learn physical data modeling skills from Agile DBAs, and often logical data modeling skills
as well.

Similarly you should take the opportunity to work with the enterprise architects within your
organization. As you saw in Agile Enterprise Architecture they should be taking an active role
on your project, mentoring your project team in the enterprise architecture (if any), mentoring you
in modeling and architectural skills, and aiding in your team’s modeling and development efforts.
Once again, volunteer to work with them and ask questions when you are doing so. Enterprise
architects will be able to teach you conceptual and logical data modeling skills as well as instill an
appreciation for enterprise issues.

You also need to do some reading. Although this chapter is a good start it is only a brief
introduction. The best approach is to simply ask the Agile DBAs that you work with what they
think you should read.

My final word of advice is that it is critical for application developers to understand and
appreciate the fundamentals of data modeling. This is a valuable skill to have and has been
since the 1970s. It also provides a common framework within which you can work with Agile
DBAs, and may even prove to be the initial skill that enables you to make a career transition into
becoming a full-fledged Agile DBA.

Data modeling is the hardest and most important activity in the RDBMS world. If you get
the data model wrong, your application might not do what users need, it might be
unreliable, it might fill up the database with garbage. Why then do we start a SQL tutorial
with the most challenging part of the job? Because you can't do queries, inserts, and
updates until you've defined some tables. And defining tables is data modeling.

When data modeling, you are telling the RDBMS the following:

 what elements of the data you will store


 how large each element can be
 what kind of information each element can contain
 what elements may be left blank
 which elements are constrained to a fixed range
 whether and how various tables are to be linked

Three-Valued Logic

Programmers in most computer languages are familiar with Boolean logic. A variable
may be either true or false. Pervading SQL, however, is the alien idea of three-valued
logic. A column can be true, false, or NULL. When building the data model you must
affirmatively decide whether a NULL value will be permitted for a column and, if so,
what it means.
For example, consider a table for recording user-submitted comments to a Web site. The
publisher has made the following stipulations:

 comments won't go live until approved by an editor


 the admin pages will present editors with all comments that are pending approval,
i.e., have been submitted but neither approved nor disapproved by an editor
already

Here's the data model:


create table user_submitted_comments (
comment_id integer primary key,
user_id not null references users,
submission_time date default sysdate not null,
ip_address varchar(50) not null,
content clob,
approved_p char(1) check(approved_p in ('t','f'))
);
Implicit in this model is the assumption that approved_p can be NULL and that, if not
explicitly set during the INSERT, that is what it will default to. What about the check
constraint? It would seem to restrict approved_p to values of "t" or "f". NULL, however,
is a special value and if we wanted to prevent approved_p from taking on NULL we'd
have to add an explicit not null constraint.

How do NULLs work with queries? Let's fill user_submitted_comments with some
sample data and see:

insert into user_submitted_comments


(comment_id, user_id, ip_address, content)
values
(1, 23069, '18.30.2.68', 'This article reminds me of Hemingway');

Table created.

SQL> select first_names, last_name, content,


user_submitted_comments.approved_p
from user_submitted_comments, users
where user_submitted_comments.user_id = users.user_id;

FIRST_NAMES LAST_NAME CONTENT


APPROVED_P
------------ --------------- ------------------------------------
------------
Philip Greenspun This article reminds me of Hemingway
We've successfully JOINed the user_submitted_comments and users table to get both
the comment content and the name of the user who submitted it. Notice that in the select
list we had to explicitly request user_submitted_comments.approved_p. This is
because the users table also has an approved_p column.
When we inserted the comment row we did not specify a value for the approved_p
column. Thus we expect that the value would be NULL and in fact that's what it seems to
be. Oracle's SQL*Plus application indicates a NULL value with white space.

For the administration page, we'll want to show only those comments where the
approved_p column is NULL:

SQL> select first_names, last_name, content,


user_submitted_comments.approved_p
from user_submitted_comments, users
where user_submitted_comments.user_id = users.user_id
and user_submitted_comments.approved_p = NULL;

no rows selected
"No rows selected"? That's odd. We know for a fact that we have one row in the
comments table and that is approved_p column is set to NULL. How to debug the query?
The first thing to do is simplify by removing the JOIN:
SQL> select * from user_submitted_comments where approved_p = NULL;

no rows selected
What is happening here is that any expression involving NULL evaluates to NULL,
including one that effectively looks like "NULL = NULL". The WHERE clause is
looking for expressions that evaluate to true. What you need to use is the special test IS
NULL:
SQL> select * from user_submitted_comments where approved_p is NULL;

COMMENT_ID USER_ID SUBMISSION_T IP_ADDRESS


---------- ---------- ------------ ----------
CONTENT APPROVED_P
------------------------------------ ------------
1 23069 2000-05-27 18.30.2.68
This article reminds me of Hemingway
An adage among SQL programmers is that the only time you can use "= NULL" is in an
UPDATE statement (to set a column's value to NULL). It never makes sense to use
"= NULL" in a WHERE clause.

The bottom line is that as a data modeler you will have to decide which columns can
be NULL and what that value will mean.

Back to the Mailing List

Let's return to the mailing list data model from the introduction:
create table mailing_list (
email varchar(100) not null primary key,
name varchar(100)
);
create table phone_numbers (
email varchar(100) not null references mailing_list,
number_type varchar(15) check (number_type in
('work','home','cell','beeper')),
phone_number varchar(20) not null
);
This data model locks you into some realities:
 You will not be sending out any physical New Year's cards to folks on your
mailing list; you don't have any way to store their addresses.
 You will not be sending out any electronic mail to folks who work at companies
with elaborate Lotus Notes configurations; sometimes Lotus Notes results in
email addresses that are longer than 100 characters.
 You are running the risk of filling the database with garbage since you have not
constrained phone numbers in any way. American users could add or delete digits
by mistake. International users could mistype country codes.
 You are running the risk of not being able to serve rich people because the
number_type column may be too constrained. Suppose William H. Gates the
Third wishes to record some extra phone numbers with types of "boat", "ranch",
"island", and "private_jet". The check (number_type in
('work','home','cell','beeper')) statement prevents Mr. Gates from doing
this.
 You run the risk of having records in the database for people whose name you
don't know, since the name column of mailing_list is free to be NULL.
 Changing a user's email address won't be the simplest possible operation. You're
using email as a key in two tables and therefore will have to update both tables.
The references mailing_list keeps you from making the mistake of only
updating mailing_list and leaving orphaned rows in phone_numbers. But if
users changed their email addresses frequently, you might not want to do things
this way.
 Since you've no provision for storing a password or any other means of
authentication, if you allow users to update their information, you run a minor risk
of allowing a malicious change. (The risk isn't as great as it seems because you
probably won't be publishing the complete mailing list; an attacker would have to
guess the names of people on your mailing list.)

These aren't necessarily bad realities in which to be locked. However, a good data
modeler recognizes that every line of code in the .sql file has profound implications for
the Web service.

Papering Over Your Mistakes with Triggers

Suppose that you've been using the above data model to collect the names of Web site
readers who'd like to be alerted when you add new articles. You haven't sent any notices
for two months. You want to send everyone who signed up in the last two months a
"Welcome to my Web service; thanks for signing up; here's what's new" message. You
want to send the older subscribers a simple "here's what's new" message. But you can't do
this because you didn't store a registration date. It is easy enough to fix the table:
alter table mailing_list add (registration_date date);
But what if you have 15 different Web scripts that use this table? The ones that query it
aren't a problem. If they don't ask for the new column, they won't get it and won't realize
that the table has been changed (this is one of the big selling features of the RDBMS).
But the scripts that update the table will all need to be changed. If you miss a script,
you're potentially stuck with a table where various random rows are missing critical
information.

Oracle has a solution to your problem: triggers. A trigger is a way of telling Oracle "any
time anyone touches this table, I want you to execute the following little fragment of
code". Here's how we define the trigger mailing_list_registration_date:

create trigger mailing_list_registration_date


before insert on mailing_list
for each row
when (new.registration_date is null)
begin
:new.registration_date := sysdate;
end;
Note that the trigger only runs when someone is trying to insert a row with a NULL
registration date. If for some reason you need to copy over records from another database
and they have a registration date, you don't want this trigger overwriting it with the date
of the copy.

A second point to note about this trigger is that it runs for each row. This is called a
"row-level trigger" rather than a "statement-level trigger", which runs once per
transaction, and is usually not what you want.

A third point is that we're using the magic Oracle procedure sysdate, which will return
the current time. The Oracle date type is precise to the second even though the default is
to display only the day.

A fourth point is that, starting with Oracle 8, we could have done this more cleanly by
adding a default sysdate instruction to the column's definition.

The final point worth noting is the :new. syntax. This lets you refer to the new values
being inserted. There is an analogous :old. feature, which is useful for update triggers:

create or replace trigger mailing_list_update


before update on mailing_list
for each row
when (new.name <> old.name)
begin
-- user is changing his or her name
-- record the fact in an audit table
insert into mailing_list_name_changes
(old_name, new_name)
values
(:old.name, :new.name);
end;
/
show errors
This time we used the create or replace syntax. This keeps us from having to drop
trigger mailing_list_update if we want to change the trigger definition. We added a
comment using the SQL comment shortcut "--". The syntax new. and old. is used in the
trigger definition, limiting the conditions under which the trigger runs. Between the
begin and end, we're in a PL/SQL block. This is Oracle's procedural language, described
later, in which new.name would mean "the name element from the record in new". So you
have to use :new instead. It is obscurities like this that lead to competent Oracle
consultants being paid $200+ per hour.

The "/" and show errors at the end are instructions to Oracle's SQL*Plus program. The
slash says "I'm done typing this piece of PL/SQL, please evaluate what I've typed." The
"show errors" says "if you found anything to object to in what I just typed, please tell
me".

The Discussion Forum -- philg's personal odyssey

Back in 1995, I built a threaded discussion forum, described ad nauseum in


http://philip.greenspun.com/wtr/dead-trees/53013.htm. Here's how I stored the postings:

create table bboard (


msg_id char(6) not null primary key,
refers_to char(6),
email varchar(200),
name varchar(200),
one_line varchar(700),
message clob,
notify char(1) default 'f' check (notify in ('t','f')),
posting_time date,
sort_key varchar(600)
);
German order reigns inside the system itself: messages are uniquely keyed with msg_id,
refer to each other (i.e., say "I'm a response to msg X") with refers_to, and a thread can
be displayed conveniently by using the sort_key column.

Italian chaos is permitted in the email and name columns; users could remain
anonymous, masquerade as "president@whitehouse.gov" or give any name.

This seemed like a good idea when I built the system. I was concerned that it work
reliably. I didn't care whether or not users put in bogus content; the admin pages made it
really easy to remove such postings and, in any case, if someone had something
interesting to say but needed to remain anonymous, why should the system reject their
posting?
One hundred thousand postings later, as the moderator of the photo.net Q&A forum, I
began to see the dimensions of my data modeling mistakes.

First, anonymous postings and fake email addresses didn't come from Microsoft
employees revealing the dark truth about their evil bosses. They came from complete
losers trying and failing to be funny or wishing to humiliate other readers. Some fake
addresses came from people scared by the rising tide of spam email (not a serious
problem back in 1995).

Second, I didn't realize how the combination of my email alert systems, fake email
addresses, and Unix mailers would result in my personal mailbox filling up with
messages that couldn't be delivered to "asdf@asdf.com" or "duh@duh.net".

Although the solution involved changing some Web scripts, fundamentally the fix was
add a column to store the IP address from which a post was made:

alter table bboard add (originating_ipvarchar(16));


Keeping these data enabled me to see that most of the anonymous posters were people
who'd been using the forum for some time, typically from the same IP address. I just sent
them mail and asked them to stop, explaining the problem with bounced email.

After four years of operating the photo.net community, it became apparent that we
needed ways to

 display site history for users who had changed their email addresses
 discourage problem users from burdening the moderators and the community
 carefully tie together user-contributed content in the various subsystems of
photo.net

The solution was obvious to any experienced database nerd: a canonical users table and
then content tables that reference it. Here's a simplified version of the data model, taken
from a toolkit for building online communities, describe in
http://philip.greenspun.com/panda/community:

create table users (


user_id integer not null primary key,
first_names varchar(100) not null,
last_name varchar(100) not null,
email varchar(100) not null unique,
..
);

create table bboard (


msg_id char(6) not null primary key,
refers_to char(6),
topic varchar(100) not null references bboard_topics,
category varchar(200), -- only used for categorized Q&A
forums
originating_ip varchar(16), -- stored as string, separated by
periods
user_id integer not null references users,
one_line varchar(700),
message clob,
-- html_p - is the message in html or not
html_p char(1) default 'f' check (html_p in ('t','f')),
...
);

create table classified_ads (


classified_ad_id integer not null primary key,
user_id integer not null references users,
...
);
Note that a contributor's name and email address no longer appear in the bboard table.
That doesn't mean we don't know who posted a message. In fact, this data model can't
even represent an anonymous posting: user_id integer not null references
users requires that each posting be associated with a user ID and that there actually be a
row in the users table with that ID.

First, let's talk about how much fun it is to move a live-on-the-Web 600,000 hit/day
service from one data model to another. In this case, note that the original bboard data
model had a single name column. The community system has separate columns for first
and last names. A conversion script can easily split up "Joe Smith" but what is it to do
with William Henry Gates III?

How do we copy over anonymous postings? Remember that Oracle is not flexible or
intelligent. We said that we wanted every row in the bboard table to reference a row in
the users table. Oracle will abort any transaction that would result in a violation of this
integrity constraint. So we either have to drop all those anonymous postings (and any
non-anonymous postings that refer to them) or we have to create a user called
"Anonymous" and assign all the anonymous postings to that person. The technical term
for this kind of solution is kludge.

A more difficult problem than anonymous postings is presented by long-time users who
have difficulty typing and or keeping a job. Consider a user who has identified himself as

1. Joe Smith; jsmith@ibm.com


2. Jo Smith; jsmith@ibm.com (typo in name)
3. Joseph Smith; jsmth@ibm.com (typo in email)
4. Joe Smith; cantuseworkaddr@hotmail.com (new IBM policy)
5. Joe Smith-Jones; joe_smithjones@hp.com (got married, changed name, changed
jobs)
6. Joe Smith-Jones; jsmith@somedivision.hp.com (valid but not canonical corporate
email address)
7. Josephina Smith; jsmith@somedivision.hp.com (sex change; divorce)
8. Josephina Smith; josephina_smith@hp.com (new corporate address)
9. Siddhartha Bodhisattva; josephina_smith@hp.com (change of philosophy)
10. Siddhartha Bodhisattva; thinkwaitfast@hotmail.com (traveling for awhile to find
enlightenment)

Contemporary community members all recognize these postings as coming from the
same person but it would be very challenging even to build a good semi-automated
means of merging postings from this person into one user record.

Once we've copied everything into this new normalized data model, notice that we
can't dig ourselves into the same hole again. If a user has contributed 1000 postings, we
don't have 1000 different records of that person's name and email address. If a user
changes jobs, we need only update one column in one row in one table.

The html_p column in the new data model is worth mentioning. In 1995, I didn't
understand the problems of user-submitted data. Some users will submit plain text, which
seems simple, but in fact you can't just spit this out as HTML. If user A typed < or >
characters, they might get swallowed by user B's Web browser. Does this matter?
Consider that "<g>" is interpreted in various online circles as an abbreviation for "grin"
but by Netscape Navigator as an unrecognized (and therefore ignore) HTML tag.
Compare the meaning of

"We shouldn't think it unfair that Bill Gates has more wealth than the 100 million poorest
Americans combined. After all, he invented the personal computer, the graphical user
interface, and the Internet."
with
"We shouldn't think it unfair that Bill Gates has more wealth than the 100 million poorest
Americans combined. After all, he invented the personal computer, the graphical user
interface, and the Internet. <g>"

It would have been easy enough for me to make sure that such characters never got
interpreted as markup. In fact, with AOLserver one can do it with a single call to the
built-in procedure ns_quotehtml. However, consider the case where a nerd posts some
HTML. Other users would then see

"For more examples of my brilliant thinking and modesty, check out <a
href="http://philip.greenspun.com/">my home page</a>."
I discovered that the only real solution is to ask the user whether the submission is an
HTML fragment or plain text, show the user an approval page where the content may be
previewed, and then remember what the user told us in an html_p column in the
database.

Is this data model perfect? Permanent? Absolutely. It will last for at least... Whoa!
Wait a minute. I didn't know that Dave Clark was replacing his original Internet Protocol,
which the world has been running since around 1980, with IPv6
(http://www.faqs.org/rfcs/rfc2460.html). In the near future, we'll have IP addresses that are
128 bits long. That's 16 bytes, each of which takes two hex characters to represent. So we
need 32 characters plus at least 7 more for periods that separate the hex digits. We might
also need a couple of characters in front to say "this is a hex representation". Thus our
brand new data model in fact has a crippling deficiency. How easy is it to fix? In Oracle:

alter table bboard modify (originating_ip varchar(50));


You won't always get off this easy. Oracle won't let you shrink a column from a
maximum of 50 characters to 16, even if no row has a value longer than 16 characters.
Oracle also makes it tough to add a column that is constrained not null.

Representing Web Site Core Content

Free-for-all Internet discussions can often be useful and occasionally are compelling, but
the anchor of a good Web site is usually a set of carefully authored extended documents.
Historically these have tended to be stored in the Unix file system and they don't change
too often. Hence I refer to them as static pages. Examples of static pages on the photo.net
server include this book chapter, the tutorial on light for photographers at
http://www.photo.net/making-photographs/light.

We have some big goals to consider. We want the data in the database to

 help community experts figure out which articles need revision and which new
articles would be most valued by the community at large.
 help contributors work together on a draft article or a new version of an old
article.
 collect and organize reader comments and discussion, both for presentation to
other readers but also to assist authors in keeping content up-to-date.
 collect and organize reader-submitted suggestions of related content out on the
wider Internet (i.e., links).
 help point readers to new or new-to-them content that might interest them, based
on what they've read before or based on what kind of content they've said is
interesting.

The big goals lead to some more concrete objectives:


 We will need a table that holds the static pages themselves.
 Since there are potentially many comments per page, we need a separate table to
hold the user-submitted comments.
 Since there are potentially many related links per page, we need a separate table
to hold the user-submitted links.
 Since there are potentially many authors for one page, we need a separate table to
register the author-page many-to-one relation.
 Considering the "help point readers to stuff that will interest them" objective, it
seems that we need to store the category or categories under which a page falls.
Since there are potentially many categories for one page, we need a separate table
to hold the mapping between pages and categories.

create table static_pages (


page_id integer not null primary key,
url_stub varchar(400) not null unique,
original_author integer references users(user_id),
page_title varchar(4000),
page_body clob,
obsolete_p char(1) default 'f' check (obsolete_p in
('t','f')),
members_only_p char(1) default 'f' check (members_only_p in
('t','f')),
price number,
copyright_info varchar(4000),
accept_comments_p char(1) default 't' check
(accept_comments_p in ('t','f')),
accept_links_p char(1) default 't' check (accept_links_p
in ('t','f')),
last_updated date,
-- used to prevent minor changes from looking like new content
publish_date date
);

create table static_page_authors (


page_id integer not null references static_pages,
user_id integer not null references users,
notify_p char(1) default 't' check (notify_p in
('t','f')),
unique(page_id,user_id)
);

Note that we use a generated integer page_id key for this table. We could key the table
by the url_stub (filename), but that would make it very difficult to reorganize files in
the Unix file system (something that should actually happen very seldom on a Web
server; it breaks links from foreign sites).

How to generate these unique integer keys when you have to insert a new row into
static_pages? You could

 lock the table


 find the maximum page_id so far
 add one to create a new unique page_id
 insert the row
 commit the transaction (releases the table lock)

Much better is to use Oracle's built-in sequence generation facility:


create sequence page_id_sequence start with 1;
Then we can get new page IDs by using page_id_sequence.nextval in INSERT
statements (see the Transactions chapter for a fuller discussion of sequences).

Reference
Here is a summary of the data modeling tools available to you in Oracle, each
hyperlinked to the Oracle documentation. This reference section covers the following:
 data types
 statements for creating, altering, and dropping tables
 constraints

Data Types
For each column that you define for a table, you must specify the data type of that
column. Here are your options:
Character Data
char(n) A fixed-length character string, e.g., char(200) will take up 200 bytes
regardless of how long the string actually is. This works well when the data
truly are of fixed size, e.g., when you are recording a user's sex as "m" or
"f". This works badly when the data are of variable length. Not only does it
waste space on the disk and in the memory cache, but it makes comparisons
fail. For example, suppose you insert "rating" into a comment_type column
of type char(30) and then your Tcl program queries the database. Oracle
sends this column value back to procedural language clients padded with
enough spaces to make up 30 total characters. Thus if you have a
comparison within Tcl of whether $comment_type == "rating", the
comparison will fail because $comment_type is actually "rating" followed
by 24 spaces.

The maximum length char in Oracle8 is 2000 bytes.


varchar(n) A variable-length character string, up to 4000 bytes long in Oracle8. These
are stored in such a way as to minimize disk space usage, i.e., if you only
put one character into a column of type varchar(4000), Oracle only
consumes two bytes on disk. The reason that you don't just make all the
columns varchar(4000) is that the Oracle indexing system is limited to
indexing keys of about 700 bytes.
clob A variable-length character string, up to 4 gigabytes long in Oracle8. The
CLOB data type is useful for accepting user input from such applications as
discussion forums. Sadly, Oracle8 has tremendous limitations on how
CLOB data may be inserted, modified, and queried. Use varchar(4000) if
you can and prepare to suffer if you can't.

In a spectacular demonstration of what happens when companies don't


follow the lessons of The Mythical Man Month, the regular string functions
don't work on CLOBs. You need to call identically named functions in the
DBMS_LOB package. These functions take the same arguments but in
different orders. You'll never be able to write a working line of code without
first reading the DBMS_LOB section of the Oracle8 Server Application Developer's
Guide.
nchar, The n prefix stands for "national character set". These work like char,
nvarchar, varchar, and clob but for multi-byte characters (e.g., Unicode; see
nclob http://www.unicode.org).
Numeric Data
number Oracle actually only has one internal data type that is used for storing
numbers. It can handle 38 digits of precision and exponents from -130 to
+126. If you want to get fancy, you can specify precision and scale limits.
For example, number(3,0) says "round everything to an integer [scale 0]
and accept numbers than range from -999 to +999". If you're American and
commercially minded, number(9,2) will probably work well for storing
prices in dollars and cents (unless you're selling stuff to Bill Gates, in which
case the billion dollar limit imposed by the precision of 9 might prove
constraining). If you are Bill Gates, you might not want to get distracted by
insignificant numbers: Tell Oracle to round everything to the nearest
million with number(38,-6).
integer In terms of storage consumed and behavior, this is not any different from
number(38) but I think it reads better and it is more in line with ANSI SQL
(which would be a standard if anyone actually implemented it).
Dates and Date/Time Intervals (Version 9i and newer)
timestamp A point in time, recorded with sub-second precision. When creating a
column you specify the number of digits of precision beyond one second
from 0 (single second precision) to 9 (nanosecond precision). Oracle's
calendar can handle dates between between January 1, 4712 BC and
December 31, 9999 AD. You can put in values with the to_timestamp
function and query them out using the to_char function. Oracle offers
several variants of this datatype for coping with data aggregated across
multiple timezones.
interval year
An amount of time, expressed in years and months.
to month
interval day An amount of time, expressed in days, hours, minutes, and seconds. Can be
to second precise down to the nanosecond if desired.
Dates and Date/Time Intervals (Versions 8i and earlier)
date Obsolete as of version 9i. A point in time, recorded with one-second
precision, between January 1, 4712 BC and December 31, 4712 AD. You
can put in values with the to_date function and query them out using the
to_char function. If you don't use these functions, you're limited to
specifying the date with the default system format mask, usually 'DD-
MON-YY'. This is a good recipe for a Year 2000 bug since January 23,
2000 would be '23-JAN-00'. On better-maintained systems, this is often the
ANSI default: 'YYYY-MM-DD', e.g., '2000-01-23' for January 23, 2000.
number Hey, isn't this a typo? What's number doing in the date section? It is here
because this is how Oracle versions prior to 9i represented date-time
intervals, though their docs never say this explicitly. If you add numbers to
dates, you get new dates. For example, tomorrow at exactly this time is
sysdate+1. To query for stuff submitted in the last hour, you limit to
submitted_date > sysdate - 1/24.
Binary Data
blob BLOB stands for "Binary Large OBject". It doesn't really have to be all that
large, though Oracle will let you store up to 4 GB. The BLOB data type
was set up to permit the storage of images, sound recordings, and other
inherently binary data. In practice, it also gets used by fraudulent
application software vendors. They spend a few years kludging together
some nasty format of their own. Their MBA executive customers demand
that the whole thing be RDBMS-based. The software vendor learns enough
about Oracle to "stuff everything into a BLOB". Then all the marketing and
sales folks are happy because the application is now running from Oracle
instead of from the file system. Sadly, the programmers and users don't get
much because you can't use SQL very effectively to query or update what's
inside a BLOB.
bfile A binary file, stored by the operating system (typically Unix) and kept track
of by Oracle. These would be useful when you need to get to information
both from SQL (which is kept purposefully ignorant about what goes on in
the wider world) and from an application that can only read from standard
files (e.g., a typical Web server). The bfile data type is pretty new but to my
mind it is already obsolete: Oracle 8.1 (8i) lets external applications view
content in the database as though it were a file on a Windows NT server. So
why not keep everything as a BLOB and enable Oracle's Internet File
System?
Despite this plethora of data types, Oracle has some glaring holes that torture developers.
For example, there is no Boolean data type. A developer who needs an approved_p
column is forced to use char(1) check(this_column in ('t','f')) and then,
instead of the clean query where approved_p is forced into where approved_p = 't'.

Oracle8 includes a limited ability to create your own data types. Covering these is beyond
the scope of this book. See Oracle8 Server Concepts, User-Defined Datatypes.

Tables
The basics:
CREATE TABLE your_table_name (
the_key_column key_data_type PRIMARY KEY,
a_regular_column a_data_type,
an_important_column a_data_type NOT NULL,
... up to 996 intervening columns in Oracle8 ...
the_last_column a_data_type
);
Even in a simple example such as the one above, there are few items worth noting. First, I
like to define the key column(s) at the very top. Second, the primary key constraint has
some powerful effects. It forces the_key_column to be non-null. It causes the creation of
an index on the_key_column, which will slow down updates to your_table_name but
improve the speed of access when someone queries for a row with a particular value of
the_key_column. Oracle checks this index when inserting any new row and aborts the
transaction if there is already a row with the same value for the_key_column. Third, note
that there is no comma following the definition of the last row. If you are careless and
leave the comma in, Oracle will give you a very confusing error message.

If you didn't get it right the first time, you'll probably want to

alter table your_table_name add (new_column_name a_data_type


any_constraints);
or
alter table your_table_name modify (existing_column_name new_data_type
new_constraints);
In Oracle 8i you can drop a column:
alter table your_table_name drop column existing_column_name;
(see http://www.oradoc.com/keyword/drop_column).

If you're still in the prototype stage, you'll probably find it easier to simply

drop table your_table_name;


and recreate it. At any time, you can see what you've got defined in the database by
querying Oracle's Data Dictionary:
SQL> select table_name from user_tables order by table_name;

TABLE_NAME
------------------------------
ADVS
ADV_CATEGORIES
ADV_GROUPS
ADV_GROUP_MAP
ADV_LOG
ADV_USER_MAP
AD_AUTHORIZED_MAINTAINERS
AD_CATEGORIES
AD_DOMAINS
AD_INTEGRITY_CHECKS
BBOARD
...
STATIC_CATEGORIES
STATIC_PAGES
STATIC_PAGE_AUTHORS
USERS
...
after which you will typically type describe table_name_of_interest in SQL*Plus:
SQL> describe users;
Name Null? Type
------------------------------- -------- ----
USER_ID NOT NULL NUMBER(38)
FIRST_NAMES NOT NULL VARCHAR2(100)
LAST_NAME NOT NULL VARCHAR2(100)
PRIV_NAME NUMBER(38)
EMAIL NOT NULL VARCHAR2(100)
PRIV_EMAIL NUMBER(38)
EMAIL_BOUNCING_P CHAR(1)
PASSWORD NOT NULL VARCHAR2(30)
URL VARCHAR2(200)
ON_VACATION_UNTIL DATE
LAST_VISIT DATE
SECOND_TO_LAST_VISIT DATE
REGISTRATION_DATE DATE
REGISTRATION_IP VARCHAR2(50)
ADMINISTRATOR_P CHAR(1)
DELETED_P CHAR(1)
BANNED_P CHAR(1)
BANNING_USER NUMBER(38)
BANNING_NOTE VARCHAR2(4000)
Note that Oracle displays its internal data types rather than the ones you've given, e.g.,
number(38) rather than integer and varchar2 instead of the specified varchar.

Constraints
When you're defining a table, you can constrain single rows by adding some magic words
after the data type:
 not null; requires a value for this column
 unique; two rows can't have the same value in this column (side effect in Oracle:
creates an index)
 primary key; same as unique except that no row can have a null value for this
column and other tables can refer to this column
 check; limit the range of values for column, e.g., rating integer
check(rating > 0 and rating <= 10)
 references; this column can only contain values present in another table's
primary key column, e.g., user_id not null references users in the bboard
table forces the user_id column to only point to valid users. An interesting twist
is that you don't have to give a data type for user_id; Oracle assigns this column
to whatever data type the foreign key has (in this case integer).

Constraints can apply to multiple columns:


create table static_page_authors (
page_id integer not null references static_pages,
user_id integer not null references users,
notify_p char(1) default 't' check (notify_p in
('t','f')),
unique(page_id,user_id)
);
Oracle will let us keep rows that have the same page_id and rows that have the same
user_id but not rows that have the same value in both columns (which would not make
sense; a person can't be the author of a document more than once). Suppose that you run
a university distinguished lecture series. You want speakers who are professors at other
universities or at least PhDs. On the other hand, if someone controls enough money, be it
his own or his company's, he's in. Oracle stands ready:
create table distinguished_lecturers (
lecturer_id integer primary key,
name_and_title varchar(100),
personal_wealth number,
corporate_wealth number,
check (instr(upper(name_and_title),'PHD') <> 0
or instr(upper(name_and_title),'PROFESSOR') <> 0
or (personal_wealth + corporate_wealth) > 1000000000)
);

insert into distinguished_lecturers


values
(1,'Professor Ellen Egghead',-10000,200000);

1 row created.

insert into distinguished_lecturers


values
(2,'Bill Gates, innovator',75000000000,18000000000);

1 row created.

insert into distinguished_lecturers


values
(3,'Joe Average',20000,0);

ORA-02290: check constraint (PHOTONET.SYS_C001819) violated


As desired, Oracle prevented us from inserting some random average loser into the
distinguished_lecturers table, but the error message was confusing in that it refers to
a constraint given the name of "SYS_C001819" and owned by the PHOTONET user. We
can give our constraint a name at definition time:
create table distinguished_lecturers (
lecturer_id integer primary key,
name_and_title varchar(100),
personal_wealth number,
corporate_wealth number,
constraint ensure_truly_distinguished
check (instr(upper(name_and_title),'PHD') <> 0
or instr(upper(name_and_title),'PROFESSOR') <> 0
or (personal_wealth + corporate_wealth) > 1000000000)
);

insert into distinguished_lecturers


values
(3,'Joe Average',20000,0);

ORA-02290: check constraint (PHOTONET.ENSURE_TRULY_DISTINGUISHED)


violated
Now the error message is easier to understand by application programmers.

Creating More Elaborate Constraints with Triggers


The default Oracle mechanisms for constraining data are not always adequate. For
example, the ArsDigita Community System auction module has a table called
au_categories. The category_keyword column is a unique shorthand way of referring
to a category in a URL. However, this column may be NULL because it is not the
primary key to the table. The shorthand method of referring to the category is optional.
create table au_categories (
category_id integer primary key,
-- shorthand for referring to this category,
-- e.g. "bridges", for use in URLs
category_keyword varchar(30),
-- human-readable name of this category,
-- e.g. "All types of bridges"
category_name varchar(128) not null
);
We can't add a UNIQUE constraint to the category_keyword column. That would allow
the table to only have one row where category_keyword was NULL. So we add a
trigger that can execute an arbitrary PL/SQL expression and raise an error to prevent an
INSERT if necessary:
create or replace trigger au_category_unique_tr
before insert
on au_categories
for each row
declare
existing_count integer;
begin
select count(*) into existing_count from au_categories
where category_keyword = :new.category_keyword;
if existing_count > 0
then
raise_application_error(-20010, 'Category keywords must be
unique if used');
end if;
end;
This trigger queries the table to find out if there are any matching keywords already
inserted. If there are, it calls the built-in Oracle procedure raise_application_error to
abort the transaction.

Cardinality defines the numeric relationships between occurrences of the


entities on either end of the relationship line.

The Entity-Relationship Model


The Entity-Relationship (ER) model was originally proposed by Peter in 1976 [Chen76]
as a way to unify the network and relational database views. Simply stated the ER model
is a conceptual data model that views the real world as entities and relationships. A basic
component of the model is the Entity-Relationship diagram which is used to visually
represents data objects. Since Chen wrote his paper the model has been extended and
today it is commonly used for database design For the database designer, the utility of the
ER model is:
 it maps well to the relational model. The constructs used in the ER model can easily be
transformed into relational tables.
 it is simple and easy to understand with a minimum of training. Therefore, the model
can be used by the database designer to communicate the design to the end user.
 In addition, the model can be used as a design plan by the database developer to
implement a data model in a specific database management software.

Basic Constructs of E-R Modeling


The ER model views the real world as a construct of entities and association between
entities.

Entities
Entities are the principal data object about which information is to be collected. Entities
are usually recognizable concepts, either concrete or abstract, such as person, places,
things, or events which have relevance to the database. Some specific examples of
entities are EMPLOYEES, PROJECTS, INVOICES. An entity is analogous to a table in
the relational model.

Entities are classified as independent or dependent (in some methodologies, the terms
used are strong and weak, respectively). An independent entity is one that does not rely
on another for identification. A dependent entity is one that relies on another for
identification.

An entity occurrence (also called an instance) is an individual occurrence of an entity. An


occurrence is analogous to a row in the relational table.

Special Entity Types


Associative entities (also known as intersection entities) are entities used to associate two
or more entities in order to reconcile a many-to-many relationship.

Subtypes entities are used in generalization hierarchies to represent a subset of instances


of their parent entity, called the supertype, but which have attributes or relationships that
apply only to the subset.

Associative entities and generalization hierarchies are discussed in more detail below.

Relationships
A Relationship represents an association between two or more entities. An example of a
relationship would be:
employees are assigned to projects

projects have subtasks

departments manage one or more projects

Relationships are classified in terms of degree, connectivity, cardinality, and existence.


These concepts will be discussed below.

Attributes
Attributes describe the entity of which they are associated. A particular instance of an
attribute is a value. For example, "Jane R. Hathaway" is one value of the attribute Name.
The domainof an attribute is the collection of all possible values an attribute can have.
The domain of Name is a character string.

Attributes can be classified as identifiers or descriptors. Identifiers, more commonly


called keys, uniquely identify an instance of an entity. A descriptor describes a non-
unique characteristic of an entity instance.

Classifying Relationships
Relationships are classified by their degree, connectivity, cardinality, direction, type, and
existence. Not all modeling methodologies use all these classifications.

Degree of a Relationship
The degree of a relationship is the number of entities associated with the relationship.
The n-ary relationship is the general form for degree n. Special cases are the binary, and
ternary ,where the degree is 2, and 3, respectively.

Binary relationships, the association between two entities is the most common type in the
real world. A recursive binary relationship occurs when an entity is related to itself. An
example might be "some employees are married to other employees".

A ternary relationship involves three entities and is used when a binary relationship is
inadequate. Many modeling approaches recognize only binary relationships. Ternary or
n-ary relationships are decomposed into two or more binary relationships.

Connectivity and Cardinality


The connectivity of a relationship describes the mapping of associated entity instances in
the relationship. The values of connectivity are "one" or "many". The cardinality of a
relationship is the actual number of related occurences for each of the two entities. The
basic types of connectivity for relations are: one-to-one, one-to-many, and many-to-many.

A one-to-one (1:1) relationship is when at most one instance of a entity A is associated


with one instance of entity B. For example, "employees in the company are each assigned
their own office. For each employee there exists a unique office and for each office there
exists a unique employee.

A one-to-many (1:N) relationships is when for one instance of entity A, there are zero,
one, or many instances of entity B, but for one instance of entity B, there is only one
instance of entity A. An example of a 1:N relationships is

a department has many employees

each employee is assigned to one department

A many-to-many (M:N) relationship, sometimes called non-specific, is when for one


instance of entity A, there are zero, one, or many instances of entity B and for one
instance of entity B there are zero, one, or many instances of entity A. An example is:

employees can be assigned to no more than two projects at the same time;

projects must have assigned at least three employees

A single employee can be assigned to many projects; conversely, a single project can
have assigned to it many employee. Here the cardinality for the relationship between
employees and projects is two and the cardinality between project and employee is three.
Many-to-many relationships cannot be directly translated to relational tables but instead
must be transformed into two or more one-to-many relationships using associative
entities.

Direction
The direction of a relationship indicates the originating entity of a binary relationship.
The entity from which a relationship originates is the parent entity; the entity where the
relationship terminates is the child entity.

The direction of a relationship is determined by its connectivity. In a one-to-one


relationship the direction is from the independent entity to a dependent entity. If both
entities are independent, the direction is arbitrary. With one-to-many relationships, the
entity occurring once is the parent. The direction of many-to-many relationships is
arbitrary.

Type
An identifying relationship is one in which one of the child entities is also a dependent
entity. A non-identifying relationship is one in which both entities are independent.

Existence
Existence denotes whether the existence of an entity instance is dependent upon the
existence of another, related, entity instance. The existence of an entity in a relationship is
defined as either mandatory or optional. If an instance of an entity must always occur for
an entity to be included in a relationship, then it is mandatory. An example of mandatory
existence is the statement "every project must be managed by a single department". If the
instance of the entity is not required, it is optional. An example of optional existence is
the statement, "employees may be assigned to work on projects".

Generalization Hierarchies
A generalization hierarchy is a form of abstraction that specifies that two or more entities
that share common attributes can be generalized into a higher level entity type called a
supertype or generic entity. The lower-level of entities become the subtype, or categories,
to the supertype. Subtypes are dependent entities.

Generalization occurs when two or more entities represent categories of the same real-
world object. For example, Wages_Employees and Classified_Employees represent
categories of the same entity, Employees. In this example, Employees would be the
supertype; Wages_Employees and Classified_Employees would be the subtypes.

Subtypes can be either mutually exclusive (disjoint) or overlapping (inclusive). A


mutually exclusive category is when an entity instance can be in only one category. The
above example is a mutually exclusive category. An employee can either be wages or
classified but not both. An overlapping category is when an entity instance may be in two
or more subtypes. An example would be a person who works for a university could also
be a student at that same university. The completeness constraint requires that all
instances of the subtype be represented in the supertype.

Generalization hierarchies can be nested. That is, a subtype of one hierarchy can be a
supertype of another. The level of nesting is limited only by the constraint of simplicity.
Subtype entities may be the parent entity in a relationship but not the child.

ER Notation
There is no standard for representing data objects in ER diagrams. Each modeling
methodology uses its own notation. The original notation used by Chen is widely used in
academics texts and journals but rarely seen in either CASE tools or publications by non-
academics. Today, there are a number of notations used, among the more common are
Bachman, crow's foot, and IDEFIX.
All notational styles represent entities as rectangular boxes and relationships as lines
connecting boxes. Each style uses a special set of symbols to represent the cardinality of
a connection. The notation used in this document is from Martin. The symbols used for
the basic ER constructs are:

 entities are represented by labeled rectangles. The label is the name of the entity.
Entity names should be singular nouns.
 relationships are represented by a solid line connecting two entities. The name of the
relationship is written above the line. Relationship names should be verbs.
 attributes, when included, are listed inside the entity rectangle. Attributes which are
identifiers are underlined. Attribute names should be singular nouns.
 cardinality of many is represented by a line ending in a crow's foot. If the crow's foot
is omitted, the cardinality is one.
 existence is represented by placing a circle or a perpendicular bar on the line.
Mandatory existence is shown by the bar (looks like a 1) next to the entity for an
instance is required. Optional existence is shown by placing a circle next to the entity
that is optional.

Examples of these symbols are shown in Figure 1 below:

Figure 1: ER Notation

Summary
The Entity-Relationship Model is a conceptual data model that views the real world as
consisting of entities and relationships. The model visually represents these concepts by
the Entity-Relationship diagram. The basic constructs of the ER model are entities,
relationships, and attributes. Entities are concepts, real or abstract, about which
information is collected. Relationships are associations between the entities. Attributes
are properties which describe the entities. Next, we will look at the role of data modeling
in the overall database design process and a method for building the data model. To
proceed, see Data Modeling As Part of Database Design.

Data Modeling Definition :

The analysis of data objects and their relationships to other data objects. Data modeling is often the first
step in database design and object-oriented programming as the designers first create a conceptual
model of how data items relate to each other. Data modeling involves a progression from conceptual model
to logical model to physical schema.

You might also like