You are on page 1of 32

Advanced Database Systems

Chapter-I
Introduction

The database is the heart of any system. If the design is wrong then the whole
application will be wrong, either in effectiveness or performance, or even both. No
amount of clever coding can compensate for a bad database design. Sometimes
when building an application we may encounter a problem, which can only be
solved effectively by changing the database rather than by changing the code.
The biggest problem we have encountered in all these years is where different teams
handle the database design and software development. The database designers build
something according to their rules, and they then expect the developers to write code
around this design. This approach is often fraught with disaster as the database
designers often have little or no development experience, so they have little or no
understanding of how the development language can use that design to achieve the
expected results.

What is a database?
This may seem a pretty fundamental question, but unless you know what a
database consists of you may find it difficult to build one that can be used effectively.
Here is a simple definition of a database:

A database is a collection of information that is organized so that it can


easily be accessed, managed, and updated.

A database engine may comply with a combination of any of the following:


• The database is a collection of table, files or datasets.
• Each table is a collection of fields, columns or data items.
• One or more columns in each table may be selected as the primary key.
• There may be additional unique keys or non-unique indexes to assist in
data retrieval.
• Columns may be fixed length or variable length.
• Records may be fixed length or variable length.
• Table and column names may be restricted in length (8, 16 or 32
characters).
• Table and column names may be case-sensitive.

Over the years there have been several different ways of constructing databases,
amongst which have been the following:

• The Hierarchical Data Model


• The Network Data Model
• The Relational Data Model
1. The Hierarchical Data Model
The Hierarchical Data Model can be represented as follows:
Figure 1 - The Hierarchical Data Model

A hierarchical database consists of the following:


1. It contains nodes connected by branches.
2. The top node is called the root.
3. If multiple nodes appear at the top level, the nodes are called root segments.
4. The parent of node nx is a node directly above nx and connected to nx by a branch.
5. Each node (with the exception of the root) has exactly one parent.
6. The child of node nx is the node directly below nx and connected to nx by a branch.
7. One parent may have many children.
By introducing data redundancy, complex network structures can also be
represented as hierarchical databases. This redundancy is eliminated in physical
implementation by including a 'logical child'. The logical child contains no data but
uses a set of pointers to direct the database management system to the physical child
in which the data is actually stored. Associated with a logical child are a physical
parent and a logical parent. The logical parent provides an alternative (and possibly
more efficient) path to retrieve logical child information.

2. The Network Data Model


The Network Data Model can be represented as follows:

Figure 2 - The Network Data Model

Like the Hierarchical Data Model the Network Data Model also consists of nodes
and branches, but a child may have multiple parents within the network structure.

Both hierarchical and network databases, suffered from the following deficiencies
(when compared with relational databases):
• Access to the database was not via SQL query strings, but by a specific set of
API's.
• It was not possible to provide a variable WHERE clause. The only selection
mechanism was to read entries from a child table for a specific entry on a related
parent table with any filtering being done within the application code.
• It was not possible to provide an ORDER BY clause. Data was presented in the
order in which it existed in the database. This mechanism could be tuned by
specifying sort criteria to be used when each record was inserted, but this had
several disadvantages:

• Only a single sort sequence could be defined for each path (link to a
parent), so all records retrieved on that path would be provided in that sequence.

• It could make inserts rather slow when attempting to insert into the
middle of a large collection, or where a table had multiple paths each with its own set
of sort criteria.

3. The Relational Data Model


The Relational Data Model has the relation at its heart, but then a whole series of rules
governing keys, relationships, joins, functional dependencies, transitive dependencies,
multi-valued dependencies, and modification anomalies.

The Relation

The Relation is the basic element in a relational data model.

Figure 3 - Relations in the Relational Data Model

A relation is subject to the following rules:


1. Relation (file, table) is a two-dimensional table.
2. Attribute (i.e. field or data item) is a column in the table.
3. Each column in the table has a unique name within that table.
4. Each column is homogeneous. Thus the entries in any column are all of
the same type (e.g. age, name, employee-number, etc).
5. Each column has a domain, the set of possible values that can appear in
that column.
6. A Tuple (i.e. record) is a row in the table.
7. The order of the rows and columns is not important.
8. Values of a row all relate to some thing or portion of a thing.
9. Repeating groups (collections of logically related attributes that occur
multiple times within one record occurrence) are not allowed.
10. Duplicate rows are not allowed (candidate keys are designed to prevent
this).
11. Cells must be single-valued (but can be variable length). Single valued
means the following:
• Cannot contain multiple values such as 'A1, B2, and C3’.
• Cannot contain combined values such as 'ABC-XYZ' where 'ABC' means
one thing and 'XYZ' another.
A relation may be expressed using the notation R (A, B, C,...) where:
• R = the name of the relation.
• (A, B, C,...) = the attributes within the relation.
• A = the attribute(s) which form the primary key.

Keys
1. A simple key contains a single attribute.
2. A composite key is a key that contains more than one attribute.
3. A candidate key is an attribute (or set of attributes) that uniquely
identifies a row. A candidate key must possess the following properties:
• Unique identification - For every row the value of the key must uniquely
identify that row.
• Non-redundancy - No attribute in the key can be discarded without
destroying the property of unique identification.
4. A primary key is the candidate key, which is selected as the principal
unique identifier. Every relation must contain a primary key. The primary key is
usually the key selected to identify a row when the database is physically
implemented. For example, a part number is selected instead of a part
description.
5. A superkey is any set of attributes that uniquely identifies a row. A
superkey differs from a candidate key in that it does not require the non-
redundancy property.
6. A foreign key is an attribute (or set of attributes) that appears as a non-key
attribute in one relation and as a primary key attribute in another relation.
• A many-to-many relationship can only be implemented by introducing an
intersection or link table, which then becomes the child in two one-to-
many relationships. The intersection table therefore has a foreign key for
each of its parents, and its primary key is a composite of both foreign
keys.
• A one-to-one relationship requires that the child table have no more than
one occurrence for each parent, which can only be enforced by letting the
foreign key also serve as the primary key.
7. A semantic or natural key is a key for which the possible values have an
obvious meaning to the user or the data. For example, a semantic primary key
for a COUNTRY entity might contain the value 'USA' for the occurrence
describing the United States of America. The value 'USA' has meaning to the
user.
8. A technical or surrogate or artificial key is a key for which the possible
values have no obvious meaning to the user or the data. These are used instead
of semantic keys for any of the following reasons:
• When the value in a semantic key is likely to be changed by the user, or
can have duplicates. For example, on a PERSON table it is unwise to use
PERSON_NAME as the key as it is possible to have more than one person
with the same name, or the name may change such as through marriage.
• When none of the existing attributes can be used to guarantee uniqueness.
In this case adding an attribute whose value is generated by the system, e.g
from a sequence of numbers, is the only way to provide a unique value.
Typical examples would be ORDER_ID and INVOICE_ID. The value
'12345' has no meaning to the user, as it conveys nothing about the entity
to which it relates.
9. A key functionally determines the other attributes in the row, thus it is
always a determinant.
10. Note that the term 'key' in most DBMS engines is implemented as an
index, which does not allow duplicate entries.
Relationships
One table (relation) may be linked with another in what is known as a relationship.
Relationships may be built into the database structure to facilitate the operation
of relational joins at runtime.

1. A relationship is between two tables in what is known as a one-to-


many or parent-child or master-detail relationship where an occurrence on
the 'one' or 'parent' or 'master' table may have any number of associated
occurrences on the 'many' or 'child' or 'detail' table. To achieve this the child
table must contain fields, which link back the primary key on the parent
table. These fields on the child table are known as a foreign key, and the
parent table is referred to as the foreign table (from the viewpoint of the
child).

2. It is possible for a record on the parent table to exist without


corresponding records on the child table, but it should not be possible for an
entry on the child table to exist without a corresponding entry on the parent
table.

3. A child record without a corresponding parent record is known as an


orphan.

4. It is possible for a table to be related to itself. For this to be possible it


needs a foreign key, which points back to the primary key. Note that these
two keys cannot be comprised of exactly the same fields otherwise the record
could only ever point to itself.

5. A table may be the subject of any number of relationships, and it may


be the parent in some and the child in others.
6. Some database engines allow a parent table to be linked via a
candidate key, but if this were changed it could result in the link to the child
table being broken.

7. Some database engines allow relationships to be managed by rules


known as referential integrity or foreign key restraints. These will prevent
entries on child tables from being created if the foreign key does not exist on
the parent table, or will deal with entries on child tables when the entry on
the parent table is updated or deleted.

Relational Joins

The join operator is used to combine data from two or more relations (tables) in
order to satisfy a particular query. Two relations may be joined when they share at
least one common attribute. The join is implemented by considering each row in an
instance of each relation. A row in relation R1 is joined to a row in relation R2 when
the value of the common attribute(s) is equal in the two relations. The join of two
relations is often called a binary join.

The join of two relations creates a new relation. The notation 'R1 x R2' indicates the
join of relations R1 and R2. For example, consider the following:

Relation R1 Relation R1 x R2
Relation R2
A B C A B C D E
B D E
1 5 3 1 5 3 7 8
4 7 4
2 4 5 2 4 5 7 4
6 2 3
8 3 5 8 3 5 2 2
5 7 8
9 3 3 9 3 3 2 2
7 2 3
1 6 5 1 6 5 2 3
3 2 2
5 4 3 5 4 3 7 4

2 7 5 2 7 5 2 3
Note that the instances of relation R1 and R2 contain the same data values for
attribute B. Data normalization is concerned with decomposing a relation (e.g. R
(A,B,C,D,E) into smaller relations (e.g. R1 and R2). The data values for attribute B
in this context will be identical in R1 and R2. The instances of R1 and R2 are
projections of the instances of R (A,B,C,D,E) onto the attributes (A,B,C) and
(B,D,E) respectively. A projection will not eliminate data values - duplicate rows are
removed, but this will not remove a data value from any attribute.
The join of relations R1 and R2 is possible because B is a common attribute. The
result of the join is shown above. The row (2 4 5 7 4) was formed by joining the row
(2 4 5) from relation R1 to the row (4 7 4) from relation R2. The two rows were
joined since each contained the same value for the common attribute B. The row (2
4 5) was not joined to the row (6 2 3) since the values of the common attribute (4
and 6) are not the same. The relations joined in the preceding example shared
exactly one common attribute. However, relations may share multiple common
attributes. All of these common attributes must be used in creating a join. For
example, the instances of relations R1 and R2 in the following example are joined
using the common attributes B and C:

Before the join: After the join:

Relation R1 Relation R2
Relation R1 x R2
A B C B C D
A B C D
6 1 4 1 4 9
6 1 4 9
8 1 4 1 4 2
6 1 4 2
5 1 2 1 2 1
8 1 4 9
2 7 1 7 1 2
8 1 4 2
7 1 3
5 1 2 1
2 7 1 2
2 7 1 3
The row (6 1 4 9) was formed by joining the row (6 1 4) from relation R1 to the row
(1 4 9) from relation R2. The join was created since the common set of attributes (B
and C) contained identical values (1 and 4). The row (6 1 4) from R1 was not joined
to the row (1 2 1) from R2 since the common attributes did not share identical
values - (1 4) in R1 and (1 2) in R2. The join operation provides a method for
reconstructing a relation that was decomposed into two relations during the
normalization process. The join of two rows, however, can create a new row that
was not a member of the original relation. Thus invalid information can be created
during the join process.

Lossless Joins

A set of relations satisfies the lossless join property if the instances can be joined
without creating invalid data (i.e. new rows). The term lossless join may be
somewhat confusing. A join that is not lossless will contain extra, invalid rows. A
join that is lossless will not contain extra, invalid rows. To give an example of
incorrect information created by an invalid join let us take the following data
structure:

R (student, course, instructor, hour, room, grade)

Assuming that only one section of a class is offered during a semester we can define
the following functional dependencies:

1. (HOUR, ROOM) COURSE

2. (COURSE, STUDENT) GRADE

3. (INSTRUCTOR, HOUR) ROOM

4. (COURSE) INSTRUCTOR

5. (HOUR, STUDENT) ROOM

Take the following sample data:

STUDENT COURSE INSTRUCTOR HOUR ROOM GRADE


Smith Math 1 Jenkins 8:00 100 A
Jones English Goldman 8:00 200 B
Brown English Goldman 8:00 200 C
Green Algebra Jenkins 9:00 400 A
The following four relations, each in 4th normal form, can be generated from the
given and implied dependencies:

• R1 (STUDENT, HOUR, COURSE)

• R2 (STUDENT, COURSE, GRADE)

• R3 (COURSE, INSTRUCTOR)

• R4 (INSTRUCTOR, HOUR, ROOM)

Note that the dependencies (HOUR, ROOM) COURSE and (HOUR,


STUDENT) ROOM are not explicitly represented in the preceding
decomposition. The goal is to develop relations in 4th normal form that can be
joined to answer any ad hoc inquiries correctly. This goal can be achieved without
representing every functional dependency as a relation. Furthermore, several sets of
relations may satisfy the goal.

The preceding sets of relations can be populated as follows:

R1
STUDENT HOUR COURSE
Smith 8:00 Math 1
Jones 8:00 English
Brown 8:00 English
Green 9:00 Algebra

R3
COURSE INSTRUCTOR
Math 1 Jenkins
English Goldman
Algebra Jenkins
R4
INSTRUCTOR HOUR ROOM
Jenkins 8:00 100
Goldman 8:00 200
Jenkins 9:00 400

Now suppose that a list of courses with their corresponding room numbers is
required. Relations R1 and R4 contain the necessary information and can be joined
using the attribute HOUR. The result of this join is:

R1 x R4
STUDENT COURSE INSTRUCTOR HOUR ROOM
Smith Math 1 Jenkins 8:00 100
Smith Math 1 Goldman 8:00 200
Jones English Jenkins 8:00 100
Jones English Goldman 8:00 200
Brown English Jenkins 8:00 100
Brown English Goldman 8:00 200
Green Algebra Jenkins 9:00 400

This join creates the following invalid information (denoted by the colored rows):

• Smith, Jones, and Brown take the same class at the same time from two
different instructors in two different rooms.

• Jenkins (the Maths teacher) teaches English.

• Goldman (the English teacher) teaches Maths.

• Both instructors teach different courses at the same time.

Another possibility for a join is R3 and R4 (joined on INSTRUCTOR). The result


would be:
R3 x R4
COURSE INSTRUCTOR HOUR ROOM
Math 1 Jenkins 8:00 100
Math 1 Jenkins 9:00 400
English Goldman 8:00 200
Algebra Jenkins 8:00 100
Algebra Jenkins 9:00 400

This join creates the following invalid information:

• Jenkins teaches Math 1 and Algebra simultaneously at both 8:00 and 9:00.

A correct sequence is to join R1 and R3 (using COURSE) and then join the resulting
relation with R4 (using both INSTRUCTOR and HOUR). The result would be:

R1 x R3
STUDENT COURSE INSTRUCTOR HOUR
Smith Math 1 Jenkins 8:00
Jones English Goldman 8:00
Brown English Goldman 8:00
Green Algebra Jenkins 9:00
(R1 x R3) x R4
STUDENT COURSE INSTRUCTOR HOUR ROOM
Smith Math 1 Jenkins 8:00 100
Jones English Goldman 8:00 200
Brown English Goldman 8:00 200
Green Algebra Jenkins 9:00 400

Extracting the COURSE and ROOM attributes (and eliminating the duplicate row
produced for the English course) would yield the desired result:

COURSE ROOM
Math 1 100
English 200
Algebra 400
The correct result is obtained since the sequence (R1 x r3) x R4 satisfies the lossless
join property.

A relational database is in 4th normal form when the lossless join property can be
used to answer unanticipated queries. However, the choice of joins must be
evaluated carefully. Many different sequences of joins will recreate an instance of a
relation. Some sequences are more desirable since they result in the creation of less
invalid data during the join operation.

Suppose that a relation is decomposed using functional dependencies and multi-


valued dependencies. Then at least one sequence of joins on the resulting relations
exists that recreates the original instance with no invalid data created during any of
the join operations. For example, suppose that a list of grades by room number is
desired. This question, which was probably not anticipated during database design,
can be answered without creating invalid data by either of the following two join
sequences:

R1 x R3
(R1 x R3) x R2 or
((R1 x R3) x R2) x
R4

Determinant and Dependent

The terms determinant and dependent can be described as follows:

1. The expression X Y means 'if we know the value of X, then we can


obtain the value of Y' (in a table or somewhere).

2. In the expression X Y, X is the determinant and Y is the dependent


attribute.

3. The value X determines the value of Y.

4. The value Y depends on the value of X.


Functional Dependencies (FD)

A functional dependency can be described as follows:

1. An attribute is functionally dependent if its value is determined by


another attribute.

2. That is, if we know the value of one (or several) data items, then we can
find the value of another (or several).

3. Functional dependencies are expressed as X Y, where X is the


determinant and Y is the functionally dependent attribute.

4. If A (B,C) then A B and A C.

5. If (A,B) C, then it is not necessarily true that A C and B C.

6. If A B and B A, then A and B are in a 1-1 relationship.

7. If A B then for A there can only ever be one value for B.

Transitive Dependencies (TD)

A transitive dependency can be described as follows:

1. An attribute is transitively dependent if its value is determined by


another attribute, which is not a key.

2. If X Y and X is not a key then this is a transitive dependency.

3. A transitive dependency exists when A B C but NOT A C.

Multi-Valued Dependencies (MVD)

A multi-valued dependency can be described as follows:

1. A table involves a multi-valued dependency if it may contain multiple


values for an entity.

2. A multi-valued dependency may arise as a result of enforcing 1st


normal form.

3. X Y, ie X multi-determines Y, when for each value of X we can have


more than one value of Y.

4. If A B and A C then we have a single attribute A, which multi-


determines two other independent attributes, B and C.
5. If A (B,C) then we have an attribute A which multi-determines a set
of associated attributes, B and C.

Join Dependencies (JD)

A join dependency can be described as follows:

1. If a table can be decomposed into three or more smaller tables, it must


be capable of being joined again on common keys to form the original table.

Modification Anomalies

A major objective of data normalization is to avoid modification anomalies. These


come in two flavors:

1. An insertion anomaly is a failure to place information about a new


database entry into all the places in the database where information about that
new entry needs to be stored. In a properly normalized database, information
about a new entry needs to be inserted into only one place in the database. In
an inadequately normalized database, information about a new entry may need
to be inserted into more than one place, and, human fallibility being what it is,
some of the needed additional insertions may be missed.

2. A deletion anomaly is a failure to remove information about an


existing database entry when it is time to remove that entry. In a properly
normalized database, information about an old, to-be-gotten-rid-of entry needs
to be deleted from only one place in the database. In an inadequately
normalized database, information about that old entry may need to be deleted
from more than one place, and, human fallibility being what it is, some of the
needed additional deletions may be missed.

An update of a database involves modifications that may be additions, deletions, or


both. Thus 'update anomalies' can be either of the kinds of anomalies discussed
above. All these kinds of anomalies are highly undesirable, since their occurrence
constitutes corruption of the database. Properly normalized databases are much less
susceptible to corruption than are normalized databases.

Entity-Relationship Diagram (ERD)

An entity-relationship diagram (ERD) is a data modeling technique that creates a


graphical representation of the entities, and the relationships between entities, within
an information system. Any ER diagram has an equivalent relational table, and any
relational table has an equivalent ER diagram. ER diagramming is an invaluable aid
to engineers in the design, optimization, and debugging of database programs.
• The entity is a person, object, place or event for which data is collected. It is
equivalent to a database table. An entity can be defined by means of its
properties, called attributes. For example, the CUSTOMER entity may have
attributes for such things as name, address and telephone number.

• The relationship is the interaction between the entities. It can be described


using a verb such as:

o A customer places an order.

o A sales rep serves a customer.

o A order contains a product.

o A warehouse stores a product.

In an entity-relationship diagram entities are rendered as rectangles, and


relationships are portrayed as lines connecting the rectangles. One way of indicating
which is the 'one' or 'parent' and which is the 'many' or 'child' in the relationship is to
use an arrowhead, as in figure 4.

Figure 4 - One-to-Many relationship using arrowhead notation

This can produce an ERD as shown in figure 5:

Figure 5 - ERD with arrowhead notation

Another method is to replace the arrowhead with a crow’s foot, as shown in figure 6:
Figure 6 - One-to-Many relationship using crow’s foot notation

The relating line can be enhanced to indicate cardinality, which defines the
relationship between the entities in terms of numbers. An entity may be optional
(zero or more) or it may be mandatory (one or more).

•A single bar indicates one.

•A double bar indicates one and only one.

•A circle indicates zero.

•A crow’s foot or arrowhead indicates many.

As well as using lines and circles the cardinality can be expressed using numbers, as
in:

• One-to-One expressed as 1:1

• Zero-to-Many expressed as 0:M

• One-to-Many expressed as 1:M

• Many-to-Many expressed as N:M

This can produce an ERD as shown in figure 7:

Figure 7 - ERD with crow’s foot notation and cardinality

In plain language the relationships can be expressed as follows:

•1 instance of a SALES REP serves 1 to many CUSTOMERS

•1 instance of a CUSTOMER places 1 to many ORDERS


•1 instance of an ORDER lists 1 to many PRODUCTS

•1 instance of a WAREHOUSE stores 0 to many PRODUCTS

We have now completed the logical data model, but before we can construct the
physical database there are several steps that must take place:

• Assign attributes (properties or values) to all the entities. After all, a table
without any columns will be of little use to anyone.

• Refine the model using a process known as 'normalisation'. This ensures that
each attribute is in the right place. During this process it may be necessary to
create new tables and new relationships.

Data Normalization

Data normalization is a set of rules and techniques concerned with:

• Identifying relationships among attributes.

• Combining attributes to form relations.

• Combining relations to form a database.

It follows a set of rules worked out by E F Codd in 1970. A normalized relational


database provides several benefits:

• Elimination of redundant data storage.

• Close modeling of real world entities, processes, and their relationships.

• Structuring of data so that the model is flexible.

The guidelines for developing relations in 3rd Normal Form can be summarized as
follows:

1. Define the attributes.

2. Group logically related attributes into relations.

3. Identify candidate keys for each relation.

4. Select a primary key for each relation.

5. Identify and remove repeating groups.

6. Combine relations with identical keys (1st normal form).


7. Identify all functional dependencies.

8. Decompose relations such that each non-key attribute is dependent on


all the attributes in the key.

9. Combine relations with identical primary keys (2nd normal form).

10. Identify all transitive dependencies.

• Check relations for dependencies of one non key attribute with another
non key attribute.

• Check for dependencies within each primary key (i.e. dependencies of


one attribute in the key on other attributes within the key).

11. Decompose relations such that there are no transitive dependencies.

12. Combine relations with identical primary keys (3rd normal form) if
there are no transitive dependencies.

1st Normal Form

A table is in first normal form if all the key attributes have been defined
and it contains no repeating groups.

Taking the ORDER entity in figure 7 as an example we could end up with a set of
attributes like this:

ORDER
order_id customer_id product1 product2 product3
123 456 abc1 def1 ghi1
456 789 abc2

This structure creates the following problems:


• Order 123 has no room for more than 3 products.

• Order 456 has wasted space for product2 and product3.

ORDER
order_id customer_id
123 456
456 789

In order to create a table that is in first normal form we must extract the repeating
groups and place them in a separate table, which we call ORDER_LINE.

We removed 'product1', 'product2' and 'product3', so there are no repeating groups.

ORDER_LINE
order_id product
123 abc1
123 def1
123 ghi1
456 abc2

Each row contains one product for one order, so this allows an order to contain any
number of products.

This results in a new version of the ERD, as shown in figure 8:


Figure 8 - ERD with ORDER and ORDER_LINE

The new relationships can be expressed as follows:

•1 instance of an ORDER has 1 to many ORDER LINES

•1 instance of a PRODUCT has 0 to many ORDER LINES

2nd Normal Form

A table is in second normal form (2NF) if and only if it is in 1NF and


every non key attribute is fully functionally dependent on the whole of the
primary key (i.e. there are no partial dependencies).

1. Anomalies can occur when attributes are dependent on only part of a


multi-attribute (composite) key.

2. A relation is in second normal form when all non-key attributes are


dependent on the whole key. That is, no attribute is dependent on only a part
of the key.

3. Any relation having a key with a single attribute is in second normal


form.

Take the following table structure as an example:

order(order_id, cust, cust_address, cust_contact, order_date, order_total)


Here we should realize that cust_address and cust_contact are functionally
dependent on cust but not on order_id, therefore they are not dependent on the
whole key. To make this table 2NF these attributes must be removed and placed
somewhere else.

3rd Normal Form

A table is in third normal form (3NF) if and only if it is in 2NF and every
non-key attribute is non transitively dependent on the primary key (i.e.
there are no transitive dependencies).

1. Anomalies can occur when a relation contains one or more transitive


dependencies.

2. A relation is in 3NF when it is in 2NF and has no transitive


dependencies.

3. A relation is in 3NF when 'All non-key attributes are dependent on the


key, the whole key and nothing but the key'.

Take the following table structure as an example:

order(order_id, cust, cust_address, cust_contact, order_date, order_total)

Here we should realize that cust_address and cust_contact are functionally


dependent on cust, which is not a key. To make this table 3NF these attributes must
be removed and placed somewhere else.

Boyce-Codd Normal Form

A table is in Boyce-Codd normal form (BCNF) if and only if it is in 3NF


and every determinant is a candidate key.

1. Anomalies can occur in relations in 3NF if there is a composite key in


which part of that key has a determinant, which is not itself, a candidate key.

2. This can be expressed as R(A,B,C), C A where:

o The relation contains attributes A, B and C.

o A and B form a candidate key.

o C is the determinant for A (A is functionally dependent on C).

o C is not part of any key.


3. Anomalies can also occur where a relation contains several candidate
keys where:

• The keys contain more than one attribute (they are composite keys).

• An attribute is common to more than one key.

Take the following table structure as an example:

schedule(campus, course, class, time, room/bldg)

Take the following sample data:

campus course class time room/bldg


East English 101 1 8:00-9:00 212 AYE
East English 101 2 10:00-11:00 305 RFK
West English 101 3 8:00-9:00 102 PPR

Note that no two buildings on any of the university campuses have the same name,
thus ROOM/BLDG CAMPUS. As the determinant is not a candidate key this
table is NOT in Boyce-Codd normal form.This table should be decomposed into the
following relations:

R1(course, class, room/bldg, time)

R2(room/bldg, campus)

As another example take the following structure:

enrol(student#, s_name, course#, c_name, date_enrolled)

This table has the following candidate keys:

• (student#, course#)

• (student#, c_name)

• (s_name, course#) - this assumes that s_name is a unique identifier

• (s_name, c_name) - this assumes that c_name is a unique identifier

The relation is in 3NF but not in BCNF because of the following dependencies:

• student# s_name
• course# c_name

4th Normal Form

A table is in fourth normal form (4NF) if and only if it is in BCNF and


contains no more than one multi-valued dependency.

1. Anomalies can occur in relations in BCNF if there is more than one


multi-valued dependency.

2. If A B and A C but B and C are unrelated, ie A (B,C) is false,


then we have more than one multi-valued dependency.

3. A relation is in 4NF when it is in BCNF and has no more than one


multi-valued dependency.

Take the following table structure as an example:

info(employee#, skills, hobbies)

Take the following sample data:

employee# skills hobbies


1 Programming Golf
1 Programming Bowling
1 Analysis Golf
1 Analysis Bowling
2 Analysis Golf
2 Analysis Gardening
2 Management Golf
2 Management Gardening

This table is difficult to maintain since adding a new hobby requires multiple new
rows corresponding to each skill. This problem is created by the pair of multi-valued
dependencies EMPLOYEE# SKILLS and EMPLOYEE# HOBBIES. A much
better alternative would be to decompose INFO into two relations:

skills(employee#, skill)

hobbies(employee#, hobby)
5th (Projection-Join) Normal Form

A table is in fifth normal form (5NF) or Projection-Join Normal Form


(PJNF) if it is in 4NF and it cannot have a lossless decomposition into any
number of smaller tables.

Another way of expressing this is:

... and each join dependency is a consequence of the candidate keys.

Yet another way of expressing this is:

... and there are no pairwise cyclical dependencies in the primary key
comprised of three or more attributes.

• Anomalies can occur in relations in 4NF if the primary key has three or more
fields.

• 5NFis based on the concept of join dependence - if a relation cannot be


decomposed any further then it is in 5NF.

• Pairwise cyclical dependency means that:

• You always need to know two values (pairwise).

• For any one you must know the other two (cyclical).

Take the following table structure as an example:

buying(buyer, vendor, item)

This is used to track buyers, what they buy, and from whom they buy.

Take the following sample data:

buyer vendor item


Sally Liz Claiborne Blouses
Mary Liz Claiborne Blouses
Sally Jordach Jeans
Mary Jordach Jeans
Sally Jordach Sneakers

The question is, what do you do if Claiborne starts to sell Jeans? How many records
must you create to record this fact?

The problem is there are pairwise cyclical dependencies in the primary key. That is,
in order to determine the item you must know the buyer and vendor, and to
determine the vendor you must know the buyer and the item, and finally to know the
buyer you must know the vendor and the item. The solution is to break this one table
into three tables; Buyer-Vendor, Buyer-Item, and Vendor-Item.

6th (Domain-Key) Normal Form

A table is in sixth normal form (6NF) or Domain-Key normal form


(DKNF) if it is in 5NF and if all constraints and dependencies that should
hold on the relation can be enforced simply by enforcing the domain
constraints and the key constraints specified on the relation.

Another way of expressing this is:

... if every constraint on the table is a logical consequence of the definition


of keys and domains.

1. An domain constraint (better called an attribute constraint) is simply a


constraint to the effect a given attribute A of R takes its values from some
given domain D.

2. A key constraint is simply a constraint to the effect that a given set A,


B, ..., C of R constitutes a key for R.

If relation R is in DKNF, then it is sufficient to enforce the domain and key


constraints for R, and all constraints on R will be enforced automatically. Enforcing
those domain and key constraints is, of course, very simple. To be specific,
enforcing domain constraints just means checking that attribute values are always
values from the applicable domain (i.e., values of the right type); enforcing key
constraints just means checking that key values are unique.

Unfortunately lots of relations are not in DKNF in the first place. For example,
suppose there's a constraint on R to the effect that R must contain at least ten tuples.
Then that constraint is certainly not a consequence of the domain and key
constraints that apply to R, and so R is not in DKNF. The sad fact is, not all relations
can be reduced to DKNF; nor do we know the answer to the question "Exactly when
can a relation be so reduced?"
Introduction to Database Design

Database design is a complex subject, no matter how easy some people think it is. This
session only scratches the surface, but it is a good scratch. A properly designed database
is a model of a business, or some "thing" in the real world. Like their physical model
counterparts, data models enable you to get answers about the facts that make up the
objects being modeled. It's the questions that need answers that determine which facts
need to be stored in the data model. In the relational model, data is organized in tables
that have the following characteristics: every record has the same number of facts; every
field contains the same type of facts in each record; there is only one entry for each fact;
no two records are exactly the same; the order of the records and fields is not important.

Why Design?

Accurate design is crucial to the operation of a reliable and efficient information system.
The design of a database has to do with the way data is stored and how that data is
related. The design process is performed after you determine exactly what information
needs to be stored and how it is to be retrieved. The more carefully you design, the better
the physical database meets users' needs. In the process of designing a complete system,
you must consider user needs from a variety of viewpoints.

Problems Resulting from Poor Design

A myriad of problems can manifest themselves as a result of poor database design:

• The database and/or application may not function properly.


• Data may be unreliable or inaccurate.
• Performance may be degraded.
• Flexibility may be lost.

The following section explains some common problems resulting from poor database
design. The problems can be grouped under two categories: redundant data and
modification anomalies.

Redundant Data
Modification Anomaly
Deletion Anomaly
Insertion Anomaly

A Method of Database Design

As you have seen, database design plays a major role in the stability and the reliability of
your data. In this section, we show you the process of designing a database. To help
illustrate the design process, a database named Rags is created for a fictitious wholesale
clothing manufacturer called Unlimited Rags.

Although there are a number of rules that can be followed in designing a database
structure, the design process is as much an art as it is a science. Follow these rules when
at all possible, but not to the point where the database loses the functionality that is so
important to the user. Doing a paper design first has several advantages:

• Saves time, money, and problems


• Makes system more reliable; avoids potential data-modification problems
• Serves as a blueprint for discussion
• Helps in estimating costs and size

A good design should have the following objectives:

• Meet the users' needs


• Solve the problem
• Be free of modification anomalies
• Have a reliable and stable database, where the tables are as independent as
possible
• Be easy to use

Design of the Database Model

The design of the database structure requires the following steps:

1. List the objects.


2. List the facts about the objects.
3. Turn the objects and facts into tables and columns.
4. Determine the relationship among objects.
5. Determine the key columns.
6. Determine the linking columns.
7. Determine the constraints.
8. Evaluate the design model.
9. Implement the database.

Step 1: List the Objects

Make a list of all objects. An object is a single theme, similar to a paragraph. At


Unlimited Rags the objects are:

Customer Ship Rate


Product Invoice
Employee Dependent
Step 2: List the Facts About the Objects

There is a great deal of information associated with every object. In this step, you should
list the facts about an object and then eliminate the facts that are not important to the
solution of the problem. The customer object, for example, can have many facts
associated with it: company name, address, city, founders, number of employees, stock
price. In this case, it is not important to keep information about the number of employees,
stock price, or founders. Unlimited Rags needs only the information it will use now and
possibly in the future.

Object Important Facts About the Object


Employee employee, name, birth date, gender, SSN, marital status
Customer company name, address, city, state, zip, contact, title
date, salesperson, customer, quantity, shipping charge, tax,
Invoice
freight
Product product name, description, cost, markup
Dependent name, date of birth
Ship Rates state, rates

Step 3: Turn the Objects and Facts into Tables and Columns

Objects automatically become tables, and facts become columns once the column
domains are determined. Recall that a domain is a set of values that a column can have.
Every column has a domain, which has both physical and logical properties. For example,
the column for employee last name is defined as TEXT 15. TEXT 15 is the physical
property of the column. Because of this definition, its domain is the set of all employee
last names with 15 characters or less. The following is a list of the preliminary tables,
columns, and domains for Unlimited Rags:

Table: CUSTOMER Table: PRODUCT

Name Type Length Name Type Length


COMPANY TEXT 45 PRODNAME TEXT 30
CADD1 TEXT 30 PRODDESC TEXT 50
CADD2 TEXT 30 PRODCOST CURR
CCITY TEXT 25 PMARKUP NUMB
CSTATE TEXT 2
CZIP TEXT 10 Table: DEPENDENT
CAC TEXT 3 Name Type Length
CTELPH TEXT 7 DLAST TEXT 1`5
CONTACT TEXT 30 DFIRST TEXT 10
TITLE TEXT 30 DDOB D/T

Step 4: Determine the Relationship Among Objects

To determine the relationship among the objects, take each object and look at how that
object may be related to another. The relationships that are important are those that allow
you to model the database after the real-world situation that the database represents.

One-to-one relationships. For any given row in Table A, there is only one row in Table
B. For any given row in Table B, there is only one row in Table A.

One-to-many relationships. For any given row in Table A, there are many rows in Table
B. For any given row in Table B, there is only one row in Table A.

Many-to-many relationships. For any given row in Table A, there are many rows in
Table B. For any given row in Table B, there are many rows in Table A.

An effective method to find the type of relationship is to ask whether a specific record in
Table A can point to (is linked to) one or to many rows in Table B, and then reverse the
tables and ask the question again.

Step 5: Determine the Key Columns

A key can be an account number, social security number, or any other numeric value or
combination of characters that are unique. A complex key is one that is derived from
more than one column. No other row in the table can have the value of the key column(s).
Other tables may share the same set of key information. If a company name is universally
unique, it is used as a unique row identifier. However, if there is any possibility another
company could have the same name, and then it is not unique and must not be employed
as a key column. Do not use any column as a key where the possibility exists for a
duplicate. A key column cannot contain null values. By definition, all key columns
should be indexed. Because text names are usually not unique and cannot be used in math
operations, it is useful to make key columns a sequential numeric value. In many cases, it
is easier to develop your own unique row identifier.

Step 6: Determine the Linking Columns

If you have been careful about designating key columns, you also have determined the
linking columns. Links provide a way to tie information (rows) in one table to another
table. If a table has a key column, that column can generally serve as the link. Tables are
linked together through their key columns. However, the placement of the key is
important, and where the link is placed depends on the type of relationship between the
tables.

To determine the placement of the links, you must first know the type of relationship
among the objects or tables. Once you know the type of relationship among tables, it is
much easier to determine where to place the linking column to tie two tables together.

Linking in a one-to-one relationship. In one-to-one relationships the link should be the


most stable column or should be from the table where the key column is created. The
most stable is the column least likely to change. If an automatic numbering system is
being used, then use that column as the linking column.

Linking in a one-to-many relationship. In one-to-many relationships the linking


column should come from the one table. The key column from the employee table (one
side) should be placed in the dependent table (many side). When the key empid is placed
in the dependent table, it is referred to as a foreign key in the dependent table.

Linking in a many-to-many relationship. The many-to-many relationship causes


problems when attempting to retrieve data and when relating a value in one table to its
corresponding value in the other table. It is important to understand this relationship to be
able to recognize and control this situation when it arises.

Step 7: Determine the Relationship Constraints

Often the information we get from a database comes from more than one table. For
example, if we want to know who the parent of a particular dependent is, the name is
determined by using the value in empid to look up the correct row in the employee table.
The question of who the parent is can be answered only if there is a row in the employee
table with an empid value corresponding to that in the dependent table. To ensure the
integrity of the data in our database, our model should require, for example, that no row
could be added to the dependent table, unless there is already a corresponding row in the
employee table. This requirement is known as a relationship constraint. In this case, a
constraint must exist on the dependent table that ensures that the employee (parent)
exists. There are at least four methods to implement relationship constraints:

• Built-in controls in the DBMS


• Data entry and access procedures
• Programming
• Implementation of rules

Step 8: Evaluate the Design

The next step in the design process is the evaluation of the design. In this step, you
should look for any design flaws that could cause the data to be unreliable, unstable, or
redundant.
Every table should be evaluated by asking the following questions:

1. Does each table have a single theme? It should. Each column should be a fact
about the key.
2. Does each table have a key column(s)? It should.
3. Are there any dependencies? Only logical consequences of the key should exist.
4. Are the domains unique among tables? Do not mix domains unless the column is
common between tables.
5. Are the restrictions domain or key?
6. Is the table easy to use?

Evaluation of the Customer Table

CUSTID COMPANY
CADD1
CADD2
CCITY
CSTATE
CZIP
CAC
CTELPH
CONTACT
TITLE

The table has a single theme: customers.

The table has a key: custid.

The table does not have any dependencies that are not logical consequences of the key.
Given custid, a company and company address can be uniquely determined. Given a
company, we cannot determine any particular custid. Given a state, we cannot determine
any particular custid. Therefore, the customer table does not have any dependencies. The
column names are not used in any other tables except for custid, which is a foreign key in
other table. The restrictions are domain or key.

Step 9: Implement the Design

Once the database had been designed on paper, the next step is to implement the design
in Microsoft Access or Oracle or any other development tool. The following is a list of
the final tables, columns, and domains for Unlimited Rags, including linking columns:

Table: CUSTOMER Table: PRODUCT

Name Type Length Name Type Length


CUSTID COUNTER PRODID COUNTER
COMPANY TEXT 45 PNAME TEXT 30
CADD1 TEXT 30 PDESCRIP TEXT 50
CADD2 TEXT 30 PCOST CURR
CCITY TEXT 25 PMARKUP NUMB
CSTATE TEXT 2
CZIP TEXT 10
CAC TEXT 3 Table: SHIP RATE
CTELPH TEXT 7 Name Type Length
CONTACT TEXT 30 SHIPST TEXT 2
TITLE TEXT 30 SHIPRATE NUMB

Summary

By following the nine-step design process, the problems of data redundancy, changing
multiple occurrences of data, and deletion and insertion anomalies can be avoided. It is
well worth the time spent in the design process to ensure a reliable and flexible system.

Design to the point where redundancy is eliminated or controlled. As you design your
database, keep in mind the following list of common database errors to avoid:

• Trash-table-putting everything in the same table


• No unique row identifier (key column or columns)
• No linking or common columns
• Mixing logical and physical descriptions of domains
• Putting the linking column in the wrong table
• Restrictions not enforced
• Many-to-many relationships without intersecting tables

You might also like