You are on page 1of 5

Question 1

Reasons for Normalization

The first is to eliminate redundant data. Redundant data is defined as storing the same data in more
than one table. The second is to ensure that data dependencies make sense by having only related
data stored in the same table. Both of these goals are important since they reduce the amount of
space a database consumes and ensures that data is logically stored.
Example
SalesStaff
EmployeeId SalesPerson SalesOffice OfficeNumber Customer1 Customer2 Customer3
1003 Mary Smith Chicago 312-555-1212 Ford GM
1004 John Hunt New York 312-555-1212 Dell hp Apple
1005 Martin Hap Chicago 312-555-1212 Boeing

he first thing to notice is this table serves many purposes including:

1. Identifying the organization’s salespeople


2. Listing the sales offices and phone numbers
3. Associating a salesperson with an sales office
4. Showing each salesperson’s customers

As a DBA this raises a red flag. In general I like to see tables that have one purpose. Having the
table serve many purposes introduces many of the challenges; namely, data duplication, data
update issues, and increased effort to query data.

Data Duplication and Modification Anomalies

Notice that for each SalesPerson we have listed both the SalesOffice and OfficeNumber. This
information is duplicated for each SalesPerson. Duplicated information presents two problems:

1. It increases storage and decrease performance.


2. It becomes more difficult to maintain data changes.

For example

 Consider if we move the Chicago office to Evanston, IL. To properly reflect this in our
table, we need to update the entries for all the SalesPersons currently in Chicago. Our table
is a small example, but you can see if it were larger, that potentially this could involve
hundreds of updates.

 Also consider what would happen if John Hunt quits. If we remove his entry, then we lose
the information for New York.
These situations are modification anomalies. There are three modification anomalies that can
occur:

Insert Anomaly

There are facts we cannot record until we know information for the entire row. In our example we
cannot record a new sales office until we also know the sales person. Why? Because in order to
create the record, we need provide a primary key. In our case this is the EmployeeID.

SalesStaff
EmployeeId SalesPerson SalesOffice OfficeNumber Customer1 Customer2 Customer3
1003 Mary Smith Chicago 312-555-1212 Ford GM
1004 John Hunt New York 312-555-1212 Dell hp Apple
1005 Martin Hap Chicago 312-555-1212 Boeing
??? ??? Atlanta 312-555-1212

Update Anomaly

The same information is recorded in multiple rows. For instance if the office number changes,
then there are multiple updates that need to be made. If these updates are not successfully
completed across all rows, then an inconsistency occurs.

SalesStaff
EmployeeId SalesPerson SalesOffice OfficeNumber Customer1 Customer2 Customer3
1003 Mary Smith Chicago 312-555-1212 Ford GM
1004 John Hunt New York 312-555-1212 Dell hp Apple
1005 Martin Hap Chicago 312-555-1212 Boeing

Deletion Anomaly

Deletion of a row can cause more than one set of facts to be removed. For instance, if John Hunt
retires, then deleting that row cause use to lose information about the New York office.

SalesStaff
EmployeeId SalesPerson SalesOffice OfficeNumber Customer1 Customer2 Customer3
1003 Mary Smith Chicago 312-555-1212 Ford GM
1004 John Hunt New York 312-555-1212 Dell hp Apple
1005 Martin Hap Chicago 312-555-1212 Boeing
Search and Sort Issues

The last reason we’ll consider is making it easier to search and sort your data. In the SalesStaff
table if you want to search for a specific customer such as Ford, you would have to write a query
like

SELECT SalesOffice
FROM SalesStaff
WHERE Customer1 = ‘Ford’ OR
Customer2 = ‘Ford’ OR
Customer3 = ‘Ford’

Clearly if the customer were somehow in one column our query would be simpler. Also, consider
if you want to run a query and sort by customer. The way the table is currently defined, this isn’t
possible, unless you use three separate queries with a UNION. These anomalies can be eliminated
or reduced by properly separating the data into different tables, to house the data in tables which
serve a single purpose. The process to do this is called normalization, and the various stages you
can achieve are called the normal forms.

Question 2

The critical issues in denormalizing are:

 Deciding what data to duplicate and why


 Planning how to keep the data in synch
 Refactoring the queries to use the denormalized fields.

One of the easiest types of denormalizing is to populate an identity field to tables to avoid a join.
As identities should not ever change, this means the issue of keeping the data in sync rarely comes
up. For instance, we populate our client id to several tables because we often need to query them
by client and do not necessarily need, in the queries, any of the data in the tables that would be
between the client table and the table we are querying if the data was totally normalized. You still
have to do one join to get the client name, but that is better than joining to 6 parent tables to get
the client name when that is the only piece of data you need from outside the table you are
querying.

However, there would be no benefit to this unless we were often doing queries where data from
the intervening tables was needed.

Another common denormalization might be to add a name field to other tables. As names are
inherently changeable, you need to ensure that the names stay in synch with triggers. But if this
saves you from joining to 5 tables instead of 2, it can be worth the cost of the slightly longer insert
or update.

Perhaps it is better to move on to a better example of where denormalization of data can play an
important part: managing user permissions. And no, this is not to be confused with database
security (which is completely useless for user/application-level permission management). Let’s
say you have a tree-based system of widgets. Widgets can have sub-widgets and so on such that
each widget has only one parent but can have many children. Now let’s say users within the
application (again not to be confused with database users) are assigned read/write roles on
individual widgets as well as to entire subtrees of widgets. For example, Bob may be able to read
the entire tree, but he can only write to select widgets A, B, and C or a select sub-tree of widgets
D which contains X, Y and Z. Now, how would you design a database system that allowed for
quick determination of a user’s permissions on a node, considering the user may be accessing
dozens or hundreds of nodes at a time, such as in a tree viewer?

Since the height of the tree is unbounded, we would need a database schema that allowed for
arbitrary height. A common solution is to have a single table, let’s call it WidgetTable, with a field
for the parent reference, let’s call it ParentId. Obviously, the root node would have a null ParentId.
Given a node X, how do I quickly determine if X is writable? In this case, the user may have write
access to X directly or through a parent node somewhere up the tree.

Question 3

Business rules impact on both database normalization and the decision to


denormalize database tables

Business Rules

You need to conduct research at your company before you can begin to normalize a database. You
need to perform a requirements analysis, which will identify policies and procedures and will list
the business rules for them. You must have consensus on what the rules mean. By consensus, I
mean that everyone who uses the database must agree on the definition and the use of these data
items. Without consensus, if you ask three people in the company to define what customer means,
you might get three different answers. To one person, a customer is the company that buys products
and services. To a second person, the customer is the contact person for the company who buys
product and services. To a third person, the customer is someone who might be interested in buying
products and services. Some terms are standard, but under no circumstances can you assume the
definition or meaning of a term. Confirm meanings of terms, and confirm how your company uses
these terms.
You can use schemes, or methodologies, to guide you through the requirements-analysis phase of
a database design. Think of schemes as playbooks or recipes for database design. If you are using
a methodology and like it, continue to use it, no matter how small or insignificant the database
design project. If you don't have a favorite methodology, you might want to explore techniques
such as Bachman, IDEF1X (favored by the Department of Defense), or the new object-oriented
Unified Modeling Language (UML—the scheme for the Microsoft Repository).

You might also like