You are on page 1of 125

Tera-Tom on

Teradata Basics

Teradata explained through


unimaginable simplicity

Written by Tom Coffing and Morgan Jones


1

First Edition 2001


Web Page: www.Tera-Tom.com
E-Mail addresses:
Tom: Tcoffing@aol.com
Teradata, NCR, and BYNET are registered trademarks of NCR Corporation,
Dayton, Ohio, U.S.A., IBM and DB2 are registered trademarks of IBM Corporation,
ORACLE is a registered trademark of Oracle, SYBASE is a registered trademark of
SYBASE, ANSI is a registered trademark of the American National Standards Institute.
In addition to these products names, all brands and product names in this document are
registered names or trademarks of their respective holders.
Coffing Data Warehousing shall have neither liability nor responsibility to any person or
entity with respect to any loss or damages arising from the information contained in this
book or from the use of programs or program segments that are included. The manual is
not a publication of NCR Corporation, nor was it produced in conjunction with NCR
Corporation.
Copyright 2001 by Coffing Publishing
All rights reserved. No part of this book shall be reproduced, stored in a retrieval system,
or transmitted by any means, electronic, mechanical, photocopying, recording, or
otherwise, without written permission from the publisher. No patent liability is assumed
with respect to the use of information contained herein. Although every precaution has
been taken in the preparation of this book, the publisher and author assume no
responsibility for errors or omissions, neither is any liability assumed for damages
resulting from the use of information contained herein. For information, address:
Coffing Publishing
7810 Kiester Rd.
Middletown, OH 45042
International Standard Book Number: ISBN 0-9704980-1-2

Printed in the United States of America


All terms mentioned in this book that are known to be trademarks or service have been
stated. Coffing Publishing cannot attest to the accuracy of this information. Use of a
term in this book should not be regarded as affecting the validity of any trademark or
service mark.

Acknowledgements and Special Thanks

This book is dedicated to Americans and friends of liberty and


freedom

We also want to thank our wives Leona Coffing and Janie Jones

Thanks to a great editor and friend Cheryl N. Buford

TABLE OF CONTENTS

INTRODUCTION .................................................................................................................. 7
Rule # 1 Start Building Towards A Central Data Warehouse ............................................................... 10
Rule # 2 Build for the User.................................................................................................................... 15
Rule # 3 Let the IT Department Lead the Way to User Utopia.......................................................... 17
Rule # 4 Build the Foundation Around Detail Data .............................................................................. 19
Rule # 5 Build Data Marts from the Detail............................................................................................ 21
Rule # 6 Make Scalability Your Best Friend......................................................................................... 23
Rule # 7 Model the Data Correctly........................................................................................................ 27
Rule # 8 Dont Let a Technical Issue Make Your Data Warehouse a Failure Statistic ......................... 30
Rule # 9 Take a Building Block Approach............................................................................................ 32
Rule # 10 Buy a Teradata Data Warehouse........................................................................................... 34

Teradata: The Shining Star......................................................................................... 35


Parallel Processing.................................................................................................................................... 37
Components of a Personal Computer ....................................................................................................... 39
Teradata Spreads Data over Multiple Processors ..................................................................................... 42
A Logical View of the Teradata Architecture .......................................................................................... 46
Parsing Engine (PE) ................................................................................................................................. 48
Access Module Processor (AMP)............................................................................................................. 50
The BYNET ............................................................................................................................................. 52
Teradata Building Block Approach .......................................................................................................... 54
Teradata Tables ........................................................................................................................................ 56
Arthur Schopenhauer................................................................................................................................ 56
Teradata Spreads the Data Evenly Across the AMPs............................................................................... 59
Primary Indexes........................................................................................................................................ 62
There are two types of Primary Indexes ................................................................................................... 63
The Hash Map .......................................................................................................................................... 66
How the Hash Map and Primary Index Work Together ........................................................................... 68
Retrieving the Data................................................................................................................................... 74
The Full Table Scan.................................................................................................................................. 75
Secondary Indexes.................................................................................................................................... 78
Join Indexes.............................................................................................................................................. 81

Teradata Databases, Users and Space ........................................................... 82


Databases and Users ................................................................................................................................. 86
Three Types of Teradata Space ................................................................................................................ 87
What is a View? ....................................................................................................................................... 90
What is a Macro?...................................................................................................................................... 92
Access Rights for Teradata Users............................................................................................................. 94
Automatic, Implicit, and Explicit Rights.................................................................................................. 95

Data Protection ......................................................................................................................... 97


Transaction Concept & Transient Journal ................................................................................................ 99
FALLBACK Protection.......................................................................................................................... 101
Down AMP Recovery Journal (DARJ) .................................................................................................. 106
Redundant Array of Independent Disks (RAID) .................................................................................... 108
Cliques.................................................................................................................................................... 111
Permanent Journal .................................................................................................................................. 114
Locking Modes in Teradata.................................................................................................................... 117
Referential Integrity................................................................................................................................ 119

Loading the Data .................................................................................................................. 120


4

Fastload .................................................................................................................................................. 121


Multiload ................................................................................................................................................ 123
Tpump .................................................................................................................................................... 124

Conclusion A Final Thought on Teradata ......................................... 125

INTRODUCTION
A full 40% of Fortune's "U.S. Most Admired" companies use Teradata.
What do they know that your company needs to know? Ive been in
the computer business for more than 27 years. Ive witnessed so
much since the early days of punch cards, assembler languages, and
COBOL programming.
With that in mind, the most magnificent,
ingenious technology that Ive ever seen is a database from the NCR
Corporation called Teradata.

The wave of the future is coming and there


is no fighting it.
Anne Morrow Lindbergh
Teradata is absolutely the wave of the future in data warehousing. I
introduced this technology to a great friend, Morgan Jones.
He
immediately recognized that Teradata is the gold standard for all data
warehousing, and as a result, weve partnered to write this book. So,
sit back, relax, and enjoy. With our guidance, you will soon realize
why Teradata is the greatest technology on the planet!

The Ten Rules of Data Warehousing

What weapon was deemed so powerful that experts claimed it would


end all wars? Believe it or not, it was the crossbow! Throughout
history, people have improved technology and advanced society
through foresight and ingenuity. Just when we think something is
impossible it becomes a reality. Who would have dreamed we could
send a person to the moon, or that someone could run a mile in under
four minutes? Ingenuity and the desire to improve are attributes of
the human race, and both are found in numerous avenues, from sports
to business.

Expect the unexpected,


or you wont find it.
Roger von Oech
When Frank Lloyd Wright began to design the Imperial Hotel in Tokyo,
he discovered the unexpected: just eight feet below the surface of the
ground lay a sixty-foot bed of soft mud. Since Japan is a land of
frequent shakes and tremors Wright was faced with what appeared to
be an insurmountable obstacle. This gave him an idea: Why not float
the Imperial Hotel building on the bed of mud, and let it absorb the
shock of any quake? Critics and cynics alike laughed at such an
impossible idea. Frank Lloyd Wright built the hotel anyway. Shortly
after the grand opening of the hotel, Japan suffered its worst
earthquake in fifty-two years. All around Tokyo buildings were
destroyed, but the Imperial Hotel stood firm.
For a long time the mainframe and OLTP industry laughed at those
who recommended the data warehouse design principles set forth in
this book. But those companies that build one based upon these rules
will join the ranks of the elite. Consider this: ten of the Top 13 global
communications companies use Teradata; nine of the top 16 global
retailers use Teradata; and eight of the top 20 global banks use
Teradata.

The ability to continually improve is one of Teradatas greatest


strengths. The database was designed in 1976 and has continually
improved ever since. Teradata has averaged one data warehouse
installation per week for the past decade.
Through continual
improvement based on customer feedback from many of the largest
data warehouse sites, Teradata has been able to identify itself as the
data warehouse of choice for award winning data warehouses.
This book begins with the 10 cardinal rules to follow for data
warehouse success. It illustrates how Teradata helps customers follow
these rules. Then it explains the brilliance of how Teradata works. By
the end, the reader will have a real grasp of essential Teradata
concepts.

Rule # 1 Start Building Towards A Central


Data Warehouse

Moments after midnight on July 30, 1945, the Navy cruiser USS
Indianapolis, suffered a fatal torpedo hit from a Japanese submarine.
It had been traveling unescorted through the Philippine Sea. Within
12 minutes of the deadly hit, the ship sank. Over 300 men were killed
and nearly 900 were stranded in shark-infested seas. Tragically, those
who survived until daylight faced four tortuous days in the water, and
battled continuous shark attacks before being stumbled upon by a
passing ship. In the end, only 316 souls survived. With a crew of
1,199 people, this was one of the worst military disasters of World War
II for the United States.
Most people assume that war is cruel, but the heart-wrenching story
above becomes even more tragic when the following facts are
revealed: First, the ships captain did not have all of the facts, and
second, the Navy did not provide the captain with a single version of
the truth. The Captains request for a destroyer escort was denied
even though the regional Naval command knew another ship had been
attacked just two days earlier, plus multiple enemy sightings had
occurred within the previous five days. Not only were these crucially
relevant facts withheld, but also the captain of the Indianapolis was
told that his passage route was clear and there would be no need for a
destroyer escort.

"To withhold news is to play God."


John Hess
Had everyone involved with the USS Indianapolis adhered to a single
version of the truth, with detail data to back them up, this disaster
may have never occurred. Likewise, if your company doesnt maintain
detail data in a Centralized Data Warehouse, you will never know

10

which version of the truth to believe. Each division of a business will


have its own view of the truth. Summarized data, such as a data mart,
does have its place in knowledge management, but it should always be
built from the detail data within the central data warehouse.
Most companies dont have a Central Data Warehouse. Why? Because
they dont have proper leadership or direction. Company leaders often
let different branches of the company create data marts that are
effective short-term solutions.
These solutions are based on
departmental leadership that is most interested in short-term
solutions.
Such leaders dont plan on being with a particular
department forever, so they are only interested in keeping things
simple, controlled, and beneficial to them.

Were all in this alone.


Lily Tomlin
For example, imagine a company that made cars on an assembly line.
Instead of using a giant plant with the latest and greatest technology,
the company builds cars in 300 small garages. Each garage is owned
by a different department, and has different needs. In addition, every
user has his access restricted to his or her garage. With this structure,
leaders feel safe, but building cars, logistically, is a nightmare. In fact,
just moving cars from one garage to the next would be a joke. This
scenario may seem simple-minded, but that is how most data
warehouses are built. Each part of some data warehouses operates
alone.
Now, imagine a giant car assembly plant where the assembly line was
managed by the idea of There is no I in Team. This plant would
continually improve processes, finding better ways to work together.
Everyone has an idea what the others are doing, and new ideas are
welcome. Management is able to run the entire plant with one team of
dedicated professionals, and decisions are made cooperatively,
concisely, and clearly.
This style of management is the idea behind a central data warehouse.
From the top layer of management down through the entire company,

11

they are one solid team. A data warehouse experienced team saves
valuable money and resources, plus users can manage the entire data
warehouse. Executives may ask any question targeted to any part of
the business. Decisions are made with long-term vision, and every
employee is confident that when they need answers - the data
warehouse will provide them.

If I have seen further it is by standing


on the shoulders of giants.
Isaac Newton
When asked how he had discovered the Law of Gravity, Isaac Newton
did not grab all of the glory for himself. He claimed that his work
stood on the foundation of those early scientists who had gone before
him. Likewise, a central data warehouse allows users to stand on the
shoulders of another giant.
This giant, built right, allows major
corporations to make decisions and act on those decisions quickly.
In 1993, I was asked to train one of the worlds largest retailers on its
Teradata data warehouse. I flew to Bentonville, Arkansas, and an
employee met me at the airport then escorted me to the classroom.
As we walked down the hallways, most employees seemed to be at a
pace I had never seen before. They were practically running. I asked,
Whats up? Why is everyone hurrying? The employee replied, Its
work time! I was shocked. In all of places I had previously worked,
we strolled. This place had a leadership that Ive never
encounteredanywhere. H. Ross Perot described this kind of team
when he said, When building a team, I first look for people who love
to win, if I can't find any of those, then I look for people who hate to
lose. This was a concise team of employees so motivated and so
empowered that they thought they could take over the world!
As I grew to know the team, I asked them how long it took top
management to make a decision. And how long did it take to
implement that decision at thousands of stores nationwide. They
simply said, About two hours!
I was amazed. Today, this team
continues to have one of the single greatest data warehouses ever
built. They use it extensively and it grows stronger every day.
12

While visiting with this team, management decided at one point that
stores across the country should place Halloween displays and candy
near the cash registers. In less than two hours, stores moved their
Halloween candy from the normal candy aisles to end-caps near the
cash register. Every store participated but one!
When asked why he didnt participate, the store manager said he had
simply run out of time to create the displays plus move the Halloween
candy from his normal candy aisle to the end-caps. Management was
ticked. Telling the manager they would get back to him, they then
asked the DBA to query the data warehouse to see how much this
snafu had cost the company. The DBA came back and reported that
the store actually sold almost the same amount of Halloween candy as
forecasted. Management was surprised and honestly a little
disappointed with the answer. But then the DBA added somewhat
sheepishly, I found something else, too.
Go ahead, replied
members of the management team. He said, I found out they
actually sold about 40% more normal candy then we forecasted for
this holiday. Management got on the phone immediately and told the
other thousand stores: Move those goblins and Halloween candy back
to the normal candy aisles!
What that DBA did was to use his instinct and the data warehouse to
find out exactly what was going on with the business at that time. He
was armed with a system that had cross-functional analysis. A central
data warehouse gives good management great confidence because
they see the whole picture. When users can ask any question, at any
time, and on any data, their knowledge is unlimited.
Most Teradata Central Data Warehouse sites will tell you most of their
Return On Investment (ROI) came from areas they never suspected.
Thomas Jefferson once said, We dont know one millionth of a percent
about anything. When we explained Teradata to Jefferson he did not
build another Monticello, but he did retract his statement! Companies
with a centralized data warehouse know about a million percent more
than companies that have invested in stovepipe applications and 300
different data marts.
Actually, any company planning on competing in this millennium must
think long-term and begin building a centralized data warehouse. If
not, that company will be on the short end of the stick when
competing with a company that chose to build one. That thought
should sound scarier than a goblin near the cash registers on
Halloween!
13

If you think about it, every major decision in business makes someone
happy. If you are armed with facts supported by a central data
warehouse and you do your homework, your business decisions will
make your shareholders happy. However, if you are making decisions
with a data mart strategy, those decisions are more likely to make
your competitors happy.
There are many companies that are fearful of such an undertaking.
They want a central data warehouse, but wonder: What if it fails?
Which database should we choose? What type of hardware do we
need? Should we do an RFP? Decisions, decisions! It would literally
take me about 30 seconds to make a decision on Teradata. There
would be no RFP. We used to wade in swimming pools of data; today
we are swamped in oceans of data. Teradata is built for this type of
environment.
This book explains the fundamentals of Teradata.
Anyone with any experience or knowledge about data warehouse
environments will clearly see why Teradata is the best solution.

14

Rule # 2 Build for the User

"A learned person is not one who gives the


right answers; it is the one who asks the
right questions."
Claude Levi-Strauss
The user is the heart of the data warehouse, and they get better with
each day of experience. The user makes decisions that affect the
companys bottom line. Thats why the data warehouse is built around
the business user. Building a data warehouse is simple: find out what
data the business users need and what type of queries they want to
ask but are not able to ask today. Then, find out if the data is
available and if the queries can be attained. With those answers, you
will exceed users expectations.
An experienced data warehouse user is usually shocked when he or
she first uses Teradata. Its sheer power and flexibility enables users
to ask questions they have never been able to ask before. On a recent
consultant trip of mine, a young DBA got antsy when a particular
query took more than a minute or so with Teradata. So I asked,
Well, how long did that same query take with your OLTP-based data
warehouse? He retorted, We couldnt even run this query on the old
system. I said, So, whats wrong two minutes? He added, You
know, some of our business users are so used to how long our queries
used to run that they will be sitting, staring at the screen, without
realizing that Teradata has already brought back the answer! With
Teradata, users can expand their thinking by using intuition and keen
business sense without technology barriers.
The building of an enterprise data warehouse begins with top
management, but then cascades down to a relationship between the IT
department and the business user community.
The IT department must realize they have a supporting role. That role
is to please the business user by making data available so the business
15

user can easily ask questions and get answers. Its also the IT
departments role to build a system that allows users to ask questions
on their own without IT intervention. Forget about building a system
where users ask IT to run the queries for them. When users need
information, the IT department should eventually be able to say, Ask
the question yourselfit is all available to you.
The business users are actually the stars, however the entire business
community must take responsibility for the warehouses success.
These users must continually educate themselves and other users on
the capabilities of the data warehouse, new tools, and new techniques
that will enhance its potential. Those same users must help IT help
them. If both understand their respective roles and work together to
help the company, then the data warehouse will be a huge success.

16

Rule # 3 Let the IT Department Lead the


Way to User Utopia

Few sports challenges are as grueling or demanding as the Tour de


France. But victory at this event eluded Lance Armstrong, a powerful
young cyclist from Austin, Texas. Lance excelled in individual
competition, even winning the World Championships. But despite his
hard work Lance could not overcome the Europeans strong and proud
tradition at the Tour de France. A few years ago, Lance was thrown
into the battle of his life, not against others but against himself. He
discovered that he had cancer and was given virtually no chance of
surviving. Suddenly he found out how little cycling really meant in life.
With all his might, Lance battled his way back to health, beating the
odds. Now he found out how very much cycling could mean in life.
His bicycle became a tool to reclaim the future. He found a spot as a
team member for the U.S. Postal Service team.
With a new
perspective and a new depth of character, Lance led that team to
victory in the next Tour de France. And he repeated this victory again
for the next two years!
To win the premier event in the cycling world, Lance Armstrong had to
totally rethink his role. In the same way, the key members of any
company seeking success with its data warehouse must rethink their
roles. The IT department plays a key role in a data warehouse. What
do users know about technical issues? Not enough to build a data
warehouse.
So, technical issues are the responsibility of the IT
department. The danger with this train of thought is that while the IT
department has years of experience with handling company
transactions through production databases and applications, most are
new at data warehousing. A data-warehousing environment can be
extremely different than anything an IT department has ever built or
used before. Therefore, its a bad idea to build a data warehouse
without the help of experienced people.
An OLTP environment gets more and more predictable each month. It
is designed to be tweaked and tuned in order to maximize a
companys environment. On the other hand, a data warehouse is an
unpredictable environment where the only way to gain control is to
actually give up control. In data warehousing, the user must be
allowed the freedom to ask the questions and they will blossom in an
environment where flexibility is accepted and welcomed.
17

The only sure weapon against bad ideas


is better ideas.
A. Whitney Griswold
If the IT department decides to build hundreds of data marts that will
please each and every department, then they are missing the boat.
Data warehouse experience is a hard teacher because it gives the test
first, and the lesson afterwards. Abraham Lincoln once said, A house
divided cannot stand. With that in mind, build the data warehouse
so it will stand strong for a long time.
Whats the formula? First and foremost, start by building your data
warehouse around detail data. Bring transaction data, along with key
details, from the OLTP systems into the data warehouse. Then, as
known queries are identified, build data marts to enhance their
performance, and also insist that data marts are created and
maintained directly from the detail data.
Doing so will build a
foundation that will stand.
Next, the IT department needs to keep an open mind about creating
an environment called User Utopia. Have you ever been there? In
User Utopia the user confidently asks queries without fear of being
charged by the minute. The user has meta-data so he or she becomes
intimate with the data, then makes informed decisions. The user
should also be able to ask monster queries with the full backing of IT.
Recently, on one such query, the IT department wanted to pull the
plug. But the DBA held out, granting the user more time. When the
query finished running, the information it brought back from the detail
data saved the company millions of dollars. Overall, a user will get
the majority of his or her answers back quickly from data marts, but
he or she also needs the capability of going back to the detail data for
more information. This is User Utopia.
Here is the message for IT: Dont follow the idea that if you build it,
they will come. Instead, become a leader go to the users and
build it together.

18

Rule # 4 Build the Foundation Around Detail


Data
Business is always trying to predict the unpredictable! The US Air
Force Reserve's 53rd Weather Reconnaissance Squadron is a special
force that flies their planes directly into tropical storms and hurricanes.
Using a WC-130 Hercules aircraft they fly into storms at low altitudes
between 1,000 and 10,000 feet, taking weather readings that are
relayed to the National Hurricane Center in Florida. They measure wind
speeds, measure the pressure and structure of the storm, and, most
importantly, locate the eye of the storm. The data collected by these
Hurricane Hunters is used to determine when and where a storm
might hit the coast and how strong it will be at that time. Teradata has
no fear of detail data; its virtual processors will fly right into thick of
your data warehouse to bring back valuable information for decision
support. You see, Teradata enables you to understand the storms in
your business today while helping you predict when and where the
next storm will hit tomorrow.
I estimate that 80% of todays data warehouses are built on summary
(summarized) data. Therefore, 80% of all data warehouses will never
come close to realizing their full potential. Your data warehouse does
not have to be one of them!

A bird does not sing because it has the


answers, it sings because it has a song.
A data warehouse built on detail data does not sing because it has a
song, it sings because it has the answers. When you capture detail
data, answers to an infinite amount of questions are available. But, if
this is truly the case then why doesnt everybody build around detail
data? Well, there are two reasons. One is price! Like a bird, many
companies decide to go cheap cheap. But watch out! The real
expense is not the cost of the data warehouse; it is the money that
you will not make without one. The second reason is power! Many
companies dont have the wingspan to fly through the detail, so they
sore with the summary. In addition, some companies dont want to
19

pay for the disk space it actually takes to keep detail data, but believe
me, that cost is a small price to pay for success.

Once you miss the first buttonhole


it becomes difficult to button your shirt.
Many companies use the same database for their data warehouse as
they have done for their OLTP system. This is a critical mistake. In
essence, they have missed the first buttonhole and most likely will lose
their shirt on their data warehouse adventure.
At this point, companies no longer have a choice of using detail data.
They must summarize for performance reasons. As one marine told
his boot camp soldiers jokingly, The beatings will continue until the
moral improves. Similarly, a database designed for OLTP takes a
continual beating when it tries to query large amounts of detail data.
Companies building true data warehouses dont compromise on price,
and will have a data warehouse that is built for decision support, not
one that specializes in OLTP. With this decision, you have buttoned
the first buttonhole and are well on your way to reaching the top.
Detail data is the foundation that data warehouses are built upon.
Users can ask any question, anytime, and conduct data mining, OLAP,
ROLAP, SQL and SPL functions, build data marts directly from the
detail data, and can easily maintain and grow the environment on a
daily basis. Now thats a tune well worth singing. Make a note of it!

20

Rule # 5 Build Data Marts from the Detail

You cannot teach a man anything; you can


only help him find it within himself.
Galileo

Galileo was a smart man. How did he know so much about life and
data marts? When we explained to Galileo data marts he said, You
cannot build a data mart directly from the OLTP systems, you can only
build a data mart directly from the detail within. He was right!
Many companies build data mart after data mart directly from the
OLTP systems and their universe begins to revolve around continual
maintenance. Then as things get worse, as Galileo predicted their
universe begins to revolve around the son. The son of a gun sent in to
replace them!
Why does this happen? At first, things work out great, but soon there
are more and more requests for additional information. As a result,
more and more data marts are created, and soon the system looks like
a giant spider web. Different data marts start to yield different results
on like data, and the actual maintenance of this complicated spider
web takes up most of ITs time. Meanwhile, short-term dreams turn
into long-term nightmares like this one: A man and his wife had had a
big argument just before he went on a business trip. Feeling rather
contrite about his harsh words, he arranged to send his wife some
flowers and asked the florist to write on the card, Im sorry. I love
you. The beautiful bouquet arrived at the door. But then his wife
read the words the florist had actually written in haste, Im sorry I
love you.

21

The top reasons to build data marts directly from detail data are:

Users can get answers from the data mart, but must validate
their findings or check out additional information from the detail
that built it.

There is only one consistent version of the truth

Maintenance is easy

If a user comes up with a data mart answer that does not make sense,
then he or she has the ability to drill down into the detail and
investigate. Sometimes summary data can spark interest and finding
out the why can result in big bucks.
If users dont trust the data, they wont use the system. When a data
warehouse is built on a foundation of detail data and then data marts
are erected from that foundation, you have a winning combination.
The results will always be consistent and trustworthy. However, you
should only build data marts when there is a credible business case,
and you should be ready to drop them when they are no longer
needed. The life span of a data mart is relatively short to that of its
mother and father (better known as the detail data). If you build the
data mart from the detail, it makes them easy to manage, easy to
drop, and easy to change.

22

Rule # 6 Make Scalability Your Best Friend

Plan your life for a million tomorrows, and


live your life as if tomorrow may be
your last.
Morgan Jones
The roar of class-6 rapids on a river in Suriname can be almost
deafening against the dense walls of the jungle. Especially when you
are 9 years old. Our mission was to lower our canoe down the
waterfall with ropes. The Trio Amer-Indian who anchored our 40-foot
dugout canoe let go of the anchor rope too quickly. Without warning,
the heavy boat began a freefall through the rocky water with my
father hanging onto the side for dear life. He disappeared under the
rocky waters and I knew for sure we had lost him. My heart pounded
in against my chest. As I rallied myself to grasp this loss as only a
nine year old can, the Indians abruptly began cheering wildly above
the roar of the river.
My dad had resurfaced a hundred yards
downstream, battered and bruised; but he was alive! In just one short
minute I determined that I would love my family every day as if there
were no tomorrow.
As I made my family my best friend a data warehouse must make
scalability its best friend. A data warehouse that does not scale will
have no tomorrow. It is only a matter of time until the warehouse
disappears in rocky waters only to never come up for air. Dont let go
of the anchor rope.
The data-warehousing environment will throw obstacles in your way
every single day. A data warehouse must be planned to meet todays
needs. But it must also be capable of meeting tomorrows challenges.
The future cannot be predicted, so plan for unlimited growth, or linear
scalability - - both vertical and horizontal. There are so many data
warehouses that start out with sizzling performance, but as they grow,
they eventually and inevitably hit the scalability wall. However,
before they hit the wall, there is a pattern of diminishing performance.
23

A data warehouse designed without scalability in mind is doomed


before it is begun. It can never reach its potential. Take the
scalability question out of the equation by investing in a database that
allows you to start small, but grows linearly.
In todays fast paced world, Gigabytes soon become Terabytes. It
may not sound like much, but it weighs a ton on the shoulders of
giants. Listen to these measurements and pick your data warehouses
life span. For example, if you lived for a million seconds (Megabyte),
then you would live for 11.5 days. In comparison, if you lived for a
billion seconds (Gigabyte), then you would live for 31.5 years. Plus, if
you lived for a trillion seconds (Terabyte), then you would live for
31,688 years!

How nice it would be on your 31,688th


birthday that people would say, You sure
look good for your age.
Data warehouses hit the wall of scalability because they cannot grow
with the same degree that the amount of data being gathered grows.
Teradata allows for unlimited linear scalability. Linear Scalability is
a building block approach to data warehousing that ensures that as
building blocks are added, the system continues at the same
performance level.
This is why the largest data warehouses in the world use Teradata. I
was lucky to be in the right place at the right time, and taught
beginning stages at what are considered the two largest data
warehouse sites in the world: South Western Bell (SBC) and Wal-Mart.
Wal-Marts data warehouse started with less than 30 gigabytes, and
SBC started with less than 200 gigabytes and 100 users. Both
warehouses:

Started small and simple;

Used Teradata from the beginning;

24

Have built the largest Enterprise Data Warehouse in their


respective industries;

Continue to realize additional Return On Investment (ROI) on an


annual basis;

Have grown to more than 10 Terabytes of data, and are still


growing;

Have thousands of users (some estimates are shocking);

Have educated and experienced data warehouse staffs;

Have educated and experienced data warehouse users;

Experience continual growth without boundaries;

Have experienced linear performance by Teradata in every single


upgrade (from gigabytes to terabytes and from terabytes to tens
of terabytes);

Both companies are impressed with Teradatas power and


performance;

And both SBC and Wal-Mart are committed to the excellence of


Teradata.

A data warehouse is built in small building blocks. Linear Scalability is


described in three ways:
First, building blocks are added until the performance requirements of
your environment are met. (Guaranteed Success);
Second, every time the data doubles, building blocks are doubled, and
the system maintains its performance level. (Guaranteed Success);
and
Third, any time the environment changes, building blocks are added
until performance requirements are met. (Guaranteed Success)
Scalability is not just about growing the data volume. It also means
growing, or increasing, the number of users. Many systems work
flawlessly until as few as 5 users are added, then they slow down to a
25

crawl. Companies need a system where growth and performance are


easily calculated and implemented. That means where the number of
users, size and complexity of queries, volume of data, and number of
applications being used can be calculated and compared to the current
systems actual size. If more power, speed, or size is needed, then
the company can simply add building blocks to the system until the
requirements are met.

26

Rule # 7 Model the Data Correctly

You will find only what you bring in.


Yoda, Jedi Master in Star Wars
We model a database for the same reasons that Boeing builds an
aircraft model to test flight characteristics in a wind tunnel. Its
simpler and cheaper to model, than to reconstruct the plane by
iterations until you get it right. A proper data model should be
designed to reflect the business components and possible
relationships.
Here
1)
2)
3)

are three rules for modeling data in a data warehouse:


Model the data quickly
Normalize the detail data
Use a dimensional model for data marts.

The 3rd Normal Form believes each column in a table should be


directly related to the primary key, the whole key, and nothing but the
key. Data is placed into tables where it makes the most sense and has
no repeating groups, derived data, or optional columns. This allows
users to ask any question, at any time, on all data within the
enterprise. Users do not have to strive for 3rd Normal Form, but just
normalize the data the best they can. There will be fewer columns in a
table, but a lot more tables overall. This model is easier to maintain,
incredibly flexible, and allows a user to ask any question on any data
at any time.
A Star-Schema model is comprised of a fact table and a number of
dimension tables. The fact table is a table with a multi-part key. Each
element of the key is, itself a foreign key, to a single dimension table.
The remaining fields in the fact table are known as facts, and are
numeric, continuously valued, and additive. Facts can be thought of
as measurements taken at the intersection of all of the dimensions.
Dimension attributes are mostly textual, and are almost always the
source of constraints and report breaks.
This model enhances
performance on known queries, or in other words, queries users run
repeatedly day after day.
27

Most database modelers prefer to create a logical model in 3rd


Normal Form, but most database engines are overcome by physical
limitations, so they must compromise the model. The four most
difficult functions for a database to handle are:

Join tables
Aggregate data
Sort data
Scan large volumes of data.

In order to get around these system limitations, vendors will suggest a


model to avoid joins, use summarized data to avoid aggregation, store
data in sorted order to avoid sorts, and overuse indexes to avoid large
scans. With these limitations, vendors are also going to avoid being
able to compete! That is like placing a ball and chain around the
runners leg and saying, I wish you all the best in the marathon!
Come on! Whose side are these vendors really on?
Teradata is the only database engine I have seen that has the power
and maturity to use a 3rd Normal Form physical model on databases
exceeding a terabyte in size. Because of the physical limitations, other
databases have had to use a Star-Schema model to enhance
performance, but have given up on the ability to perform ad-hoc
queries and data mining.
A normalized model is one that should be used for the central data
warehouse. It allows users to ask any question, at any time, on
information from any place within the enterprise. This is the central
philosophy of a data warehouse. It leads to the power of ad-hoc
queries and data mining, whereby advanced tools discover
relationships that are not easily detected, but do exist naturally in the
business environment.
A Star-Schema model enhances performance on known queries
because we build our assumptions into the model.
While these
assumptions may be correct for the first application, they may not be
correct for others. Flexibility is a big issue, but data marts can be
dropped and added with relative ease if each is built directly from the
detail data.
Remember, build the data warehouse around detail data using a
normalized model. Then, as query patterns emerge and performance
for well-known queries becomes a priority, Star Schema data marts
can be created by extracting summarized or departmental data from
28

the centralized data warehouse. The user will then have access to
both the data marts for repetitive queries, and the central warehouse
for other queries.
Because data marts can be an administrative nightmare, Teradata
enables Star-Schema access without requiring physical data marts.
By setting up a join index as the intersection of your Star-Schema
model, you can create a Star-Schema structure directly from your
3rd Normal Form data model. Best of all, once it is created, the data
is automatically maintained as the underlying tables are updated.
Keep in mind, 80% of data warehouse queries are repetitive, but 80%
of the Return On Investment (ROI) is actually provided by the other
20% of the queries that go against detailed data in an iterative
environment. By using a normalized model for your central data
warehouse and a Star-Schema model on data marts, you can
enhance the possibility of realizing an 80% Return on Investment and
still enhance the performance on 80% of your queries.

29

Rule # 8 Dont Let a Technical Issue Make


Your Data Warehouse a Failure Statistic

"Experience is a hard teacher


because she gives the test first,
the lesson afterwards."
Scottish Proverb
Did you know that 3/4th of the people in the world hate fractions, and
that 40% of the time a data warehouse fails is because of a technical
issue? There are many traps and pitfalls in every data warehouse
venture. One winter day a hunter met a bear in the forest. The bear
said, Im hungry. I want a full stomach. The man replied, Well, Im
cold. I would like a fur coat. Lets compromise, said the bear, and
he quickly gobbled up the hunter. They both got what they asked for.
The bear went away with a full belly and the man left wrapped in a fur
coat. With that in mind, good judgment comes from experience;
experience comes from bad judgment.
You have shown good
judgment by reading this book; so let our experience keep your
company from having a bad data warehouse experience.
Author Daniel Borsten wrote in The Discoverist, The greatest obstacle
to discovering the shape of the earth, the continents, and the oceans
was not ignorance, but rather the illusion of knowledge. There is a lot
of illusion of knowledge being spread around in the datawarehousing environment. Before you decide on any data warehouse
product, ask yourself, and the vendor, these questions:

As my data demands increase, will the system be able to physically


load the data? Our experience shows that many systems are not
capable of handling very large volumes of data. Do the math!

30

As the data grows in volume, can the system meet the performance
requirements? Do the math!

As the number of users grows, will the system be able to scale? Do


the math!

As my environment changes, will the system be flexible enough to


allow changes quickly and easily? Do the math!

Will the system need so many Database Administrators (DBAs) that


my systems cost skyrockets? Do the math!

If we suddenly merged with another company and needed to


incorporate into their mainframe or LAN environment, would the
system be able to connect and include them? Do the math!

Can I continue to meet my batch window timeframes? Do the math!

Could I become the hero of the company one day, only to have
some technical glitch blamed on me because of my poor foresight
and be thrown out of the company into a giant mud puddle? Do the
bath!

31

Rule # 9 Take a Building Block Approach

Be not afraid of growing slowly; be afraid


only of standing still.
Chinese Proverb
Ever since Vasco de Balboa discovered the Pacific coast of Panama in
1513, kings and businessmen alike dreamed of the impossible: to cut
a waterway across the mountainous isthmus, creating a shortcut
between the Atlantic and Pacific Oceans. Those dreams turned into
reality during the Industrial Revolution. It took almost forty years of
trial and error before the worlds greatest engineering feat since the
Pyramids was completed in 1914. Ships move through the locks of the
canal, rising 85' above sea level before they descend to the opposite
side.
Since its grand opening in 1920, the Panama Canal has
revolutionized trans-oceanic traffic joining East and West. Its 50-mile
stretch saves every vessel about 8,000 extra nautical miles of travel
around the bottom tip of South America. Several modifications have
been engineered through the years to accommodate the increasing
size of ships.
Data warehouses, like the Panama Canal, must be built over time and
changed over time to meet new demands. A data warehouse must
grow with the environment, but the environment is unpredictable. All
sailors know that they cant direct the wind, but that they can adjust
their sails. In comparison, all data warehouse users know they cant
direct the environment, but they can adjust their warehouse.
Sometimes the data warehouse will grow quickly and sometimes it will
grow slowly, but it should always be growing.
So, take a building block approach to data warehousing. Teradata
allows you to expand without boundaries - one building block at a
time. Plus, adding on building blocks is easy.
There are two aspects to a building block approach. First, you need to
add applications to your data warehouse in three to six month

32

intervals. Once the first application works, then you are ready for
more projects. As you become more experienced with this approach,
you can add multiple projects in parallel by involving multiple
organizations.
The second aspect of the building block approach is in the actual data
warehouse architecture. It doesnt matter if yours is the smallest data
warehouse in the world, the largest, or falls somewhere in between,
power and scalability always fuel success.
Not long ago a customer flew out to San Diego for a Teradata
demonstration and benchmark. The benchmark ran late into the
evening, but the numbers were more than 50% better than the
competition. The customer was extremely impressed, but before
buying he demanded to see the system scalability that everyone had
been talking about. Although it was already late, a Teradata employee
was called in the middle of the night, arrived within 10 minutes (in
pajamas), hooked up the building blocks, and ran a utility called
config. She ran another called reconfig, and in less than two hours
the system size doubled.
As the environment changes in terms of users, data, complexity,
capacity, batch windows, time changes, events, or opportunities, users
should be able to continue building applications and architecture. The
more a Teradata system grows, the more Teradata outshines the
competition.

33

Rule # 10 Buy a Teradata Data Warehouse

Men occasionally stumble over the truth,


but most of them pick themselves up and
hurry off as if nothing had happened.
Winston Churchill
Winston Churchill led Britain through World War II, during what he
called that countrys finest hour. When users see consistent data, the
system, too, is in its finest hour. Teradata gives users the ability to
ask questions they could never ask before. Users trust Teradata
because of its industry performance and reputation, and because it
never gives in. Constant use gives users optimal business
experience, and no matter what a user asks, the system responds with
a hearty, Yes, Sir!
When we explained Teradata to Churchill he said,

A data WARe-house that consists of 250 Data marts is like


poison; and if I were the MIS department responsible for
maintaining them, Id take it.
Teradata guarantees an Enterprise Data Warehouse with no scalability
issues. Data loads like lightning and system administration is a
breeze.
You can pick the performance level that meets your
requirements for today and forever. The database can be normalized
around detail data, and because of Teradatas power, users have the
flexibility to ask any question, at any time, on any data.
All other databases are suspect in data loading capabilities, scalability,
reference sites, decades of data warehouse experience, flexibility,
system administration difficulties, and inability to handle the complex
queries of todays users. These users are good!

34

Teradata: The Shining Star


Teradata has always been at the top of the data warehouse game,
even if the experts werent bright enough to know it. The incredible
vision that the original designers had was tremendous. It was so far
to the left of genius that most thought the idea was impossible.

Only he who attempts the ridiculous may


achieve the impossible.
Don Quixote
The Teradata database was originally designed in 1976, and many of
the fundamental concepts still remain today. Nearly 25 years later,
Teradata is still considered ahead of its time.
In 1976, IBM mainframes dominated the computer business.
Everyone who was anyone had an IBM Mainframe. However, the
original founders of Teradata noticed that it took about 4 years for
IBM to produce a new mainframe. They also noticed a little company
called Intel. Intel created a new PC chip every 2 years. With
mainframes moving forward every 4 years and PC chip every 2
years, Teradata recognized their vision: to network enough PC chips
together that the mainframe would be overpowered, yet costs would
be hundreds of times cheaper than a mainframe. The Teradata team
estimated the power surge would come in 1990.
IBM laughed out loud! They said, Lets get this straight... you are
going to network a bunch of PC chips together and overpower our
mainframes? Thats like plowing a field with a 1,000 chickens! In
fact, IBM salespeople are still trying to dismiss Teradata as just a
bunch of PCs in a cabinet.
Teradata was convinced it could produce a product that would power
large amounts of data and achieve the impossible: using PC
technology in mainframe territory. Its founders agreed with Napoleon
35

Bonaparte who asserted, The word impossible is not in my


dictionary! Sure enough, when we looked in his dictionary, that word
was not there. And it is not in Teradatas Data Dictionary, either! The
Teradata team set two goals: build a database that could

Perform parallel processing; and

Accommodate a Terabyte of data

Driving in the car one evening, Morgans eight-year old daughter Kara
piped up from the back seat, Daddy, can you buy Teradata in the
store? I mean, what does Teradata really do? Morgan thought for a
moment and then replied, Do you remember when you went on the
Easter egg hunt last spring? Well, imagine that we had fifty eggs and
you were the only child there. If I asked you to find all the purple
eggs, would you be able to do that? Kara said, Sure! But it might
take me a long time. Morgan continued, What if we now let fifty
children go in and I asked them to show me all of the purple eggs.
How long would that take? His daughter responded, It wouldnt take
any time at all because each child would only have to look at one egg.
That is precisely how Teradata works. It divides up huge tasks among
its processors and tackles each portion simultaneously, with amazing
speed. And it doesnt matter if you have a trillion eggs in your basket!
In 1984, the DBC/1012 was introduced. Since then, Teradata has
been the dominant force in data warehousing. Teradata got the
chickens plowing, and is considered outstanding. Meanwhile, IBMs
plow is out rusting in its field.

36

Parallel Processing

"An invasion of armies can be resisted, but


not an idea whose time has come."
Victor Hugo
The idea of parallel processing gives Teradata the ability to have
unlimited users, unlimited power, and unlimited scalability. This is an
idea whose time has come. And, it all starts with something called
parallel processing. So what is parallel processing? Let us explain:
It was 10 p.m. on a Saturday night and two friends were having dinner
and drinks. One of the friends looked at his watch and said, I have to
get going. The other friend responded, Whats the hurry? His
friend went on to tell him that he had to leave to do his laundry at the
Laundromat.
The other friend could not believe his ears.
He
responded, What?! Youre leaving to do your laundry on a Saturday
night?! Do it tomorrow! His buddy went on to explain that there
were only 10 washing machines at the laundry. If I wait until
tomorrow, it will be crowded and I will be lucky to get one washing
machine. I have 10 loads of laundry, so I will be there all day. If I go
now, there will be nobody there, and I can do all 10 loads at the same
time. Ill be done in less than an hour and a half.
This story describes what we call Parallel Processing. Teradata is the
only database in the world that loads data, backs-up data, and
processes data in parallel. Teradata was born to be parallel, and
instead of allowing just 10 loads of wash to be done simultaneously,
Teradata allows for hundreds even thousands of loads to be done
simultaneously. Teradata users may not be washing clothes, but this
is the technology that has been cleaning every databases clock in
performance tests.

37

Tera-Tom Parallel Processing Laundry Mat

Only one customer allowed at a time

After enlightenment, the laundry


Zen Proverb

After parallel processing the laundry,


enlightenment!
Teradata Zen Proverb
With the computer world seeing Terabytes of data, hundreds to
thousands of users are asking a wide variety of complex questions,
and need instantaneous access to data.
In short, this is the
technology needed in a data warehouse environment. What we find
most fascinating is that Teradata has unlimited power, and grows
without boundaries, and was born out of the PC (personal computer)
world by people with vision.

38

Components of a Personal Computer

A ship in harbor is safe, but that's not why


ships are built.
John Shedd
In 1805 the pivotal Battle of Trafalgar matched Britains flotilla of
battle ships against the almighty Spanish Armada. Spain had huge
battleships, some having four tiered decks of canons. But Britains
Admiral Horatio Nelson used two lines of ships to sail circles around
the Armada, attacking them at their most vulnerable point, the stern.
That battle paralyzed the Armada and turned the world of naval
warfare upside down. Teradata stunned the data-warehousing world
by taking personal computer technology right into the mighty,
mainframe-dominated environment and beating them on their own
turf. Armed with a lightweight technology built on Intel processor
chips, memory, a hard drive, and an operating system Teradata
achieved the unthinkable: lightning-fast processing speed managing
terabytes of data.
A Personal Computer (PC) is made up of the following components:
Processor Chip - This is the brain of the computer.
done at the direction of the processor.

All tasks are

Memory This is the hand of the computer. The memory allows data
to be viewed, manipulated, changed, or altered. Data is brought in
from the hard drive and the processor works with the data in memory.
Once changes are made in memory, the processor can command that
the information be written back to disk.
Hard Drive This is the spine of the computer. The hard drive
stores data, applications, and the Operating System inside the PC.
The hard drive, also called the disk drive, holds the contents of the
data for the system on its disk.

39

For example, suppose you made three new good friends this month
and want to add their names to your list. Opening that document
brings it up from the hard drive and displays it on your screen. As you
type in the new names, the processor executes your request onto the
document while it is still being displayed in memory.
Upon
completion, you close the document and the processor writes all the
changes to the disk where it is stored.
In the picture below, we see the basic components of a Personal
Computer. Note that it also holds a file called Best_Friends listing,
and lists eight best friends.

Processor

Memory

BEST FRIENDS
1
2
3
4

Ben Hon
Joe Davis
Mary Gray
John Davis

5
6
7
8

Don Roy
Sam Mills
Kyle Marx
Lyn Jones

Disk

40

41

Teradata Spreads Data over Multiple


Processors

I dont mind starting the season with


unknowns. I just dont like finishing the
season with them.
Coach Lou Holtz
With Teradata you will never finish with any unknowns about your
business; you can know it all! One reason why this assertion is true
can be found by looking at the unique way this database places the
data into the system and processes it. Teradata takes every table in
the system and spreads the data across multiple processors. Each
processor works on its portion of the database in parallel when
requested to do so. This is why we call it parallel processing. In the
previous example, one processor listed eight best friends on its disk.
In that case, Teradata would read eight rows.
The Teradata example on the next page shows two processors, each
having direct access to its own physical disk. The Best_Friends table
has been spread out evenly across both processors. When we ask for
a list of best friends the system, both processors will receive data in
parallel and will return combined results over the connecting network.
Returns for this example could easily double the speed of the previous
example.
Even though we still need to read eight records, each processor is only
responsible for reading four records and simultaneously the other
processor reads the remaining four records. So, how could we double
the speed of this system again?

42

Network
Processors

Memory

Memory

BEST FRIENDS
1 Ben Hon
2 Joe Davis
3 Mary Gray
4 John Davis

BEST FRIENDS
5 Don Roy
6 Sam Mills
7 Kyle Marx
8 Lyn Jones

43

Teradata has Linear Scalability

Every ceiling, when reached becomes a


floor upon which one walks and now can see
a new ceiling.
Tom Stoppard
There is no ceiling on the Teradata databases ability to grow. Any
time you want to double the speed, simply double the number of
processors. This is called Linear Scalability. This allows unlimited
growth with minimal effects on response time. Each time a new
processor is added in Teradata, a new storage disk is also added. By
doing so, the system can continually grow, and there are no worries
about the disk becoming the bottleneck of data.
Notice in the system below there are four processors, and that each is
assigned two rows of data. When we ask for our Best_Friends, the
system will read all eight rows. Since data is spread evenly over four
processors, Teradata reads two rows simultaneously across four
processors. Now, the system is four times faster.
Most data warehouses have tables that hold millions, even billions of
rows. Teradata allows you to decide how many processors are needed
to get the desired response time. This is called the Divide and
Conquer theory. To accommodate desired response rates, some
customers have thousands of processors.
Tasks are divided up
between the AMPs and processed in parallel.

44

B EST FRIENDS

B EST FRIENDS

B EST FRIENDS

B EST FRIENDS

1 Ben Hon
5 Don Roy

2 Joe Davis
6 Sam Mills

3 Mary Gray
7 Kyle Marx

4 John Davis
8 Lyn Jones

45

A Logical View of the Teradata Architecture

You are either making history


or you are history.
Leonard Sweet
A frustrated choral director was preparing for a concert, then suddenly
stopped and said, "I've got to tell you eight years ago I was directing
another choir in this anthem, and they made the same mistake you're
making. He continued, Do any of you have a clue as to what the
mistake is?" Just then a voice from the choir called out, "Same
director!
Many data warehouse environments have an architecture that is not
designed for Decision Support, yet company officials wonder why their
data warehouse failed, when they actually never had a chance to
succeed. In ancient days, Solomon wrote, Where there is no vision,
the people perish. It is no different today. Company leaders must
cast a new vision that enables Decision Support with technology that
can handle it or their companies, too, will be history.
The following picture shows a logical view of Teradata. The illustration
shows a proper architecture for a data warehouse. In the example, a
user logs on to Teradata from a LAN or mainframe host, and then is
given a session with a Parsing Engine processor (PE). The user then
asks a specific query using SQL.
The PE checks SQL syntax, then checks to see if the user has proper
rights (authority) to access the table. Next, the PE creates a plan for
the Access Module Processors (AMPs) to execute. The PE passes the
plan to the AMPs over the BYNET. The AMPs obtain information on
their disks, then pass it to the PE over the BYNET. The PE then passes
the data back to the user.

46

P
E
P

BYNET Network

A
M
P

A
M
P

A
M
P

A
M
P

A
M
P

A
M
P

A
M
P

A
M
P

47

Parsing Engine (PE)

"Even a stopped clock is right twice a day."


Polish Proverb
A man and his son were riding a bicycle built for two when they came
to a steep hill. It took a great deal of struggle for them to complete
what proved to be a very steep climb. When they got to the top, the
father in front said, "Boy, that sure was a hard climb!" His son in the
back responded, "Yes it was, Dad. And if I hadn't kept the brakes on
all the way we would have rolled down backwards." Teradata has an
ingenious way to keep this type of situation from happening inside the
data warehouse. Most databases make educated guesses about the
best way to retrieve data. The Teradata PE or Optimizer has both
the experience and design to KNOW the best way to retrieve data.
When users log-on to Teradata they are connecting to a Parsing
Engine (PE). When a user submits a query, then the PE takes action.
The PE creates a PLAN that tells the AMPs exactly what to do in order
to get the data. The PE knows how many AMPs are in the system,
how many rows are in the table, and the best way to get to the data.
Teradatas PE has been continually enhanced since 1984. It has such
a great reputation for speeding up data access that it has earned the
name The OPTIMIZER.
The PE loves to serve valid Teradata users, but it was raised like a
guard dog. A good guard dog loves its family, but it barks and may
bite when strangers approach. The PE will always check users security
(access) rights to ensure the user has the proper authority to obtain
the information that is being requested. If the user has authority, the
PE instructs the AMPs to get the data. If the user doesnt have proper
access rights, the query is rejected.
The PE doesnt like to brag, but it did graduate at the top of its class.
Customers like Wal-Mart, Anthem Blue Cross and Blue Shield, Bank of
America, AT&T, and SouthWestern Bell have continually pushed the

48

data warehouse envelope. This has given the PE years of experience


in guiding AMPs to answer complex questions some of which have
never been asked before in their respective industries.
This
experience allows users to ask any question regardless of its
complexity. The PE isnt called The Optimizer for nothing. It needs
no tuning by a Database Administrator (DBA) or hints from the user.
Teradata users ask the questions, and Teradata returns the answers.

49

Access Module Processor (AMP)

"Wise men talk


because they have something to say;
fools talk
because they have to say something."
Plato
Two men decided to go ice fishing. They found a good spot on some
ice and began digging. As soon as they finished the hole, they heard a
voice from above saying, "There are no fish here." Taking that as a
sign, they moved about thirty feet and began digging again. A second
time they heard the voice saying, "There are no fish here." So they
moved another thirty feet and began to dig a third hole. This time the
impatient voice spoke from above, "There are no fish here in this ice
skating rink!" Some people just dont listen. But this is never the
case with Teradatas Access Module Processors.
The Access Module Processor (AMP) is a processor of little words. It
keeps its mouth shut and its ears open. Each AMP listens to the PE
via the BYNET network for instructions. Each AMP retrieves data from
its disk or writes data to its disk. The AMP is the worker bee of the
system. It is the perfect employee. It never complains, rarely calls in
sick, and lives to take direction from its boss the Parsing Engine (PE).
The best example is to think of each AMP as a computer processor
attached to its own disk.
Every AMP has its own disk, and its the only AMP allowed to read or
write data to that disk. This action is referred to as a SharedNothing architecture. Although AMPs are the perfect workers, they
are not the perfect playmates. Even as children AMPs would never
share toys with other AMPs on the playground. Each AMP has its own
disk, and it shares this with no other AMP, hence a Shared-Nothing
architecture.

50

Teradata spreads the rows of a table evenly across all AMPs in the
system. When the PE asks the AMPs to get the data, each AMP will
read the rows only on their particular disk.
If this is done
simultaneously, all AMPs should finish at about the same time. As a
matter of fact, when we explained this philosophy to Confucius he
stated, A query is only as fast as the slowest AMP. Confucius,
however, did say not to quote him!
Again, an AMPs job is to read and write data to its disk. The AMP
takes its direction from the Parsing Engine (PE). The number of AMPs
varies per system. Today, some Teradata systems have just four
AMPs, while others have more than 2,000!

51

The BYNET

"Even if you're on the right track,


you'll still get run over if you just sit there."
Will Rogers
The BYNET ensures communication between AMPs and PEs is on the
right track and that it happens rapidly. When communication between
AMPs and PEs is necessary, the BYNET operates as a communication
superhighway.
There are always two BYNETs per system. They are called BYNET 0
and BYNET 1. The duplication is insurance in case one BYNET fails,
and it also enhances performance. As an example, think of two
BYNETs as two telephone lines in your home. AMPs and PEPs can talk
to one another over either BYNET, or over both.
Morgan Jones, co-author, has been talking to his four-year old son,
David, about AMPs, PEs, and the BYNET. Little David asked, Daddy,
what happens when the AMPs and PEs get lonely? Morgan replied,
They talk to each other over the BYNET.
Here are the steps that outline exactly how the AMPs, PEs, and BYNETs
work together: A user performs a LOGON to Teradata. A PE is
assigned to manage all SQL for that particular user. The user then
asks Teradata a question. Next,

The PE checks the users SQL Syntax;

The PE checks the users security rights;

The PE comes up with a plan for the AMPs to follow;

The PE passes the plan along to the AMPs over the BYNET;

The AMPs follow the plan and retrieve the data requested.
52

The AMPs pass the data to the PE over the BYNET; and

The PE then passes the final data to the user.

53

Teradata Building Block Approach

Better a diamond with a flaw,


than a pebble without one.
Anonymous
Teradata builds its data warehouses in building blocks called nodes.
Each building block is a gem composed of four Intel processors. Each
node is connected flawlessly to other nodes through two BYNETs. The
AMPs and PEs reside inside the nodes memory.
Each node is
connected to a disk array where each AMP has direct access to one
virtual disk.
Below is a picture of a Teradata system. It has four Intel processors,
and the AMPs and PEs reside in memory. Each AMP is directly
attached to its one virtual disk.

Virtual Disks

Node
Intel Processors
Memory

AMPs
PEs

54

The following picture shows two nodes connected together over the
BYNETs.

Virtual Disks

Node 1
Intel Processors
Memory

AMPs
PEs

BYNET
Virtual Disks

Node 2
Intel Processors
Memory

AMPs
PEs

55

Teradata Tables

Nearly everyone takes the limits of his own


vision for the limits of the world.
A few do not. Join them.
Arthur Schopenhauer
Do you have one of those notoriously messy junk drawers in your
kitchen? You know the one were talking about the one next to the
silverware drawer. This drawer may often contain old washer and
dryer warranties, matches, half-used flashlight batteries, straws, odd
nuts, bolts and washers, corncob holders, etc.
Fortunately, the
dresser drawers in your bedroom are typically much more organized!
In fact, you probably store your clothing in those drawers much more
neatly so you can get to what you need quickly.
Relational databases store data much like we organize our dresser
drawers: Just as you might put all of your t-shirts in one drawer and
your socks in another, the database will store data about one topic in
one table and data that pertains to another topic is kept in another
table.
For example, a database might contain a CustomerTable
containing items to track such as customer number, CustomerName,
city, and order number. Another table, the OrderTable, might hold
data like Order Number, Order Date, CustomerName, Item No, and
Quantity.
An example of each table follows:
CUSTOMER TABLE called CustomerTable
CustomerID
CustomerName CityName
(PK)
1001
1002
1003

JC Penney
Office Depot
Dillards

Dallas
Columbia
Atlanta

Order
Number
(FK)
105372
105799
106227

Customer
Rep
Dreyer
Crocker
Smith

56

ORDER TABLE called OrderTable


Order Number
Order Date
(PK)
105372
03/07/2001
105799
04/18/2001
106227
10/17/2001

Item
No
212
296
325

Quantity CustomerID
(FK)
20
1001
52
1002
17
1003

The data stored in the CustomerTable is logically related to the data


stored in the Order Table. The two tables both have columns called
Order Number. These tables make up an extended family, joined by
the marriage of the columns named Order Number in each table.
Earlier programming languages referred to files, records and fields.
Relational databases use the terms Tables, Rows, and
Columns. Each Row of a table is comprised of one or more fields
identified by a column name. A Row is the smallest value that can be
inserted into a table. A column is the smallest value within a table
that can be updated, or modified. The data value stored in each
column must match the data type for that column. For example, you
cannot enter the name of a city in a column that is defined as a
decimal data type. Columns that are defined but have no data value
will display a null, or are sometimes represented by a ?.
One column, or combination of columns, in each table is chosen to be
the Primary Key (PK). This is a logical modeling term. The primary
key contains a unique value for each row, and enforces the uniqueness
of that row. The PK cannot be null, and should contain values that will
not change. In the CustomerTable, the primary key is the CustomerID
column. Each customer has a unique CustomerID. The data in the
columns of every row must be consistent with the unique CustomerID
for that row. The rows in a table need not be stored in any particular
order. This is also called being arbitrary or an unordered set.
Before the table is defined, the order of the columns is also arbitrary.
It doesnt matter if you place CustomerName before CityName or after
it. However, once the table is created, the order of the columns (e.g.,
the row format for the table) must remain the same. Plus, you cannot
have multiple row formats within a table.
What forms the relationship between the tables in a relational
database? A key that is common to each table forms it. A Foreign
Key (FK) is a key in a table that is a Primary Key (PK) in another
table. The PK and FK relationship allows the two tables to relate to
one another. When you need to display data from more than one

57

table, you can JOIN the two tables by matching a common key
between the two tables. A great choice is to match the primary key of
one table to the foreign key of the other table. Remember that a table
may have only one PK, but it may have multiple FKs.
Here is a quick reference chart for Primary and Foreign Keys:
PRIMARY KEY
FOREIGN KEY
Not optional
Optional
Comprised of one or more
Comprised of one or more
columns
columns
Can only have one PK per table
Can have multiple FKs per table
No duplicates allowed
Duplicates allowed
No changes allowed
Changes allowed
No nulls allowed
Nulls allowed

58

Teradata Spreads the Data Evenly Across the


AMPs

A chain is only as strong as its weakest


link
Because Teradata spreads data evenly no AMP or disk is ever the
weakest link. Teradata is the only database that strings hundreds and
thousands of processors together to achieve awesome processing
power for todays data warehouses.
Today, the AMPs (Access
Module Processors) are software processors that reside in memory.
Teradata always attempts to spread data evenly so each AMP will
manage approximately the same amount of data. As a result, the
rows of every table are distributed across all of the AMPs. In other
words, every AMP stores a portion of every table in the database on its
virtual disk (VDISK). If a data warehouse has 200 tables, then each
AMP will hold a portion of 200 tables. This method of data distribution
is unique to Teradata.
There are some significant benefits to handling data this way:
First, when each AMP has nearly the same quantity of table rows, then
no one AMP becomes a data bottleneck. AMPs can all retrieve their
portion of the data in parallel so you do not have AMPs sitting idle
while one or two others are chugging away. Baseball phenomenon
Casey Stengel once said, "It's easy to get good players. Gettin' em to
play together, that's the hard part." AMPs love to work together in
parallel.
Second, each AMP is unaware of any data except its own portion. The
only AMP that can read or write to a particular row of data is the AMP
that actually owns that row. This makes retrieving data from a
particular row very efficient as all AMPs do their own work.
Third, each AMP automatically groups all of its rows by the tables from
which they come. Have you ever been to a large aquarium and seen
one of the displays that look like a very tall, clear cylinder? As you

59

walk around the glass, the fish tend to swim in schools. Similarly,
Teradata does this with the rows on the AMPs to boost performance.
When you ask for data from any given table, an AMP will immediately
go to that particular group of rows, and then select what you need. It
doesnt need to look through the rows of many tables before it finds
what you need.
This is how parallel processing works. The AMPs
retrieve data in parallel, then pass it over the BYNET to the Parsing
Engine (PE), and the PE ensures the data is delivered to the user.
Keep in mind, the Bynet is an internal Teradata network over which
the PEs and the AMPs communicate.
The example below shows the information we have just discussed.
Notice that the system has four AMPs, and three tables: Employee,
Customer, and Order. Notice each AMP holds a portion of the rows
for every table. AMP1, for example, holds 1/4th of the Employee table
rows, 1/4th of the Customer table rows, and 1/4th of the Order table
rows.
Plus, the data is spread evenly for all tables. If a query asks for all
rows in the Customer Table, then each AMP will retrieve their
Customer table rows in parallel with the other AMPs. Each AMP will
then pass its data to the PE via the BYNET. Because the data in the
Customer table is spread evenly among all AMPs, each should finish
reading at exactly the same time.
Also, notice how each AMP separates each table. Just like schools of
fish, the rows of the Employee Table are grouped together. In
addition, the Customer and Order tables are grouped together. This is
important in a data warehouse environment because most queries
read millions of rows to satisfy a single query.
Performance is
enhanced when table rows are grouped together and Teradata is
permitted to bring blocks of rows into memory.

60

A
M
P

A
M
P

A
M
P

A
M
P

Empl oyee

Empl oyee

Empl oyee

Empl oyee

Customer

Customer

Customer

Customer

Order

Order

Order

Order

61

Primary Indexes

Every road has two directions.


Russian Proverb
When world-renowned explorer, Dr. David Livingstone, was working in
Africa, a group of friends wrote to him saying, "We would like to send
other men to you. Have you found a good road into your area yet?"
According to a member of his family, Dr. Livingstone sent this
message in response, "If you have men who will only come if there is
a good road, I don't want them. I want men who will come if there is
no road at all."
Although it doesnt have to cut its way through the dense African
jungle, the PRIMARY INDEX (PI) is the trailblazer in Teradata that
paves the way for the rest of the data to follow. The PI is so important
to Teradata functionality that every table in the database is required to
have one. As the quote above states, Every road has two directions.
The Primary Index is used in two directions:
1. The Primary Index WILL DETERMINE which rows go to
which AMPs; and
2. The Primary Index is ALWAYS the FASTEST RETRIEVAL
method.
If the user doesnt define a PRIMARY INDEX when creating a table, the
system will automatically choose one by default. Once it is defined,
the PI column cannot be dropped or changed. The table would need to
be re-created in order to change the PI.

62

There are two types of Primary Indexes

A man who chases two rabbits


catches none.
Roman Proverb

A man who chases two rabbits misses both by a


HARE! A person who chases two Primary Indexes
misses both by an ERR!
Tera-Tom Coffing
Each table may only have one Primary Index, but every table must
have a Primary Index defined. It is either an UPI or a NUPI; in other
words, a Unique Primary Index (UPI) or a Non-Unique Primary Index
(NUPI). The Primary Index is created when the table is created. An
example of creating a Unique Primary Index on the column EMP
follows:
CREATE Table employee
(
emp
INTEGER
,dept
INTEGER
,lname
CHAR(20)
,fname
VARCHAR(20)
,salary
DECIMAL(10,2)
,hire_date DATE
)

UNIQUE PRIMARY INDEX(emp);

63

An example of creating a Non-Unique Primary Index is listed below.


Notice you never see the prefix NON:
CREATE Table TomC.employee
( emp
INTEGER
,dept
INTEGER
,lname
CHAR(20)
,fname
VARCHAR(20)
,salary
DECIMAL(10,2)
,hire_date DATE
)
PRIMARY INDEX(dept);

PRIMARY INDEXES may be defined on one column, or on a set of


columns viewed as a composite unit. Up to 16 columns may be
defined as a Primary Index. An example of creating a multi-column
Unique Primary Index follows:
CREATE Table employee
( emp
INTEGER
,dept
INTEGER
,lname
CHAR(20)
,fname
VARCHAR(20)
,salary
DECIMAL(10,2)
,hire_date DATE
)

UNIQUE PRIMARY INDEX(emp, dept);

"Being related hardly insures relatability.


Michael E. Angier
All of the tables in a Teradata database are related to each other. But
the Primary Key and Primary Index ensure their relatability in day-today use. What is the difference between a PRIMARY KEY and a
PRIMARY INDEX? A Primary Key is a logical term used to label
column(s) that enforce the uniqueness of each row in a table. PKs
determine relationships among tables.
A Primary Index is a
physical term used to label column(s) that is used to store and locate
rows of data.

64

To illustrate, imagine a library.


The Primary Key, the logical, is like
the actual construction of the library. Do you know what part of the
library is reserved for fiction? What about for non-fiction? Plus, where
will the card catalog reside? Once the library is logically correct, it is
ready to receive books. A Primary Key on a table helps to logically
determine what data to track in the table.
The Primary Index is much like a card catalog in the library. Inside
the card catalog drawers are thousands of index cards that provide
the books title, author, publisher, and the Dewey Decimal number.
By taking that index card, you can immediately find where that book
is shelved within the library. The Primary Index column value for a
Teradata table tells where the row should reside. Its also the fastest
mechanism to retrieve data.
Teradata uses the Primary Index to distribute each tables rows to the
proper AMPs. Teradata also uses the Primary Index to retrieve rows at
lightning speed.
Exactly how does Teradata actually accomplish this?
you asked! Lets look at the HASH MAP next:

Well, Im glad

65

The Hash Map

The map is not the territory.


Alfred Korzybski
The first map of all the known lands in the world has been attributed
to the Greek philosopher Anaximander of Miletus (610-ca.546 BC). He
may have been the first person to attempt such a map, although
others had drawn local maps before. The Hash Map was created by
a group of individuals so Teradata could maximize its parallel
processing roots. Its the hash map that tells which AMP holds a
particular row. It does not contain any data rows; it just shows where
to find them. Overall, the idea of the hash map is to spread the data
as equally as possible.
Once a travel agent received a call from a man asking, Is it possible
to see England from Canada? The agent said, No. The man replied,
But they look so close on the map!
Teradata uses a map called the HASH MAP in combination with the
PRIMARY INDEX to distribute data rows. The HASH MAP is not a twodimensional array, although it appears that way in diagrams. It is
more like a honeycomb with myriad buckets.
But, while the
honeycomb holds honey in its buckets, the HASH MAP buckets contain
just one thing - - the number of an AMP. All AMPs and PEPs use the
very same HASH MAP.
The picture on the following page shows the hash map for a four-amp
system. This is shown for simulation purposes. The actual hash map
has 65,536 buckets. On the diagram, notice that inside each bucket is
an AMP number, and that AMP number goes 1, 2, 3, 4, then starts
over again. Why? Its because this is the hash map for a four-AMP
system.

66

Hash Map
1
3
1
3
1
3
1
3

2
4
2
4
2
4
2
4

3
1
3
1
3
1
3
1

4
2
4
2
4
2
4
2

1
3
1
3
1
3
1
3

2
4
2
4
2
4
2
4

The next diagram shows the hash map for an eight-AMP system. As
before, this is for simulation purposes. Notice that the AMP number
for this hash map goes 1, 2, 3, 4, 5, 6, 7, 8, and then starts over
again. Why? Because this hash map is for an eight-AMP system.

1
3
7
1
5
3
1
3
7
1
5
3

2
4
8
2
6
4
2
4
8
2
6
4

3
1
3
7
1
5
3
1
3
7
1
5

4
2
4
8
2
6
4
2
4
8
2
6

5
1
3
1
3
7
1
5
3
1
3
7

6
2
4
2
4
8
2
6
4
2
4
8

67

How the Hash Map and Primary Index Work


Together

Choice, not chance, determines destiny.


Anonymous
The choice made for the Primary Index determines the exact AMP
destination for each row in a table. It must not be left up to chance!
Here is how the Hash Map and Primary Index work together: When a
table is being loaded with data, the rows will be spread among all
AMPs. The Hash Map determines the actual DESTINATION AMP for
each row of the table.
Destination is determined using the Whiz-Bang Formula (a secret NCR
formula). First, well explain the theory, and then we will invent our
own Wiz-Bang Formula to show you how it works conceptually.
Lets start with a table to load on our four-AMP system. Imagine you
have listed your eight best friends in a table called Best_Friends.
You have two columns in the table. They are titled Friend_Number
and Friend_name. Weve chosen only even numbers for Friend_Num
because our friends are so even tempered. We have also made the
Friend_Num a Unique Primary Index (UPI) on the table.

Best_Friends Table
Friend_Num
2
4
6
8
10
12
14
16

Friend_Name
Ben Hon
Joe Davis
Mary Gray
John Davis
Don Roy
Sam Mills
Kyle Marx
Lyn Jones

68

For this example, Teradata will attempt to spread the table rows
among the four-AMP system. A picture of the four-AMP configuration
follows:

P
E

BYNET NETWORK
A
M
P

A
M
P

A
M
P

A
M
P

B EST FRIENDS

B EST FRIENDS

B EST FRIENDS

B EST FRIENDS

1 Ben Hon
5 Don Roy

2 Joe Davis
6 Sam Mills

4 John Davis
8 Lyn Jones

4 John Davis
8 Lyn Jones

Since there is a four-AMP configuration, the system will use a four-AMP


hash map. Here is an illustration:

1
3
1
3
1
3
1
3

2
4
2
4
2
4
2
4

3
1
3
1
3
1
3
1

4
2
4
2
4
2
4
2

1
3
1
3
1
3
1
3

2
4
2
4
2
4
2
4

69

Instead of trying to figure out the NCR Wiz-Bang formula (a secret),


we can show you the theory of distributing data and retrieving data
with our own formula. It is called the:
Coffing/Jones Wiz-Bang formula: Take a tables Primary Index
and divide the column value by 2. The answer points to a hash map
bucket, and that bucket tells which AMP will hold the row.
Lets take our first row and determine on which AMP it will reside.
Remember, we will get the Primary Index value of the row, divide it by
the Coffing/Jones Wiz-Bang formula (divide by 2), and the answer will
point to a bucket in the hash map. Inside that bucket will be the AMP
number in which the row will reside. Lets take our first row and
determine its proper location:
Friend_Num

Friend_Name

Bill Hon

Since we designated Friend_Num as the Primary Index, we merely


divide the value of Friend_Num (2) by the Coffing/Jones Wiz-Bang
Formula (divide by 2):
2 divided by 2 = 1
The hash map bucket number is one. Lets check the hash map to see
bucket number 1 and to see what AMP number is inside that bucket.
As seen in the picture below, the first bucket in the hash map says the
rows destination is AMP 1.

1
3
1
3
1
3
1
3

2
4
2
4
2
4
2
4

3
1
3
1
3
1
3
1

4
2
4
2
4
2
4
2

1
3
1
3
1
3
1
3

2
4
2
4
2
4
2
4
70

Lets look at another random row:


Friend_Num

Friend_Name

16

Lyn Jones

Since we designated Friend_Num as the Primary Index, we merely


divide the value of Friend_Num (16) by the Coffing/Jones Wiz-Bang
Formula (divide by 2) and the answer is:
16 divided by 2 = 8
Thus, the hash map bucket number is now eight. Lets check our hash
map to see bucket number eight, determine which AMP number is
inside that bucket. As you can see below, bucket eight in the hash
map says the rows destination is AMP four.

1
3
1
3
1
3
1
3

2
4*
2
4
2
4
2
4

3
1
3
1
3
1
3
1

4
2
4
2
4
2
4
2

1
3
1
3
1
3
1
3

2
4
2
4
2
4
2
4

71

If we continue the process until all data is laid out, the system would
look like this:

A
M
P

A
M
P

A
M
P

A
M
P

B EST FRIENDS

B EST FRIENDS

B EST FRIENDS

B EST FRIENDS

1 Ben Hon
5 Don Roy

24 Joe Davis
6 Sam
12
SamMills
Mills

46 John
MaryDavis
Gray
14
Kyle
Marx
8 Lyn Jones

48 John Davis
8 Lyn
16
LynJones
Jones

Best_Friends Table
Friend_Num

Friend_Name

2
4
6
8
10
12
14
16

Ben Hon
Joe Davis
Mary Gray
John Davis
Don Roy
Sam Mills
Kyle Marx
Lyn Jones

HASH MAP

1
3
1
3
1
3
1
3

2
4
2
4
2
4
2
4

3
1
3
1
3
1
3
1

4
2
4
2
4
2
4
2

1
3
1
3
1
3
1
3

2
4
2
4
2
4
2
4

72

Remember, the Teradata hashing formula is a secret. However, the


Coffing/Jones Whiz Bang Formula did not crack the code. The purpose
is to show you how the hash map works, in theory, to distribute and
locate rows.
Simply, you should understand that the formula is
mathematical (similar to Coffing/Jones Whiz-Bang Formula) and it will
be consistent. When we divided Friend_Number two by two, we got
bucket one in the hash map. However, if we ran the formula on this
premise a million times, we would still get the same results.

"If you always do what you always did,


you'll always get what you always got."
Verne Hill
In summary, Teradata will always be able to find a row if it knows the
Primary Index. It can rerun the hash formula, point to the bucket in
the hash map, and then retrieve the row from the correct AMP. The
Teradata hashing formula always does what it always did, and always
gets what it always got. Since it always runs the same formula, it is
consistent.

73

Retrieving the Data


When Teradata needs to retrieve data, the fastest and most efficient
way is via the Primary Index. An example of SQL showing how
Teradata retrieves the data follows:
SELECT Friend_Num, Friend_Name
FROM Best_Friends
WHERE Friend_Num = 8;
The Parsing Engine understands that the user wants to have two
columns, titled Friend_Num and Friend_Name, returned. The PE
gets excited when it notices that we are after Friend_Num eight. It
recognizes that Friend_Num is the PRIMARY INDEX. The PE then
runs the hash formula for eight.
For explanation purposes the
Coffing/Jones hash formula is used, and merely divides the PI by two.
When the PE divides the value eight by two, then it receives an answer
of four. It looks in bucket four and sees the AMP number. The PE
passes a plan to retrieve the data to ONLY AMP number four as this is
a one AMP operation.

A
M
P

A
M
P

A
M
P

A
M
P

B EST FRIENDS

B EST FRIENDS

B EST FRIENDS

B EST FRIENDS

12 Ben Hon
5 Don
10
DonRoy
Roy

24 Joe Davis
6 Sam
12
SamMills
Mills

46 Mary
John Davis
Gray
8 Lyn
14
KyleJones
Marx

48 John Davis
8 Lyn
16
LynJones
Jones

HASH MAP

1
3
1
3
1
3
1
3

2
4
2
4
2
4
2
4

3
1
3
1
3
1
3
1

4
2
4
2
4
2
4
2

1
3
1
3
1
3
1
3

2
4
2
4
2
4
2
4

74

The Full Table Scan

"What matters is not the size of the dog in


the fight, but the size of the fight in the dog."
Coach Bear Bryant
When we travel the globe teaching Teradata classes, we often ask
students, Are Full Table Scans acceptable in a data warehouse?
About 80% of the time students respond, NO!
training they respond, Heck YES!

After we complete

Tom told me that he wrestled his way through high school and college.
I said, Really? I didnt think the classes were that difficult myself!
Actually, Tom earned a wrestling scholarship to college and achieved
the All-American level. His wrestling coach drilled into the wrestlers
minds that the size of the opponent is not to be feared, but the size of
their will. The truth is that most databases do not have the FIGHT in
them to handle a Full Table Scan. Thats why so many students are
surprised at Teradatas abilities to actually handle Full Table Scans.
A Full Table Scan (FTS) is a query that reads every row of a table. The
table may be small or have billions of rows. With Teradata, a Full
Table Scan (FTS) means every AMP reads only the rows it owns in
parallel with all other AMPs in the system. Doing so speeds up a Full
Table Scan hundreds to thousands of times.
For example, imagine a table that has 100 rows in a system that has
10 AMPs. Each AMP owns 10 rows. On a Full Table Scan, each AMP
reads its 10 rows. Next, each AMP passes the information over the
BYNET to the PEP. This process is 10 times faster than most systems.
But what happens with systems that have hundreds, or even
thousands of AMPS? Well, one major telecommunications company
copied a 3.5 billion-row table in just 18 minutes. The 1,900 AMPs in
its system helped return results very rapidly. Talk about efficiency!

75

Most FTS bring traditional databases to their knees, but Teradata was
born to be parallel.
Teradata was specifically designed for data
warehousing. When you ask decision support questions like, Who are
my best and worst customers? then you are asking the system to
read through an entire table. Full Table Scans are fundamental and an
important part of data warehousing. They allow users to literally ask
any question, about any data, at any time.
Teradata has the
experience, power, and architecture to allow Full Table Scans.
A an example of a query asking for a Full Table Scan is:
SELECT Friend_Num, Friend_Name
FROM Best_Friends;

A
M
P

A
M
P

A
M
P

A
M
P

B EST FRIENDS

B EST FRIENDS

B EST FRIENDS

B EST FRIENDS

2 Ben Hon
1
5 Don
10
DonRoy
Roy

24 Joe Davis
6 Sam
12
SamMills
Mills

6 Mary
4
John Davis
Gray
8 Lyn
14
KyleJones
Marx

8 John Davis
4
8 Lyn
16
LynJones
Jones

In this example, the Parsing Engine receives the SQL and checks the
syntax and security.
If the user passes these tests, the query
continues. The PE knows this query asks to return all records. This is
a Full Table Scan. Therefore, it passes the AMPs a plan that says,
Retrieve all of your Best_Friends table rows, and then pass
them to me (PE) over the BYNET. With that in mind:

Each AMP reads the Best_Friends rows individually own.

Each AMP passes its rows to the PE over the BYNET.

76

Lets run through the SQL again and see the result:
SELECT Friend_Num, Friend_Name
FROM Best_Friends;
8 rows returned
Friend_Num

Friend_Name

6
14
8
16
2
10
4
12

Mary Gray
Kyle Marx
John Davis
Lyn Jones
Ben Hon
Don Roy
Joe Davis
Sam Mills

In this chapter, we have shown you two opposite approaches to


retrieving data. In our first query, we used the Primary Index to
retrieve one row. In the next query, we used a Full Table Scan (FTS)
to retrieve all the rows. One approach is the fastest way, and the
other is the slowest way. But are these the only options for retrieving
data? No. There is another option in a Secondary Index.

77

Secondary Indexes

Measure a thousand times and cut once.


Turkish Proverb
Secondary Indexes provide an alternate path to the data, and should
be used on queries that run thousands of times. Teradata runs
extremely well without secondary indexes, but since secondary
indexes use up space and overhead, they should only be used on
KNOWN QUERIES or queries that are run over and over again. Once
you know the data warehouse, environment you can create secondary
indexes to enhance its performance.

Measure a thousand query times and


create a secondary index.
Turkish Teradata Certified Professional
Furthermore, there are two types of secondary indexes. They are
Unique Secondary Indexes (USI) and Non-Unique Secondary Indexes
(NUSI), respectively referred to as USI and NUSI. A table may have
up to 32 secondary indexes.
The good news about secondary indexes is that they
queries.
The bad news is that every time someone
secondary index on a table, Teradata creates and maintains
secondary index sub-table. This action not only takes up
also adds overhead.

speed up
creates a
a separate
space, but

A classical secondary index is itself: a table made up of rows having


two main parts. The first is the data column inside the secondary
index table, and the second part is a pointer showing the location of

78

the row in the base table. Teradata brilliantly uses the hash formula
and the hash map to build its secondary index sub-tables.
There are three values stored in every secondary index sub-table row:
Secondary Index data value
Secondary Index Row-ID (This is the hashed version of the value)
Primary Index Row-ID (This locates the AMP and the base row)
When a secondary index is created, the Teradata PE tells each AMP to
hash the secondary index column value for each of its rows. It tells
the PE to place the hash in a secondary index sub-table along with the
ROW-ID that points to the base row where the desired value resides.
Lets create a secondary index on our Best_friends table. The syntax
to create a secondary index on the column Friend_Name in the table
called Best_Friends is:
CREATE UNIQUE INDEX(Friend_Name) on Best_Friends;

B EST FRIENDS

B EST FRIENDS

B EST FRIENDS

B EST FRIENDS

2 Ben Hon
10 Don Roy

4 Joe Davis
12 Sam Mills

6 Mary Gray
14 Kyle Marx

8 John Davis
16 Lyn Jones

Lyn Jones
Kyle Marx

Ben Hon
Joe Davis

John Davis
Mary Gray

Don Roy
Sam Mills

The example above shows the theory behind creating a secondary


index. There are four AMPs in this system. The base table is the
Best_Friends table seen near the top of the AMPs disk. We created a
Unique Secondary Index (USI) on Friend_Name, and Teradata
automatically created a secondary index sub-table on each AMP. Next,
the AMPs hashed the secondary index values. These values went to
the AMP to which they hashed, along with a pointer to the base row.
The design is simple for display purposes. A symbol represents the
base row-id. For example, Ben Hon who is Friend_Number 2 has a
smiley-face for his symbol. Notice that in the Secondary Index Subtable (located at the bottom of the AMPs disk) there is also a smiley

79

face. Here is how the design works for retrieval. Lets look at how the
following query plays out:
SELECT Friend_Num, Friend_Name
FROM Best_Friends
WHERE Friend_Name = Ben Hon
The Teradata Parsing Engine takes the SQL and checks the syntax and
security access rights. If all is well, the PE notices that in the WHERE
clause of the query it is asking WHERE Friend_Name = Ben Hon.
The PE recognizes that Friend_name is a Unique Secondary Index.
The PE will hash Ben Hon, and then use the hash map to find the
AMP that holds Ben Hon in its secondary index sub-table. As you
can see the AMP involved is number two (notice the smiley face on
AMP 2). The PE instructs AMP 2 to retrieve the Ben Hon Secondary
Index Sub-table. Once complete, Teradata can see the real row-id and
find the base row. In our example, once the Ben Hon Secondary
Index Sub-table row is found, the row-id (smiley face in this
example) is revealed, and the PE can find the matching smiley face in
the base table.
This approach allows all USI requests in the WHERE clause of SQL to
become two-AMP operations.
A NUSI used in the WHERE clause still requires all AMPs, but the AMPs
can easily check the secondary index sub-table to see if they have one
or more qualifying rows.
Create secondary indexes only on columns used repeatedly in the
WHERE clause of on-going queries. Secondary indexes take up space
and overhead, but boy can they speed up queries.

80

Join Indexes

A bend in the road is not the end of the


road unless you fail to make the turn.
A join is an SQL query that gathers its information from more than one
table. Teradata can join up to 64 tables in a single query. Many
databases cant handle join processing so either the database is
modeled in a dimensional fashion or summary tables are created.
Teradata allows you to travel down a faster and straighter highway.
Because data marts or summary tables can be an administrative
nightmare, Teradata enables join access without requiring physical
data marts. This is accomplished by creating a join index. When you
create a join index the tables involved are pre-joined. There is an
actual table built containing the joined data. The users dont every
query the join index. They run their normal joins and the PE will check
to see if the join can be satisfied by the Join Index table. If it can
Teradata will pull the data from the Join Index table. Best of all, once
it is created, the data is automatically maintained as the underlying
base tables are updated.

Table A

Table B

Join Index Table of A and B

81

Teradata Databases, Users and Space

Choose a job you like and you will never


have to work a day of your life.
Confucius
When a Teradata system arrives at your doorstep, it has been carefully
configured to provide adequate permanent disk space that will store,
manage, and back-up your companys data. All of the space that
comes with the system belongs to the user called DBC. DBC loves
its job because it is the top dog. Every Teradata system that was ever
built has a user called DBC. The acronym is derived from the first
Teradata machine called the DBC/1012. DBC stands for Database
Computer, and 1012 stands for 10 to the 12th power or a Terabyte.
There is no user with greater privileges than the DBC.
The DBC owns all permanent space in a Teradata system. It also
contains system tables that hold information about the entire system.
These system tables are known as the Data Dictionary/Directory
(DD). The Data Dictionary acts like a Dictionary to users who want
to look up system information, and as a Directory to the Parsing
Engine (PE). The PE looks in the Directory for help with creating The
Plan. The Dictionary directs the PE on topics such as security, access
rights, table columns, indexes, macros, views, etc.
So, if your system comes with 100 Gigabytes of permanent disk space,
then the DBC owns 100 Gigabytes of PERM space.
Teradata is
hierarchical in nature, so it is up to DBC to dole out space to other
databases or users. In the beginning, the DBC owns all PERM space.
No space is unassigned. As DBC begins to give space to other
users/databases, they take ownership. Keep in mind, all space is
owned. Space never goes unaccounted for -- if the space is not
owned by DBC, then its owned by someone under DBC.

82

DBC
Data Dictionary Directory (DD)

100 Gigabyte
Data Warehouse

Logical Picture of System Space


The DBC is now ready to distribute space, but because DBC is so
powerful this can be dangerous. What if the DBC user forgets the
password? What if a disgruntled employee knows the DBC password
and is looking for revenge? The DBC password must be protected and
as a result, many companies create a new user called SYSDBA. This
user owns about 80% of the space, while the DBC owns the remaining
20% that is allocated for the Data Dictionary and the Transient Journal
(see Data Protection chapter). The DBC password can then be locked
in a safe, and it is now up to the SYSDBA to distribute space.

83

20%

DBC
Data Dictionary Directory (DD)

SYSDBA

80%

100 Gigabyte
Data Warehouse

Logical Picture of System Space


The SYSDBA now owns 80% of the system space. The user does NOT
have to be called SYSDBA. It could be called Morgan, or Tom, or
anything. SYSDBA, however, is a standard name that most systems
utilize.
As you can see in the following picture, the DBC still owns about 20%
of the total space. The user SYSDBA has given some space to a
database called MRKT, and to another one called SALES. It has
also given space to a user called Morgan.
NOTE: Morgan has given some of his space to Tom. Therefore, both
Morgan and Tom can now own tables.

84

20%

DBC
Data Dictionary Directory (DD)

SYSDBA
MRKT

Sales

Morgan
Tom

Logical Picture of System Space


Remember either a database or a user can own space. Whats the
difference between a database and a user? That topic follows.

85

Databases and Users


Unlike other database products, Teradata sees little difference
between a user and a database. Both need space to contain or own
data. In fact, the only real difference is that a user has a password
and he or she can log-on and submit SQL requests.
Both a database and a user can own perm space; therefore both can
actually own tables.
When we stated that relational databases are much like an extended
family, we were not kidding! Below is a diagram showing a hierarchy
of space ownership in Teradata.
Any user or database sitting
anywhere above you in the hierarchy is referred to as your parent
or owner. Any object below you is a child. Your extended family
will grow as you add users and databases.
Take a look at the following diagram, and then tell us who is the owner
of Tom. The answer is Morgan, SYSDBA, and DBC. Each of these
items are listed above Tom in the hierarchy, so each is a parent or
owner. With this hierarchy in effect, parents (or owners) have the
ability to GRANT or REVOKE rights from Tom.

DBC
SYSDBA

MRKT

Sales

Morgan
Tom

86

Three Types of Teradata Space


There are three types of space with Teradata. They are:

Perm Space,
Spool Space, and
Temp Space

Perm space defines the upper limit of space that a database or user
can use to hold tables, secondary index sub-tables, and permanent
journals (See protection features).
Spool space defines the upper limit of space that a user has to run a
query. When a user runs a query, AMPs build the answer set in spool
space. Once the query is done, the spool space is released. If the
query exceeds the spool spaces upper limit, the query aborts. Then,
the user is out of spool space.
Temp space defines the upper limit that a user or database can have
to hold Global Volatile Temporary tables.
These tables will be
discussed in another chapter.
The SYSDBA knows that tenaciously holding onto its space will not
provide any value to your company. A bank that holds onto all of its
capital will not be successful, or will it? If its destined for success, it
will lend out its capital in the form of credit lines or mortgages. These
actions will provide the bank with a healthy profit. The SYSDBA
likewise gladly gives up space to each new user or database in an
effort to make the Teradata system profitable.
SYSDBA gives out two kinds of space: Perm space and Spool
space. When you receive a credit card from the bank, you are given
an upper limit to your line of credit. In order to spend more than that
limit, you must get approval from the bank. In the same way, the
SYSDBA gives a new user an upper limit of space to use. When that
amount is used up, the user must request an increase. Another way to
free up some space is to drop some tables from the database.

87

Perm space is actually used to store real data such as tables, views
and macros. If you give some of your perm space to a child object,
then you must subtract that same amount from the total perm space
you own.
Spool space is the area where AMPs temporarily place the answer to
a query. Once the answer is delivered to the person making the query,
the AMPs release that spool space to be used for another query!
Unlike perm space, spool space is not lost if it is given away. You can
actually give users below you as much spool as you would like, yet still
have the original amount. Spool is like a speed limit on the highway.
If your own speed limit is 65 mph, you can still allow every other
driver to drive up to 65 mph. Some users may not receive perm space
if their job is just to run queries -- not create tables. These users will
just receive spool.
The following picture shows a logical view of a CustomerTable. Note:
the table is stored in PERM space. When a user submits a query
against this table, the answer is stored temporarily in SPOOL. When
the query is completed, the answer is delivered to the user, and then
the SPOOL is released.
The next picture shows a logical Teradata system. In the PERM area
there is a table called Employee. This table has five columns: Emp,
Dept, Lname, Fname, and Sal. The table has four employees. Notice
the SQL statement at the bottom of the picture is asking to see all
columns where the employees department is equal to 10.
To
complete the query, the AMPs will read the rows of the table and each
time they find a row where Dept is equal to 10, a row is added to
spool. Plus, when the answer is returned, the spool is released.

88

Teradata System (Logical)


Employee Table

PERM
SPACE
SPOOL
SPACE

Emp Dept
1
2
3
4

10
20
30
10

Lname Fname
Jones
Smith
Chang
Wilson

Dave
Mary
Vu
Sue

Sal
45000.00
50000.00
65000.00
44000.00

EMP DEPT LNAME FNAME SAL


1
10
Jones Dave 45000.00
4
10
Wilson Sue
44000.00

SELECT *
FROM Employee
WHERE DEPT = 10;

89

What is a View?
At Christmas time no one cares about the past or the future. All that
matters is the present! One year, my wife and I were in New York City
during the holiday season. We had always heard about how wonderful
the window displays are in the large department stores. As we
window-shopped, we got lots of ideas for gifts. We could see products
displayed in the windows, but we could not actually touch them. We
only had a pleasant view. Display windows are designed to show
shoppers what store management wants you to see. In Teradata, a
view is like a department store window because you can see selected
portions of a table, yet you arent able to see sensitive data. Instead,
you can view data within your access rights and you determine what
data portions you want others to see.
Views are real sticklers for protecting sensitive data from inquiring
eyes. For example, the Human Resources database might contain an
employee table. Management can create a view of the table that hides
the salary column, yet still allows an administrative associate to view
names, phone numbers and department numbers of employees. In
this scenario, the salary column is not shown. As a result, views are
the best choice for protecting sensitive data.
Another benefit of views is that their definitions are stored in the Data
Dictionary. When you select a view of a table(s), the data is not
stored on the disks, so it does not duplicate data and take up more
space. In this scenario, you are looking at a filtered picture of the
data.

The Employee Table


Emp Dept Lname Fname
1
2
22
25
33
99

10
20
30
10
10
20

Johnson
Carlsbad
Winter
Lester
Samuels
Walter

Sal

Manny 100000
Jan
100000
Steve
77000
Bonnie 56000
Todd
120000
Misha 104000

90

The previous table shows the employee table.


In nearly every
company, employees are curious about the salaries of co-workers.
Providing access to the employee table above will actually allow users
to see everyone elses salary. To avoid disclosing salary information, a
view should be created to limit certain columns and rows.
Its simple to create a view:
CREATE VIEW EMPLOY_V AS
SELECT Emp
,Dept
,Lname
,Fname
FROM EMPLOYEE;
In the SQL statement above, salary is not selected. However, if users
are denied access the employee table, but are given access rights to
the EMPLOY_V view, there is enhanced security. With this restriction,
no user can actually see the list of employee salaries.
Perm Space is required to create a table, but it is not needed to create
a view. The creation and definition of a view are both stored in the
Data Dictionary, and are monitored by the DBC. However, anyone can
create a view, provided that person has the proper privileges.
Once a view has been created, users can select data from the view.
An example is:
SELECT *
FROM Employ_V;
6 rows returned
Emp
1
2
22
25
33
99

Dept
10
20
30
10
10
20

Lname
Johnson
Carlsbad
Winter
Lester
Samuels
Walter

Fname
Manny
Jan
Steve
Bonnie
Todd
Misha

91

What is a Macro?

The axe soon forgets, but the tree always


remembers.
Anonymous
When you run specific queries often, or if you want to ensure you dont
forget an SQL step you should use a macro. The user sometimes
forgets, but the macro always remembers. A macro is a group of one
or more SQL statements that are given a name and that are executed
with a simple command. If there are multiple commands, Teradata
treats them as one single transaction. In other words, either they all
work or none of them work. Like views, the definition statement for a
macro is stored in the Data Dictionary.
If your manager asks you for three reports, he may want to know:

What employees are in department 10;

What employees are in department 20; and

A list of employee names sorted by last name;

A macro can easily be created to run all three commands. The syntax
would be:
CREATE MACRO Emp_mac AS
(
SELECT * from Employ_v WHERE dept = 10;
SELECT * from Employ_v WHERE dept = 20;
SELECT * FROM Employ_v Order by lname;
);

92

Once the macro has been created and stored in the Data Dictionary,
its time for a test run. To run this macro, the user merely executes
the SQL:
Execute Emp_mac;
Here is a handy reference chart that compares views with macros:

Views

We select from views.


Uses the keyword AS
Definition is stored in the
Data Dictionary
Accesses certain portions of
the data
Is changed using the
keyword REPLACE

Macros

We execute macros.
Uses the keyword AS
Definition is stored in the
Data Dictionary
Accesses the real data itself
Is changed using the
keyword REPLACE

93

Access Rights for Teradata Users

Never insult seven men when all youre


packing is a six gun.
Wild West Slogan
I taught in one place that was so rough security actually checked me
for weapons. When they found out that I had no weapons, they gave
me some!
Actually, on a recent consulting trip I was signed in each morning by a
friendly security guard. This customer site had tons of highly sensitive
data. As long as I stayed in my assigned work area, the guard and I
got along just fine. However, as soon as I needed to move to a
different room, someone had to accompany me and give me access.
In Teradata, the Parsing Engine is the vigilant guard who never lets
someone get close to data if he or she doesnt have the right
permissions.
Every time an SQL request comes to the PE, it checks the SQL syntax
for validity first.
Its next step, every single time, is to see if the user
has permission to perform a given operation on a specified Teradata
object.

94

Automatic, Implicit, and Explicit Rights


Teradata uses three types of privileges, and records of these rights are
stored in the DBC. Owners or Parents have Implicit Rights. These
rights allow the owners (parents) to grant and revoke privileges on
any users listed below them in the hierarchy. In real life, parents have
these privileges too. Think about it nearly every teenager has heard
the statement, Im revoking your privilege to drive the family car until
those grades come up. Hand over the keys!
Explicit Rights are any privileges granted from someone else. For
example, Tom might grant Mary permission to create a table in his
database even though Mary works in the marketing (MRKT)
department.
Automatic Rights are system assigned privileges. When a new user
or database is created, it receives 16 different access rights. The
creator of the new object gets 20 rights. Similarly, when a baby is
born in the United States he or she is granted some basic rights by the
U.S. Constitution.

DBC

SYSDBA
MRKT
Mary

Sales

Morgan
Tom

95

In the picture above, the DBC has Implicit rights on all databases and
users. Plus, SYSDBA has Implicit rights on every person listed below
him. MRKT has explicit rights over Mary, and Morgan has the same
rights over Tom. Implicit rights simply means it is implied that those
people listed above you (in a hierarchy chart) can GRANT or REVOKE
privileges on you.
For example, if Tom or Morgan decides to give certain privileges to
Mary, either person could EXPLICITLY give her those permissions.
In comparison, Automatic Rights means when Morgan created Tom he
automatically received 20 access rights (on Tom), plus Tom was given
16 access rights on himself.

96

Data Protection
As a man was driving down the interstate highway, his cell phone
rang. When he answered he heard his wife warn him urgently,
"George, I just heard on the news that there's a car going the wrong
way on I-26!" George replied, "I'm on I-26 right now and it's not just
one car. It's hundreds of them!"
How do you protect your data when things go the wrong way?
Murphys law states, The more mission critical a data warehouse,
the more likely the system will crash at the most critical moment of
the mission. Ironically, most DBAs think Murphy was an optimist.

"Please sleep on it tonight, and if you wake


up in the morning,
let me know what you think."
Morgans Life Insurance Agent
A database not prepared to defend itself is like an unsigned contract.
It is not worth the paper it is written on. However, Teradata is
always prepared and it will protect your data better than a wild pit
bull. As a matter of fact, the difference between Teradata and a pit
bull is that eventually the pit bull will get bored and let go.
System and user errors are inevitable in any large system. For
example, an associate may accidentally give everyone a 100% raise
instead of a 10% raise. Or, what if a million-dollar transaction fails
right at the wrong time? Or an AMP or DISK goes down? In any of
these cases, Teradata will have many ways to protect your data.
Some processes for protection are automatic and some of them are
optional.

97

The protection features we will discuss are:

Transaction Concept
Transient Journal
FALLBACK
RAID
Clustering
Cliques
Permanent Journaling

98

Transaction Concept & Transient Journal

The afternoon knows what the morning


never suspected.
Swedish Proverb
At any time something could go wrong with a transaction. An old
proverb suggests, The afternoon often knows what the morning
never suspected, likewise the Transient Journal knows what the
transaction never suspected.
What good would it do if you could gather, store and analyze
terabytes of data, but doubted the integrity of the data? Teradata
makes every effort to ensure a database doesnt get corrupt.
Fundamental to this assurance is the Transaction Concept, which
means that an SQL statement is viewed as a transaction. Simply
stated, either it works or it fails.
The Transient Journals job is to ensure if things do fail, then the
rows affected can be reverted back to their original state.
In
Teradata, all SQL statements are considered transactions.
This
applies whether you have one statement or multiple statements
executing (MACRO). If all SQL statements cannot be performed
successfully, the following happens:

The user receives immediate feedback in the form of a failure


message;

The entire transaction is rolled back, and any changes made to


the database are reversed;

Locks are released; and

Spool files are discarded

99

Wouldnt it be great if every time you got a haircut, the barber or


stylist took a picture of your hairdo before they cut a single strand?
Then after he or she cut your hair, asked if you liked it? If you didnt
like it, then you could ask to have it restored? Well, that is what the
Transaction Journal does. If a row is going to change because of an
INSERT, UPDATE, or DELETE, it takes a BEFORE picture. If the
transaction fails, then the journal restores it to the way it was.
The TRANSIENT JOURNAL is an automatic system function. It is not
optional. The BEFORE image is actually stored in the AMPs Transient
Journal. Every AMP has a transient journal that is maintained in DBCs
PERM space. If the transaction is aborted for any reason, the AMP
restores the data to match the before-image stored in the Transient
Journal. The data will then revert to its original state. When a
transaction is successful, the PE and the AMPs shake hands on it and
the Transient Journal is wiped clean. The handshake is called the
COMMIT. After a COMMIT, all the AMPS have a party to celebrate,
and the user is invited to join in the festivities! In other words,
Transaction Journal Cleanliness is next to Godliness. If it is clean,
then things went good!

100

FALLBACK Protection
I asked my dentist if I had to floss all my teeth, and he responded,
No, just the ones you want to keep.

If youre not TRUE to your teeth,


theyll be FALSE to you.
Morgans Dentist
FALLBACK is a table protection feature used in case an AMP fails.
You can use FALLBACK on all tables, some tables or no tables. When
I asked my dentist if I should use FALLBACK on all tables, he
responded, No, just the ones you want to keep running when an
AMP fails.
Below is the four-AMP system and the Best_Friends table. In this
example, data is spread evenly and the system is ready to run in
parallel. It is brilliant, but vulnerable. What happens if we lose AMP
one? We can no longer get to the Best_Friends rows containing Ben
Hon and Don Roy. FALLBACK, however, will correct this situation.

AMP 1

AMP 2

AMP 3

AMP 4

B EST FRIENDS

B EST FRIENDS

B EST FRIENDS

B EST FRIENDS

1 Ben Hon
5 Don Roy

2 Joe Davis
6 Sam Mills

3 Mary Gray
7 Kyle Marx

4 John Davis
8 Lyn Jones

101

In the picture below, you can see the Best_Friends table and the
FALLBACK protected rows.

16
14

AMP1

AMP2

B EST FRIENDS

B EST FRIENDS

B EST FRIENDS

B EST FRIENDS

2 Ben Hon
10 Don Roy

4 Joe Davis
12 Sam M ills

6 M ary Gray
14 Kyle M arx

8 John Davis
16 Lyn Jones

Lyn Jones
Kyle M arx

2
6

Ben Hon
M ary Gray

AMP3

8
4

John Davis
Joe Davis

AMP4

10
12

Don Roy
Sam M ills

In this picture, the BASE table Best_Friends is illustrated at the top of


the disk and the FALLBACK rows are placed at the bottom of the disk.
If we lose AMP1, then we can get Ben Hon from AMP2 and Don
Roy from AMP4.
Keep in mind, FALLBACK tables use twice as much disk space as
NON-FALLBACK rows. In the picture above there were eight base
rows in the Best_Friends table and eight rows in the Best_Friends
FALLBACK rows. With FALLBACK, we can lose any AMP and still get
to the data.

You cant step into the same river twice.


Heraclitus
The data in a companys database tables is constantly changing,
much like a flowing river. As every footstep really encounters a
different river, likewise each update really makes a different table.
That is why Fallback protection can be vital for mission critical tables.
It actually allows the user to step into the same table twice, if
necessary.

102

If we can lose any one AMP/disk, what happens if we lose two? The
chance of losing two AMPs in a four-AMP system is rare, however
some systems have nearly 2,000 AMPs. Therefore, the chance of
losing two AMPs in a 2,000 AMP system is much greater than in a
four-AMP system. Thats why Teradata designed Clustering. Lets
look at this next example with a little larger system:

CLUSTER 1

AMP1

AMP2

B EST FRIENDS

B EST FRIENDS

B EST FRIENDS

B EST FRIENDS

2 Ben Hon

4 Joe Davis

6 M ary Gray

8 John Davis

M ary Gray

Ben Hon

AMP3

John Davis

AMP4

Joe Davis

CLUSTER 2
AMP5

16

AMP6

AMP7

AMP8

B EST FRIENDS

B EST FRIENDS

B EST FRIENDS

B EST FRIENDS

10 Don Roy

12 Sam M ills

14 Kyle M arx

16 Lyn Jones

Lyn Jones

10

Don Roy

12

Sam M ills

14

Kyle M arx

Lets discuss the picture above in detail: This is an eight-AMP system.


Four AMPs are in Cluster one and four AMPs are in cluster two. The
base table Best_Friends (listed at the top of all disks) is spread
evenly across all eight AMPs. Taking the Primary Index and running
it through the hashing algorithm complete this allocation. Next, the
output of the hashing algorithm points to a bucket in the hash map,
and inside that bucket is the AMP number or the rows destination.
Notice the FALLBACK rows in this example. In the top cluster
(cluster 1), FALLBACK rows are backups for the top clusters base
rows. In the bottom cluster (cluster 2), FALLBACK rows are backups
for the bottom clusters base rows.
With this protection, WE CAN AFFORD TO LOSE ONE AMP IN EACH
CLUSTER!

103

The brilliance behind this protection is the Hash Map. There is a Base
Row Hash Map used to distribute the base rows. Its called the
Primary Hash Map. There is also the Fallback Hash Map that knows
exactly how AMPs are clustered and which AMP should host a
FALLBACK row.
In most systems, AMPs are clustered in a group of four. The next
most popular clustering scheme is a group of three. However, the
minimum number of AMPs per cluster is two, but the maximum
number of AMPs per cluster is 16. Lets look at the extremes of both
clusters (two versus 16).
The advantage of clustering in groups of two is that both AMPs would
have to fail before the system stopped. The disadvantage is that if
one AMP fails, the other must do its work plus the work of the down
AMP. With clustering in a group of two, every complex query will
take twice as long to process.
The advantage to clustering in groups of 16 is that if one AMP fails,
there are 15 other AMPs doing their work and sharing in the work of
the failed AMP. The disadvantage to this type of clustering is there is
an increased risk of losing two AMPs in the cluster.
This is the reason four-AMP cluster configurations are so popular.
The chances of losing two AMPs out of four are quite low. However, if
one AMP is lost, the other three will share in the extra work.
FALLBACK is an optional means of protection specified at the
database or table level. It may be requested when the table is first
created, or you may add or drop FALLBACK at any time by using the
ALTER TABLE command. (For more information, refer to Teradata
SQL Unleash the Power by Mike Larkins and Tom Coffing).
Lets review FALLBACK and clarify related issues: When a new row is
inserted into a table, FALLBACK always places a second copy of that
row on another AMP in the same group, or cluster. Keep in mind that
a cluster usually consists of four AMPs. From that point on, any
manipulation of the data in the primary row also happens to the
FALLBACK row. FALLBACK rows are distributed evenly across all the
AMPs within the same cluster. If one AMP fails, processing continues
with all subsequent changes to that AMPs rows.
FALLBACK provides an optional insurance policy for a failed AMP,
however there is a cost for that insurance. FALLBACK requires twice
104

as much disk space to store both the primary and duplicate rows on
a table. Another cost that should not be overlooked is twice the I/O
(Input/Output) applies to inserts, updates and deletes because there
are always two copies to write. However, because Teradata AMPs
operate in parallel, both rows are placed on their respective AMPs at
nearly the same time.
Although FALLBACK may be created on any, all or no tables, its extra
cost causes most companies to use it only for mission critical tables.
As you might suspect, the Data Dictionary is automatically FALLBACK
protected. FALLBACK may not protect your system from all failures,
but it certainly is an excellent fault tolerant solution.

105

Down AMP Recovery Journal (DARJ)


The blockbuster movie While You Were Sleeping starring Sandra
Bullock told a fascinating love story. A young woman who collected
tolls for the Chicago elevated train system fell in love with a man
who boarded the train each day at her station. However, the man
only knew her as the woman in the booth who collected his fare. One
day, the dashing young man tripped and fell into the path of the
train. Only quick action by the lovesick toll clerk kept him from
certain death. Although he avoided death, he fell into a coma. While
he was in a coma, the mans family fell in love with the toll clerk who
visited him in the hospital. Because she visited so often, the mans
family actual thought she was the mans fiance.
The movie
continues to tell how the man regains consciousness and the events
the immediately follow. At the end of the movie, it turns out that all
along the woman had been telling the man everything that happened
While you were sleeping.
The Down AMP Recovery Journal (DARJ) is a special journal used only
for FALLBACK rows when an AMP is not working.
Like the
TRANSIENT JOURNAL, the DARJ, also known as the RECOVERY
JOURNAL, gets it space from the DBCs PERM space. When an AMP
fails, the rest of the AMPs in its cluster initiate a DARJ. The DARJ
keeps track of any changes written to the failed AMP. When the AMP
comes back online, the DARJ will catch-up the AMP on everything
that occurred while it was sleeping. Then the DARJ is discarded.

106

CLUSTER 1
AMP2

AMP3

AMP4

DARJ

DARJ

DARJ

BEST FRI ENDS

BEST FRI ENDS

BEST FRI ENDS

BEST FRI ENDS

2 Ben Hon

4 Joe Davis

6 Mary Gray

8 John Davis

AMP1

Mary Gray

AMP5

16

Ben Hon

John Davis

AMP6

AMP7

Joe Davis

AMP8

BEST FRI ENDS

BEST FRI ENDS

BEST FRI ENDS

BEST FRI ENDS

10 Don Roy

12 Sam Mills

14 Kyle Marx

16 Lyn Jones

Lyn Jones

10

Don Roy

12

Sam Mills

14

Kyle Marx

CLUSTER 2
In the previous picture there are two clusters, but notice that AMP one
has failed. After failure, the other AMPs in the top cluster open the
Down AMP Recovery Journal (DARJ). Also, none of the AMPs in the
bottom cluster have the DARJ open. Why? Simply, because the
FALLBACK rows for the down AMP are housed within the cluster. If
anything happens while the AMP is sleeping, it has three extremely
cute ticket takers that will store all information pertaining to the down
AMP.

107

Redundant Array of Independent Disks (RAID)

I know that you believe that you


understand what you think I said, but I am
not sure you realize that what you heard is
not what I meant.
Sign on Pentagon office wall
RAID never gets confused. It always knows exactly what the disk
said and it mirrors it exactly! The disks in the Disk Array modules
accessed by the AMPs are similar to a hard disk drive in a personal
computer. No doubt you have heard people complain that their hard
drive crashed. Well, disk drives crash inside modules that store
multiple disks, too. Redundant Array of Independent Disks (RAID)
protects against a disk failure. There are many levels of RAID in the
data storage industry. The most common level, and one that is used
by Teradata, is RAID-1, also called MIRRORING. With RAID-1, each
primary disk has a mirror image, or an exact copy of all its data on
another disk. The contents of both disks are identical.
When data is written on the primary disk, it is also written on the
mirror disk. However, the dual-write process is invisible to the user.
This is the reason RAID-1 is also called transparent mirroring.
Mirrored disks provide a high degree of reliability because when a
disk fails no data is lost; its actually fully accessible on the mirror
disk. Operations continue while the Disk Array Controller copies the
data from the mirror disk to a replacement primary disk. The down
side of RAID-1, like FALLBACK, is that it requires a 50% overhead of
disk space.
Mirroring has been typically provided at the application or operating
system level. Teradatas RAID solutions, however, manage mirroring
at the Disk Array Controller level because it boosts performance. The
AMPs can read data from either the primary disk or its mirror. Plus,

108

the controller decides which read/write assembly (drive actuator) is


closest to the requested data.
In the next example, an AMP is shown with its virtual disk. However,
this is conceptual. In actuality, each AMP has four physical disks.
Since only the AMP illustrated can get to its information, we like to
explain this concept as a single virtual disk. This concept is called a
Shared Nothing environment. However, we can still keep the
shared nothing environment and have four physical disks. With
that only the AMP actually owning the virtual disk can access its four
disks.

A
M
P

Each AMP has one Virtual Disk

109

A
M
P

Disk Array Controller


Data

Mirror

2 Ben Hon
10 Don Roy

2 Ben Hon
10 Don Roy

Data

Mirror

Four Physical Disks


OneVirtual Disk

In the picture above, one AMP has one Virtual Disk, but it also has four
physical disks. Plus, each disk has a mirror in case of the loss of a
disk. The four disks together form a Rank of Disks. Two disks in a
rank may be lost so long as they are not comprised of a data disk and
its mirror. In this example, the data from the Best_Friends table is
displayed. It is on the first disk, and there is a set of mirrored the
information on the second disk. If a disk goes down, the system does
not even flinch. It sends the operations personnel a message about
failure, and keeps on running.

110

Cliques
In high school you can walk into the cafeteria and immediately
identify the cliques (pronounced clicks). In other words, they are
groups of students that hang around together because they have
formed a common identity and a common bond. The cliques in
Teradata are similar to, yet different from high school cliques.
CLIQUES (pronounced cleeks) in Teradata are a method of system
protection against the failure of an entire node. Multiple processing
nodes (SMPs) are not only connected with an unbroken line to their
own disks, but are also with a dotted line to each others disks. This
shared disk arrangement forms a CLIQUE. If a node fails then its
virtual processors (AMPs and PEPs) migrate to other nodes in its
CLIQUE like birds flying south in winter. The receiving node now has
twice as many VPROCs, so its performance slows down.
The
important factor is that the migrated VPROCs can still access their
own disks, and business continues until the failed node is repaired or
replaced.

Node 1

Node 2

Intel Processors

Intel Processors

Memory

Memory

A
M
P
1

A
M
P
16

AMP 16
Virtual
Disk

AMP 17
Virtual
Disk

A
M
P
17

A
M
P
32

This is NOT a clique


The picture above shows two nodes. A node can be thought of as a
powerful PC with four Intel Processors. AMPs and PEs reside inside the
nodes memory, and there are about 10-16 AMPs per node and two-tothree PEs per node. This configuration is a two-node 32 AMP system.
111

Lets focus on AMP16 in node one and AMP 17 in node two (look at the
arrows). AMP 16 has its own virtual disk and similarly, AMP 17 has
its own virtual disk. Remember, no other AMP is allowed in another
AMPs virtual disk.
What if an entire node is lost? Well, then AMPs 1-16 cannot access any
disks. To prevent this, lets create a clique in our next picture. The
idea of a clique is to connect both nodes to one anothers disks. That
way, if either node goes down, the AMPs can migrate over the BYNET
and join the other 16 nodes in memory. However, each AMP will still
have a connection to the original virtual disks.

Node 1

Node 2

Clique Cables

Intel Processors

Intel Processors

Memory

Memory

A
M
P
1

A
M
P
16

AMP 16
Virtual
Disk

AMP 17
Virtual
Disk

A
M
P
17

A
M
P
32

Clique Cables

This is a clique
In the illustration above, cables have been added. If node one or node
two goes down, the AMPs can migrate to the other node and still have
access their own disks. The only difference is that the migrating AMPs
now reside in memory on different node, plus they are accessing their
own virtual disk via a different physical cable.
People who come from the colder climates to spend their winters in
sunny Florida are often called snowbirds. Do you know what bird
migrates farther than any other bird on the planet? It is the Arctic
tern. This bird leaves its Arctic Circle home in August for its winter
vacation home in Antarctica a round trip of more than 11,000 miles!

112

In the same way, when a node goes down the software AMPs and PEs
migrate over the Bynet to a temporary home on another node.

Node 1

Node 2

Intel Processors

Intel Processors

Memory

Memory

NODE Crash

AMP 16
Virtual
Disk

AMP 17
Virtual
Disk

AMP16s new path

A
M
P
1

A
M
P
32

All 16 AMPs
Migrate to
the new node

113

Permanent Journal

The absent are always in the wrong.


English Proverb
If a system had five million rows and used FALLBACK protection, then
it would have five million FALLBACK rows. However, this would be
quite costly because FALLBACK actually stores a duplicate copy of all
the rows on other AMPs within the same cluster. FALLBACK is used
either because the system is mission critical or the system is not
backed up regularly.
For customers who backup data regularly,
another option for data restoration is the Permanent Journal. When
a company is not severely impacted by a couple of hours for a
restoration to be completed, this is a very good option. The Permanent
Journal works in conjunction with backup procedures, plus its a lot
more cost effective than FALLBACK.
The Permanent Journal stores only images of rows that have been
changed due to an INSERT, UPDATE, or DELETE command. It keeps
track of all new, deleted or modified data since the last Permanent
Journal backup. This option is usually less expensive than storing the
additional five million FALLBACK rows.
Like FALLBACK, the Permanent Journal is optional. It may be used on
specific tables of your choosing or on no tables at all. It provides the
flexibility to customize a Journal to meet specific needs. The
Permanent Journal must be manually purged from time to time.
There are four image options for the Permanent Journal:
1. The BEFORE JOURNAL stores an image of a table row before it
changes. It is used to perform a manual rollback to a specific point
in time should there be a programming error.

114

2. The AFTER JOURNAL stores an image of a table row after it


changes. It is used to manually roll forward from a specific point
in time.
3. A DUAL BEFORE JOURNAL captures two images of a table row
before it changes. This type of journal stores the duplicate images
on two different AMPs.
4. A DUAL AFTER JOURNAL captures two images of a table row after
it changes and stores those images on two different AMPs.
In order to explain journaling, lets say that the Customer
Representative table is created with a BEFORE Journal. After its
created, a programmer is told to move every Customer Representative
from the Western Region to the newly designated Southwest Region.
However, every representative from every region is accidentally
transferred to the Southeast Region. Because there is a BEFORE
Journal, a programmer has the ability to manually rollback the data to
the specific point in time BEFORE this update occurred. Note that this
was not a transaction failure. The update was successful but it was
not accurate. The BEFORE Journal saves the day!
The AFTER JOURNAL works in the opposite way. In this scenario,
company officials decided not to use FALLBACK on any tables. The
data was not mission-critical, and it could be restored from backup
tapes if necessary. A FULL SYSTEM BACKUP takes place on the first
day of each month. Plus, an AFTER JOURNAL has been placed on all
the tables in the system. Every time a new row is added or a change
is made to an existing row, Teradata captures the AFTER image.
Suppose a hardware failure occurs on the 5th day of the month and
data is lost.
To recover the data, the hardware problem should be fixed, and then
the data should be reloaded from the FULL SYSTEM BACKUP done on
the 1st of the month. The AFTER JOURNAL is then used to capture the
transactions that either added or modified data between the 1st and 5th
day of the month. As you can see, an AFTER JOURNAL is used to
roll forward and is usually done to restore data lost as a result of a
hardware problem.

115

The following example


PERMANENT JOURNAL:

shows

the

use

of

FALLBACK

and

the

CREATE TABLE TomC.employee, FALLBACK,


BEFORE JOURNAL,
DUAL AFTER JOURNAL
(
emp
INTEGER
,dept
INTEGER
,lname
CHAR(20)
,fname
VARCHAR(20)
,salary
DECIMAL(10,2)
,hire_date DATE FORMAT
)
UNIQUE PRIMARY INDEX(emp);

The example above created the table called Employee in the TomC
database, and is FALLBACK protected. A BEFORE Journal and a DUAL
AFTER Journal are specified. Remember that both FALLBACK and
JOURNALING have defaults of NO - meaning if you dont specify this
protection at either the table or database level the default is NO
FALLBACK and NO JOURNALING.

116

Locking Modes in Teradata

You just obey instructions;


well take care of the obstructions.
David Seamands
A private pilot was flying into a new town when the weather turned
suddenly cloudy and he became confused. Not very experienced in
landing by instrument, he began to panic, thinking of the hills, trees
and buildings below. But the local air traffic controller commanded
him, You just obey instructions; well take care of the obstructions.
Many database systems can become confused when the number of
users begins to grow. But like a master air traffic controller, Teradata
uses a brilliant locking logic that gets each user to the right data at the
proper time without conflicting or disastrous results.
Teradata allows hundreds, even thousands of users, to access the data
warehouse concurrently. However, there would be a lot of confusion
about which user had access to a table first if it were not for the
LOCKING MODES. No one likes to be waiting for a long time in a line
only to have someone cut in front of him or her. Teradata uses LOCKS
to help maintain data integrity. Locks are activated on the targeted
database, table, or row while the SQL request is executed. Those locks
are released upon query completion.
There are four modes of locking:
1) The EXCLUSIVE LOCK is the mother of all locks. Its placed only
on databases or tables, and restricts access to then whenever a
structural change is made. EXCLUSIVE LOCKing reminds me of
what happens when there is a structural change being made to a
parking garage. A construction company will wrap what seems
like thousands of yards of bright orange plastic fencing around
the garage in order to keep people out and protecting them from
falling debris. To this day, I have not seen a database or table
fall on top of a user! The EXCLUSIVE LOCK prevents any access,
period. This lock is placed on a table or database.
117

2) The WRITE LOCK jumps to action whenever a user asks for an


INSERT, DELETE, or UPDATE. Keep in mind, these commands
are writing actions. No other Exclusive, Write, or Read locks can
cut in line ahead of an existing WRITE LOCK. The only exception
is an ACCESS LOCK one that allows a user to read data that
may not be totally accurate due to modifications being made at
the time it is accessed. This kind of read is called a stale or
dirty read.
3) Everybody loves the READ LOCK. Its placed whenever the
SELECT command is used. With a READ LOCK a thousand users
can simultaneously SELECT from a table. A READ LOCK will
prevent either an Exclusive or WRITE LOCK from jumping ahead
in the queue.
4) When a user is not concerned with precisely accurate data, he or
she may request an ACCESS LOCK. This lock can jump in line
ahead of either a READ or WRITE LOCK, but not an EXCLUSIVE
LOCK.

118

Referential Integrity
Just how important is it to protect the integrity of your data? This story
says it all: After reading an advertisement offering split, dry firewood
for $60 a cord (including delivery), Jeff decided to place a phone order.
Upon delivery, Jeff was upset when the deliveryman finished stacking
the wood. Jeff objected, "That's not a full cord of wood!" "Well,
that's what I call a cord," the man answered firmly. Grudgingly, Jeff
pulled some money out of his pocket and thrust it into the man's
hands. "Hey, just a minute," the man said after counting the money.
"You only gave me $30!" Jeff shrugged his shoulders and replied,
"Well, that's what I call $60."
Imagine getting fired from your job and the company deletes you from
its employee table, but forgets to delete you from the payroll table.
Thats not like getting fired its more like getting fired up for a
Bahamas vacation. Referential Integrity would have stopped this
oversight. RI, as it is called, would not allow anyone to be deleted
from the employee table unless he or she was also deleted from the
payroll table.
REFERENTIAL INTEGRITY (RI) is the relational concept that mandates
that a row cannot be inserted into a table if it does not contain a
column value that also exists in another table within the database.
Conversely, a row with a corresponding value in another table may not
be deleted unless the common value is first removed from the former
table.
An important function of RI on a newly created table is that it will not
allow invalid data values to be entered into a column. If RI is enforced
on an existing table with RI violations the ALTER TABLE will proceed.
Plus, it will copy and store the table and any related RI violations for
review and correction. Then the user will need to locate the table
copy, and then make corrections to the original table.

119

Loading the Data


One night I said to my son, When Abraham Lincoln was your age, he
studied by candlelight. My son retorted, When Abraham Lincoln was
your age, he was president.
Just as Lincoln will go down as one of the greatest presidents in
history, Teradata will not go down when it loads history. Data within a
warehouse environment is often historic in nature, so the sheer
volume of data can overwhelm many systems. But, not Teradata!
Teradata is so far ahead of the data loading game that other database
vendors cant hold a candle to it. A data warehouse brings enormous
amounts of data into the system. This is an area that most companies
overlook when purchasing a data warehouse. Most company officials
think loading data is simply that just loading data.
Some people
actually ask, Are data loads that critical? Come on, ASCII stupid
question and get a stupid ANSI.
Seriously though, there are data warehouses in existence today that
merely cant load data once it reaches a certain volume. As one
Teradata developer said, It is not the load that brings them down, but
the way they carry it. Even an experienced body builder must use a
good technique to lift the weight over his head. While most database
vendors are new to the game, Teradata has had 15 years of practice
loading the largest data warehouses in the world.
Now, the
combination of Fastload, Multiload, and Tpump can load millions, even
billions, of records in record time.

120

Fastload
Fastload is designed to load flat file data from a mainframe or LAN
directly into an empty Teradata table. This is how a Teradata table is
populated the first time. I have personally seen Teradata load over
one billion large rows in less than 6 hours. Plus, I have seen Teradata
load millions of rows in minutes. Teradata has the quickest time to
solution, and has the most powerful performance in the data
warehousing industry.
How is Teradatas speed and performance
accomplished? Its done through parallel processing.
Fastload understands one SQL command - INSERT. It inserts rows
into an empty table. The process is as follows: A flat file is prepared
for loading on a mainframe or LAN. The FASTLOAD utility needs three
pieces of information to process: where the flat file located, what is its
file definition, and what table the data should be loaded into in
Teradata.
When the Fastload utility starts, the Parsing Engine comes up with a
plan for the AMPs. The Parsing Engine then steps back and lets the
AMPs do their work. The data is loaded in large 64K blocks. Each AMP
is given a 64K block of rows for loading. Like a line of workers trying
to pass sand bags to prevent a flood, Teradata passes these blocks
from AMP to AMP until all the data is on Teradata. Next, all AMPs take
the blocks they received, hash the rows in those blocks (in parallel)
and send the rows to the proper AMP over the BYNET. Once this is
done, each AMP sorts its data by Row ID and the table is ready for
business.
Fastload Basics:

Loads data to Teradata from a Mainframe or LAN flat file;


Only one table may be loaded at a time;
The table to be loaded must be empty;
There can be no secondary indexes, referential integrity, or
triggers;
It doesnt support Multi-set tables; and
It locks at the table level.

121

Mainframe
or
LAN
DATA

64
K

A
M
P

P
E

64
K

BYNET

64
K

64
K

64
K

A
M
P

A
M
P

64
K

64
K

64
K

64
K

64
K

A
M
P

64
K

64
K

A
M
P

64
K

A
M
P

122

Multiload
Where Fastload is meant to populate empty tables with INSERTS,
Multiload is meant to process INSERTS, UPDATES, and DELETES on
tables that have existing data. Multiload is extremely fast. One major
Teradata data warehouse company processes 120 million inserts,
updates, and deletes during its nightly batch.
Multiload works similar to Fastload. Data originates as a flat file on
either a mainframe or LAN. When the Multiload utility is executed, the
Parsing Engine creates a plan for the AMPs to follow. The data is then
passed to the AMPs, in parallel, in 64K blocks, and the AMPs hash the
rows to the proper AMP. Last, the INSERTS, UPDATES, and DELETES
are applied.
In the previous diagram the mainframe/LAN is talking to the Parsing
Engine. The PE passes the data across the BYNET for the AMPs to
retrieve. Keep in mind, many systems have hundreds to thousands of
AMPs. The load takes place, continually, in parallel when the 64K
packets are delivered to the AMPs. Multiload has been designed for
users who have a need for speed.
Multiload locks at the table level. Therefore, while Multiload is running,
the table is unavailable.
Multiload Basics:

Loads data to Teradata from a Mainframe or LAN flat file;


Up to 20 INSERTS, UPDATES, or DELETES may be executed on
up to 5 tables;
Receiving tables are usually populated;
There can be no Unique secondary indexes, referential integrity,
or triggers;
It doesnt support Multi-set tables; and
It locks at the table level.

123

Tpump
The Tpump utility is designed to allow OLTP transactions to
immediately load into a data warehouse. When I started working with
Teradata, more than 10 years ago, most companies loaded data on a
monthly basis. Suddenly, companies began to load data weekly.
Today, most companies load data nightly, and industry leaders are
loading data hourly. Tpump is the beginning step of an Active Data
Warehouse (ADW).
ADW combines OLTP transactions with a
Decisions Support System (DSS).

You dont drown by falling into the water;


you drown by staying in the water.
Edwin Louis Cole
If the data is not flowing, a company can drown in it! The utility is
called Tpump because it theoretically acts like a water faucet. Tpump
can be set to full throttle to load millions of transactions during off
peak hours or turned down to trickle small amounts of data during
the data warehouse rush hour. It can also be automatically preset to
load different levels at certain times during the day, and can be
modified at any time.
Also, Tpump locks at a row level so users have access to the rest of
the rows while the table is being loaded.
Tpump Basics:

Loads data to Teradata from a Mainframe or LAN flat file;


Processes INSERTS, UPDATES, or DELETES;
Tables are usually populated;
It can have secondary indexes, triggers, and referential
integrity;
It doesnt support Multi-set tables; and
It locks at the row level.

124

Conclusion A Final Thought on Teradata

"Genius is one percent inspiration and


ninety-nine percent perspiration."
Thomas Alva Edison
Thomas Edison only averaged 4 hours of sleep every night. That is not
surprising because that stupid light was always on.
Teradata
developers averaged about 4 hours of sleep because as their brilliance
continued to unfold the light kept going on. Teradata was originally
designed to handle large amounts of data back in 1976. Most other
databases were designed to handle On-line Transaction Processing
(OLTP). Teradata has been able to continually improve on its design
for the past 15 years at many of the largest data warehouse sites in
the world. As someone once said, Before you can eat the fruit you
must climb the tree. Teradata has been climbing to the top for over a
decade. The fruits of labor have paid off big for both Teradata and
Teradata customers. Here is why Teradata was made for e-business
data warehousing.
-

Parallel processing for unlimited performance

Unlimited scalability of data, users, and applications

Ability to answer extremely complex queries

Ease of setup and maintenance Only one DBA needed

Ability to load data at lightning speeds from a mainframe or LAN

Ability to answer any question on any data without any DBA


intervention or tuning

Performance capabilities to model detail data in 3rd Normal Form


or Dimensional Models

125

You might also like