Professional Documents
Culture Documents
Teradata Basics
We also want to thank our wives Leona Coffing and Janie Jones
TABLE OF CONTENTS
INTRODUCTION .................................................................................................................. 7
Rule # 1 Start Building Towards A Central Data Warehouse ............................................................... 10
Rule # 2 Build for the User.................................................................................................................... 15
Rule # 3 Let the IT Department Lead the Way to User Utopia.......................................................... 17
Rule # 4 Build the Foundation Around Detail Data .............................................................................. 19
Rule # 5 Build Data Marts from the Detail............................................................................................ 21
Rule # 6 Make Scalability Your Best Friend......................................................................................... 23
Rule # 7 Model the Data Correctly........................................................................................................ 27
Rule # 8 Dont Let a Technical Issue Make Your Data Warehouse a Failure Statistic ......................... 30
Rule # 9 Take a Building Block Approach............................................................................................ 32
Rule # 10 Buy a Teradata Data Warehouse........................................................................................... 34
INTRODUCTION
A full 40% of Fortune's "U.S. Most Admired" companies use Teradata.
What do they know that your company needs to know? Ive been in
the computer business for more than 27 years. Ive witnessed so
much since the early days of punch cards, assembler languages, and
COBOL programming.
With that in mind, the most magnificent,
ingenious technology that Ive ever seen is a database from the NCR
Corporation called Teradata.
Moments after midnight on July 30, 1945, the Navy cruiser USS
Indianapolis, suffered a fatal torpedo hit from a Japanese submarine.
It had been traveling unescorted through the Philippine Sea. Within
12 minutes of the deadly hit, the ship sank. Over 300 men were killed
and nearly 900 were stranded in shark-infested seas. Tragically, those
who survived until daylight faced four tortuous days in the water, and
battled continuous shark attacks before being stumbled upon by a
passing ship. In the end, only 316 souls survived. With a crew of
1,199 people, this was one of the worst military disasters of World War
II for the United States.
Most people assume that war is cruel, but the heart-wrenching story
above becomes even more tragic when the following facts are
revealed: First, the ships captain did not have all of the facts, and
second, the Navy did not provide the captain with a single version of
the truth. The Captains request for a destroyer escort was denied
even though the regional Naval command knew another ship had been
attacked just two days earlier, plus multiple enemy sightings had
occurred within the previous five days. Not only were these crucially
relevant facts withheld, but also the captain of the Indianapolis was
told that his passage route was clear and there would be no need for a
destroyer escort.
10
11
they are one solid team. A data warehouse experienced team saves
valuable money and resources, plus users can manage the entire data
warehouse. Executives may ask any question targeted to any part of
the business. Decisions are made with long-term vision, and every
employee is confident that when they need answers - the data
warehouse will provide them.
While visiting with this team, management decided at one point that
stores across the country should place Halloween displays and candy
near the cash registers. In less than two hours, stores moved their
Halloween candy from the normal candy aisles to end-caps near the
cash register. Every store participated but one!
When asked why he didnt participate, the store manager said he had
simply run out of time to create the displays plus move the Halloween
candy from his normal candy aisle to the end-caps. Management was
ticked. Telling the manager they would get back to him, they then
asked the DBA to query the data warehouse to see how much this
snafu had cost the company. The DBA came back and reported that
the store actually sold almost the same amount of Halloween candy as
forecasted. Management was surprised and honestly a little
disappointed with the answer. But then the DBA added somewhat
sheepishly, I found something else, too.
Go ahead, replied
members of the management team. He said, I found out they
actually sold about 40% more normal candy then we forecasted for
this holiday. Management got on the phone immediately and told the
other thousand stores: Move those goblins and Halloween candy back
to the normal candy aisles!
What that DBA did was to use his instinct and the data warehouse to
find out exactly what was going on with the business at that time. He
was armed with a system that had cross-functional analysis. A central
data warehouse gives good management great confidence because
they see the whole picture. When users can ask any question, at any
time, and on any data, their knowledge is unlimited.
Most Teradata Central Data Warehouse sites will tell you most of their
Return On Investment (ROI) came from areas they never suspected.
Thomas Jefferson once said, We dont know one millionth of a percent
about anything. When we explained Teradata to Jefferson he did not
build another Monticello, but he did retract his statement! Companies
with a centralized data warehouse know about a million percent more
than companies that have invested in stovepipe applications and 300
different data marts.
Actually, any company planning on competing in this millennium must
think long-term and begin building a centralized data warehouse. If
not, that company will be on the short end of the stick when
competing with a company that chose to build one. That thought
should sound scarier than a goblin near the cash registers on
Halloween!
13
If you think about it, every major decision in business makes someone
happy. If you are armed with facts supported by a central data
warehouse and you do your homework, your business decisions will
make your shareholders happy. However, if you are making decisions
with a data mart strategy, those decisions are more likely to make
your competitors happy.
There are many companies that are fearful of such an undertaking.
They want a central data warehouse, but wonder: What if it fails?
Which database should we choose? What type of hardware do we
need? Should we do an RFP? Decisions, decisions! It would literally
take me about 30 seconds to make a decision on Teradata. There
would be no RFP. We used to wade in swimming pools of data; today
we are swamped in oceans of data. Teradata is built for this type of
environment.
This book explains the fundamentals of Teradata.
Anyone with any experience or knowledge about data warehouse
environments will clearly see why Teradata is the best solution.
14
user can easily ask questions and get answers. Its also the IT
departments role to build a system that allows users to ask questions
on their own without IT intervention. Forget about building a system
where users ask IT to run the queries for them. When users need
information, the IT department should eventually be able to say, Ask
the question yourselfit is all available to you.
The business users are actually the stars, however the entire business
community must take responsibility for the warehouses success.
These users must continually educate themselves and other users on
the capabilities of the data warehouse, new tools, and new techniques
that will enhance its potential. Those same users must help IT help
them. If both understand their respective roles and work together to
help the company, then the data warehouse will be a huge success.
16
18
pay for the disk space it actually takes to keep detail data, but believe
me, that cost is a small price to pay for success.
20
Galileo was a smart man. How did he know so much about life and
data marts? When we explained to Galileo data marts he said, You
cannot build a data mart directly from the OLTP systems, you can only
build a data mart directly from the detail within. He was right!
Many companies build data mart after data mart directly from the
OLTP systems and their universe begins to revolve around continual
maintenance. Then as things get worse, as Galileo predicted their
universe begins to revolve around the son. The son of a gun sent in to
replace them!
Why does this happen? At first, things work out great, but soon there
are more and more requests for additional information. As a result,
more and more data marts are created, and soon the system looks like
a giant spider web. Different data marts start to yield different results
on like data, and the actual maintenance of this complicated spider
web takes up most of ITs time. Meanwhile, short-term dreams turn
into long-term nightmares like this one: A man and his wife had had a
big argument just before he went on a business trip. Feeling rather
contrite about his harsh words, he arranged to send his wife some
flowers and asked the florist to write on the card, Im sorry. I love
you. The beautiful bouquet arrived at the door. But then his wife
read the words the florist had actually written in haste, Im sorry I
love you.
21
The top reasons to build data marts directly from detail data are:
Users can get answers from the data mart, but must validate
their findings or check out additional information from the detail
that built it.
Maintenance is easy
If a user comes up with a data mart answer that does not make sense,
then he or she has the ability to drill down into the detail and
investigate. Sometimes summary data can spark interest and finding
out the why can result in big bucks.
If users dont trust the data, they wont use the system. When a data
warehouse is built on a foundation of detail data and then data marts
are erected from that foundation, you have a winning combination.
The results will always be consistent and trustworthy. However, you
should only build data marts when there is a credible business case,
and you should be ready to drop them when they are no longer
needed. The life span of a data mart is relatively short to that of its
mother and father (better known as the detail data). If you build the
data mart from the detail, it makes them easy to manage, easy to
drop, and easy to change.
22
24
26
Join tables
Aggregate data
Sort data
Scan large volumes of data.
the centralized data warehouse. The user will then have access to
both the data marts for repetitive queries, and the central warehouse
for other queries.
Because data marts can be an administrative nightmare, Teradata
enables Star-Schema access without requiring physical data marts.
By setting up a join index as the intersection of your Star-Schema
model, you can create a Star-Schema structure directly from your
3rd Normal Form data model. Best of all, once it is created, the data
is automatically maintained as the underlying tables are updated.
Keep in mind, 80% of data warehouse queries are repetitive, but 80%
of the Return On Investment (ROI) is actually provided by the other
20% of the queries that go against detailed data in an iterative
environment. By using a normalized model for your central data
warehouse and a Star-Schema model on data marts, you can
enhance the possibility of realizing an 80% Return on Investment and
still enhance the performance on 80% of your queries.
29
30
As the data grows in volume, can the system meet the performance
requirements? Do the math!
Could I become the hero of the company one day, only to have
some technical glitch blamed on me because of my poor foresight
and be thrown out of the company into a giant mud puddle? Do the
bath!
31
32
intervals. Once the first application works, then you are ready for
more projects. As you become more experienced with this approach,
you can add multiple projects in parallel by involving multiple
organizations.
The second aspect of the building block approach is in the actual data
warehouse architecture. It doesnt matter if yours is the smallest data
warehouse in the world, the largest, or falls somewhere in between,
power and scalability always fuel success.
Not long ago a customer flew out to San Diego for a Teradata
demonstration and benchmark. The benchmark ran late into the
evening, but the numbers were more than 50% better than the
competition. The customer was extremely impressed, but before
buying he demanded to see the system scalability that everyone had
been talking about. Although it was already late, a Teradata employee
was called in the middle of the night, arrived within 10 minutes (in
pajamas), hooked up the building blocks, and ran a utility called
config. She ran another called reconfig, and in less than two hours
the system size doubled.
As the environment changes in terms of users, data, complexity,
capacity, batch windows, time changes, events, or opportunities, users
should be able to continue building applications and architecture. The
more a Teradata system grows, the more Teradata outshines the
competition.
33
34
Driving in the car one evening, Morgans eight-year old daughter Kara
piped up from the back seat, Daddy, can you buy Teradata in the
store? I mean, what does Teradata really do? Morgan thought for a
moment and then replied, Do you remember when you went on the
Easter egg hunt last spring? Well, imagine that we had fifty eggs and
you were the only child there. If I asked you to find all the purple
eggs, would you be able to do that? Kara said, Sure! But it might
take me a long time. Morgan continued, What if we now let fifty
children go in and I asked them to show me all of the purple eggs.
How long would that take? His daughter responded, It wouldnt take
any time at all because each child would only have to look at one egg.
That is precisely how Teradata works. It divides up huge tasks among
its processors and tackles each portion simultaneously, with amazing
speed. And it doesnt matter if you have a trillion eggs in your basket!
In 1984, the DBC/1012 was introduced. Since then, Teradata has
been the dominant force in data warehousing. Teradata got the
chickens plowing, and is considered outstanding. Meanwhile, IBMs
plow is out rusting in its field.
36
Parallel Processing
37
38
Memory This is the hand of the computer. The memory allows data
to be viewed, manipulated, changed, or altered. Data is brought in
from the hard drive and the processor works with the data in memory.
Once changes are made in memory, the processor can command that
the information be written back to disk.
Hard Drive This is the spine of the computer. The hard drive
stores data, applications, and the Operating System inside the PC.
The hard drive, also called the disk drive, holds the contents of the
data for the system on its disk.
39
For example, suppose you made three new good friends this month
and want to add their names to your list. Opening that document
brings it up from the hard drive and displays it on your screen. As you
type in the new names, the processor executes your request onto the
document while it is still being displayed in memory.
Upon
completion, you close the document and the processor writes all the
changes to the disk where it is stored.
In the picture below, we see the basic components of a Personal
Computer. Note that it also holds a file called Best_Friends listing,
and lists eight best friends.
Processor
Memory
BEST FRIENDS
1
2
3
4
Ben Hon
Joe Davis
Mary Gray
John Davis
5
6
7
8
Don Roy
Sam Mills
Kyle Marx
Lyn Jones
Disk
40
41
42
Network
Processors
Memory
Memory
BEST FRIENDS
1 Ben Hon
2 Joe Davis
3 Mary Gray
4 John Davis
BEST FRIENDS
5 Don Roy
6 Sam Mills
7 Kyle Marx
8 Lyn Jones
43
44
B EST FRIENDS
B EST FRIENDS
B EST FRIENDS
B EST FRIENDS
1 Ben Hon
5 Don Roy
2 Joe Davis
6 Sam Mills
3 Mary Gray
7 Kyle Marx
4 John Davis
8 Lyn Jones
45
46
P
E
P
BYNET Network
A
M
P
A
M
P
A
M
P
A
M
P
A
M
P
A
M
P
A
M
P
A
M
P
47
48
49
50
Teradata spreads the rows of a table evenly across all AMPs in the
system. When the PE asks the AMPs to get the data, each AMP will
read the rows only on their particular disk.
If this is done
simultaneously, all AMPs should finish at about the same time. As a
matter of fact, when we explained this philosophy to Confucius he
stated, A query is only as fast as the slowest AMP. Confucius,
however, did say not to quote him!
Again, an AMPs job is to read and write data to its disk. The AMP
takes its direction from the Parsing Engine (PE). The number of AMPs
varies per system. Today, some Teradata systems have just four
AMPs, while others have more than 2,000!
51
The BYNET
The PE passes the plan along to the AMPs over the BYNET;
The AMPs follow the plan and retrieve the data requested.
52
The AMPs pass the data to the PE over the BYNET; and
53
Virtual Disks
Node
Intel Processors
Memory
AMPs
PEs
54
The following picture shows two nodes connected together over the
BYNETs.
Virtual Disks
Node 1
Intel Processors
Memory
AMPs
PEs
BYNET
Virtual Disks
Node 2
Intel Processors
Memory
AMPs
PEs
55
Teradata Tables
JC Penney
Office Depot
Dillards
Dallas
Columbia
Atlanta
Order
Number
(FK)
105372
105799
106227
Customer
Rep
Dreyer
Crocker
Smith
56
Item
No
212
296
325
Quantity CustomerID
(FK)
20
1001
52
1002
17
1003
57
table, you can JOIN the two tables by matching a common key
between the two tables. A great choice is to match the primary key of
one table to the foreign key of the other table. Remember that a table
may have only one PK, but it may have multiple FKs.
Here is a quick reference chart for Primary and Foreign Keys:
PRIMARY KEY
FOREIGN KEY
Not optional
Optional
Comprised of one or more
Comprised of one or more
columns
columns
Can only have one PK per table
Can have multiple FKs per table
No duplicates allowed
Duplicates allowed
No changes allowed
Changes allowed
No nulls allowed
Nulls allowed
58
59
walk around the glass, the fish tend to swim in schools. Similarly,
Teradata does this with the rows on the AMPs to boost performance.
When you ask for data from any given table, an AMP will immediately
go to that particular group of rows, and then select what you need. It
doesnt need to look through the rows of many tables before it finds
what you need.
This is how parallel processing works. The AMPs
retrieve data in parallel, then pass it over the BYNET to the Parsing
Engine (PE), and the PE ensures the data is delivered to the user.
Keep in mind, the Bynet is an internal Teradata network over which
the PEs and the AMPs communicate.
The example below shows the information we have just discussed.
Notice that the system has four AMPs, and three tables: Employee,
Customer, and Order. Notice each AMP holds a portion of the rows
for every table. AMP1, for example, holds 1/4th of the Employee table
rows, 1/4th of the Customer table rows, and 1/4th of the Order table
rows.
Plus, the data is spread evenly for all tables. If a query asks for all
rows in the Customer Table, then each AMP will retrieve their
Customer table rows in parallel with the other AMPs. Each AMP will
then pass its data to the PE via the BYNET. Because the data in the
Customer table is spread evenly among all AMPs, each should finish
reading at exactly the same time.
Also, notice how each AMP separates each table. Just like schools of
fish, the rows of the Employee Table are grouped together. In
addition, the Customer and Order tables are grouped together. This is
important in a data warehouse environment because most queries
read millions of rows to satisfy a single query.
Performance is
enhanced when table rows are grouped together and Teradata is
permitted to bring blocks of rows into memory.
60
A
M
P
A
M
P
A
M
P
A
M
P
Empl oyee
Empl oyee
Empl oyee
Empl oyee
Customer
Customer
Customer
Customer
Order
Order
Order
Order
61
Primary Indexes
62
63
64
Well, Im glad
65
66
Hash Map
1
3
1
3
1
3
1
3
2
4
2
4
2
4
2
4
3
1
3
1
3
1
3
1
4
2
4
2
4
2
4
2
1
3
1
3
1
3
1
3
2
4
2
4
2
4
2
4
The next diagram shows the hash map for an eight-AMP system. As
before, this is for simulation purposes. Notice that the AMP number
for this hash map goes 1, 2, 3, 4, 5, 6, 7, 8, and then starts over
again. Why? Because this hash map is for an eight-AMP system.
1
3
7
1
5
3
1
3
7
1
5
3
2
4
8
2
6
4
2
4
8
2
6
4
3
1
3
7
1
5
3
1
3
7
1
5
4
2
4
8
2
6
4
2
4
8
2
6
5
1
3
1
3
7
1
5
3
1
3
7
6
2
4
2
4
8
2
6
4
2
4
8
67
Best_Friends Table
Friend_Num
2
4
6
8
10
12
14
16
Friend_Name
Ben Hon
Joe Davis
Mary Gray
John Davis
Don Roy
Sam Mills
Kyle Marx
Lyn Jones
68
For this example, Teradata will attempt to spread the table rows
among the four-AMP system. A picture of the four-AMP configuration
follows:
P
E
BYNET NETWORK
A
M
P
A
M
P
A
M
P
A
M
P
B EST FRIENDS
B EST FRIENDS
B EST FRIENDS
B EST FRIENDS
1 Ben Hon
5 Don Roy
2 Joe Davis
6 Sam Mills
4 John Davis
8 Lyn Jones
4 John Davis
8 Lyn Jones
1
3
1
3
1
3
1
3
2
4
2
4
2
4
2
4
3
1
3
1
3
1
3
1
4
2
4
2
4
2
4
2
1
3
1
3
1
3
1
3
2
4
2
4
2
4
2
4
69
Friend_Name
Bill Hon
1
3
1
3
1
3
1
3
2
4
2
4
2
4
2
4
3
1
3
1
3
1
3
1
4
2
4
2
4
2
4
2
1
3
1
3
1
3
1
3
2
4
2
4
2
4
2
4
70
Friend_Name
16
Lyn Jones
1
3
1
3
1
3
1
3
2
4*
2
4
2
4
2
4
3
1
3
1
3
1
3
1
4
2
4
2
4
2
4
2
1
3
1
3
1
3
1
3
2
4
2
4
2
4
2
4
71
If we continue the process until all data is laid out, the system would
look like this:
A
M
P
A
M
P
A
M
P
A
M
P
B EST FRIENDS
B EST FRIENDS
B EST FRIENDS
B EST FRIENDS
1 Ben Hon
5 Don Roy
24 Joe Davis
6 Sam
12
SamMills
Mills
46 John
MaryDavis
Gray
14
Kyle
Marx
8 Lyn Jones
48 John Davis
8 Lyn
16
LynJones
Jones
Best_Friends Table
Friend_Num
Friend_Name
2
4
6
8
10
12
14
16
Ben Hon
Joe Davis
Mary Gray
John Davis
Don Roy
Sam Mills
Kyle Marx
Lyn Jones
HASH MAP
1
3
1
3
1
3
1
3
2
4
2
4
2
4
2
4
3
1
3
1
3
1
3
1
4
2
4
2
4
2
4
2
1
3
1
3
1
3
1
3
2
4
2
4
2
4
2
4
72
73
A
M
P
A
M
P
A
M
P
A
M
P
B EST FRIENDS
B EST FRIENDS
B EST FRIENDS
B EST FRIENDS
12 Ben Hon
5 Don
10
DonRoy
Roy
24 Joe Davis
6 Sam
12
SamMills
Mills
46 Mary
John Davis
Gray
8 Lyn
14
KyleJones
Marx
48 John Davis
8 Lyn
16
LynJones
Jones
HASH MAP
1
3
1
3
1
3
1
3
2
4
2
4
2
4
2
4
3
1
3
1
3
1
3
1
4
2
4
2
4
2
4
2
1
3
1
3
1
3
1
3
2
4
2
4
2
4
2
4
74
After we complete
Tom told me that he wrestled his way through high school and college.
I said, Really? I didnt think the classes were that difficult myself!
Actually, Tom earned a wrestling scholarship to college and achieved
the All-American level. His wrestling coach drilled into the wrestlers
minds that the size of the opponent is not to be feared, but the size of
their will. The truth is that most databases do not have the FIGHT in
them to handle a Full Table Scan. Thats why so many students are
surprised at Teradatas abilities to actually handle Full Table Scans.
A Full Table Scan (FTS) is a query that reads every row of a table. The
table may be small or have billions of rows. With Teradata, a Full
Table Scan (FTS) means every AMP reads only the rows it owns in
parallel with all other AMPs in the system. Doing so speeds up a Full
Table Scan hundreds to thousands of times.
For example, imagine a table that has 100 rows in a system that has
10 AMPs. Each AMP owns 10 rows. On a Full Table Scan, each AMP
reads its 10 rows. Next, each AMP passes the information over the
BYNET to the PEP. This process is 10 times faster than most systems.
But what happens with systems that have hundreds, or even
thousands of AMPS? Well, one major telecommunications company
copied a 3.5 billion-row table in just 18 minutes. The 1,900 AMPs in
its system helped return results very rapidly. Talk about efficiency!
75
Most FTS bring traditional databases to their knees, but Teradata was
born to be parallel.
Teradata was specifically designed for data
warehousing. When you ask decision support questions like, Who are
my best and worst customers? then you are asking the system to
read through an entire table. Full Table Scans are fundamental and an
important part of data warehousing. They allow users to literally ask
any question, about any data, at any time.
Teradata has the
experience, power, and architecture to allow Full Table Scans.
A an example of a query asking for a Full Table Scan is:
SELECT Friend_Num, Friend_Name
FROM Best_Friends;
A
M
P
A
M
P
A
M
P
A
M
P
B EST FRIENDS
B EST FRIENDS
B EST FRIENDS
B EST FRIENDS
2 Ben Hon
1
5 Don
10
DonRoy
Roy
24 Joe Davis
6 Sam
12
SamMills
Mills
6 Mary
4
John Davis
Gray
8 Lyn
14
KyleJones
Marx
8 John Davis
4
8 Lyn
16
LynJones
Jones
In this example, the Parsing Engine receives the SQL and checks the
syntax and security.
If the user passes these tests, the query
continues. The PE knows this query asks to return all records. This is
a Full Table Scan. Therefore, it passes the AMPs a plan that says,
Retrieve all of your Best_Friends table rows, and then pass
them to me (PE) over the BYNET. With that in mind:
76
Lets run through the SQL again and see the result:
SELECT Friend_Num, Friend_Name
FROM Best_Friends;
8 rows returned
Friend_Num
Friend_Name
6
14
8
16
2
10
4
12
Mary Gray
Kyle Marx
John Davis
Lyn Jones
Ben Hon
Don Roy
Joe Davis
Sam Mills
77
Secondary Indexes
speed up
creates a
a separate
space, but
78
the row in the base table. Teradata brilliantly uses the hash formula
and the hash map to build its secondary index sub-tables.
There are three values stored in every secondary index sub-table row:
Secondary Index data value
Secondary Index Row-ID (This is the hashed version of the value)
Primary Index Row-ID (This locates the AMP and the base row)
When a secondary index is created, the Teradata PE tells each AMP to
hash the secondary index column value for each of its rows. It tells
the PE to place the hash in a secondary index sub-table along with the
ROW-ID that points to the base row where the desired value resides.
Lets create a secondary index on our Best_friends table. The syntax
to create a secondary index on the column Friend_Name in the table
called Best_Friends is:
CREATE UNIQUE INDEX(Friend_Name) on Best_Friends;
B EST FRIENDS
B EST FRIENDS
B EST FRIENDS
B EST FRIENDS
2 Ben Hon
10 Don Roy
4 Joe Davis
12 Sam Mills
6 Mary Gray
14 Kyle Marx
8 John Davis
16 Lyn Jones
Lyn Jones
Kyle Marx
Ben Hon
Joe Davis
John Davis
Mary Gray
Don Roy
Sam Mills
79
face. Here is how the design works for retrieval. Lets look at how the
following query plays out:
SELECT Friend_Num, Friend_Name
FROM Best_Friends
WHERE Friend_Name = Ben Hon
The Teradata Parsing Engine takes the SQL and checks the syntax and
security access rights. If all is well, the PE notices that in the WHERE
clause of the query it is asking WHERE Friend_Name = Ben Hon.
The PE recognizes that Friend_name is a Unique Secondary Index.
The PE will hash Ben Hon, and then use the hash map to find the
AMP that holds Ben Hon in its secondary index sub-table. As you
can see the AMP involved is number two (notice the smiley face on
AMP 2). The PE instructs AMP 2 to retrieve the Ben Hon Secondary
Index Sub-table. Once complete, Teradata can see the real row-id and
find the base row. In our example, once the Ben Hon Secondary
Index Sub-table row is found, the row-id (smiley face in this
example) is revealed, and the PE can find the matching smiley face in
the base table.
This approach allows all USI requests in the WHERE clause of SQL to
become two-AMP operations.
A NUSI used in the WHERE clause still requires all AMPs, but the AMPs
can easily check the secondary index sub-table to see if they have one
or more qualifying rows.
Create secondary indexes only on columns used repeatedly in the
WHERE clause of on-going queries. Secondary indexes take up space
and overhead, but boy can they speed up queries.
80
Join Indexes
Table A
Table B
81
82
DBC
Data Dictionary Directory (DD)
100 Gigabyte
Data Warehouse
83
20%
DBC
Data Dictionary Directory (DD)
SYSDBA
80%
100 Gigabyte
Data Warehouse
84
20%
DBC
Data Dictionary Directory (DD)
SYSDBA
MRKT
Sales
Morgan
Tom
85
DBC
SYSDBA
MRKT
Sales
Morgan
Tom
86
Perm Space,
Spool Space, and
Temp Space
Perm space defines the upper limit of space that a database or user
can use to hold tables, secondary index sub-tables, and permanent
journals (See protection features).
Spool space defines the upper limit of space that a user has to run a
query. When a user runs a query, AMPs build the answer set in spool
space. Once the query is done, the spool space is released. If the
query exceeds the spool spaces upper limit, the query aborts. Then,
the user is out of spool space.
Temp space defines the upper limit that a user or database can have
to hold Global Volatile Temporary tables.
These tables will be
discussed in another chapter.
The SYSDBA knows that tenaciously holding onto its space will not
provide any value to your company. A bank that holds onto all of its
capital will not be successful, or will it? If its destined for success, it
will lend out its capital in the form of credit lines or mortgages. These
actions will provide the bank with a healthy profit. The SYSDBA
likewise gladly gives up space to each new user or database in an
effort to make the Teradata system profitable.
SYSDBA gives out two kinds of space: Perm space and Spool
space. When you receive a credit card from the bank, you are given
an upper limit to your line of credit. In order to spend more than that
limit, you must get approval from the bank. In the same way, the
SYSDBA gives a new user an upper limit of space to use. When that
amount is used up, the user must request an increase. Another way to
free up some space is to drop some tables from the database.
87
Perm space is actually used to store real data such as tables, views
and macros. If you give some of your perm space to a child object,
then you must subtract that same amount from the total perm space
you own.
Spool space is the area where AMPs temporarily place the answer to
a query. Once the answer is delivered to the person making the query,
the AMPs release that spool space to be used for another query!
Unlike perm space, spool space is not lost if it is given away. You can
actually give users below you as much spool as you would like, yet still
have the original amount. Spool is like a speed limit on the highway.
If your own speed limit is 65 mph, you can still allow every other
driver to drive up to 65 mph. Some users may not receive perm space
if their job is just to run queries -- not create tables. These users will
just receive spool.
The following picture shows a logical view of a CustomerTable. Note:
the table is stored in PERM space. When a user submits a query
against this table, the answer is stored temporarily in SPOOL. When
the query is completed, the answer is delivered to the user, and then
the SPOOL is released.
The next picture shows a logical Teradata system. In the PERM area
there is a table called Employee. This table has five columns: Emp,
Dept, Lname, Fname, and Sal. The table has four employees. Notice
the SQL statement at the bottom of the picture is asking to see all
columns where the employees department is equal to 10.
To
complete the query, the AMPs will read the rows of the table and each
time they find a row where Dept is equal to 10, a row is added to
spool. Plus, when the answer is returned, the spool is released.
88
PERM
SPACE
SPOOL
SPACE
Emp Dept
1
2
3
4
10
20
30
10
Lname Fname
Jones
Smith
Chang
Wilson
Dave
Mary
Vu
Sue
Sal
45000.00
50000.00
65000.00
44000.00
SELECT *
FROM Employee
WHERE DEPT = 10;
89
What is a View?
At Christmas time no one cares about the past or the future. All that
matters is the present! One year, my wife and I were in New York City
during the holiday season. We had always heard about how wonderful
the window displays are in the large department stores. As we
window-shopped, we got lots of ideas for gifts. We could see products
displayed in the windows, but we could not actually touch them. We
only had a pleasant view. Display windows are designed to show
shoppers what store management wants you to see. In Teradata, a
view is like a department store window because you can see selected
portions of a table, yet you arent able to see sensitive data. Instead,
you can view data within your access rights and you determine what
data portions you want others to see.
Views are real sticklers for protecting sensitive data from inquiring
eyes. For example, the Human Resources database might contain an
employee table. Management can create a view of the table that hides
the salary column, yet still allows an administrative associate to view
names, phone numbers and department numbers of employees. In
this scenario, the salary column is not shown. As a result, views are
the best choice for protecting sensitive data.
Another benefit of views is that their definitions are stored in the Data
Dictionary. When you select a view of a table(s), the data is not
stored on the disks, so it does not duplicate data and take up more
space. In this scenario, you are looking at a filtered picture of the
data.
10
20
30
10
10
20
Johnson
Carlsbad
Winter
Lester
Samuels
Walter
Sal
Manny 100000
Jan
100000
Steve
77000
Bonnie 56000
Todd
120000
Misha 104000
90
Dept
10
20
30
10
10
20
Lname
Johnson
Carlsbad
Winter
Lester
Samuels
Walter
Fname
Manny
Jan
Steve
Bonnie
Todd
Misha
91
What is a Macro?
A macro can easily be created to run all three commands. The syntax
would be:
CREATE MACRO Emp_mac AS
(
SELECT * from Employ_v WHERE dept = 10;
SELECT * from Employ_v WHERE dept = 20;
SELECT * FROM Employ_v Order by lname;
);
92
Once the macro has been created and stored in the Data Dictionary,
its time for a test run. To run this macro, the user merely executes
the SQL:
Execute Emp_mac;
Here is a handy reference chart that compares views with macros:
Views
Macros
We execute macros.
Uses the keyword AS
Definition is stored in the
Data Dictionary
Accesses the real data itself
Is changed using the
keyword REPLACE
93
94
DBC
SYSDBA
MRKT
Mary
Sales
Morgan
Tom
95
In the picture above, the DBC has Implicit rights on all databases and
users. Plus, SYSDBA has Implicit rights on every person listed below
him. MRKT has explicit rights over Mary, and Morgan has the same
rights over Tom. Implicit rights simply means it is implied that those
people listed above you (in a hierarchy chart) can GRANT or REVOKE
privileges on you.
For example, if Tom or Morgan decides to give certain privileges to
Mary, either person could EXPLICITLY give her those permissions.
In comparison, Automatic Rights means when Morgan created Tom he
automatically received 20 access rights (on Tom), plus Tom was given
16 access rights on himself.
96
Data Protection
As a man was driving down the interstate highway, his cell phone
rang. When he answered he heard his wife warn him urgently,
"George, I just heard on the news that there's a car going the wrong
way on I-26!" George replied, "I'm on I-26 right now and it's not just
one car. It's hundreds of them!"
How do you protect your data when things go the wrong way?
Murphys law states, The more mission critical a data warehouse,
the more likely the system will crash at the most critical moment of
the mission. Ironically, most DBAs think Murphy was an optimist.
97
Transaction Concept
Transient Journal
FALLBACK
RAID
Clustering
Cliques
Permanent Journaling
98
99
100
FALLBACK Protection
I asked my dentist if I had to floss all my teeth, and he responded,
No, just the ones you want to keep.
AMP 1
AMP 2
AMP 3
AMP 4
B EST FRIENDS
B EST FRIENDS
B EST FRIENDS
B EST FRIENDS
1 Ben Hon
5 Don Roy
2 Joe Davis
6 Sam Mills
3 Mary Gray
7 Kyle Marx
4 John Davis
8 Lyn Jones
101
In the picture below, you can see the Best_Friends table and the
FALLBACK protected rows.
16
14
AMP1
AMP2
B EST FRIENDS
B EST FRIENDS
B EST FRIENDS
B EST FRIENDS
2 Ben Hon
10 Don Roy
4 Joe Davis
12 Sam M ills
6 M ary Gray
14 Kyle M arx
8 John Davis
16 Lyn Jones
Lyn Jones
Kyle M arx
2
6
Ben Hon
M ary Gray
AMP3
8
4
John Davis
Joe Davis
AMP4
10
12
Don Roy
Sam M ills
102
If we can lose any one AMP/disk, what happens if we lose two? The
chance of losing two AMPs in a four-AMP system is rare, however
some systems have nearly 2,000 AMPs. Therefore, the chance of
losing two AMPs in a 2,000 AMP system is much greater than in a
four-AMP system. Thats why Teradata designed Clustering. Lets
look at this next example with a little larger system:
CLUSTER 1
AMP1
AMP2
B EST FRIENDS
B EST FRIENDS
B EST FRIENDS
B EST FRIENDS
2 Ben Hon
4 Joe Davis
6 M ary Gray
8 John Davis
M ary Gray
Ben Hon
AMP3
John Davis
AMP4
Joe Davis
CLUSTER 2
AMP5
16
AMP6
AMP7
AMP8
B EST FRIENDS
B EST FRIENDS
B EST FRIENDS
B EST FRIENDS
10 Don Roy
12 Sam M ills
14 Kyle M arx
16 Lyn Jones
Lyn Jones
10
Don Roy
12
Sam M ills
14
Kyle M arx
103
The brilliance behind this protection is the Hash Map. There is a Base
Row Hash Map used to distribute the base rows. Its called the
Primary Hash Map. There is also the Fallback Hash Map that knows
exactly how AMPs are clustered and which AMP should host a
FALLBACK row.
In most systems, AMPs are clustered in a group of four. The next
most popular clustering scheme is a group of three. However, the
minimum number of AMPs per cluster is two, but the maximum
number of AMPs per cluster is 16. Lets look at the extremes of both
clusters (two versus 16).
The advantage of clustering in groups of two is that both AMPs would
have to fail before the system stopped. The disadvantage is that if
one AMP fails, the other must do its work plus the work of the down
AMP. With clustering in a group of two, every complex query will
take twice as long to process.
The advantage to clustering in groups of 16 is that if one AMP fails,
there are 15 other AMPs doing their work and sharing in the work of
the failed AMP. The disadvantage to this type of clustering is there is
an increased risk of losing two AMPs in the cluster.
This is the reason four-AMP cluster configurations are so popular.
The chances of losing two AMPs out of four are quite low. However, if
one AMP is lost, the other three will share in the extra work.
FALLBACK is an optional means of protection specified at the
database or table level. It may be requested when the table is first
created, or you may add or drop FALLBACK at any time by using the
ALTER TABLE command. (For more information, refer to Teradata
SQL Unleash the Power by Mike Larkins and Tom Coffing).
Lets review FALLBACK and clarify related issues: When a new row is
inserted into a table, FALLBACK always places a second copy of that
row on another AMP in the same group, or cluster. Keep in mind that
a cluster usually consists of four AMPs. From that point on, any
manipulation of the data in the primary row also happens to the
FALLBACK row. FALLBACK rows are distributed evenly across all the
AMPs within the same cluster. If one AMP fails, processing continues
with all subsequent changes to that AMPs rows.
FALLBACK provides an optional insurance policy for a failed AMP,
however there is a cost for that insurance. FALLBACK requires twice
104
as much disk space to store both the primary and duplicate rows on
a table. Another cost that should not be overlooked is twice the I/O
(Input/Output) applies to inserts, updates and deletes because there
are always two copies to write. However, because Teradata AMPs
operate in parallel, both rows are placed on their respective AMPs at
nearly the same time.
Although FALLBACK may be created on any, all or no tables, its extra
cost causes most companies to use it only for mission critical tables.
As you might suspect, the Data Dictionary is automatically FALLBACK
protected. FALLBACK may not protect your system from all failures,
but it certainly is an excellent fault tolerant solution.
105
106
CLUSTER 1
AMP2
AMP3
AMP4
DARJ
DARJ
DARJ
2 Ben Hon
4 Joe Davis
6 Mary Gray
8 John Davis
AMP1
Mary Gray
AMP5
16
Ben Hon
John Davis
AMP6
AMP7
Joe Davis
AMP8
10 Don Roy
12 Sam Mills
14 Kyle Marx
16 Lyn Jones
Lyn Jones
10
Don Roy
12
Sam Mills
14
Kyle Marx
CLUSTER 2
In the previous picture there are two clusters, but notice that AMP one
has failed. After failure, the other AMPs in the top cluster open the
Down AMP Recovery Journal (DARJ). Also, none of the AMPs in the
bottom cluster have the DARJ open. Why? Simply, because the
FALLBACK rows for the down AMP are housed within the cluster. If
anything happens while the AMP is sleeping, it has three extremely
cute ticket takers that will store all information pertaining to the down
AMP.
107
108
A
M
P
109
A
M
P
Mirror
2 Ben Hon
10 Don Roy
2 Ben Hon
10 Don Roy
Data
Mirror
In the picture above, one AMP has one Virtual Disk, but it also has four
physical disks. Plus, each disk has a mirror in case of the loss of a
disk. The four disks together form a Rank of Disks. Two disks in a
rank may be lost so long as they are not comprised of a data disk and
its mirror. In this example, the data from the Best_Friends table is
displayed. It is on the first disk, and there is a set of mirrored the
information on the second disk. If a disk goes down, the system does
not even flinch. It sends the operations personnel a message about
failure, and keeps on running.
110
Cliques
In high school you can walk into the cafeteria and immediately
identify the cliques (pronounced clicks). In other words, they are
groups of students that hang around together because they have
formed a common identity and a common bond. The cliques in
Teradata are similar to, yet different from high school cliques.
CLIQUES (pronounced cleeks) in Teradata are a method of system
protection against the failure of an entire node. Multiple processing
nodes (SMPs) are not only connected with an unbroken line to their
own disks, but are also with a dotted line to each others disks. This
shared disk arrangement forms a CLIQUE. If a node fails then its
virtual processors (AMPs and PEPs) migrate to other nodes in its
CLIQUE like birds flying south in winter. The receiving node now has
twice as many VPROCs, so its performance slows down.
The
important factor is that the migrated VPROCs can still access their
own disks, and business continues until the failed node is repaired or
replaced.
Node 1
Node 2
Intel Processors
Intel Processors
Memory
Memory
A
M
P
1
A
M
P
16
AMP 16
Virtual
Disk
AMP 17
Virtual
Disk
A
M
P
17
A
M
P
32
Lets focus on AMP16 in node one and AMP 17 in node two (look at the
arrows). AMP 16 has its own virtual disk and similarly, AMP 17 has
its own virtual disk. Remember, no other AMP is allowed in another
AMPs virtual disk.
What if an entire node is lost? Well, then AMPs 1-16 cannot access any
disks. To prevent this, lets create a clique in our next picture. The
idea of a clique is to connect both nodes to one anothers disks. That
way, if either node goes down, the AMPs can migrate over the BYNET
and join the other 16 nodes in memory. However, each AMP will still
have a connection to the original virtual disks.
Node 1
Node 2
Clique Cables
Intel Processors
Intel Processors
Memory
Memory
A
M
P
1
A
M
P
16
AMP 16
Virtual
Disk
AMP 17
Virtual
Disk
A
M
P
17
A
M
P
32
Clique Cables
This is a clique
In the illustration above, cables have been added. If node one or node
two goes down, the AMPs can migrate to the other node and still have
access their own disks. The only difference is that the migrating AMPs
now reside in memory on different node, plus they are accessing their
own virtual disk via a different physical cable.
People who come from the colder climates to spend their winters in
sunny Florida are often called snowbirds. Do you know what bird
migrates farther than any other bird on the planet? It is the Arctic
tern. This bird leaves its Arctic Circle home in August for its winter
vacation home in Antarctica a round trip of more than 11,000 miles!
112
In the same way, when a node goes down the software AMPs and PEs
migrate over the Bynet to a temporary home on another node.
Node 1
Node 2
Intel Processors
Intel Processors
Memory
Memory
NODE Crash
AMP 16
Virtual
Disk
AMP 17
Virtual
Disk
A
M
P
1
A
M
P
32
All 16 AMPs
Migrate to
the new node
113
Permanent Journal
114
115
shows
the
use
of
FALLBACK
and
the
The example above created the table called Employee in the TomC
database, and is FALLBACK protected. A BEFORE Journal and a DUAL
AFTER Journal are specified. Remember that both FALLBACK and
JOURNALING have defaults of NO - meaning if you dont specify this
protection at either the table or database level the default is NO
FALLBACK and NO JOURNALING.
116
118
Referential Integrity
Just how important is it to protect the integrity of your data? This story
says it all: After reading an advertisement offering split, dry firewood
for $60 a cord (including delivery), Jeff decided to place a phone order.
Upon delivery, Jeff was upset when the deliveryman finished stacking
the wood. Jeff objected, "That's not a full cord of wood!" "Well,
that's what I call a cord," the man answered firmly. Grudgingly, Jeff
pulled some money out of his pocket and thrust it into the man's
hands. "Hey, just a minute," the man said after counting the money.
"You only gave me $30!" Jeff shrugged his shoulders and replied,
"Well, that's what I call $60."
Imagine getting fired from your job and the company deletes you from
its employee table, but forgets to delete you from the payroll table.
Thats not like getting fired its more like getting fired up for a
Bahamas vacation. Referential Integrity would have stopped this
oversight. RI, as it is called, would not allow anyone to be deleted
from the employee table unless he or she was also deleted from the
payroll table.
REFERENTIAL INTEGRITY (RI) is the relational concept that mandates
that a row cannot be inserted into a table if it does not contain a
column value that also exists in another table within the database.
Conversely, a row with a corresponding value in another table may not
be deleted unless the common value is first removed from the former
table.
An important function of RI on a newly created table is that it will not
allow invalid data values to be entered into a column. If RI is enforced
on an existing table with RI violations the ALTER TABLE will proceed.
Plus, it will copy and store the table and any related RI violations for
review and correction. Then the user will need to locate the table
copy, and then make corrections to the original table.
119
120
Fastload
Fastload is designed to load flat file data from a mainframe or LAN
directly into an empty Teradata table. This is how a Teradata table is
populated the first time. I have personally seen Teradata load over
one billion large rows in less than 6 hours. Plus, I have seen Teradata
load millions of rows in minutes. Teradata has the quickest time to
solution, and has the most powerful performance in the data
warehousing industry.
How is Teradatas speed and performance
accomplished? Its done through parallel processing.
Fastload understands one SQL command - INSERT. It inserts rows
into an empty table. The process is as follows: A flat file is prepared
for loading on a mainframe or LAN. The FASTLOAD utility needs three
pieces of information to process: where the flat file located, what is its
file definition, and what table the data should be loaded into in
Teradata.
When the Fastload utility starts, the Parsing Engine comes up with a
plan for the AMPs. The Parsing Engine then steps back and lets the
AMPs do their work. The data is loaded in large 64K blocks. Each AMP
is given a 64K block of rows for loading. Like a line of workers trying
to pass sand bags to prevent a flood, Teradata passes these blocks
from AMP to AMP until all the data is on Teradata. Next, all AMPs take
the blocks they received, hash the rows in those blocks (in parallel)
and send the rows to the proper AMP over the BYNET. Once this is
done, each AMP sorts its data by Row ID and the table is ready for
business.
Fastload Basics:
121
Mainframe
or
LAN
DATA
64
K
A
M
P
P
E
64
K
BYNET
64
K
64
K
64
K
A
M
P
A
M
P
64
K
64
K
64
K
64
K
64
K
A
M
P
64
K
64
K
A
M
P
64
K
A
M
P
122
Multiload
Where Fastload is meant to populate empty tables with INSERTS,
Multiload is meant to process INSERTS, UPDATES, and DELETES on
tables that have existing data. Multiload is extremely fast. One major
Teradata data warehouse company processes 120 million inserts,
updates, and deletes during its nightly batch.
Multiload works similar to Fastload. Data originates as a flat file on
either a mainframe or LAN. When the Multiload utility is executed, the
Parsing Engine creates a plan for the AMPs to follow. The data is then
passed to the AMPs, in parallel, in 64K blocks, and the AMPs hash the
rows to the proper AMP. Last, the INSERTS, UPDATES, and DELETES
are applied.
In the previous diagram the mainframe/LAN is talking to the Parsing
Engine. The PE passes the data across the BYNET for the AMPs to
retrieve. Keep in mind, many systems have hundreds to thousands of
AMPs. The load takes place, continually, in parallel when the 64K
packets are delivered to the AMPs. Multiload has been designed for
users who have a need for speed.
Multiload locks at the table level. Therefore, while Multiload is running,
the table is unavailable.
Multiload Basics:
123
Tpump
The Tpump utility is designed to allow OLTP transactions to
immediately load into a data warehouse. When I started working with
Teradata, more than 10 years ago, most companies loaded data on a
monthly basis. Suddenly, companies began to load data weekly.
Today, most companies load data nightly, and industry leaders are
loading data hourly. Tpump is the beginning step of an Active Data
Warehouse (ADW).
ADW combines OLTP transactions with a
Decisions Support System (DSS).
124
125