You are on page 1of 10

Teradata DBA

Rolf Hanusa

A Lesson in Outer Joins


(Learned the Hard Way!)
Using outer joins to ease the query process
Outer joins are extremely powerful tools, and as such they are very difficult to understand
and use properly. A lack of understanding can give you unexpected and costly results.
(For example, your company might mail promotional fliers to 17 million customers,
instead of the 17,000 customers that you intended to target.) Although this article won't
make you an expert on outer joins, it will help you understand their complexities using
real-world examples and explanations.
Before we get too deep into the SQL syntax, we need a framework on which to build.
Because most of the outer joins that I see are used in Teradata, I'll limit this article to
queries written and executed in a Teradata environment. The rules should still apply with
other RDBMSs, but the queries may execute differently.

OUTER JOIN: A LOGICAL DEFINITION

An outer join is defined in sections; it is defined as the UNION ALL of various pieces.
The pieces pulled together are defined by the type of outer join:
• Piece 1: The inner join the result of the two tables as described by the full ON
clause, with all conditions applied
• Piece 2: All rows from the left table not included in Piece1, extended with NULL
values for each column of the right table
• Piece 3: All rows from the right table not included in Piece 1, extended with NULL
values for each column of the left table.

Left Outer Join is Piece 1


UNION ALL Piece 2.
Right Outer Join is Piece 1
UNION ALL Piece 3.
Full Outer Join is Piece 1
UNION ALL Piece 2
UNION ALL Piece 3

For each type of outer join (left, right, full), just put the proper "pieces" together using
UNION ALL.

SOME BASIC RULES AND RECOMMENDATIONS

One or more join conditions, also called "connecting terms," are required in the ON
clause for each relation in an outer join. These join conditions are used to define the rows
in the outer table that take part in the match to the inner table.
I recommend that you use only join conditions in ON clauses. However, when a
search condition (used for row selection) is required on the inner table, it should be put in
the ON clause as well. A search condition in the ON clause of the inner table will not
limit the number of rows in the answer set. It only defines the rows eligible to take part in
the match to the outer table.
An outer join can also include a WHERE clause; however, the results you get when
you do include it may be surprising--or at least not obvious. This will be explained in
more detail later in the article. To limit the number of qualifying rows in the outer table
(and therefore the answer set), the search condition for the outer table must be in the
WHERE clause. Note: The WHERE clause is applied only after the outer join has been
produced.
Here's a little known (or less understood) outer join rule: If a search condition on the
inner table is placed in the WHERE clause, the JOIN is logically equivalent to an INNER
JOIN, even if you code OUTER JOIN in the query. Read on to see how this can impact
your results.
These rules are not strange concepts unique to Teradata. This is a fully SQL-92-
compliant implementation (for better or worse). Teradata's optimizer does, however, take
advantage of these concepts in processing these queries. Instead of executing the outer
join just as it is defined, the optimizer rewrites the query to roll the whole, complex
process into a single step, as well as to eliminate outer joins that really aren't.

FROM THEORY TO REAL-WORLD ANALYSIS

The following examples represent actual cases that I have encountered as a DBA.
Although I've changed them slightly to avoid any conflict of interest, the basic syntax and
counts remain accurate. Since Teradata EXPLAINs may be new to some readers, they
have been altered slightly for clarity (that is, aliases were replaced with database names,
and so forth).
Before writing a query, it is important to understand the business question that it is
supposed to answer. Here is a simple explanation of the business question we are trying
to answer in the remainder of this article:
We want to know all the customers (using table CUSTOMER, which contains over
18 million rows):
• Who reside in the DISTRICT of K,
And:
• Who have a SERVICE_TYPE of ABC or XYZ,
And: Their monthly revenue (using table REVENUE, which contains over 234
million rows) for the month of July 1997 (199707)
• Using DATA_DATE = 199707,
And (here's where the outer join comes in):
If the customer revenue is unknown (that is, if no revenue records are found), we
want to keep the customer record with a NULL for MONTHLY REVENUE.
Sounds simple enough, doesn't it? I thought so too until I started analyzing my
original answer sets and found them to be incorrect and, in some cases, very surprising.
In fact, until I researched several coding alternatives and repeatedly questioned one of
NCR's developers (who now probably uses caller ID to screen my calls), I was convinced
that Teradata's optimizer had lost its mind. It hadn't, but I almost did. You'll see what I
mean as we go through the following examples and analyze the results.
The first example (see Listing 1) is a single table select, which provides the base of
customer records that we want. The second example (see Listing 2) is an inner join that
will help EXPLAIN the remaining queries and results. It starts with the same base of
customer records but matches them with revenue records for a particular month. Note that
all customer records found a matching revenue record.

Listing 1. Single table select.

SELECT C.CUSTNUM
FROM SAMPDB.CUSTOMER C
WHERE C.DISTRICT='K'
AND (C.SERVICE_TYPE= 'ABC'
OR C.SERVICE_TYPE= 'XYZ')
ORDER BY 1;

Result: This query returns 18,034 rows.

Listing 2. Inner join.


SELECT C.CUSTNUM, B.MONTHLY_REVENUE
FROM SAMPDB.CUSTOMER C
, SAMPDB2.REVENUE B
WHERE
C.CUSTNUM = B.CUSTNUM
AND C.DISTRICT = 'K'
AND B.DATA_DATE = 199707
AND (C.SERVICE_TYPE = 'ABC' OR
C.SERVICE_TYPE = 'XYZ')
ORDER BY 1;

Result: This query returns 13,010 rows.

In Listing 3, an outer join is requested, but if we apply these rules stated, we end up
with a surprising result. Although we are asking for a LEFT OUTER JOIN, it is in fact
treated as an inner join. Because all the selection criteria are in the WHERE clause, they
are logically applied only after the outer join processing has been completed. This means
that Listings 2 and 3 are logically similar and will provide the same result. It is important
to note that Teradata recognizes that this query is the same as an inner join and executes
it as such (see EXPLAIN Exq3). Therefore, it executes with the speed of an inner join.

Listing 3. Outer join. (But is it?)

SELECT C.CUSTNUM, B.MONTHLY_REVENUE


FROM SAMPDB.CUSTOMER C
LEFT OUTER JOIN
SAMPDB2.REVENUE B
ON C.CUSTNUM = B.CUSTNUM
WHERE
AND C.DISTRICT='K'
AND B.DATA_DATE= 199707
AND (C.SERVICE_TYPE= 'ABC' OR
C.SERVICE_TYPE= 'XYZ')
ORDER BY 1;
Result: This query returns 13,010 rows.

Note: For those of you who are unfamiliar with a Teradata EXPLAIN, it is a textual
description of the processing steps that the Teradata Optimizer will use to execute an
SQL query.

EXPLAIN Exq3:

1. First, we lock SAMPDB.CUSTOMER for access, and we lock


SAMPDB2.REVENUE for access.
2. Next, we do an all-AMPs JOIN step from SAMPDB.CUSTOMER by way of a
RowHash match scan with a condition of ("(SAMPDB. T1.DISTRICT = 'K') and
((SAMPDB. T1.SERVICE_TYPE= 'ABC ') or (SAMPDB. T1.SERVICE_ TYPE = 'XYZ
'))"), which is joined to SAMPDB.CUSTOMER with a condition of
("SAMPDB2.REVENUE.DATA_DATE = 199707"). SAMPDB.CUSTOMER and
SAMPDB2.REVENUE are joined using a merge join, with a join condition of
("(SAMPDB.CUSTOMER.CUSTNUM = SAMPDB2. REVENUE.CUSTNUM)"). The
input table SAMPDB.CUSTOMER will not be cached in memory. The result goes into
Spool 1, which is built locally on the AMPs. Then we do a SORT to order Spool 1 by the
sort key in spool field1. The size of Spool 1 is estimated to be 1,328,513 rows. The
estimated time for this step is 6 minutes and 2 seconds.
3. Finally, we send out an END TRANSACTION step to all AMPs involved in
processing the request.
--> The contents of Spool 1 are sent back to the user as the result of statement 1. The
total estimated time is 0 hours and 6 minutes and 2 seconds.
The NCR/Teradata developer's explanation: Logically, terms in the WHERE
clause are supposed to be applied after the outer join has been performed using the terms
in the ON clause. If we do that, there will be 18,034 rows in the result. But when we
apply the term B.DATA_DATE= 199707 afterward, it will eliminate all rows where
B.DATA_DATE is null (these are the rows where no inner table rows matched outer
table rows). Thus, it is quite reasonable to expect that this client's request should return
fewer than 18,034 rows.
Perhaps (since the EXPLAIN will not appear to reflect the logic I've described) I
should mention that we do not really apply the term B.DATA_DATE= 199707 after
doing the outer join. The optimizer recognizes that outer joins with a WHERE clause
containing a term referencing the inner table, which would not evaluate true when the
column is null, are logically equivalent to an inner join. In such cases, the optimizer
generates a plan to perform an inner join. (Note that step 2 of the EXPLAIN says that we
do a merge join, not an outer merge join.)
My explanation of the developer's explanation: Notice that the restrictions on the
outer table are in the WHERE clause. This causes the left table to be reduced from
17,713,502 to 18,034 rows. The restrictions on the inner table are also in the WHERE
clause (instead of the ON clause), so they will be applied afterward to remove all rows
containing NULLs (as a result of the outer join). This reduces the answer set to 13,010
rows. Confusing, yes. But it gets worse. Our next example (see Listing 4) is an outer join,
but the answer set returned is vastly different from the desired result, as we shall see. This
query was the most confusing for me to understand, at least at first. As the developer told
me, it is counterintuitive.

Listing 4. Outer join. (Yes, but is this what you want?)

SELECT C.CUSTNUM, B.MONTHLY_REVENUE


FROM SAMPDB.CUSTOMER C
LEFT OUTER JOIN
SAMPDB2.REVENUE B
ON C.CUSTNUM = B.CUSTNUM
AND C.DISTRICT='K'
AND B.DATA_DATE= 199707
AND (C.SERVICE_TYPE= 'ABC' OR
C.SERVICE_TYPE= 'XYZ')
ORDER BY 1;

Result: This query returns 17,713,502 rows.

The NCR/Teradata developer's explanation: As long as there is no WHERE


clause, the result of an outer join will always have at least one row in the result for every
row in the outer relation. That is what we have here. Listing 4 demonstrates the result of
one of the possible placements of single-relation terms on the outer relation. When such
terms are placed in the ON clause, they do not eliminate any rows from the result. Outer
table rows where DISTRICT = 'C' and (SERVICE_TYPE= 'ABC' OR SERVICE_TYPE=
'XYZ') are considered to be nonmatches with the inner table whether or not the join terms
(those that reference both inner and outer relations) all evaluate as true. In other words,
every outer relation row for which those two terms do not evaluate true, do not match any
inner relation rows, even if all the connecting terms in the ON clause evaluate true for
that outer row and some inner relation row.
My explanation of the developer's explanation: The selection criteria (search
conditions) in the ON clause only define the rows to which nulls are to be used for
nonmatching rows (see EXPLAIN Exq4). This means that all the rows (17,713,502 of
them) in the left table (CUSTOMER) will be returned. But only the rows (13,010) with a
SERVICE_ TYPE of "ABC" or "XYZ" in DISTRICT "C" and matching rows from the
right table (BILL HISTORY) for month 199707 will have non-NULL value for
MONTHLY_ REVENUE. This query will also perform more slowly since there are no
WHERE conditions to limit the query ... well, almost none. Teradata is smart enough to
treat the right table as an inner join, applying the DATA_DATE= 199707 to limit the
query. Otherwise, this query would run much longer. Note that when you review
EXPLAIN Exq4, you will see the words "Left outer joined using a merge join." This
statement confirms that this query is in fact an outer join.
EXPLAIN Exq4:
1) First, we lock SAMPDB.CUSTOMER for access, and we lock
SAMPDB2.REVENUE for access.
2) Next, we do an all-AMPs JOIN step from SAMPDB.CUSTOMER by way of a
RowHash match scan with no residual conditions, which is joined to
SAMPDB.CUSTOMER with a condition of ("SAMPDB2.REVENUE.DATA_DATE =
199707"). SAMPDB.CUSTOMER and SAMPDB2.REVENUE are left outer joined
using a merge join, with condition(s) used for nonmatching on left table
("((SAMPDB.T1.SERVICE_TYPE='ABC') or (SAMPDB.T1.SERVICE_TYPE='XYZ'))
and (SAMPDB. T1.DISTRICT = 'K')"), with a join condition of (" (SAMPDB.T1.
CUSTNUM = SAMPDB2.REVENUE.CUSTNUM)"). The input table
SAMPDB.CUSTOMER will not be cached in memory. The result goes into Spool 1,
which is built locally on the AMPs. Then we do a SORT to order Spool 1 by the sort key
in spool field1. The size of Spool 1 is estimated to be 17,713,502 rows. The estimated
time for this step is 7 minutes and 15 seconds.
3) Finally, we send out an END TRANSACTION step to all AMPs involved in
processing the request.
--> The contents of Spool 1 are sent back to the user as the result of statement 1. The
total estimated time is 0 hours and 7 minutes and 15 seconds.
Listing 5 is another example of a query where an outer join is requested, but it is
logically, and therefore transformed into, an inner join.

Listing 5. Outer join. (Not! This will be treated as an inner join.)

SELECT C.CUSTNUM, B.MONTHLY_REVENUE


FROM SAMPDB.CUSTOMER C
LEFT OUTER JOIN
SAMPDB2.REVENUE B
ON C.CUSTNUM = B.CUSTNUM
AND C.DISTRICT='K'
AND (C.SERVICE_TYPE= 'ABC' OR
C.SERVICE_TYPE= 'XYZ')
WHERE B.DATA_DATE= 199707
ORDER BY 1;

Result: This query returns 13,010 rows.

My explanation of Listing 5: Using what we have learned from the previous


examples, we can quickly see the similarity to Listing 3. Again, this query is treated as an
inner join, even though we asked for an outer join. The WHERE clause on the right
(inner) table, logically changes this query from an outer join to an inner join (see
EXPLAIN Exq5). As in previous examples, the WHERE clause is logically applied after
the outer join processing has been completed, removing all rows that were NULLed in
the process (that is, nonmatching rows between left and right table). As before, the
optimizer knows to execute this as an inner join to improve the performance of the query.

EXPLAIN Exq5 (As you can see, this EXPLAIN output is identical to EXPLAIN
Exq3 and, as expected, so is the answer set.):
1. First, we lock SAMPDB.CUSTOMER for access, and we lock
SAMPDB2.REVENUE or access.
2. Next, we do an all-AMPs JOIN step from SAMPDB.CUSTOMER by way of a
RowHash match scan with a condition of ("(SAMPDB. T1.DISTRICT = 'K') and
((SAMPDB. T1.SERVICE_TYPE= 'ABC') or (SAMPDB. T1.SERVICE_
TYPE='XYZ'))"), which is joined to SAMPDB.CUSTOMER with a condition of
("SAMPDB2.REVENUE.DATA_DATE =199707"). SAMPDB.CUSTOMER and
SAMPDB2.REVENUE are joined using a merge join, with a join condition of ("
(SAMPDB. T1.CUSTNUM = SAMPDB2. REVENUE.CUSTNUM)"). The input table
SAMPDB.CUSTOMER will not be cached in memory. The result goes into Spool 1,
which is built locally on the AMPs. Then we do a SORT to order Spool 1 by the sort key
in spool field1. The size of Spool 1 is estimated to be 1,328,513 rows. The estimated time
for this step is 6 minutes and 2 seconds.
3. Finally, we send out an END TRANSACTION step to all AMPs involved in
processing the request.
--> The contents of Spool 1 are sent back to the user as the result of statement 1. The
total estimated time is 0 hours and 6 minutes and 2 seconds.
Finally, we have the correct answer. This example (see Listing 6) is an outer join
providing the answer set, which answers the original business question.

Listing 6. Outer join. (The correct answer.)

SELECT C.CUSTNUM, B.MONTHLY_REVENUE


FROM SAMPDB.CUSTOMER C
LEFT OUTER JOIN
SAMPDB2.REVENUE B
ON C.CUSTNUM = B.CUSTNUM
AND B.DATA_DATE= 199707
WHERE C.DISTRICT='K'
AND (C.SERVICE_TYPE= 'ABC' OR
C.SERVICE_TYPE= 'XYZ')
ORDER BY 1;

This query returns 18,034 rows. 13,010 rows have non-NULL values for
MONTHLY_REVENUE.

In this query, the left (outer) table is limited by the search conditions in the WHERE
clause, and the search condition in the ON clause for the right (inner) table defines the
NULL-able nonmatching rows. This EXPLAIN confirms that this is in fact an outer join
(see EXPLAIN Exq6).
EXPLAIN Exq6:
1. First, we lock SAMPDB.CUSTOMER for access, and we lock
SAMPDB2.REVENUE for access.
2. Next, we do an all-AMPs JOIN step from SAMPDB.CUSTOMER by way of a
RowHash match scan with a condition of ( "((SAMPDB. T1.SERVICE_TYPE= 'ABC')
or (SAMPDB. T1.SERVICE_ TYPE='XYZ')) and (SAMPDB.T1. DISTRICT = 'K')"),
which is joined to SAMPDB.CUSTOMER with a condition of
( "SAMPDB2.REVENUE.DATA_ DATE = 199707"). SAMPDB.CUSTOMER and
SAMPDB2.REVENUE are left outer joined using a merge join, with a join condition of
(" (SAMPDB. T1.CUSTNUM = SAMPDB2.REVENUE.CUSTNUM )"). The input table
SAMPDB.CUSTOMER will not be cached in memory. The result goes into Spool 1,
which is built locally on the AMPs. Then we do a SORT to order Spool 1 by the sort key
in spool field1. The size of Spool 1 is estimated to be 1,328,513 rows. The estimated time
for this step is 6 minutes and 2 seconds.
3. Finally, we send out an END TRANSACTION step to all AMPs involved in
processing the request.
--> The contents of Spool 1 are sent back to the user as the result of statement 1. The
total estimated time is 0 hours and 6 minutes and 2 seconds.

GETTING THE "CORRECT" ANSWER

As the previous examples show, outer joins, when used properly, provide additional
information from a single query that formerly required multiple queries and/or steps to
achieve. However, the proper use of outer joins requires training and/or experience
because simple logic does not always apply. Use the following steps to be sure that you're
getting the "correct" answer (that is, the one you expect to get):
1. Make sure that you understand the question you are trying to answer; you should
have a pretty good idea what the answer set should look like.
2. Write the query, keeping in mind the proper placement of join conditions and
search conditions:
• All join conditions are placed on the ON clause.
• Search conditions for the inner table are placed on the ON clause while search
conditions on the outer table are placed in the WHERE clause.
3. Always EXPLAIN the query before executing it. Look for the words "outer join."
If you don't see them, it's not one.
4. Run the query and compare the result with your expectations.
If your answer set matches your expectations, it is probably correct. If not, check the
locations of any selection criteria that you have placed in the ON and/or WHERE clauses.
As this article demonstrates, many results are possible and the correct solution is not
necessarily intuitive, especially in a more complex query. Now let's look at that 12-way
complex join...•

Rolf Hanusa is the project leader and lead DBA for Southwestern Bell's Corporate
Data Warehouse (CDW) Project. Rolf has more than 10 years experience as a DBA,
supporting both Teradata and DB2 DSS systems. He is also an active member of the
Partners Product Advisory Council, a group of NCR/Teradata customers that provides
input to NCR on the product direction of NCR's large system products, as well as
enhancements to the Teradata RDBMS. You can reach him via email at
rh9151@stlmail1.sbc.com.

You might also like