You are on page 1of 50

Calculus And Linear Algebra


Sebastian Shaqiri

February 26, 2017


I Calculus and Linear Algebra 3

1 Limit and Derivatives 4

1.1 Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Evaluating Limits . . . . . . . . . . . . . . . . . . . . 6
1.1.2 Continuous Functions . . . . . . . . . . . . . . . . . . 8
1.2 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Definition of the Derivative . . . . . . . . . . . . . . . 9
1.2.2 Properties of Derivatives . . . . . . . . . . . . . . . . . 10
1.2.3 General Characteristics of Differentiable Functions . . 13
1.2.4 LHpitalss rule . . . . . . . . . . . . . . . . . . . . . 15

2 Integrals and Antiderivatives 16

2.1 General Characteristics of Anti-derivatives . . . . . . . . . . . 16
2.1.1 Partial integration . . . . . . . . . . . . . . . . . . . . 17
2.1.2 Variable Substitution . . . . . . . . . . . . . . . . . . 17
2.2 Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 The Riemann Integral . . . . . . . . . . . . . . . . . . 19
2.2.2 Integration of Continuous Functions . . . . . . . . . . 20
2.2.3 Properties and Estimates . . . . . . . . . . . . . . . . 22
2.2.4 Fundamental theorem of calculus . . . . . . . . . . . . 23
2.2.5 Improper Integrals . . . . . . . . . . . . . . . . . . . . 24
2.2.6 Integrals in Probability Theory . . . . . . . . . . . . . 25

3 Linear Algebra 27
3.1 System of Linear Equations . . . . . . . . . . . . . . . . . . . 27
3.2 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . 29
3.3 The Matrix Equation Ax = b . . . . . . . . . . . . . . . . . . 31
3.3.1 Properties of the Matrix-vector product Ax . . . . . . 32
3.4 The Inverse of a Matrix . . . . . . . . . . . . . . . . . . . . . 33
3.5 Matrix Factorizations . . . . . . . . . . . . . . . . . . . . . . 34
3.5.1 The LU Factorization . . . . . . . . . . . . . . . . . . 35
3.6 Subspaces of Rn . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.7 Eigenvectors and Eigenvalues . . . . . . . . . . . . . . . . . . 37

3.8 Diagonalization . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.9 Inner Product, Length and Orthogonality . . . . . . . . . . . 41
3.9.1 The Inner Product . . . . . . . . . . . . . . . . . . . . 41
3.9.2 The Length of a Vector . . . . . . . . . . . . . . . . . 41
3.9.3 Orthogonal Vectors . . . . . . . . . . . . . . . . . . . . 42
3.9.4 Orthogonal Sets . . . . . . . . . . . . . . . . . . . . . 43
3.9.5 Orthogonal Projections . . . . . . . . . . . . . . . . . 45
3.10 The Gram-Schmidt process . . . . . . . . . . . . . . . . . . . 47
3.11 Least-Squares Problems . . . . . . . . . . . . . . . . . . . . . 48
3.12 Further Reading (Optional) . . . . . . . . . . . . . . . . . . . 49

Part I

Calculus and Linear Algebra


Limit and Derivatives

Nothing takes place in the world whose meaning is not that of

some maximum or minimum.
Leonhard Euler,

1.1 Limits
In the case of x + we define the limit as
Definition 1. Assume that f (x) is a function whose definition contains
arbitrarily large amount of real numbers. We say that f (x) has the limit A
as x approaches infinity if for every given number > 0 is a number such
that )
|f (x) A| < .
x Df
This is written
f (x) A when x +
or alternatively
lim f (x) = A.

The meaning of the definition is as follows: the function has the limit
A when x if the function values f (x) satisfies any given tolerance
requirements of the form
A < f (x) < A +
as soon x is sufficiently large, that is, for all x > . The greater accuracy -ie
the smaller -stated, the greater has to be selected for tolerance require-
ment to be fulfilled for all x > .

The Definition of 1 applies in particular to the sequence (an )

n=0 , which
of course can be seen as functions of the natural numbers as domain. For

sequences, but not other functions, there are also the following terminology:
if the sequence has a limit as n it is said to be convergent otherwise

The definition of the limit above is just one of many similar definitions
that must be done. When we examine the elementary functions we have to
work with
f (x) A when x ,
f (x) A when x a,
f (x) A when x a+ ,
f (x) A when x a .
These notions are defined analogously to the above-treated prototype limx+ f (x) =
A, and they are designated by corresponding lim notations. For example we
have the case x a :
Let f be a function and assume that every setting of the point a contains
points from Df . Then f is said to have the limit A as x approaches a if, for
every number > 0 there exist a number > 0 such that
|x a| <
|f (x) A| < .
x Df

Especially if the point a itself belongs to the domain, you can select
x = a. Accordingly read is |f (a) A| < for each > 0. The only possibility
then is that A = f (a). If f is defined in a and is defined for the limit when
x a thus the limit has to be equal to the function value f (a).
Limits of the type x a+ and x a . is right respectively left limit.
Their definition is obtained by changing the condition above |x a| < for
a x < a + and a < x. Apparently f (x) is defined for the limit when
x a exactly in that case when the right and left limit exist and is equal.

In the discussion above it is implicit but important that A is a number.

We also need to introduce concepts such as

f (x) + when x +,

f (x) + when x a,
f (x) when x a+
Such limits we call improper limits. They are defined by analogy with the
former (proper) limits.

1.1.1 Evaluating Limits
We usually try to avoid working directly with the definition when to deter-
mine the limits. Instead, we try to use some basic properties together with
a set of standard values.
The basic properties that we are about to establish are by most perceived
as intuitively obvious and works fully automatically at problem solving. The
rules are valid for all types of limits, x +, x a, x a+ etc., and in
the formulation below, we therefore make no stipulation in this regard unless
it is necessary for the sake of consistency.

Theorem 1. If lim f (x) = 0 and the function g(x) is finite it holds that

f (x)g(x) 0.

Proof. By the definition there exists two numbers C and 0 such that

x > 0 |g(x)| < C

Let be a positive number, it then exists a number 1 such that

x > 1 |f (x)| <
Let = max(0 , 1 ) we then have

x > |f (x)| |g(x)| < C =
and by the definition of a limit we have that f (x)g(x) 0 when x .

Theorem 2 (Addition, Product and Quotient properties). If

lim f (x) = A and lim g(x) = B

it applies
f (x) + g(x) A + B (1.1)

f (x)g(x) AB (1.2)
Furthermore, if B 6= 0 it applies

f (x) A
g(x) B

Proof. Proof of (3.1): Let be a positive number. By the definition it then

exists two numbers 1 and 2 such that

x > 1 |f (x) A| <


x > 2 |g(x) B| <
Let = max(1 , 2 ), by the triangle-inequality for x > we then get

|f (x) + g(x) (A + B)| = |(f (x) A) + (g(x) B)| |f (x A)|+|g(x) B| < + =
2 2
and thereby the proof is done. Proof of (3.2): Consider the equation

f (x)g(x) AB = (f (x) A)g(x) + A(g(x) B).

Since f (x) A and g(x) B 0 it applies trough a combination of theorem

1 and that the proof is correct. Proof of (3.3): We are going to show that
1 1
when x .
g(x) B

If B is positive we get
g(x) > B = .
2 2
and then we get
1 2
0< <
g(x) B
which gives us
1 1 1
= (B g(x))
g(x) B Bg(x)
and thus the proof is done.

Theorem 3 (Squeeze Theorem). If f (x) and g(x) has the same limit A and
f (x) h(x) g(x)
it implicate that lim h(x) = A.

Proof. Let be a given positive number. According to the conditions there

exists two numbers 1 and 2 such that

x > 1 A < f (x) < A +

x > 2 A < g(x) < A +
A < f (x) h(x) g(x) < A +
for all x > max(1 , 2 ), and by the definition it entails that h(x) has the
limit A.

1.1.2 Continuous Functions
Definition 2. A function f is said to be continuous at a point x0 if x0
belongs to the domain, and if the limit
lim f (x)

-If a function is continuous at each point in its domain it is called contin-
Points in which a function is not continuous,is often referred to as discon-
tinuities. Sometimes we also talk about a singularity at one such point. The
meaning of continuity is that a small variation of the variable x only causes a
small change in the function value f (x). A sudden change of function values
thus indicates the presence of a discontinuity.

The elementary functions are continuous

Of the basic properties for limits it immediately follows that if f and g are
continuous functions as is
f + g, f g, , f g
continuous in their respective domains. Since the function f (x) = x triv-
ially is is continuous, follows through repeated use of these rules that each
polynomial and each rational function is is continuous. We accept without
closed proof that the process for the introduction of powers is such that it
leads to continuous power functions and exponential functions. The geo-
metric situation at the introduction of the trigonometric functions indicates
that these are continuous: a small change in arc length x gives rise to small
changes in the coordinates cos(x) and sin(x) for the corresponding points
on the unit circle. The hyperbolic functions are made up of exponential
functions and is therefore continuous. Finally also logarithmic, and inverse
functions are continuous according to the following theorem
Theorem 4. The inverse of a strictly monotonic and continuous function
is continuous.
All the elementary functions of polynomials of inverse functions is thus
continuous. The same is true as well for all the functions that are made up
of those using addition, multiplication, division and composition.

1.2 Derivatives
There are many practical issues about how quickly a particular course of
change appears, such as "how fast is the car going?", "how quickly the air

pressure with increasing height above the surface of the earth?", "how much
does the tax increase with a growing income? ", etc. We will equip ourselves
with a mathematical tool that measures the speed of such changes. Let f (x)
denote the temperature distribution along a thin rod positioned along the
x-axis. We assume that the temperature is measured in degrees Celsius and
that the length of the unit is meters. If you have full knowledge of the
function f (x), it should also be possible to answer the question: how fast,
expressed in degrees per meter, is the temperature changing along with the
rod at a certain point x0 ? In other words it should be an expression formed
only by means of f (x) as to reasonably measure the change in temperature
per meter at a given point x0 . To find this expression, we first note that
from the point x0 to a nearby point x0 + h, the temperature has changed
f (x0 + h) f (x0 )
degrees. In the interval with endpoints x0 and x0 + h is accordingly read
the temperature increase (or decrease) in average

f (x0 + h) f (x0 )
degrees per meter. The expression (3.4) means no precise answer to the
question of how large the growth rate is at point x0 , but the smaller intervals
we use the closer we should get a precise indication. If the limit

f (x0 + h) f (x0 )
lim (1.5)
h0 h
exist, it is therefore reasonable to regard this as the metrics for temperature
rate of change in the point x0 . The limit value (3.5) thus represents the
expression were looking for.

1.2.1 Definition of the Derivative

The result of the analysis in the previous section shows that the limits of
the form (3.5) is of great interest, and we are now beginning a systematic
study of such.

Definition 3. Suppose that the function f is defined in a setting of the point

x0 . If the limit
f (x0 + h) f (x0 )
h0 h
exist then f is said to be differentiable at the point x0 . The limit is called
the derivative of f in x0 and denoted
f 0 (x0 ), (x0 ) or Df (x0 ).

If a function f is differentiable at every point in its domain we say briefly
that f is differentiable. The function

x f 0 (x), x Df

is called the derivative of f .

Geometric Interpretation of the Derivative

We define the tangent at the point (x0 , f (x0 )) as the line whose equation is

y f (x0 ) = f 0 (x0 )(x x0 ). (1.6)

We also talk about f 0 (x0 ) as the function curves slope or steepness at the
point (x0 , f (x0 )).

Second derivative
It can, of course, in some cases, be reason to study the growth rate of the
derivative f 0 of a function f. You should then form (f 0 )0 . This function is
called second derivative of f and designated in either of the ways
d2 f
f 00 , f (2) , D2 (f ) and .
According to its definition,f 00 measures how fast the growth rate increases
at the point x. For example, if s(t) indicates the total distance in meters at
the time t seconds is s0 (t) the normal acceleration in m/s2 . Depending on
the interpretation of f (x), however, the acceleration f 00 (x) have completely
different units. In the example where f (x) is the temperature ( C) at x (m)
this ensure the acceleration unit C/m2 .

1.2.2 Properties of Derivatives

We will now derive a number of basic properties of differentiation. In par-
ticular, for this purpose, we need the following results.
Theorem 5. If a function f is differentiable is the continuous.
Proof. Suppose that f is differentiable at point x0 . According to the def-
inition of continuity we shall show that f (x0 + h) f (x0 ) when h 0.
f (x0 + h) f (x0 )
f (x0 + h) f (x0 ) = h f (x0 ) 0 = 0 when h 0.
Which proves the theorem.

The reverse of theorem 5 is not true. For example the function f (x) = |x|
is continuous but not differentiable at the point x = 0.

Algebraic Properties
Theorem 6. Let f and g be differentiable functions and is a constant.
Then the functions of f + g, f g, and f /g is differentiable in their respective
domain. We have the following formulas for their derivatives:

(f )0 (x) = f 0 (x) (1.7)

(f + g)0 (x) = f 0 (x) + g 0 (x) (1.8)

(f g)0 (x) = f 0 (x)g(x) + f (x)g 0 (x) (1.9)

f f 0 (x)g(x) f (x)g 0 (x)
= (1.10)
g g(x)2
Proof. Proof (3.8):By the definition of the derivative we get

f (x + h) + g(x + h) (f (x) + g(x)) f (x + h) f (x) g(x + h) g(x)

= + f 0 (x)+g 0 (x)
h h h
when h 0 and thereby (3.8) is the derivative.
Proof (3.9): By the definition of the derivative we get

f (x + h)g(x + h) f (x)g(x) [f (x + h) f (x)] g(x + h) + f (x)g(x + h) f (x)g(x)

= =
h h
f (x + h) f (x) g(x + h) g(x)
= g(x + h) + f (x) .
h h
thus countinous we get

g(x + h) g(x) when h 0.

Hence follows that (3.9) is true.

Proof (3.10):We are going to show that

1 g 0 (x)
D =
g(x) g(x)2

By the definition of the derivative we get

1 1 g(x+h)g(x)
g(x+h) g(x) g(x) g(x + h) h
= = .
h hg(x)g(x + h) g(x)g(x + h)

where the denominator goes to g(x)2 when h 0 and the numerator goes
to g 0 (x) hence the theorem is true.

Derivatives of Composite Function
Theorem 7 (Chain rule). Let g(x) be differentiable at x and f (x) be dif-
ferentiable at g(x). Let y = f (g(x)) and u = g(x).
Proof. We will use the fact that if y = h(x) is differentiable at x then
y = h0 (x)x + x
where 0 when x 0. We have that
u = g 0 (x)x + 1 x dr 1 0 d x 0
y = f 0 (u)u + 2 u dr 2 0 d u 0.
Substituting u from the first equation into the second,
= f 0 (u) + 2 g 0 (x) + 1 .
Taking the limit as x 0
dy dy du
= f 0 (u) g 0 (x) = .
dx du dx

Derivative of Inverse Function

Theorem 8. Assume that the function f has an inverse function which
is continuous. If f is differentiable at point x and f 0 (x) 6= 0 then f 1
differentiable at the point y = f (x) and
(Df 1 )(y) = .
f 0 (x)
Proof. For each small contribution k 6= 0 to y, we can write
y + k = f (x + h),
where the contribution h of x is determined by f 1 and (y + k) = x + h e.i.
h = f 1 (y + k) f 1 (y).
Since f 1 is assumed continuous when h 0 as k approaches 0. Now
consider the difference quotient of f 1 at the point y :
f 1 (y + k) f 1 (y) h 1
= = f (x+h)f (x)
k f (x + h) f (x)
It follows that
f 1 (y + k) f 1 (y) 1
0 when k 0.
k f (x)
And thereby the theorem is proved.

1.2.3 General Characteristics of Differentiable Functions
Definition 4. Let x0 be a point in the domain Df to a function f . We say
that f has a local maximum at x0 if there is a number > 0 such that
|x x0 i |
f (x) f (x0 ).
x Df

We then call x0 a local maximum point of f and the function value f (x0 )
for a local maximum. Moreover, if f (x) < f (x0 ) when x 6= 0 we speak of a
strict local maximum point and a strict local maximum.
Similarly we define a (strict) local minimum point and a (strict) local
minimum value.

Local maximum and local minimum points are with a common name
called local extreme points. We also say that f has local extreme values at
thees points. Note carefully that the concept of local extreme value only
describes the functions behavior in the immediate surroundings of a point.
A local maximum is not necessarily the functions largest value, but of course
it could be.

Theorem 9. If the function f has a local extreme value at an interior point

x0 in the domain interval and if f is differentiable at x0 we get

f 0 (x0 ) = 0.

Proof. We consider the case where f has a local maximum in x0 : the proof
in the other case is analogous. For all sufficiently small values of |h| is
according to the definition of a local maximum
f (x0 + h) f (x0 ) 0 if h < 0
h 0 if h > 0

We then use the squeeze theorem which gives us that

0 f 0 (x0 ) 0

so we get that f 0 (x0 ) = 0.

Points for which f has the derivative zero, ie where the functions growth
rate is zero, is usually called critical points. The meaning of theorem 9 is
that in addition to the possible end points, extreme values can only occur at
critical points. Among other things, in order to determine whether a given
critical point is an extreme point or not we need additional connections
between a function and its derivative. The following theorem is fundamental
in deriving such.

Theorem 10 (Mean value theorem). Suppose that f is continuous in the
closed interval a c b and differentiable in the open interval a < x < b.
Then there exist at least one point , a < < b, such that
f (b) f (a) = f 0 ()(b a).
Proof. Consider the following help fuction
f (b) f (a)
(x) = f (x) (x a)
deposit of x = a, x = b gives us (a) = (b) = f (a). Furthermore is
continues in the interval [a, b] and differentiable in the interval (a, b) which
gives us
f (b) f (a)
0 (x) = f 0 (x)
Rolles theorem states that there has to be a critical point at x = which
gives us
f (b) f (a)
() = f 0 () =0
which is equivalent with
f (b) f (a) = f 0 ()(b a)
and thereby we have proved the theorem.

Theorem 11. If the function f is differentiable in the interval a < x < b,

and if f 0 (x) = 0 for all x in this interval, then f is a constant function
Proof. Let c be a fixed number and let x be an arbitrary point in the interval.
Since differentiability entails continuity are the prerequisites of the mean
value theorem met in the range of endpoints c and x. Thus,
f (x) f (c) = f 0 ()(x c)
for some between c and x. However, the derivative is equal to 0 at all
points, so we get that f (x) f (c) = 0, ie
f (x) = f (x)
for all x ]a, b[. Thus the proof is finished.

Corollary 1. If the functions f and g are differentiable in ]a, b[ and

f 0 (x) = g 0 (x), a < x < b,
it follows that
f (x) = g(x) + C
for some constant C.
Proof. The assertion follows directly by application of theorem 11 on the
funktion f (x) g(x).

1.2.4 LHpitalss rule
Let x0 be a real number (including ) and let f (x) and g(x) be dieren-
tiable functions. Suppose that limxx0 f (x) = 0 and limxx0 g(x) = 0. If
0 (x)
limxx0 fg0 (x) exists and there is an interval (a, b) containing x0 such that
f 0 (x)
g 0 (x) 6= 0 for all x (a, b), then limxx0 g 0 (x) exists and

f (x) f 0 (x)
lim = lim 0 .
xx0 g(x) xx0 g (x)

Also suppose limxx0 f (x) = and limxx0 g(x) = . If limxx0 fg0 (x) (x)

exists and there is an interval (a, b) containing x0 such that g 0 (x) 6= 0 for all
0 (x)
x (a, b), then limxx0 fg0 (x) exists and

f (x) f 0 (x)
lim = lim 0 .
xx0 g(x) xx0 g (x)

When x0 = , intervals are of the form (, b) and (a, ).


Integrals and Antiderivatives

"Love can reach the same level of talent, and even genius, as the
discovery of differential calculus."
Lev Vygotsky

2.1 General Characteristics of Anti-derivatives

Definition 5. Let f be definerad in an interval I. A differentiable function
F is called a anti-derivative to f if

F 0 (x) = f (x), x I.

It is obvious that if F (x) is a anti-derivative to f (x) as is

F (x) + C

an anti-derivative for all constants C. On the other hand, is a constant, the

only uncertainty that has a primitive. Namely, if G is another anti-derivative
of f , so that
G0 (x) = F 0 (x) = f (x), x I,
then the corollary of theorem 11 shows that

g(x) = F (x) + C.

Thus, if we can find an anti-derivative F to f so can we obtain all anti

derivatives of f by adding constants to F Instead of saying that f 0 (x) is the
derivative of f (x), we can say that f 0 (x)dx is the differential of f (x). The
reverse problem can be similarly formulated, we are looking for a function
F (x) whose differential is equal to f (x)dx. This is the background to let the
f (x)dx (2.1)

denote an anti-derivative of f . We will soon see that this differential writ-
ing has large computational
advantages over other perhaps closer at hand
designations for example f (x).

2.1.1 Partial integration

Theorem 12 (Partial integration). If F is an anti-derivative of f then
f (x)g(x)dx = F (x)g(x) F (x)g 0 (x)dx. (2.2)

Proof. It is enough to demonstrate that the derivative of the right side is

equal to f (x)g(x). But the rule for differentiation of a product and the
definition of an anti-derivative gives immediately
D F (x)g(x) F (x)g 0 (x) = F 0 (x)g(x)+F (x)g 0 (x)F (x)g 0 (x) = F 0 (x)g(x) = f (x)g(x).

2.1.2 Variable Substitution

A general method for all types of mathematical problem solving is to replace
variable. In this way, one might simplify his problems, or become aware of a
new perspective on it. The calculation of the anti-derivative is no exception
in this regard.
A change from a variable x to a new variable t is in this context form

x = g(t), (2.3)

where the function is injective, ie g has an inverse t = g 1 (x). Thus, we can

return to the variable x having resolved our problems in the variable
The following theorem hows how to transform calculating f (x)dx to
the calculation of an anti-derivative of such through the change of variables
in (3.3).
Theorem 13 (Variable Substitution). Suppose that g in (3.3) is a differ-
entiable function. Then
Z Z 
f (x)dx = f (g(t))g 0 (t)dt .
t=g 1 (x)

Proof. Let F denote an anti-derivative of f . We will show that except for a

constant,is Z 
F (x) = f (g(t))g 0 (t)dt .
t=g 1 (x)
This is equivalent to
F (g(t)) = f (g(t))g 0 (t)dt.

According to the chain rule and the definition of an anti-derivative, the
derivative with respect to t of the left hand side equal to

F 0 (g(t))g 0 (t) = f (g(t))g 0 (t)

But Again according to the definition of an anti-derivative the right

hand side the same derivative. Thus following the corollary of theorem 11.
The two sides are equal except for a constant. The proof is done.

Note that the change of variables in (3.3) is dx

dt = g 0 (t) which in differ-
ential form is written
dx = g 0 (t)dt.
This notation is convenient for practical behalf. In the integrals to be
calculated Z
F (x) = f (x)dx,

Preforming the substitution x = g(t) by replacing everything, even dx with

g(t). Hopefully when you make a change of variables of this kind is that
the new primitive Rfunction f (g(t))g 0 (t)dt should prove easier to calculate

than the original f (x)dx. If so, it carries out this calculation and finish
the solution and to return to the variable x.

2.2 Integrals
Integral of Steps Functions
A function on the interval [a, b] is called a step function if there is a
subdivision of [a, b] into smaller divisions in which has a constant value.
More precisely, if the division points are

a = x0 < x1 < ... < xn1 < xn = b

then is defiend as

(x) = ck when xk1 < x < xk , k = 1, 2, ...n, (2.4)

where ck are constants.

For the step function (3.4) we define the area between its graph and the
xaxis as the number
I() = ck (xk xk1 ). (2.5)

This definition is of course based on our experience of the rectangle area.

Each term in the sum can be interpreted as such. The part of the area below
the x-axis, however, has been assigned a negative metrics, as a closer stage

of (3.5) immediately indicate. We shall see later that this relationship is
very practical. It will also prove beneficial to no longer speak of the area
between the graph of and the x-axis, but rather consider I() in (3.5) as
a number associated with the function .
Definition 6. The number
I() = ck (xk xk1 )

is called the integral of the step function . We also use the designation
Z b
I() = (x)dx.

For each step function hears that we have seen a breakdown of its defini-
tion interval [a, b]. It is of course conceivable to add another division points
but for the sake of the function itself is changed. We then say that the divi-
sion refined. It is obvious that the value of the integral I() is not affected by
such a refinement of the distribution. This observation has the consequence
that if we have two step functions in the same interval [a, b] then there is
no restriction to assume that they are generated from the same division of
the interval. Against this background, it is not difficult to recognize the
correctness of the following theorem.
Theorem 14. The following properties hold for the integral of the step
function on the interval [a, b].
I() = I(), constant, (2.6)

I( + ) = I() + I() (2.7)

I() I() (2.8)

Z b Z c Z b
I() = (x)dx = (x)dx + (x)dx if acb (2.9)
a a c

2.2.1 The Riemann Integral

Definition 7. a finite function f defined on a finite interval [a, b] is said to
be (Riemann) integrable over this if it is to every real number > 0 exists
two step functions and satisfying
(x) f (x) (x), a x b,
and which is such that
I() I() < .

The definition has the consequence that if a function is integrable so its
graph can be covered by finitely many axis-parallel rectangles with arbitrar-
ily small total area. For the area between the graphs of and in the
definition consists of those rectangles and occupies an area of less than .
It remains to define the integral of an integrable function. The following
theorem is the basis for this.

Theorem 15. If the function f is integrable, there exists a number such

I() I()
for all step functions and with f .

Given the geometric importance of I() and I() the number should
be an adequate measure of the area of the region between the graph of f
and x-axis. We are therefore led to the following definition.

Definition 8. Assume that the function f integrable over the interval [a, b].
The uniquely determined number in Theorem 15 is called the integral of f
over [a, b] and could be written as
Z b
f (x)dx

or sometimes I(f ), as there is no doubt of which interval referred to.

2.2.2 Integration of Continuous Functions

Theorem 16. If the function f is continuous in the closed interval [a, b],
then f is integrable over this.

Proof. Let be a given positive number. We will then construct two step
functions and with

(x) f (x) (x).

Since f is continuous on a closed restricted interval there exists a number

> 0 such that

|f (x) f (y)| < , x, y [a, b] : |x y| < .
With this we now make a division

D : a = x0 < x1 < ... < xn = b

of [a, b], such that the length l(D) of the longest sub-interval satisfy

l(D) < .

Then we define the numbers mk and Mk as the minimum and maximum
value of f in the interval xk1 x xk . Specifically, when

Mk mk < , k = 1, 2, ..., n.
Finally, we define two step functions D and D belonging to this division
by putting
D = mk and D = Mk for xk1 < x < xk .
Then D f D and
X n
I(D ) I(D ) = Mk (xk xk1 ) mk (xk xk1 )
k=1 k=1
= (Mk mk )(xk xk1 ) <
< (xk xk1 ) = (b a) =
b a k=1 ba
Thus, f is integrable over [a, b] as defined by definition (7), and the theorem
is proved.

Due to the statement above, we know that the ab f (x)dx is a well-defined


quantity of each of [a, b] continuous function f . With very small changes in

the proof we also get the more general result that each piecewise continuous
function is integrable. (With a piecewise continuous function in this context
means a function which is continuous in the whole interval [a, b] except at
finitely many points, where it is allowed to have a leap.)

Riemann sum
Let f be a continuous function at the interval [a, b], and regard the division
D : a = x0 < x1 < ... < xn = b
of this. Denote by l(D) the length of the largest sub-interval. This number
can we perceive as a measure of the fineness subdivision. Choose arbitrarily
in each sub-interval a point k so xk1 x xk , and form the sum
RD = f (k )(xk xk1 ). (2.10)

Such a sum is called a Riemann sum. Geometrically, it is interpreted as

the sum of the rectangle area. It is reasonable that this sum can be made
arbitrarily close to the integral of f by choosing a sufficiently fine division,
ie a division D with sufficiently small value of l(D). This is the meaning of
the following theorem.

Theorem 17. Suppose that f is continuous on [a, b]. For the Riemann sum
(3.10) it applies that
X Z b
RD = f (k )(xk xk1 ) f (x)dx (2.11)
k=1 a

at indefinitely refined subdivision.

Proof. We use designations from theorem 16. Since

mk f (k ) Mk , k = 1, 2, ..., n.

I(D ) RD I(D ).
seeing that
I(D ) I(f ) I(D )
by the definition of I(f )

|RD I(f )| I(D ) I(D ) < when l(D) < .

This shows that RD I(f ) when the division fineness l(D) goes to zero.

2.2.3 Properties and Estimates

Theorem 18. If the functions f and g are integrable over [a, b] so this also
applies to the functions f ( constant) and f + g. Furthermore, we have
Z b Z b
f (x)dx = f (x)dx, (2.12)
a a

Z b Z b Z b
(f (x) + g(x)) = f (x)dx + g(x)dx, (2.13)
a a a
Z b Z b
f (x) g(x) in [a, b] f (x)dx g(x)dx, (2.14)
a a
Z b Z c Z b
f (x)dx = f (x)dx + f (x)dx. (2.15)
a a c

We refrain from detailed proof of these properties. The easiest way to

prove is when f and g are piecewise continuous. Then follows the formulas
directly by using the limit value (3.12) for the Riemann sums, and properties
(3.6) -(3.9) for step functions.
Primary (3.16) only exists when a c b. However, it is convenient for
b a to define Z b Z a
f (x)dx = f (x)dx. (2.16)
a b

Especially when aa f (x)dx = 0. With this Convention, we see that (3.16)

is a correct formula for all relative positions of points a, b and c, under the
premise that the integrals exist.
An important special case of (3.15) is
Z b
g(x) 0 in [a, b] g(x)dx 0.

Theorem 19 (Mean Value Theorem). If the function f is continuous in

[a, b], there exists point , so that a b, such that
Z b
f (x)dx = f ()(b a).

Proof. We put
m = min f (x) M = max
axb axb
m f (x) M nr axb
wich gives
Z b Z b Z b
m(b a) = mdx f (x)dx M dx = M (b a).
a a a

we put
1 b Z
C= f (x)dx
ba a
thees differences then implies m C M. But f is continuous and therefore
adopts every value between m and M in the interval [a, b]. Especially, there
is a in this interval for which f () = C. And we have proved the thoerem.

2.2.4 Fundamental theorem of calculus

Theorem 20 (Fundamental theorem of calculus). Suppose that the function
f is continuous in the interval a x b. Then put
Z x
S(x) = f (t)dt

Then the function S differentiable with the derivative

S 0 (x) = f (x).

Proof. To show that S(x) is differentiable, we must go back to the definition

of a derivative. We, therefore, form the differential quotient
Z x+h Z x ! Z x+h
S(x + h) S(x) 1 1
= f (t)dt f (t)dt = f (t)dt.
h h a a h x

Now we use the mean value theorem for an integral, and we get

S(x + h) S(x) 1
= f (h )(x + h x) = f (h )
h h
for some point h between x and x + h. When h 0 h goes towards x.
whereas f is continuous it follows that

f (h ) f (x) when h 0.

Thus, the function S(x) is differentiable with the derivative S 0 (x) = f (x).

2.2.5 Improper Integrals

The definition of the Riemann integral considers that we are working with
definite functions on definite intervals. In practice, you need to expand the
integral concept to include indefinite functions and intervals. The Riemann
integral is thereby combined with a limit process. We begin to study the
two simple cases where only one of the two restriction requirements will be

Infinite Domain of Integration

Consider a function defined in the interval [a, ] which is (Riemann)integrable
at the restricted domain a x X for each X. We associate f with a for-
mal improper integral Z
f (x)dx, (2.17)
for which we define the following concept.

Definition 9. If the limit

lim f (x)dx
X+ a

exist, say equal to A, it is said that the improper integral (3.18) is convergent.
The number A is called its value. If the limit does not exist, we say that the
improper integral is divergent.

For convergent improper integrals we usually use the symbol

f (x)dx,

not only for the integral but also for the integrals designated value A.

Indefinite integrand
We now consider a function defined in a definite interval a < x b and is
definite and Riemann integrable in each sub-interval [a + , b], > 0. The
function is assumed not to be definite throughout ]a, b]. For such a function
f is
Z b
f (x)dx (2.18)
a improper integral.

Definition 10. If the limit

Z b
lim f (x)dx = A
0+ a+

exists we say that the improper integral (3.19) is convergent with the value
A. If the limit does not exist it is said to be divergent.

Integrals improper in more than one way

There will be integrals that
are improper in more
ways than one. It may for
example be a question of or an integral of a which also is improper in
the end point a. In such cases, divide the integral into two parts (or more),
each improper in just one way, and says that the whole integral converges
if each of the pieces does it. Otherwise, it is said to be divergent. For a
convergent integral its value is defined as the sum of a the individual bits

2.2.6 Integrals in Probability Theory

In probability theory we often do analysis of random phenomenas, for ex-
ample in finance. As a model we often use a so called density function ie, a
non-negative function f (x) defined on the real axis and such that
f (x)dx = 1.

The density function can be interpreted as a probability density function

f (x)dx is interpreted as the probability that the outcome of the trial will be
a number in a small range around x with width dx. The probability that
the outcome of the experiment ends up in a certain interval [a, b] is obtained
by summation of these sub intervals, ie it is equal to
Z b
f (x)dx.

It also works with distribution function F (x), which is related to the density
function by Z x
F (x) = f (t)dt;

the number F (x) apparently means the probability that the outcome of
the trial is less or equal to x. If f is continuous, F is differentiable and
F 0 (x) = f (x) according to the fundamental theorem of calculus.
As a measure of the density function we use the so-called mean or ex-
pected value. This is defined as the number
(x)dx = 1.

The analogy with an emphasis in the mechanics is clear: the expected value
coincides with the center of gravity location for a mass distribution along
the entire real axis with density f (x) and the total mass first.

It is also of interest to what extent the function f is concentrated near the

mean value. As a measure of this concentration, we use standard deviation,
which is the positive numbers that meet
= (x m)2 f (x)dx.

The number 2 is called the variance. Here, we can determine a compari-

son with Mechanics: The variance corresponding to the inertia of the mass
distribution f (x) with respect to an axis through each m perpendicular to
the xaxis.

One of the most important probabilty density functions is

1 2
(x) = ex /2
that belong to the so-called normal distribution. The corresponding density
function is Z x
1 2
(x) = et /2 dt.


Linear Algebra

"But in my opinion, all things in nature occur mathematically."

Ren Descartes

3.1 System of Linear Equations

A linear equation in the variables x1 , ..., xn is an equation that could be
written in the form
a1 x1 + a2 x2 + ... + an xn = b (3.1)
where b and the coefficients a1 , ..., an are real or complex numbers, usually
known in advance. The subscript n may be any positive integer.
A system of linear equations is a collection of one or more linear equations
involving the same variables-say,x1 , ..., xn . A solution of the system is a list
(s1 , s2 , ..., sn ) of numbers that makes each equation a true statement when
the values s1 , ..., sn are substituted for x1 , ..., xn respectively.
The set of all possible solutions is called a solution set of the linear
system. Two linear systems are equivalent if they have the same solution
set. That is, each solution of the first system is a solution of the second
system, and each solution of the second system i a solution to the first.
Finding the solution set of a system of two linear equations in two vari-
ables is easy because it amounts to finding the intersection of two lines. A
system of linear equations has
1. no solution, or
2. exactly on solution, or
3. infinitely many solutions.
A system of linear equations is said to be consistent if it has either one
solution or infinitely many solutions; a system is inconsistent if it has no

Matrix Notation
The essential information of a linear system can be recorded compactly in
a rectangular array called a matrix. Given the system

x1 2x2 + x3 = 0

2x 8x = 8
2 3 (3.2)
5x 5x = 10

1 3

with the coefficients of each variable aligned in columns, the matrix

1 2 1
0 2 8

5 0 5

is called the coefficient matrix of the system (3.2), and

1 2 1 0
0 2 8 8

5 0 5 10

is called the augmented matrix of the system. An augmented matrix of a

system consists of the coefficient matrix with an added column containing
the constants from the right sides of the equations.
The size of a matrix tells how many rows and columns it has. if m and
n are positive integers, an m n matrix is a rectangular array of numbers
with m rows and n columns. Matrix notation will simplify the calculations
in the examples that follows.

Row Reduction and Echelon Forms

Definition 11. A rectangular matrix is in echelon form (or row echelon
form) if it has the following three properties:

1. All nonzero rows are above any rows of all zeros.

2. Each leading entry of a row is in a column to the right of the leading

entry of the row above it.

3. All entries in a column below a leading entry are zeros.

If a matrix in echelon form satesfies the following additional coditions, then

it is in reduced echelon form (or reduced row echelon form):

1. The leading entry in each non zero row is 1.

2. Each leading 1 is the only nonzero entry in its column.

An echelon matrix is one that is in echelon form. Property 2 says that
the leading entries form an echelon ("steplike") pattern that moves down to
the right trough the matrix. Property 3 is a simple consequence of property
2, but we include it for emphasis.
The triangular metrices

2 3 2 1 1 0 0 29
0 1 4 8 and 0 1 0 16

0 0 0 5/2 0 0 1 3

are in echelon form. In fact the second matrix is in reduced echelon form.

Any nonzero matrix may be row reduced into more than one matrix
in echelon form, using different sequences of row operations. However, the
reduced echelon form one obtains from a matrix is unique.

Theorem 21 (Uniqueness of the reduce row echelon form). Each matrix is

row equivalent to one and only one reduced echelon matrix.

If a matrix A is row equivalent to an echelon matrix U , we call U an

echelon form of A: if U is in reduced echelon form, we call U the reduced
echelon form of A.

Pivot Positions
When row operations on a matrix produce an echelon form, further row
operations to obtain the reduced echelon form do not change the positions
of the leading entries. Since the reduced echelon form is unique, the leading
entries are always in the same positions in any echelon form obtained from a
given matrix. These leading entries correspond to leading 1s in the reduced
echelon form.

Definition 12. A pivot position in a matrix A is location in A that corre-

sponds to a leading 1 in the reduced echelon form of A.A pivot column is a
column of A that contains a pivot position.

3.2 Matrix Multiplication

When a matrix B multiplies a vector x, it transforms x into the vector Bx.
If this vector is the multiplied in turn by a matrix A, the resulting vector
is A(Bx). Thus A(Bx) is produced from x by a composition of mappings.
Our goal is to represent this composite mapping as multiplication by a singel
matrix, denoted AB, so that

A(Bx) = (AB)x. (3.3)

If A is m n, B is n p, and x Rp , denote the columns of B by
b1 , ..., bp and the entries in x by x1 , ..., xp . Then

Bx = x1 b1 + ... + xp bp .

By the linearity of multiplication by A,

A(Bx) = A(x1 b1 ), ..., A(xp bp )

= x1 Ab1 + ... + xp Abp .

The vector A(Bx) is a linear combination of the vectors Ab1 , ..., Abp , using
the entries in x as wights. In matrix notation, this linear combination is
written as h i
A(Bx) = Ab1 Ab2 ... Abp x.
h i
Thus multiplication by Ab1 Ab2 ... Abp transforms x into A(Bx).

Definition 13. If A is an m n matrix, and if B is an n p matrix with

columns b1 , ..., bp , then the product AB is the m p matrix whose columns
are Ab1 , ..., Abp . That
h i h i
AB = A b1 b2 ... bp = Ab1 Ab2 ... Abp

This definition makes equation (3.3) true for all x Rp . Equation (3.3)
proves that the composite mapping is a linear transformation and that its
standard matrix is AB. Multiplication of matrices corresponds to composi-
tions of linear transformations.

Properties of Matrix Multiplication

Theorem 22. Let A be an m n matrix, and let B and C have sizes for
which the indicated sums and products are defined.
1. A(BC) = (AB)C

2. A(B + C) = AB + AC

3. (B + C)A = BA + CA

4. r(AB) = (rA)B = A(rB)

5. Im A = A = AIn
Proof. We will just prove property (1). Property (1) follows from the fact
that matrix multiplication corresponds to composition of linear transforma-
tions, and its know that the composition of functions is associative.

3.3 The Matrix Equation Ax = b
A fundamental idea in linear algebra is to view a linear combination of vector
as the product of a matrix and a vector.

Definition 14. If A is an m n matrix, with columns a1 , ..., an , and if

x Rn , then the product of A and x, denoted by Ax, is a linear combination
of the columns of A using the corresponding entries in x as wights: that is

h i
Ax = a1 a2 ... an . = x1 a1 + ... + xn an


Theorem 23. If A is an m n matrix, with columns a1 , ..., an , and if

b Rn , the matrix equation
Ax = b (3.4)
has the same solution as the vector equation

x1 a1 + ... + xn an = b (3.5)

which, in turn, has the same solution set as the system of linear equations
whose augmented matrix is
h i
a1 a2 ... an b (3.6)

Theorem 23 provieds a powerful tool for gaining insight into problems in

linear algebra, because a system of linear equations may now be viewed in
three different but equivalent ways: as a matrix equation, as a vector equa-
tion, or as a system of linear equations. Whenever you cunstruct a math-
ematical model of a problem in real life, you are free to choose whichever
viewpoint is the most natural. Then you may switch from one formulation
of a problem to another whenever it is convenient. In any case, the matrix
equation (3.4), the vector equation(3.5), and the system of equations are all
solved in the same way- by row reducing the augmented matrix (3.6).

Existence of solutions
Theorem 24. Let A be an m n matrix. The following statements are
logically equivalent. That is, for a particular A, either they are all true or
they are all false.

1. For each b Rm , the equation Ax = b has a solution.

2. Each b Rm is a linear combination of the columns of A.

3. The columns of A span Rm .

4. A has a pivot position in every row.

Statements (1),(2) and (3) are equivalent because of the definition of Ax

and what it means foe a set of vectors to span Rm .

3.3.1 Properties of the Matrix-vector product Ax

Theorem 25. If A is an m n matrix, u and v are in Rn and c is a scalar,
A(u + v) = Au + Av; (3.7)

A(cu) = c(Au). (3.8)

h i
Proof. For simplicity, take n = 3, A = a1 a2 a3 , and u, v R3 . For
i = 1, 2, 3, let ui and vi be the ith entries in u and v, receptively. To prove
statement (3.7), compute A(u + v) as a linear combination of the columns
of A using the entries in u + v as weights.

h i u1 + v1
A(u + v) = a1 a2 a3 u2 + v2

u3 + v3

= (u1 + v1 )a1 + (u2 + v2 )a2 + (u3 + v3 )a3

= (u1 a1 + u2 a2 + u3 a3 ) + (v1 a1 + v2 a2 + v3 a3 )
= Au + Av.
To prove statement (3.8), compute A(cu) as a linear combination of the
columns of A using the entries in cu as wights.

h i cu1
A(cu) = a1 a2 a3 cu2 = (cu1 )a1 + (cu2 )a2 + (cu3 )a3


= c(u1 a1 ) + c(u2 a2 ) + c(u3 a3 )

= c(u1 a1 + u2 a2 + u3 a3 )

3.4 The Inverse of a Matrix
Matrix algebra provides tools for manipulation matrix equations and creat-
ing various useful formulas in ways similar to doing ordinary algebra with
real numbers.
Recall that the multilicatie inverse of a number sich as 5 is 1/5 or 51 .
This inverse satisfies the equations
51 5 = 1 and 5 51 = 1.
The matrix generalization requires both equations and avoids the slanted-
line notion (for division) because matrix multiplication is not commutative.
Furthermore, a full generalization is possible only if the matrices involved
are square.
An n n matrix A is said to be inverteble if ther is an n n matrix C
such that
CA = I and AC = I
where I = In , the n n identity matrix. In this case, C is an inverse of A.
In fact, C is uniquely determined by A, because if B were another inverse
matrix of A then B = BI = B(AC) = (BA)C = CI = C. This unique
inverse is denoted by A1 so that
A1 A = I and AA1 = I.
A matrix that is not invertible is sometimes called a singular matrix, and
an invertible matrix is called a nonsingular matrix.
Theorem 26. Let " #
a b
c d
. If ad bc 6= 0,then A is invertible and
" #
1 1 d b
A =
ad bc c a
If ad bc = 0, then A is not invertible.
Theorem 27. If A is an invertible n n matrix, the for each b in Rn , the
equation Ax = b has the unique solution x = A1 b.
Proof. Take any b in Rn . A solution exists because if A1 b is substituted
for x, then Ax = AA1 b = (AA1 )b = Ib = b. So A1 b is a solution. To
prove that the solution is unique, show that if u is any solution, then u in
fact, must be A1 b. Indeed if Au = b, we can multiply both sides with A1
and obtain
A1 Au = A1 b Iu = A1 b u = A1 b.

The formula in theorem 27 i seldom used to solve an equation Ax = b
numerically because row reduction of [A b] is nearly almost faster. One
possible exeption is the 2 2 case. In this case mental computations to solve
Ax = b are sometimes easier using the formula for A1 .

Theorem 28. (a) If A is an invertible matrix, then A1 is invertible and

(A1 )1 = A.

(b) If A and B are n n invertible matrices, then so is AB, and the inverse
of AB is the product of the inverses of A and B in the revers order.
That is,
(AB)1 = B 1 A1 .

(c) If A is an invertible matrix, then so is AT , and the inverse of AT is the

transpose of A1 . That is

(AT )1 = (A1 )T .

Proof. To verify statement (a), find the matrix C such that

A1 C = I and CA1 = I.

In fact these equations are satisfied with A in place of C. Hence A1 is

invertible, and A is its inverse. Next to prove statement (b), compute

(AB)(B 1 A1 ) = A(BB 1 )A1 = AIA1 = AA1 = I.

A similar calculation show that (B 1 A1 )(AB) = I. For statement (c), use

the fact that (rA)T = rAT . We then get, (A1 )T AT = (AA1 )T = I T = I.
Similarly, AT (A1 )T = I T = I. Hence AT is invertible, and its inverse is
(A1 )T .

3.5 Matrix Factorizations

A factorization of a matrix A is an equation that expresses A as a product of
two or more matrices. Whereas matrix multiplication involves a synthesis of
data, matrix factorization is an analysis of data. In the language of computer
science, the expression of A as a product amounts to a preprocessing of data
in A, organizing that data into two ore more parts whose structures are
more useful in some way, perhaps more accessible for computation.

3.5.1 The LU Factorization
The LU factorization, described below, is motivated by the fairly common
industrial an business problem of solving a sequence of equations, all with
the same coefficient matrix:

Ax = b1 , Ax = b2 , ... Ax = bp . (3.9)

When A is invertible, one could compute A1 and then compute A1 b1 , A1 b2

and so on. However, it is more efficient to solve the first equation in the
sequence (3.9) by row reduction and obtain an LU factorization of A at the
same time. Thereafter, the remaining equations in sequence (3.9) are solved
with the LU factorization.
At first, assume that A is an m n matrix that can be row reduced
to echelon form, without row interchanges. Then A can be written in the
form A = LU, where L is an m m lower triangular matrix with 1s on the
diagonal and U is an mn echelon form of A. Such factorization is called an
LU factorization of A. The matrix L is invertible and is called a unit lower
triangular matrix.

Before studying how to construct L and U, we should look at why they

are so useful. When A = LU, the equation Ax = b can be written as
L(U x) = b. Writing y for U x, we can find x by solving the pair of equations

Ly = b (3.10)

U x = y. (3.11)
First solve Ly = b for y, and then solve U x = y for x. Each equation
are easy to solve because L and U are triangular.

An LU Factorization Algorithm
Suppose A can be reduced to an echelon form from U using only row re-
placements that add a multiple of one row to another row below it. In this
case, there exist unit lower triangular elementary matrices E1 , ..., Ep such
Ep E1 A = U. (3.12)
A = (Ep E1 )1 U = LU (3.13)
L = (Ep E1 )1 . (3.14)
It can be shown that products and inverses of unit lower triangular ma-
trices are also unit lower triangular. Thus L is unit lower triangular.

Note that the row operations in equation (3.12), wich reduce A to U ,
also reduce the L in equation (3.14) to I, because Ep E1 L = (Ep
E1 )(Ep E1 )1 = I. This observation is key to construction L.
Definition 15 (Algorithm for an LU Factorization). 1. Reduce A to an
echelon form U by a sequence of row replacement operations, if possi-
2. Place entries in L such that the same sequence of row operations re-
duces L to I.
Step 1 is not always possible, but when it is, the argument above shows
that an LU factorization exists. By construction L will satisfy
(Ep E1 )L = I
using the same E1 , ..., Ep as equation (3.12). Thus L will be invertible,
by the invertible matrix theorem, with (Ep E1 ) = L1 . From (3.12),
L1 A = U, and A = LU. So step 2 will produce an acceptable L.

3.6 Subspaces of Rn
Definition 16. The subspace of Rn is any set H in Rn that has three
(a) The zero vector is in H.
(b) For each u and v, the sum u + v is in Rn .
(c) For each u in H and each scalar c, the vector cu is in Rn .
In words, a subspace is closed under addition and scalar multiplication.

Column Space and Null Space of a Matrix

Subspaces of Rn usually occur in applications and theory in one of two ways.
In both cases, the subspace can be related to a matrix.
Definition 17. The column space of a matrix A is the set of ColA of all
linear combinations of the columns of A.
If A = [a1 an ], with the columns in Rn , then ColA is the same as span
a1 , ..., an . Note that ColA equals Rm only when the columns of A span Rm .
Otherwise, ColA is only part of Rn .
Definition 18. The null space of a matrix A is the set NulA of all solutions
of the homogeneous equation Ax = 0.
When A has n columns, the solution of Ax = 0 belongs to Rn , and
the null space of A is a subset of Rn . In fact, NulA has the properties of a
subspace in Rn .

Theorem 29. The null space of an m n matrix is a subspace of Rn , and
the set off all solutions of a equation Ax = 0 of m homogeneous linear
equations in n unknowns is a subspace of Rn .
Proof. The zero vector is in NulA (because A0 = 0.) To show that NulA
satisfies the other two properties required for a subspace, take any u and
v in NulA. That is, suppose Au = 0 and Av = 0. Then, by a property of
matrix multiplication,
A(u + v) = Au + Av = 0 + 0.
Thus u + v satisfies Ax = 0 so u + v is in NulA. Also for any scalar c,
A(cu) = c(Au) = c(0) = 0.

To test whether a given vector v is in NulA, just compute Av to See

whether Av is the zero vector. Because NulA is described by a condition
that must be checked for each vector, we say that the null space is defined
implicitly. In contrast, the column space is defined explicitly, because the
vectors in ColA can be constructed (by linear combinations) from columns
of A. To create an explicit description of NulA, solve the equation Ax = 0
and write the solution in parametric vector form.

Basis for a Subspace

Because a subspace typically contains an infinite numbers of vectors, some
problems involving a subspace are handled best by working with small finite
set of vectors that span the subspace. The smaller set, the better. It can be
shown that the smaller possible spanning set must be linearly independent.
Definition 19. A basis fo a subspace H of Rn is a linearly independent set
in H that spans H.
Theorem 30. The pivot columns of a matrix A form a basis for the column
space of A.
Definition 20. The dimension of a nonzero subspace H, denoted by dimH,
is the number of vectors in any basis for H. The dimension of the zero
subspace is defined to be zero.
Definition 21. The rank of a matrix A, denoted by rankA, is the dimension
of the column space of A.

3.7 Eigenvectors and Eigenvalues

Definition 22. An eigenvector of an m n matrix A is a nonzero vector
x such that Ax = x for some scalar . A scalar is called an eigenvalue
of A if there is a nontrivial solution x of Ax = x; such an x is called an
eigenvector corresponding to .

We say that is an eigenvector of an m n matrix A if and only if the
(A I)x = 0 (3.15)
has a nontrivial solution. The set of all solutions of (3.15) is just the null
space of the matrix A I. So this set is a subspace of Rn and is called
the eigenspace of A corresponding to . The eigenspace consists of the zero
vector and all the eigenvectors corresponding to .

Theorem 31. The eigenvalues of a triangular matrix are the entries of its
main diagonal.

Proof. For simplicity, consider the 3 3 case. If A is upper triangular, then

A I has the form

a11 a12 a13 0 0
A I = 0 a22 a23 0 0

0 0 a33 0 0

a11 a12 a13
= 0 a22 a23 .

0 0 a33
The scalar is an eigenvalue of A if and only if the equation (A I)x = 0
has a nontrivial solution, that is, if and only if the equation has a free
variable. Because of the zero entries in A I, it is easy to see that (A
I))x = 0 has a free variable if an only if at least one of the entries on the
diagonal of (A I) is zero. This happend if and only if equals one of the
entries, a11 , a22 , a33 in A.

What does it mean for a matrix A to have an eigenvalue of 0? This

happens if and only if the equation

Ax = 0x (3.16)

has a nontrivial solution. But (3.16) is equivalent to Ax = 0, which has a

nontrivial solution if and only if A is not invertible. Thus 0 is an eigenvalue
of A if and only if A is not invertible.

Theorem 32. If v 1 , ..., v r are egienvectors that corresponds to distinct egen-

values 1 , ..., r of an n n matrix A, then the set v 1 , ..., v r is linearly

Proof. Suppose v 1 , ..., v r is linearly dependent. Since v 1 is nonzero, we say

that one of the vectors in the set is a linear combination of the preceding
vectors. Let p be the least index such that v p+1 is a linear combination

of the preceding (linearly independent) vectors. Then there exist scalars
c1 , ..., cp such that
c1 v 1 + ... + cp v p = v p+1 . (3.17)
Multiplying both sides of (3.17) by A and using the fact that Av k = k v k
for each k, er obtain

c1 Av 1 + ... + cp Av p = Av p+1

c1 1 + v 1 + ... + cp p + v p = p+1 v p+1 . (3.18)

Multiplying both sides of (3.17) by p+1 and subtracting the result from
(3.18) we have

c1 (1 p+1 )v 1 + ... + cp (p p+1 )v = 0. (3.19)

Since v 1 , ..., v p is linearly independent, the wights in (3.19) are all zero. But
none of the factors i p+1 are zero, because the eigenvalues are distinct.
Hence v 1 , ..., v r cannot be linearly dependent and therefore must be linearly

The next theorem illustrates one use of the characteristic polynomial, and
it provides the foundation for several iterative methods that approximate
eigenvalues. If A and B are n n matrices, then A is similar to B if there is
an invertible matrix P such that P 1 AP = B, or equivalently, A = P BP 1 .
Writing Q for P 1 , we have Q1 BQ = A. So B is also similar to A, and
we say simply that A and B are similar. Changing A into P 1 AP is called
similarity transformation.

Theorem 33. If n n matrices A and B are similar, then they have the
same characteristic polynomial and hence the same eigenvalues.

Proof. If B = P 1 AP, then

B I = P 1 AP P 1 P = P 1 (AP P ) = (A I)P.

Using the fact that detAB = (detA)(detB), we get

det(B I) = det[P 1 (A I)P ]

det(P 1 ) det(A I) det(P ). (3.20)

Since det(P 1 ) det(P ) = det(P 1 P ) = detI = 1, we see from equation
(3.20) that det(B I) = det(A I).

3.8 Diagonalization
In many cases, the eigenvalue-eigenvector information contained within a
matrix A can be displayed in a useful factorization of the form A = P DP 1
where D is a diagonal matrix. In this section, the factorization enable us
to compute Ak quickly for large values of k, a fundamental idea in several
applications of linear algebra.

A square matrix A is said to be diagonalizeble if A is similar to a diagonal

matrix that is, if A = P DP 1 for some invertible matrix P and some diag-
onal matrix D. The next theorem gives a characterization of diagonalizable
matrices and tells how to construct a suitable factorization.
Theorem 34 (The diagonalization theorem). An n n matrix A is diago-
nalizable if and only if A has n linearly independent eigenvectors.
In fact, A = P DP 1 , with a D a diagonal matrix, if and only if the
columns of P are n linearly independent eigenvectors of A. In this case, the
diagonal entries of D are eigenvalues of A that correspond, respectively, to
the eigenvectors in P.
In other words, A is diagonalizable if and only if there are enough eigen-
vectors to form a basis of Rn . We call such basis an eigenvector basis of
Rn .

Proof. First, observe that if P is any nn matrix with the columns v 1 , ..., v n and
if D is any diagonal matrix with diagonal entries 1 , ..., n , then
h i h i
AP = A v 1 v 2 ... v n = Av 1 Av 2 ... Av n (3.21)

1 0 0
0 2 0

PD = P
.. .. .
.. (3.22)
. . .
0 0 n
Now suppose A is diagonalizable and A = P DP 1 . Then right-multiplying
this relation by P, we have AP = P D. In this case, equations (3.21) and
(3.22) imply that
h i h i
Av 1 Av 2 ... Av n = 1 v 1 2 v 2 n v n . (3.23)

Equating columns, we find that

Av 1 = 1 v 1 , Av 2 = 2 v 2 , .... Av n = n v n . (3.24)

Since P is invertible, its columns v 1 , ..., v n must be linearly independent.

Also, since these columns are nonzero, the equations in (3.24) show that

1 , ..., n are eigenvalues and v 1 , ..., v n are corresponding eigenvectors. This
argument proves the "only if" parts of the first and second statement, along
with the third statement, of the theorem.
Finally, given any n eigenvectors v 1 , ..., v n , use them to construct the
columns of P and use corresponding eigenvalues 1 , ..., n to construct D.
By equation (3.21)-(3.23), AP = P D. This is true without any condition on
the eigenvectors. If, in fact, the eigenvectors are linearly independent, then
P is invertible, and AP = P D implies that A = P DP 1 .

3.9 Inner Product, Length and Orthogonality

3.9.1 The Inner Product
If u and v are vectors in Rn , then we regard u and v as n 1 matrices.
The transpose uT is a 1 n matrix, and the matrix product uT v is a 1 1
matrix, which we write as a single real number (a scalar) without brackets.
The number uT v is called the inner product of u and v, and is often written
u v. This inner product, is also referred to as dot product. If

u1 v1
u2 v2

and v=

. .
un vn

then the inner product of u and v is

u1 u2 un .

= u1 v1 + u2 v2 + ... + un vn .

Theorem 35. Let u and v be vectors in Rn , and let c be a scalar. Then

(a) u v = v u.

(b) (u + v) w = u w + v w

(c) (cu)v = c(u v) = u(cv)

(d) u u 0, and u u = 0 if and only if u = 0.

3.9.2 The Length of a Vector

If v is in Rn , with entries v1 , ..., vn , then the square root of v v is defined
because v v is nonnegative.

Definition 23. The length (or the norm) of v is the nonnegative scalar kvk
defined by
kvk = v v = v12 + v22 + ... + vn2 and kvk2 = v v.
" #
Suppose v is in R2 , say v = , if we identify v with a geometric
point in the plane, as usual, then kvk coincides with the standard notion of
the length of the line segment from the origin to v. This follows from the
Pythagorean Theorem applied to a triangle.
A similar calculation with the diagonal of a rectangular box shows that
the definition of length of a vector v in R3 coincides with the usual notion
of length.
For any scalar c, the length of cv is |c| times the length of v. That is
kcvk = |c| kvk .
A vector whose length is 1 is called a unit vector. If we divide a nonzero
vector v by its length- that is, multiply by 1/ kvk- we obtain a unit vector u
because the length of u is (1/ kvk) kvk. The process of creating u from v is
sometimes called normalizing v, and we say that u is in the same direction
as v.

Distance in Rn
Recall that if a and b are real numbers, the distance on the number line
between a and b is the number |a b|. This definition of distance in R has
a direct analogue in Rn .
Definition 24. For u and v in Rn , the distance between u and v, written
as dist(u, v), is the length of the vector u v. That is
dist(u, v) = ku vk .
In R2 and R3 , this definition of distance coincides with the usual formulas
for the Euclidean distance between two point.

3.9.3 Orthogonal Vectors

Consider R2 or R3 and two lines trough the origin determined by vectors
u and v. The two lines are geometrically perpendicular if and only if the
distance from u to u is the same as the distance from u to u.
Definition 25. Two vectors u and v in Rn are orthogonal if u v = 0.
Theorem 36 (The Pythagorean Theorem). Two vectors u and v are or-
thogonal if and only if ku + vk2 = kuk2 + kvk2 .
Proof. ku + vk2 = (u + v)(u + v) = u u + u v + v u + v v = kuk2 +
kvk2 + 2u v. Where 2u v = 0 since the two vectors are orthogonal.

Orthogonal Complements
If a vector z is orthogonal to every vector in a subspace of W of Rn , then z
is said to be orthogonal to W. The set of all vectors z that are orthogonal
to W is called the orthogonal complement of W and is denoted by W
Theorem 37. Let A be an m n matrix. The orthogonal complement of
the row space of A is the null space of A, and the orthogonal complement of
the column space of A is the null space of AT :

(RowA) = NulA and (ColA) = NulAT .

Proof. The row-column rule for computing Ax shows that if x is in NulA,

then x is orthogonal to each row of A. Since the rows of A span the row
space, x is orthogonal to RowA. Conversely, if x is orthogonal to RowA,
then x is certainly orthogonal to each row of A, and hence Ax = 0. This
proves the first statement of the theorem. Since this statement is true for
any matrix, it proves for AT . That is, the orthogonal complement of the
row space of AT is the null space of AT . This proves the second statement,
because RowA = ColA.

3.9.4 Orthogonal Sets

A set of vectors {u1 , ..., up } in Rn is said to be an orthogonal set if each pair
of distinct vectors from the set is orthogonal, that is, if ui uj = 0 whenever
i 6= j.
Theorem 38. If S = {u1 , ..., up } is an orthogonal set of nonzero vectors
in Rn , then S is linearly independent and hence is a basis for the subspace
spanned by S.
Proof. if 0 = c1 u1 + ... + cp up for some scalars c1 , ..., cn , then

0 = 0 u = (c1 u1 + c2 u2 + ... + cp up )u1

= (c1 u1 ) u1 + (c2 u2 ) u1 + ... + (cp up ) u1

= c1 (u1 u1 ) + c2 (u2 u1 ) + ... + cp (up u1 )
= c1 (u1 u1 ),
because u1 is orthogonal to u2 , ..., up . Since u1 is nonzero u1 u1 is not
zero and so c1 = 0. Similarly, c2 , ..., cp must be zero. Thus S is linearly

Definition 26. An orthogonal basis for a subspace W of Rn is a basis for

W that is also an orthogonal set.
The next theorem suggest why an orthogonal basis is much nicer that
other bases. The weight in a linear combination can be computed easily.

Theorem 39. Let {u1 , ....up } be an orthogonal basis for a subspace W of
Rn . For each y in W , the weights in the linear combination

y = c1 u1 + ... + cp up

are given by
y uj
cj = .
uj uj
Proof. As in the preceding proof, the orthogonality of {u1 , ..., up } shows
y u1 = (c1 u1 + c2 u2 + ... + cp up ) u1 = c1 (u1 u1 )
Since u1 u1 is not zero, the equation can be solved for c1 . To find cj for
j = 2, ..., p, compute y uj and solve for cj .

Orthonormal Sets
A set u1 , ..., up is an orthonormal set if it is an orthogonal set of unit vectors.
If W is the subspace spanned by such a set, then u1 , ..., up is an orthonormal
basis for W, since the set is automatically linearly independent, by theorem
The simplest example of an orthonormal set is the standard basis {e1 , ..., en }
for Rn . Any nonempty subset of {e1 , ..., en } is orthonormal, too.

Theorem 40. An m n matrix U has orthonormal columns if and only if

U T U = I.

Proof. To simplify notation, we suppose that U only has three columns,

each a vector
h in Rn . The
i proof of the general case is essentially the same.
Let U = u1 u2 u4 and compute

uT1 h i uT1 u1 uT1 u2 uT1 u3
U U = u2 u1 u2 u3 = u2 u1 uT2 u2 uT2 u3 .

uT3 uT3 u1 uT3 u2 uT3 u3

The entries in the matrix at the right are inner product, using transpose
notation. The columns of U are orthogonal if and only if

uT1 u2 = uT2 u1 = 0, uT1 u3 = uT3 u1 = 0, uT2 u3 = uT3 u2 = 0. (3.26)

The columns of U all have unit length if and only if

uT1 u1 = 1, uT2 u2 = 1, uT3 u3 = 1. (3.27)

The theorem follows immediately from (3.25)-(3.27).

Theorem 41. Let U be an m n matrix with orthonormal columns, and
let x and y be in Rn . Then

(a) kU xk = kxk

(b) (U x) (U y) = x y

(c) (U x) (U y) = 0 if and only if x y = 0.

3.9.5 Orthogonal Projections

The orthogonal projection of a point in R2 onto a line trough the origin has
an important analogue in Rn . Given a vector y and a subspace W in Rn ,
in W such that (1) y
there is a vector y is the unique vector in W for which
yy is orthogonal to W , and (2) y is the unique vector in W closest to
y. These properties of y provide the key to finding least-squares solution of
linear system.
To prepare for the first theorem, observe that whenever a vector y is
written as a linear combination of vectors u1 , ..., un in Rn , the terms in the
sum for y can be grouped into two parts so that y can be written as

y = z1 + z2

where z 1 is a linear combination of some ui and z 2 is a linear combination

of the rest of the ui . The idea is particularly useful when {u1 , ..., un } is an
orthogonal basis.

Theorem 42 (The orthogonal decomposition theorem). Let W be a sub-

space of Rn . Then each y in Rn can be written uniquely in the form

y=y (3.28)

where y is in W and z is in W . In fact, if {u1 , ..., up } is any orthogonal

basis of W , then
y u1 y up
y u+ ... + up (3.29)
u1 u1 up up
and z = y y

The vector y in (3.28)is called the orthogonal projection of y onto W

and often is written as projW y.

Proof. Let {u1 , ..., up } be any orthogonal basis for W , and define y by (3.29).
Then y is in W because y is a linear combination of the basis u1 , ..., up . Let
z =yy . Since u1 is orthogonal to u2 , ..., up , it follows from (3.29) that

y u1
z u1 = (y y
) u1 = y u1 u1 u1 0 ... 0
u1 u1

= y u1 y u1 = 0.
Thus z is orthogonal to u1 . Similarly, z is orthogonal to each uj in the basis
for W. Hence z is orthogonal to every vector in W. That is, z in W .
To show that the decomposition in (3.28) is unique, suppose y can also
be written as y = y1 +z 1 with y1 in W and z 1 in W . Then y
+z = y1 +z 1 ,
and so
y y1 = z 1 z.
y1 is in W an in W . Hence
This equality shows that the vector v = y
v v = 0, which shows that v = 0. This proves that y = y1 and also
z 1 = z.

The uniqueness of the decomposition (3.28) shows that the orthogonal

depends only on W and not on the particular basis used in
projection y

Properties of Orthogonal Projections

If {u1 , ..., up } is an orthogonal basis for W and if y happens to be in W,
then the formula for projW y is exactly the same as the representation of y
given in theorem 39.
Theorem 43 (The Best Approximation Theorem). Let W be a subspace of
Rn , let y be any vector in Rn , and let y
be the orthogonal projection of y
onto W. Then y is the closest point in W to y, int the sense that

ky y
k < ky vk (3.30)

for all v in W distinct from y
The vector y in theorem 43 is called the best approximation to y by
elements of W. The distance from y to v, given by ky y
k, can be regarded
as the "error" of using v in place of y. Theorem 43 says that this error is
minimized when v = y .
does not depend on the
Inequality (3.30) leads to a new proof that y
particular orthogonal basis used to compute it. If a different orthogonal
basis for W were used to construct an orthogonal projection of y, then this
projection would also be the closest point in W to y, namely y.

Proof. Take v in W distinct from y v is in W. By the orthogonal

. Then y
decomposition theorem, y y is orthogonal to W. In particular, y y is
v. Since
orthogonal to y

y v = (y y y v)
) + (

the Pythagorean Theorem gives

ky vk2 = ky y
k2 + k
y vk2 .

y vk2 > 0 because y
Now k v 6= 0, and so inequality (3.30) follows

Theorem 44. If {u1 , ..., up } is an orthonormal basis for a subspace W of

Rn , then
projW y = (y u1 )u1 + (y u2 )u2 + ... + (y up )up (3.31)
h i
if U = u1 u2 up , then

projW y = U U T y y Rn (3.32)
Proof. Formula (3.31) follows immediately from (3.29). Also (3.31) shows
that projW y is a linear combination of the columns of U using the weight
y u1 , y u2 , ..., y up . The weight can be written as uT1 y, uT2 y, ..., uTp y,
showing that they are entries in U T y and justifying (3.32).

3.10 The Gram-Schmidt process

The Gram-Schmidt process is a simple algorithm for producing an orthog-
onal or orthogonal basis for any nonzero subspace of Rn .
Theorem 45 (The Gram-Schmidt Process). Given a basis {x1 , ..., xp } for
a nonzero subspace W of Rn , define
v 1 = x1
x2 v 1
v 2 = x2 v1
v1 v1
x3 v 1 x3 v 2
v 3 = x3 v1 v2
v1 v1 v2 v2
x3 v 1 x3 v 2 xp v p1
v p = x3 v1 v 2 ... v p1 .
v1 v1 v2 v2 v p1 v p1
Proof. For 1 k p, let Wk = Span {x1 , ..., xk } . Set v 1 = x1 , so that
Span {v 1 } = Span {x1 } . Suppose, for some k < p, we have constructed
v 1 , ..., v k so that {x1 , ..., xk } is an orthogonal basis for Wk . Define
v k+1 = xk+1 projk xk+1 . (3.33)
By the orthogonal decomposition theorem v k+1 is orthogonal to Wk . Note
that projk xk+1 is in Wk and hence also in Wk+1 , so is v k+1 . Further-
more, v k+1 6= 0 because v k+1 is not in Wk = Span {x1 , ..., xk } . Hence
Span {v 1 , ..., v k+1 } is an orthogonal set of nonzero vectors in the (k + 1)-
dimensional space Wk+1 . By the basis theorem, this set is an orthogonal
basis for Wk+1 . Hence Wk+1 = Span {v 1 , ..., v k+1 } . When k + 1 = p, the
process stops.

Theorem 45 shows that any nonzero subspace W of Rn has an orthogo-
nal basis, because an ordinary basis {x1 , ..., xk } is always available and the
Gram-Schmidt process depends only on the existence of orthogonal projec-
tions onto subspaces of W that already have orthogonal bases.

Orthonormal Bases
An orthonormal base is constructed easily form an orthogonal basis {v 1 , ..., v p } :
simply normalize all the v k . When working problems by hand, this is easier
than normalizing each v k as soon as it is found.

3.11 Least-Squares Problems

Definition 27. If A is m n and b is in Rn , a least-squares solution of
in Rn such that
Ax = b is an x

kb A
xk kb Axk

for all x in Rn .

The most important aspect of the least-square problem is that no matter

what x we select, the vector Ax will necessarily be in the column space,
ColA. So we seek an x that makes Ax the closest point in ColA to b.

Solution of the General Least-Squares Problem

Given A and b as above, apply the best approximation theorem to the
subspace ColA. Let
= Proj
b ColA b.
is in the column space of A, the equation Ax = b
Because b is consistent,
in R such that
and there is an x

x = b. (3.34)
is the closest point in ColA to b, a vector x
Since b is a least-squares solution
of Ax = b if and only if x satisfy (3.34). Such an x in Rn is a list of weights
that will build b out of the columns of A.

Suppose x satisfies A By the orthogonal decomposition theorem,

x = b.
the projection b has the property that b b is orthogonal to ColA, so
b A x is orthogonal to each column of A. If aj is any column of A, then
aj (b Ax) = 0, and aTj (b A
x) = 0. Since aTj is a row of AT ,

AT (b A
x) = 0. (3.35)

= AT A
x = AT b.
These calculations show that each least squares solution of Ax = b satisfies
the equation
AT Ax = AT b. (3.36)
The matrix equation (3.36) represent a system of equations called the normal
equations for Ax = b. A solution of (3.36) is often denoted by x .

Theorem 46. The set of least-square solutions of Ax = b coincides with

the nonempty set of solutions of the normal equation AT A
x = AT b.

Proof. As shown above, the set of least-squares solutions is nonempty and

each least-squares solution x satisfies the normal equations. Conversely,
suppose x satisfies AT Ax = AT b. Then x satisfies (3.35) above, which
shows that b A x is orthogonal to the rows of AT and hence is orthogonal
to the columns of A. Since the columns of A Span ColA, the vector b A x
is orthogonal to all of ColA. Hence the equation

x + (b A
b = A x)

is a decomposition of b into the sum of a vector in ColA and a vector

orthogonal to ColA. By the uniqueness of the orthogonal decomposition,
Ax must be the orthogonal projection of b onto ColA. That is, A
x = b,
and x is a least-squares solution.

3.12 Further Reading (Optional)

1. Determinants

2. Orthogonal Matrices

3. Singular Value Factorization