Fundamentals of Multimedia Slide

Multimedia Systems
1
What is Multimedia?
2
When different people mention the term multimedia, they often have quite
different, or even opposing, viewpoints
A PC vendor
A PC that has sound capability, a DVD-ROM drive, and perhaps the superiority of
multimedia-enabled microprocessors that understand additional multimedia
instructions
A consumer entertainment vendor
interactive cable TV with hundreds of digital channels available, or a cable TV-like
service delivered over a high-speed Internet connection.
A Computer Science (CS) student
Applications that use multiple modalities, including text, images, drawings (graphics),
animation, video, sound including speech, and interactivity.
What is Multimedia?
3
Multimedia is
Multiple forms of information content and information processing (e.g. text,
audio, graphics, animation, video, interactivity) to inform or entertain
Computer-controlled integration of text, graphics, drawings, still and moving
images (Video), animation, audio, and any other media where every type of
information can be represented, stored, transmitted and processed digitally
Characterized by the processing, storage, generation, manipulation and
rendition of Multimedia information
Multimedia may be broadly divided into linear and non-linear categories.
Linear active content progresses without any navigation control for the viewer such as a
cinema presentation.
Non-linear content offers user interactivity to control progress as used with a computer
game or used in self-paced computer based training.

History of Multimedia and Hypermedia
Newspaper:
perhaps the first mass communication medium, uses text, graphics, and
images
Motion pictures:
conceived of in 1830's in order to observe motion too rapid for perception
by the human eye
Wireless radio transmission:
Guglielmo Marconi, at Pon-tecchio, Italy, in 1895
Television:
the new medium for the 20th century, established video as a commonly
available medium and has since changed the world of mass
communications
The connection between computers and ideas about multimedia
covers only a short period
4
Characteristics of a Multimedia System
5
A Multimedia system has four basic characteristics:
Multimedia systems must be computer controlled
Multimedia systems are integrated
The information they handle must be represented digitally
The interface to the final presentation of media is usually interactive
Challenges of a Multimedia System
6
Very High Processing Power
needed to deal with large data processing and real time delivery of media.
Multimedia Capable File System
needed to deliver real-time media -- e.g. Video/Audio Streaming. Special Hardware/Software needed e.g RAID
technology.
Data Representations/File Formats that support multimedia
Data representations/file formats should be easy to handle yet allow for compression/decompression in real-time
Efficient and High I/O
input and output to the file subsystem needs to be efficient and fast. Needs to allow for real-time recording as well as
playback of data. e.g. Direct to Disk recording systems.
Special Operating System
to allow access to file system and process data efficiently and quickly. Needs to support direct transfers to disk, real-
time scheduling, fast interrupt processing, I/O streaming etc.
Storage and Memory
large storage units (of the order of 50 -100 Gb or more) and large memory (50 -100 Mb or more).
Network Support
Client-server systems common as distributed systems common.
Software Tools
user friendly tools needed to handle media, design and develop applications, deliver media.
Application of multimedia
7
Multimedia finds its application in various areas including
Advertisements
Art
Education
Entertainment
Engineering
Medicine
Mathematics
Business
Scientific research and
Spatial, temporal applications
World Wide Web
Hypermedia courseware
Video conferencing
Video-on-demand
Interactive TV
Groupware
Home shopping
Games
Virtual reality
Digital video editing and production systems
Multimedia Database systems
8
Topics
9
Issues in Multimedia (Authoring and Design)
Multimedia authoring versus programming
Difference between multimedia authoring and programming
Multimedia application design
Design stages, storyboarding
Multimedia software tools
Audio sequencing, image/graphics editing, animation, multimedia authoring
Text
Fonts and faces, Character set and alphabets, Font Editing and Design tools
Images/graphics
Digital images, Image data types, colors
Audio
Sound digitization, audio file formats
Topics contd
10
Video
Text compression
Image compression
Audio Compression
Video Compression
Multimedia Hardware & Software
Content-based Multimedia Retrieval
Multimedia Network Communications
Use of previous programming skill

Multimedia Authoring and Tools
11
Multimedia authoring: creation of multimedia productions, sometimes
called movies or presentations
Why should you use an authoring system?
An authoring System has pre-programmed elements for the development of interactive
multimedia software titles.
Authoring systems vary widely in orientation, capabilities, and learning curve.
There is no such thing as a completely point-and-click automated authoring system
authoring is actually just a speeded-up form of programming
we are mostly interested in interactive applications
we also have a look at still-image editors such as Adobe Photoshop, and simple video
editors such as Adobe Premiere since they help to create interactive multimedia projects
The interaction goes to from no interactivity to virtual reality creation
Control the pace like click next, able to control the sequence, able to control the object
Multimedia Authoring
Paradigms/methodology
12
Multimedia Authoring Metaphors
Multimedia Production
Multimedia Presentation
Automatic Authoring
13
Scripting Language Metaphor: use a special language to enable interactivity
(buttons, mouse, etc.), and to allow conditionals, jumps, loops, functions/macros
etc. E.g., a smallToolbook program is as below:
Topics contd
14
Video
Text compression
Image compression
Audio Compression
Video Compression
Content-based Multimedia Retrieval
Use of previous programming skill

Multimedia Authoring and
Tools
Hypermedia courseware
Video conferencing
Video-on-demand
Interactive TV
Groupware
Home shopping
Digital video editing and production systems
Multimedia Database systems
World Wide Web
Games
Virtual reality

16
Multimedia authoring: creation of multimedia productions,
sometimes called movies or presentations.
Authoring involves the assembly and bringing together of Multimedia
with possibly high level graphical interface design and some high
level scripting.
Programming involves low level assembly and construction and
control of Multimedia and involves real languages like C and Java.
An authoring System has
Pre-programmed elements for the development of interactive
multimedia software
Vary widely in orientation, capabilities, and learning curve
There is no completely point-and-click automated authoring system
A speeded-up form of programming, 1/8 of programming
development time
Focus in interactive applications Why?
The level of interaction goes from no interactivity to virtual
reality creation
Control the pace like click next, able to control the
sequence, able to control the object, able to control the
entire simulation
It also includes image editors such as Adobe Photoshop, and
simple video editors such as Adobe Premiere since they help
to create interactive multimedia projects
In this section, we take a look at
Multimedia application Production
Automatic Authoring
1. Scripting Language Metaphor
Use a special language to enable interactivity (buttons, mouse,
etc.), and to allow conditionals, jumps, loops, functions/macros etc.
Closest to programming
Tend to be longer in development time
Run time speed is minimal

2. Iconic/Flow-control Metaphor
Graphical icons are available in a toolbox, and authoring proceeds
by creating a flow chart with icons attached
Speediest in development time and suited for short time projects
Suffer least from runtime speed problems
19
global gNavSprite
on exitFrame
go the frame
play sprite gNavSprite
end
Fig. 2.1: Authorware flowchart
4. Hierarchical Metaphor
Represented by embedded objects and iconic properties
User-controllable elements are organized into a tree structure
learning curve is non-trivial
Often used in menu-driven applications
5. Frames Metaphor
Like Iconic/Flow-control Metaphor; however links between icons are
more conceptual, rather than representing the actual flow of the
program
This is a very fast development system but requires a good auto-
debugging function

Fig. 2.2: Quest Frame
7. Cast/Score/Scripting Metaphor:

Time is shown horizontally; like a spreadsheet:
rows, or tracks, represent instantiations of
characters in a multimedia production.

Multimedia elements are drawn from a cast of
characters, and scripts are basically event-
procedures or procedures that are triggered by
timer events.

Director, by Macromedia, is the chief example of
this metaphor. Director uses the Lingo scripting
language, an object-oriented event-driven
language.

Multimedia Application Production
the multimedia design phase consists of
Storyboarding
help to plan the general organization or content of a presentation by recording
and organizing ideas on index cards, or placed on board/wall. Insure media
are collected and organized
Flowcharting
Adds navigation information for the story board, the multimedia
concept structure and user interaction followed by detail
functional requirement specification
Prototyping and user testing
parallel media production
Two types of design considerations needs to be also made
Multimedia content and technical design
Multimedia Content Design
Content design deals with what to say and what vehicle to use.
There are five ways to format and deliver your message. You can write it,
illustrate it, wiggle it, hear it, and interact with it.
Writing (Scripting)
Understand your audience and correctly address them
Keep your writing as simple as possible. (e.g., write out the full message(s)
first, then shorten it
Make sure technologies used complement each other
Illustrate(Graphics)
Make use of pictures to effectively deliver your messages.
Create your own (draw, (color) scanner, PhotoCD, ...), or keep "copy files"
of art works
Graphic styles
Fonts
colors
Multimedia Content Design
Graphics Styles: Human visual dynamics impact how
presentations must be constructed.

(a) Color principles and guidelines: Some color
schemes and art styles are best combined with a
certain theme or style. A general hint is to not use too
many colors, as this can be distracting.

(b) Fonts: For effective visual communication in a
presentation, it is best to use large fonts (i.e., 18 to 36
points), and no more than 6 to 8 lines per screen
(fewer than on this screen!). Fig. 2.4 shows a
comparison of two screen projections:
26

Fig. 2.4: Colours and fonts [from Ron Vetter].
(c) A color contrast program: If the text color is some triple
(R,G,B), a legible color for the background is that color
subtracted from the maximum (here assuming max=1):

(R, G, B) (1 R, 1 G, 1 B) (2.1)

Some color combinations are more pleasing than others;
e.g., a pink background and forest green foreground, or a
green background and mauve foreground. Fig. 2.5 shows
a small VB program (textcolor.exe) in operation:
Fig. 2.5: Program to investigate colours and
readability.
Fig. 2.6: Colour wheel
- Fig. 2.6, shows a colour wheel, with opposite colours
equal to (1-R, 1-G, 1-B)
wiggling (Animation)
1. Types of animation
Character Animation - humanize an object
Highlights and Sparkles
To pop a word in/out of the screen, to sparkle a logo
Moving Text
Video - live video or digitized video
2. When to Animate
Only animate when it has a specific purpose
Enhance emotional impact
Make a point
Improve information delivery
Indicate passage of time
Provide a transition to next subsection
31
Video Transitions
Video transitions: to signal scene changes.

Many different types of transitions:
1. Cut: an abrupt change of image contents
formed by abutting two video frames
consecutively. This is the simplest and most
frequently used video transition.

2. Wipe: a replacement of the pixels in a region of
the viewport with those from another video. Wipes
can be left-to-right, right-to-left, vertical, horizontal,
like an iris opening, swept out like the hands of a
clock, etc.

3. Dissolve: replaces every pixel with a mixture over
time of the two videos, gradually replacing the first
by the second. Most dissolves can be classified as
two types: cross dissolve and dither dissolve.
Type I: Cross Dissolve
Every pixel is affected gradually. It can be
defined by:

(2.2)

where A and B are the color 3-vectors for
video A and video B. Here, (t) is a transition
function, which is often linear:

(2.3)

(1 ( )) ( ) t t o o = + D A B
( ) ,with 1
max
t k t k t o =
Type II: Dither Dissolve
Determined by (t), increasingly more
and more pixels in video A will abruptly
(instead of gradually as in Type I) change
to video B.

Fade-in and fade-out are special types of
Type I dissolve: video A or B is black (or
white). Wipes are special forms of Type II
dissolve in which changing pixels follow a
particular geometric pattern.

Build-your-own-transition: Suppose we
wish to build a special type of wipe which
slides one video out while another video
slides in to replace it: a slide (or push).
(a) Unlike a wipe, we want each video frame not be
held in place, but instead move progressively farther
into (out of) the viewport.

(b) Suppose we wish to slide Video
L
in from the left,
and push out Video
R
. Figure 2.9 shows this
process:

Fig. 2.9: (a): Video
L
. (b): Video
R
. (c): Video
L
sliding
into place and pushing out Video
R
.
Hearing (Audio)
Types of audio in multimedia application
Music - set the mood of the presentation, enhance the
emotion, illustrate points
Sound effects - to make specific points, e.g., squeaky
doors, explosions, wind, ...
Narration - most direct message, often effective
Interactivity (interacting)

Interactive multimedia systems
People remember 70% of what they interact with
Menu driven programs/presentations
-often a hierarchical structure (main menu, sub-menus, ...)
Hypermedia
less structured, cross-links between subsections of the same
subject, nonlinear, quick access to information +: easier for
introducing more multimedia features,
Simulations / Performance-dependent Simulations
e.g., Games - SimCity, Flight Simulators
Technical Design Issues
1. Computer Platform: Much software is ostensibly portable
but cross-platform software relies on run-time modules which
may not work well across systems.

2. Video format and resolution: The most popular video
formats NTSC, PAL, and SECAM are not compatible, so
a conversion is required before a video can be played on a
player supporting a different format.

3. Memory and Disk Space Requirement: At least 128 MB
of RAM and 20 GB of hard-disk space should be available for
acceptable performance and storage for multimedia
programs.
4. Delivery Methods:
Not everyone/everywhere has rewriteable DVD
drives, as yet.

CD-ROMs: may be not enough storage to hold
a multimedia presentation. As well, access time
for CD-ROM drives is longer than for hard-disk
drives.

Electronic delivery is an option, but depends on
network bandwidth at the user side (and at
server). A streaming option may be available,
depending on the presentation.
Automatic Authoring
Hypermedia documents: Generally, three
steps:

1. Capture of media: From text or using an audio
digitizer or video frame-grabber; is highly developed
and well automated.

2. Authoring: How best to structure the data in order
to support multiple views of the available data, rather
than a single, static view.

3. Publication: i.e. Presentation, is the objective of
the multimedia tools we have been considering.
Externalization versus linearization:

(a) Fig. 2.12(a) shows the essential problem involved in
communicating ideas without using a hypermedia
mechanism.

(b) In contrast, hyperlinks allow us the freedom to partially
mimic the authors thought process (i.e., externalization).

(c) Using, e.g., Microsoft Word, creates a hypertext version
of a document by following the layout already set up in
chapters, headings, and so on. But problems arise when
we actually need to automatically extract semantic
content and find links and anchors (even considering just
text and not images etc.) Fig. 2.13 displays the problem.

Multimedia Systems
(eadeli@iust.ac.ir)
43
Fig. 2.12: Communication using hyperlinks [from David Lowe].
(a)
(b)
(d) Once a dataset becomes large we should
employ database methods. The issues
become focused on scalability (to a large
dataset), maintainability, addition of material,
and reusability.
Fig. 2.13: Complex information space [from David Lowe].
Semi-automatic migration of hypertext
The structure of hyperlinks for text information is simple:
nodes represent semantic information and these are
anchors for links to other pages.
Fig. 2.14: Nodes and anchors in hypertext [from David Lowe].
Hyperimages
We need an automated method to help
us produce true hypermedia:
Fig. 2.15: Structure of hypermedia [from David Lowe].
Can manually delineate syntactic image elements by
masking image areas. Fig. 2.16 shows a
hyperimage, with image areas identified and
automatically linked to other parts of a document:

Fig. 2.16: Hyperimage [from David Lowe].
2.2 Some Useful Editing and Authoring
Tools
One needs real vehicles for showing understanding
principles of and creating multimedia. And straight
programming in C++ or Java is not always the best
way of showing your knowledge and creativity.

Some popular authoring tools include the following:
Adobe Premiere 6
Macromedia Director 8 and MX
Flash 5 and MX
Dreamweaver MX

Assignments for this section
2.2.1 Adobe Premiere

2.2.2 Macromedia Director

2.2.3 Macromedia Flash

2.2.4 Dreamweaver
At the convergence of technology and creative
invention in multimedia is virtual reality
Placing inside a lifelike experience
Take a step forward, and the view gets closer, turn
your head, and the view rotates
Reach out and grab an object; your hand moves in
front of you maybe the object explodes in a 90-
decibel crescendo as you wrap your fingers around it.
Or it slips out from your grip, falls to the floor, and
hurriedly escapes through a mouse hole at the bottom
of the wall
In VR, your cyberspace is made up of many thousands
of geometric objects plotted in three-dimensional space
The more objects and the more points that describe the
objects, the higher resolution and the more realistic your
view
As the user moves about, each motion or action requires
the computer to recalculate the position, angle size, and
shape of all the objects that make up your view, and
many thousands of computations must occur as fast as
30 times per second to seem smooth.
2.3 VRML (Virtual Reality Modelling
Language)
Overview

(a) VRML: conceived in the first international conference of the
World Wide Web as a platform-independent language that
would be viewed on the Internet.

(b) Objective of VRML: capability to put coloured objects into
a 3D environment.

(c) VRML is an interpreted language; however it has been
very influential since it was the first method available for
displaying a 3D world on the World Wide Web.
History of VRML

VRML 1.0 was created in May of 1995, with a revision for
clarification called VRML 1.0C in January of 1996:

VRML is based on a subset of the file inventor format
created by Silicon Graphics Inc.

VRML 1.0 allowed for the creation of many simple 3D
objects such as a cube and sphere as well as user-defined
polygons. Materials and textures can be specified for
objects to make the objects more realistic.
The last major revision of VRML was VRML 2.0,
standardized by ISO as VRML97:

This revision added the ability to create an interactive
world. VRML 2.0, also called Moving Worlds, allows for
animation and sound in an interactive virtual world.

New objects were added to make the creation of virtual
worlds easier.

Java and Javascript have been included in VRML to allow
for interactive objects and user-defined actions.

VRML 2.0 was a large change from VRML 1.0 and they
are not compatible with each other. However, conversion
utilities are available to convert VRML 1.0 to VRML 2.0
automatically.
VRML Shapes
VRML contains basic geometric shapes that can be combined to
create more complex objects. Fig. 2.28 displays some of these
shapes:

Fig. 2.28: Basic VRML shapes.

Shape node is a generic node for all objects in VRML.

Material node specifies the surface properties of an object. It can
control what color the object is by specifying the red, green and blue
values of the object.
There are three kinds of texture nodes that
can be used to map textures onto any object:

1. ImageTexture: The most common one that can
take an external JPEG or PNG image file and
map it onto the shape.

2. MovieTexture: allows the mapping of a movie
onto an object; can only use MPEG movies.

3. PixelTexture: simply means creating an image
to use with ImageTexture within VRML.
VRML world

Fig. 2.29 displays a simple VRML scene from one viewpoint:
Openable-book VRML simple world!:

The position of a viewpoint can be specified with the position
node and it can be rotated from the default view with the
orientation node.

Also the cameras angle for its field of view can be changed
from its default 0.78 radians, with the fieldOfView node.

Changing the field of view can create a telephoto effect.
Fig. 2.29: A simple VRML scene.
Three types of lighting can be used in a VRML world:

DirectionalLight node shines a light across the whole world in a
certain direction.

PointLight shines a light from all directions from a certain point in
space.

SpotLight shines a light in a certain direction from a point.

RenderMan: rendering package created by Pixar.

The background of the VRML world can also be specified using
the Background node.

A Panorama node can map a texture to the sides of the world. A
panorama is mapped onto a large cube surrounding the VRML
world.
Animation and Interactions
The only method of animation in VRML is by tweening done by
slowly changing an object that is specified in aninterpolator node.

This node will modify an object over time, based on the six types of
interpolators: color, coordinate, normal, orientation, position, and
scalar.

(a) All interpolators have two nodes that must be specified: the key and
keyValue.

(b) The key consists of a list of two or more numbers starting with 0 and
ending with 1, defines how far along the animation is.

(c) Each key element must be complemented with a keyValue
element: defines what values should change.
To time an animation, a TimeSensor node should be used:

(a) TimeSensor has no physical form in the VRML world and just keeps
time.

(b) To notify an interpolator of a time change, a ROUTE is needed to
connect two nodes together.

(c) Most animation can be accomplished through the method of routing
a TimeSensor to an interpolator node, and then the interpolator node
to the object to be animated.

Two categories of sensors can be used in VRML to obtain input
from a user:

(a) Environment sensors: three kinds of environmental sensor nodes:
VisibilitySensor, ProximitySensor, and Collision.

(b) Pointing device sensors: touch sensor and drag sensors.
VRML Specifics
Some VRML Specifics:
(a) A VRML file is simply a text file with a .wrl" extension.

(b) VRML97 needs to include the line #VRML V2.0 UTF8 in the first
line of the VRML file tells the VRML client what version of VRML to
use.

(c) VRML nodes are case sensitive and are usually built in a
hierarchical manner.

(d) All Nodes begin with { and end with } and most can contain
nodes inside of nodes.

(e) Special nodes called group nodes can cluster together multiple
nodes and use the keyword children followed by [ ... ].
(f) Nodes can be named using DEF and be used again
later
by using the keyword USE. This allows for the creation of
complex objects using many simple objects.

A simple VRML example to create a box in VRML:
one can accomplish this by typing:

Shape {
Geometry Box{}
}

The Box defaults to a 2-meter long cube in the
center of the screen. Putting it into a Transform node
can move this box to a different part of the scene. We
can also give the box a different color, such as red.
Transform { translation 0 10 0 children
[
Shape {
Geometry Box{}
appearance Appearance {
material Material {
diffuseColor 1 0 0
}
}
}
]}
Text
Introduction to Text
Words and symbols in any form, spoken or written, are the
most common system of communication
Deliver the most widely understood meaning
Typeface usually includes many type sizes and styles
A font is a collection of characters of a single size and
style belonging to a particular typeface family.
Typical font styles are bold face and italic
Other style attributes such as underlining and outlining of
characters, may be added at the users choice
Text is used in multimedia projects in many ways
Web pages
Video
Computer-based training
Presentations
Uses for Text in Multimedia
Uses for Text in Multimedia
Text is also used in multimedia projects in these ways.
Games rely on text for rules, chat, character
descriptions, dialog, background story, and many
more elements.
Educational games rely on text for content, directions,
feedback, and information.
Kiosks use text to display information, directions, and
descriptions.
Formatting Text
Formatting text controls the way the text looks.
You can choose:
Fonts
Text sizes and colors
Text alignment
Text spacing: line spacing or spacing between
individual characters
Advanced formatting: outlining, shadow, superscript,
subscript, watermarks, embossing, engraving, or
animation
Text wraps
Typefaces
Characterization of a typeface is serif and sans serif
Serif
Times, Times New Roman, Bookman
Used for body of text
F
Sans serif
Arial, Optima, Verdana
Used for headings
F
Guidelines for Using Fonts
Avoid using many varying font styles in the same
project.
When possible, use fonts that come with both
Windows and Mac OS.
Use bitmap fonts on critical areas such as buttons,
titles, or headlines.
More Tips for Using Fonts
Use fancy or whimsical fonts sparingly
for special effects or emphasis.
Keep paragraphs and line lengths short.
Use bold, italic, and underlining options
sparingly for emphasis.
More Guidelines for Using
Fonts
Avoid using text in all uppercase letters.
Use font, style options, size, and color
consistently.
Provide adequate contrast between text
and background when choosing colors.
Always check spelling and grammar.
Formatting for Screen Display
Apply these guidelines to multimedia
applications for display rather than to printed
documents.
Test your presentation on monitors in several sizes.
Avoid patterned backgrounds.
Use small amounts of text on each screen display.
Text for a presentation that will be viewed by a large group of
people must be visible from the back of the room.
For interactive displays, use consistent placement of hypertext
links.
Character set and alphabets
ASCII Character set
Uses 8 bit characters
Numeric of value to 128 characters including both
lower and uppercase letters, punctuation marks, Arabic
numbers and math symbols.
32 control characters for device control messages, such
as carriage return, line feed, tab and form feed.
ASCII extended character set also uses 8 bits
Character set and alphabets
UNICODE Character set
Use 16-bit architecture for multilingual text and
character encoding.
Unicode uses about 65,000 characters from all known
languages and alphabets in the world.
Several languages share a set of symbols that have a
historically related derivation, the shared symbols of
each language are unified ymbols (Called scripts).
Font Technologies
Understanding font technologies can be important when
creating multimedia projects. The most popular font
technologies are:
Scalable fonts: Postscript, TrueType, and OpenType
Bitmap fonts which are not scalable but provide more
control over the appearance of text.
Font Editing and Design tools
In some multimedia projects it may be required to create
special characters.
Using the font editing tools it is possible to create a special
symbols and use it in the entire text.
Software that can be used for editing and creating fonts
Fontographer
Fontmonger
Cool 3D text

Graphics and Image Data
Representations

Why use Images?

To show information that is visual and cant be easily
communicated except as an image, for instance, maps,
charts or diagrams
To clarify interpretation of information by applying color
schemes or other visuals that help make meaning more
obvious
To create an evident context for information by using
images that your audience can associate with your tone or
message
Bitmap/Vector images

In a bitmap or raster image, visual data is mapped as spots of
color or pixels.
The more pixels in a bitmap image, the finer the detail will be
Because photographs have high levels of detail and a variety
of tones and colors, they are best represented as bitmap
images. Scanners and digital cameras produce bitmap
images
Vector or object oriented graphics use mathematical formulas
to describe outlines and fills for image objects
Vector graphics can be enlarged or reduced with no loss of
data and no change in image quality e.g. CorelDraw,
Illustrator, Freehand, AutoCAD & Flash create vector images
What is image
For digitizing image, the image is discredited both in terms
spatial co-ordinates and its amplitude values
Discretization of the spatial coordinates (x,y) is called
image sampling
Descretization of the amplitude values f ( x, y) is called
grey-level quantization, intensity
A digital image is represented by a matrix of numeric
values each representing a quantized intensity value
When I is a two-dimensional matrix, then I(r,c) is the
intensity value at the position corresponding to row r and
column c of the matrix
Image
Each element of the array is called pixels
Pixel Neighbours
A pixel p1 is a neighbour of another pixel p2, if their spatial
coordinates (x1,y1) and (x1,y2) are not more than a unit
distance apart. Types of neighbours often used in image
processing:
Horizontal neighbours
Vertical neighbours
Diagonal neighbours
Arithmetic and Logic Operations
Addition: p1+p1 used in image averaging
Subtraction: p1-p2 used in image motion analysis, and background
removal
Multiplication: p1*p2 used in colour and image shading operations
Division p1/p2 used in colour processing,
Color
Reflection of light is simply the bouncing of light waves from
an object back toward the lights source or other directions
Energy is often absorbed from the light (and converted into
heat or other forms) when the light reflects off an object, so
the reflected light might have slightly different properties
light is the portion of electromagnetic radiation tat is visible to
the human eye
Visible light has a wavelength of about 400 to 780
nanometers
The adjacent frequencies of infrared on the lower end and
ultraviolet on the higher end are still called light, even though
they are not visible to the human eye

Color
Cameras store and reproduce light as images and video
The device consists of a box with a hole in one side
Light from an external scene passes through the hole and
strikes a surface inside where it is reproduced, upside-
down, but with both color and perspective preserved
At first, the image is projected in light-sensitive chemical
plates, later it became chemical film, and now it is
photosensitive electronics that can record images in a
digital format

Human Color perception
The retina contains two types of light-sensitive
photoreceptors: rods and cones.
The rods are responsible for monochrome perception,
allowing the eyes to distinguish between black and white.
The cones are responsible for color vision.
In humans, there are three types of cones: maximally
sensitive to long-wavelength, medium-wavelength, and
short-wavelength light or Red, Green and Blue
The color perceived is the combined effect of stimuli to
these three types of cone cells. Overall there are more
rods than cones, so color perception is less accurate than
black and white contrast perception.
Monochrome Images
Each pixel is stored as a single bit (0 or 1), so also referred to as
binary image
Such an image is also called a 1-bit monochrome image since
it contains no color
A 640 x 480 monochrome image requires 37.5 KB of storage.

8-bit Gray-level Images
Each pixel has a gray-value between 0 and 255.
Each pixel is represented by a single byte; e.g., a dark pixel might
have a value of 10, and a bright one might be 230.
A 640 x 480 grayscale image requires over 300 KB of storage.
8-Bit Colour Image
One byte for each pixel
Supports 256 out of the millions s possible, acceptable colour quality
Requires Colour Look-Up Tables (LUTs)
A 640 x 480 8-bit colour image requires 307.2 KB of storage (the
same as 8-bit greyscale)

24-bit Color Images
Each pixel is represented by three bytes (e.g., RGB)
Supports 256 x 256 x 256 possible combined colours
(16,777,216)
A 640 x 480 24-bit colour image would require 921.6 KB
of storage
Most 24-bit images are 32-bit images, the extra byte of
data for each pixel is used to store an alpha value
representing special effect information
95
Assignment
How do CRT monitors create images
How do Flat panel displays create images
How do scanners digitize image
How do printers create image with colors
How can we create 3D images
Briefly describe the different image formats,
GIF,JPEG,PNG,TIFF

Image resolution refers to the number of pixels in a digital
image (higher resolution always yields better quality).
- Fairly high resolution for such an image might be 1,600 x1,200,
whereas lower resolution might be 640 x 480.
Frame buffer: Hardware used to store bitmap.
- Video card (actually a graphics card) is used for this purpose.
- The resolution of the video card does not have to match the
desired resolution of the image, but if not enough video card
memory is available then the data has to be shifted around in RAM
for display.
8-bit image can be thought of as a set of 1-bit bit-planes, where
each plane consists of a 1-bit representation of the image at
higher and higher levels of elevation: a bit is turned on if the
image pixel has a nonzero value that is at or above that bit level.
Fig. 3.2 displays the concept of bit-planes graphically.

Fig. 3.2: Bit-planes for 8-bit grayscale image.
3.2 Popular File Formats
8-bit GIF : one of the most important formats because of its
historical connection to the WWW and HTML markup language as
the first image type recognized by net browsers.

JPEG: currently the most important common file format.

GIF
GIF standard: (We examine GIF standard because it is so
simple! yet contains many common elements.)
Limited to 8-bit (256) color images only, which, while
producing acceptable color images, is best suited for images
with few distinctive colors (e.g., graphics or drawing).

GIF standard supports interlacing successive display of
pixels in widely-spaced rows by a 4-pass display process.

GIF actually comes in two flavors:

1. GIF87a: The original specification.
2. GIF89a: The later version. Supports simple animation via a
Graphics Control Extension block in the data, provides simple
control over delay time, a transparency index, etc.

GIF87
For the standard specification, the general file format of a GIF87
file is as in Fig. 3.12.

Fig. 3.12: GIF file
format.
Screen Descriptor comprises a set of attributes that belong to
every image in the file. According to the GIF87 standard, it is
defined as in Fig. 3.13.

Fig. 3.13: GIF screen descriptor.

Color Map is set up in a very simple fashion as in Fig. 3.14.
However, the actual length of the table equals 2
(pixel+1)
as given
in the Screen Descriptor.

Fig. 3.14: GIF color map.

Each image in the file has its own Image Descriptor, defined as
in Fig. 3.15.

Fig. 3.15: GIF image descriptor.

If the interlace bit is set in the local Image Descriptor, then the
rows of the image are displayed in a four-pass sequence
(Fig.3.16).

Fig. 3.16: GIF 4-pass interlace display row order.

We can investigate how the file header works in practice by
having a look at a particular GIF image. Fig. 3.7 on page is an 8-
bit color GIF image, in UNIX, issue the command:
od -c forestfire.gif | head -2
and we see the first 32 bytes interpreted as characters:
G I F 8 7 a \208 \2 \188 \1 \247 \0 \0 \6 \3 \5
J \132 \24 | ) \7 \198 \195 \ \128 U \27 \196 \166 & T

To decipher the remainder of the file header (after GIF87a), we
use hexadecimal:
od -x forestfire.gif | head -2
with the result
4749 4638 3761 d002 bc01 f700 0006 0305 ae84 187c 2907 c6c3 5c80
551b c4a6 2654

JPEG
JPEG: The most important current standard for image
compression.

The human vision system has some specific limitations and JPEG
takes advantage of these to achieve high rates of compression.

JPEG allows the user to set a desired level of quality, or
compression ratio (input divided by output).

As an example, Fig. 3.17 shows our forestfire image, with a quality
factor Q=10%.
- This image is a mere 1.5% of the original size. In comparison, a JPEG
image with Q=75% yields an image size 5.6% of the original, whereas
a GIF version of this image compresses down to 23.0% of
uncompressed image size.

Fig. 3.17: JPEG image with low quality specified by user.
PNG
PNG format: standing for Portable Network Graphics
meant to supersede the GIF standard, and extends it in
important ways.

Special features of PNG files include:

1. Support for up to 48 bits of color information a large
increase.

2. Files may contain gamma-correction information for correct
display of color images, as well as alpha-channel information for
such uses as control of transparency.

3. The display progressively displays pixels in a 2-dimensional
fashion by showing a few pixels at a time over seven passes
through each 8 8 block of an image.

TIFF
TIFF: stands for Tagged Image File Format.

The support for attachment of additional information (referred to
as tags) provides a great deal of flexibility.

1. The most important tag is a format signifier: what type of
compression etc. is in use in the stored image.

2. TIFF can store many different types of image: 1-bit,
grayscale, 8-bit color, 24-bit RGB, etc.

3. TIFF was originally a lossless format but now a new JPEG
tag allows one to opt for JPEG compression.

4. The TIFF format was developed by the Aldus Corporation in
the 1980's and was later supported by Microsoft.
EXIF
EXIF (Exchange Image File) is an image format for digital
cameras:

1. Compressed EXIF files use the baseline JPEG format.

2. A variety of tags (many more than in TIFF) are available to facilitate
higher quality printing, since information about the camera and picture-
taking conditions (flash, exposure, light source, white balance, type of
scene, etc.) can be stored and used by printers for possible color correction
algorithms.

3. The EXIF standard also includes specification of file format for audio that
accompanies digital images. As well, it also supports tags for information
needed for conversion to FlashPix (initially developed by Kodak).
Audio
Sound
What is Sound?
Sound is a wave phenomenon like light, but is macroscopic and
involves molecules of air being compressed and expanded
under the action of some physical device.
(a) For example, a speaker in an audio system vibrates back
and forth and produces a longitudinal pressure wave that we
perceive as sound.
(b) Since sound is a pressure wave, it takes on continuous
values, as opposed to digitized ones.
(C) If we wish to use a digital version of sound waves we
must form digitized representations of audio information.
The perception of sound in any organism is limited to a
certain range of frequencies(20Hz~20000Hz for humans)
Infrasound; Elephants
Ultrasound; Bat

Digitization of Sound
Digitization means conversion to a stream of numbers,
and preferably these numbers should be integers for
efficiency.
Example of Sound
Fig. 6.1: An analog signal: continuous
measurement of pressure wave.
The graph in Fig. 6.1 has to be made digital in both time and
amplitude. To digitize, the signal must be sampled in each
dimension: in time, and in amplitude.
(a) Sampling means measuring the quantity we are interested
in, usually at evenly-spaced intervals.

(b) The first kind of sampling, using measurements only at
evenly spaced time intervals, is simply called, sampling. The rate
at which it is performed is called the sampling frequency

(c) For audio, typical sampling rates are from 8 kHz (8,000
samples per second) to 48 kHz. This range is determined by the
Nyquist theorem, discussed later.

(d) Sampling in the amplitude or voltage dimension is called
quantization.
Fig. 6.2: Sampling and Quantization. (a): Sampling the
analog signal in the time dimension. (b): Quantization is
sampling the analog signal in the amplitude dimension.
(a) (b)
Regardless of what vibrating object is creating the sound
wave, the particles of the medium through which the sound
moves is vibrating in a back and forth motion at a given
frequency.
The frequency of a wave refers to how often the particles of
the medium vibrate when a wave passes through the
medium. The frequency of a wave is measured as the
number of complete back-and-forth vibrations of a particle of
the medium per unit of time. If a particle of air undergoes
1000 longitudinal vibrations in 2 seconds, then the
frequency of the wave would be 500 vibrations per second.
A commonly used unit for frequency is the Hertz
(abbreviated Hz), where
1 Hertz = 1 vibration/second
Few Terminologies
Few Terminologies Contd
Few Terminologies Contd
The sensation of a frequency is commonly referred to as the
pitch
A high pitch sound corresponds to a high frequency sound
wave and a low pitch sound to a low frequency sound wave.
Musically trained people are capable of detecting a difference
in frequency between two separate sounds that is as little as 2
Hz and common people understand 7Hz
Any two sounds whose frequencies make a 2:1 ratio are said to
be separated by an octave

Fourier Series
The representation of periodic function as infinite sum of
sinusoidal

Harmonics: any series of musical tones whose frequencies
are integral multiples of the frequency of a fundamental tone
Fig. 6.3: Building up a complex signal by superposing
sinusoids
Thus to decide how to digitize audio data
we need to answer the following questions:
1. What is the sampling rate?
2. How finely is the data to be quantized, and is
quantization uniform?
Digitization
The Nyquist theorem states how frequently we must sample
in time to be able to recover the original sound.

(a) Fig. 6.4(a) shows a single sinusoid: it is a single, pure,
frequency (only electronic instruments can create such
sounds).

(b) If sampling rate just equals the actual frequency, Fig. 6.4(b)
shows that a false signal is detected: it is simply a constant, with
zero frequency.

(c) Now if sample at 1.5 times the actual frequency, Fig. 6.4(c)
shows that we obtain an incorrect (alias) frequency that is lower
than the correct one it is half the correct one (the wavelength,
from peak to peak, is double that of the actual signal).

(d) Thus for correct sampling we must use a sampling rate equal
to at least twice the maximum frequency content in the signal.
This rate is called the Nyquist rate.
Nyquist theorem
Fig. 6.4: Aliasing.

(a): A single frequency.

(b): Sampling at exactly the frequency
produces a constant.

(c): Sampling at 1.5 times per cycle
produces an alias perceived frequency.
Nyquist Theorem: If a signal is band-limited, i.e.,
there is a lower limit f
1
and an upper limit f
2
of
frequency components in the signal, then the sampling
rate should be at least 2(f
2
f
1
).

Nyquist frequency: half of the Nyquist rate.
Since it would be impossible to recover frequencies higher
than Nyquist frequency in any event, most systems have
an antialiasing filter that restricts the frequency content in
the input to the sampler to a range at or below Nyquist
frequency.

The relationship among the Sampling Frequency, True
Frequency, and the Alias Frequency is as follows:
f
alias
= f
sampling
f
true
, for f
true
< f
sampling
< 2 f
true
(6.1)
In general, the apparent frequency of a sinusoid is the
lowest frequency of a sinusoid that has exactly the
same samples as the input sinusoid. Fig. 6.5 shows
the relationship of the apparent frequency to the input
frequency.

Fig. 6.5: Folding of sinusoid frequency which is
sampled at 8,000 Hz. The folding frequency, shown
dashed, is 4,000 Hz.
1HZ Wave Frequency
Sampling at 2Hz
Sampling at 3Hz
Sampling at 1.5Hz
Sampling with 3HZ Frequency
Aliasing
Exercise
If sampling rate is 4000HZ, what is the frequency of a
sine wave
If the highest frequency is 4000HZ, what is the minimum
sampling rate?
What is the alias of 2000HZ wave frequency sampled at
1500HZ?

Signal to Noise Ratio (SNR)
The ratio of the power of the correct signal and the noise is
called the signal to noise ratio (SNR) a measure of the
quality of the signal.

The SNR is usually measured in decibels (dB), where 1 dB is
a tenth of a bel. The SNR value, in units of dB, is defined in
terms of base-10 logarithms of squared voltages, as follows:

(6.2)

2
10 10
2
10log 20log
signal signal
noise noise
V V
SNR
V V
= =
a) The power in a signal is proportional to the
square of the voltage. For example, if the
signal voltage V
signal
is 10 times the noise,
then the SNR is 20 log
10
(10) = 20dB.

b) In terms of power, if the power from ten
violins is ten times that from one violin
playing, then the ratio of power is 10dB, or
1B.

c) To know: Power 10; Signal Voltage
20.
The usual levels of sound we hear around us are described in terms of decibels, as a ratio to
the quietest sound we are capable of hearing. Table 6.1 shows approximate levels for these
sounds.

Table 6.1: Magnitude levels of common sounds, in decibels

Threshold of hearing 0
Rustle of leaves 10
Very quiet room 20
Average room 40
Conversation 60
Busy street 70
Loud radio 80
Train through station 90
Riveter 100
Threshold of discomfort 120
Threshold of pain 140
Damage to ear drum 160
Signal to Quantization Noise Ratio
(SQNR)
Aside from any noise that may have been present
in the original analog signal, there is also an
additional error that results from quantization.

(a) If voltages are actually in 0 to 1 but we have only 8
bits in which to store values, then effectively we force
all continuous values of voltage into only 256 different
values.

(b) This introduces a roundoff error. It is not really
noise. Nevertheless it is called quantization noise
(or quantization error).
The quality of the quantization is
characterized by the Signal to
Quantization Noise Ratio (SQNR).
(a) Quantization noise: the difference between
the actual value of the analog signal, for the
particular sampling time, and the nearest
quantization interval value.

(b) At most, this error can be as much as
half of the interval.
(c) For a quantization accuracy of N bits per sample, the
SQNR can be simply expressed:

(6.3)

Notes:

(a)We map the maximum signal to 2
N1
1 ( 2
N1
) and the
most negative signal to 2
N1
.

(b) Eq. (6.3) is the Peak signal-to-noise ratio, PSQNR: peak
signal and peak noise.
1
2
20log 20log
10 10
1
_
2
20 log 2 6.02 (dB)
V
N
signal
SQNR
V
quan noise
N N
= =
= =
(c) The dynamic range is the ratio of maximum to
minimum absolute values of the signal: V
max
/V
min
. The
max abs. value V
max
gets mapped to 2
N1
1; the min
abs. value V
min
gets mapped to 1. V
min
is the smallest
positive voltage that is not masked by noise. The
most negative signal, V
max
, is mapped to 2
N1
.

(d) The quantization interval is V=(2V
max
)/2
N
, since
there are 2
N
intervals. The whole range V
max
down to
(V
max
V/2) is mapped to 2
N1
1.

(e) The maximum noise, in terms of actual voltages, is
half the quantization interval: V/2 = V
max
/2
N
.
6.02N is the worst case. If the input
signal is sinusoidal, the quantization error
is statistically independent, and its
magnitude is uniformly distributed between
0 and half of the interval, then it can be
shown that the expression for the SQNR
becomes:

SQNR = 6.02N+1.76(dB) (6.4)
Linear and Non-linear
Quantization
Linear format: samples are typically stored as uniformly quantized values.

Non-uniform quantization: set up more finely-spaced levels where
humans hear with the most acuity.

Webers Law stated formally says that equally perceived differences have values
proportional to absolute levels:

Response Stimulus/Stimulus (6.5)

Inserting a constant of proportionality k, we have a differential equation that states:

dr = k (1/s) ds (6.6)

with response r and stimulus s.
Integrating, we arrive at a solution

r = k ln s + C (6.7)

with constant of integration C.
Stated differently, the solution is

r = k ln(s/s
0
) (6.8)

s
0
= the lowest level of stimulus that causes a response (r = 0 when s = s
0
).

Nonlinear quantization works by first transforming an analog signal from the raw s space
into the theoretical r space, and then uniformly quantizing the resulting values.

Such a law for audio is called -law encoding, (or u-law). A very similar rule, called
A-law, is used in telephony in Europe.

The equations for these very similar encodings are as follows:
-law:

(6.9)

A-law:

(6.10)

Fig. 6.6 shows these curves. The parameter is set to = 100 or
= 255; the parameter A for the A-law encoder is usually set to A =
87.6.
sgn( )
ln 1 , 1
ln(1 )
p p
s s s
r
s s

= + s
`
+

)
1
,
1 ln
sgn( ) 1
1 ln , 1
1 ln
p p
p p
A s s
A s s A
r
s s s
A
A s A s
| |
s
|
|
+
\ .
+ s s (
+
(

1 if 0,
wheresgn( )
1 otherwise
s
s
>
Fig. 6.6: Nonlinear transform for audio signals

The -law in audio is used to develop a nonuniform
quantization rule for sound: uniform quantization of r gives
finer resolution in s at the quiet end.
149 Li & Drew
Audio Filtering
Prior to sampling and AD conversion, the audio signal is also usually
filtered to remove unwanted frequencies. The frequencies kept depend on
the application:

(a) For speech, typically from 50Hz to 10kHz is retained, and other frequencies
are blocked by the use of a band-pass filter that screens out lower and higher
frequencies.

(b) An audio music signal will typically contain from about 20Hz up to 20kHz.

(c) At the DA converter end, high frequencies may reappear in the output
because of sampling and then quantization, smooth input signal is replaced by a
series of step functions containing all possible frequencies.

(d) So at the decoder side, a lowpass filter is used after the DA circuit.
Audio Quality vs. Data Rate
The uncompressed data rate increases as more bits are
used for quantization. Stereo: double the bandwidth. to
transmit a digital audio signal.

Table 6.2: Data rate and bandwidth in sample audio applications

Quality Sample
Rate (Khz)
Bits per
Sample
Mono /
Stereo
Data Rate
(uncompressed
) (kB/sec)
Frequency
Band (KHz)
Telephone 8 8 Mono 8 0.200-3.4
AM Radio 11.025 8 Mono 11.0 0.1-5.5
FM Radio 22.05 16 Stereo 88.2 0.02-11
CD 44.1 16 Stereo 176.4 0.005-20
DAT 48 16 Stereo 192.0 0.005-20
DVD
Audio
192 (max) 24(max) 6
channels
1,200 (max) 0-96 (max)
Synthetic Sounds
1. FM (Frequency Modulation): one
approach to generating synthetic sound:

(6.11)

( ) ( ) cos[ ( ) cos( ) ]
c m m c
x t A t t I t t e t e t | | = + + +
Fig. 6.7: Frequency Modulation. (a): A single frequency. (b): Twice the
frequency. (c): Usually, FM is carried out using a sinusoid argument to
a sinusoid. (d): A more complex form arises from a carrier frequency,
2t and a modulating frequency 4t cosine inside the sinusoid.

2. Wave Table synthesis:

A more accurate way of generating
sounds from digital signals. Also known,
simply, as sampling.
In this technique, the actual digital
samples of sounds from real instruments
are stored. Since wave tables are stored in
memory on the sound card, they can be
manipulated by software so that sounds
can be combined, edited, and enhanced.
Quantization and Transmission of Audio
Coding of Audio: Quantization and
transformation of data are collectively known as
coding of the data.

a) For audio, the -law technique for companding
audio signals is usually combined with an algorithm
that exploits the temporal redundancy present in
audio signals.

b) Differences in signals between the present and a
past time can reduce the size of signal values and
also concentrate the histogram of pixel values
(differences, now) into a much smaller range.

c) The result of reducing the variance of
values is that lossless compression methods
produce a bitstream with shorter bit lengths
for more likely values
In general, producing quantized sampled
output for audio is called PCM (Pulse
Code Modulation). The differences version
is called DPCM (and a crude but efficient
variant is called DM). The adaptive version
is called ADPCM.
Pulse Code Modulation
The basic techniques for creating digital
signals from analog signals are sampling
and quantization.
Quantization consists of selecting
breakpoints in magnitude, and then re-
mapping any value within an interval to
one of the representative output levels.
Fig. 6.2: Sampling and Quantization.
(a) (b)
a) The set of interval boundaries are called
decision boundaries, and the representative
values are called reconstruction levels.

b) The boundaries for quantizer input intervals
that will all be mapped into the same output level
form a coder mapping.

c) The representative values that are the output
values from a quantizer are a decoder mapping.

d) Finally, we may wish to compress the data, by
assigning a bit stream that uses fewer bits for the
most prevalent signal values (Chap. 7).
Every compression scheme has three stages:
A. The input data is transformed to a new
representation that is easier or more efficient to
compress.

B. We may introduce loss of information.
Quantization is the main lossy step we use a
limited number of reconstruction levels, fewer
than in the original signal.

C. Coding. Assign a codeword (thus forming a
binary bitstream) to each output level or symbol.
This could be a fixed-length code, or a variable
length code such as Huffman coding
For audio signals, we first consider PCM
for digitization. This leads to Lossless
Predictive Coding as well as the DPCM
scheme; both methods use differential
coding. As well, we look at the adaptive
version, ADPCM, which can provide better
compression.
PCM in Speech Compression
Assuming a bandwidth for speech from about 50 Hz to about 10
kHz, the Nyquist rate would dictate a sampling rate of 20 kHz.

(a) Using uniform quantization without companding, the minimum
sample size we could get away with would likely be about 12 bits.
Hence for mono speech transmission the bit-rate would be 240 kbps.

(b) With companding, we can reduce the sample size down to about 8
bits with the same perceived level of quality, and thus reduce the bit-rate
to 160 kbps.

(c) However, the standard approach to telephony in fact assumes that
the highest-frequency audio signal we want to reproduce is only about 4
kHz. Therefore the sampling rate is only 8 kHz, and the companded bit-
rate thus reduces this to 64 kbps.
However, there are two small wrinkles we must
also address:

1. Since only sounds up to 4 kHz are to be considered,
all other frequency content must be noise.
Therefore, we should remove this high-frequency
content from the analog input signal. This is done
using a band-limiting filter that blocks out high, as
well as very low, frequencies.

Also, once we arrive at a pulse signal, such
as that in Fig. 6.13(a) below, we must still perform
DA conversion and then construct a final output
analog signal. But, effectively, the signal we arrive
at is the staircase shown in Fig. 6.13(b).
Fig. 6.13: Pulse Code Modulation (PCM). (a) Original analog
signal and its corresponding PCM signals. (b) Decoded staircase
signal. (c) Reconstructed signal after low-pass filtering.
164 Li & Drew
2. A discontinuous signal contains not just
grequency components due to the original
signal, but also a theoretically infinite set of
higher-frequency components:

(a) This result is from the theory of Fourier
analysis, in signal processing.

(b) These higher frequencies are extraneous.

(c) Therefore the output of the digital-to-analog
converter goes to a low-pass filter that allows
only frequencies up to the original maximum to be
retained.
The complete scheme for encoding and decoding
telephony signals is shown as a schematic in Fig.
6.14. As a result of the low-pass filtering, the output
becomes smoothed and Fig. 6.13(c) above showed
this effect.

Fig. 6.14: PCM signal encoding and decoding.
Differential Coding of Audio
Audio is often stored not in simple PCM but
instead in a form that exploits differences
which are generally smaller numbers, so offer
the possibility of using fewer bits to store.

(a) If a time-dependent signal has some
consistency over time (temporal redundancy),
the difference signal, subtracting the current
sample from the previous one, will have a more
peaked histogram, with a maximum around zero.

Fundamental Concepts in
Video
Digital Video
One may be excused for thinking that the capture
and playback of digital video is simply a matter of
capturing each frame, or image, and playing them
back in a sequence at 25 frames per second.
A single image or frame with a window size or
screen resolution of 640 x 480 pixels and 24 bit
colour (16.8 million colours) occupies
approximately 1MB of disc space.
Roughly 25 MB of disc space are needed for
every second of video, 1.5 GB for every minute.
The three basic problems of digital video

There are three basic problems with digital video
Size of video window, Frame rate and Quality of image
Size of video window
Digital video stores a lot of information about each pixel in each
frame
It takes time to display those pixels on your computer screen
If the window size is small, then the time taken to draw the pixels is
less. If the window size is large, there may not be enough time to
display the image or single frame before its time to start the next
one
Choose an appropriate window size, may not always produce
desirable result
Frame Rates
Too many pixels and not enough time.
Depending on the size of video window chosen, you may also be
able to reduce file size by reducing the number of frames per
second to, for example, 12 frames per second.
5.1 Types of Video Signals
Component video

Component video: Higher-end video systems make use of three
separate video signals for the red, green, and blue image planes.
Each color channel is sent as a separate video signal.

(a) Most computer systems use Component Video, with separate signals
for R, G, and B signals.

(b) For any color separation scheme, Component Video gives the best
color reproduction since there is no crosstalk between the three
channels.

(c) This is not the case for S-Video or Composite Video, discussed next.
Component video, however, requires more bandwidth and good
synchronization of the three components.
Composite Video 1 Signal
Composite video: color (chrominance) and intensity (luminance) signals
are mixed into a single carrier wave.

a) Chrominance is a composition of two color components (I and Q, or U and V).

b) In NTSC TV, e.g., I and Q are combined into a chroma signal, and a color subcarrier is
then employed to put the chroma signal at the high-frequency end of the signal
shared with the luminance signal.

c) The chrominance and luminance components can be separated at the receiver end
and then the two color components can be further recovered.

d) When connecting to TVs or VCRs, Composite Video uses only one wire and video
color signals are mixed, not sent separately. The audio and sync signals are
additions to this one signal.

Since color and intensity are wrapped into the same signal, some interference
between the luminance and chrominance signals is inevitable.
S-Video 2 Signals
S-Video: as a compromise, (separated video, or Super-video, e.g., in
S-VHS) uses two wires, one for luminance and another for a
composite chrominance signal.

As a result, there is less crosstalk between the color information and
the crucial gray-scale information.

The reason for placing luminance into its own part of the signal is that
black-and-white information is most crucial for visual perception.

In fact, humans are able to differentiate spatial resolution in grayscale
images with a much higher acuity than for the color part of color images.

As a result, we can send less accurate color information than must be
sent for intensity information we can only see fairly large blobs of
color, so it makes sense to send less color detail.
5.2 Analog Video
An analog signal f(t) samples a time-varying image. So-called
progressive scanning traces through a complete picture (a frame)
row-wise for each time interval.

In TV, and in some monitors and multimedia standards as well,
another system, called interlaced scanning is used:

a) The odd-numbered lines are traced first, and then the even-numbered
lines are traced. This results in odd and even fields two fields
make up one frame.

b) In fact, the odd lines (starting from 1) end up at the middle of a line
at the end of the odd field, and the even scan starts at a half-way point.
Fig. 5.1: Interlaced raster scan

c) Figure 5.1 shows the scheme used. First the solid (odd) lines are traced, P to Q, then R to S,
etc., ending at T; then the even field starts at U and ends at V.

d) The jump from Q to R, etc. in Figure 5.1 is called the horizontal retrace, during which the
electronic beam in the CRT is blanked. The jump from T to U or V to P is called the
vertical retrace.
Because of interlacing, the odd and even
lines are displaced in time from each other
generally not noticeable except when
very fast action is taking place on screen,
when blurring may occur.

For example, in the video in Fig. 5.2, the
moving helicopter is blurred more than is
the still background.
Fig. 5.2: Interlaced scan produces two fields for each frame. (a) The
video frame, (b) Field 1, (c) Field 2, (d) Difference of Fields
(a)
(b) (c) (d)
Since it is sometimes necessary to change the frame rate,
resize, or even produce stills from an interlaced source video,
various schemes are used to de-interlace it.

a) The simplest de-interlacing method consists of discarding one
field and duplicating the scan lines of the other field. The
information in one field is lost completely using this simple
technique.

b) Other more complicated methods that retain information from
both fields are also possible.

Analog video use a small voltage offset from zero to indicate
black, and another value such as zero to indicate the start of
a line. For example, we could use a blacker-than-black zero
signal to indicate the beginning of a line.
Fig. 5.3 Electronic signal for one NTSC scan line.
Digital Video
The advantages of digital representation for video are
many. For example:
(a) Video can be stored on digital devices or in memory,
ready to be processed (noise removal, cut and paste, etc.),
and integrated to various multimedia applications;

(b) Direct access is possible, which makes nonlinear video
editing achievable as a simple, rather than a complex,
task;

(c) Repeated recording does not degrade image quality;

(d) Ease of encryption and better tolerance to channel noise.
Chroma Subsampling
Since humans see color with much less spatial
resolution than they see black and white, it makes
sense to decimate the chrominance signal.
Interesting (but not necessarily informative!) names
have arisen to label the different schemes used.
To begin with, numbers are given stating how many
pixel values, per four original pixels, are actually
sent:

(a) The chroma subsampling scheme 4:4:4 indicates
that no chroma subsampling is used: each pixels Y,
Cb and Cr values are transmitted, 4 for each of Y, Cb,
Cr.
(b) The scheme 4:2:2 indicates horizontal subsampling of
the Cb, Cr signals by a factor of 2. That is, of four pixels
horizontally labelled as 0 to 3, all four Ys are sent, and
every two Cbs and two Crs are sent, as (Cb0, Y0)(Cr0,
Y1)(Cb2, Y2)(Cr2, Y3)(Cb4, Y4), and so on (or averaging
is used).

(c) The scheme 4:1:1 subsamples horizontally by a factor
of 4.

(d) The scheme 4:2:0 subsamples in both the horizontal
and vertical dimensions by a factor of 2. Theoretically, an
average chroma pixel is positioned between the rows and
columns as shown Fig.5.6.

Scheme 4:2:0 along with other schemes is commonly used
in JPEG and MPEG (see later chapters in Part 2).
Fig. 5.6: Chroma subsampling
Lossless Compression
Algorithms
Introduction
Compression: the process of coding that will
effectively reduce the total number of bits needed
to represent certain information.

Fig. 7.1: A General Data Compression Scheme.

Introduction (contd)
If the compression and decompression processes
induce no information loss, then the compression
scheme is lossless; otherwise, it is lossy.

Compression ratio:

(7.1)

B
0
number of bits before compression
B
1
number of bits after compression
0
1
B
compressionratio
B
=
Compression basically employs
redundancy in the data:

Temporal -- in 1D data, 1D signals, Audio etc.
Spatial -- correlation between neighbouring pixels or
data items
Spectral -- correlation between colour or luminescence
components. This uses the frequency domain to exploit
relationships between frequency of change in data.
psycho-visual -- exploit perceptual properties of the
human visual system.
Basics of Information Theory
The entropy of an information source with alphabet S = {s
1
, s
2
,
. . . , s
n
} is:

(7.2)

(7.3)

p
i
probability that symbol s
i
will occur in S.

indicates the amount of information ( self-information
as defined by Shannon) contained in s
i
, which corresponds to
the number of bits needed to encode s
i
.
2
1
1
( ) log
n
i
i
i
H S p
p
q
=
= =
2
1
log
n
i i
i
p p
=
=
1
log
2 p
i
Distribution of Gray-Level
Intensities

Fig. 7.2 Histograms for Two Gray-level Images.

Fig. 7.2(a) shows the histogram of an image with uniform distribution
of gray-level intensities, i.e., i p
i
= 1/256. Hence, the entropy of this
image is:

log
2
256 = 8 (7.4)

Fig. 7.2(b) shows the histogram of an image with two possible values.
Its entropy is 0.92.
Entropy and Code Length
As can be seen in Eq. (7.3): the entropy is a
weighted-sum of terms ; hence it
represents the average amount of information
contained per symbol in the source S.

The entropy specifies the lower bound for the
average number of bits to code each symbol in S,
i.e.,

(7.5)

- the average length (measured in bits) of the
codewords produced by the encoder.
1
log
2 p
i
l q s
l
Simple Repetition Suppression
For Example
89400000000000000000000000000000000
With 894f32
Suppression of zero's in a file (Zero Length
Suppression)
Silence in audio data, Pauses in conversation
Bitmaps
Blanks in text or program source files
Backgrounds in images
Run-Length Coding
This encoding method is frequently
applied to images (or pixels in a scan line).
For example:
111122233333311112222 can be
encoded as: (1,4),(2,3),(3,6),(1,4),(2,4)
Variable-Length Coding (VLC)
Representing symbols with variable bit
Shannon-Fano Algorithm a top-down approach

1. Sort the symbols according to the frequency count of their
occurrences.

2. Recursively divide the symbols into two parts, each with
approximately the same number of counts, until all parts contain
only one symbol.

An Example: coding of HELLO

Frequency count of the symbols in HELLO.
Symbol H E L O
Count 1 1 2 1
Fig. 7.3: Coding Tree for HELLO by Shannon-Fano.
Table 7.1: Result of Performing Shannon-Fano
on HELLO
Symbol Count Log
2
Code # of bits used
L 2 1.32 0 1
H 1 2.32 10 2
E 1 2.32 110 3
O 1 2.32 111 3
TOTAL # of bits: 10
1
p
i
Fig. 7.4 Another coding tree for HELLO by Shannon-
Fano.
Table 7.2: Another Result of Performing Shannon-Fano
on HELLO (see Fig. 7.4)
Symbol Count Log
2
Code # of bits used
L 2 1.32 00 4
H 1 2.32 01 2
E 1 2.32 10 2
O 1 2.32 11 2
TOTAL # of bits: 10
1
p
i
Huffman Coding
ALGORITHM 7.1 Huffman Coding Algorithm a bottom-up approach

1. Initialization: Put all symbols on a list sorted according to their frequency
counts.

2. Repeat until the list has only one symbol left:

(1) From the list pick two symbols with the lowest frequency counts. Form a Huffman
subtree that has these two symbols as child nodes and create a parent node.

(2) Assign the sum of the childrens frequency counts to the parent and insert it into the
list such that the order is maintained.

(3) Delete the children from the list.

3. Assign a codeword for each leaf based on the path from the root.
Fig. 7.5: Coding Tree for HELLO using the Huffman Algorithm.
Huffman Coding (contd)
In Fig. 7.5, new symbols P1, P2, P3 are
created to refer to the parent nodes in the
Huffman coding tree. The contents in the list
are illustrated below:

After initialization: L H E O
After iteration (a): L P1 H
After iteration (b): L P2
After iteration (c): P3
Properties of Huffman Coding
1. Unique Prefix Property: No Huffman code is a prefix of any other
Huffman code - precludes any ambiguity in decoding.

2. Optimality: minimum redundancy code - proved optimal for a given
data model (i.e., a given, accurate, probability distribution):

The two least frequent symbols will have the same length for their
Huffman codes, differing only at the last bit.

Symbols that occur more frequently will have shorter Huffman codes than
symbols that occur less frequently.

The average code length for an information source S is strictly less than
+ 1. Combined with Eq. (7.5), we have:

(7.6)
1 l q < +
Dictionary-based Coding
LZWuses fixed-length codewords to represent
variable-length strings of symbols/characters that
commonly occur together, e.g., words in English
text.

the LZW encoder and decoder build up the same
dictionary dynamically while receiving the data.

LZW places longer and longer repeated entries into
a dictionary, and then emits the code for an
element, rather than the string itself, if the element
has already been placed in the dictionary.
ALGORITHM 7.2 - LZW Compression

BEGIN
s = next input character;
while not EOF
{
c = next input character;

if s + c exists in the dictionary
s = s + c;
else
{
output the code for s;
add string s + c to the dictionary with a new code;
s = c;
}
}
output the code for s;
END
Example 7.2 LZW compression for string
ABABBABCABABBA

Lets start with a very simple dictionary (also
referred to as a string table), initially containing
only 3 characters, with codes as follows:

Now if the input string is ABABBABCABABBA,
the LZW compression algorithm works as follows:
Code String
1 A
2 B
3 C
The output codes are: 1 2 4 5 2 3 4 6 1. Instead of sending 14 characters,
only 9 codes need to be sent (compression ratio = 14/9 = 1.56).
S C Output Code String
1
2
3
A
B
C
A
B
A
AB
B
BA
B
C
A
AB
A
AB
ABB
A
B
A
B
B
A
B
C
A
B
A
B
B
A
EOF
1
2

4

5
2
3

4

6
1
4
5

6

7
8
9

10

11
AB
BA

ABB

BAB
BC
CA

ABA

ABBA
ALGORITHM 7.3 LZW Decompression (simple version)

BEGIN
s = NIL;
while not EOF
{
k = next input code;
entry = dictionary entry for k;
output entry;
if (s != NIL)
add string s + entry[0] to dictionary with a new code;
s = entry;
}
END

Example 7.3: LZW decompression for string ABABBABCABABBA.
Input codes to the decoder are 1 2 4 5 2 3 4 6 1.
The initial string table is identical to what is used by the encoder.
Apparently, the output string is ABABBABCABABBA, a truly lossless result!
S K Entry/output Code String
1
2
3
A
B
C
NIL
A
B
AB
BA
B
C
AB
ABB
A
1
2
4
5
2
3
4
6
1
EOF
A
B
AB
BA
B
C
AB
ABB
A

4
5
6
7
8
9
10
11

AB
BA
ABB
BAB
BC
CA
ABA
ABBA
The LZW decompression algorithm then works as follows:
ALGORITHM 7.4 LZW Decompression (modified)

BEGIN
s = NIL;
while not EOF
{
k = next input code;
entry = dictionary entry for k;

/* exception handler */
if (entry == NULL)
entry = s + s[0];

output entry;
if (s != NIL)
add string s + entry[0] to dictionary with a new code;
s = entry;
}
END
LZW Coding (contd)
In real applications, the code length l is kept in
the range of [l
0
, l
max
]. The dictionary initially
has a size of 2
l0
. When it is filled up, the code
length will be increased by 1; this is allowed
to repeat until l = l
max
.

When l
max
is reached and the dictionary is
filled up, it needs to be flushed (as in Unix
compress, or to have the LRU (least recently
used) entries removed.
Li & Drew 209
Arithmetic Coding
Arithmetic coding is a more modern coding method that
usually out-performs Huffman coding.

Huffman coding assigns each symbol a codeword
which has an integral bit length. Arithmetic coding can
treat the whole message as one unit.

A message is represented by a half-open interval [a, b)
where a and b are real numbers between 0 and 1.
Initially, the interval is [0, 1). When the message
becomes longer, the length of the interval shortens
and the number of bits needed to represent the
interval increases.
ALGORITHM 7.5 Arithmetic Coding Encoder

BEGIN
low = 0.0; high = 1.0; range = 1.0;

while (symbol != terminator)
{
get (symbol);
low = low + range * Range_low(symbol);
high = low + range * Range_high(symbol);
range = high - low;
}

output a code so that low <= code < high;
END
Example: Encoding in Arithmetic Coding

(a) Probability distribution of symbols.

Fig. 7.8: Arithmetic Coding: Encode Symbols CAEE$
Symbol Probabilit
y
Range
A 0.2 [0, 0.2)
B 0.1 [0.2, 0.3)
C 0.2 [0.3, 0.5)
D 0.05 [0.5, 0.55)
E 0.3 [0.55,
0.85)
F 0.05 [0.85, 0.9)
G 0.1 [0.9, 1.0)
Fig. 7.8(b) Graphical display of shrinking ranges.
Li & Drew 213
Example: Encoding in Arithmetic Coding

(c) New low, high, and range generated.

Fig. 7.8 (contd): Arithmetic Coding: Encode Symbols
CAEE$
Li & Drew 214
Symbo
l
Low High Range
0 1.0 1.0
C 0.3 0.5 0.2
A 0.30 0.34 0.04
E 0.322 0.334 0.012
E 0.3286 0.3322 0.0036
$ 0.3318
4
0.3322
0
0.0003
6
PROCEDURE 7.2 Generating Codeword for Encoder

BEGIN
code = 0;
k = 1;
while (value(code) < low)
{
assign 1 to the kth binary fraction bit
if (value(code) > high)
replace the kth bit by 0
k = k + 1;
}
END

The final step in Arithmetic encoding calls for the generation of a
number that falls within the range [low, high). The above algorithm
will ensure that the shortest binary codeword is found.
ALGORITHM 7.6 Arithmetic Coding Decoder

BEGIN
get binary code and convert to
decimal value = value(code);
Do
{
find a symbol s so that
Range_low(s) <= value < Range_high(s);
output s;
low = Rang_low(s);
high = Range_high(s);
range = high - low;
value = [value - low] / range;
}
Until symbol s is a terminator
END
Li & Drew 216
Table 7.5 Arithmetic coding: decode symbols CAEE$

Li & Drew 217
Value Output
Symbol
Low High Range
0.3320312
5
C 0.3 0.5 0.2
0.1601562
5
A 0.0 0.2 0.2
0.8007812
5
E 0.55 0.85 0.3
0.8359375 E 0.55 0.85 0.3
0.953125 $ 0.9 1.0 0.1
Lossless Image Compression
Approaches of Differential Coding of Images:

Given an original image I(x, y), using a simple difference operator
we can define a difference image d(x, y) as follows:
d(x, y) = I(x, y) I(x 1, y) (7.9)
or use the discrete version of the 2-D Laplacian operator to
define a difference image d(x, y) as
d(x, y) = 4 I(x, y) I(x, y 1) I(x, y +1) I(x+1, y) I(x 1, y)
(7.10)

Due to spatial redundancy existed in normal images I, the
difference image d will have a narrower histogram and hence
a smaller entropy, as shown in Fig. 7.9.
Fig. 7.9: Distributions for Original versus Derivative Images. (a,b):
Original gray-level image and its partial derivative image; (c,d):
Histograms for original and derivative images.

(This figure uses a commonly employed image called Barb.)
Li & Drew 219
Lossless JPEG
Lossless JPEG: A special case of the JPEG image
compression.

The Predictive method
1. Forming a differential prediction: A predictor combines
the values of up to three neighboring pixels as the predicted
value for the current pixel, indicated by X in Fig. 7.10.
The predictor can use any one of the seven schemes
listed in Table 7.6.

2. Encoding: The encoder compares the prediction with the
actual pixel value at the position X and encodes the
difference using one of the lossless compression
techniques we have discussed, e.g., the Huffman coding
scheme.
Fig. 7.10: Neighboring Pixels for Predictors in Lossless JPEG.

Note: Any of A, B, or C has already been decoded before it is used
in the predictor, on the decoder side of an encode-decode cycle.
Table 7.6: Predictors for Lossless JPEG
Predictor Prediction
P1 A
P2 B
P3 C
P4 A + B C
P5 A + (B C) / 2
P6 B + (A C) / 2
P7 (A + B) / 2
Table 7.7: Comparison with other lossless compression
programs
Compression
Program
Compression Ratio
Lena Football F-18 Flower
s
Lossless JPEG 1.45 1.54 2.29 1.26
Optimal Lossless
JPEG
1.49 1.67 2.71 1.33
Compress (LZW) 0.86 1.24 2.21 0.87
Gzip (LZ77) 1.08 1.36 3.10 1.05
Gzip -9 (optimal LZ77) 1.08 1.36 3.13 1.05
Pack(Huffman coding) 1.02 1.12 1.19 1.00

Lossy Compression
Algorithms
Introduction
Lossless compression algorithms do not
deliver compression ratios that are high
enough. Hence, most multimedia
compression algorithms are lossy.

What is lossy compression?
The compressed data is not the same as the
original data, but a close approximation of it.
Yields a much higher compression ratio than
that of lossless compression.
Distortion Measures
The three most commonly used distortion measures in image compression are:

mean square error (MSE)
2
,

(8.1)

where x
n
, y
n
, and N are the input data sequence, reconstructed data sequence, and length
of the data sequence respectively.

signal to noise ratio (SNR), in decibel units (dB),

(8.2)

where is the average square value of the original data sequence and is the MSE.

peak signal to noise ratio (PSNR),

(8.3)
2 2
1
1
( )
N
n n
n
x y
N
o
=
=
2
10 2
10log
x
d
SNR
o
o
=
2
10 2
10log
peak
d
x
PSNR
o
=
2
x
o
2
d
o
The Rate-Distortion Theory
Provides a framework for the study of tradeoffs
between Rate and Distortion.

Fig. 8.1: Typical Rate Distortion Function.
Quantization
Reduce the number of distinct output values to
a much smaller set.

Main source of the loss in lossy compression.

Three different forms of quantization.
Uniform: midrise and midtread quantizers.
Nonuniform: companded quantizer.
Vector Quantization.
Uniform Scalar Quantization
A uniform scalar quantizer partitions the domain of
input values into equally spaced intervals, except
possibly at the two outer intervals.

The output or reconstruction value corresponding to each
interval is taken to be the midpoint of the interval.

The length of each interval is referred to as the step size,
denoted by the symbol .

Two types of uniform scalar quantizers:

Midrise quantizers have even number of output levels.
Midtread quantizers have odd number of output levels,
including zero as one of them (see Fig. 8.2).
For the special case where = 1, we can simply compute the output values
for these quantizers as:

(8.4)

(8.5)

Performance of an M level quantizer. Let B = {b
0
, b
1
, . . . , b
M
} be the set of
decision boundaries and Y = {y
1
, y
2
, . . . , y
M
} be the set of reconstruction or
output values.

Suppose the input is uniformly distributed in the interval [X
max
,X
max
]. The rate
of the quantizer is:

(8.6)

( ) 0.5
midrise
Q x x =
(
(
( ) 0.5
midtread
Q x x = +
(

2
log R M =
(
(
Fig. 8.2: Uniform Scalar Quantizers: (a) Midrise, (b) Midtread.
Quantization Error of Uniformly
Distributed Source
Granular distortion: quantization error caused by the quantizer for bounded
input.

To get an overall figure for granular distortion, notice that decision boundaries b
i
for a
midrise quantizer are [(i 1), i], i = 1..M/2, covering positive data X (and another
half for negative X values).

Output values y
i
are the midpoints i/2, i = 1..M/2, again just considering the positive
data. The total distortion is twice the sum over the positive data, or

(8.8)

Since the reconstruction values y
i
are the midpoints of each interval, the
quantization error must lie within the values [ , ]. For a uniformly
distributed source, the graph of the quantization error is shown in Fig. 8.3.
( )
2 2
( 1)
1
2 1 1
2
2 2
M
i
gran
i
max i
i
D x dx
X
A
A
=
= A
}
2
A
2
A
Fig. 8.3: Quantization error of a uniformly distributed source.
Fig. 8.4: Companded quantization.

Companded quantization is nonlinear.

As shown above, a compander consists of a compressor
function G, a uniform quantizer, and an expander function G
1
.

The two commonly used companders are the -law and A-law
companders.
Vector Quantization (VQ)
According to Shannons original work on information
theory, any compression system performs better if it
operates on vectors or groups of samples rather than
individual symbols or samples.

Form vectors of input samples by simply concatenating
a number of consecutive samples into a single vector.

Instead of single reconstruction values as in scalar
quantization, in VQ code vectors with n components
are used. A collection of these code vectors form the
codebook.
Fig. 8.5: Basic vector quantization
procedure.
Transform Coding
The rationale behind transform coding:
If Y is the result of a linear transform T of the input vector X
in such a way that the components of Y are much less
correlated, then Y can be coded more efficiently than X.
If most information is accurately described by the first few
components of a transformed vector, then the remaining
components can be coarsely quantized, or even set to zero,
with little signal distortion.
Example
A Simple Transform Encoding procedure maybe
described by the following steps for a 2x2 block of
monochrome pixels:
Take top left pixel as the base value for the block, pixel A.
Calculate three other transformed values by taking the difference
between these(respective) pixels and pixel A, i.e. B-A, C-A, D-A.
Store the base pixel and the differences as the values of the
transform.
Any Redundancy in the data has been transformed to
values, Xi. So We can compress the data by using fewer
bits to represent the differences. I.e if we use 8 bits per
pixel then the 2x2 block uses 32 bits/ If we keep 8 bits for
the base pixel, X0, and assign 4 bits for each difference
then we only use 20 bits
Spatial Frequency and DCT
Spatial frequency indicates how many times
pixel values change across an image block.

The DCT formalizes this notion with a
measure of how much the image contents
change in correspondence to the number of
cycles of a cosine wave per block.

The role of the DCT is to decompose the
original signal into its DC and AC
components; the role of the IDCT is to
reconstruct (re-compose) the signal.
Definition of DCT:
Given an input function f(i, j) over two integer variables i and j (a
piece of an image), the 2D DCT transforms it into a new function
F(u, v), with integer u and v running over the same range as i
and j. The general definition of the transform is:

(8.15)

where i, u = 0, 1, . . . ,M 1; j, v = 0, 1, . . . ,N 1; and the
constants C(u) and C(v) are determined by

(8.16)

1 1
0 0
2 ( ) ( ) (2 1) (2 1)
( , ) cos cos ( , )
2 2
M N
i j
C u C v i u j v
F u v f i j
M N
MN
t t

= =
+ +
=

2
0,
( ) 2
1 .
if
C
otherwise

=
=
2D Discrete Cosine Transform (2D DCT):

(8.17)

where i, j, u, v = 0, 1, . . . , 7, and the constants C(u) and C(v) are
determined by Eq. (8.5.16).

2D Inverse Discrete Cosine Transform (2D
IDCT):
The inverse function is almost the same, with the roles of f(i, j) and F(u,
v) reversed, except that now C(u)C(v) must stand inside the sums:

(8.18)

where i, j, u, v = 0, 1, . . . , 7.
7 7
0 0
( ) ( ) (2 1) (2 1)
( , ) cos cos ( , )
4 16 16
i j
C u C v i u j v
F u v f i j
t t
= =
+ +
=

7 7
0 0
( ) ( ) (2 1) (2 1)
( , ) cos cos ( , )
4 16 16
u v
C u C v i u j v
f i j F u v
t t
= =
+ +
=
1D Discrete Cosine Transform (1D DCT):

(8.19)

where i = 0, 1, . . . , 7, u = 0, 1, . . . , 7.

1D Inverse Discrete Cosine Transform
(1D IDCT):

(8.20)

where i = 0, 1, . . . , 7, u = 0, 1, . . . , 7.
7
0
( ) (2 1)
( ) cos ( )
2 16
i
C u i u
F u f i
t
=
+
=

7
0
( ) (2 1)
( ) cos ( )
2 16
u
C u i u
f i F u
t
=
+
=
Fig. 8.6: The 1D DCT basis functions.

244
Fig. 8.6 (contd): The 1D DCT basis
functions.
Fig. 8.7: Examples of 1D Discrete Cosine Transform: (a) A DC signal
f
1
(i), (b) An AC signal f
2
(i).
(a)
(b)
Fig. 8.7 (contd): Examples of 1D Discrete Cosine Transform: (c) f
3
(i) =
f
1
(i)+f
2
(i), and (d) an arbitrary signal f(i).
(c)
(d)
Fig. 8.8 An example of 1D IDCT.
Fig. 8.8 (contd): An example of 1D IDCT.
The DCT is a linear transform:
In general, a transform T (or function) is linear,
iff

(8.21)

where and are constants, p and q are any
functions, variables or constants.

From the definition in Eq. 8.17 or 8.19, this
property can readily be proven for the DCT
because it uses only simple arithmetic
operations.
250
( ) ( ) ( ) T p q T p T q o | o | + = +
The Cosine Basis Functions
Function B
p
(i) and B
q
(i) are orthogonal, if

(8.22)

Function B
p
(i) and B
q
(i) are orthonormal, if they are orthogonal
and

(8.23)

It can be shown that:

[ ( ) ( )] 0
p q
i
B i B i if p q = =
7
0
(2 1) (2 1)
cos cos 0
16 16
i
i p i q
if p q
t t
=
+ +
(
= =
(

7
0
( ) (2 1) ( ) (2 1)
cos cos 1
2 16 2 16
i
C p i p C q i q
if p q
t t
=
+ +
(
= =
(

[ ( ) ( )] 1
p q
i
B i B i if p q = =
Fig. 8.9: Graphical Illustration of 8 8 2D DCT basis.

2D Separable Basis
The 2D DCT can be separated into a sequence of
two, 1D DCT steps:

(8.24)

(8.25)

It is straightforward to see that this simple change
saves
many arithmetic steps. The number of iterations
required is reduced from 8 8 to 8+8.
7
0
(2 1)
1
( , ) ( ) cos ( , )
2 16
j
j v
G i v C v f i j
t
=
+
=

7
0
(2 1)
1
( , ) ( ) cos ( , )
2 16
i
i u
F u v C u G i v
t
=
+
=

Comparison of DCT and DFT
The discrete cosine transform is a close counterpart to the Discrete Fourier Transform (DFT).
DCT is a transform that only involves the real part of the DFT.
For a continuous signal, we define the continuous Fourier transform F as follows:

(8.26)

Using Eulers formula, we have

(8.27)

Because the use of digital computers requires us to discretize the input signal, we define a
DFT that operates on 8 samples of the input signal {f
0
, f
1
, . . . , f
7
} as:

(8.28)

( ) ( )
i t
f t e dt
e
e

=
}
F
cos( ) sin( )
ix
e x i x = +
7
2
8
0
i x
x
x
F f e
t e
e

=
=
Writing the sine and cosine terms explicitly, we

have

(8.29)

The formulation of the DCT that allows it to use
only the cosine basis functions of the DFT is that
we can cancel out the imaginary part of the DFT
by making a symmetric copy of the original input
signal.

DCT of 8 input samples corresponds to DFT of the
16 samples made up of original 8 input samples
and a symmetric copy of these, as shown in Fig.
8.10.
Li & Drew 255
( ) ( )
7 7
0 0
2 2
cos sin
8 8
x x
x x
x x
F f i f
e
te te
= =
=

Fig. 8.10 Symmetric extension of the ramp function.

A Simple Comparison of DCT and
DFT
Table 8.1 and Fig. 8.11 show the comparison of DCT and DFT on a ramp
function, if only the first three terms are used.

Table 8.1 DCT and DFT coefficients of the ramp function

Li & Drew 257
Ramp DCT DFT
0 9.90 28.00
1 -6.44 -4.00
2 0.00 9.66
3 -0.67 -4.00
4 0.00 4.00
5 -0.20 -4.00
6 0.00 1.66
7 -0.51 -4.00
Fig. 8.11: Approximation of the ramp function: (a) 3 Term DCT
Approximation, (b) 3 Term DFT Approximation.
Karhunen-Love Transform
(KLT)
The Karhunen-Love transform is a reversible linear transform that
exploits the statistical properties of the vector representation.

It optimally decorrelates the input signal.

To understand the optimality of the KLT, consider the autocorrelation
matrix R
X
of the input vector X defined as

(8.30)

(8.31)

[ ]
T
E =
X
R XX
(1,1) (1, 2) (1, )
(2,1) (2, 2) (2, )
( ,1) ( , 2) ( , )
X X X
X X X
X X X
R R R k
R R R k
R k R k R k k
(
(
(
=
(
(

Our goal is to find a transform T such that the components of the output Y are
uncorrelated, i.e E[Y
t
Y
s
] = 0, if t s. Thus, the autocorrelation matrix of Y
takes on the form of a positive diagonal matrix.

Since any autocorrelation matrix is symmetric and non-negative definite, there
are k orthogonal eigenvectors u
1
, u
2
, . . . , u
k
and k corresponding real and
nonnegative eigenvalues
1

2
...
k
0.

If we define the Karhunen-Love transform as

(8.32)

Then, the autocorrelation matrix of Y becomes

(8.35)

(8.36)

1 2
[ , , , ]
T
k
= T u u u
[ ] [ ]
T T T
E E = = =
Y X
R YY TXX T TR T
1
2
0 0
0 0
0 0
0 0
k
(
(
(
=
(
(

KLT Example
To illustrate the mechanics of the KLT, consider the four 3D input vectors x
1
= (4, 4, 5),
x
2
= (3, 2, 5), x
3
= (5, 7, 6), and x
4
= (6, 7, 7).

Estimate the mean:

Estimate the autocorrelation matrix of the input:

(8.37)

Li & Drew 261
18
1
20
4
23
x
(
(
=
(
(

m
1
1
n
T T
i i x x
i
M
=
=
X
R x x m m
1.25 2.25 0.88
2.25 4.50 1.50
0.88 1.50 0.69
(
(
=
(
(

The eigenvalues of R
X
are
1
= 6.1963,
2
= 0.2147, and
3
= 0.0264. The
corresponding eigenvectors are

The KLT is given by the matrix
Li & Drew 262
1 2 3
0.4385 0.4460 0.7803
0.8471 , 0.4952 , 0.1929
0.3003 0.7456 0.5949
( ( (
( ( (
= = =
( ( (
( ( (

u u u
0.4385 0.8471 0.3003
0.4460 0.4952 0.7456
0.7803 0.1929 0.5949
(
(
=
(
(

T
Subtracting the mean vector from each input vector and apply the KLT

Since the rows of T are orthonormal vectors, the inverse transform is just the
transpose: T
1
= T
T
, and

(8.38)

In general, after the KLT most of the energy of the transform coefficients are
concentrated within the first few components. This is the energy
compaction property of the KLT.
Li & Drew 263
1 2
3 4
1.2916 3.4242
0.2870 , 0.2573 ,
0.2490 0.1453
1.9885 2.7273
0.5809 , 0.6107
0.1445 0.0408

( (
( (
= =
( (
( (

( (
( (
= =
( (
( (

y y
y y
T
x
= + x T y m

Image Compression Standards
The JPEG Standard
9.1 The JPEG Standard
JPEG is an image compression standard that was
developed by the Joint Photographic Experts Group. JPEG
was formally accepted as an international standard in 1992.

JPEG is a lossy image compression method. It employs a
transform coding method using the DCT (Discrete Cosine
Transform).

An image is a function of i and j (or conventionally x and y) in
the spatial domain. The 2D DCT is used as one step in JPEG
in order to yield a frequency response which is a function F(u,
v) in the spatial frequency domain, indexed by two integers u
and v.
Li & Drew 265
Observations for JPEG Image
Compression
The effectiveness of the DCT transform coding
method in JPEG relies on 3 major observations:

Observation 1: Useful image contents change
relatively slowly across the image, i.e., it is unusual
for intensity values to vary widely several times in a
small area, for example, within an 88 image block.

much of the information in an image is repeated,
hence spatial redundancy.
Li & Drew 266
Observations for JPEG Image
Compression(contd)
Observation 2: Psychophysical experiments suggest that
humans are much less likely to notice the loss of very high
spatial frequency components than the loss of lower
frequency components.

the spatial redundancy can be reduced by largely reducing
the high spatial frequency contents.

Observation 3: Visual acuity (accuracy in distinguishing
closely spaced lines) is much greater for gray (black and
white) than for color.

chroma subsampling (4:2:0) is used in JPEG.
Li & Drew 267
Fig. 9.1: Block diagram for JPEG encoder.
Main Steps in JPEG Image Compression

Transform RGB to YIQ or YUV and subsample
color
DCT on image blocks
Quantization
Zig-zag ordering and run-length encoding
Entropy coding
DCT on image blocks
Each image is divided into 8 8 blocks.
The 2D DCT is applied to each block
image f(i, j), with output being the DCT
coefficients F(u, v) for each block.
Using blocks, however, has the effect of
isolating each block from its neighboring
context. This is why JPEG images look
choppy (blocky) when a high
compression ratio is specified by the user.
The input image is N by M;
f(i,j) is the intensity of the pixel in row i and column j;
F(u,v) is the DCT coefficient in row k1 and column k2 of
the DCT matrix.
For most images, much of the signal energy lies at low
frequencies; these appear in the upper left corner of the
DCT.
Compression is achieved since the lower right values
represent higher frequencies, and are often small small
enough to be neglected with little visible distortion.
The DCT input is an 8 by 8 array of integers. This array
contains each pixel's gray scale level;
8 bit pixels have levels from 0 to 255.
The basic operation of the DCT
Quantization

(9.1)

F(u, v) represents a DCT coefficient, Q(u, v) is a quantization matrix
entry, and represents the quantized DCT
coefficients which JPEG will use in the succeeding entropy coding.

The quantization step is the main source for loss in JPEG
compression.

The entries of Q(u, v) tend to have larger values towards the lower right
corner. This aims to introduce more loss at the higher spatial
frequencies a practice supported by Observations 1 and 2.

Table 9.1 and 9.2 show the default Q(u, v) values obtained from
psychophysical studies with the goal of maximizing the compression
ratio while minimizing perceptual losses in JPEG images.
Li & Drew 272
( , )
( , )
( , )
F u v
F u v round
Q u v
| |
=
|
\ .
( , ) F u v
Table 9.1 The Luminance Quantization Table

Table 9.2 The Chrominance Quantization Table
16 11 10 16 24 40 51 61
12 12 14 19 26 58 60 55
14 13 16 24 40 57 69 56
14 17 22 29 51 87 80 62
18 22 37 56 68 109 103 77
24 35 55 64 81 104 113 92
49 64 78 87 103 121 120 101
72 92 95 98 112 100 103 99
17 18 24 47 99 99 99 99
18 21 26 66 99 99 99 99
24 26 56 99 99 99 99 99
47 66 99 99 99 99 99 99
99 99 99 99 99 99 99 99
99 99 99 99 99 99 99 99
99 99 99 99 99 99 99 99
99 99 99 99 99 99 99 99
An 8 8 block from the Y image of Lena

Fig. 9.2: JPEG compression for a smooth image block.
200 202 189 188 189 175 175 175
200 203 198 188 189 182 178 175
203 200 200 195 200 187 185 175
200 200 200 200 197 187 187 187
200 205 200 200 195 188 187 175
200 200 200 200 200 190 187 175
205 200 199 200 191 187 187 175
210 200 200 200 188 185 187 186
f(i, j)
515 65 -12 4 1 2 -8 5
-16 3 2 0 0 -11 -2 3
-12 6 11 -1 3 0 1 -2
-8 3 -4 2 -2 -3 -5 -2
0 -2 7 -5 4 0 -1 -4
0 -3 -1 0 4 1 -1 0
3 -2 -3 3 3 -1 -1 3
-2 5 -2 4 -2 2 -3 0
F(u, v)
Fig. 9.2 (contd): JPEG compression for a smooth image block.
Another 8 8 block from the Y image of Lena

Fig. 9.2: JPEG compression for a smooth image block.
70 70 100 70 87 87 150 187
85 100 96 79 87 154 87 113
100 85 116 79 70 87 86 196
136 69 87 200 79 71 117 96
161 70 87 200 103 71 96 113
161 123 147 133 113 113 85 161
146 147 175 100 103 103 163 187
156 146 189 70 113 161 163 197
f(i, j)
-80 -40 89 -73 44 32 53 -3
-135 -59 -26 6 14 -3 -13 -28
47 -76 66 -3 -108 -78 33 59
-2 10 -18 0 33 11 -21 1
-1 -9 -22 8 32 65 -36 -1
5 -20 28 -46 3 24 -30 24
6 -20 37 -28 12 -35 33 17
-5 -23 33 -30 17 -5 -4 20
F(u, v)
Li & Drew 277
Fig. 9.3 (contd): JPEG compression for a textured image block.
Run-length Coding (RLC) on AC coefficients
RLC aims to turn the values into sets {#-zeros-to-skip, next
non-zero value}.

To make it most likely to hit a long run of zeros: a zig-zag scan is
used to turn the 88 matrix into a 64-vector.

Fig. 9.4: Zig-Zag Scan in JPEG.
DPCM on DC coefficients
The DC coefficients are coded separately
from the AC ones. Differential Pulse Code
modulation (DPCM) is the coding method.

If the DC coefficients for the first 5 image
blocks are 150, 155, 149, 152, 144, then
the DPCM would produce 150, 5, -6, 3, -8,
assuming d
i
= DC
i+1
DC
i
, and d
0
= DC
0
.
Entropy Coding
The DC and AC coefficients finally undergo an entropy coding step
to gain a possible further compression.

Use DC as an example: each DPCM coded DC coefficient is
represented by (SIZE, AMPLITUDE), where SIZE indicates how
many bits are needed for representing the coefficient, and
AMPLITUDE contains the actual bits.

In the example were using, codes 150, 5, 6, 3, 8 will be turned
into

(8, 10010110), (3, 101), (3, 001), (2, 11), (4, 0111) .

SIZE is Huffman coded since smaller SIZEs occur much more often.
AMPLITUDE is not Huffman coded, its value can change widely so
Huffman coding has no appreciable benefit.
Video Compression Standards
mobile phones and digital video players,
Service providers such as online video storage and
telecommunications companies.
In the video surveillance industry where there are demands for high
frame rates and high resolution, such as in the surveillance of
highways, airports and casinos, where the use of 30/25 frames per
second is the norm..
the adoption of megapixel cameras since the highly efficient
compression technology can reduce the large file sizes and bit rates
generated without compromising image quality.
Video Compression
Video compression is about reducing and removing
redundant video data
The process involves applying an algorithm to the source
video to create a compressed file
To play the compressed file, an inverse algorithm is
applied to produce a video.
The time it takes to compress, send, decompress and
display a file is called latency.
Different video compression standards utilize different
methods of reducing data, and the results differ in
Bit rate
quality
latency
Frames
I-frames, P-frames and B-frames

Basic Methods of Reducing Data
Within an image frame, removing unnecessary information
In a series of frames, video data can be reduced by such
methods as difference coding
In difference coding, a frame is compared with a reference
frame (i.e. earlier I- or P-frame) and only pixels that have
changed with respect to the reference frame are coded
The amount of encoding can be further reduced if detection
and encoding of differences is based on blocks of pixels
(macroblocks) rather than individual pixels
Example
Orginal image Encoded image
Video Coding
Difference coding, however, would not significantly reduce
data if there is a lot of motion in a video
Block-based motion compensation can be used.
Block-based motion compensation takes into account that
much of what makes up a new frame in a video sequence
can be found in an earlier frame, but in a different location.
This technique divides a frame into a series of macroblocks.
Block by block, a new framefor instance, a P-framecan
be composed or predicted by looking for a matching block
in a reference frame. If a match is found, the encoder simply
codes the position where the matching block is to be found
in the reference frame.
Coding the motion vector, as it is called, takes up fewer bits
than if the actual content of a block were to be coded.
Audio Compression
Makes compression without considering the nature of the audio sources
by only taking the perceptual capacity of human being
Sampling rate 32, 41.8, 48 khz
The key to audio compression is quantization based on the perceptual
capacity of human beings
With 6:1 sampling ratio, 16 bits per sample, 48 sampling rate produce
perceptual loss less result

Compression Process
Filter Bank
Analyze the frequency
(spectral) components
of the audio signal
by calculating a
frequency transform of a
window of signal
values
Decompose the signal
into subbands by using a
bank of filters (Layer 1
& 2: quadrature-mirror;
Layer 3: adds a DCT;
psychoacoustic model:
Fourier transform)

Psychoacoustics
The range of human hearing is about 20 Hz to about 20
kHz
The dynamic range, the ratio of the maximum sound
amplitude to the quietest sound that humans can hear, is
on the order of about 120 dB
Strong sound signals make temporal in perception of
weaker signals
The masking ability of a given signal depend on its
frequency position and its loudness
Frequency Masking
Lossy audio data compression methods, such as MPEG/Audio
encoding, remove some sounds which are masked anyway

The general situation in regard to masking is as follows:

A lower tone can effectively mask (make us unable to hear) a higher
tone

The reverse is not true a higher tone does not mask a lower tone well

The greater the power in the masking tone, the wider is its influence
the broader the range of frequencies it can mask.

As a consequence, if two tones are widely separated in frequency then
little masking occurs
Fig. 14.4: Effect of masking tone at three different frequencies
Critical Bands
Critical bandwidth represents the ears resolving power for
simultaneous tones or partials

At the low-frequency end, a critical band is less than 100 Hz
wide, while for high frequencies the width can be greater than
4 kHz

Experiments indicate that the critical bandwidth:

for masking frequencies < 500 Hz: remains approximately
constant in width ( about 100 Hz)
for masking frequencies > 500 Hz: increases approximately
linearly with frequency
293
Table 14.1 25-Critical Bands and Bandwidth
Li & Drew 294
Li & Drew 295
Temporal Masking
Phenomenon: any loud tone will cause
the hearing receptors in the inner ear to
become saturated and require time to
recover

The following figures show the results of
Masking experiments:
Fig. 14.6: The louder is the test tone, the shorter it takes for our
hearing to get over hearing the masking.
Fig. 14.7: Effect of temporal and frequency maskings
depending on both time and closeness in frequency.
Fig. 14.8: For a masking tone that is played for a longer time, it takes
longer before a test tone can be heard. Solid curve: masking tone
played for 200 msec; dashed curve: masking tone played for 100 msec.
MPEG Audio Strategy (contd)
Frequency masking: by using a psychoacoustic
model to estimate the just noticeable noise level:
Encoder balances the masking behavior and the
available number of bits by discarding inaudible
frequencies
Scaling quantization according to the sound level that is
left over, above masking levels

May take into account the actual width of the critical
bands:
For practical purposes, audible frequencies are divided
into 25 main critical bands (Table 14.1)
To keep simplicity, adopts a uniform width for all
frequency analysis filters, using 32 overlapping subbands
Li & Drew 300
Basic Algorithm (contd)
The algorithm proceeds by dividing the input into 32
frequency subbands, via a filter bank
A linear operation taking 32 PCM samples, sampled in time;
output is 32 frequency coefficients

In the Layer 1 encoder, the sets of 32 PCM values are first
assembled into a set of 12 groups of 32s
an inherent time lag in the coder, equal to the time to
accumulate 384 (i.e., 1232) samples

Fig.14.11 shows how samples are organized
A Layer 2 or Layer 3, frame actually accumulates more than 12
samples for each subband: a frame includes 1,152 samples
Fig. 14.11: MPEG Audio Frame Sizes
Bit Allocation Algorithm
Aim: ensure that all of the quantization noise is below the masking
thresholds

One common scheme:
For each subband, the psychoacoustic model calculates the Signal-to-
Mask Ratio (SMR)in dB
Then the Mask-to-Noise Ratio (MNR) is defined as the difference (as
shown in Fig.14.12):

(14.6)

The lowest MNR is determined, and the number of code-bits allocated
to this subband is incremented

Then a new estimate of the SNR is made, and the process iterates
until there are no more bits to allocate
dB dB dB
MNR SNR SMR
Fig. 14.12: MNR and SMR. A qualitative view of SNR,
SMR and MNR are shown, with one dominate masker
and m bits allocated to a particular critical band.

Fundamentals of Multimedia Slide

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fundamentals of Multimedia Slide

Uploaded by

Copyright:

Available Formats

Multimedia Systems

Fig. 6.6: Nonlinear transform for audio signals

2D Discrete Cosine Transform (2D DCT):

1D Discrete Cosine Transform (1D DCT):

Fig. 8.6: The 1D DCT basis functions.

Fig. 8.9: Graphical Illustration of 8 8 2D DCT basis.

Writing the sine and cosine terms explicitly, we

You might also like