You are on page 1of 74

Advanced Web Programming

XML

XML Dr. Mbale J. 1


What Is XML?
• eXtensible Markup Language
• A simplified version of SGML
• Maintains the most useful parts of SGML
• Designed so that SGML can be delivered over the
Web
• More flexible and adaptable than HTML
• XHTML -- a reformulation of HTML 4 in XML 1.0

XML Dr. Mbale J. 2


What is XML? (contd.)
• A markup language is used to provide
information about a document.
• Tags are added to the document to provide
the extra information.
• HTML tags tell a browser how to display
the document.
• XML tags give a reader some idea what
some of the data means.

XML Dr. Mbale J. 3


What is XML Used For?
• XML documents are used to transfer data from one
place to another often over the Internet.
• XML subsets are designed for particular
applications.
• One is RSS (Rich Site Summary or Really Simple
Syndication ). It is used to send breaking news
bulletins from one web site to another.
• A number of fields have their own subsets. These
include chemistry, mathematics, and books
publishing.
• Most of these subsets are registered with the
W3Consortium and are available for anyone‟s use.

XML Dr. Mbale J. 4


Advantages of XML
• XML is text (Unicode) based.
 Takes up less space.
 Can be transmitted efficiently.
• One XML document can be displayed
differently in different media.
 Html, video, CD, DVD,
 You only have to change the XML document in
order to change all the rest.
• XML documents can be modularized. Parts
can be reused.

XML Dr. Mbale J. 5


Difference between XML and HTML

XML was designed to carry data, not displaying data

• XML is not a replacement for HTML.

• Different goals:
XML was designed to describe data and to focus on what data is.
HTML was designed to display data and to focus on how data
looks.

• HTML is about displaying information, XML is about


describing information.

XML Dr. Mbale J. 6


Difference Between HTML and XML (contd.)

• HTML tags have a fixed meaning and browsers


know what it is.
• XML tags are different for different applications,
and users know what they mean.
• HTML tags are used for display.
• XML tags are used to describe documents and
data.

XML Dr. Mbale J. 7


XML comes from SGML

• Standard Generalized Markup Language


 Based on IBM‟s GML (Goldfarb, et al.)
 ISO standard since 1989
 Used for large-scale document management
(Boeing 747 user‟s manual)

 Expensive, complex to implement


 Not Web-friendly (no “well-formed” SGML)
 Too many options (e.g., tag minimization)

XML Dr. Mbale J.


8
XML, HTML, & XHTML

• HTML—display-oriented, SGML-based
scheme for making Web pages
 Syntax & allowed elements (semantics) are fixed

• XML—set of rules for defining markup


schemes
 Element set is fully extensible
 Syntax is fixed

• XHTML—HTML modified to be XML-


compliant (not just SGML-compliant)
XML Dr. Mbale J.
9
Markup languages compared
• XML syntax is stricter than HTML or SGML
 Must explicitly close all elements
 Attributes must be enclosed in quotes
 All markup is case-sensitive
• XML & SGML: no fixed tags, no predefined
style
• XML & SGML are extensible
 Fixed elements (HTML) vs. rules (XML, SGML)
 HTML elements describe how to present content
 XML elements can describe the content itself

XML Dr. Mbale J.


10
A different XML for every
community

• XML is a set of rules used for defining &


encoding intellectual structures
• XML is extensible & customizable
 Its greatest strength
 Its greatest weakness
• HTML was invented by physicists
 What if it had been lawyers, or teachers, or
bureaucrats, or librarians, or …?

XML Dr. Mbale J.


11
XML Rules
• Tags are enclosed in angle brackets.
• Tags come in pairs with start-tags and end-tags.
• Tags must be properly nested.
 <name><email>…</name></email> is not allowed.
 <name><email>…</email><name> is.
• Tags that do not have end-tags must be terminated
by a „/‟.
 <br /> is an html example.

XML Dr. Mbale J. 12


More XML Rules
• Tags are case sensitive.
 <address> is not the same as <Address>
• XML in any combination of cases is not
allowed as part of a tag.
• Tags may not contain „<„ or „&‟.
• Tags follow Java naming conventions,
except that a single colon and other
characters are allowed. They must begin
with a letter and may not contain white
space.
• Documents must have a single root tag
that begins the document.

XML Dr. Mbale J. 13


Encoding
• XML (like Java) uses Unicode to encode characters.
• Unicode comes in many flavors. The most common
one used in the West is UTF-8.
• UTF-8 is a variable length code. Characters are
encoded in 1 byte, 2 bytes, or 4 bytes.
• The first 128 characters in Unicode are ASCII.
• In UTF-8, the numbers between 128 and 255 code for
some of the more common characters used in
western Europe, such as ã, á, å, or ç.
• Two byte codes are used for some characters not
listed in the first 256 and some Asian ideographs.
• Four byte codes can handle any ideographs that are
left.
• Those using non-western languages should
investigate other versions of Unicode.

XML Dr. Mbale J. 14


An example of XML
<?xml version="1.0" encoding="ISO-8859-
1"?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>

XML Dr. Mbale J. 15


Overview of Main Features XML
XML DTD HTML DTD
<!ELEMENT Book (Each)+> <!ELEMENT Table (Tr, Td )*>
<!ELEMENT Each (Title, Abs)> <!ELEMENT Tr (#PCDATA)>
<!ELEMENT Title (#PCD ATA)> <!ELEMENT Td (#PCDATA)>
<!ELEMENT Abs (#PCDATA)>
<!ATTLIST Abs Lang (K|E|O) ”K">
HTML File
XML File <table>
<tr>
<Book> <td>Dr.Q</td>
<Each> <td>aaaaaa</td>
<Title Lang=“K”> Dr.Q</Title> </tr>
<Abs Lang=“E”>aaaaaa </table>
</Abs>
</Each>
</Book>
HTML Browser

DB/Application XML Browser

XML Dr. Mbale J. 16


Why Is XML Important?
• Plain Text
 Easy to edit
 Useful for storing small amounts of data
 Possible to efficiently store large amounts of
XML data through an XML front end to a
database
• Data Identification
 Tell you what kind of data you have
 Can be used in different ways by different
applications

XML Dr. Mbale J. 17


Why Is XML Important? (contd.)
• Stylability
 Inherently style-free
 XSL---Extensible Stylesheet Language
 Different XSL formats can then be used to display the
same data in different ways

• Inline Reusabiliy
 Can be composed from separate entities
 Modularize your documents without resorting to links

XML Dr. Mbale J. 18


Why is XML important? (contd.)
• Linkability -- XLink and XPointer
 Simple unidirectional hyperlinks
 Two-way links
 Multiple-target links
 “Expanding” links

• Easily Processed
 Regular and consistent notation
 Vendor-neutral standard

XML Dr. Mbale J. 19


Why is XML important? (contd.)
• Hierarchical
 Faster to access
 Easier to rearrange

XML Dr. Mbale J. 20


Markup Language
A markup language must specify
• What markup is allowed
• What markup is required
• How markup is to be distinguished from text
• What the markup means

*XML only specify the first three, the fourth is specified by DTD

XML Dr. Mbale J. 21


SGML(ISO 8879)
• Standard Generalized Markup Language
• The international standard for defining descriptions of
structure and content in text documents

• Interchangeable: device-independent, system-independent

• tags are not predefined

• Using DTD to validate the structure of the document

• Large, powerful, and very complex

• Heavily used in industrial and commercial for over a decade

XML Dr. Mbale J. 22


HTML(RFC 1866)
• HyperText Markup Language

• A small SGML application used on web (a DTD and


a set of processing conventions)

• Can only use a predefined set of tags

XML Dr. Mbale J. 23


XML Specifications
• XML 1.0
Defines the syntax of XML

• XPointer, XLink
Defines a standard way to represent links between resources

• XSL
Defines the standard stylesheet language for XML

XML Dr. Mbale J. 24


Ordered hierarchies of content objects
• Premise: A text is the sum of its component
parts
 A <Book> could be defined as containing:
<FrontMatter>, <Chapter>s, <BackMatter>
 <FrontMatter> could contain:
<BookTitle> <Author>s <PubInfo>
 A <Chapter> could contain:
<ChapterTitle> <Paragraph>s
 A <Paragraph> could contain:
<Sentence>s or <Table>s or <Figure>s …
• Components chosen should reflect anticipated
use

XML Dr. Mbale J.


25
Ordered hierarchies of content
objects
• OHCO is a useful, albeit imperfect, model
 Exposes an object‟s intellectual structure
 Supports reuse & abstraction of components
 Better than a bit-mapped page image
 Better than a model of text as a stream of
characters plus formatting instructions
 Data management system for document-like
objects
 Does not allow overlapping content objects
 Incomplete; requires infrastructure
XML Dr. Mbale J.
26
Content objects in a book
Book
FrontMatter
BookTitle
Author(s)
PubInfo
Chapter(s)
ChapterTitle
Paragraph(s)
BackMatter
References
Index

XML Dr. Mbale J.


27
Content objects in a catalog card
Card
CallNumber
MainEntry
TitleStatement
TitleProper
StatementOfResponsibility
Imprint
SummaryNote
AddedEntrySubject(s)
Added EntryPersonalName(s)

XML Dr. Mbale J.


28
A simple XML fragment
<Book>
<FrontMatter>
<BookTitle>XML Is Easy</BookTitle>
<Author>Tim Cole</Author>
<Author>Tom Habing</Author>
<PubInfo>CDP Press, 2002</PubInfo>
</FrontMatter>
<Chapter>
<ChapterTitle>First Was SGML</ChapterTitle>
<Paragraph>Once upon a time …</Paragraph>
</Chapter>
</Book>

XML Dr. Mbale J.


29
Terminology
• Document instance
• Document class
• Document Type Definition (DTD), or
schema
• Well-formed XML
• Valid XML
• Stylesheets
• XML Transformations
• Document Object Model (DOM)

XML Dr. Mbale J.


30
What‟s it good for?
Discussion points:
• Smarter documents
• Full text
• Metadata
• Machine-to-machine interactions

XML Dr. Mbale J.


31
Using XML for metadata

• Consistency in applying schema


 Optional versus required elements
 Consistent use of elements
 Granularity & depth of information
• XML schemas still evolving
 Attributes versus elements
 Mixing namespaces
 Schema languages
 Philosophical issues

XML Dr. Mbale J.


32
Machine-to-machine interactions

• Web services
 Facilitating machine-to-machine communications
via XML
 Simple Object Access Protocol (SOAP)
 XML Protocol Working Group
• Semantic Web
 Abstract representation of data on the Web
• XML and Databases

XML Dr. Mbale J.


33
How does it work?
In XML, there‟s content and there‟s markup.
• Markup
 Elements
 Attributes
 Comments
 Processing instructions
• Content
 Entities
 Encoded (Unicode) characters

XML
Dr. Mbale J.
34
Elements
Elements are markup that enclose content
• <element_name>…</element_name>
or <element_name />
• Content models
 Parsed Character Data Only
 Child Elements Only
 Mixed

<author>Cole, T</author>

XML Dr. Mbale J.


35
Elements (contd.)
• Element
Delimited by angle brackets
Identify the nature of the content they
surround
General format: <element> … </element>
Empty element: </empty-Element>

XML Dr. Mbale J.


36
Attributes

Associate a name-value pair with an element


• <tag name1="value1"
name2='value2'>…</tag>
 Can be used to embellish content…
 or to associate added content to an element

<author order='1'>Cole, T</author>


<author name='Habing, T' />

XML Dr. Mbale J.


37
Attributes (contd.)

• Attribute
Name-value pairs that occur inside start-tags after
element name, like: <element attribute=“value”>

XML Dr. Mbale J.


38
Comments
Human-readable annotations
• Can be inserted anywhere after headers
• Not part of the document structure
• Usually ignored by XML parsers
• Do not have to be passed to application

<!-- This is a comment -->

XML Dr. Mbale J.


39
Processing instructions
Machine-readable & application-specific
• Must be passed through by XML Parsers
• XML Declaration is a special PI
• XML Declaration is always first line in file

<?xml version='1.0' encoding='UTF-8' ?>

<?MyApp indent='on' linefeeds='off' ?>

XML Dr. Mbale J.


40
Entities
• Placeholders for internal or external content
 Placeholder for a single character…
 or string of text…
 or external content (images, audio, etc.)
• Implementation specifics may vary

<!ENTITY copyright "&#xA9;" >


&copyright; is replaced by ©
<!ENTITY pic SYSTEM "mugshot.gif" NDATA gif >
&pic; is replaced by graphic image

XML Dr. Mbale J.


41
Character Encoding Issues

• XML Parsers must accept UTF-8 & UTF-16


• Also must accept &#nnnn; or &#xhhhh;
• MARC-8 encodings must be converted to
Unicode for use in XML

http://lcweb.loc.gov/marc/specifications/specchartables.
html

XML Dr. Mbale J.


42
XML schema language

• New in XML
 Uses XML syntax
 Supports datatyping
 Richer and more complex
<book xsi:noNamespaceSchemaLocation='HTTP://…'>
<xsd:element name='Book'>
<xsd:complexType>
<xsd:sequence>
<xsd:element name='Front' minOccurs='1'
maxOccurs='1'
type='frontType'/>…

XML Dr. Mbale J.


43
Namespaces

• Qualify element and attribute names


• Allows modularization of schemas
 Mix and match elements from multiple schemas
in document instances
 Import or include from one XML Schema into
another

<oai:metadata xmlns:oai='http:…' xmlns:oai_dc='…'


xmlns:dc='…'>
<oai_dc:dc>
<dc:title>…</dc:title>
<dc:creator>…</dc:creator>

XML Dr. Mbale J.


44
XML & Cascading Style Sheets

• Attach styling instructions directly to XML


files
 <?xml-stylesheet href=“http:…" type="text/css" ?>
 Supported by newest browsers: IE5+, Mozilla, Opera
• Can style but not rearrange elements
 Block or inline style
 Bold, italic, underline, font, color, etc.
 Margins, positioning
 Generated content (browser support not good)

front author {color:red; font-weight:bold; font-


family:serif;}

XML Dr. Mbale J.


45
XSLT — Transforming Stylesheets

Language for transforming XML documents


 Into HTML, Text, or other XML documents
 Supported in new browsers (IE5+, Mozilla; not
Opera)
 Usually applied on the server or in batch mode
• Valuable for interoperability or reusability
<xsl:template match='//author'>
<xsl:element name='dc:creator'>
<xsl:value-of select='lastname'/>
<xsl:text>, </xsl:text>
<xsl:value-of select='firstname'/>
</xsl:element >
</xsl:template>

XML Dr. Mbale J.


46
XPath, XPointer, & XLink

• XPath
 Allows addressing of parts of an XML document
 Used in XSLT, XPointer, and XQuery
 /document/front/author/@number
• XPointer (working draft)
 Used as a fragment id in an XML URI reference
 http://.../some.xml#xpointer(/document/front/author)
• XLink
 Creates and describes extended or simple links between resources
 Used for HTML-style hrefs or imgs, tables of contents, etc.

<aulink xlnk:type="simple" xlnk:href="…"


xlnk:actuate="onRequest">
Cole, T
</aulink>

XML Dr. Mbale J.


47
XQuery (XML query language)

• Treat an XML document or collection of


documents as a database
• Equivalent to SQL SELECT statements,
only for XML
• Some support in XML databases
(but working draft only)

XML Dr. Mbale J.


48
Programming standards
• “Platform- and language-neutral interfaces that allow
programs and scripts to dynamically access and update the
content, structure, and style of XML documents.”
• Document Object Model (DOM)
 Object-based
 Better for complex documents
 High memory usage, slower
 Documents can be updated
• Simple API for XML (SAX)
 Event-based
 Better for simple documents
 Low memory usage, faster
 Documents cannot be updated

XML Dr. Mbale J.


49
Other XML-related standards
• XBase
• XForms
• XML Encryption
• XML Signature
• Many more …

XML Dr. Mbale J.


50
XML authoring tools
• XML editors
 XMetaL (Corel/SoftQuad)
 Epic Editor (Arbortext)
 TurboXML (Tibco Extensibility)
• Standard Office Tools
 WordPerfect 2002 (Corel)
 Microsoft Office XP
 OpenOffice
• Plain Text Editors

XML Dr. Mbale J.


51
Other XML tools
• Validating parsers & transformation tools
 MSXML (Microsoft)
 Xerces, Xalan (Apache Software Foundation)
 XSV (U. of Edinburgh)
• Document management & database tools
 Tamino (Software AG)
 XMLCanon/Developer (Tibco/Extensibility)
 DLXS/XPAT (U. of Michigan/OpenText)
• XML-aware browsers

XML Dr. Mbale J.


52
XML resources on the Web

• World Wide Web Consortium


• OASIS
• Microsoft Developer Network
• Sun Microsystems
• Apache XML Project
• XML.COM (O‟Reilly)
• XML.ORG (OASIS)
• ZVON.ORG

XML Dr. Mbale J.


53
The infrastructure of XML

• Required to make it work…


 DTDs & schemas: defining
document classes
 Reusing & integrating schemas
(using namespaces)
 Stylesheets for presentation
& transformation
 Standards for linking, querying,
& pointing
 Programming standards

XML Dr. Mbale J.


54
Document Type Definitions (DTD)

• Legacy from SGML; part of XML standard

<!DOCTYPE Book SYSTEM 'http://…'>


<!ELEMENT Book (Front, Chapter+, Back?)>
<!ATTLIST Book
type (series|monograph) #REQUIRED>

XML Dr. Mbale J.


55
Document Type Definitions (contd.)
• A DTD describes the tree structure of a document
and something about its data.
• There are two data types, PCDATA and CDATA.
 PCDATA is parsed character data.
 CDATA is character data, not usually parsed.
• A DTD determines how many times a node may
appear, and how child nodes are ordered.

XML Dr. Mbale J. 56


DTD for address Example (contd.)
<!ELEMENT address (name, email, phone, birthday)>
<!ELEMENT name (first, last)>
<!ELEMENT first (#PCDATA)>
<!ELEMENT last (#PCDATA)>
<!ELEMENT email (#PCDATA)>
<!ELEMENT phone (#PCDATA)>
<!ELEMENT birthday (year, month, day)>
<!ELEMENT year (#PCDATA)>
<!ELEMENT month (#PCDATA)>
<!ELEMENT day (#PCDATA)>

XML Dr. Mbale J. 57


Revision

XML Dr. Mbale J. 58


XML Syntax
• All XML elements must have a closing tag
• XML tags are case sensitive
• All XML elements must be properly nested
• All XML documents must have a root tag
• Attribute values must always be quoted
• With XML, white space is preserved
• With XML, a new line is always stored as LF
• Comments in XML: <!-- This is a comment -->

XML Dr. Mbale J. 59


XML Elements

• XML Elements are Extensible


XML documents can be extended to carry more information

• XML Elements have Relationships


Elements are related as parents and children
• Elements have Content
Elements can have different content types: element content,
mixed content, simple content, or empty content and attributes

• XML elements must follow the naming


rules

XML Dr. Mbale J. 60


XML Attributes
• Located in the start tag of elements
• Provide additional information about elements
• Often provide information that is not a part of data
• Must be enclosed in quotes
• Should I use an element or an attribute?
metadata (data about data) should be stored as attributes, and that
data itself should be stored as elements

XML Dr. Mbale J. 61


XML Validation
• "Well Formed" XML document
--correct XML syntax
• "Valid" XML document
 “well formed”
 Conforms to the rules of a DTD (Document Type
Definition)
• XML DTD
 defines the legal building blocks of an XML document
 Can be inline in XML or as an external reference
• XML Schema
 an XML based alternative to DTD, more powerful
 Support namespace and data types

XML Dr. Mbale J. 62


Applications

• Data exchange applications


 EDI , RDF , MCF ...
• Document publishing applications
 semi-structured data의 markup

XML Dr. Mbale J. 63


Data exchange application

account 332 DB
<telephone number=“332”>
R&D 567 <dept>account</b>
<person>Peter</person>
Peter 332 <person>Jim</person>
</telephone>
332
Jim 332
account
Sue 912 peter,Jim
RDB 567
R&D

XML Dr. Mbale J. 64


ASP, Java, VB
Events
SAX
XML DBMS

Parser

XSL Processor
DOM API
DOM
Tree
DTD HTML
Browser

DOM(Document Object Model)


SAX(Simple APIs for XML)
XSL(eXtensible Stylesheet Language)
ASP(Active Server Page)

Data exchange applications

XML Dr. Mbale J. 65


Displaying XML
• XML documents do not carry information about
how to display the data

• We can add display information to XML with


 CSS (Cascading Style Sheets)
 XSL (eXtensible Stylesheet Language) --- preferred

XML Dr. Mbale J. 66


XML Application1—Separate data
XML can Separate Data from HTML
• Store data in separate XML files
• Using HTML for layout and display
• Using Data Islands
• Data Islands can be bound to HTML elements

Benefits:
Changes in the underlying data will not require any changes
to your HTML

XML Dr. Mbale J. 67


XML Application2—Exchange data
XML is used to Exchange Data
• Text format
• Software-independent, hardware-independent
• Exchange data between incompatible systems, given that
they agree on the same tag definition.
• Can be read by many different types of applications

Benefits:
• Reduce the complexity of interpreting data
• Easier to expand and upgrade a system

XML Dr. Mbale J. 68


XML Application3—Store Data
XML can be used to Store Data
• Plain text file
• Store data in files or databases
• Application can be written to store and retrieve information
from the store
• Other clients and applications can access your XML files as
data sources

Benefits:
Accessible to more applications

XML Dr. Mbale J. 69


XML Application4—Create new
language
XML can be used to Create new Languages
• WML (Wireless Markup Language) used to markup Internet
applications for handheld devices like mobile phones (WAP)
• MusicXML used to publishing musical scores

XML Dr. Mbale J. 70


XML support in IE 5.0+
Internet Explorer 5.0 has the following XML support:
• Viewing of XML documents
• Full support for W3C DTD standards
• XML embedded in HTML as Data Islands
• Binding XML data to HTML elements
• Transforming and displaying XML with XSL
• Displaying XML with CSS
• Access to the XML DOM (Document Object Model)

*Netscape 6.0 also have full XML support

XML Dr. Mbale J. 71


Microsoft XML Parser
• Comes with IE 5.0
• The parser features a language-neutral
programming model that supports:
 JavaScript, VBScript, Perl, VB, Java, C++ and more
 W3C XML 1.0 and XML DOM
 DTD and validation

XML Dr. Mbale J. 72


Java APIs for XML
• JAXP: Java API for XML Processing
• JAXB: Java Architecture for XML Binding
• JDOM: Java DOM
• DOM4J: an alternative to JDOM
• JAXM: Java API for XML Messaging
(asynchronous)
• JAX-RPC: Java API for XML-based Remote
Process Communications (synchronous)
• JAXR: Java API for XML Registries

XML Dr. Mbale J. 73


Conclusion
• XML is a self-descriptive language
• XML is a powerful language to describe structure
data for web application
• XML is currently applied in many fields
• Many vendors already supports or will support
XML

XML Dr. Mbale J. 74

You might also like