Professional Documents
Culture Documents
Web publishing
In this section, we look more closely into the basic concepts behind Web publishing. The
first subsections give basic background information. This information is not essential to
follow the book but it may help you to get the big picture and understand how the various
parts fit together. Such an understanding can be quite vital in abnormal situations, where
something does not work as it should be. The last sections are essential for the remaining
parts of the book.
2.1. HTTP
Today, almost all Web publishing uses the HTTP (HyperText Transfer Protocol) [RFC
2616] or its "secured" version HTTPS [RFC 2818] which it HTTP over TLS (Transport
Layer Security) [RFC 2246] or its predecessor SSL (Secure Socket Layer).
HTTP is a very simple protocol: a client sends a request to a server; the server processes
the request and replies with a response. Requests and responses are send as messages
over a TCP connection. The message format is essentially MIME (Multi purpose Internet
Mail Extension) [RFC 2045-2049], a format used, too, to transfer multi media mail
messages across the Internet. A MIME message consists of a set of headers and a body,
also known as message entity. The body is optional for HTTP messages. The standards
speaks of it as the HTTP entity. For HTTP, a request or response line, respectively, is
prepended to the message.
A request line consists of the request method, the resource locator and the protocol
version. A resource can be anything: a HTML page, an image, a file, a database, a
service, an application. It is identified by the resource locator, a path to easily locate the
resource in a hierarchical structure such as e.g. a file system or Zope's folder structure.
HTTP uses the URL syntax for the resource locator. It is up to the receiving HTTP server
to determine what resource the resource locator does really identify.
HTTP knows a set of request methods. The most essential are GET, POST, PUT, HEAD.
GET requests all information about the resource to be transfered in the response. HTML's
link traversal and image references are mapped to GET requests. HTML form submission
may use a GET request. Additional request headers can make the request into a
conditional request or a request to transfer part of the information. Both request
modifications are intended to reduce communication traffic. A GET request should not
have side effects and should be idempotent. Idempotence means, that a repetition of the
same request results in the same response. HTTP clients often use this fact and assume
that GET requests can be cached. They save the response to such a request and take the
response from their cache when a later GET request targets the same resource. For a Zope
Web site, many requests do have side effects. This is especially true, if a session
management device is employed. In such cases, it may be necessary to fight with the
caching behavior.
A POST request sends information to a resource that this should integrate subordinate to
itself. POST requests usually have side effects. You would use a POST request for
example to create a new database record, update the properties of a Zope object or post a
news item to a discussion board. HTML form submission often uses POST requests.
A PUT request sends information that the server should use to create a resource located by
the request resource locator. If there is already such a resource, it may be overwritten.
PUT requests are not used from HTML. HTML editing tools use PUT requests to publish
an object on a Web site.
A HEAD request is similar to a GET request. It is expected to have no side effects and be
idempotent. It should return the same headers as a GET targeted to the same resource but
should not transfer the message body. HEAD is not used from HTML. It is, however,
used by link validation and indexing tools, to efficiently check the existence, the type,
size and other meta information for a resource. It is difficult for Zope to meet the
requirements for this request type. Most of its objects are templates that need to be
rendered in order to obtain full header information. Rendering, however, can have
unwanted side effects. Zope, therefore, returns only approximate, sometimes even wrong
information in response to HEAD requests.
A resource may be finer grained then the location of an object in a hierarchical structure.
Resource locators for GET and HEAD requests may have a trailing query string that
provides additional parameters. The query string is started with a ? which is followed by
a sequence of & separated parameter definitions of the form name=value. For POST and
PUT requests, parameters may be present in the request body. Their packaging there is
indicated by the request's Content-Type header (by e.g. application/x-www-form-
urlencoded or multipart/form-data). In Zope, ZPublisher takes care of the
parameters, independent whether they are provided as part of the resource locator or in
the request body, and makes them accessible in a standard way.
A response begins with a response line. This consists of the protocol version, the
numerical status code and the textual status phrase. The status code, a three digit decimal
number, tells the client, what happened with the request. The code is divided into classes
based on the first digit.
The codes 2xx tell the client that the request has been successfully executed. Usually, the
remaining response contains an entity as the primary request result. The usual return code
is 200. Other codes indicate, that special information is available in response headers or
that the browser should behave in a special way.
The 3xx class calls for redirections. The request is not completed but requires special
actions from the user or his user agent. For security reasons, HTTP requires that
redirections are only performed automatically for GET and HEAD requests. All other
request types require user confirmation. The redirect method of Zope's RESPONSE
object uses a 302 status code with the location header set to the new URL. For some
objects, especially files and images, Zope responds with a 304 response to GET requests
made conditional with an If-Modified-Since header, if the object has not been
modified since the given date. In fact, this response is not a redirection. It completes the
response without sending the entity data. Conditional requests of this type are usually
send for objects in a client's cache when the cache validity should be checked. A 304
response indicates in this case, that the cache entry is still valid and the request can be
served from the cache. If the object has been changed since the given date, Zope responds
with a 200 response that contains the new information. By default, Zope does not employ
this mechanism for template objects as their modification date is not decisive to
determine whether or not the generated page remains the same. As it is too difficult to get
this right in the general case, Zope always processes such requests as unconditional.
However, applications with special efficiency concerns may explicitly generate a 304
response if they can guarantee validity. As of version 2.3, Zope provides an integrated
cache manager that can help you to control caches both inside and outside of Zope.
The 4xx status codes indicate a client error. Usually, the response contains an entity that
explains the problem and what can be done about it. The most essential codes are
400
401 (Unauthorized)
404
The requested resource is not found. HTTP allows the server to cheat. The code is
a catch all for all types of client errors, the server does not want to give a more
detailed description for.
5xx status codes indicate a server error. Zope uses code 500 (internal server error), when
some application code tries to set an invalid status code or when it raises an exception
that Zope is unable to map to another status code (such as redirect or unauthorized).
When Zope is connected to through a proxy, a client may observe other status codes from
this class. It usually indicates that either Zope or another proxy on the way to Zope died
or a connection broke down.
HTTP is a stateless protocol. This means that a request must contain all information
necessary to process it. The server is not expected to have saved state information from
previous requests that may be necessary to process this one. This HTTP property makes it
quite hard to build more complex Web applications. Of course, users expect from most
applications that they are aware of their preferences and remember essential facts from
previous interactions. A whole mess of kludges have been defined and implemented to
work around this limitation: authentication headers, hidden form variables, cookies,
session products. We will learn about these concepts later in this book. I expect future
HTTP versions to remove this limitation.
2.2. URL
The URL (Universal Resource Locator) is one of the most essential Web publishing
concepts. As the name says, it is used to locate a resource. As we explained in the last
section, a resource is used in a very wide sense: it can be almost anything, a person, an
object, a service, an application etc. Almost the only requirement is that it can be
identified with an identifier, an URI (Universal Resource Identifier). There are different
kinds of identifiers. Some kinds contain a description how to locate the resource. These
form the subclass of the resource locators, the others are the resource names, URN's. The
URI syntaxes for the various kinds have many commonalities. Therefore, the common
aspects can be described in a single URI syntax standard [RFC 2396]. Each URI kind is
identified by a scheme. Although the scheme determines the precise syntax, URIs,
especially URLs, usually consist of up to 4 components: the scheme, an authority, a path
and a query. The scheme is always present. It determines which of the other components
may or must be present. This means, the generic syntax looks like:
For resource locators, the scheme usually identifies a protocol which can be used to
access the resource. The remaining URL parts provide the parameters necessary for this
access. The most prominent protocols in the Web publishing domain are http and its
secured version https, as well as mailto and ftp. The mailto protocol accesses the
resource, usually a mailbox or mail group, by sending an email to it. The URLs use only
the authority part which usually has the format user@host.
The other listed protocols belong to the family of hierarchical URI schemes. Their
commonality is the use of the path component, a sequence of (path) segments separated
by /. Paths can be used to navigate in a (typically) hierarchical structure: to locate
path/segment, segment is used as a local selector in the context of the resource located
by path. You know this type of navigation from your file system and indeed the
resources are often folders and files in a standard file systems and the URL path
component directly mapped to the file hierarchy.
For hierarchical URI schemes, the authority component typically has the form
//[userinfo@]host[:port]
The host identifies the name or Internet address of a host where a service at port should
resolve the URI into a resource. If port is not specified, a protocol specific default is
used. This is 80 for HTTP, 443 for HTTPS and 21 for FTP. If present, userinfo
identifies the user for whom the resolution and maybe an associated request should be
performed. It has the form username[:password].
What is usually used in documents are not URIs but rather a more general concept, an
URI reference. An URI reference is essentially an URI, but two aspects make it more
general than an URI. An URI reference may have an attached fragment identifier,
introduced with #. It identifies a fragment, a part of the resource identified by the URI.
And the URI reference may be relative, i.e. it may only specify part of the complete URI
with the missing parts given by a base URI. The fragment part is only interpreted by user
agents, usually to position the view onto the displayed resource. It is never sent in a
request. Likewise, relative URI references are resolved with respect to their base URI to
form an absolute URI and only these absolute URIs are sent in a request. A relative URI
reference does not follow the above mentioned URI syntax, it is rather a suffix thereof,
especially it does not have the scheme component. The rules to resolve a relative URI
reference with respect to its base into an effective absolute URI are as follows:
As we speak about Web publishing, most resource references embedded in pages will be
URL references: references to images, other pages, mail addresses. If the resources are
local, you should usually try to reference them by relative references with respect to the
current page. Relative references have the advantage that you can often rename or move a
substructure without the need to change all your references. Using relative URLs with
respect to the current page works because the default base URI for resolution of relative
URIs is the URI of the current document. HTML provides the base tag, a header
component, to explicitly specify the base URI. Under some circumstances, the URL used
by the HTTP request is not the canonical URL of an object, as would be necessary for
correct resolution of relative URLs. Zope knows about many of these circumstances and
generates automatically a corresponding base tag. For cases where this is not possible,
Zope provides a method that allows the application to set the base.
As we have seen, there are many characters that have special meaning for URI parsing. If
they need to be used literally, i.e. as part of one of the URI components rather than as
component separator, then they must be encoded. Furthermore, some characters have
platform dependent representations. This induces problems for cross platform
applications such as Web publishing. There are other problems with some control
characters. The URI standard therefore severely restricts the set of characters that can
occur unencoded as part of the various URI components. The only characters that can be
used unrestrictedly are the ASCII letters (upper and lower case), the ASCII digits and the
characters from the set -_.!~*'(). Depending on the URI component, other characters
may be allowed, too. For example, the characters :@&=+$, are additionally allowed in
path segments. However, you should think twice, whether you really want to use such
facts. Any character not allowed in a context must be encoded. The encoding consists of
% followed by the two hex digits representing the character's code in the ISO-8859-1
encoding, also known as Latin-1. This is a superset of the ASCII character set. You do
not need to worry about the coding details: Zope provides a function url_quote that
encodes strings correctly to be used as URI components. You must be aware, however,
that encoding is necessary at some places and use url_quote at these places. Zope
decodes URLs automatically. Thus, there is usually no need to worry about this aspect.
Although a basic HTML understanding is necessary to build dynamic Web sites with
Zope, it is beyond this book's scope to provide a thorough HTML introduction. I,
personally, look into the HTML4.0 specification when I need information about HTML.
In my view, it is a very good specification which provides introductions, well structured
overviews and detailed information combined with good navigation support such as
element and attribute indexes. In this book, we will only look at HTML forms, as they are
especially important for dynamic Web sites.
An HTML form is the major device that allows users to provide input for Web
applications. It is implemented by the HTML form tag. The form tag contains special
form controls beside normal text and HTML markup. Controls have a name, an initial
value and a current value. The user interacts with the form by changing the current value
of its controls, either directly or through script invocations. He may then submit the form.
Form submission results in a request being send to an agent, e.g. an email server or an
HTTP server. Depending on the context of submission, some controls are being
considered successful. For each successful control, the request contains an association
control_name=current_value. The order is the same as the controls appear in the
document.
Controls are implemented as HTML tags. Their name is given by a name attribute. Their
initial value is usually given by a value attribute, for some controls by their content
(textarea; option if no value attribute is present). The current value is initially set to
the initial value and can later be changed either by the user or a script. Values are strings.
There are controls for (single line) text input, image, submit, check, radio and reset
buttons and file input, all implemented by the input tag. Menus are implemented by
select, which is a container for options and is available both as a single and a multiple
selection. Multi line text input is implemented by textarea. HTML 4 provides additional
button and object controls. As a special case, there are hidden controls, also
implemented by input. They are used not for user interaction but to transfer information
between the page generation process and the request processing after form submission.
Such a transfer may be necessary to work around HTTP's lack of state which requires that
each request is self describing. Cookies provide an alternative to the use of hidden
controls.
Whether a control is successful during form submission is usually determined by its type
and its current value. Text input and hidden controls are always successful. Check and
radio button controls are only successful, if checked. For selections, each selected option
defines a successful control associated with the selection's name. Thus, there may be one,
several or none successful controls for a single (multiple) menu control in the submitted
form data. A submit or image button is only successful, if it was used to submit the
form[22].
It should be noted, that unsuccessful checkbox controls can make problems during form
processing. Similar problems result from multiple selections when no option has been
selected. In all these cases, the submitted form data does not contain a definition for the
associated control name. The application must take care, to interpret this lack of a value
correctly. Zope provides various facilities to handle these cases.
The form tag has one required attribute, action. Its value is an URI reference and
specifies the resource that should process the form data when it is submitted. Usually, it
is either a mailto or http/https URI. In the first case, the form data is send by email to
the given URI, in the second case, an HTTP request is sent. form has several optional
attributes, the most essential being method and enctype. method's value is either GET (the
default) or POST. When the GET method is used, then the form data is provided as query
string in the request locator of an HTTP GET request. As we have noted, the allowed
characters inside an URI are severely restricted. Characters not allowed must be encoded,
which results in a three byte code for each single byte character. Therefore, this method is
inefficient for non-ASCII strings or binary data. You should use POST when your form
transmits large non-ASCII strings or even files. If the action specifies a HTTP URI, then
an HTTP POST request is used to transfer the form data. Here, the form data is contained
not in the resource locator but in the request body. The enctype (encoding type)
determines in this case the content type of the request body (and thereby the encoding of
the form data). The default enctype value is application/x-www-form-urlencoded,
which is the same encoding used for URL encoding and therefore, is inefficient in the
same cases as the GET method. Do not use it when your form contains files. Use in these
cases multipart/form-data. This uses a multipart MIME message to encode the form
data. It can contain binary parts and therefore transfer binary data efficiently. If the form
data is sent as email to a human, then text/plain may be appropriate as value for
enctype. With this encoding, each successful control results usually in a line of the form
name=value without an encoding of characters in name or value. This is adequate for
humans. If the email recipient is a program one of the other encoding may be more
appropriate as they present no parsing ambiguity and there are standard tools for parsing.
If a form is submitted to Zope, any of the request methods and encoding types (exception
text/plain) are handled transparently and the form data made accessible in a
convenient way.
<FORM action="http://somesite.com/prog/adduser"
method="post">
<P>
<LABEL for="firstname">First name: </LABEL>
<INPUT type="text" id="firstname"><BR>
<LABEL for="lastname">Last name: </LABEL>
<INPUT type="text" id="lastname"><BR>
<LABEL for="email">email: </LABEL>
<INPUT type="text" id="email"><BR>
<INPUT type="radio" name="sex" value="Male"> Male<BR>
<INPUT type="radio" name="sex" value="Female">
Female<BR>
Interests:
<SELECT name="interests" multiple>
<OPTION value="1">Sports</OPTION>
<OPTION value="2">Politics</OPTION>
<OPTION value="3">Arts</OPTION>
<OPTION value="4">Economics</OPTION>
<OPTION value="5">Family</OPTION>
</SELECT><BR>
Origin continent:
<SELECT name="origin">
<OPTION>North America</OPTION>
<OPTION>South America</OPTION>
<OPTION>Asia</OPTION>
<OPTION>Australia</OPTION>
<OPTION>Europe</OPTION>
</SELECT><BR>
Interested in further information:
<INPUT name="info" type="checkbox" checked>
</P>
<h4>Remarks:</h4>
<P><TEXTAREA name="remarks" cols=60
rows=10></TEXTAREA></P>
<P>
<INPUT type="submit" value="Send"> <INPUT type="reset">
<INPUT type="hidden" name="sessionId" value="2417369">
</P>
</FORM>
This example (partly stolen from the HTML 4.0 specification) shows a simple form
containing most available controls.
The Forms chapter of the HTML specification contains more detailed information about
forms and form processing. It is very recommended reading.
2.4. Authentication
Unlike a static Web site where visitors usually can only retrieve data, a dynamic Web site
built with Zope allows in principle all types of site extensions and modifications
performed through the Web. It is clear that an administrator wants to control who is
entitled to perform such operations. Authentication, the determination of the identity of a
requesting agent, is vital for a dynamic Web site.
Basic authentication has two weak points. The first is security: username and password
are essentially sent in clear text (They are sent base64 encoded. However, it is trivial to
reconstruct the original information from the encoding) with every request. Anyone that
intercepts such a request can extract the username and password and use it to obtain a
false identity. The second is comfort: the lifetime of the login information is controlled
by the browser. It usually maintains it during the current session (i.e. the lifetime of the
browser process)[23]. In this case, the user has to reauthenticate each time he restarted its
browser.
Recently, a more secure authentication scheme has been defined for HTTP: digest
authentication. However, it is not widely implemented. Especially, Zope does not yet
support it (but many browsers do not, too).
The authentication scheme used by a Zope Web site is not hard-wired into Zope. Instead,
a component, the so called UserFolder decides about all authentication aspects. The
standard UserFolder which comes as part of Zope supports only basic HTTP
authentication. There are, however, products that use cookie authentication.
Some people (I am one of them) do not like cookies because of privacy concerns.
Cookies are often used by Web sites to identify their visitors across visits, collect long
term information about their visits and visit patterns and use this information in various
ways: to improve their Web site (good), to analyze their visitors interests and use it for
personalized marketing (I do not like that), maybe even sell this information (I hate that).
Therefore, I look regularly in my cookie file (where the browser maintains long living
cookies). When I detect cookies with a lifetime of more than a month or so, I get very
suspicious about the site's intentions. I delete such cookies and may disable cookies
altogether when visiting such a site.
2.5. Cookies
HTTP is a stateless protocol. This means that each request must be self contained. There
is nothing like a context build from previous requests that can be used to interpret the
current request. On the other hand, many applications need to be state full. Think of a
shopping card. When you look at your card, it must of course contain the items you have
sent to it in previous requests. Or think of a form with a complex form field. To fill it,
you may need to look at supporting information. When you come back, the form fields
you have filled previously must of course retain their values even though the new visit is
a new request to the server. How to implement such applications despite the stateless
HTTP protocol?
There are several workarounds for this HTTP deficiency. Usually they combine two
strategies: first, store information on the server associated with an id, and second, encode
the id somehow in the URI or the HTML content. To encode something in the URI, either
a path segment or a query parameter might be appropriate. Hidden form controls are
appropriate to encode state information inside HTML forms. Usually, these work arounds
are tedious and several encoding techniques must be used in combination, for example
hidden variables for pages with forms and ids encoded in URI references for link
traversal. That's where cookies come in.
The cookie mechanism is very similar to the HTTP authentication scheme we have seen
in the last section. Authentication is a typical example where you want state full behavior.
You should not need to authenticate for each request separately. After you have logged in
once, all following requests should use the login information you provided during this
first login. This is possible despite the stateless HTTP protocol, because the user agent
provides this information with each request. With a cookie, it is very similar.
A cookie is a named value, defined by the HTTP server and sent to the user agent. The
user agent stores the cookie and automatically includes all cookies defined by a given
server when it sends a request to this server. Looking at its cookies, the server can access
information effectively determined by earlier requests. That's a bit simplified but it gives
the general idea. Cookies are great by providing state information for HTTP processing
without the need to switch such information between query strings and hidden variables.
Cookies have been invented by Netscape. The cookie specification can be found on
Netscape's web site. As cookies solve a fundamental problem with HTTP, they were soon
be implemented by other browsers. Nowadays, almost all browsers support cookies.
Earlier I said a cookie were a named value and all cookies defined by a server were sent
with any request to this server. As already mentioned, this was a simplification. Actually,
a cookie is described by the following attributes:
name
the cookie's name. The name must not contain white space, equal sign, comma or
semicolon.
value
the cookie's value. The value is a string not containing whitespace, comma or
semicolon. The value is usually encoded to prevent such forbidden characters to
slip in.
expires
the cookie's expiration date. This is an HTTP datetime, also known as an RFC
822 time [RFC 822]. The time zone is fixed to GMT. The format is Wdy, DD-Mon-
YYYY HH:MM:SS GMT. The user agent should delete the cookie when this time
arrives. If the cookie creating server does not specify an expiration date, the
cookie lives as long as the browser process. It is not stored persistently.
domain
The domain controls to which servers the cookie may be sent. A cookie may be
sent to a server, when domain is a suffix of the server's host name. This implies
that a cookie can also be sent to a server different from that defining the cookie as
long it is in the domain given by domain. To make abuse more difficult, a server
that sets a cookie can only specify a domain, it belongs to. Moreover, the domain
must be sufficiently specific: domain must contain at least 2 or 3 periods. If the
cookie creating server does not specify a domain, the servers host name is used.
path
secure
If the cookie is marked secure, the browser will only send it over secure
connections. This currently means either an HTTPS or HTTP over TLS
connection.
User agents usually impose limits on the number and complexity of cookies. There is a
total limit (300) and a limit per server and domain for the number of cookies (20). The
name and value part of a cookie must not exceed 4kB.
Cookies can pose a significant thread to privacy. Be aware that some potential users will
disable cookies in their browser.