You are on page 1of 9

Chapter 2

Data Structures
VECTOR DATA STRUCTURES
There are numerous spatial data structures or data models used in GIS software. Figure
1.1 diagrams the relationships between common data models. We are primarily
concerned with two data structures: coverages and shapefiles. There are two aspects of
the structures: the folder and file structure, that is, how the files that contain the data, and
the actual data model or the organization of both the spatial and non-spatial or attribute
data.

Folder and File structure


There are two primary spatial data file structures we need to discuss: the disk structure of
the two file types we are going to be using in this course. The two structures are
coverages and shapefiles. It is very important that you understand the differences in how
these data files are stored on the disk because incorrect management is a major problem
for beginning users of these files. Although we are going to be using only coverages and
shapefiles and several kinds of attribute data files there are many other forms of GIS data
as outlined in Figure 1.1. The vector data model, as described in Chapter 1, is based on
points and their X,Y coordinate pairs. These points are used to construct the other feature
types (lines and polygon). Under Vector models, the figure shows topological and non-
topological models. Shapefiles are considered to be non-topological. It might be better
GI

Spatial Data Attribute Data

Access
Access
Vector Raster
DBase
DBase
Grid

Non- Other DBs


Topological
Topological IDRISI

Shapefile High level


Shapefile Simple Data GeoDataBase
Data Models GeoDataBase

Dynamic
Coverage
Coverage TIN Regions Object Oriented
segmentation

Figure 1.1. A schematic of data models. After Chang, 2002.

Chapter 2 1
to say that shapefiles are designed so that topology can be constructed on the fly and is
not built into the data. As a result, shapefiles load and draw faster then coverages.
Coverages are the topological data model spatial data constructs that we will be using.
TINs are triangulated irregular networks used to represent continuous data like elevation
and pollution concentrations. TINs are made up of triangles whose edges connect three
points defining, say, elevation at spots. The slope and aspect of the triangular polygons
can be calculated easily and other raster type of operations can be carried out with this
data structure. Regions are an advanced polygon structure in which the polygons may
overlap and the logical polygon may be made up several unconnected graphic polygons.
Dynamic segmentation is a model based on the line or arc feature type and allows for the
assignment of different attributes to different parts of an arc. The GeoDataBase is a
complex structure that uses point, line, and polygon geometries to represent the graphic
part of GI. Point features can consist of a single point or a set of points. A line feature is
made up of a set of arcs that are not necessarily connected. A polygon feature is made up
of one or many rings that may or not be connected. While with the other structures the
attribute data was stored separately from the spatial data that is not the case with the
GeoDatabase. Both spatial and attribute data are stored in the same database.

The way in which coverage and shapefile data is stored is quite different and can cause
considerable confusion in novice users. We will take up the file structure of
these data model type separately.

Coverages
A coverage is a topological data structure (see later for definition of topological). On
disk, a coverage is defined by the data stored in a folder. Figure 2.2 shows the coverages
for a database called Yellowstone. Note that the Yellowstone database includes a
folder called info as well as coverages various versions of land use. The info folder is
important since that is where the data is actually
stored but the really important point from a data
management point of view is that any folder of GIS
data that contains an info folder is a workspace.
Any folders inside a workspace, that is, any folder
containing an info folder, cannot be moved by
dragging or copying and The files inside the
1

coverages actually contain only pointers to data in


the info folder so moving them in explorer or in
DOS will cause destruction of the coverage! You
must use the data management tools in the ESRI
softwares. In Figure 2.2, the coverage is called

Folders inside a workspace, that is, any


folder containing an INFO folder, cannot Figure 2.2 Windows Explorer view of
the database called Yellowstone.
be moved by dragging or copying and
pasting!

Chapter 2 2
INLANBUF and it is and it is located in a workspace called LANDED although you cant
tell that from the figure. You can tell it is a workspace from the fact that there is an INFO
file. The contents of some of the files in the coverage INLANBUF are

pat.adf polygon or point attribute table. contains information about the


polygons (or points if it is a point coverage),
aat.adf not in figure 2.2 but if present contains the arc attribute table.
tic.adf Contains the tic coordinates for the coverage. Tics locate the coverage in
space
bnd.adf Contains the coordinates of the current bounding rectangle for the
coverage. The bounding rectangle is also called the extent of the data
prj.adf Contains projection, datum, and units information about the coverage
arc.adf Contains information about the arcs in the coverage

If the coverage was a PC ArcInfo coverage then the files with names like pat.dbf and
tic.dbf would be, respectively, the actual polygon and tic database files and NOT
pointers to data in an info file. Since the files in workstation ArcInfo have an .adf
extension you know they contain only pointers to the data stored in the info folder.
Coverages produced and used by PC ArcInfo do NOT have the info folder structure and
thus may be moved or copied and pasted in windows explorer or DOS.
Make sure you understand the points made above about the folder and file structure of
coverages. If you dont understand it you will
undoubtedly make frustrating if not serious
errors in the management of coverages.

Shapefiles
Shapefiles do not have the same kind of
topology as coverages. The graphic features in
shapefiles are just that, graphic objects and, as
you will see, the kinds of analyses that can be
performed on shapefiles is quite different than Figure 2.3. Contents of a simple shapefile.
the analyses that can be carried out with
coverages. Figure 2.3 shows the data files that
make up a shapefiles called AccessRd. The file with the .dbf extension is the attribute
data file, the .shp file stores the geometry and the .shx file stores the index of the of the
feature geometry. These are the three files that must be present for a working shapefile.
Other files that may be in a shapefile are the .sbn and .sbx files that store the spatial index
of the features and the .fbn and .fbx that store the spatial index of the features for
shapefiles that are read-only. There may also be .ain and .aih files store the attribute
index of the active fields in a tables or a themes attribute table, the .xml file that stores
metadata for use in ArcInfo 8, and the .avl file that stores legend information. For now
the important point is that if you move a shape file in DOS or Windows Explorer you
have to make sure you get all the parts.

Chapter 2 3
Spatial Data Structure
The basic features in spatial data files can be points, lines(arcs) or Polygons (areas).
2

There are other feature types that we will take up later. We will take up the structure of
coverages first and then shapefiles. However, before we get into the structure of the two
GI data structures we need to discuss topology.

Topology
Topology is a term that is very common in GIS literature and conversations. It is usually
taken to mean that the data structure is designed in such a way that it is easy for the
software to figure out what is next to what. However, that is NOT the only reason that
some structures have topology. Coverages are usually spoken of as topological structures
while shapefiles are not considered to be topological and so we need to spend some time
on the subject of topology.
Mathematical topology assumes that geographic objects are located on a 2-dimensional
plane. These 2-D features are:
1. Nodes or non-dimensional points defined by X,Y coordinates. Sometimes called 3

0-dimensional cells
2. Edges or arcs, or lines defined by two or more nodes, sometimes called 1-
dimensional cells
3. Polygons or Areas defined by 3 or more arcs and nodes, sometimes called 2-
dimensional cells
For a topological database the points, lines, and areas exist in 2-D space. The rules for
this space say that lines cannot cross without a node at the intersection, in which case the
crossed lines become 3 or more lines, and polygons cannot overlap or have multiple
parts. Note also that a polygon is a closed set of arcs containing a label point to which
the polygon attributes are attached.
The major advantages of applying topological rules to a data set being constructed was
that the software could enforce topological rules and thereby help reduce errors in
digitizing. Lines that intersected with a node being present, polygons that did not close,
and overlapping polygons could be identified and corrected relatively easily. ESRI
products contain at least 2 modules to do this work: BUILD and CLEAN. CLEAN is
used to remove overshooting arcs and create nodes where arcs intersect while BUILD
actually builds the topological structure. Any coverage that contains a pat and/or an aat
file has been build but that is not proof that the errors have been corrected.

2 You might as well get used to the fact that the terms arc and line are used interchangeably: they mean the
same thing and you will find both terms being used in the GIS community. The same is true for polygon
and area; they mean the same thing.
3 In the U.S. Census data descriptions.

Chapter 2 4
Having a clean topological structure also allowed for relatively easy identification of
polygons that were adjacent to one another and that was important for analysis. But not
necessary for analysis as you will see.

Coverage structure
In coverages, the basic data
is points. Points are defined 7
V6

spatially by X, Y coordinate V1 N1
V5
pairs. X and Y are usually 6 B 4 D 6

in real world coordinates 5 1


N4 V4

like longitude and latitude 4


C 5 3
V3
but may be in table inches Y N5

or any other coordinate 3 N2 N3


2

system. In arc and polygon 2 V2


coverages there are two 1
A

types of points: Nodes and


Vertices. Nodes mark the 0
beginning and ends of arcs. 0 1 2 3 4 5 6 7
Nodes that do not have two Node X V=Vertix
or more arcs connected to Vertex N=Node
them are called pseudo- Arc, 1 Arc # A= Polygon
nodes and may be errors.
Vertices are used to shape
the arc and are not Figure 2.4. Showing graphic structure of a polygon coverage.
connected to any other arcs.
Lines (arcs) are built from
points and polygons (areas) are built from lines. Figure 2.4 shows a simple coverage that
contains 4 polygons, 6 Arcs, 5 Nodes and 6 vertices. The X and Y coordinates locate the
structure in space. Note that each arc has a direction as show by the arrow and that
individual arcs may have 1 or more segments and 0 or more vertices.
Table 2.1 shows the node and vertex coordinates making up each arc in Figure 2.4 while
Table 2.2 shows the topological structure.
Table 2.1. List of Nodes and Vertices and their coordinates making up each arc in Figure 2.4

Arc # List of Nodes and Vertices


1 N1@4,6 V1@2,6 N2@1,3
2 N2@1,3 V2@3,2 N3@5,3
3 N3@5,3 N4@5,5
4 N1@4,6 N4@5,5
5 N5@2,4 N5@2.4
6 N1@4,6 V6@6,7 V5@7,6 V4@6,5 V3@6,4 N3@5,3

Table 2.2. Table showing the topological structure of the coverage in Figure 2.4

Arc # From Node To Node Left Poly Right Poly

Chapter 2 5
1 N1 N2 B A
2 N2 N3 B A
3 N3 N4 B D
4 N1 N4 D B
5 N5 N5 B C
6 N1 N3 A D
There are several important features of this structure. The most important are its
topological properties. Contiguity is maintained through the fact that each arc has
direction and thus the polygons on the right and left of the arc can be determined. This
means that the system knows that, for example, that polygons B and D are next to each
other across arcs 3 and 4. This fact is useful in carrying out analyses dependent on
knowing what polygons are adjacent to one another. Connectivity is maintained because
arcs connect to each other at nodes. Another feature of the structure is that there are no
duplicate arcs between polygons. In some GISs this is not true and polygons B and D
would have completely closed arcs or rings of their own. This means that the arcs
between the two polygons might NOT be the same and some error could be introduced.
The structure also shows that arcs can be simple arcs, that is, straight lines between nodes
or they can be more complex structures using vertices to define curved or non-straight
arcs. Still another feature is that the polygons must have labels or else there is no way to
identify right and left polygons for the arcs. Arc 5 has a special node, N5, called a
pseudonode. A pseudonode is any node that has only two arcs connected to it. In the
case of closed polygon like D this is not an error. A pseudonode is considered an error
when it just connects two arcs and there was no other reason for it to be there.
The coverage could have been an arc coverage. In this case Table 2.2 would not exist
since all that is needed is the list of nodes and vertices for each arc. If there were no arcs
then the coverage would have been a point coverage and the only information stored
would be the coordinates of each point.
Note that since each polygon in a polygon coverage must have a label point it is
impermissible to have both points and polygons in the same coverage. Although it is
possible to have both arcs and points in one coverage good practice says to keep the
different coverage types separate.

Shapefile structure
Like coverages, a set of ArcView files can represent points, lines, or polygons (and
regions also see later). Figure 2.5 shows a collection of arcs and their nodes as they
would be constructed in a shapefile. These are called polylines. The polylines are defined
by the ordered sequence of nodes making up the feature. The feature at A is a Polyline
with 3 connected components or parts but has only one record in the attribute table.
However, there is no rule that says Polyline features have to be connected. Thus, Even
though the arcs are not connected in Figure 2.5 they could still be a single arc as far as
ArcView is concerned and be represented as one record in the attribute table. Hence the
term Polyline is used.

Chapter 2 6
0
6
Polygon structures in shapefiles are a little more complex. There can be multiple1parts to
a polygon shape just like for arcs in a arc shape. Figure 2.6 shows a typical polygon
theme with vertices identified by number. Technically the polygons 2 are rings
3 and the
lines connecting the vertices always go in a clockwise direction. Look at the polygon
attribute file for the polygons A, B, and C. Note that for polygon A the vertices start a 1
A
and then go clockwise to 14,13, and end up a 1 again. Since the arcs defining the
polygons always go clockwise the polygon is always to right of 5 the bounding arcs. This
is a fairly complex polygon theme because it shows both an island polygon 4(C )7 and
Figure 2.8. ArcView polygons are
complete with all boundaries.
polygon B has a hole in it. Although it appears that polygons A and B have a common
Figure 2.6 A Polyline
boundary, they, in fact, do not as is shown in Figure 2.8. In this figure,
features in awe have dragged
shapefile.
polygon B away from A so that you can see that each polygon is complete. No shared
boundaries in this case. The theme was constructed using the autocomplete function in
ArcView that forces the vertices that appear to be common in Figure 2.7 to be duplicated
for polygon B. Each of the polygons is complete, there are NO common edges! With
shapefiles it is also possible to have polygons with multiple parts. Figure 2.9 shows 3
polygons, 1, 2, and 3. Polygons 2 and 3 have had a strip erased so that these 2 polygons
appear to be 4 polygons. However, the attribute table shows that as far as ArcView is
concerned there are still only 3 polygons in the view. Polygons with multiple parts can
be created during analysis and although this can sometimes be a problem it is a real when
dealing with tax parcels, for example, that have been split by a new road right of way.

Figure 2.7. Structure of a polygon shape file showing the polygons, the vertices, and the attribute
files for both.

Chapter 2 7
Figure 2.9. What appear to be 5 polygons are, as far as
ArcView is concerned, actually only 3.

Although shapefiles do not have topology, they can still be used in analysis because the
software can compute the necessary relationships on the fly. With modern computers,
this is not a problem since computational speeds are very high. Because of the relatively
simple data structure shapefiles draw more rapidly than do coverages.

E00 AND OTHER EXCHANGE FILES


In order to move data between different systems ESRI has what are called Exchange files,
commonly referred to as E00 files because that is the extension on the files. These files
are ASCII (text) files and can be read by most ESRI software regardless of platform
(Windows, Unix, etc.). They are disk space hogs, however. With todays software Shape
files are a better alternative but cannot be read by all software on all systems.

TERMINOLOGY REVIEW
Shapefile :A drawing file with out topology (in the usual sense) designed for use
with ArcView and ArcGIS
Coverage : An ESRI vector data structure with topology
Directory : AKA windows folder
E000 file : ArcInfo ASCII exchange file

Chapter 2 8
Folder : AKA Directory
Node: A point that is used to start and stop arcs
Polygon: An arc structure that closes on itself
Tin: Triangulated Irregular Network use to model continuous
surfaces
Topology: Explicit spatial relationships between features
Vertex: A point that is used to control the shape of an arc
Workspace: A directory (folder) that contains an INFO file and is used to
hold geographic data pertaining to some project or set of data.

SUMMARY
There are two aspects of GIS data files that are important to the efficient use of the
technology: the actual data model or structure of the GI data files and their file structure.
The important aspects of file structure for coverages and shapefiles are as follows:
1. Coverages are folders (or directories if you prefer). If the coverage is other than a
PC ArcInfo coverage then they are located in a workspace. Having an INFO
folder identifies a workspace. Coverages in a workspace MUST be copied or
moved using the management tools in the software and must NEVER be copied
or moved by DOS commands or Windows click and drag techniques. PC folders
are also folders but can be moved with normal DOS or Windows commands.
2. Coverages are topological databases and have a complex data structure. The y are
folders (directories) containing a set of files that, for the most part, contain
pointers to data stored in the INFO directory within the workspace. You can only
move workspaces with windows or DOS commands. You must use ArcInfo data
management tools or ArcCatalog to move individual coverages.
3. Shapefiles are simpler than coverages and can be moved through the use of DOS
and Windows techniques. But you have to be careful that you get all of the files
describing a shape. Shapefiles are not topological

Chapter 2 9

You might also like