Professional Documents
Culture Documents
• It takes about approximately 120 nanoseconds to access information from RAM while
it takes about 30 milliseconds if it’s from disks.
• It is analogous to looking in an index of a book for 20 seconds for a specific topic and
locating for the same thing in the library for 58 days!
• Transfer of data from disks to RAM and vice versa takes more time than calculations
made in the Central Processing Unit of a computer.
• RAM chips are located on the motherboard so the distance the electrical
signals have to travel from the CPU to RAM or in the opposite direction is
much shorter compared to the distance between the CPU and secondary
storage devices. The shorter the distance, the faster the processing.
• It takes few nanoseconds for the CPU to access RAM but it takes several
milliseconds to access secondary storage. Also working with the secondary
storage involves mechanical operations like spinning.
• Volative primary storage: when the power is off, all contents of RAM are lost. That is why data
from RAM is saved as files on secondary storage which is non-volatile and almost permanent
(It wears out eventually or becomes out of dated technology)
• Virtually infinite secondary storage: when you run out of space on one disk, you use another.
On the contrary there is a limited amount of RAM that can be accessed by the CPU. Some
programs will not run on a particular computer system because there is not enough RAM
available.
• Cheaper secondary storage: secondary storage is cheaper than RAM in terms of cost per unit
of data. Therefore, we need a to have file structure design optimizing the speed of RAM and
capacity of disks.
• Dynamic nature of data: insertions, deletions, searching and updates of information are
common.
• Need for fast searching and retrieval: finding data from a collection of items in least time, even
after deletion and insertion of considerable amount of information, without reorganizing the
data. File organization must adjust gracefully to these operations.
Ideally, we would like to get information with one access to the disk. If it is not possible, even with
just as few accesses as possible. We also want to have file structures with information groups, or
what we call records, to get a bulk of information in just one access to the disk.
File structure design which meet above goals is easy to come up to when we have files that never
change. But in reality, we add, delete and update files which makes reaching these goals much
difficult.
Solutions Made
• Tapes were the early storage devices used where access was sequential.
• Indexes were then used to search easily before accessing its actual information from the disk.
• Binary trees were then applied in 1960s but due to deletion and addition, it gets uneven which
results in long searches requiring more disk access.
• Tree-structured file organization: B-tree, B*-tree and B+-tree.
• Direct access design in the form of hashing.
A file in a disk, which means it exists physically, is called a physical file. A file as viewed by the
user is called a logical file. Consider the following Pascal code:
assign(input_file,’input.txt’);
This statement asks the operating system to find the physical file ‘input.txt’ from the disk and to
hook it up to its logical filename input_file. This logical file is what we use to refer to the physical
file inside the program.
Secondary storage devices hold files that are not currently being used. For a file to be used it must
be copied to main memory first. After any modifications files must be saved to secondary storage.
In designing, we are always responsive to the constraints of the medium and the environment.
The point here is we should know the constraints of the medium we are using.
• Direct Access Storage Devices (DASDs) ⇒ also known as random access devices, means
that the system maintains a list of data locations and the required piece of data can be found
quickly. The most common DASD media are magnetic disks such as hard disks, floppy disks
and optical disks
• Serial Devices ⇒ use media such as magnetic tapes that permit only serial access.
Goal: Since accessing secondary storage devices is a bottleneck, we need to have a good design
to arrange data in ways that minimize access cost.
Disks
Organization of Disks
All magnetic disks are similarly formatted, or divided into areas, called tracks, sectors and
cylinders.
• Platters
• Tracks ⇒ circular rings on one side of the disk which are arranged successively in a platter,
each consisting of sectors
• Sectors ⇒ the smallest addressable portion of a disk.
• Disk pack ⇒ consists of platters on a spindle when a disk drive uses a number of platters.
• Cylinder ⇒ tracks that are directly below and above one another.
⇒ Accessing data on tracks on the same cylinder would not require additional seek
time
Seeking ⇒ Moving the arm which holds the r/w head in finding the right location on the disk.
⇒ Usually the slowest part of reading information.
Disk Capacity
• Amount of data on a track depends upon how densely bits can be stored: a low-density disk
can hold about 40 kilobytes per track and 35 tracks per surface. A top-of-the-line disk can hold
about 50 kilobytes per track and can have more than 1,000 tracks on a surface.
Example: What is the capacity of the drive with the following characteristics?
#bytes/sector = 512
#sectors/track = 40
#tracks/cylinder = 11
#cylinders = 1331
Three main factors affecting disk access time: seek time [ts], rotational delay [tr], block transfer
time [tbt]
2. rotate rotational
rotate disk under the head to the correct delay
sector
Seek Time
minimum seek time ⇒ time it takes to move the access arm from a track to its adjacent track.
maximum seek time ⇒ time it takes to move the access arm from the outermost to the innermost
track.
average seek time ⇒ the average of the minimum and maximum seek times.
Rotational Delay
• Time it takes for the head to reach the right place on the track.
• Assumption: Correct block can be recognized by some flag or block identifier at the beginning.
rotational time (tr ) ⇒ time needed for a disk pack to complete one whole revolution.
• Average latency time: [tr/2]= time it takes the disk drive to make ½ revolution.
or if tr is given:
tr/2 = tr / 2
• Hard disks usually rotates at about 5,400 rpm ~ 1 rev. per 11.1 ms so the average
rotational delay is 5.55 msec.
tbt = B / rbt
Example: Assume a block size of 1,000 bytes and that the blocks are stored randomly. The disk
drive has the following characteristics:
ts/2 = 30 ms
tr = 16.67 ms
tr/2 = 8.3 ms
rbt = 806 KB/sec
= 825,344 bytes/sec
Magnetic Tape
• A sequential-access storage device in which blocks of data are stored serially along the length
of the tape and can only be accessed in a serial manner
• Logical position of a byte within a file corresponds directly to its physical location relative to the
start of the file.
• Tape is a good medium for archival storage, for transportation of data.
• Tape drives come in many shapes, sizes and speeds. Performance of each drive can be
measured in terms of the following quantities:
• Tape (or recording) density ⇒ number of characters or bytes of data that can be
stored per inch (bpi)
• Tape Speed ⇒ commonly 30 to 200 inches per second (ips)
DISK TAPE
Random Access Sequential Access
Use to store files in shorter terms Long-term storage of files
Generally serves many processes Dedicated to one process
Expensive Less expensive
Used as main secondary storage Considered to be a tertiary storage
RAID Levels
Data is stored in a physical storage for later retrieval. In order to organize the data for easy
retrieval, physical file organization and access mechanism have to be determined.
Field and record organization refers to the physical structure of records to store. Records could be
fixed or variable in length.
A Stream File
PROGRAM: writstrm
get output file name and open it with the logical name OUTPUT
get LAST name as input
while (LAST name has a length > 0)
get FIRST name, ADDRESS, CITY, STATE, and ZIP as input
write LAST to the file OUTPUT
write FIRST to the file OUTPUT
write ADDRESS to the file OUTPUT
write CITY to the file OUTPUT
write STATE to the file OUTPUT
write ZIP to the file OUTPUT
get LAST name as input
endwhile
close OUTPUT
end PROGRAM
Given an input, this is written to the file precisely as specified: as a stream of bytes containing no
added information. Once we put all that information together as a single byte stream, there is no
way to get it apart again. The integrity of the fundamental organizational units of the input data is
also lost.
Field Structures
Field
A conceptual tool used in file processing.
Does not necessarily exist in any physical sense, yet it is important to the file’s structure.
Smallest logically meaningful unit of information in a file.
A subdivision of a record containing a single attribute of the entity the record describes.
The following are the methods most commonly used to organize a record into fields:
1. Fixed-length fields
• In this method, we could pull the fields back out of the file simply by counting our way
to the end of the field.
• One disadvantage is, this method makes the file larger.
• Data larger than the size of the field won’t fit in it. A solution is to fix the size of the field
to cover all cases, but this will result to internal fragmentation.
• Appropriate method when the fields are fixed in length or if there is a little variation in
the field.
• In C:
struct {
char last[10];
char first[10];
char address[15];
char city[15];
char state[2];
char zip[9];
} set_of_ fields;
Record Structures
Records
The following are the methods most commonly used to organize a file into records:
1. Fixed-length records
• All records contain the same number of bytes.
• Most commonly used method for organizing files.
• Having a fixed length does not imply that the sizes or number of fields in the record
must be fixed – it is possible to have fixed-sized or variable-sized fields or its
combination
File Organization
Sequential file organization is the oldest type of file organization since during the 1950s and
1960s, the foundation of many information systems was sequential processing.
Sequential Files
Physical Characteristics
• With relative file organization, there exists a predictable relationship between the key used to
identify a record and that record’s absolute address on an external file.
• Relative addressing allows access to a record directly given only the key, regardless of the
position of the record in the file.
• The file is characterized as providing random access because the logical organization of the
file need not correspond to its physical organization.
• The simplest relative file organization is when the key value corresponds directly to the
physical location of the record in the file.
This method is useful for dense keys, i.e., values of consecutive keys differ only by one.
If the key collection is not dense, it might result to wasted space. This can be solved by
mapping the large range of non-dense key values into smaller range of record positions in
the file. ⇒ The key is no longer the address of the record in the file. Hashing or indexing
may be used.
FILE ACCESS
Sequential Access
• Data is accessed sequentially, starting at the beginning of the file.
• O(n) – expensive if done directly to the disk
• Not advisable for most serious retrieval situations but there are still some other
applications which it is the best one to use like a few-record file, searching for some
pattern, search where a large number of matches is expected.
Relative Access
• Addresses of records can be obtained directly from a key.
• Uses an indexing technique that allows a user to locate a record in a file with as few
accesses as possible, ideally with just one access.
• The indexing scheme could be one of the following:
• Direct addressing
• Binary search tree
• B-trees
• Multiple-key indexing
• Hashing
If access is sequential, file organization used will not have significant effect on the access cost.
However, relative access entails fixing the size of records or using an index.
Indexing Mechanisms
Index
Key
Identifies a record based on the record’s contents not on the sequence of the records.
Primary keys
Keys that uniquely identify a single record.
Primary keys should be unchanging.
Secondary keys
Used to overcome the shortcomings of the primary key.
Used to access records according to data content.
• Separating logical organization of files from their physical organization. The physical file may
be unorganized, but there may be logically ordered indexes for accessing it.
• Conducting efficient searches on the files
• Indexes hold information about location of records with specific values
• Index structures (hopefully) fit in main memory so index searching is fast
Types of Indexes
• The index is a simple arrays of structures that contain the keys and reference fields. It allows
binary search on the index table and access to the physical records/blocks directly from the
index table.
• The physical records are not ordered; the index provides the order.
• Physical files are entry-sequenced.
• To create an index of this type:
• Append records to the physical file as they are inserted
• Build an index (primary and/or secondary) on this file
• Deletion/Update of physical records require reorganization of the file and the
reorganization of primary index
Secondary Index
• Use boolean AND operation, specifying the intersection of two subsets of the data file.
Hashing
• Used to obtain addresses from keys. A hash function is used to map a range of key values into
a smaller range of relative addresses.
• Hashing is like indexing in that it involves associating a key with a relative record address.
• Unlike indexing, with hashing, there is no obvious connection between the key and the address
generated since the function "randomly selects" a relative address for a specific key value,
without regard to the physical sequence of the records in the file. Thus it is also referred to as
randomizing scheme.
• Problem in hashing: Presence of collision. Collision happens when two or more input keys
generate the same address when the same hash function is used.
• Solution: Collision resolution technique or use of perfect hash functions.
Disadvantages
• Records may collide • Requires knowledge of the set of key values
• Records may be unevenly distributed. in order to generate a perfectly uniform
In the worst case, all records hash distribution of keys
into a single address • Quite complicated to use
• Has a lot of pre-requisites
• It is hard to find a function that produces no
collision
If the index to a file is too large to be stored in the main memory, index access and maintenance
must be done on secondary storage. The disadvantage of this approach is that binary searching
on secondary storage requires several seeks. Also, index rearrangement requires shifting or
sorting records on secondary storage, which is expensive.
Alternatives: use hashing for direct access or tree-structured index for both keyed access
and sequential access
It is not sufficient to organize files to store data. To provide efficient access, we must know how to
organize files to improve performance. Data compression and reclaiming unused space are some
of the ways that we can do to improve space utilization and file access times.
Data Compression
Data compression is the process of making files smaller. It involves encoding the information in a
file in such a way as to take up less space.
Since smaller files use less storage, it results in cost savings. In addition, it can be transmitted
faster and processed faster sequentially.
Several methods for data compression exist, and in this course we'll cover:
• Using a different notation
• Suppressing repeated sequences, and
• Assigning variable-length codes
Redundancy Reduction
⇒ Compression by reducing redundancy in data representation
⇒ The three techniques that follow use this method.
Compact Notation ⇒ a compression technique which we decrease the number of bits in data
representation.
• A compression technique in which runs of repeated codes are replaced by a count of the
number of repetitions of the code, followed by the code that is repeated
The algorithm:
Read through the pixels in the image, copying the values to the file sequence, except
where the same pixel value occurs more that one successively.
Substitute with the following 3 bytes the pixel values which occurred in succession:
• Run-length code indicator;
• Pixel value that is repeated; and
• The number of times it is repeated.
For example, we want to compress the following, with 0xFF not included in the image:
22 23 24 24 24 24 24 24 24 25 26 26 26 26 26 26 25 24
RLE does not guarantee any particular amount of space savings. It may even result in larger
files.
This is another redundancy reduction technique of data compression where the most frequently
used piece of data is assigned to have the shortest code.
E • T -
I •• M --
S ••• O ---
H ••••
Letter a b c d e f g
Probability 0.4 0.1 0.1 0.1 0.1 0.1 0.1
Code 1 010 011 0000 0001 0010 0011
• Letter a has the greatest probability to occur more frequently, so it is assigned the one bit code
• Seven letters can be stored using three bits only, but in this example, as much as four bits are
used to ensure that the distinct codes can be stored together without delimiters and still could
be recognized