You are on page 1of 177

E-528-529, sector-7, Dwarka, New delhi-110075 (Nr. Ramphal chowk and Sector 9 metro station) Ph.

011-47350606, (M) 7838010301-04 www.eduproz.in

Educate Anytime...Anywhere...
"Greetings For The Day" About Eduproz We, at EduProz, started our voyage with a dream of making higher education available for everyone. Since its inception, EduProz has been working as a stepping-stone for the students coming from varied backgrounds. The best part is the classroom for distance learning or correspondence courses for both management (MBA and BBA) and Information Technology (MCA and BCA) streams are free of cost. Experienced faculty-members, a state-of-the-art infrastructure and a congenial environment for learning are the few things that we offer to our students. Our panel of industrial experts, coming from various industrial domains, lead students not only to secure good marks in examination, but also to get an edge over others in their professional lives. Our study materials are sufficient to keep students abreast of the present nuances of the industry. In addition, we give importance to regular tests and sessions to evaluate our students progress. Students can attend regular classes of distance learning MBA, BBA, MCA and BCA courses at EduProz without paying anything extra. Our centrally air-conditioned classrooms, well-maintained library and wellequipped laboratory facilities provide a comfortable environment for learning.

Honing specific skills is inevitable to get success in an interview. Keeping this in mind, EduProz has a career counselling and career development cell where we help student to prepare for interviews. Our dedicated placement cell has been helping students to land in their dream jobs on completion of the course.

EduProz is strategically located in Dwarka, West Delhi (walking distance from Dwarka Sector 9 Metro Station and 4-minutes drive from the national highway); students can easily come to our centre from anywhere Delhi and neighbouring Gurgaon, Haryana and avail of a quality-oriented education facility at apparently no extra cost.

Why Choose Edu Proz for distance learning?

Edu Proz provides class room facilities free of cost. In EduProz Class room teaching is conducted through experienced faculty. Class rooms are spacious fully air-conditioned ensuring comfortable ambience. Course free is not wearily expensive. Placement assistance and student counseling facilities. Edu Proz unlike several other distance learning courses strives to help and motivate pupils to get

high grades thus ensuring that they are well placed in life. Students are groomed and prepared to face interview boards. Mock tests, unit tests and examinations are held to evaluate progress. Special care is taken in the personality development department.

"HAVE A GOOD DAY"

Karnataka State Open University (KSOU) was established on 1st June 1996 with the assent of H.E. Governor of Karnataka as a full fledged University in the academic year 1996 vide Government notification No/EDI/UOV/dated 12th February 1996 (Karnataka State Open University Act 1992). The act was promulgated with the object to incorporate an Open University at the State level for the introduction and promotion of Open University and Distance Education systems in the education pattern of the State and the country for the Co-ordination and determination of standard of such systems. Keeping in view the educational needs of our country, in general, and state in particular the policies and programmes have been geared to cater to the needy. Karnataka State Open University is a UGC recognised University of Distance Education Council (DEC), New Delhi, regular member of the Association of Indian Universities (AIU), Delhi, permanent member of Association of Commonwealth Universities (ACU), London, UK, Asian Association of Open Universities (AAOU), Beijing, China, and also has association with Commonwealth of Learning (COL). Karnataka State Open University is situated at the NorthWestern end of the Manasagangotri campus, Mysore. The campus, which is about 5 kms, from the city centre, has a serene atmosphere ideally suited for academic pursuits. The University houses at present the Administrative Office, Academic Block, Lecture Halls, a well-equipped Library, Guest House Cottages, a Moderate Canteen, Girls Hostel and a few cottages providing limited accommodation to students coming to Mysore for attending the Contact Programmes or Term-end examinations.

Unit 1: Overview of the Operating Systems: This unit covers introduction, evolution of OS. And also covers the OS components and its services.

Introduction to Operating Systems Programs, Code files, Processes and Threads

A sequence of instructions telling the computer what to do is called a program. The user normally uses a text editor to write their program in a high level language, such as Pascal, C, Java, etc. Alternatively, they may write it in assembly language. Assembly language is a computer language whose statements have an almost one to one correspondence to the instructions understood by the CPU of the computer. It provides a way of specifying in precise detail what machine code the assembler should create. A compiler is used to translate a high level language program into assembly language or machine code, and an assembler is used to translate an assembly language program into machine code. A linker is used to combine relocatable object files (code files corresponding to incomplete portions of a program) into executable code files (complete code files, for which the addresses have been resolved for all global functions and variables). The text for a program written in a high level language or assembly language is normally saved in a source file on disk. Machine code for a program is normally saved in a code file on disk. The machine code is loaded into the virtual memory for a process, when the process attempts to execute the program. The notion of a program is becoming more complex nowadays, because of shared libraries. In the old days, the user code for a process was all in one file. However, with GUI libraries becoming so large, this is no longer possible. Library code is now stored in memory that is shared by all processes that use it. Perhaps it is best to use the term program for the machine code stored in or derived from a single code file. Code files contain more than just machine code. On UNIX, a code file starts with a header, containing information on the position and size of the code (text), initialised data, and uninitialised data segments of the code file. The header also contains other information, such as the initial value to give the program counter (the entry point) and global pointer register. The data for the code and initialised data segments then follows.

As well the above information, code files can contain a symbol table a table indicating the names of all functions and global variables, and the virtual addresses they correspond to. The symbol table is used by the linker, when it combines several relocatable object files into a single executable code file, to resolve references to functions in shared libraries. The symbol table is also used for debugging. The structure of UNIX code files on the Alpha is very complex, due to the use of shared libraries.

When a user types in the name of a command in the UNIX shell, this results in the creation of what is called a process. On any large computer, especially one with more than one person using it at the same time, there are normally many processes executing at any given time. Under UNIX, every time a user types in a command, they create a separate process. If several users execute the same command, then each one creates a different process. The Macintosh is a little different from UNIX. If the user double clicks on several data files for an application, only one process is created, and this process manages all the data files. A process is the virtual memory, and information on open files, and other operating system resources, shared by its threads of execution, all executing in the same virtual memory. The threads in a process execute not only the code from a user program. They can also execute the shared library code, operating system kernel code, and (on the Alpha) what is called PALcode. A process is created to execute a command. The code file for the command is used to initialise the virtual memory containing the user code and global variables. The user stack for the initial thread is cleared, and the parameters to the command are passed as parameters to the main function of the program. Files are opened corresponding to the standard input and output (keyboard and screen, unless file redirection is used). When a process is created, it is created with a single thread of execution. Conventional processes never have more than a single thread of execution, but multi-threaded processes are now becoming common place. We often speak about a program executing, or a process executing a program, when we really mean a thread within the process executes the program. In UNIX, a new process executing a new program is created by the fork() system call (which creates an almost identical copy of an existing process, executing the same program), followed by the exec() system call (which replaces the program being executed by the new program).

In the Java programming language, a new process executing a new program is created by the exec() method in the Runtime class. The Java exec() is probably implemented as a combination of the UNIX fork() and exec() system calls.

A thread is an instance of execution (the entity that executes). All the threads that make up a process share access to the same user program, virtual memory, open files, and other operating system resources. Each thread has its own program counter, general purpose registers, and user and kernel stack. The program counter and general purpose registers for a thread are stored in the CPU when the thread is executing, and saved away in memory when it is not executing. The Java programming language supports the creation of multiple threads. To create a thread in Java, we create an object that implements the Runnable interface (has a run() method), and use this to create a new Thread object. To initiate the execution of the thread, we invoke the start() method of the thread, which invokes the run() method of the Runnable object. The threads that make up a process need to use some kind of synchronisation mechanism to avoid more than one thread accessing shared data at the same time. In Java, synchronisation is done by synchronised methods. The wait(), notifyO, and notifyAU() methods in the Object class are used to allow a thread to wait until the data has been updated by another thread, and to notify other threads when the data has been altered. In UNIX C, the pthreads library contains functions to create new threads, and provide the equivalent of synchronised methods, waitO, notifyO, etc. The Java mechanism is in fact based on the pthreads library. In Java, synchronisation is built into the design of the language (the compiler knows about synchronised methods). In C, there is no syntax to specify that a function (method) is synchronised, and the programmer has to explicitly put in code at the start and end of the method to gain and relinquish exclusive access to a data structure. Some people call threads lightweight processes, and processes heavyweight processes. Some people call processes tasks. Many application programs, such as Microsoft word, are starting to make use of multiple threads. For example, there is a thread that processes the input, and a thread for doing repagination in the background. A compiler could have multiple threads, one for lexical analysis, one for parsing, one for analysing the abstract syntax tree. These can all execute in parallel, although the parser cannot execute ahead of the lexical analyser, and the abstract syntax tree analyser can only process the portion of the abstract syntax tree already generated by the parser. The code for performing graphics can easily be sped up by having multiple threads, each painting a portion of the screen. File and network servers have to deal with multiple external requests, many of which block before the reply is given. An elegant way of programming servers is to have a thread for each request.

Multi-threaded processes are becoming very important, because computers with multiple processors are becoming commonplace, as are distributed systems, and servers. It is important that you learn how to program in this manner. Multi-threaded programming, particularly dealing with synchronisation issues, is not trivial, and a good conceptual understanding of synchronisation is essential. Synchronisation is dealt with fully in the stage 3 operating systems paper. Objectives An operating system can be thought of as having three objectives: Convenience: An operating system makes a computer more convenient to use. Efficiency: An operating system allows the computer system resources to be used in an efficient manner. Ability to evolve: An operating system should be constructed in such a way as to permit the effective development, testing and introduction of new system functions without interfering with current services provided. What is an Operating System? An operating system (OS) is a program that controls the execution of an application program and acts as an interface between the user and computer hardware. The purpose of an OS is to provide an environment in which a user can execute programs in a convenient and efficient manner. The operating system must provide certain services to programs and to the users of those programs in order to make the programming task easier, these services will differ from one OS to another. Functions of an Operating System Modern Operating systems generally have following three major goals. Operating systems generally accomplish these goals by running processes in low privilege and providing service calls that invoke the operating system kernel in high-privilege state. To hide details of hardware An abstraction is software that hides lower level details and provides a set of higher-level functions. An operating system transforms the physical world of devices, instructions, memory, and time into virtual world that is the result of abstractions built by the operating system. There are several reasons for abstraction.

First, the code needed to control peripheral devices is not standardized. Operating systems provide subroutines called device drivers that perform operations on behalf of programs for example, input/output operations. Second, the operating system introduces new functions as it abstracts the hardware. For instance, operating system introduces the file abstraction so that programs do not have to deal with disks. Third, the operating system transforms the computer hardware into multiple virtual computers, each belonging to a different program. Each program that is running is called a process. Each process views the hardware through the lens of abstraction. Fourth, the operating system can enforce security through abstraction. Resources Management An operating system as resource manager, controls how processes (the active agents) may access resources (passive entities). One can view Operating Systems from two points of views: Resource manager and Extended machines. Form Resource manager point of view Operating Systems manage the different parts of the system efficiently and from extended machines point of view Operating Systems provide a virtual machine to users that is more convenient to use. The structurally Operating Systems can be design as a monolithic system, a hierarchy of layers, a virtual machine system, a micro-kernel, or using the client-server model. The basic concepts of Operating Systems are processes, memory management, I/O management, the file systems, and security. Provide a effective user interface The user interacts with the operating systems through the user interface and usually interested in the look and feel of the operating system. The most important components of the user interface are the command interpreter, the file system, on-line help, and application integration. The recent trend has been toward increasingly integrated graphical user interfaces that encompass the activities of multiple processes on networks of computers. Evolution of Operating System Operating system and computer architecture have had a great deal of influence on each other. To facilitate the use of the hardware, OSs were developed. As operating systems were designed and used, it became obvious that changes in the design of the hardware could simplify them. Early Systems In the earliest days of electronic digital computing, everything was done on the bare hardware. Very few computers existed and those that did exist were experimental in

nature. The researchers who were making the first computers were also the programmers and the users. They worked directly on the bare hardware. There was no operating system. The experimenters wrote their programs in assembly language and a running program had complete control of the entire computer. Debugging consisted of a combination of fixing both the software and hardware, rewriting the object code and changing the actual computer itself. The lack of any operating system meant that only one person could use a computer at a time. Even in the research lab, there were many researchers competing for limited computing time. The first solution was a reservation system, with researchers signing up for specific time slots. The high cost of early computers meant that it was essential that the rare computers be used as efficiently as possible. The reservation system was not particularly efficient. If a researcher finished work early, the computer sat idle until the next time slot. If the researchers time ran out, the researcher might have to pack up his or her work in an incomplete state at an awkward moment to make room for the next researcher. Even when things were going well, a lot of the time the computer actually sat idle while the researcher studied the results (or studied memory of a crashed program to figure out what went wrong). The solution to this problem was to have programmers prepare their work off-line on some input medium (often on punched cards, paper tape, or magnetic tape) and then hand the work to a computer operator. The computer operator would load up jobs in the order received (with priority overrides based on politics and other factors). Each job still ran one at a time with complete control of the computer, but as soon as a job finished, the operator would transfer the results to some output medium (punched tape, paper tape, magnetic tape, or printed paper) and deliver the results to the appropriate programmer. If the program ran to completion, the result would be some end data. If the program crashed, memory would be transferred to some output medium for the programmer to study (because some of the early business computing systems used magnetic core memory, these became known as core dumps) Soon after the first successes with digital computer experiments, computers moved out of the lab and into practical use. The first practical application of these experimental digital computers was the generation of artillery tables for the British and American armies. Much of the early research in computers was paid for by the British and American militaries. Business and scientific applications followed. As computer use increased, programmers noticed that they were duplicating the same efforts. Every programmer was writing his or her own routines for I/O, such as reading input from a magnetic tape or writing output to a line printer. It made sense to write a common device driver for each input or output device and then have every programmer share the same device drivers rather than each programmer writing his or her own. Some

programmers resisted the use of common device drivers in the belief that they could write more efficient or faster or better device drivers of their own. Additionally each programmer was writing his or her own routines for fairly common and repeated functionality, such as mathematics or string functions. Again, it made sense to share the work instead of everyone repeatedly reinventing the wheel. These shared functions would be organized into libraries and could be inserted into programs as needed. In the spirit of cooperation among early researchers, these library functions were published and distributed for free, an early example of the power of the open source approach to software development. Simple Batch Systems When punched cards were used for user jobs, processing of a job involved physical actions by the system operator, e.g., loading a deck of cards into the card reader, pressing switches on the computers console to initiate a job, etc. These actions wasted a lot of central processing unit (CPU) time. Operating System User Program Area
Figure 1.1: Simple Batch System

To speed up processing, jobs with similar needs were batched together and were run as a group. Batch processing (BP) was implemented by locating a component of the BP system, called the batch monitor or supervisor, permanently in one part of computers memory. The remaining memory was used to process a user job the current job in the batch as shown in the figure 1.1 above. The delay between job submission and completion was considerable in batch processed system as a number of programs were put in a batch and the entire batch had to be processed before the results were printed. Further card reading and printing were slow as they used slower mechanical units compared to CPU which was electronic. The speed mismatch was of the order of 1000. To alleviate this problem programs were spooled. Spool is an acronym for simultaneous peripheral operation on-line. In essence the idea was to use a cheaper processor known as peripheral processing unit (PPU) to read programs and data from cards store them on a disk. The faster CPU read programs/data from the disk processed them and wrote the results back on the disk. The cheaper processor then read the results from the disk and printed them. Multi Programmed Batch Systems Even though disks are faster than card reader/ printer they are still two orders of magnitude slower than CPU. It is thus useful to have several programs ready to run waiting in the main memory of CPU. When one program needs input/output (I/O) from

disk it is suspended and another program whose data is already in main memory (as shown in the figure 1.2 bellow) is taken up for execution. This is called multiprogramming.

Operating System Program 1 Program 2 Program 3 Program 4


Figure 1.2: Multi Programmed Batch Systems

Multiprogramming (MP) increases CPU utilization by organizing jobs such that the CPU always has a job to execute. Multiprogramming is the first instance where the operating system must make decisions for the user. The MP arrangement ensures concurrent operation of the CPU and the I/O subsystem. It ensures that the CPU is allocated to a program only when it is not performing an I/O operation. Time Sharing Systems Multiprogramming features were superimposed on BP to ensure good utilization of CPU but from the point of view of a user the service was poor as the response time, i.e., the time elapsed between submitting a job and getting the results was unacceptably high. Development of interactive terminals changed the scenario. Computation became an online activity. A user could provide inputs to a computation from a terminal and could also examine the output of the computation on the same terminal. Hence, the response time needed to be drastically reduced. This was achieved by storing programs of several users in memory and providing each user a slice of time on CPU to process his/her program. Distributed Systems A recent trend in computer system is to distribute computation among several processors. In the loosely coupled systems the processors do not share memory or a clock. Instead, each processor has its own local memory. The processors communicate with one another using communication network. The processors in a distributed system may vary in size and function, and referred by a number of different names, such as sites, nodes, computers and so on depending on the context. The major reasons for building distributed systems are:

Resource sharing: If a number of different sites are connected to one another, then a user at one site may be able to use the resources available at the other. Computation speed up: If a particular computation can be partitioned into a number of sub computations that can run concurrently, then a distributed system may allow a user to distribute computation among the various sites to run them concurrently. Reliability: If one site fails in a distributed system, the remaining sites can potentially continue operations. Communication: There are many instances in which programs need to exchange data with one another. Distributed data base system is an example of this. Real-time Operating System The advent of timesharing provided good response times to computer users. However, timesharing could not satisfy the requirements of some applications. Real-time (RT) operating systems were developed to meet the response requirements of such applications. There are two flavors of real-time systems. A hard real-time system guarantees that critical tasks complete at a specified time. A less restrictive type of real time system is soft real-time system, where a critical real-time task gets priority over other tasks, and retains that priority until it completes. The several areas in which this type is useful are multimedia, virtual reality, and advance scientific projects such as undersea exploration and planetary rovers. Because of the expanded uses for soft real-time functionality, it is finding its way into most current operating systems, including major versions of Unix and Windows NT OS. A real-time operating system is one, which helps to fulfill the worst-case response time requirements of an application. An RT OS provides the following facilities for this purpose: 1. 2. 3. Multitasking within an application. Ability to define the priorities of tasks. Priority driven or deadline oriented scheduling. 4. Programmer defined interrupts. A task is a sub-computation in an application program, which can be executed concurrently with other sub-computations in the program, except at specific places in its execution called synchronization points. Multi-tasking, which permits the existence of many tasks within the application program, provides the possibility of overlapping the CPU and I/O activities of the application with one another. This helps in reducing its

elapsed time. The ability to specify priorities for the tasks provides additional controls to a designer while structuring an application to meet its response-time requirements. Real time operating systems (RTOS) are specifically designed to respond to events that happen in real time. This can include computer systems that run factory floors, computer systems for emergency room or intensive care unit equipment (or even the entire ICU), computer systems for air traffic control, or embedded systems. RTOSs are grouped according to the response time that is acceptable (seconds, milliseconds, microseconds) and according to whether or not they involve systems where failure can result in loss of life. Examples of real-time operating systems include QNX, Jaluna-1, ChorusOS, LynxOS, Windows CE .NET, and VxWorks AE, etc. Self assessment questions 1. 2. What do the terms program, process, and thread mean? What is the purpose of a compiler, assembler and linker?

3. What is the structure of a code file? What is the purpose of the symbol table in a code file? 4. Why are shared libraries essential on modern computers?

Operating System Components Even though, not all systems have the same structure many modern operating systems share the same goal of supporting the following types of system components. Process Management The operating system manages many kinds of activities ranging from user programs to system programs like printer spooler, name servers, file server etc. Each of these activities is encapsulated in a process. A process includes the complete execution context (code, data, PC, registers, OS resources in use etc.). It is important to note that a process is not a program. A process is only ONE instant of a program in execution. There are many processes can be running the same program. The five major activities of an operating system in regard to process management are1. Creation and deletion of user and system processes. 2. Suspension and resumption of processes. 3. A mechanism for process synchronization. 4. A mechanism for process communication. 5. A mechanism for deadlock handling.

Main-Memory Management Primary-Memory or Main-Memory is a large array of words or bytes. Each word or byte has its own address. Main-memory provides storage that can be access directly by the CPU. That is to say for a program to be executed, it must in the main memory. The major activities of an operating in regard to memory-management are: 1. Keep track of which part of memory are currently being used and by whom.

2. Decide which processes are loaded into memory when memory space becomes available. 3. Allocate and de-allocate memory space as needed.

File Management A file is a collection of related information defined by its creator. Computer can store files on the disk (secondary storage), which provides long term storage. Some examples of storage media are magnetic tape, magnetic disk and optical disk. Each of these media has its own properties like speed, capacity, data transfer rate and access methods. A file system normally organized into directories to ease their use. These directories may contain files and other directions. The five main major activities of an operating system in regard to file management are 1. 2. 3. 4. 5. The creation and deletion of files. The creation and deletion of directions. The support of primitives for manipulating files and directions. The mapping of files onto secondary storage. The back up of files on stable storage media.

I/O System Management I/O subsystem hides the peculiarities of specific hardware devices from the user. Only the device driver knows the peculiarities of the specific device to whom it is assigned. Secondary-Storage Management Generally speaking, systems have several levels of storage, including primary storage, secondary storage and cache storage. Instructions and data must be placed in primary storage or cache to be referenced by a running program. Because main memory is too small to accommodate all data and programs, and its data are lost when power is lost, the computer system must provide secondary storage to back up main memory. Secondary storage consists of tapes, disks, and other media designed to hold information that will

eventually be accessed in primary storage (primary, secondary, cache) is ordinarily divided into bytes or words consisting of a fixed number of bytes. Each location in storage has an address; the set of all addresses available to a program is called an address space. The three major activities of an operating system in regard to secondary storage management are: 1. Managing the free space available on the secondary-storage device. 2. Allocation of storage space when new files have to be written. 3. Scheduling the requests for memory access.

Networking A distributed system is a collection of processors that do not share memory, peripheral devices, or a clock. The processors communicate with one another through communication lines called network. The communication-network design must consider routing and connection strategies, and the problems of contention and security.

Protection System If a computer system has multiple users and allows the concurrent execution of multiple processes, then various processes must be protected from one anothers activities. Protection refers to mechanism for controlling the access of programs, processes, or users to the resources defined by a computer system.

Command Interpreter System A command interpreter is an interface of the operating system with the user. The user gives commands with are executed by operating system (usually by turning them into system calls). The main function of a command interpreter is to get and execute the next user specified command. Command-Interpreter is usually not part of the kernel, since multiple command interpreters (shell, in UNIX terminology) may be supported by an operating system, and they do not really need to run in kernel mode. There are two main advantages of separating the command interpreter from the kernel. 1. If we want to change the way the command interpreter looks, i.e., I want to change the interface of command interpreter, I am able to do that if the command interpreter is separate from the kernel. I cannot change the code of the kernel so I cannot modify the interface.

2. If the command interpreter is a part of the kernel, it is possible for a malicious process to gain access to certain part of the kernel that it should not have. To avoid this scenario it is advantageous to have the command interpreter separate from kernel.

Self Assessment Questions 1. Discuss the various components of OS? 2. Explain the Memory Management and File Management in brief. 3. Write Note on. 1. Secondary-Storage Management 2. Command Interpreter System Operating System Services Following are the five services provided by operating systems for the convenience of the users. Program Execution The purpose of a computer system is to allow the user to execute programs. So the operating system provides an environment where the user can conveniently run programs. The user does not have to worry about the memory allocation or multitasking or anything. These things are taken care of by the operating systems. Running a program involves the allocating and de-allocating memory, CPU scheduling in case of multi-process. These functions cannot be given to the user-level programs. So user-level programs cannot help the user to run programs independently without the help from operating systems. I/O Operations Each program requires an input and produces output. This involves the use of I/O. The operating systems hides from the user the details of underlying hardware for the I/O. All the users see that the I/O has been performed without any details. So the operating system, by providing I/O, makes it convenient for the users to run programs. For efficiently and protection users cannot control I/O so this service cannot be provided by user-level programs.

File System Manipulation

The output of a program may need to be written into new files or input taken from some files. The operating system provides this service. The user does not have to worry about secondary storage management. User gives a command for reading or writing to a file and sees his/her task accomplished. Thus operating system makes it easier for user programs to accomplish their task. This service involves secondary storage management. The speed of I/O that depends on secondary storage management is critical to the speed of many programs and hence I think it is best relegated to the operating systems to manage it than giving individual users the control of it. It is not difficult for the user-level programs to provide these services but for above mentioned reasons it is best if this service is left with operating system.

Communications There are instances where processes need to communicate with each other to exchange information. It may be between processes running on the same computer or running on the different computers. By providing this service the operating system relieves the user from the worry of passing messages between processes. In case where the messages need to be passed to processes on the other computers through a network, it can be done by the user programs. The user program may be customized to the specifications of the hardware through which the message transits and provides the service interface to the operating system.

Error Detection An error in one part of the system may cause malfunctioning of the complete system. To avoid such a situation the operating system constantly monitors the system for detecting the errors. This relieves the user from the worry of errors propagating to various part of the system and causing malfunctioning. This service cannot be allowed to be handled by user programs because it involves monitoring and in cases altering area of memory or de-allocation of memory for a faulty process, or may be relinquishing the CPU of a process that goes into an infinite loop. These tasks are too critical to be handed over to the user programs. A user program if given these privileges can interfere with the correct (normal) operation of the operating systems.

Self Assessment Questions

1.

Explain the five services provided by the operating system.

Operating Systems for Different Computers Operating systems can be grouped according to functionality: operating systems for Supercomputers, Computer Clusters, Mainframes, Servers, Workstations, Desktops, Handheld Devices, Real Time Systems, or Embedded Systems. OS for Supercomputers: Supercomputers are the fastest computers, very expensive and are employed for specialized applications that require immense amounts of mathematical calculations, for example, weather forecasting, animated graphics, fluid dynamic calculations, nuclear energy research, and petroleum exploration. Out of many operating systems used for supercomputing UNIX and Linux are the most dominant ones. Computer Clusters Operating Systems: A computer cluster is a group of computers that work together closely so that in many respects they can be viewed as though they are a single computer. The components of a cluster are commonly, connected to each other through fast local area networks. Besides many open source operating systems, and two versions of Windows 2003 Server, Linux is popularly used for Computer clusters. Mainframe Operating Systems: Mainframes used to be the primary form of computer. Mainframes are large centralized computers and at one time they provided the bulk of business computing through time sharing. Mainframes are still useful for some large scale tasks, such as centralized billing systems, inventory systems, database operations, etc. Minicomputers were smaller, less expensive versions of mainframes for businesses that couldnt afford true mainframes. The chief difference between a supercomputer and a mainframe is that a supercomputer channels all its power into executing a few programs as fast as possible, whereas a mainframe uses its power to execute many programs concurrently. Besides various versions of operating systems by IBM for its early System/360, to newest Z series operating system z/OS, Unix and Linux are also used as mainframe operating systems. Servers Operating Systems: Servers are computers or groups of computers that provides services to other computers, connected via network. Based on the requirements, there are various versions of server operating systems from different vendors, starting with Microsofts Servers from Windows NT to Windows 2003, OS/2 servers, UNIX servers, Mac OS servers, and various flavors of Linux.

Workstation Operating Systems: Workstations are more powerful versions of personal computers. Like desktop computers, often only one person uses a particular workstation, and run a more powerful version of a desktop operating system. Most of the times workstations are used as clients in a network environment. The popular workstation operating systems are Windows NT Workstation, Windows 2000 Professional, OS/2 Clients, Mac OS, UNIX, Linux, etc

Desktop Operating Systems: A personal computer (PC) is a microcomputer whose price, size, and capabilities make it useful for individuals, also known as Desktop computers or home computers Desktop operating systems are used for personal computers, for example DOS, Windows 9x, Windows XP, Macintosh OS, Linux, etc. Embedded Operating Systems: Embedded systems are combinations of processors and special software that are inside of another device, such as the electronic ignition system on cars. Examples of embedded operating systems are Embedded Linux, Windows CE, Windows XP Embedded, Free DOS, Free RTOS, etc. Operating Systems for Handheld Computers: Handheld operating systems are much smaller and less capable than desktop operating systems, so that they can fit into the limited memory of handheld devices. The operating systems include Palm OS, Windows CE, EPOC, and Summary An operating system (OS) is a program that controls the execution of an application program and acts as an interface between the user and computer hardware. The objectives of operating system are convenience, efficiency, and ability to evolve. Besides this the operating system performs function such as hiding details of the hardware, resource management, and providing effective user interface. The process management component of operating system is responsible for creation, termination, other and state transitions of a process. The memory management unit is mainly responsible for allocation, de-allocation to processes, and keeping track records of memory usage by different processes. The operating system services are program execution, I/O operations, file system manipulation, communication and error detection.

Terminal Questions

1. 2. 3. 4. 5.

What is an operating system? What are the objectives of an operating system? Describe in brief, the function of an operating system. Explain the evolution of operating system in brief. Write a note on Batch OS. Discuss how it is differ from Multi Programmed Batch Systems. 6. What is difference between multi-programming and timesharing operating systems? 7. What are the typical features of an operating system provides? 8. Explain the functions of operating system as file manager. 9. What are different services provided by an operating system? 10. Write Note on : 1.Mainframe Operating Systems 2.Embedded Operating Systems 3.Servers Operating Systems 4.Desktop Operating Systems

many Linux versions such as Qt Palmtop, and Pocket Linux, etc.

Unit 2: Operating System Architecture : This unit deals with the Simple structure, extended machine, layered approaches. It covers the different methodology for OS design (Models). It covers the Introduction of Virtual Machine, Virtual environment and Machine aggregation. And also describes the implementation techniques.

Introduction A system as large and complex as a modern operating system must be engineered carefully if it is to function properly and be modified easily. A common approach is to partition the task into small component rather than have one monolithic system. Each of these modules should be a well-defined portion of the system, with carefully defined inputs, outputs, and functions. In this unit, we discuss how various components of an operating system are interconnected and melded into a kernel.

Objective: At the end of this unit, readers would be able to understand:

What is Kernel? Monolithic Kernel Architecture Layered Architecture Microkernel Architecture Operating System Components Operating System Services

OS as an Extended Machine We can think of an operating system as an Extended Machine standing between our programs and the bare hardware.

As shown in above figure 2.1, the operating system interacts with the hardware hiding it from the application program, and user. Thus it acts as interface between user programs and hardware. Self Assessment Questions 1. What is the role of an Operating System?

Simple Structure Many commercial systems do not have well-defined structures. Frequently, such operating systems started as small, simple, and limited systems and then grew beyond their original scope. MS-DOS is an example of such a system. It was originally designed and implemented by a few people who had no idea that it would become so popular. It was written to provide the most functionality in the least space, so it was not divided into

modules carefully. In MS-DOS, the interfaces and levels of functionality are not well separated. For instance, application programs are able to access the basic I/O routines to write directly to the display and disk drives. Such freedom leaves MS-DOS vulnerable to errant (or malicious) programs, causing entire system crashes when user programs fail. Of course, MS-DOS was also limited by the hardware of its era. Because the Intel 8088 for which it was written provides no dual mode and no hardware protection, the designers of MS-DOS had no choice but to leave the base hardware accessible. Another example of limited structuring is the original UNIX operating system. UNIX is another system that initially was limited by hardware functionality. It consists of two separable parts:

the kernel and the system programs

The kernel is further separated into a series of interfaces and device drivers, which have been added and expanded over the years as UNIX has evolved. We can view the traditional UNIX operating system as being layered. Everything below the system call interface and above the physical hardware is the kernel. The kernel provides the file system, CPU scheduling, memory management, and other operating-system functions through system calls. Taken in sum, that is an enormous amount of functionality to be combined into one level. This monolithic structure was difficult to implement and maintain. Self Assessment Questions 1. In MS-DOS, the interfaces and levels of functionality are not well separated. Comment on this. 2. What are the components of a Unix Operating System?

Layered Approach With proper hardware support, operating systems can be broken into pieces that are smaller and more appropriate than those allowed by the original MS-DOS or UNIX systems. The operating system can then retain much greater control over the computer and over the applications that make use of that computer. Implementers have more freedom in changing the inner workings of the system and in creating modular operating systems. Under the top-down approach, the overall functionality and features are determined and the separated into components. Information hiding is also important, because it leaves programmers free to implement the low-level routines as they see fit, provided that the external interface of the routine stays unchanged and that the routine itself performs the advertised task.

A system can be made modular in many ways. One method is the layered approach, in which the operating system is broken up into a number of layers (levels). The bottom layer (layer 0) id the hardware; the highest (layer N) is the user interface.

Users File Systems Inter-process Communication I/O and Device Management Virtual Memory Primitive Process Management Hardware Fig. 2.2: Layered Architecture

An operating-system layer is an implementation of an abstract object made up of data and the operations that can manipulate those data. A typical operating system layer-say, layer M-consists of data structures and a set of routines that can be invoked by higherlevel layers. Layer M, in turn, can invoke operations on lower-level layers. The main advantage of the layered approach is simplicity of construction and debugging. The layers are selected so that each uses functions (operations) and services of only lower-level layers. This approach simplifies debugging and system verification. The first layer can be debugged without any concern for the rest of the system, because, by definition, it uses only the basic hardware (which is assumed correct) to implement its functions. Once the first layer is debugged, its correct functioning can be assumed while the second layer is debugged, and so on. If an error is found during debugging of a particular layer, the error must be on that layer, because the layers below it are already debugged. Thus, the design and implementation of the system is simplified. Each layer is implemented with only those operations provided by lower-level layers. A layer does not need to know how these operations are implemented; it needs to know only what these operations do. Hence, each layer hides the existence of certain data structures, operations, and hardware from higher-level layers. The major difficulty with the layered approach involves appropriately defining the various layers. Because layer can use only lower-level layers, careful planning is necessary. For example, the device driver for the backing store (disk space used by virtual-memory algorithms) must be at a lower level than the memory-management routines, because memory management requires the ability to use the backing store. Other requirement may not be so obvious. The backing-store driver would normally be above the CPU scheduler, because the driver may need to wait for I/O and the CPU can be rescheduled during this time. However, on a larger system, the CPU scheduler may have more information about all the active processes than can fit in memory. Therefore,

this information may need to be swapped in and out of memory, requiring the backingstore driver routine to be below the CPU scheduler.

A final problem with layered implementations is that they tend to be less efficient than other types. For instance, when a user program executes an I/O operation, it executes a system call that is trapped to the I/O layer, which calls the memory-management layer, which in turn calls the CPU-scheduling layer, which is then passed to the hardware. At each layer, the parameters may be modified; data may need to be passed, and so on. Each layer adds overhead to the system call; the net result is a system call that takes longer than does one on a non-layered system. These limitations have caused a small backlash against layering in recent years. Fewer layers with more functionality are being designed, providing most of the advantages of modularized code while avoiding the difficult problems of layer definition and interaction. Self Assessment Questions 1. 2. What is the layered Architecture of UNIX? What are the advantages of layered Architecture?

Micro-kernels We have already seen that as UNIX expanded, the kernel became large and difficult to manage. In the mid-1980s, researches at Carnegie Mellon University developed an operating system called Mach that modularized the kernel using the microkernel approach. This method structures the operating system by removing all nonessential components from the kernel and implementing then as system and user-level programs. The result is a smaller kernel. There is little consensus regarding which services should remain in the kernel and which should be implemented in user space. Typically, however, micro-kernels provide minimal process and memory management, in addition to a communication facility.

Device File Server Drivers Client Process . Virtual Memory

Microkernel Hardware Fig. 2.3: Microkernel Architecture

The main function of the microkernel is to provide a communication facility between the client program and the various services that are also running in user space. Communication is provided by message passing. For example, if the client program and service never interact directly. Rather, they communicate indirectly by exchanging messages with the microkernel. On benefit of the microkernel approach is ease of extending the operating system. All new services are added to user space and consequently do not require modification of the kernel. When the kernel does have to be modified, the changes tend to be fewer, because the microkernel is a smaller kernel. The resulting operating system is easier to port from one hardware design to another. The microkernel also provided more security and reliability, since most services are running as user rather than kernel processes, if a service fails the rest of the operating system remains untouched. Several contemporary operating systems have used the microkernel approach. Tru64 UNIX (formerly Digital UNIX provides a UNIX interface to the user, but it is implemented with a March kernel. The March kernel maps UNIX system calls into messages to the appropriate user-level services. The following figure shows the UNIX operating system architecture. At the center is hardware, covered by kernel. Above that are the UNIX utilities, and command interface, such as shell (sh), etc.

SelAssessment Questions 1. What other facilities Micro-kernel provides in addition to Communication facility? 2. What are the benefits of Micro-kernel?

UNIX kernel Components The UNIX kernel has components as depicted in the figure 2.5 bellow. The figure is divided in to three modes: user mode, kernel mode, and hardware. The user mode contains user programs which can access the services of the kernel components using system call interface. The kernel mode has four major components: system calls, file subsystem, process control subsystem, and hardware control. The system calls are interface between user programs and file and process control subsystems. The file subsystem is responsible for file and I/O management through device drivers.

The process control subsystem contains scheduler, Inter-process communication and memory management. Finally the hardware control is the interface between these two subsystems and hardware.

Fig. 2.5: Unix kernel components

Another example is QNX. QNX is a real-time operating system that is also based on the microkernel design. The QNX microkernel provides services for message passing and process scheduling. It also handled low-level network communication and hardware interrupts. All other services in QNX are provided by standard processes that run outside the kernel in user mode. Unfortunately, microkernels can suffer from performance decreases due to increased system function overhead. Consider the history of Windows NT. The first release had a layered microkernels organization. However, this version delivered low performance compared with that of Windows 95. Windows NT 4.0 partially redressed the performance problem by moving layers from user space to kernel space and integrating them more closely. By the time Windows XP was designed, its architecture was more monolithic than microkernel. Self Assessment Questions 1. What are the components of UNIX Kernel?

2. Under what circumstances a Micro-kernel may suffer from performance decrease?

Modules Perhaps the best current methodology for operating-system design involves using objectoriented programming techniques to create a modular kernel. Here, the kernel has a set of core components and dynamically links in additional services either during boot time or during run time. Such a strategy uses dynamically loadable modules and is common in modern implementations of UNIX, such as Solaris, Linux and MacOSX. For example, the Solaris operating system structure is organized around a core kernel with seven types of loadable kernel modules: 1. 2. 3. 4. 5. 6. 7. Scheduling classes File systems Loadable system calls Executable formats STREAMS formats Miscellaneous Device and bus drivers Such a design allow the kernel to provide core services yet also allows certain features to be implemented dynamically. For example device and bus drivers for specific hardware can be added to the kernel, and support for different file systems can be added as loadable modules. The overall result resembles a layered system in that each kernel section has defined, protected interfaces; but it is more flexible than a layered system in that any module can call any other module. Furthermore, the approach is like the microkernel approach in that the primary module has only core functions and knowledge of how to load and communicate with other modules; but it is more efficient, because modules do not need to invoke message passing in order to communicate. Self Assessment Questions 1. Which strategy uses dynamically loadable modules and is common in modern implementations of UNIX? 2. What are different loadable modules based on which the Solaris operating system structure is organized around a core kernel?

Introduction to Virtual Machine The layered approach of operating systems is taken to its logical conclusion in the concept of virtual machine. The fundamental idea behind a virtual machine is to abstract the hardware of a single computer (the CPU, Memory, Disk drives, Network Interface Cards, and so forth) into several different execution environments and thereby creating the illusion that each separate execution environment is running its own private

computer. By using CPU Scheduling and Virtual Memory techniques, an operating system can create the illusion that a process has its own processor with its own (virtual) memory. Normally a process has additional features, such as system calls and a file system, which are not provided by the hardware. The Virtual machine approach does not provide any such additional functionality but rather an interface that is identical to the underlying bare hardware. Each process is provided with a (virtual) copy of the underlying computer. Hardware Virtual machine The original meaning of virtual machine, sometimes called a hardware virtual machine, is that of a number of discrete identical execution environments on a single computer, each of which runs an operating system (OS). This can allow applications written for one OS to be executed on a machine which runs a different OS, or provide execution sandboxes which provide a greater level of isolation between processes than is achieved when running multiple processes on the same instance of an OS. One use is to provide multiple users the illusion of having an entire computer, one that is their private machine, isolated from other users, all on a single physical machine. Another advantage is that booting and restarting a virtual machine can be much faster than with a physical machine, since it may be possible to skip tasks such as hardware initialization. Such software is now often referred to with the terms virtualization and virtual servers. The host software which provides this capability is often referred to as a virtual machine monitor or hypervisor. Software virtualization can be done in three major ways: Emulation, full system simulation, or full virtualization with dynamic recompilation the virtual machine simulates the complete hardware, allowing an unmodified OS for a completely different CPU to be run. Paravirtualization the virtual machine does not simulate hardware but instead offers a special API that requires OS modifications. An example of this is XenSources XenEnterprise (www.xensource.com) Native virtualization and full virtualization the virtual machine only partially simulates enough hardware to allow an unmodified OS to be run in isolation, but the guest OS must be designed for the same type of CPU. The term native virtualization is also sometimes used to designate that hardware assistance through Virtualization Technology is used.

Application virtual machine Another meaning of virtual machine is a piece of computer software that isolates the application being used by the user from the computer. Because versions of the virtual

machine are written for various computer platforms, any application written for the virtual machine can be operated on any of the platforms, instead of having to produce separate versions of the application for each computer and operating system. The application is run on the computer using an interpreter or Just In Time compilation. One of the best known examples of an application virtual machine is Sun Microsystems Java Virtual Machine.
Self Assessment Questions 1. 2. What do you mean by a Virtual Machine? Differentiate Hardware Virtual Machines and Software Virtual Machines.

Virtual Environment A virtual environment (otherwise referred to as Virtual private server) is another kind of a virtual machine. In fact, it is a virtualized environment for running user-level programs (i.e. not the operating system kernel and drivers, but applications). Virtual environments are created using the software implementing operating system-level virtualization approach, such as Virtuozzo, FreeBSD Jails, Linux-VServer, Solaris Containers, chroot jail and OpenVZ. Machine Aggregation A less common use of the term is to refer to a computer cluster consisting of many computers that have been aggregated together as a larger and more powerful virtual machine. In this case, the software allows a single environment to be created spanning multiple computers, so that the end user appears to be using only one computer rather than several. PVM (Parallel Virtual Machine) and MPI (Message Passing Interface) are two common software packages that permit a heterogeneous collection of networked UNIX and/or Windows computers to be used as a single, large, parallel computer. Thus large computational problems can be solved more cost effectively by using the aggregate power and memory of many computers than with a traditional supercomputer. The Plan9 Operating System from Bell Labs uses this approach. Boston Circuits had released the gCore (grid-on-chip) Central Processing Unit (CPU) with 16 ARC 750D cores and a Time-machine hardware module to provide a virtual machine that uses this approach.

Self Assessment Questions 1. What is Virtual Environment?

2.

Explain Machine Aggregation.

Implementation Techniques Emulation of the underlying raw hardware (native execution) This approach is described as full virtualization of the hardware, and can be implemented using a Type 1 or Type 2 hypervisor. (A Type 1 hypervisor runs directly on the hardware; a Type 2 hypervisor runs on another operating system, such as Linux.) Each virtual machine can run any operating system supported by the underlying hardware. Users can thus run two or more different guest operating systems simultaneously, in separate private virtual computers. The pioneer system using this concept was IBMs CP-40, the first (1967) version of IBMs CP/CMS (1967-1972) and the precursor to IBMs VM family (1972-present). With the VM architecture, most users run a relatively simple interactive computing single-user operating system, CMS, as a guest on top of the VM control program (VMCP). This approach kept the CMS design simple, as if it were running alone; the control program quietly provides multitasking and resource management services behind the scenes. In addition to CMS, VM users can run any of the other IBM operating systems, such as MVS or z/OS. z/VM is the current version of VM, and is used to support hundreds or thousands of virtual machines on a given mainframe. Some installations use Linux for zSeries to run Web servers, where Linux runs as the operating system within many virtual machines. Full virtualization is particularly helpful in operating system development, when experimental new code can be run at the same time as older, more stable, versions, each in separate virtual machines. (The process can even be recursive: IBM debugged new versions of its virtual machine operating system, VM, in a virtual machine running under an older version of VM, and even used this technique to simulate new hardware.) The x86 processor architecture as used in modern PCs does not actually meet the Popek and Goldberg virtualization requirements. Notably, there is no execution mode where all sensitive machine instructions always trap, which would allow per-instruction virtualization. Despite these limitations, several software packages have managed to provide virtualization on the x86 architecture, even though dynamic recompilation of privileged code, as first implemented by VMware, incurs some performance overhead as compared to a VM running on a natively virtualizable architecture such as the IBM System/370 or Motorola MC68020. By now, several other software packages such as Virtual PC, VirtualBox, Parallels Workstation and Virtual Iron manage to implement virtualization on x86 hardware.

On the other hand, plex86 can run only Linux under Linux using a specific patched kernel. It does not emulate a processor, but uses bochs for emulation of motherboard devices. Intel and AMD have introduced features to their x86 processors to enable virtualization in hardware.

Emulation of a non-native system Virtual machines can also perform the role of an emulator, allowing software applications and operating systems written for computer processor architecture to be run. Some virtual machines emulate hardware that only exists as a detailed specification. For example:

One of the first was the p-code machine specification, which allowed programmers to write Pascal programs that would run on any computer running virtual machine software that correctly implemented the specification. The specification of the Java virtual machine. The Common Language Infrastructure virtual machine at the heart of the Microsoft .NET initiative. Open Firmware allows plug-in hardware to include boot-time diagnostics, configuration code, and device drivers that will run on any kind of CPU.

This technique allows diverse computers to run any software written to that specification; only the virtual machine software itself must be written separately for each type of computer on which it runs.
Self Assessment Questions 1. 2. What are the techniques to realize Virtual Machines concept? What are the advantages of Virtual Machines?

Operating system-level virtualization Operating System-level Virtualization is a server virtualization technology which virtualizes servers on an operating system (kernel) layer. It can be thought of as partitioning: a single physical server is sliced into multiple small partitions (otherwise called virtual environments (VE), virtual private servers (VPS), guests, zones etc); each such partition looks and feels like a real server, from the point of view of its users. The operating system level architecture has low overhead that helps to maximize efficient use of server resources. The virtualization introduces only a negligible overhead and allows running hundreds of virtual private servers on a single physical server. In contrast,

approaches such as virtualisation (like VMware) and paravirtualization (like Xen or UML) cannot achieve such level of density, due to overhead of running multiple kernels. From the other side, operating system-level virtualization does not allow running different operating systems (i.e. different kernels), although different libraries, distributions etc. are possible
Self Assessment Questions

1.

Describe the Operating System Level Virtualization.

Summary The virtual machine concept has several advantages. In this environment, there is complete protection of the various system resources. Each virtual machine is completely isolated from all other virtual machines, so there are no protection problems. At the same time, however, there is no direct sharing of resources. Two approaches to provide sharing have been implemented. A virtual machine is a perfect vehicle for operating systems research and development. Operating system as extended machine acts as interface between hardware and user application programs. The kernel is the essential center of a computer operating system, i.e. the core that provides basic services for all other parts of the operating system. It includes interrupts handler, scheduler, operating system address space manager, etc. In the layered type architecture of operating systems, the components of kernel are built as layers on one another, and each layer can interact with its neighbor through interface. Whereas in micro-kernel architecture, most of these components are not part of kernel but acts as another layer to the kernel, and the kernel comprises of essential and basic components.

Terminal Questions 1. Explain operating system as extended machine. 2. What is a kernel? What are the main components of a kernel? 3. Explain monolithic type of kernel architecture in brief. 4. What is a micro-kernel? Describe its architecture. 5. Compare micro-kernel with layered architecture of operating system. 6. Describe UNIX kernel components in brief. 7. What are the components of operating system? 8. Explain the responsibilities of operating system as process management. 9. Explain the function of operating system as file management. 10. What are different services provided by an operating system?

Unit 3: Process Management : This unit covers the process management and threads. Brief about the process creation, termination, process state and process control. Discussed about the process Vs Threads, Types of threads etc.

Introduction This unit discuss the definition of process, process creation, process termination, process state, and process control. And also deals with the threads and thread types. A process can be simply defined as a program in execution. Process along with program code, comprises of program counter value, Processor register contents, values of variables, stack and program data. A process is created and terminated, and it follows some or all of the states of process transition; such as New, Ready, Running, Waiting, and Exit. A thread is a single sequence stream within in a process. Because threads have some of the properties of processes, they are sometimes called lightweight processes. There are two types of threads: user level threads (ULT) and kernel level threads (KLT), user level threads are mostly used on the systems where the operating system does not support threads, but also can be combined with the kernel level threads. Threads also have similar properties like processes e.g. execution states, context switch etc.

Objectives : At the end of this unit, you will be able to understand the : What is a Process?

Process Creation , Process Termination, Process States, Process Control Threads Types of Threads

What is a Process? The notion of process is central to the understanding of operating systems. The term process is used somewhat interchangeably with task or job. There are quite a few definitions presented in the literature, for instance A program in Execution. An asynchronous activity. The entity to which processors are assigned. The dispatchable unit.

And many more, but the definition Program in Execution seem to be most frequently used. And this is a concept we will use in the present study of operating systems. Now that we agreed upon the definition of process, the question is, what is the relation between process and program, or is it same with different name or when the process is sleeping (not executing) it is called program and when it is executing becomes process. Well, to be very precise. Process is not the same as program. A process is more than a program code. A process is an active entity as oppose to program which considered being a passive entity. As we all know that a program is an algorithm expressed in some programming language. Being a passive, a program is only a part of process. Process, on the other hand, includes: Current value of Program Counter (PC) Contents of the processors registers Value of the variables The process stack, which typically contains temporary data such as subroutine parameter, return address, and temporary variables. A data section that contains global variables. A process is the unit of work in a system. In Process model, all software on the computer is organized into a number of sequential processes. A process includes PC, registers, and variables. Conceptually, each process has its own virtual CPU. In reality, the CPU switches back and forth among processes. Process Creation In general-purpose systems, some way is needed to create processes as needed during operation. There are four principal events led to processes creation. System initialization. Execution of a process Creation System call by a running process. A user request to create a new process. Initialization of a batch job.

Foreground processes interact with users. Background processes that stay in background sleeping but suddenly springing to life to handle activity such as email, webpage, printing, and so on. Background processes are called daemons. This call creates an exact clone of the calling process. A process may create a new process by executing system call fork in UNIX. Creating process is called parent process and the created one is called the child processes. Only one parent is needed to create a child process. This creation of process (processes) yields a hierarchical structure of processes. Note that each child has only one parent but each parent may have many children. After the fork, the two processes, the parent and the child, initially have the same memory image, the same environment strings and the same open files. After a process is created, both the parent and child have their own distinct address space. Following are some reasons for creation of a process 1. 2. 3. 4. User logs on. User starts a program. Operating systems creates process to provide service, e.g., to manage printer. Some program starts another process.

Creation of a process involves following steps: 1. Assign a unique process identifier to the new process, followed by making new entry in to the process table regarding this process. 2. Allocate space for the process: this operating involves finding how much space is needed by the process and allocating space to the parts of the process such as user program, user data, stack and process attributes. The requirement of the space can be taken by default based on the type of the process, or from the parent process if the process is spawned by another process. 3. Initialize Process Control Block: the PCB contains various attributes required to execute and control a process, such as process identification, processor status information and control information. This can be initialized to standard default values plus attributes that have been requested for this process. 4. Set the appropriate linkages: the operating system maintains various queues related to a process in the form of linked lists, the newly created process should be attached to one of such queues.

5. Create or expand other data structures: depending on the implementation, an operating system may need to create some data structures for this process, for example to maintain accounting file for billing or performance assessment. Process Termination A process terminates when it finishes executing its last statement. Its resources are returned to the system, it is purged from any system lists or tables, and its process control block (PCB) is erased i.e., the PCBs memory space is returned to a free memory pool. The new process terminates the existing process, usually due to following reasons:

Normal Exit Most processes terminates because they have done their job. This call is exit in UNIX. Error Exit When process discovers a fatal error. For example, a user tries to compile a program that does not exist. Fatal Error An error caused by process due to a bug in program for example, executing an illegal instruction, referring non-existing memory or dividing by zero. Killed by another Process A process executes a system call telling the Operating Systems to terminate some other process.

Process States A process goes through a series of discrete process states during its lifetime. Depending on the implementation, the operating systems may differ in the number of states a process goes though. Though there are various state models starting from two states to nine states, we will first see a five states model and then seven states model, as lower states models are now obsolete. Five State Process Model Following are the states of a five state process model. The figure 3.1 show these state transition.

New State The process being created. Terminated State The process has finished execution.

Blocked (waiting) State When a process blocks, it does so because logically it cannot continue, typically because it is waiting for input that is not yet available. Formally, a process is said to be blocked if it is waiting for some event to happen (such as an I/O completion) before it can proceed. In this state a process is unable to run until some external event happens. Running State A process is said to be running if it currently has the CPU, which is, actually using the CPU at that particular instant. Ready State A process is said to be ready if it use a CPU if one were available. It is run-able but temporarily stopped to let another process run.

Logically, the Running and Ready states are similar. In both cases the process is willing to run, only in the case of Ready state, there is temporarily no CPU available for it. The Blocked state is different from the Running and Ready states in that the process cannot run, even if the CPU is available. Following are six possible transitions among above mentioned five states Transition 1 occurs when process discovers that it cannot continue. If running process initiates an I/O operation before its allotted time expires, the running process voluntarily relinquishes the CPU. This state transition is: Block (process): Running Blocked. Transition 2 occurs when the scheduler decides that the running process has run long enough and it is time to let another process have CPU time. This state transition is:

Time-Run-Out (process): Running Ready. Transition 3 occurs when all other processes have had their share and it is time for the first process to run again This state transition is: Dispatch (process): Ready Running. Transition 4 occurs when the external event for which a process was waiting (such as arrival of input) happens. This state transition is: Wakeup (process): Blocked Ready. Transition 5 occurs when the process is created. This state transition is: Admitted (process): New Ready. Transition 6 occurs when the process has finished execution. This state transition is: Exit (process): Running Terminated. Swapping Many of the operating systems follow the above shown process model. However the operating systems which does not employ virtual memory, the processor will be idle most of the times considering the difference between speed of I/O and processor. There will be many processes waiting for I/O in the memory, and exhausting the memory. If there is no ready process to run; new processes can not be created as there is no memory available to accommodate new process. Thus the processor has to wait till any of the waiting processes become ready after completion of an I/O operation. This problem can be solved by adding to more states in the above process model by using swapping technique. Swapping involves moving part or all of a process from main memory to disk. When none of the processes in main memory is in the ready state, the operating system swaps one of the blocked processes out onto disk in to a suspend queue. This is a queue of existing processes that have been temporarily shifted out of main memory, or suspended. The operating system then either creates new process or brings a swapped process from the disk which has become ready.

Seven State Process Model The following figure 3.2 shows the seven state process model in which uses above described swapping technique.

Apart from the transitions we have seen in five states model, following are the new transitions which occur in the above seven state model.

Blocked to Blocked / Suspend: If there are now ready processes in the main memory, at least one blocked process is swapped out to make room for another process that is not blocked. Blocked / Suspend to Blocked: If a process is terminated making space in the main memory, and if there is any high priority process which is blocked but suspended, anticipating that it will become free very soon, the process is brought in to the main memory. Blocked / Suspend to Ready / Suspend: A process is moved from Blocked / Suspend to Ready / Suspend, if the event occurs on which the process was waiting, as there is no space in the main memory. Ready / Suspend to Ready: If there are no ready processes in the main memory, operating system has to bring one in main memory to continue the execution. Some times this transition takes place even there are ready processes in main memory but having lower priority than one of the processes in Ready / Suspend state. So the high priority process is brought in the main memory. Ready to Ready / Suspend: Normally the blocked processes are suspended by the operating system but sometimes to make large block free, a ready process may be suspended. In this case normally the low priority processes are suspended.

New to Ready / Suspend: When a new process is created, it should be added to the Ready state. But some times sufficient memory may not be available to allocate to the newly created process. In this case, the new process is sifted to Ready / Suspend.

Process Control In this section we will study structure of a process, process control block, modes of process execution, and process switching. Process Structure After studying the process states now we will see where does the process reside, and what is the physical manifestation of a process? The location of the process depends on memory management scheme being used. In the simplest case, a process is maintained in the secondary memory, and to manage this process, at least small part of this process is maintained in the main memory. To execute the process, the entire process or part of it is brought in the main memory, and for that the operating system need to know the location of the process.

Process identification Processor state information Process control information User Stack Private user address space (program, data) Shared address space Figure 3.3: Process Image

The obvious contents of a process are User Program to be executed, and the User Data which is associated with that program. Apart from these there are two major parts of a process; System Stack, which is used to store parameters and calling addresses for procedure and system calls, and Process Control Block, this is nothing but collection of process attributes needed by operating system to control a process. The collection of user program, data, system stack, and process control block is called as Process Image as shown in the figure 3.3 above. Process Control Block

A process control block as shown in the figure 3.4 bellow, contains various attributes required by operating system to control a process, such as process state, program counter, CPU state, CPU scheduling information, memory management information, I/O state information, etc. These attributes can be grouped in to three general categories as follows: Process identification Processor state information Process control information

The first category stores information related to Process identification, such as identifier of the current process, identifier of the process which created this process, to maintain parent-child process relationship, and user identifier, the identifier of the user on behalf of whos this process is being run. The Processor state information consists of the contents of the processor registers, such as user-visible registers, control and status registers which includes program counter and program status word, and stack pointers. The third category Process Control Identification is mainly required for the control of a process. The information includes: scheduling and state information, data structuring, inter-process communication, process privileges memory management, and resource ownership and utilization. pointer process state

process number program counter registers memory limits list of open files . . .
Figure 3.4: Process Control Block

Modes of Execution

In order to ensure the correct execution of each process, an operating system must protect each processs private information (executable code, data, and stack) from uncontrolled interferences from other processes. This is accomplished by suitably restricting the memory address space available to a process for reading/writing, so that the OS can regain CPU control through hardware-generated exceptions whenever a process violates those restrictions. Also the OS code needs to execute in a privileged condition with respect to normal: to manage processes, it needs to be enabled to execute operations which are forbidden to normal processes. Thus most of the processors support at least two modes of execution. Certain instructions can only be executed in the more privileged mode. These include reading or altering a control register such as program status word, primitive I/O instruction; and memory management instructions. The less privileged mode is referred as user mode as typically user programs are executed in this mode, and the more privileged mode in which important operating system functions are executed is called as kernel mode/ system mode or control mode. The current mode information is stored in the PSW, i.e. whether the processor is running in user mode or kernel mode. The mode change is normally done by executing change mode instruction; typically after a user process invokes a system call, or whenever an interrupt occurs, as these are operating system functions and needed to be executed in privileged mode. After the completion of system call or interrupt routine, the mode is again changed to user mode to continue the user process execution. Context Switching To give each process on a multiprogrammed machine a fair share of the CPU, a hardware clock generates interrupts periodically. This allows the operating system to schedule all processes in main memory (using scheduling algorithm) to run on the CPU at equal intervals. Each time a clock interrupt occurs, the interrupt handler checks how much time the current running process has used. If it has used up its entire time slice, then the CPU scheduling algorithm (in kernel) picks a different process to run. Each switch of the CPU from one process to another is called a context switch. A context is the contents of a CPUs registers and program counter at any point in time. Context switching can be described as the kernel (i.e., the core of the operating system) performing the following activities with regard to processes on the CPU: (1) suspending the progression of one process and storing the CPUs state (i.e., the context) for that process somewhere in memory, (2) retrieving the context of the next process from memory and restoring it in the CPUs registers and (3) returning to the location indicated by the program counter (i.e., returning to the line of code at which the process was interrupted) in order to resume the process. The figure 3.5 bellow depicts the process of context switch from process P0 to process P1.

Figure 3.5: Process switching

Self Assessment Questions: 1. 2. 3. 4. Discuss the process state with its five state process model. Explain the seven state process model. What is Process Control ? Discuss the process control block. Write note on Context Switching.

A context switch is sometimes described as the kernel suspending execution of one process on the CPU and resuming execution of some other process that had previously been suspended. A context switch occurs due to interrupts, trap (error due to the current instruction) or a system call as described bellow:

Clock interrupt: when a process has executed its current time quantum which was allocated to it, the process must be switched from running state to ready state, and another process must be dispatched for execution. I/O interrupt: whenever any I/O related event occurs, the OS is interrupted, the OS has to determine the reason of it and take necessary action for that event. Thus the current process is switched to ready state and the interrupt routine is loaded to do the action for the interrupt event (e.g. after an I/O interrupt the OS moves all the processes which were blocked on the event, from blocked state to ready state, and blocked/suspended to ready/suspended state). After completion of the interrupt related actions, it is expected that the process which was switched, should be brought for execution, but that does not happen. At this point the

scheduler again decides which process is to be scheduled for execution from all the ready processes afresh. This is important as it will schedule any high priority process present in the ready queue added during the interrupt handling period. Memory fault: when virtual memory technique is used for memory management, many a times it happens that a process refers to a memory address which is not present in the main memory, and needs to be brought in. As the memory block transfer takes time, another process should be given chance for execution and the current process should be blocked. Thus the OS blocks the current process, issues an I/O request to get the memory block in the memory and switches the current process to blocked state, and loads another process for execution. Trap: if the instruction being executed has any error or exception, depending on the criticalness of the error / exception and design of operating system, it may either move the process to exit state, or may execute the current process after a possible recovery.

System call: many a times a process has to invoke a system call for a privileged job, for this the current process is blocked and the respective operating systems system call code is executed. Thus the context of the current process is switched to the system call code. Example: UNIX Process Let us see an example of UNIX System V, which makes use of a simple but powerful process facility that is highly visible to the user. The following figure shows the model followed by UNIX, in which most of the operating system executes within the environment of a user process. Thus, two modes, user and kernel, are required. UNIX uses two categories of processes: system processes and user processes. System processes run in kernel mode and execute operating system code to perform administrative and housekeeping functions, such as allocation of memory and process swapping. User processes operate in user mode to execute user programs and utilities and in kernel mode to execute instructions belong to the kernel. A user process enters kernel mode by issuing a system call, when an exception (fault) is generated or when an interrupt occurs.

A total of nine process states are recognized by the UNIX operating system as explained bellow

User Running: Executing in user mode. Kernel Running: Executing in kernel mode. Ready to Run, in Memory: Ready to run as soon as the kernel schedules it. Asleep in Memory: Unable to execute until an event occurs; process is in main memory (a blocked state). Ready to Run, Swapped: Process is ready to run, but the swapper must swap the process into main memory before the kernel can schedule it to execute. Sleeping, Swapped: The process is awaiting an event and has been swapped to secondary storage (a blocked state). Preempted: Process is returning from kernel to user mode, but the kernel preempts it and does a process switch to schedule another process. Created: Process is newly created and not yet ready to run.

Zombie: Process no longer exists, but it leaves a record for its parent process to collect.

UNIX employs two Running states to indicate whether the process is executing in user mode or kernel mode. A distinction is made between the two states: (Ready to Run, in Memory) and (Preempted). These are essentially the same state, as indicated by the dotted line joining them. The distinction is made to emphasize the way in which the preempted state is entered. When a process is running in kernel mode (as a result of a supervisor call, clock interrupt, or I/O interrupt), there will come a time when the kernel has completed its work and is ready to return control to the user program. At this point, the kernel may decide to preempt the current process in favor of one that is ready and of higher priority. In that case, the current process moves to the preempted state. However, for purposes of dispatching, those processes in the preempted state and those in the Ready to Run, in Memory state form one queue. Preemption can only occur when a process is about to move from kernel mode to user mode. While a process is running in kernel mode, it may not be preempted. This makes UNIX unsuitable for real-time processing. Two processes are unique in UNIX. Process 0 is a special process that is created when the system boots; in effect, it is predefined as a data structure loaded at boot time. It is the swapper process. In addition, process 0 spawns process 1, referred to as the init process; all other processes in the system have process 1 as an ancestor. When a new interactive user logs onto the system, it is process 1 that creates a user process for that user. Subsequently, the user process can create child processes in a branching tree, so that any particular application can consist of a number of related processes. Threads A thread is a single sequence stream within in a process. Because threads have some of the properties of processes, they are sometimes called lightweight processes. In a process, threads allow multiple executions of streams. In many respect, threads are popular way to improve application through parallelism. The CPU switches rapidly back and forth among the threads giving illusion that the threads are running in parallel. Like a traditional process i.e., process with one thread, a thread can be in any of several states (Running, Blocked, Ready or Terminated). Each thread has its own stack. Since thread will generally call different procedures and thus a different execution history. This is why thread needs its own stack. An operating system that has thread facility, the basic unit of CPU utilization is a thread. A thread has or consists of a program counter (PC), a register set, and a stack space. Threads are not independent of one other like processes as a result threads shares with other threads their code section, data section, OS resources also known as task, such as open files and signals. Processes Vs Threads

As we mentioned earlier that in many respect threads operate in the same way as that of processes. Some of the similarities and differences are: Similarities

Like processes threads share CPU and only one thread is running at a time. Like processes, threads within processes execute sequentially. Like processes, thread can create children. And like process, if one thread is blocked, another thread can run.

Differences

Unlike processes, threads are not independent of one another. Unlike processes, all threads can access every address in the task . Unlike processes, threads are designed to assist one other. (Processes might or might not assist one another because processes may originate from different users.)

Why Threads? Following are some reasons why we use threads in designing operating systems. 1. A process with multiple threads makes a great server for example printer server. 2. Because threads can share common data, they do not need to use interprocess communication. 3. Because of the very nature, threads can take advantage of multiprocessors. Threads are cheap in the sense that 1. They only need a stack and storage for registers therefore, threads are cheap to create. 2. Threads use very little resources of an operating system in which they are working. That is, threads do not need new address space, global data, program code or operating system resources. 3. Context switching is fast when working with threads. The reason is that we only have to save and/or restore PC, SP and registers. Advantages of Threads over Multiple Processes

Context Switching Threads are very inexpensive to create and destroy, and they are inexpensive to represent. For example, they require space to store, the PC, the SP, and the general-purpose registers, but they do not require space to share memory information, Information about open files of I/O devices in use, etc. With so little context, it is much faster to switch between threads. In other words, it is relatively easier for a context switch using threads.

Sharing Treads allow the sharing of a lot resources that cannot be shared in process, for example, sharing code section, data section, Operating System resources like open file etc.

A proxy server satisfying the requests for a number of computers on a LAN would be benefited by a multi-threaded process. In general, any program that has to do more than one task at a time could benefit from multitasking. For example, a program that reads input, process it, and outputs could have three threads, one for each task. Disadvantages of Threads over Multiple Processes

Blocking: The major disadvantage if that if the kernel is single threaded, a system call of one thread will block the whole process and CPU may be idle during the blocking period. Security: Since there is, an extensive sharing among threads there is a potential problem of security. It is quite possible that one thread over writes the stack of another thread (or damaged shared data) although it is very unlikely since threads are meant to cooperate on a single task.

Any sequential process that cannot be divided into parallel task will not benefit from thread, as they would block until the previous one completes. For example, a program that displays the time of the day would not benefit from multiple threads.

Self Assessment Questions


1. Define Thread. 2. Discuss the Process vs Threads. 3. State the advantages and disadvantages of Threads over multiple processes. Types of Threads There are two types of threads: user level threads (ULT) and kernel level threads (KLT). User Level Threads User-level threads implement in user-level libraries, rather than via systems calls, so thread switching does not need to call operating system and to cause interrupt to the kernel. In fact, the kernel knows nothing about user-level threads and manages them as if they were single-threaded processes as shown in the figure 3.7 bellow.

Figure 3.7: User Level Thread

Advantages: The most obvious advantage of this technique is that a user-level threads package can be implemented on an Operating System that does not support threads. Some other advantages are

User-level thread does not require modification to operating systems. Simple Representation: Each thread is represented simply by a PC, registers, stack and a small control block, all stored in the user process address space. Simple Management: This simply means that creating a thread, switching between threads and synchronization between threads can all be done without intervention of the kernel. Fast and Efficient: Thread switching is not much more expensive than a procedure call.

Disadvantages:

There is a lack of coordination between threads and operating system kernel. Therefore, process as whole gets one time slice irrespective of whether process has one thread or 1000 threads within. It is up to each thread to relinquish control to other threads. User-level threads require non-blocking systems call i.e., a multithreaded kernel. Otherwise, entire process will blocked in the kernel, even if there are runable threads left in the processes. For example, if one thread causes a page fault, the process blocks.

Kernel Level Threads: As shown in the figure 3.8 bellow, in this method, the kernel knows about and manages the threads. No runtime system is needed in this case. Instead of thread table in each process, the kernel has a thread table that keeps track of all threads in the system. In addition, the kernel also maintains the traditional process table to keep track of processes. Operating Systems kernel provides system call to create and manage threads.

Figure 3.8: Kernel Level Thread

Advantages:

Because kernel has full knowledge of all threads, Scheduler may decide to give more time to a process having large number of threads than process having small number of threads. Kernel-level threads are especially good for applications that frequently block.

Disadvantages:

The kernel-level threads are slow and inefficient. For instance, threads operations are hundreds of times slower than that of user-level threads. Since kernel must manage and schedule threads as well as processes. It requires a full thread control block (TCB) for each thread to maintain information about threads. As a result there is significant overhead and increased in kernel complexity.

Thread States As like processes, threads also go through some similar states as depicted in the figure below. The figure only shows three main states i.e. ready, running and blocked states. Apart from these states there are new and terminated states, very similar to the process states.

Figure 3.9: Thread States

The only difference in thread states and processes states is that, depending on its implementation, in a running process there may be many threads, but only one will be in a running state and others will be in blocked or ready states. Thus a process may be running but there may be a blocked state thread inside the thread. Also in user level threads, a process may be blocked due to I/O request by a thread, or a process may be switched to ready state after execution for some time, but the thread which was in running state at the time of switch or I/O request will be in running state. Thus the process is not in running state, but the thread within the process is in running state. Self Assessment Questions 1. 2. Write a advantages and disadvantages user level threads. Write a note on Kernal level threads.

Summary A process can be simply defined as a program in execution. Process along with program code, comprises of program counter value, Processor register contents, values of variables, stack and program data. A process is created and terminated, and it follows some or all of the states of process transition; such as New, Ready, Running, Waiting, and Exit. A thread is a single sequence stream within a process. Because threads have some of the properties of processes, they are sometimes called lightweight processes. There are two types of threads: user level threads (ULT) and kernel level threads (KLT), user level threads are mostly used on the systems where the operating system does not support threads, but also can be combined with the kernel level threads. Threads also have similar properties like processes e.g. execution states, context switch etc.

Terminal Questions 1. 2. 3. 4. 5. Define process. Explain the major components of a process. What are the events for process creation? Explain the reasons for termination of a process. Explain the process state transition with diagram. Explain the event for transition of a process 1. from New to Ready 2. from Ready to Running 3. from Running to Blocked 6. What are threads? 7. State advantages and disadvantages of thread over a process. 8. What are different types of threads? Explain.

Unit 4: Memory Management : This unit covers memory hierarchy , paging and segmentation and its paging policies. Discussed about the cache Memory and its performance fetch and write mechanism, replacement policy. Covers the associative memory.

Introduction The part of the operating system which handles this responsibility is called the memory manager. Since every process must have some amount of primary memory in order to execute, the performance of the memory manager is crucial to the performance of the entire system. Virtual memory refers to the technology in which some space in hard disk is used as an extension of main memory so that a user program need not worry if its size extends the size of the main memory. For paging memory management, each process is associated with a page table. Each entry in the table contains the frame number of the corresponding page in the virtual address space of the process. This same page table is also the central data structure for virtual memory mechanism based on paging, although more facilities are needed. It covers the Control bits, Multi-level page table etc. Segmentation is another popular method for both memory management and virtual memory

Basic Cache Structure : The idea of cache memories is similar to virtual memory in that some active portion of a low-speed memory is stored in duplicate in a higher-speed cache memory. When a memory request is generated, the request is first presented to the cache memory, and if the cache cannot respond, the request is then presented to main memory. Content-Addressable Memory (CAM) is a special type of computer memory used in certain very high speed searching applications. It is also known as associative memory, associative storage, or associative array, although the last term is more often used for a programming data structure.

Objectives : At the end of this unit, you will be able to understand that :

Memory hierarchy with strategies. Virtual memory and its mechanism Paging and Segmentation Replacement policy and replacement algorithms etc.

Memory Hierarchy In addition to the responsibility of managing processes, the operating system must efficiently manage the primary memory of the computer. The part of the operating system which handles this responsibility is called the memory manager. Since every process must have some amount of primary memory in order to execute, the performance of the memory manager is crucial to the performance of the entire system. Nutt explains: The memory manager is responsible for allocating primary memory to processes and for assisting the programmer in loading and storing the contents of the primary memory. Managing the sharing of primary memory and minimizing memory access time are the basic goals of the memory manager.

The real challenge of efficiently managing memory is seen in the case of a system which has multiple processes running at the same time. Since primary memory can be space-

multiplexed, the memory manager can allocate a portion of primary memory to each process for its own use. However, the memory manager must keep track of which processes are running in which memory locations, and it must also determine how to allocate and de-allocate available memory when new processes are created and when old processes complete execution. While various different strategies are used to allocate space to processes competing for memory, three of the most popular are Best fit, Worst fit, and First fit. Each of these strategies are described below:

Best fit: The allocator places a process in the smallest block of unallocated memory in which it will fit. For example, suppose a process requests 12KB of memory and the memory manager currently has a list of unallocated blocks of 6KB, 14KB, 19KB, 11KB, and 13KB blocks. The best-fit strategy will allocate 12KB of the 13KB block to the process. Worst fit: The memory manager places a process in the largest block of unallocated memory available. The idea is that this placement will create the largest hold after the allocations, thus increasing the possibility that, compared to best fit, another process can use the remaining space. Using the same example as above, worst fit will allocate 12KB of the 19KB block to the process, leaving a 7KB block for future use. First fit: There may be many holes in the memory, so the operating system, to reduce the amount of time it spends analyzing the available spaces, begins at the start of primary memory and allocates memory from the first hole it encounters large enough to satisfy the request. Using the same example as above, first fit will allocate 12KB of the 14KB block to the process.

Notice in the diagram above that the Best fit and First fit strategies both leave a tiny segment of memory unallocated just beyond the new process. Since the amount of memory is small, it is not likely that any new processes can be loaded here. This condition of splitting primary memory into segments as the memory is allocated and deallocated is known as fragmentation. The Worst fit strategy attempts to reduce the problem of fragmentation by allocating the largest fragments to new processes. Thus, a larger amount of space will be left as seen in the diagram above.

Another way in which the memory manager enhances the ability of the operating system to support multiple process running simultaneously is by the use of virtual memory. According the Nutt, virtual memory strategies allow a process to use the CPU when only part of its address space is loaded in the primary memory. In this approach, each processs address space is partitioned into parts that can be loaded into primary memory when they are needed and written back to secondary memory otherwise. Another consequence of this approach is that the system can run programs which are actually larger than the primary memory of the system, hence the idea of virtual memory. Brookshear explains how this is accomplished: Suppose, for example, that a main memory of 64 megabytes is required but only 32 megabytes is actually available. To create the illusion of the larger memory space, the memory manager would divide the required space into units called pages and store the contents of these pages in mass storage. A typical page size is no more than four kilobytes. As different pages are actually required in main memory, the memory manager would exchange them for pages that are no longer required, and thus the other software units could execute as though there were actually 64 megabytes of main memory in the machine. In order for this system to work, the memory manager must keep track of all the pages that are currently loaded into the primary memory. This information is stored in a page table maintained by the memory manager. A page fault occurs whenever a process requests a page that is not currently loaded into primary memory. To handle page faults, the memory manager takes the following steps: 1. The memory manager locates the missing page in secondary memory. 2. The page is loaded into primary memory, usually causing another page to be unloaded. 3. The page table in the memory manager is adjusted to reflect the new state of the memory. 4. The processor re-executes the instructions which caused the page fault. 1. Virtual Memory An Introduction In an operating system, it is possible that a program is too large to be loaded into the main memory. In theory, a 32-bit program may have a linear space of up to 4 giga bytes, which is larger than almost all computers nowadays. Thus we need some mechanism that allows the execution of a process that is not completely in main memory. Overlay is one choice. With it, the programmers have to deal with swapping in and out themselves to make sure at any moment that the instruction to be executed next is physically in main memory. Obviously this brings a heavy burden on the programmers. In this Unit, we introduce another solution called virtual memory, which has been adopted by almost all modern operating systems. Virtual memory refers to the technology in which some space in hard disk is used as an extension of main memory so that a user program need not worry if its size extends the

size of the main memory. If that does happen, at any time only a part of the program will reside in main memory, and other parts will otherwise remain on hard disk and may be switched into memory later if needed. This mechanism is similar to the two-level memory hierarchy we once discussed before, including cache and main memory because the principle of locality is also a basis here. With virtual memory, if a piece of process that is needed is not in a full main memory, then another piece will be swapped out and the former be brought in. If unfortunately, the latter is used immediately, then it will have to loaded back into main memory right away. As we know, the access to hard disk is time-consuming compared to the access to main memory, Thus the reference to the virtual memory space on hard disks will deteriorate the system performance significantly. Fortunately, the principle of locality holds. That is the instruction and data references during a short period tend to be bounded to one piece of process. So the access to hard disks will not be frequently requested and performed. Thus, the same principle, on the one hand, enables the caching mechanism to increase system performance, and on the other hand avoids the deterioration of performance with virtual memory. With virtual memory, there must be some facility to separate a process into several pieces so that they may reside separately either on hard disks or in main memory. Paging or/and segmentation are two methods that are usually used to achieve the goal. Paging For paging memory management, each process is associated with a page table. Each entry in the table contains the frame number of the corresponding page in the virtual address space of the process. This same page table is also the central data structure for virtual memory mechanism based on paging, although more facilities are needed. Control bits Since only some pages of a process may be in main memory, a bit in the page table entry, P in Figure 1(a), is used to indicate whether the corresponding page is present in main memory or not. Another control bit needed in the page table entry is a modified bit, M, indicating whether the content of the corresponding page have been altered or not since the page was last loaded into main memory. We often say swapping in and swapping out, suggesting that a process is typically separated into two parts, one residing in main memory and the other in secondary memory, and some pages may be removed from one part and join the other. They together make up of the whole process image. Actually the secondary memory contains the whole image of the process and part of it may have been loaded into main memory. When swapping out is to be performed, typically the page to be swapped out may be simply overwritten by the new page, since the corresponding page is already on secondary memory. However sometimes the content of a page may have been altered at runtime, say a page containing data. In this case, the alteration should be reflected in secondary memory. So when the M bit is 1, then the page to be swapped out should be written out. Other bits may also be used for sharing or protection.

Multi-level page table Typically, there is only one page table for each process, which is completely loaded into main memory during the execution of the process. However some processes may be so large that even its page table cannot be held fully in main memory. For example, in 32-bit x86 architecture, each process may have up to 232 = 4G bytes of virtual memory. With pages of 29 = 512 bytes, as many as 223 pages are needed as well as a page table of 223 entries. If each entry requires 4 bytes, that will be 225 = 32Mbytes. Thus some mechanism is needed to allow only part of a page table is loaded in main memory. Naturally we use paging for this. Thats page tables are subject to paging just as other pages are, called multi-level paging. Figure 2 shows an example of a two-level scheme with a 32-bit address. If we assume 4Kbyte pages, then 4G-byte virtual address is composed of 220 pages. If each page table entry requires 4 bytes, then a user page table of 220 entries requires 4M bytes. This huge page table itself needs 210 pages. For paging with it, a root page table of 210 is needed, requiring 4K bytes.

Fig. 1: Typical memory management formats

With this two-level paging scheme, the root page table always remains in main memory. The first 10 bits of a virtual address are used to index into the root page table to find an entry for a page of the user page table. If that page is not in main memory, a page fault occurs and the operating system is asked to load that page. If it is in main memory, then

the next 10 bits of the virtual address index into the user page table to find the entry for the page that is referenced by the virtual address. This whole process is illustrated in Figure 3.

Fig. 2: A two-level hierarchical page table

Fig. 3: Address translation in A two-level paging system

Translation lookaside buffer


As we discussed before, a translation lookaside buffer (TLB) may be used to speed up paging and avoid frequent access to main memory, which is shown in Figure 4. With multi-level paging scheme, the benefit of TLB will be even more significant.

Fig. 4: Use of a translation lookaside buffer

It should be noted that the TLB is a cache for a page table while the regular cache we mentioned before is for main memory and these facilities should work together when they are both present in a system. As figure 5 illustrates, for a virtual address consisting of a page number and an offset address, the memory system consults the TLB first to see if the matching page entry is present. If yes, the real address is generated by combining the frame number with the offset. If not, the entry is accessed from a page table. Once the real address is generated, the cache is consulted to see if the block containing that word is present. If so, it is returned to the CPU. If not, the word is retrieved from main memory. Self Assessment Questions 1. Discuss the page table with suitable example. 2. Explain the significant of control bits in paging mechanism. 3. What strategy would you followed in paging if a demanding process holds such large size of memory space where page table can not hold in memory? Cleaning policy A cleaning policy is the opposite of a fetch policy. It deals with when a modified page should be written out to secondary memory. There are two common choices:

Demand cleaning: A page is written out only when it has been selected for replacement. Pre-cleaning: Modified pages are updated on secondary memory before their page frames are needed so that pages can be written out in batches. Pre-cleaning has advantage over demand cleaning but it cannot be performed too frequently because some pages may be modified so often that frequent writing out turns out to be unnecessary.

Frame locking

One point that is worth mentioning is that some of the frames in main memory may not be replaced, or may be locked. For example, the frames occupied by the kernel of the operating system, used for I/O buffers and other time-critical areas should always be available in main memory for the operating system to operate properly. This requirement can be satisfied by adding an additional bit in the page table. Load control Another related question is how many processes may be started to run and reside in main memory simultaneously, which is called load control. Load control is critical in memory management because, if too few processes are in main memory at any one time, it will be very likely for all the processes to be blocked, and thus much time will be spent in swapping. On the other hand, if too many processes exist, each individual process will be allocated a small number of frames, and thus frequent page faulting will occur. Figure 10 shows that if all other aspects are given, there is a specific point to achieve the highest utilization.

Fig. 10: Multiprogramming effects

Cache Memory

Basic Cache Structure


Processors are generally able to perform operations on operands faster than the access time of large capacity main memory. Though semiconductor memory which can operate at speeds comparable with the operation of the processor exists, it is not economical to provide all the main memory with very high speed semiconductor memory. The problem can be alleviated by introducing a small block of high speed memory called a cache between the main memory and the processor.

The idea of cache memory is similar to virtual memory in that some active portion of a low-speed memory is stored in duplicate in a higher-speed cache memory. When a memory request is generated, the request is first presented to the cache memory, and if the cache cannot respond, the request is then presented to main memory. The difference between cache and virtual memory is a matter of implementation; the two notions are conceptually the same because they both rely on the correlation properties observed in sequences of address references. Cache implementations are totally different from virtual memory implementation because of the speed requirements of cache. We define a cache miss to be a reference to a item that is not resident in cache, but is resident in main memory. The corresponding concept for cache memories is page fault, which is defined to be a reference to a page in virtual memory that is not resident in main memory. For cache misses, the fast memory is cache and the slow memory is main memory. For page faults the fast memory is main memory, and the slow memory is auxiliary memory.

Fig. 11: A cache-memory reference. The tag 0117X matches address 01173, so the cache returns the item in the position X=3 of the matched block

A cell in memory is presented to the cache. The cache searches its directory of address tags shown in the figure to see if the item is in the cache. If the item is not in the cache, a miss occurs. For READ operations that cause a cache miss, the item is retrieved from main memory and copied into the cache. During the short period available before the main-memory operation is complete, some other item in cache is removed form the cache to make rood for the new item. The cache-replacement decision is critical; a good replacement algorithm can yield somewhat higher performance than can a bad replacement algorithm. The effective cycletime of a cache memory (teff) is the average of cache-memory cycle time (tcache) and mainmemory cycle time (tmain), where the probabilities in the averaging process are the probabilities of hits and misses. If we consider only READ operations, then a formula for the average cycle-time is:

teff = tcache + ( 1 h ) tmain where h is the probability of a cache hit (sometimes called the hit rate), the quantity (1 h), which is the probability of a miss, is know as the miss rate. In Fig.11 we show an item in the cache surrounded by nearby items, all of which are moved into and out of the cache together. We call such a group of data a block of the cache. Cache Memory Organizations

Fig. 12: The logical organization of a four-way set-associate cache

Fig. 12 shows a conceptual implementation of a cache memory. This system is called set associative because the cache is partitioned into distinct sets of blocks, ad each set contains a small fixed number of blocks. The sets are represented by the rows in the figure. In this case, the cache has N sets, and each set contains four blocks. When an access occurs to this cache, the cache controller does not search the entire cache looking for a match. Instead, the controller maps the address to a particular set of the cache and searches only the set for a match. If the block is in the cache, it is guaranteed to be in the set that is searched. Hence, if the block is not in that set, the block is not present in the cache, and the cache controller searches no further. Because the search is conducted over four blocks, the cache is said to be four-way set associative or, equivalently, to have an associativity of four. Fig. 12 is only one example, there are various ways that a cache can be arranged internally to store the cached data. In all cases, the processor reference the cache with the main memory address of the data it wants. Hence each cache organization must use this address to find the data in the cache if it is stored there, or to indicate to the processor when a miss has occurred. The problem of mapping the information held in the main memory into the cache must be totally implemented in hardware to achieve improvements in the system operation. Various strategies are possible.

Fully associative mapping Perhaps the most obvious way of relating cached data to the main memory address is to store both memory address and data together in the cache. This the fully associative mapping approach. A fully associative cache requires the cache to be composed of associative memory holding both the memory address and the data for each cached line. The incoming memory address is simultaneously compared with all stored addresses using the internal logic of the associative memory, as shown in Fig. 13. If a match is fund, the corresponding data is read out. Single words form anywhere within the main memory could be held in the cache, if the associative part of the cache is capable of holding a full address

Fig. 13: Cache with fully associative mapping

In all organizations, the data can be more than one word, i.e., a block of consecutive locations to take advantage of spatial locality. In Fig. 14 aline constitutes four words, each word being 4 bytes. The least significant part of the address selects the particular byte, the next part selects the word, and the remaining bits form the address compared to the address in the cache. The whole line can be transferred to and from the cache in one transaction if there are sufficient data paths between the main memory and the cache. With only one data word path, the words of the line have to be transferred in separate transactions.

Fig. 14: Fully associative mapped cache with multi-word lines

The fully associate mapping cache gives the greatest flexibility of holding combinations of blocks in the cache and minimum conflict for a given sized cache, but is also the most expensive, due to the cost of the associative memory. It requires a replacement algorithm to select a block to remove upon a miss and the algorithm must be implemented in hardware to maintain a high speed of operation. The fully associative cache can only be formed economically with a moderate size capacity. Microprocessors with small internal caches often employ the fully associative mechanism. I Direct mapping The fully associative cache is expensive to implement because of requiring a comparator with each cache location, effectively a special type of memory. In direct mapping, the cache consists of normal high speed random access memory, and each location in the cache holds the data, at an address in the cache given by the lower significant bits of the main memory address. This enables the block to be selected directly from the lower significant bits of the memory address. The remaining higher significant bits of the address are stored in the cache with the data to complete the identification of the cached data. Consider the example shown in Fig. 15. The address from the processor is divided into tow fields, a tag and an index. The tag consists of the higher significant bits of the address, which are stored with the data. The index is the lower significant bits of the address used to address the cache.

Figure 15: Direct Mapping

When the memory is referenced, the index is first used to access a word in the cache. Then the tag stored in the accessed word is read and compared with the tag in the address. If the two tags are the same, indicating that the word is the one required, access is made to the addressed cache word. However, if the tags are not the same, indicating that the required word is not in the cache, reference is made to the main memory to find it. For a memory read operation, the word is then transferred into the cache where it is accessed. It is possible to pass the information to the cache and the processor simultaneously, i.e., to read-through the cache, on a miss. The cache location is altered for a write operation. The main memory may be altered at the same time (write-through) or later. Fig. 15. shows the direct mapped cache with a line consisting of more than one word. The main memory address is composed of a tag, an index, and a word within a line. All the words within a line in the cache have the same stored tag. The index part to the address is used to access the cache and the stored tag is compared with required tag address. For a read operation, if the tags are the same the word within the block is selected for transfer to the processor. If the tags are not the same, the block containing the required word is first transferred to the cache. In direct mapping, the corresponding blocks with the same index in the main memory will map into the same block in the cache, and hence only blocks with different indices can be in the cache at the same time. A replacement algorithm is unnecessary, since there is only one allowable location for each incoming block. Efficient replacement relies on the low probability of lines with the same index being required. However there are such occurrences, for example, when two data vectors are stored starting at the same index and pairs of elements need to processed together. To gain the greatest performance, data arrays and vectors need to be stored in a manner which minimizes the conflicts in processing pairs of elements. Fig.6 shows the lower bits of the processor address used to address the cache location directly. It is possible to introduce a mapping function between the address index and the cache index so that they are not the same.

1. II Set-associative mapping In the direct scheme, all words stored in the cache must have different indices. The tags may be the same or different. In the fully associative scheme, blocks can displace any other block and can be placed anywhere, but the cost of the fully associative memories operate relatively slowly. Set-associative mapping allows a limited number of blocks, with the same index and different tags, in the cache and can therefore be considered as a compromise between a fully associative cache and a direct mapped cache. The cache is divided into sets of blocks. A four-way set associative cache would have four blocks in each set. The number of blocks in a set is know as the associativity or set size. Each block in each set has a stored tag which, together with the index, completes the identification of the block. First, the index of the address from the processor is used to access the set. Then, comparators are used to compare all tags of the selected set with the incoming tag. If a match is found, the corresponding location is accessed, other wise, as before, an access to the main memory is made.

Figure 16: Cache with set-associative mapping

The tag address bits are always chosen to be the most significant bits of the full address, the block address bits are the next significant bits and the word/byte address bits form the least significant bits as this spreads out consecutive man memory blocks throughout consecutive sets in the cache. This addressing format is known as bit selection and is used by all known systems. In a set-associative cache it would be possible to have the set address bits as the most significant bits of the address and the block address bits as the next significant, with the word within the block as the least significant bits, or with the block address bits as the least significant bits and the word within the block as the middle bits. Notice that the association between the stored tags and the incoming tag is done using comparators and can be shared for each associative search, and all the

information, tags and data, can be stored in ordinary random access memory. The number of comparators required in the set-associative cache is given by the number of blocks in a set, not the number of blocks in all, as in a fully associative memory. The set can be selected quickly and all the blocks of the set can be read out simultaneously with the tags before waiting for the tag comparisons to be made. After a tag has been identified, the corresponding block can be selected. The replacement algorithm for set-associative mapping need only consider the lines in one set, as the choice of set is predetermined by the index in the address. Hence, with two blocks in each set, for example, only one additional bit is necessary in each set to identify the block to replace.
III Sector Mapping

In sector mapping, the main memory and the cache are both divided into sectors; each sector is composed of a number of blocks. Any sector in the main memory can map into any sector in the cache and a tag is stored with each sector in the cache to identify the main memory sector address. However, a complete sector is not transferred to the cache or back to the main memory as one unit. Instead, individual blocks are transferred as required. On cache sector miss, the required block of the sector is transferred into a specific location within one sector. The sector location in the cache is selected and all the other existing blocks in the sector in the cache are from a previous sector. Sector mapping might be regarded as a fully associative mapping scheme with valid bits, as in some microprocessor caches. Each block in the fully associative mapped cache corresponds to a sector, and each byte corresponds to a sector block. Self Assessment Questions 1. Discuss the basic ideas of using the cache memory. 2. Write note on: a. Cache Memory Organization b. Direct Mapping 3. Explain the cache with set-associative mapping with neat diagram.

Cache Performance The performance of a cache can be quantified in terms of the hit and miss rates, the cost of a hit, and the miss penalty, where a cache hit is a memory access that finds data in the cache and a cache miss is one that does not.

When reading, the cost of a cache hit is roughly the time to access an entry in the cache. The miss penalty is the additional cost of replacing a cache line with one containing the desired data. (Access time) = (hit cost) + (miss rate)*(miss penalty) = (Fast memory access time) + (miss rate)*(slow memory access time) Note that the approximation is an underestimate control costs have been left out. Also note that only one word is being loaded from the faster memory while a whole cache blocks worth of data is being loaded from the slower memory. Since the speeds of the actual memory used will be improving independently, most effort in cache design is spent on fast control and decreasing the miss rates. We can classify misses into three categories, compulsory misses, capacity misses and conflict misses. Compulsory misses are when data is loaded into the cache for the first time (e.g. program startup) and are unavoidable. Capacity misses are when data is reloaded because the cache is not large enough to hold all the data no matter how we organize the data (i.e. even if we changed the hash function and made it omniscient). All other misses are conflict misses there is theoretically enough space in the cache to avoid the miss but our fast hash function caused a miss anyway. Fetch and write mechanism Fetch policy We can identify three strategies for fetching bytes or blocks from the main memory to the cache, namely: 1. Demand fetch Which is the fetching a block when it is needed and is not already in the cache, i.e. to fetch the required block on a miss. This strategy is the simplest and requires no additional hardware or tags in the cache recording the references, except to identify the block in the cache to be replaced. Pre-fetch Which is fetching blocks before they are requested. A simple prefetch strategy is to prefetch the (i+1)th block when the ith block is initially referenced on the expectation that it is likely to be needed if the ith block is needed. On the simple prefetch strategy, not all first references will induce a miss, as some will be to prefetched blocks. Selective fetch

Which is the policy of not always fetching blocks, dependent upon some defined criterion, and in these cases using the main memory rather than the cache to hold the information. For example, shared writable data might be easier to maintain if it is always kept in the main memory and not passed to a cache for access, especially in multi-processor systems. Cache systems need to be designed so that the processor can access the main memory directly and bypass the cache. Individual locations could be tagged as non-cacheable. Instruction and data caches The basic stored program computer provides for one main memory for holding both program instructions and program data. The cache can be organized in the same fashion, with the cache holding both program instructions and data. This is called a unified cache. We also can separate the cache into two parts: data cache and instruction (code) cache. The general arrangement of separate caches is shown in fig. 17. Often the cache will be integrated inside the processor chip.

Figure 17: Separate instruction and data caches

Write operations As reading the required word in the cache does not affect the cache contents, there can be no discrepancy between the cache word and the copy held in the main memory after a memory read instruction. However, in general, writing can occur to cache words and it is possible that the cache word and copy held in the main memory may be different. It is necessary to keep the cache and the main memory copy identical if input/output transfers operate on the main memory contents, or if multiple processors operate on the main memory, as in a shared memory multiple processor system. If we ignore the overhead of maintaining consistency and the time for writing data back to the main memory, then the average access time is given by the previous

equation, i.e. teff = tcache + ( 1 h ) tmain , assuming that all accesses are first made to the cache. The average access time including write operations will add additional time to this equation that will depend upon the mechanism used to maintain data consistency. There are two principal alternative mechanisms to update the main memory, namely the write-through mechanism and the writeback mechanism. Write-through mechanism In the write-though mechanism, every write operation to the cache is repeated to the main memory, normally at the same time. The additional write operation to the main memory will, of course, take much longer than to the cache and will dominate the access time for write operations. The average access time of writethrough with transfers from main memory to the cache on all misses (read and write) is given by: ta = tcache + ( 1 h ) ttrans + w(tmain - tcache) = (1 w) tcache + (1 h) ttrans + wtmain Where = time to transfer block to cache, assuming the whole block must be ttrans transferred together W = fraction of write references. The term (tmain - tcache) is the additional time to write the word to main memory whether a hit or a miss has occurred, given that both cache and main memory write operation occur simultaneously but the main memory write operation must complete before any subsequent cache read/write operation can be proceed. If the size of the block matches the external data path size, a whole block can be transferred in one transaction and ttrans = tmain. On a cache miss, a block could be transferred from the main memory to the cache whether the miss was caused by a write or by a read operation. The term allocate on write is used to describe a policy of bringing a word/block from the main memory into the cache for a write operation. In write-through, fetch on write transfers are often not done on a miss, i.e., a Non- allocate on write policy. The information will be written back to the main memory but not kept in the cache. The write-through scheme can be enhanced by incorporating buffers, as shown in Fig. 18, to hold information to be written back to the main memory, freeing the cache for subsequent accesses.

Figure 18: Cache with write buffer

For write-through, each item to be written back to the main memory is held in a buffer together with the corresponding main memory address if the transfer cannot be made immediately. Immediate writing to main memory when new values are generated ensures that the most recent values are held in the main memory and hence that any device or processor accessing the main memory should obtain the most recent values immediately, thus avoiding the need for complicated consistency mechanisms. There will be latency before the main memory has been updated, and the cache and main memory values are not consistent during this period. 2. Write-back mechanism In the write-back mechanism, the write operation to the main memory is only done at block replacement time. At this time, the block displaced by the incoming block might be written back to the main memory irrespective of whether the block has been altered. The policy is known as simple write-back, and leads to an average access time of: ta = tcache + ( 1 h ) ttrans + (1 h) ttrans Where one (1 h) ttrans term is due to fetching a block from memory and the other (1 h) ttrans term is due to writing back a block. Write-back normally handles write misses as allocate on write, as opposed to write-through, which often handles write misses as Non-allocate on write. The write-back mechanism usually only writes back lines that have been altered. To implement this policy, a 1-bit tag is associated with each cache line and is set whenever the block is altered. At replacement time, the tags are examined to determine whether it is necessary to write the block back to the main memory. The average access time now becomes: ta = tcache + ( 1 h ) ttrans + wb(1 h) ttrans where wb is the probability that a block has been altered (fraction of blocks altered). The probability that a block has been altered could be as high as the probability of write references, w, but is likely to be much less, as more than one

write reference to the same block is likely and some references to the same byte/word within the block are likely. However, under this policy the complete block is written back, even if only one word in the block has been altered, and thus the policy results in more traffic than is necessary, especially for memory data paths narrower than a line, but still there is usually less memory traffic than write-through, which causes every alteration to be recorded in the main memory. The write-back scheme can also be enhanced by incorporating buffers to hold information to be written back to the main memory, just as is possible and normally done with write-through. Self Assessment Questions 1. List and explain the various activities involved in fetch and write mechanism. 2. When write-back mechanism is used and what its average access time.

Replacement policy When the required word of a block is not held in the cache, we have seen that it is necessary to transfer the block from the main memory into the cache, displacing an existing block if the cache is full. Except for direct mapping, which does not allow a replacement algorithm, the existing block in the cache is chosen by a replacement algorithm. The replacement mechanism must be implemented totally in hardware, preferably such that the selection can be made completely during the main memory cycle for fetching the new block. Ideally, the block replaced will not be needed again in the future. However, such future events cannot be known and a decision has to be made based upon facts that are known at the time. 1. Random replacement algorithm Perhaps the easiest replacement algorithm to implement is a pseudo-random replacement algorithm. A true random replacement algorithm would select a block to replace in a totally random order, with no regard to memory references or previous selections; practical random replacement algorithms can approximate this algorithm in one of several ways. For example, one counter for the whole cache could be incremented at intervals (for example after each clock cycle, or after each reference, irrespective of whether it is a hit or a miss). The value held in the counter identifies the block in the cache ( if fully associative) or the block in the set if it is a set-associative cache. The counter should have sufficient bits to identify any block. For a fully associative cache, an n-bit counter is necessary if there are 2n words in the cache. For a four-way set-associative cache, one 2-bit counter would be sufficient, together with logic to increment the counter. 2. First-in first-out replacement algorithm

The first-in first-out replacement algorithm removes the block that has been in the cache for the longest time. The first-in first-out algorithm would naturally be implemented with a first-in first-out queue of block address, but can be more easily implemented with counters, only one counter for a fully associative cache or one counter for each set in a set-associative cache, each with a sufficient number of bits to identify the block. 3. Least recently used algorithm for a cache In the least recently used (LRU) algorithm, the block which has not been referenced for the longest time is removed from the cache. Only those blocks in the cache are considered. The word recently comes about because the block is not the least used, as this is likely to be back in memory. It is the least used of those blocks in the cache, and all of those are likely to have been recently used otherwise they would not be in the cache. The least recently used (LRU) algorithm is popular for cache systems and can be implemented fully when the number of blocks involved is small. There are several ways the algorithm can be implemented in hardware for a cache, these include: 1) Counters In the counter implementation, a counter is associated with each block. A simple implementation would be to increment each counter at regular intervals and to reset a counter when the associated line had been referenced. Hence the value in each counter would indicate the age of a block since last referenced. The block with the largest age would be replaced at replacement time. 2) Register stack In the register stack implementation, a set of n-bit registers is formed, one for each block in the set to be considered. The most recently used block is recorded at the top of the stack and the least recently used block at the bottom. Actually, the set of registers does not form a conventional stack, as both ends and internal values are accessible. The value held in one register is passed to the next register under certain conditions. When a block is referenced, starting at the top of the stack, starting at the top of the stack, the values held in the registers are shifted one place towards the bottom of the stack until a register is found to hold the same value as the incoming block identification. Subsequent registers are not shifted. The top register is loaded with the incoming block identification. This has the effect of moving the contents of the register holding the incoming block number to the top of the stack. This logic is fairly substantial and slow, and not really a practical solution.

Fig. 19

3) Reference matrix The reference matrix method centers around a matrix of status bits. There is more than one version of the method. In one version (Smith, 1982), the upper triangular matrix of a B X B matrix is formed without the diagonal, if there are B blocks to consider. The triangular matrix has (B * (B 1))/2 bits. When the ith block is referenced, all the bits in the ith row of the matrix are set to 1 and then all the bits in the ith column are set to 0. The least recently used block is one which has all 0s in its row and all 1s in its column, which can be detected easily by logic. The method is demonstrated in Fig. 19 for B = 4 and the reference sequence 2, 1, 3, 0, 3, 2, 1, , together with the values that would be obtained using a register stack. 4) Approximate methods. When the number of blocks to consider increases above about four to eight, approximate methods are necessary for the LRU algorithm. Fig. 20 shows a twostage approximation method with eight blocks, which is applicable to any replacement algorithm. The eight blocks in Fig. 20 are divided into four pairs, and each pair has one status bit to indicate the most/least recently used block in the pair (simply set or reset by reference to each block). The least recently used replacement algorithm now only considers the four pairs. Six status bits are necessary (using the reference matrix) to identify the least recently used pair which, together with the status bit of the pair, identifies the least recently used block of a pair.

Figure 20: Two-stage replacement algorithm

The method can be extended to further levels. For example, sixteen blocks can be divided into four groups, each group having two pairs. One status bit can be associated with each pair, identifying the block in the pair, and another with each group, identifying the group in a pair of groups. A true least recently used algorithm is applied to the groups. In fact, the scheme could be taken to its logical conclusion of extending to a full binary tree. Fig. 21 gives an example. Here, there are four blocks in a set. One status bit, B0, specifies which half o the blocks are most/least recently used. Two more bits, B1 and B2, specify which block of pairs is most/least recently used. Every time a cache block is referenced (or loaded on a miss), the status bits are updated. For example, if block L2 is referenced, B2 is set to a 0 to indicate that L2 is the most recently used of the pair L2 and L3. B0 is set to a 1 to indicate that L2/L3 is the most recently used of the four blocks, L0, L1, L2 and L3. To identify the line to replace on a miss, the status bits are examined. If B0 = 0, then the block is either L0 or L1. If then B1 = 0, it is L0.

Figure 21: Replacement algorithm using a tree selection

Self Assessment Questions

1. Discuss the various types of memory replacement algorithms in brief. 2. Write a note on: a. Register Stack method b. Reference matrix method

Second-level caches When the cache is integrated into the processor, it will be impossible to increase its size should the performance not be sufficient. In any case, increasing the size of the cache may create a slower cache. As an alternative, which has become very popular, a second larger cache can be introduced between the first cache and the main memory as shown in Fig. 22. This second-level cache is sometimes called a secondary cache.

Figure 22: Two-level caches

On a memory reference, the processor will access the first-level cache. If the information is not found there (a first-level cache miss occurs), the second-level cache will be accessed. If it is not in the second cache (a second-level cache miss occurs), then the main memory must be accessed. Memory locations will be transferred to the second-level cache and then to the first-level cache, so that two copies of a memory location will exist in the cache system at least initially, i.e., locations cached in the second-level cache also exist in the first-level cache. This is known as the Principle of Inclusion. (Of course the copies of locations in the second-level cache will never be needed as they will be found in the first-level cache.) Whether this continues will depend upon the replacement and write policies. The replacement policy practiced in both caches would normally be the least recently used algorithm. Normally write-through will be practiced between the caches, which will maintain duplicate copies. The block size of the second-level cache will be at least the same if not larger than the block size of the first-level cache, because otherwise on a first-level cache miss, more than one second-level cache line would need to be transferred into the first-level cache block. Optimizing the data cache performance

When we deal with multiple arrays with some arrays accessed by rows and some by columns, Storing the arrays row-by-row or column-by-column does not solve the problem because both rows and columns are used in each iteration of the loop. We must bring the same data into the cache again and again if the cache is not large enough to hold all the data, which is a waste. We will use a matrix multiplication (C = A.B, where A, B, and C are respectively m x p, p x n, and m x n matrices) as an example to show how to utilize the locality to improve cache performance. Principle of Locality Since code is generally executed sequentially, virtually all programs repeat sections of code and repeatedly access the same or nearby data. This characteristic is embodied in the Principle of Locality, which has been found empirically to be obeyed by most programs. It applies to both instruction references and data references, though it is more likely in instruction references. It has two main aspects: 1. Temporal locality (locality in time) individual locations, once referenced, are likely to be referenced again in the near future. 2. Spatial locality (locality in space) references, including the next location, are likely to be near the last reference. Temporal locality is found in instruction loops, data stacks and variable accesses. Spatial locality describes the characteristic that programs access a number of distinct regions. Sequential locality describes sequential locations being referenced and is a main attribute of program construction. It can also be seen in data accesses, as data item are often stored in sequential locations.

Taking advantage of temporal locality When instructions are formed into loops which are executed many times, the length of a loop is usually quite small. Therefore once a cache is loaded with loops of instructions from the main memory, the instructions are used more than once before new instructions are required from the main memory. The same situation applies to data; data is repeatedly accessed. Suppose the reference is repeated n times in all during a program loop and after the first reference, the location is always found in the cache, then the average access time would be: ta = (n*tcache + tmain)/n = tcache + tmain/n where n = number of references. As n increases, the average access time decreases. The increase in speed will, of course, depend upon the program. Some programs might have a large amount of temporal locality, while others have less. We can do some optimization about this.

Taking advantage of spatial locality To take advantage of spatial locality, we will transfer not just one byte or word from the main memory to the cache (and vice versa) but a series of sequential locations called a block. We have assumed that it is necessary to reference the cache before a reference is make to the main memory to fetch a word, and it is usual to look into the cache first to see if the information is held there. Data Blocking For the matrix multiplication C = A.B, if we made code as below: For (I = 0; I < m; I++) For (J = 0; J < n; J = J++) { R = 0; For (K = 0; K < p; K++) R = R + A[I][K] * B[K][J]; C[I][J] = R; } The two inner loops read all p by n elements of B and access the same p elements in a row of A repeatedly, and write one row of n elements of C. The number of capacity misses clearly depends on the dimension parameters: m, n, p and the size of the cache. If the cache can hold all three metrics, then all is well, provided there are no cache conflicts. In the worst case, there would be (2*m*n*p + m*n) words read form memory for m*n*p operations. To enhance the cache performance if it is not big enough, we use an optimization technique: blocking. The block method for this matrix product consist of:

Split result matrix C into blocks CI,J of size Nb x Nb, each blocks is constructed into a continuous array Cb which is then copied back into the right CI,J. Matrices A and B are spit into panels AI and BJ of size (Nb x p) and (p x Nb) each panel is copied into continuous arrays Ab and Bb. The choice of Nb must ensure that Cb, Ab and Bb fit into one level of cache, usually L2 cache.

Then we rewrite the code as: For (I = 0; I < m/Nb; I++){ Ab = AI; For (J = 0; J < n/Nb; J++) { Bb = BJ; Cb = 0; For (K = 0; K < p/Nb; K++) Cb = Cb + AbK*BKb; CI,J = Cb; }} here = means assignment for matrix We suppose for simplicity that Nb divides m, n and p. The figure 23 below may help you in understanding operations performed on blocks. In the case of previous algorithm matrix A is loaded only one time into cache compared to the n times access of the original one, while matrix B is still accessed m times. This simple block method greatly reduce memory access and real codes may choose by looking at matrix size which loop structure (ijk vs. jik) is best appropriate and if some matrix operand fits totally into cache.

Figure 23

In the previous we do not talk about L1 cache use. In fact L1 will be generally too small to handle a CI,J block and one panel of A and B, but remember that operation performed at Cb = Cb + AbK*BKb is a matrix-matrix product so each operand AbK and BKb is aceessed Nb times: this part could also use a block method. Since Nb is relatively small, the implementation may load only one of Cb, AbK, BKb into L1 cache and works with others from L2. Summary Operating system which handles the responsibility of managing the memory and t deals with the memory management which covers the Memory hierarchy Paging and page handling, segmentation with its policies and algorithms, Cache memory, cache memory organization and associative mapping , cache performance. These all concepts managing the sharing of primary and secondary memory and minimizing memory access time are the vital goal of the memory management. And also it covers the memory fetch and writes mechanism, replacement policy etc.

Terminal Questions 1. Memory management is important in operating systems. Discuss the main problems that can occur if memory is managed poorly. 2. Explain the difference between logical and physical addresses.

3. Consider a paging system with a page-table stored in memory. If a memory references takes 200 nanoseconds, how long does a paged memory reference take? If we add associative registers, and 75 percent of all page table references are found in

the associative registers, what is the effective memory reference time? (Assume that looking for (and maybe finding) a page-table entry in the associative memory takes zero time). 4. Consider a demand-paging system with a paging disk that has an average access and transfer time of 20 milliseconds. Addresses are translated through a page table in main memory, with an access time of 1 microsecond per memory access. Thus, each memory reference through the page table takes two accesses. To improve this time we have added an associative memory that reduces access time to one memory reference if the page table entry is in the associative memory. Assume that 80 percent of the accesses are in the associative memory and that of the remaining, 10 percent (or 2 percent of the total) cause page faults. What is the effective memory access time. 5. We have discussed LRU as an attempt to predict future memory access patterns based on previous access patterns (i.e. if we havent accessed a particular page in a while, we are not likely to reference it again soon). Another idea that some researchers have explored is to record the memory reference pattern from the last time the program was run and use it to predict what it will access next time. Discuss the positive and negative aspects of this idea.

Unit 5 : CPU Scheduling : This unit covers Brief introduction of CPU scheduling, scheduling criteria and various types of scheduling algorithms. Multiple-Processing scheduling and thread scheduling. Introduction Almost all programs have some alternating cycle of CPU number crunching and waiting for I/O of some kind. (Even a simple fetch from memory takes a long time relative to CPU speeds.). In a simple system running a single process, the time spent waiting for I/O is wasted, and those CPU cycles are lost forever. A scheduling system allows one process to use the CPU while another is waiting for I/O, thereby making full use of otherwise lost CPU cycles. The challenge is to make the overall system as efficient and fair as possible, subject to varying and often dynamic conditions, and where efficient and fair are somewhat subjective terms, often subject to shifting priority policies. Objective : At the end of this unit, you will be able to understand the : CPU Scheduler Scheduling Algorithms Multiple-Processor Scheduling CPU-I/O Burst Cycle

Symmetric Multithreading Thread Scheduling Algorithm Evaluation

CPU-I/O Burst Cycle


Almost all process alternate between two states in a continuing cycle, as shown in Figure 5.1 below:

A CPU burst of performing calculations, and An I/O burst, waiting for data transfer in or out of the system.

Fig. 5.1: Alternating sequence of CPU and I/O Bursts

CPU bursts vary from process to process, and from program to program, but an extensive study shows frequency patterns similar to that shown in Figure 5.2:

Fig. 5.2: Histogram of CPU-burst durations

Self Assessment Questions

1. 2. 3.

Discuss the process alternate between two states in a continuing cycle. Explain preemptive scheduling and non preemptive scheduling. What is dispatcher?

CPU Scheduler Whenever the CPU becomes idle, it is the job of the CPU Scheduler (a.k.a. the short-term scheduler) to select another process from the ready queue to run next. The storage structure for the ready queue and the algorithm used to select the next process are not necessarily a FIFO queue. There are several alternatives to choose from, as well as numerous adjustable parameters for each algorithm, which is the basic subject of this entire unit. Preemptive Scheduling CPU scheduling decisions take place under one of four conditions: 1. When a process switches from the running state to the waiting state, such as for an I/O request or invocation of the wait( ) system call.

2. When a process switches from the running state to the ready state, for example in response to an interrupt. 3. When a process switches from the waiting state to the ready state, say at completion of I/O or a return from wait( ). 4. When a process terminates. For conditions 1 and 4 there is no choice A new process must be selected. For conditions 2 and 3 there is a choice To either continue running the current process, or select a different one. If scheduling takes place only under conditions 1 and 4, the system is said to be non-preemptive, or cooperative. Under these conditions, once a process starts running it keeps running, until it either voluntarily blocks or until it finishes. Otherwise the system is said to be preemptive. Windows used non-preemptive scheduling up to Windows 3.x, and started using pre-emptive scheduling with Win95. Macs used non-preemptive prior to OSX, and pre-emptive since then. Note that pre-emptive scheduling is only possible on hardware that supports a timer interrupt. It is to be noted that pre-emptive scheduling can cause problems when two processes share data, because one process may get interrupted in the middle of updating shared data structures. Preemption can also be a problem if the kernel is busy implementing a system call (e.g. updating critical kernel data structures) when the preemption occurs. Most modern UNIXes deal with this problem by making the process wait until the system call has either completed or blocked before allowing the preemption Unfortunately this solution is problematic for real-time systems, as real-time response can no longer be guaranteed. Some critical sections of code protect themselves from concurrency problems by disabling interrupts before entering the critical section and re-enabling interrupts on exiting the section. Needless to say, this should only be done in rare situations, and only on very short pieces of code that will finish quickly, ( usually just a few machine instructions. ) Dispatcher The dispatcher is the module that gives control of the CPU to the process selected by the scheduler. This function involves: Switching context. Switching to user mode. Jumping to the proper location in the newly loaded program.

The dispatcher needs to be as fast as possible, as it is run on every context switch. The time consumed by the dispatcher is known as dispatch latency.

Scheduling Criteria There are several different criteria to consider when trying to select the best scheduling algorithm for a particular situation and environment, including: CPU utilization Ideally the CPU would be busy 100% of the time, so as to waste 0 CPU cycles. On a real system CPU usage should range from 40% (lightly loaded) to 90% (heavily loaded.) Throughput Number of processes completed per unit time. May range from 10 / second to 1 / hour depending on the specific processes. Turnaround time Time required for a particular process to complete, from submission time to completion. (Wall clock time.) Waiting time How much time processes spend in the ready queue waiting their turn to get on the CPU.

(Load average The average number of processes sitting in the ready queue waiting their turn to get into the CPU. Reported in 1-minute, 5-minute, and 15-minute averages by uptime and who.) Response time The time taken in an interactive program from the issuance of a command to the commence of a response to that command.

In general one wants to optimize the average value of a criteria (Maximize CPU utilization and throughput, and minimize all the others.) However some times one wants to do something different, such as to minimize the maximum response time. Sometimes it is most desirable to minimize the variance of a criteria than the actual value. I.e. users are more accepting of a consistent predictable system than an inconsistent one, even if it is a little bit slower. Scheduling Algorithms The following subsections will explain several common scheduling strategies, looking at only a single CPU burst each for a small number of processes. Obviously real systems have to deal with a lot more simultaneous processes executing their CPU-I/O burst cycles.

First-Come First-Serve Scheduling, FCFS

FCFS is very simple Just a FIFO queue, like customers waiting in line at the bank or the post office or at a copying machine. Unfortunately, however, FCFS can yield some very long average wait times, particularly if the first process to get there takes a long time. For example, consider the following three processes: Process Burst Time P1 24 P2 3 P3 3 In the first Gantt chart below, process P1 arrives first. The average waiting time for the three processes is (0 + 24 + 27) / 3 = 17.0 ms. In the second Gantt chart below, the same three processes have an average wait time of (0 + 3 + 6) / 3 = 3.0 ms. The total run time for the three bursts is the same, but in the second case two of the three finish much quicker, and the other process is only delayed by a short amount.

FCFS can also block the system in a busy dynamic system in another way, known as the convoy effect. When one CPU intensive process blocks the CPU, a number of I/O intensive processes can get backed up behind it, leaving the I/O devices idle. When the CPU hog finally relinquishes the CPU, then the I/O processes pass through the CPU quickly, leaving the CPU idle while everyone queues up for I/O, and then the cycle repeats itself when the CPU intensive process gets back to the ready queue. Shortest-Job-First Scheduling, SJF The idea behind the SJF algorithm is to pick the quickest fastest little job that needs to be done, get it out of the way first, and then pick the next smallest fastest job to do next. (Technically this algorithm picks a process based on the next shortest CPU burst, not the overall process time.). For example, the Gantt chart below is based upon the following CPU burst times, (and the assumption that all jobs arrive at the same time.)

Process Burst Time P1 6

P2 P3 P4

8 7 3

In the case above the average wait time is (0 + 3 + 9 + 16) / 4 = 7.0 ms, (as opposed to 10.25 ms for FCFS for the same processes.) SJF can be proven to be the fastest scheduling algorithm, but it suffers from one important problem: How do you know how long the next CPU burst is going to be? For long-term batch jobs this can be done based upon the limits that users set for their jobs when they submit them, which encourages them to set low limits, but risks their having to re-submit the job if they set the limit too low. However that does not work for short-term CPU scheduling on an interactive system. Another option would be to statistically measure the run time characteristics of jobs, particularly if the same tasks are run repeatedly and predictably. But once again that really isnt a viable option for short term CPU scheduling in the real world. A more practical approach is to predict the length of the next burst, based on some historical measurement of recent burst times for this process. One simple, fast, and relatively accurate method is the exponential average, which can be defined as follows.

estimate[ i + 1 ] = alpha * burst[ i ] + ( 1.0 alpha ) * estimate[ i ] In this scheme the previous estimate contains the history of all previous times, and alpha serves as a weighting factor for the relative importance of recent data versus past history. If alpha is 1.0, then past history is ignored, and we assume the next burst will be the same length as the last burst. If alpha is 0.0, then all measured burst times are ignored, and we just assume a constant burst time. Most commonly alpha is set at 0.5, as illustrated in Figure 5.3:

Fig. 5.3: Prediction of the length of the next CPU burst

SJF can be either preemptive or non-preemptive. Preemption occurs when a new process arrives in the ready queue that has a predicted burst time shorter than the time remaining in the process whose burst is currently on the CPU. Preemptive SJF is sometimes referred to as shortest remaining time first scheduling. For example, the following Gantt chart is based upon the following data:
Process P1 P2 P3 p4 Arrival Time 0 1 2 3 Burst Time 8 4 9 5

The average wait time in this case is ( (5 3) + (10 1) + (17 2)) / 4 = 26 / 4 = 6.5 ms. (As opposed to 7.75 ms for non-preemptive SJF or 8.75 for FCFS.)

Priority Scheduling Priority scheduling is a more general case of SJF, in which each job is assigned a priority and the job with the highest priority gets scheduled first. (SJF uses the inverse of the next

expected burst time as its priority The smaller the expected burst, the higher the priority.) Note that in practice, priorities are implemented using integers within a fixed range, but there is no agreed-upon convention as to whether high priorities use large numbers or small numbers. This book uses low number for high priorities, with 0 being the highest possible priority. For example, the following Gantt chart is based upon these process burst times and priorities, and yields an average waiting time of 8.2 ms:
Process P1 P2 P3 P4 P5 Burst Time 10 1 2 1 5 Priority 3 1 4 5 2

Priorities can be assigned either internally or externally. Internal priorities are assigned by the OS using criteria such as average burst time, ratio of CPU to I/O activity, system resource use, and other factors available to the kernel. External priorities are assigned by users, based on the importance of the job, fees paid, politics, etc. Priority scheduling can be either preemptive or non-preemptive. Priority scheduling can suffer from a major problem known as indefinite blocking, or starvation, in which a low-priority task can wait forever because there are always some other jobs around that have higher priority. If this problem is allowed to occur, then processes will either run eventually when the system load lightens (at say 2:00 a.m.), or will eventually get lost when the system is shut down or crashes. (There are rumors of jobs that have been stuck for years.) One common solution to this problem is aging, in which priorities of jobs increase the longer they wait. Under this scheme a low-priority job will eventually get its priority raised high enough that it gets run.

Round Robin Scheduling Round robin scheduling is similar to FCFS scheduling, except that CPU bursts are assigned with limits called time quantum. When a process is given the CPU, a timer is set for whatever value has been set for a time quantum. If the process finishes its burst before the time quantum timer expires, then it is swapped out of the CPU just like the

normal FCFS algorithm. If the timer goes off first, then the process is swapped out of the CPU and moved to the back end of the ready queue.

The ready queue is maintained as a circular queue, so when all processes have had a turn, then the scheduler gives the first process another turn, and so on. RR scheduling can give the effect of all processors sharing the CPU equally, although the average wait time can be longer than with other scheduling algorithms. In the following example the average wait time is 5.66 ms.
Process P1 P2 P3 Burst Time 24 3 3

The performance of RR is sensitive to the time quantum selected. If the quantum is large enough, then RR reduces to the FCFS algorithm; If it is very small, then each process gets 1/nth of the processor time and share the CPU equally. BUT, a real system invokes overhead for every context switch, and the smaller the time quantum the more context switches there are. (See Figure 5.4 below.) Most modern systems use time quantum between 10 and 100 milliseconds, and context switch times on the order of 10 microseconds, so the overhead is small relative to the time quantum.

Fig. 5.4: The way in which a smaller time quantum increases context switches

Turn around time also varies with quantum time, in a non-apparent manner. Consider, for example the processes shown in Figure 5.5:

Fig. 5.5: The way in which turnaround time varies with the time quantum

In general, turnaround time is minimized if most processes finish their next cpu burst within one time quantum. For example, with three processes of 10 ms bursts each, the average turnaround time for 1 ms quantum is 29, and for 10 ms quantum it reduces to 20. However, if it is made too large, then RR just degenerates to FCFS. A rule of thumb is that 80% of CPU bursts should be smaller than the time quantum.

Multilevel Queue Scheduling When processes can be readily categorized, then multiple separate queues can be established, each implementing whatever scheduling algorithm is most appropriate for that type of job, and/or with different parametric adjustments. Scheduling must also be done between queues, that is scheduling one queue to get time relative to other queues. Two common options are strict priority (no job in a lower priority queue runs until all higher priority queues are empty) and round-robin (each queue gets a time slice in turn, possibly of different sizes.) Note that under this algorithm jobs cannot switch from queue to queue Once they are assigned a queue, that is their queue until they finish.

Fig. 5.6: Multilevel queue scheduling

Multilevel Feedback-Queue Scheduling Multilevel feedback queue scheduling is similar to the ordinary multilevel queue scheduling described above, except jobs may be moved from one queue to another for a variety of reasons: If the characteristics of a job change between CPU-intensive and I/O intensive, then it may be appropriate to switch a job from one queue to another. Aging can also be incorporated, so that a job that has waited for a long time can get bumped up into a higher priority queue for a while. Multilevel feedback queue scheduling is the most flexible, because it can be tuned for any situation. But it is also the most complex to implement because of all the adjustable parameters. Some of the parameters which define one of these systems include: The number of queues. The scheduling algorithm for each queue. The methods used to upgrade or demote processes from one queue to another. ( Which may be different. ) The method used to determine which queue a process enters initially.

Fig. 5.7: Multilevel feedback queues

Self Assessment Questions 1. Explain the several common scheduling strategies in brief. 2. Explain the FCFS scheduling with a suitable example. 3. Write note on:

a. Priority Scheduling

b. RR Scheduling

Multiple-Processor Scheduling When multiple processors are available, then the scheduling gets more complicated, because now there is more than one CPU which must be kept busy and in effective use at all times. Load sharing revolves around balancing the load between multiple processors. Multi-processor systems may be heterogeneous, (different kinds of CPUs), or homogenous, (all the same kind of CPU). Even in the latter case there may be special scheduling constraints, such as devices which are connected via a private bus to only one of the CPUs. This book will restrict its discussion to homogenous systems.

Approaches to Multiple-Processor Scheduling


One approach to multi-processor scheduling is asymmetric multiprocessing, in which one processor is the master, controlling all activities and running all kernel code, while the other runs only user code. This approach is relatively simple, as there is no need to share critical system data. Another approach is symmetric multiprocessing, SMP, where each processor schedules its own jobs, either from a common ready queue or from separate ready queues for each processor. Virtually all modern OSes support SMP, including XP, Win 2000, Solaris, Linux, and Mac OSX.

Processor Affinity
Processors contain cache memory, which speeds up repeated accesses to the same memory locations. If a process were to switch from one processor to another each time it got a time slice, the data in the cache (for that process) would have to be invalidated and re-loaded from main memory, thereby obviating the benefit of the cache. Therefore SMP systems attempt to keep processes on the same processor, via processor affinity. Soft affinity occurs when the system attempts to keep processes on the same processor but makes no guarantees. Linux and some other OSes support hard affinity, in which a process specifies that it is not to be moved between processors.

Load Balancing
Obviously an important goal in a multiprocessor system is to balance the load between processors, so that one processor wont be sitting idle while another is overloaded. Systems using a common ready queue are naturally self-balancing, and do not need any special handling. Most systems, however, maintain separate ready queues for each processor. Balancing can be achieved through either push migration or pull migration:

Push migration involves a separate process that runs periodically, (e.g. every 200 milliseconds), and moves processes from heavily loaded processors onto less loaded ones. Pull migration involves idle processors taking processes from the ready queues of other processors. Push and pull migration are not mutually exclusive.

Note that moving processes from processor to processor to achieve load balancing works against the principle of processor affinity, and if not carefully managed, the savings gained by balancing the system can be lost in rebuilding caches. One option is to only allow migration when imbalance surpasses a given threshold.

Symmetric Multithreading
An alternative strategy to SMP is SMT, Symmetric Multi-Threading, in which multiple virtual (logical) CPUs are used instead of (or in combination with) multiple physical CPUs. SMT must be supported in hardware, as each logical CPU has its own registers and handles its own interrupts. (Intel refers to SMT as hyperthreading technology.) To some extent the OS does not need to know if the processors it is managing are real or virtual. On the other hand, some scheduling decisions can be optimized if the scheduler knows the mapping of virtual processors to real CPUs. (Consider the scheduling of two CPU-intensive processes on the architecture shown below.)

Fig. 5.8: A typical SMT architecture

Thread Scheduling The process scheduler schedules only the kernel threads. User threads are mapped to kernel threads by the thread library The OS (and in particular the scheduler) is unaware of them.

Contention Scope
Contention scope refers to the scope in which threads compete for the use of physical CPUs. On systems implementing many-to-one and many-to-many threads, Process Contention Scope, PCS, occurs, because competition occurs between threads that are part of the same process. (This is the management / scheduling of multiple user threads on a single kernel thread, and is managed by the thread library.) System Contention Scope, SCS, involves the system scheduler scheduling kernel threads to run on one or more CPUs. Systems implementing one-to-one threads (XP, Solaris 9, Linux), use only SCS. PCS scheduling is typically done with priority, where the programmer can set and/or change the priority of threads created by his or her programs. Even time slicing is not guaranteed among threads of equal priority.

Pthread Scheduling
The Pthread library provides for specifying scope contention:

PTHREAD_SCOPE_PROCESS schedules threads using PCS, by scheduling user threads onto available LWPs using the many-to-many model. PTHREAD_SCOPE_SYSTEM schedules threads using SCS, by binding user threads to particular LWPs, effectively implementing a one-to-one model.

Getscope and setscope methods provide for determining and setting the scope contention respectively:

Fig. 5.9: Pthread Scheduling API

Operating System Examples

Example: Solaris Scheduling


Priority-based kernel thread scheduling. Four classes (real-time, system, interactive, and time-sharing), and multiple queues / algorithms within each class. Default is time-sharing. o Process priorities and time slices are adjusted dynamically in a multilevelfeedback priority queue system. o Time slices are inversely proportional to priority Higher priority jobs get smaller time slices. o Interactive jobs have higher priority than CPU-Bound ones. o See the table below for some of the 60 priority levels and how they shift. Time quantum expired and return from sleep indicate the new priority

when

those

events

occur.

Fig. 5.10: Solaries scheduling

Fig. 5.11: Solaries dispatch table for interactive and time-sharing threads

Solaris 9 introduced two new scheduling classes: Fixed priority and fair share.

Fixed priority is similar to time sharing, but not adjusted dynamically. Fair share uses shares of CPU time rather than priorities to schedule jobs. A certain share of the available CPU time is allocated to a project, which is a set of processes.

System class is reserved for kernel use. (User programs running in kernel mode are NOT considered in the system scheduling class.)

Fig. 5.13: Windows XP priorities

Fig. 5.14: List of tasks indexed according to priority

Algorithm Evaluation The first step in determining which algorithm (and what parameter settings within that algorithm) is optimal for a particular operating environment is to determine what criteria are to be used, what goals are to be targeted, and what constraints if any must be applied. For example, one might want to maximize CPU utilization, subject to a maximum response time of 1 second. Once criteria have been established, then different algorithms can be analyzed and a best choice determined. The following sections outline some different methods for determining the best choice.

Deterministic Modeling
If a specific workload is known, then the exact values for major criteria can be fairly easily calculated, and the best determined. For example, consider the following workload (with all processes arriving at time 0), and the resulting schedules determined by three different algorithms:
Process P1 P2 P3 P4 P5 Burst Time 10 29 3 7 12

The average waiting times for FCFS, SJF, and RR are 28ms, 13ms, and 23ms respectively. Deterministic modeling is fast and easy, but it requires specific known input, and the results only apply for that particular set of input. However by examining multiple similar cases, certain trends can be observed. (Like the fact that for processes arriving at the same time, SJF will always yield the shortest average wait time.)

Queuing Models
Specific process data is often not available, particularly for future times. However a study of historical performance can often produce statistical descriptions of certain important parameters, such as the rate at which new processes arrive, the ratio of CPU bursts to I/O times, the distribution of CPU burst times and I/O burst times, etc. Armed with those probability distributions and some mathematical formulas, it is possible to calculate certain performance characteristics of individual waiting queues. For example, Littles Formula says that for an average queue length of N, with an average waiting time in the queue of W, and an average arrival of new jobs in the queue of Lambda, then these three terms can be related by: N = Lambda * W Queuing models treat the computer as a network of interconnected queues, each of which is described by its probability distribution statistics and formulas such as Littles formula. Unfortunately real systems and modern scheduling algorithms are so complex as to make the mathematics intractable in many cases with real systems.

Simulations
Another approach is to run computer simulations of the different proposed algorithms (and adjustment parameters) under different load conditions, and to analyze the results to determine the best choice of operation for a particular load pattern. Operating conditions for simulations are often randomly generated using distribution functions similar to those described above. A better alternative when possible is to generate trace tapes, by monitoring and logging the performance of a real system under typical expected work loads. These are better because they provide a more accurate picture of system loads, and also because they allow multiple simulations to be run with the identical process load, and not just statistically equivalent loads. A compromise is to randomly determine system loads and then save the results into a file, so that all simulations can be run against identical randomly determined system loads. Although trace tapes provide more accurate input information, they can be difficult and expensive to collect and store, and their use increases the complexity of the simulations significantly. There is also some question as to whether the future performance of the new system will really match the past performance of the old system. (If the system runs faster, users may take fewer coffee breaks, and submit more processes per hour than under the old system. Conversely if the turnaround time for jobs is longer, intelligent users may think more carefully about the jobs they submit rather than randomly submitting jobs and hoping that one of them works out.)

Fig. 5.15: Evaluation of CPU schedulers by simulation

Implementation
The only real way to determine how a proposed scheduling algorithm is going to operate is to implement it on a real system. For experimental algorithms and those under development, this can cause difficulties and resistances among users who dont care about developing OSs and are only trying to get their daily work done. Even in this case, the measured results may not be definitive, for at least two major reasons: (1) System work loads are not static, but change over time as new programs are installed, new users are added to the system, new hardware becomes available, new work projects get started, and even societal changes. (For example the explosion of the Internet has drastically changed the amount of network traffic that a system sees and the importance of handling it with rapid response times.) (2) As mentioned above, changing the scheduling system may have an impact on the work load and the ways in which users use the system. Most modern systems provide some capability for the system administrator to adjust scheduling parameters, either on the fly or as the result of a reboot or a kernel rebuild. Summary The summary of this unit covers the alternating sequence of CPU I/O bursts. CPU scheduler in this there are several alternatives to choose from, as well as numerous adjustable parameters for each specified scheduling algorithms. In this discussed various common scheduling strategies, such as FCFS Scheduling, Shortest-Job-First scheduling,

priority scheduling, RR-scheduling and Multilevel queue scheduling and MultipleProcessor scheduling. Finally we also discussed about the load balancing, thread scheduling and various algorithm evaluation models.

Terminal Questions 1. What do you understand by scheduling process what are the conditions which guides during the CPU scheduling decisions? 2. What is the significance of dispatcher module in scheduling process? Explain the dispatcher latency. 3. What are the various scheduling algorithms discuss the advantages of one over the other. 4. When it is advisable to follow the priority scheduling approach, what is the suggested solution to deal with starvation problem in this approach. 5. What is load balancing? How load balancing is achieved in multiprocessor systems.

Unit 6 : Deadlocks: This unit covers the deadlock principles, deadlock detection and recovery, deadlock avoidance , prevention, pipes.

Introduction Recall that one definition of an operating system is a resource allocator. There are many resources that can be allocated to only one process at a time, and we have seen several operating system features that allow this, such as mutexes, semaphores or file locks. Sometimes a process has to reserve more than one resource. For example, a process which copies files from one tape to another generally requires two tape drives. A process which deals with databases may need to lock multiple records in a database. A deadlock is a situation in which two computer programs sharing the same resource are effectively preventing each other from accessing the resource, resulting in both programs ceasing to function. The earliest computer operating systems ran only one program at a time. All of the resources of the system were available to this one program. Later, operating systems ran multiple programs at once, interleaving them. Programs were required to specify in

advance what resources they needed so that they could avoid conflicts with other programs running at the same time. Eventually some operating systems offered dynamic allocation of resources. Programs could request further allocations of resources after they had begun running. This led to the problem of the deadlock. Here is the simplest example:
Program 1 requests resource A and receives it. Program 2 requests resource B and receives it. Program 1 requests resource B and is queued up, pending the release of B. Program 2 requests resource A and is queued up, pending the release of A.

Now neither program can proceed until the other program releases a resource. The operating system cannot know what action to take. At this point the only alternative is to abort (stop) one of the programs. Learning to deal with deadlocks had a major impact on the development of operating systems and the structure of databases. Data was structured and the order of requests was constrained in order to avoid creating deadlocks. In general, resources allocated to a process are not preemptable; this means that once a resource has been allocated to a process, there is no simple mechanism by which the system can take the resource back from the process unless the process voluntarily gives it up or the system administrator kills the process. This can lead to a situation called deadlock. A set of processes or threads is deadlocked when each process or thread is waiting for a resource to be freed which is controlled by another process. Here is an example of a situation where deadlock can occur.
Mutex M1, M2; /* Thread 1 */ while (1) { NonCriticalSection() Mutex_lock(&M1); Mutex_lock(&M2); CriticalSection(); Mutex_unlock(&M2); Mutex_unlock(&M1); }

/* Thread 2 */ while (1) { NonCriticalSection() Mutex_lock(&M2); Mutex_lock(&M1); CriticalSection(); Mutex_unlock(&M1); Mutex_unlock(&M2);

Suppose thread 1 is running and locks M1, but before it can lock M2, it is interrupted. Thread 2 starts running; it locks M2, when it tries to obtain and lock M1, it is blocked because M1 is already locked (by thread 1). Eventually thread 1 starts running again, and it tries to obtain and lock M2, but it is blocked because M2 is already locked by thread 2. Both threads are blocked; each is waiting for an event which will never occur. Traffic gridlock is an everyday example of a deadlock situation.

In order for deadlock to occur, four conditions must be true.

Mutual exclusion Each resource is either currently allocated to exactly one process or it is available. (Two processes cannot simultaneously control the same resource or be in their critical section). Hold and Wait processes currently holding resources can request new resources No preemption Once a process holds a resource, it cannot be taken away by another process or the kernel. Circular wait Each process is waiting to obtain a resource which is held by another process.

The dining philosophers problem discussed in an earlier section is a classic example of deadlock. Each philosopher picks up his or her left fork and waits for the right fork to become available, but it never does.

Deadlock can be modeled with a directed graph. In a deadlock graph, vertices represent either processes (circles) or resources (squares). A process which has acquired a resource is show with an arrow (edge) from the resource to the process. A process which has requested a resource which has not yet been assigned to it is modeled with an arrow from the process to the resource. If these create a cycle, there is deadlock. The deadlock situation in the above code can be modeled like this.

This graph shows an extremely simple deadlock situation, but it is also possible for a more complex situation to create deadlock. Here is an example of deadlock with four processes and four resources.

There are a number of ways that deadlock can occur in an operating situation. We have seen some examples, here are two more.

Two processes need to lock two files, the first process locks one file the second process locks the other, and each waits for the other to free up the locked file. Two processes want to write a file to a print spool area at the same time and both start writing. However, the print spool area is of fixed size, and it fills up before either process finishes writing its file, so both wait for more space to become available.

Objective : At the end of this unit, you will be able to understand the :

Solutions to deadlock Deadlock detection and recovery Deadlock avoidance Deadlock Prevention Pipes

Solutions to deadlock There are several ways to address the problem of deadlock in an operating system.

Just ignore it and hope it doesnt happen Detection and recovery if it happens, take action Dynamic avoidance by careful resource allocation. Check to see if a resource can be granted, and if granting it will cause deadlock, dont grant it. Prevention change the rules

Ignore deadlock The text refers to this as the Ostrich Algorithm. Just hope that deadlock doesnt happen. In general, this is a reasonable strategy. Deadlock is unlikely to occur very often; a system can run for years without deadlock occurring. If the operating system has a deadlock prevention or detection system in place, this will have a negative impact on performance (slow the system down) because whenever a process or thread requests a resource, the system will have to check whether granting this request could cause a potential deadlock situation. If deadlock does occur, it may be necessary to bring the system down, or at least manually kill a number of processes, but even that is not an extreme solution in most situations. Deadlock detection and recovery As we saw above, if there is only one instance of each resource, it is possible to detect deadlock by constructing a resource allocation/request graph and checking for cycles. Graph theorists have developed a number of algorithms to detect cycles in a graph. The book discusses one of these. It uses only one data structure L a list of nodes. A cycle detection algorithm For each node N in the graph 1. Initialize L to the empty list and designate all edges as unmarked 2. Add the current node to L and check to see if it appears twice. If it does, there is a cycle in the graph. 3. From the given node, check to see if there are any unmarked outgoing edges. If yes, go to the next step, if no, skip the next step

4. Pick an unmarked edge, mark it, then follow it to the new current node and go to step 3. 5. We have reached a dead end. Go back to the previous node and make that the current node. If the current node is the starting Node and there are no unmarked edges, there are no cycles in the graph. Otherwise, go to step 3. Lets work through an example with five processes and five resources. Here is the resource request/allocation graph.

The algorithm needs to search each node; lets start at node P1. We add P1 to L and follow the only edge to R1, marking that edge. R1 is now the current node so we add that to L, checking to confirm that it is not already in L. We then follow the unmarked edge to P2, marking the edge, and making P2 the current node. We add P2 to L, checking to make sure that it is not already in L, and follow the edge to R2. This makes R2 the current node, so we add it to L, checking to make sure that it is not already there. We are now at a dead end so we back up, making P2 the current node again. There are no more unmarked edges from P2 so we back up yet again, making R1 the current node. There are no more unmarked edges from R1 so we back up yet again, making P1 the current node. Since there are no more unmarked edges from P1 and since this was our starting point, we are through with this node (and all of the nodes visited so far). We move to the next unvisited node P3, and initialize L to empty. We first follow the unmarked edge to R1, putting R1 on L. Continuing, we make P2 the current node and then R2. Since we are at a dead end, we repeatedly back up until P3 becomes the current node again. L now contains P3, R1, P2, and R2. P3 is the current node, and it has another unmarked edge to R3. We make R3 the current node, add it to L, follow its edge to P4. We repeat this process, visiting R4, then P5, then R5, then P3. When we visit P3 again we note that it is already on L, so we have detected a cycle, meaning that there is a deadlock situation.

Once deadlock has been detected, it is not clear what the system should do to correct the situation. There are three strategies.

Preemption we can take an already allocated resource away from a process and give it to another process. This can present problems. Suppose the resource is a printer and a print job is half completed. It is often difficult to restart such a job without completely starting over. Rollback In situations where deadlock is a real possibility, the system can periodically make a record of the state of each process and when deadlock occurs, roll everything back to the last checkpoint, and restart, but allocating resources differently so that deadlock does not occur. This means that all work done after the checkpoint is lost and will have to be redone. Kill one or more processes this is the simplest and crudest, but it works.

Deadlock avoidance The above solution allowed deadlock to happen, then detected that deadlock had occurred and tried to fix the problem after the fact. Another solution is to avoid deadlock by only granting resources if granting them cannot result in a deadlock situation later. However, this works only if the system knows what requests for resources a process will be making in the future, and this is an unrealistic assumption. The text describes the bankers algorithm but then points out that it is essentially impossible to implement because of this assumption. Deadlock Prevention The difference between deadlock avoidance and deadlock prevention is a little subtle. Deadlock avoidance refers to a strategy where whenever a resource is requested, it is only granted if it cannot result in deadlock. Deadlock prevention strategies involve changing the rules so that processes will not make requests that could result in deadlock. Here is a simple example of such a strategy. Suppose every possible resource is numbered (easy enough in theory, but often hard in practice), and processes must make their requests in order; that is, they cannot request a resource with a number lower than any of the resources that they have been granted so far. Deadlock cannot occur in this situation. As an example, consider the dining philosophers problem. Suppose each chopstick is numbered, and philosophers always have to pick up the lower numbered chopstick before the higher numbered chopstick. Philosopher five picks up chopstick 4, philosopher 4 picks up chopstick 3, philosopher 3 picks up chopstick 2, philosopher 2 picks up chopstick 1. Philosopher 1 is hungry, and without this assumption, would pick up chopstick 5, thus causing deadlock. However, if the lower number rule is in effect, he/she has to pick up chopstick 1 first, and it is already in use, so he/she is blocked. Philosopher 5 picks up chopstick 5, eats, and puts both down, allows philosopher 4 to eat. Eventually everyone gets to eat.

An alternative strategy is to require all processes to request all of their resources at once, and either all are granted or none are granted. Like the above strategy, this is conceptually easy but often hard to implement in practice because it assumes that a process knows what resources it will need in advance. Livelock There is a variant of deadlock called livelock. This is a situation in which two or more processes continuously change their state in response to changes in the other process(es) without doing any useful work. This is similar to deadlock in that no progress is made but differs in that neither process is blocked or waiting for anything. A human example of livelock would be two people who meet face-to-face in a corridor and each moves aside to let the other pass, but they end up swaying from side to side without making any progress because they always move the same way at the same time. Addressing deadlock in real systems Deadlock is a terrific theoretical problem for graduate students, but none of the solutions discussed above can be implemented in a real world, general purpose operating system. It would be difficult to require a user program to make requests for resources in a certain way or in a certain order. As a result, most operating systems use the ostrich algorithm. Some specialized systems have deadlock avoidance/prevention mechanisms. For example, many database operations involve locking several records, and this can result in deadlock, so database software often has a deadlock prevention algorithm. The Unix file locking system lockf has a deadlock detection mechanism built into it. Whenever a process attempts to lock a file or a record of a file, the operating system checks to see if that process has locked other files or records, and if it has, it uses a graph algorithm similar to the one discussed above to see if granting that request will cause deadlock, and if it does, the request for the lock will fail, and the lockf system call will return and errno will be set to EDEADLK. Killing Zombies Recall that if a child dies before its parent calls wait, the child becomes a zombie. In some applications, a web server for example, the parent forks off lots of children but doesnt care whether the child is dead or alive. For example, a web server might fork a new process to handle each connection, and each child dies when the client breaks the connection. Such an application is at risk of producing many zombies, and zombies can clog up the process table. When a child dies, it sends a SIGCHLD signal to its parent. The parent process can prevent zombies from being created by creating a signal handler routine for SIGCHLD which calls wait whenever it receives a SIGCHLD signal. There is no danger that this

will cause the parent to block because it would only call wait when it knows that a child has just died. There are several versions of wait on a Unix system. The system call waitpid has this prototype
#include <sys/types.h> #include <sys/wait.h> pid_t waitpid(pid_t pid, int *stat_loc, int options)

This will function like wait in that it waits for a child to terminate, but this function allows the process to wait for a particular child by setting its first argument to the pid that we want to wait for. However, that is not our interest here. If the first argument is set to zero, it will wait for any child to terminate, just like wait. However, the third argument can be set to WNOHANG. This will cause the function to return immediately if there are no dead children. It is customary to use this function rather than wait in the signal handler. Here is some sample code
#include <sys/types.h> #include <stdio.h> #include <signal.h> #include <wait.h> #include <unistd.h> void *zombiekiller(int n) { int status; waitpid(0,&status,WNOHANG); signal(SIGCHLD,zombiekiller); return (void *) NULL; } int main() { signal(SIGCHLD, zombiekiller); .... }

Pipes A second form of redirection is a pipe. A pipe is a connection between two processes in which one process writes data to the pipe and the other reads from the pipe. Thus, it allows one process to pass data to another process. The Unix system call to create a pipe is int pipe(int fd[2])

This function takes an array of two ints (file descriptors) as an argument. It creates a pipe with fd[0] at one end and fd[1] at the other. Reading from the pipe and writing to the pipe are done with the read and write calls that you have seen and used before. Although both ends are opened for both reading and writing, by convention a process writes to fd[1] and reads from fd[0]. Pipes only make sense if the process calls fork after creating the pipe. Each process should close the end of the pipe that it is not using. Here is a simple example in which a child sends a message to its parent through a pipe.
#include <unistd.h> #include <stdio.h> int main() { pid_t pid; int retval; int fd[2]; int n; retval = pipe(fd); if (retval < 0) { printf("Pipe failedn"); /* pipe is unlikely to fail */ exit(0); } pid = fork(); if (pid == 0) { /* child */ close(fd[0]); n = write (fd[1],"Hello from the child",20); exit(0); } else if (pid > 0) { /* parent */ char buffer[64]; close(fd[1]); n = read(fd[0],buffer,64); buffer[n]=''; printf("I got your message: %sn",buffer); } return 0; }

There is no need for the parent to wait for the child to finish because reading from a pipe will block until there is something in the pipe to read. If the parent runs first, it will try to execute the read statement, and will immediately block because there is nothing in the pipe. After the child writes a message to the pipe, the parent will wake up. Pipes have a fixed size (often 4096 bytes) and if a process tries to write to a pipe which is full, the write will block until a process reads some data from the pipe. Here is a program which combines dup2 and pipe to redirect the output of the ls process to the input of the more process as would be the case if the user typed ls | more at the Unix command line.

#include <stdio.h> #include <unistd.h>

void error(char *msg) { perror(msg); exit(1); }

int main() { int p[2], retval; retval = pipe(p); if (retval < 0) error("pipe"); retval=fork(); if (retval < 0) error("forking"); if (retval==0) { /* child */ dup2(p[1],1); /* redirect stdout to pipe */ close(p[0]); /* don't permit this process to read from pipe */ execl("/bin/ls","ls","-l",NULL); error("Exec of ls"); } /* if we get here, we are the parent */ dup2(p[0],0); /* redirect stdin to pipe */ close(p[1]); /* don't permit this process to write to pipe */ execl("/bin/more","more",NULL); error("Exec of more"); return 0;

}
Summary

A deadlock is considered to be one of the situation which whenever occur prevents the normal flow of execution of any application. Thus needs to be understood well. To cater this need, the unit began with providing a detailed discussion on the fundamental concepts of deadlock followed by understanding various situations that force deadlock to occur. Finally Unit provided a detailed coverage on methods of avoiding deadlocks to occur and in case of their occurrence, mechanism to detect them so that precautionary measures could be taken.

Terminal Questions 1. 2. 3. 4.
5.

What do you mean by a deadlock? Explain Discuss various conditions to be true for deadlock to occur. Discuss various application to overcome the problem of deadlock. What do you mean by a Zomby? Discuss in brief.
Explain the concept of pipes.

6.

Unit 7 : Concurrency Control : This unit deals with the concurrency, race condition, critical section, mutual exclusion and Semaphores Introduction Concurrency is a property of systems which execute processes overlapped in time on single or multiple processors, and which may permit the sharing of common resources between those overlapped processes. Concurrent use of shared resources is the source of many difficulties, such as race conditions. Concurrency control is a method used to ensure that processes are executed in a safe manner without affecting each other and correct results are generated, while getting those results as quickly as possible. Mutual exclusion is a way of making sure that if one process is using a shared modifiable data, the other processes will be excluded from doing the same thing. The mutual exclusion have a basic problem of busy waiting. If a process is unable to enter in to its critical section; it tightly executes the loop of testing the shared global variable, wasting CPU time, as well as resources. Semaphores avoid this wastage of time and resources by blocking the process if it can not enter into its critical section. This process will be wake up by the currently running process after coming out of critical section. Following sections covers various aspects and issues related to concurrent transactions.

Objectives: At the end of this unit you will be able to understand the:

Brief introduction of Concurrency Control Conditions for Deadlocks Semaphores

What is concurrency? Concurrency occurs when two or more execution flows are able to run simultaneously. Edsger Dijkstra. Concurrency is a property of systems which execute processes overlapped in time on single or multiple processors, and which may permit the sharing of common resources between those overlapped processes. Concurrent use of shared resources is the source of many difficulties, such as race conditions (as explained bellow). The introduction of mutual exclusion can prevent race conditions, but can lead to problems such as deadlock, and starvation. In a single-processor multiprogramming system, processes must be are interleaved in time to yield the appearance of simultaneous execution. In a multiple-processor system, it is possible not only to interleave the execution of multiple processes but also to overlap them. Interleaving and overlapping techniques can be viewed as examples of concurrent processing Concurrency control is a method used to ensure that processes are executed in a safe manner (i.e., without affecting each other) and correct results are generated, while getting those results as quickly as possible. Race Conditions A race condition occurs when multiple processes or threads read and write data items so that the final result depends on the order of execution of instructions in the multiple processes. Suppose that two processes, P1 and P2, share the global variable A. At some point in its execution, P1 updates variable A to the value 1, and at some point in its execution, P2 updates variable A to the value 2. Thus, the two processes are in a race to write variable A. In this example the loser of the race (the process that updates last) determines the final value of A. Critical Section A critical section is a part of program that accesses a shared resource (data structure or device) that must not be concurrently accessed by more than one process of execution. The key to preventing trouble involving shared storage is find some way to prohibit more than one process from reading and writing the shared data simultaneously. To avoid race conditions and flawed results, one must identify codes in Critical Sections in each process. Mutual Exclusion

Mutual exclusion is a way of making sure that if one process is using a shared modifiable data, the other processes will be excluded from doing the same thing. That is, while one process executes the shared variable, all other processes desiring to do so at the same time moment should be kept waiting; when that process has finished using the shared variable, one of the processes waiting to do so should be allowed to proceed. In this fashion, each process using the shared data (variables) excludes all others from doing so simultaneously. This is called Mutual Exclusion. Mutual exclusion needs to be enforced only when processes access shared modifiable data when processes are performing operations that do not conflict with one another they should be allowed to proceed concurrently. Requirements for mutual exclusion Following are the six requirements for mutual exclusion. 1. Mutual exclusion must be enforced: Only one process at a time is allowed into its critical section, among all processes that have critical sections for the same resource or shared object. 2. A process that halts in its non critical section must do so without interfering with other processes. 3. It must not be possible for a process requiring access to a critical section to be delayed indefinitely. 4. When no process is in a critical section, any process that requests entry to its critical section must be permitted to enter without delay. 5. No assumptions are made about relative process speed or number of processors. 6. A process remains inside its critical section for a finite time only. Following are some of the methods for achieving mutual exclusion. Mutual exclusion by disabling interrupts: In an interrupt driven system, context switches from one process to another can only occur on interrupts (timer, I/O device, etc). If a process disables all interrupts then it cannot be switched out. On entry to the critical section the process can disable all interrupts, and on exit from it can enable them again as shown bellow. while (true) { /* disable interrupts */;

/* critical section */; /* enable interrupts */; /* remainder */; } Figure 7.1: Mutual exclusion by disabling interrupts Because the critical section cannot be interrupted, mutual exclusion is guaranteed. But as the processor can not interleave processes, the system performance is degraded. Also this solution does not work for multi processor system, where more than one process is run concurrently. Mutual exclusion by using Lock variable: In this method, we consider a single, shared, (lock) variable, initially 0. When a process wants to enter in its critical section, it first tests the lock value. If lock is 0, the process first sets it to 1 and then enters the critical section. If the lock is already 1, the process just waits until (lock) variable becomes 0. Thus, a 0 means that no process in its critical section and 1 mean some process is in its critical section. process (i) { while(lock != 0) /* no operation */; lock = 1; /* critical section */; lock = 0; /* remainder */; } Figure 7.2: Mutual exclusion using lock variable The flaw in this proposal can be best explained by example. Suppose process A sees that the lock is 0. Before it can set the lock to 1 another process B is scheduled, runs, and sets the lock to 1. When the process A runs again, it will also set the lock to 1, and two

processes will be in their critical section simultaneously. Thus this method does not guarantee mutual exclusion. Mutual exclusion by Strict Alternation: In this method, the integer variable turn keeps track of whose turn is to enter the critical section. Initially, process 0 inspect turn, finds it to be 0, and enters in its critical section. Process 1 also finds it to be 0 and sits in a loop continually testing turn to see when it becomes 1. Process 0, after coming out of critical section, sets turn to 1, to allow process 1 to enter in its critical section, as shown bellow. /* Process 0 */ while (true) { while(turn != 0) /* no operation */; /* critical section */; turn = 1; /* remainder */; }

/* Process 1 */

while (true) { while(turn != 1) /* no operation */; /* critical section */; turn = 0;

/* remainder */; } Figure 7.3: Mutual exclusion by strict alternation Taking turns is not a good idea when one of the processes is much slower than the other. Suppose process 0 finishes its critical section quickly, and again wants to enter in its critical section, but it can not do so, as the turn value is set to 1. It has to wait for process 1 to finish its critical section part. Here both processes are in their non-critical section. This situation violates above mutual exclusion requirement condition no. 4. Mutual exclusion by Petersons Method: The algorithm uses two variables, flag, a boolean array and turn, an integer. A true flag value indicates that the process wants to enter the critical section. The variable turn holds the id of the process whose turn it is. Entrance to the critical section is granted for process P0 if P1 does not want to enter its critical section or if P1 has given priority to P0 by setting turn to 0. flag[0]=false; flag[1]=false; turn = 0; /* Process 0 */ while (true) { flag[0] = true; turn = 1; while(flag[1] && turn == 1) /* no operation */; /* critical section */; flag[0] = false; /* remainder */;

} /* Process 1 */ while (true) { flag[1] = true; turn = 0; while(flag[0] && turn == 0) /* no operation */; /* critical section */; flag[1] = false; /* remainder */; } Figure 7.4: Petersons algorithm Mutual exclusion by using Special Machine Instructions: In a multiprocessor environment, the processors share access to a common main memory and at the hardware level, only one access to a memory location is permitted at a time. With this as a foundation, some computer processors designed several machine instructions that carry out two actions, such as reading and writing, of a single memory location. Since processes interleave at the instruction level, so such special instructions are atomic and are not subject to interference from other processes. Two of such kind of instructions are discussed in the following parts. Test and Set Instruction: The test and set instruction can be defined as follows: boolean testset (int i) { if (i = = 0) {

i=1; return true; } else. { return false; } Figure 7.5: Test and Set Instructions where the variable i is used like a traffic light. If it is 0, meaning green, then the instruction sets it 1, i.e. red, and return true. Thus the current process is permitted to pass but the others are told to stop. On the other hand, if the light is already red, then the running process will receive false and realize not supposed to proceed. Exchange Instruction: The exchange instruction can be defined as follows: void exchange (int register, int memory) { int temp; temp = memory; memory = register; register = temp; } Figure 7.6: Exchange Instruction The instruction exchanges the contents of a register with that of a memory location. A shared variable bolt is initialized to 0. Each process uses a local variable key that is initialized to 1, and executes the instruction as exchange(key, bolt). .The only process that may enter its critical section is one that finds bolt equal to 0. It excludes all other processes from the critical section by setting bolt to 1. When a process leaves its critical section, it resets bolt to 0, allowing another process to gain access to its critical section.

Semaphores All the above methods of mutual exclusion have a basic problem of busy waiting. If a process is unable to enter in to its critical section; it tightly executes the loop of testing the shared global variable, wasting CPU time, as well as resources. Semaphores avoid this wastage of time and resources by blocking the process if it can not enter into its critical section. This process will be wake up by the currently running process after coming out of critical section. What are Semaphores? A semaphore is a mechanism that prevents two or more processes from accessing a shared resource simultaneously. On the railroads a semaphore prevents two trains from crashing on a shared section of track. On railroads and computers, semaphores are advisory: if a train engineer doesnt observe and obey it, the semaphore wont prevent a crash, and if a process doesnt check a semaphore before accessing a shared resource, chaos might result. Semaphores can be thought of as flags (hence their name, semaphores). They are either on or off. A process can turn on the flag or turn it off. If the flag is already on, processes that try to turn on the flag will sleep until the flag is off. Upon awakening, the process will reattempt to turn the flag on, possibly succeeding or possibly sleeping again. Such behavior allows semaphores to be used in implementing a post-wait driver a system where processes can wait for events (i.e., wait on turning on a semaphore) and post events (i.e. turning off of a semaphore). Dijkstra in 1965 proposed semaphores as a solution to the problems of concurrent processes. The fundamental principle is: That two or more processes can cooperate by means of simple signals, such that a process can be forced to stop at a specified place until it has received a specific signal. For signaling, special variables called semaphores are used. Primitive signal (s) is used to transmit a signal Primitive wait (s) is used to receive a signal Semaphore Implementation: To achieve desired effect, view semaphores as variables that have an integer value upon which three operations are defined:

A semaphore may be initialized to a non-negative value The wait operation decrements the semaphore value. If the value becomes negative, the process executing the wait is blocked

The signal operation increments the semaphore value. If the value is not positive, then the process blocked by wait operation is unblocked. There is no other way to manipulate semaphores

wait (S) { while (S 0); S; } /*no-operation */;

signal (S) { S++; } Figure 7.7: Semaphore operations Mutual Exclusion using Semaphore: The following example illustrates mutual exclusion using semaphore: A process before entering in to its critical section, performs wait(mutex) operating and after coming out of critical section, signal(mutex) operation; thus achieving mutual exclusion. Shared data: semaphore mutex; Process: Pi: do { wait(mutex); /* critical section */ //initially mutex = 1

signal(mutex); /* remainder section */ } while (1); Figure 7.8: Mutual exclusion using semaphore Following code gives the detailed implementation of wait and signal procedures for above example. The structure definition has semaphore value and process link. The wait operation decrements the semaphore value, and if it is less than 0 then adds it to waiting queue and blocks the process. Declaration: typedef struct { int value; struct process *L; } semaphore;

wait(S): { S.value; if (S.value < 0) { add this process to S.L; block; }

} signal(S): {

S.value++; if (S.value <= 0) { remove a process P from S.L; wakeup(P); } } Figure 7.9: wait() and signal() for mutual exclusion The process which is currently in critical section; after coming out increments the semaphore value, and checks if it is less than of equal to 0. If so, it removes process from waiting queue and then wakes up the process. Summary Concurrency is a property of systems which execute processes overlapped in time on single or multiple processors, and which may permit the sharing of common resources between those overlapped processes. Concurrency control is a method used to ensure that processes are executed in a safe manner (i.e., without affecting each other) and correct results are generated, while getting those results as quickly as possible. A race condition occurs when multiple processes or threads read and write data items so that the final result depends on the order of execution of instructions in the multiple processes. Mutual exclusion is a way of making sure that if one process is using a shared modifiable data, the other processes will be excluded from doing the same thing. Mutual exclusion can be achieved by various ways such as using lock variable, by strict alternation, by disabling interrupts, using Petersons method, through special machine instructions, and Semaphores.

Terminal Questions 1. 2. 3. 4. 5. 6. 7. What is concurrency? Discuss the problems caused by concurrent executions of processes. What is race condition? Describe critical section. What is mutual exclusion? What are its requirements? Explain any one method for achieving mutual exclusion. Explain the Petersons solution for mutual exclusion.

8. What are special machine instructions? How they support mutual exclusion? 9. What are Semaphores? How can we achieve mutual exclusion using Semaphores?

Unit 8 : File Systems and Space Management : This unit covers the file management covers the file structure, implementing file systems and space management Block size and extents, Free space, reliability, bad block and back-up dumps. And consistency checking, transactions and performance discussed in brief.

Introduction

Most operating systems provide a file system, as a file system is an integral part of any modern operating system. Early microcomputer operating systems only real task was file management a fact reflected in their names. Some early operating systems had a separate component for handling file systems which was called a disk operating system. On some microcomputers, the disk operating system was loaded separately from the rest of the operating system. On early operating systems, there was usually support for only one, native, unnamed file system; for example, CP/M supports only its own file system, which might be called CP/M file system if needed, but which didnt bear any official name at all. Because of this, there needs to be an interface provided by the operating system software between the user and the file system. This interface can be textual (such as provided by a command line interface, such as the UNIX shell, or OpenVMS DCL) or graphical (such as provided by a graphical user interface, such as file browsers). If graphical, the metaphor of the folder, containing documents, other files, and nested folders is often used. This unit covers various issues related to Files.

Objectives: At the end of this unit you will be understand the:


Brief introduction of File Systems and Structures and their implementation Storage and Space management with consistency checking, Performance evaluation and transaction related issues Fundamental understanding of Access Methods

File Systems

Just as the process abstraction beautifies the hardware by making a single CPU (or a small number of CPUs) appear to be many CPUs, one per user, the file system beautifies the hardware disk, making it appear to be a large number of disk-like objects called files. Like a disk, a file is capable of storing a large amount of data cheaply, reliably, and persistently. The fact that there are lots of files is one form of beautification: Each file is individually protected, so each user can have his own files, without the expense of requiring each user to buy his own disk. Each user can have lots of files, which makes it easier to organize persistent data. The file system also makes each individual file more beautiful than a real disk. At the very least, it erases block boundaries, so a file can be any length (not just a multiple of the block size) and programs can read and write arbitrary regions of the file without worrying about whether they cross block boundaries. Some systems (not Unix) also provide assistance in organizing the contents of a file. Systems use the same sort of device (a disk drive) to support both virtual memory and files. The question arises why these have to be distinct facilities, with vastly different user interfaces. The answer is that they dont. In Multics, there was no difference whatsoever. Everything in Multics was a segment. The address space of each running process consisted of a set of segments (each with its own segment number), and the file system was simply a set of named segments. To access a segment from the file system, a process would pass its name to a system call that assigned a segment number to it. From then on, the process could read and write the segment simply by executing ordinary loads and stores. For example, if the segment was an array of integers, the program could access the ith number with a notation like a[i] rather than having to seek to the appropriate offset and then execute a read system call. If the block of the file containing this value wasnt in memory, the array access would cause a page fault, which was serviced. This user-interface idea, sometimes called single-level store, is a great idea. So why is it not common in current operating systems? In other words, why are virtual memory and files presented as very different kinds of objects? There are possible explanations one might propose: The address space of a process is small compared to the size of a file system. There is no reason why this has to be so. In Multics, a process could have up to 256K segments, but each segment was limited to 64K words. Multics allowed for lots of segments because every file in the file system was a segment. The upper bound of 64K words per segment was considered large by the standards of the time; The hardware actually allowed segments of up to 256K words (over one megabyte). Most new processors introduced in the last few years allow 64-bit virtual addresses. In a few years, such processors will dominate. So there is no reason why the virtual address space of a process cannot be large enough to include the entire file system.

The virtual memory of a process is transient it goes away when the process terminates while files must be persistent. Multics showed that this doesnt have to be true. A segment can be designated as permanent, meaning that it should be preserved after the process that created it terminates. Permanent segments to raise a need for one file-system-like facility, the ability to give names to segments so that new processes can find them. Files are shared by multiple processes, while the virtual address space of a process is associated with only that process. Most modern operating systems (including most variants of Unix) provide some way for processes to share portions of their address spaces anyhow, so this is a particularly weak argument for a distinction between files and segments. The real reason single-level store is not ubiquitous is probably a concern for efficiency. The usual file-system interface encourages a particular style of access: Open a file, go through it sequentially, copying big chunks of it to or from main memory, and then close it. While it is possible to access a file like an array of bytes, jumping around and accessing the data in tiny pieces, it is awkward. Operating system designers have found ways to implement files that make the common file like style of access very efficient. While there appears to be no reason in principle why memory-mapped files cannot be made to give similar performance when they are accessed in this way, in practice, the added functionality of mapped files always seems to pay a price in performance. Besides, if it is easy to jump around in a file, applications programmers will take advantage of it, overall performance will suffer, and the file system will be blamed.

Naming
Every file system provides some way to give a name to each file. We will consider only names for individual files here, and talk about directories later. The name of a file is (at least sometimes) meant to used by human beings, so it should be easy for humans to use. Different operating systems put different restrictions on names: Size Some systems put severe restrictions on the length of names. For example DOS restricts names to 11 characters, while early versions of Unix (and some still in use today) restrict names to 14 characters. The Macintosh operating system, Windows 95, and most modern version of Unix allow names to be essentially arbitrarily long. I say essentially since names are meant to be used by humans, so they dont really to to be all that long. A name that is 100 characters long is just as difficult to use as one that it forced to be under 11 characters long (but for different reasons). Most modern versions of Unix, for example, restrict names to a limit of 255 characters. Case

Are upper and lower case letters considered different? The Unix tradition is to consider the names FILE1 and file1 to be completely different and unrelated names. In DOS and its descendants, however, they are considered the same. Some systems translate names to one case (usually upper case) for storage. Others retain the original case, but consider it simply a matter of decoration. For example, if you create a file named FILE1, you could open it as file1 or FIL, but if you list the directory, you would still see the file listed as Fil. Character Set Different systems put different restrictions on what characters can appear in file names. The Unix directory structure supports names containing any character other than NUL (the byte consisting of all zero bits), but many utility programs (such as the shell) would have troubles with names that have spaces, control characters or certain punctuation characters (particularly /). MacOS allows all of these (e.g., it is not uncommon to see a file name with the Copyright symbol in it). With the world-wide spread of computer technology, it is becoming increasingly important to support languages other than English, and in fact alphabets other than Latin. There is a move to support character strings (and in particular file names) in the Unicode character set, which devotes 16 bits to each character rather than 8 and can represent the alphabets of all major modern languages from Arabic to Devanagari to Telugu to Khmer. Format It is common to divide a file name into a base name and an extension that indicates the type of the file. DOS requires that each name be compose of a bast name of eight or less characters and an extension of three or less characters. When the name is displayed, it is represented as base.extension. Unix internally makes no such distinction, but it is a common convention to include exactly one period in a file name (e.g. fil.c for a C source file). File Structure Unix hides the chunkiness of tracks, sectors, etc. and presents each file as a smooth array of bytes with no internal structure. Application programs can, if they wish, use the bytes in the file to represent structures. For example, a wide-spread convention in Unix is to use the newline character (the character with bit pattern 00001010) to break text files into lines. Some other systems provide a variety of other types of files. The most common are files that consist of an array of fixed or variable size records and files that form an index mapping keys to values. Indexed files are usually implemented as B-trees.

File Types
Most systems divide files into various types. The concept of type is a confusing one, partially because the term type can mean different things in different contexts. Unix initially supported only four types of files: directories, two kinds of special files, and

regular files. Just about any type of file is considered a regular file by Unix. Within this category, however, it is useful to distinguish text files from binary files; within binary files there are executable files (which contain machine-language code) and data files; text files might be source files in a particular programming language (e.g. C or Java) or they may be human-readable text in some mark-up language such as html (hypertext markup language). Data files may be classified according to the program that created them or is able to interpret them, e.g., a file may be a Microsoft Word document or Excel spreadsheet or the output of TeX. The possibilities are endless. In general (not just in Unix) there are three ways of indicating the type of a file: 1. The operating system may record the type of a file in meta-data stored separately from the file, but associated with it. Unix only provides enough meta-data to distinguish a regular file from a directory (or special file), but other systems support more types. 2. The type of a file may be indicated by part of its contents, such as a header made up of the first few bytes of the file. In Unix, files that store executable programs start with a two byte magic number that identifies them as executable and selects one of a variety of executable formats. In the original Unix executable format, called the a.out format, the magic number is the octal number 0407, which happens to be the machine code for a branch instruction on the PDP-11 computer, one of the first computers to implement Unix. The operating system could run a file by loading it into memory and jumping to the beginning of it. The 0407 code, interpreted as an instruction, jumps to the word following the 16-byte header, which is the beginning of the executable code in this format. The PDP-11 computer is extinct by now, but it lives on through the 0407 code! 3. The type of a file may be indicated by its name. Sometimes this is just a convention, and sometimes its enforced by the OS or by certain programs. For example, the Unix Java compiler refuses to believe that a file contains Java source unless its name ends with .java. Some systems enforce the types of files more vigorously than others. File types may be enforced

Not at all, Only by convention, By certain programs (e.g. the Java compiler), or By the operating system itself.

Unix tends to be very lax in enforcing types.

Access Modes
Systems support various access modes for operations on a file.

Sequential. Read or write the next record or next n bytes of the file. Usually, sequential access also allows a rewind operation. Random. Read or write the nth record or bytes i through j. Unix provides an equivalent facility by adding a seek operation to the sequential operations listed above. This packaging of operations allows random access but encourages sequential access. Indexed. Read or write the record with a given key. In some cases, the key need not be unique there can be more than one record with the same key. In this case, programs use a combination of indexed and sequential operations: Get the first record with a given key, then get other records with the same key by doing sequential reads.

Note that access modes are distinct from from file structure e.g., a record-structured file can be accessed either sequentially or randomly but the two concepts are not entirely unrelated. For example, indexed access mode only makes sense for indexed files.

File Attributes
This is the area where there is the most variation among file systems. Attributes can also be grouped by general category. Name Ownership and Protection Owner, owners group, creator, access-control list (information about who can to what to this file, for example, perhaps the owner can read or modify it, other members of his group can only read it, and others have no access).
Time Stamps

Time created, time last modified, time last accessed, time the attributes were last changed, etc. Unix maintains the last three of these. Some systems record not only when the file was last modified, but by whom. Sizes Current size, size limit, high-water mark, space consumed (which may be larger than size because of internal fragmentation or smaller because of various compression techniques). Type Information As described above: File is ASCII, is executable, is a system file, is an Excel spread sheet, etc.

Misc Some systems have attributes describing how the file should be displayed when a directly is listed. For example MacOS records an icon to represent the file and the screen coordinates where it was last displayed. DOS has a hidden attribute meaning that the file is not normally shown. Unix achieves a similar effect by convention: The ls program that is usually used to list files does not show files with names that start with a period unless you explicit request it to (with the -a option). Unix records a fixed set of attributes in the meta-data associated with a file. If you want to record some fact about the file that is not included among the supported attributes, you have to use one of the tricks listed above for recording type information: encode it in the name of the file, put it into the body of the file itself, or store it in a file with a related name. Other systems (notably MacOS and Windows NT) allow new attributes to be invented on the fly. In MacOS, each file has a resource fork, which is a list of (attributename, attribute-value) pairs. The attribute name can be any four-character string, and the attribute value can be anything at all. Indeed, some kinds of files put the entire contents of the file in an attribute and leave the body of the file (called the data fork) empty.

Self Assessment Questions

1. Discuss the three ways of indicating the type of files. 2. Explain the various types of file access modes. 3. Explain the file system attributes in brief.
Implementing File Systems

Files
We will assume that all the blocks of the disk are given block numbers starting at zero and running through consecutive integers up to some maximum. We will further assume that blocks with numbers that are near each other are located physically near each other on the disk (e.g., same cylinder) so that the arithmetic difference between the numbers of two blocks gives a good estimate how long it takes to get from one to the other. First lets consider how to represent an individual file. There are (at least!) four possibilities: Contiguous The blocks of a file are the block numbered n, n+1, n+2, , m. We can represent any file with a pair of numbers: the block number of of first block and the length of the file (in blocks). The advantages of this approach are

Its simple The blocks of the file are all physically near each other on the disk and in order so that a sequential scan through the file will be fast.

The problem with this organization is that you can only grow a file if the block following the last block in the file happens to be free. Otherwise, you would have to find a long enough run of free blocks to accommodate the new length of the file and copy it. As a practical matter, operating systems that use this organization require the maximum size of the file to be declared when it is created and pre-allocate space for the whole file. Even then, storage allocation has all the problems we considered when studying main-memory allocation including external fragmentation. Linked List A file is represented by the block number of its first block, and each block contains the block number of the next block of the file. This representation avoids the problems of the contiguous representation: We can grow a file by linking any disk block onto the end of the list, and there is no external fragmentation. However, it introduces a new problem: Random access is effectively impossible. To find the 100th block of a file, we have to read the first 99 blocks just to follow the list. We also lose the advantage of very fast sequential access to the file since its blocks may be scattered all over the disk. However, if we are careful when choosing blocks to add to a file, we can retain pretty good sequential access performance.

Both the space overhead (the percentage of the space taken up by pointers) and the time overhead (the percentage of the time seeking from one place to another) can be decreased by using larger blocks. The hardware designer fixes the block size (which is usually quite small) but the software can get around this problem by using virtual blocks, sometimes called clusters. The OS simply treats each group of (say) four contiguous physical disk sectors as one cluster. Large, clusters, particularly if they can be variable size, are sometimes called extents. Extents can be thought of as a compromise between linked and contiguous allocation. Disk Index The idea here is to keep the linked-list representation, but take the link fields out of the blocks and gather them together all in one place. This approach is used in the FAT file system of DOS, OS/2 and older versions of Windows. At some fixed place on disk, allocate an array I with one element for each block on the disk, and move the link field from block n to I[m]. The whole array of links, called a file access table (FAT) is now small enough that it can be read into main memory when the systems starts up. Accessing the 100th block of a file still requires walking through 99 links of a linked list, but now the entire list is in memory, so time to traverse it is negligible (recall that a single disk access takes as long as 10s or even 100s of thousands of instructions). This representation has the added advantage of getting the operating system stuff (the links) out of the pages of user data. The pages of user data are now full-size disk blocks, and lots of algorithms work better with chunks that are a power of two bytes long. Also, it means that the OS can prevent users (who are notorious for screwing things up) from getting their grubby hands on the system data.

The main problem with this approach is that the index array I can get quite large with modern disks. For example, consider a 2 GB disk with 2K blocks. There are million blocks, so a block number must be at least 20 bits. Rounded up to an even number of bytes, thats 3 bytes4 bytes if we round up to a word boundaryso the array I is three or four megabytes. While thats not an excessive amount of memory given todays RAM prices, if we can get along with less, there are better uses for the memory. File Index Although a typical disk may contain tens of thousands of files, only a few of them are open at any one time, and it is only necessary to keep index information about open files in memory to get good performance. Unfortunately the whole-disk index described in the previous paragraph mixes index information about all files for the whole disk together, making it difficult to cache only information about open files. The inode structure introduced by Unix groups together index information about each file individually. The basic idea is to represent each file as a tree of blocks, with the data blocks as leaves. Each internal block (called an indirect block in Unix jargon) is an array of block numbers, listing its children in order. If a disk block is 2K bytes and a block number is four bytes, 512 block numbers fit in a block, so a one-level tree (a single root node pointing directly to the leaves) can accommodate files up to 512 blocks, or one megabyte in size. If the root node is cached in memory, the address (block number) of any block of the file can be found without any disk accesses. A two-level tree, with 513 total indirect blocks, can handle files 512 times as large (up to one-half gigabyte). The only problem with this idea is that it wastes space for small files. Any file with more than one block needs at least one indirect block to store its block numbers. A 4K file would require three 2K blocks, wasting up to one third of its space. Since many files are quite small, this is serious problem. The Unix solution is to use a different kind of block for the root of the tree.

An index node (or inode for short) contains almost all the meta-data about a file listed above: ownership, permissions, time stamps, etc. (but not the file name). Inodes are small enough that several of them can be packed into one disk block. In addition to the metadata, an inode contains the block numbers of the first few blocks of the file. What if the file is too big to fit all its block numbers into the inode? The earliest version of Unix had a bit in the meta-data to indicate whether the file was small or big. For a big file, the inode contained the block numbers of indirect blocks rather than data blocks. More recent versions of Unix contain pointers to indirect blocks in addition to the pointers to the first few data blocks. The inode contains pointers to (i.e., block numbers of) the first few blocks of the file, a pointer to an indirect block containing pointers to the next several blocks of the file, a pointer to a doubly indirect block, which is the root of a two-level tree whose leaves are the next blocks of the file, and a pointer to a triply indirect block. A large file is thus a lop-sided tree. A real-life example is given by the Solaris 2.5 version of Unix. Block numbers are four bytes and the size of a block is a parameter stored in the file system itself, typically 8K (8192 bytes), so 2048 pointers fit in one block. An inode has direct pointers to the first 12 blocks of the file, as well as pointers to singly, doubly, and triply indirect blocks. A file of up to 12+2048+2048*2048 = 4,196,364 blocks or 34,376,613,888 bytes (about 32 GB) can be represented without using triply indirect blocks, and with the triply indirect block, the maximum file size is (12+2048+2048*2048+2048*2048*2048)*8192 = 70,403,120,791,552 bytes (slightly more than 246 bytes, or about 64 terabytes). Of course, for such huge files, the size of the file cannot be represented as a 32-bit integer. Modern versions of Unix store the file length as a 64-bit integer, called a long integer in Java. An inode is 128 bytes long, allowing room for the 15 block pointers plus lots of meta-data. 64 inodes fit in one disk block. Since the inode for a file is kept in memory while the file is open, locating an arbitrary block of any file requires reading at most three I/O operations, not counting the operation to read or write the data block itself.

Directories
A directory is simply a table mapping character-string human-readable names to information about files. The early PC operating system CP/M shows how simple a directory can be. Each entry contains the name of one file, its owner, size (in blocks) and the block numbers of 16 blocks of the file. To represent files with more than 16 blocks, CP/M used multiple directory entries with the same name and different values in a field called the extent number. CP/M had only one directory for the entire system. DOS uses a similar directory entry format, but stores only the first block number of the file in the directory entry. The entire file is represented as a linked list of blocks using the disk index scheme described above. All but the earliest version of DOS provide hierarchical directories using a scheme similar to the one used in Unix. Unix has an even simpler directory format. A directory entry contains only two fields: a character-string name (up to 14 characters) and a two-byte integer called an inumber,

which is interpreted as an index into an array of inodes in a fixed, known location on disk. All the remaining information about the file (size, ownership, time stamps, permissions, and an index to the blocks of the file) are stored in the inode rather than the directory entry. A directory is represented like any other file (theres a bit in the inode to indicate that the file is a directory). Thus the inumber in a directory entry may designate a regular file or another directory, allowing arbitrary graphs of nodes. However, Unix carefully limits the set of operating system calls to ensure that the set of directories is always a tree. The root of the tree is the file with inumber 1 (some versions of Unix use other conventions for designating the root directory). The entries in each directory point to its children in the tree. For convenience, each directory also two special entries: an entry with name .., which points to the parent of the directory in the tree and an entry with name ., which points to the directory itself. Inumber 0 is not used, so an entry is marked unused by setting its inumber field to 0. Self Assessment Questions 1. What is Block? Write its advantages. 2. Explain the disk index with its advantages over the Operating Systems. 3. Explain the UNIX directory format with a suitable exaple. Space Management Block Size and Extents All of the file organizations Ive mentioned store the contents of a file in a set of disk blocks. How big should a block be? The problem with small blocks is I/O overhead. There is a certain overhead to read or write a block beyond the time to actually transfer the bytes. If we double the block size, a typical file will have half as many blocks. Reading or writing the whole file will transfer the same amount of data, but it will involve half as many disk I/O operations. The overhead for an I/O operations includes a variable amount of latency (seek time and rotational delay) that depends on how close the blocks are to each other, as well as a fixed overhead to start each operation and respond to the interrupt when it completes. Many years ago, researchers at the University of California at Berkeley studied the original Unix file system. They found that when they tried reading or writing a single very large file sequentially, they were getting only about 2% of the potential speed of the disk. In other words, it took about 50 times as long to read the whole file as it would if they simply read that many sequential blocks directly from the raw disk (with no file system software). They tried doubling the block size (from 512 bytes to 1K) and the performance more than doubled. The reason the speed more than doubled was that it took less than half as many I/O operations to read the file. Because the blocks were twice as large, twice as much of the files data was in blocks pointed to directly by the inode. Indirect blocks were twice as large as well, so they could hold twice as many pointers. Thus four times as much data could be accessed through the singly indirect block without resorting to the doubly indirect block.

If doubling the block size more than doubled performance, why stop there? Why didnt the Berkeley folks make the blocks even bigger? The problem with big blocks is internal fragmentation. A file can only grow in increments of whole blocks. If the sizes of files are random, we would expect on the average that half of the last block of a file is wasted. If most files are many blocks long, the relative amount of waste is small, but if the block size is large compared to the size of a typical file, half a block per file is significant. In fact, if files are very small (compared to the block size), the problem is even worse. If, for example, we choose a block size of 8k and the average file is only 1K bytes long, we would be wasting about 7/8 of the disk. Most files in a typical Unix system are very small. The Berkeley researchers made a list of the sizes of all files on a typical disk and did some calculations of how much space would be wasted by various block sizes. Simply rounding the size of each file up to a multiple of 512 bytes resulted in wasting 4.2% of the space. Including overhead for inodes and indirect blocks, the original 512-byte file system had a total space overhead of 6.9%. Changing to 1K blocks raised the overhead to 11.8%. With 2k blocks, the overhead would be 22.4% and with 4k blocks it would be 45.6%. Would 4k blocks be worthwhile? The answer depends on economics. In those days disks were very expensive, and a wasting half the disk seemed extreme. These days, disks are cheap, and for many applications people would be happy to pay twice as much per byte of disk space to get a disk that was twice as fast. But theres more to the story. The Berkeley researchers came up with the idea of breaking up the disk into blocks and fragments. For example, they might use a block size of 2k and a fragment size of 512 bytes. Each file is stored in some number of whole blocks plus 0 to 3 fragments at the end. The fragments at the end of one file can share a block with fragments of other files. The problem is that when we want to append to a file, there may not be any space left in the block that holds its last fragment. In that case, the Berkeley file system copies the fragments to a new (empty) block. A file that grows a little at a time may require each of its fragments to be copied many times. They got around this problem by modifying application programs to buffer their data internally and add it to a file a whole blocks worth at a time. In fact, most programs already used library routines to buffer their output (to cut down on the number of system calls), so all they had to do was to modify those library routines to use a larger buffer size. This approach has been adopted by many modern variants of Unix. The Solaris system you are using for this course uses 8k blocks and 1K fragments. As disks get cheaper and CPUs get faster, wasted space is less of a problem and the speed mismatch between the CPU and the disk gets worse. Thus the trend is towards larger and larger disk blocks. At first glance it would appear that the OS designer has no say in how big a block is. Any particular disk drive has a sector size, usually 512 bytes, wired in. But it is possible to use larger blocks. For example, if we think it would be a good idea to use 2K blocks, we can group together each run of four consecutive sectors and call it a block. In fact, it would even be possible to use variable-sized blocks, so long as each one is a multiple

of the sector size. A variable-sized block is called an extent. When extents are used, they are usually used in addition to multi-sector blocks. For example, a system may use 2k blocks, each consisting of 4 consecutive sectors, and then group them into extents of 1 to 10 blocks. When a file is opened for writing, it grows by adding an extent at a time. When it is closed, the unused blocks at the end of the last extent are returned to the system. The problem with extents is that they introduce all the problems of external fragmentation that we saw in the context of main memory allocation. Extents are generally only used in systems such as databases, where high-speed access to very large files is important. Free Space We have seen how to keep track of the blocks in each file. How do we keep track of the free blocks blocks that are not in any file? There are two basic approaches.

Use a bit vector. That is simply an array of bits with one bit for each block on the disk. A 1 bit indicates that the corresponding block is allocated (in some file) and a 0 bit says that it is free. To allocate a block, search the bit vector for a zero bit, and set it to one. Use a free list. The simplest approach is simply to link together the free blocks by storing the block number of each free block in the previous free block. The problem with this approach is that when a block on the free list is allocated, you have to read it into memory to get the block number of the next block in the list. This problem can be solved by storing the block numbers of additional free blocks in each block on the list. In other words, the free blocks are stored in a sort of lopsided tree on disk. If, for example, 128 block numbers fit in a block, 1/128 of the free blocks would be linked into a list. Each block on the list would contain a pointer to the next block on the list, as well as pointers to 127 additional free blocks. When the first block of the list is allocated to a file, it has to be read into memory to get the block numbers stored in it, but then we and allocate 127 more blocks without reading any of them from disk. Freeing blocks is done by running this algorithm in reverse: Keep a cache of 127 block numbers in memory. When a block is freed, add its block number to this cache. If the cache is full when a block is freed, use the block being freed to hold all the block numbers in the cache and link it to the head of the free list by adding to it the block number of the previous head of the list.

How do these methods compare? Neither requires significant space overhead on disk. The bitmap approach needs one bit for each block. Even for a tiny block size of 512 bytes, each bit of the bitmap describes 512*8 = 4096 bits of free space, so the overhead is less than 1/40 of 1%. The free list is even better. All the pointers are stored in blocks that are free anyhow, so there is no space overhead (except for one pointer to the head of the list). Another way of looking at this is that when the disk is full (which is the only time we should be worried about space overhead!) the free list is empty, so it takes up no space. The real advantage of bitmaps over free lists is that they give the space allocator more control over which block is allocated to which file. Since the blocks of a file are

generally accessed together, we would like them to be near each other on disk. To ensure this clustering, when we add a block to a file we would like to choose a free block that is near the other blocks of a file. With a bitmap, we can search the bitmap for an appropriate block. With a free list, we would have to search the free list on disk, which is clearly impractical. Of course, to search the bitmap, we have to have it all in memory, but since the bitmap is so tiny relative to the size of the disk, it is not unreasonable to keep the entire bitmap in memory all the time. To do the comparable operation with a free list, we would need to keep the block numbers of all free blocks in memory. If a block number is four bytes (32 bits), that means that 32 times as much memory would be needed for the free list as for a bitmap. For a concrete example, consider a 2 gigabyte disk with 8K blocks and 4-byte block numbers. The disk contains 231/213 = 218 = 262,144 blocks. If they are all free, the free list has 262,144 entries, so it would take one megabyte of memory to keep them all in memory at once. By contrast, a bitmap requires 218 bits, or 215 = 32K bytes (just four blocks). (On the other hand, the bit map takes the same amount of memory regardless of the number of blocks that are free).

Reliability Disks fail, disks sectors get corrupted, and systems crash, losing the contents of volatile memory. There are several techniques that can be used to mitigate the effects of these failures. We only have room for a brief survey. Bad-block Forwarding When the disk drive writes a block of data, it also writes a checksum, a small number of additional bits whose value is some function of the user data in the block. When the block is read back in, the checksum is also read and compared with the data. If either the data or checksum were corrupted, it is extremely unlikely that the checksum comparison will succeed. Thus the disk drive itself has a way of discovering bad blocks with extremely high probability. The hardware is also responsible for recovering from bad blocks. Modern disk drives do automatic bad-block forwarding. The disk drive or controller is responsible for mapping block numbers to absolute locations on the disk (cylinder, track, and sector). It holds a little bit of space in reserve, not mapping any block numbers to this space. When a bad block is discovered, the disk allocates one of these reserved blocks and maps the block number of the bad block to the replacement block. All references to this block number access the replacement block instead of the bad block. There are two problems with this scheme. First, when a block goes bad, the data in it is lost. In practice, blocks tend to be bad from the beginning, because of small defects in the surface coating of the disk platters. There is usually a stand-alone formatting program that tests all the blocks on the disk and sets up forwarding entries for those that fail. Thus the bad blocks never get used in the first place. The main reason for the forwarding is that it is just too hard (expensive)

to create a disk with no defects. It is much more economical to manufacture a pretty good disk and then use bad-block forwarding to work around the few bad blocks. The other problem is that forwarding interferes with the OSs attempts to lay out files optimally. The OS may think it is doing a good job by assigning consecutive blocks of a file to consecutive block numbers, but if one of those blocks is forwarded, it may be very far away for the others. In practice, this is not much of a problem since a disk typically has only a handful of forwarded sectors out of millions. The software can also help avoid bad blocks by simply leaving them out of the free list (or marking them as allocated in the allocation bitmap). Back-up Dumps There are a variety of storage media that are much cheaper than (hard) disks but are also much slower. An example is 8 millimeter video tape. A two-hour tape costs just a few dollars and can hold two gigabytes of data. By contrast, a 2GB hard drive currently casts several hundred dollars. On the other hand, while worst-case access time to a hard drive is a few tens of milliseconds, rewinding or fast-forwarding a tape to desired location can take several minutes. One way to use tapes is to make periodic back up dumps. Dumps are really used for two different purposes:

To recover lost files. Files can be lost or damaged by hardware failures, but far more often they are lost through software bugs or human error (accidentally deleting the wrong file). If the file is saved on tape, it can be restored. To recover from catastrophic failures. An entire disk drive can fail, or the whole computer can be stolen, or the building can burn down. If the contents of the disk have been saved to tape, the data can be restored (to a repaired or replacement disk). All that is lost is the work that was done since the information was dumped.

Corresponding to these two ways of using dumps, there are two ways of doing dumps. A physical dump simply copies all of the blocks of the disk, in order, to tape. Its very fast, both for doing the dump and for recovering a whole disk, but it makes it extremely slow to recover any one file. The blocks of the file are likely to be scattered all over the tape, and while seeks on disk can take tens of milliseconds, seeks on tape can take tens or hundreds of seconds. The other approach is a logical dump, which copies each file sequentially. A logical dump makes it easy to restore individual files. It is even easier to restore files if the directories are dumped separately at the beginning of the tape, or if the name(s) of each file are written to the tape along with the file. The problem with logical dumping is that it is very slow. Dumps are usually done much more frequently than restores. For example, you might dump your disk every night for three years before something goes wrong and you need to do a restore. An important trick that can be used with logical dumps is to only dump files that have changed recently. An incremental dump saves only those files that have been modified since a particular date and time. Fortunately, most file systems record the time each file was last modified. If you do a backup each night, you can save only

those files that have changed since the last backup. Every once in a while (say once a month), you can do a full backup of all files. In Unix jargon, a full backup is called an epoch (pronounced eepock) dump, because it dumps everything that has changed since the epochJanuary 1, 1970, which is the the earliest possible date in Unix. The Computer Sciences department currently does backup dumps on about 260 GB of disk space. Epoch dumps are done once every 14 days, with the timing on different file systems staggered so that about 1/14 of the data is dumped each night. Daily incremental dumps save about 6-10% of the data on each file system. Incremental dumps go fast because they dump only a small fraction of the files, and they dont take up a lot of tape. However, they introduce new problems:

If you want to restore a particular file, you need to know when it was last modified so that you know which dump tape to look at. If you want to restore the whole disk (to recover from a catastrophic failure), you have to restore from the last epoch dump, and then from every incremental dump since then, in order. A file that is modified every day will appear on every tape. Each restore will overwrite the file with a newer version. When youre done, everything will be up-to-date as of the last dump, but the whole process can be extremely slow (and labor-intensive). You have to keep around all the incremental tapes since the last epoch. Tapes are cheap, but theyre not free, and storing them can be a hassle.

The First problem can be solved by keeping a directory of what was dumped when. A bunch of UW alumni (the same person who invented NFS) have made themselves millionaires by marketing software to do this. The other problems can be solved by a clever trick. Each dump is assigned a positive integer level. A level n dump is an incremental dump that dumps all files that have changed since the most recent previous dump with a level greater than or equal to n. An epoch dump is considered to have infinitely high level. Levels are assigned to dumps as follows:

This scheme is sometimes called a ruler schedule for obvious reasons. Level-1 dumps only save files that have changed in the previous day. Level-2 dumps save files that have changed in the last two days, level-3 dumps cover four days, level-4 dumps cover 8 days, etc. Higher-level dumps will thus include more files (so they will take longer to do), but they are done infrequently. The nice thing about this scheme is that you only need to save one tape from each level, and the number of levels is the logarithm of the interval

between epoch dumps. Thus even if did a dump each night and you only did an epoch dump only once a year, you would need only nine levels (hence nine tapes). That also means that a full restore needs at worst one restore from each of nine tapes (rather than 365 tapes!). To figure out what tapes you need to restore from if your disk is destroyed after dump number n, express n in binary, and number the bits from right to left, starting with 1. The 1 bits tell you which dump tapes to use. Restore them in order of decreasing level. For example, 20 in binary is 10100, so if the disk is destroyed after the 20th dump, you only need to restore from the epoch dump and from the most recent dumps at levels 5 and 3. Self Assessment Questions 1. Explain how the block size is affected on I/O operation to read the file. 2. Explain how do you keep a track of the free blocks that are not in any file? 3. Explain the techniques that can be used to mitigate the effects of the disk fail, system crash and losing the content of volatile memory.

Consistency Checking
Some of the information in a file system is redundant. For example, the free list could be reconstructed by checking which blocks are not in any file. Redundancy arises because the same information is represented in different forms to make different operations faster. If you want to know which blocks are in a given file, look at the inode. If you you want to know which blocks are not in any inode, use the free list. Unfortunately, various hardware and software errors can cause the data to become inconsistent. File systems often include a utility that checks for consistency and optionally attempts to repair inconsistencies. These programs are particularly handy for cleaning up the disks after a crash. Unix has a utility called fscheck. It has two principal tasks. First, it checks that blocks are properly allocated. Each inode is supposed to be the root of a tree of blocks, the free list is supposed to be a tree of blocks, and each block is supposed to appear in exactly one of these trees. Fscheck runs through all the inodes, checking each allocated inode for reasonable values, and walking through the tree of blocks rooted at the inode. It maintains a bit vector to record which blocks have been encountered. If block is encountered that has already been seen, there is a problem: Either it occurred twice in the same file (in which case it isnt a tree), or it occurred in two different files. A reasonable recovery would be to allocate a new block, copy the contents of the problem block into it, and substitute the copy for the problem block in one of the two places where it occurs. It would also be a good idea to log an error message so that a human being can check up later to see whats wrong. After all the files are scanned, any block that hasnt been found should be on the free list. It would be possible to scan the free list in a similar manner, but its probably easier just to rebuild the free list from the set of blocks that were not found in any file. If a bitmap instead of a free list is used, this step is even easier: Simply overwrite the file systems bitmap with the bitmap constructed during the scan.

The other main consistency requirement concerns the directory structure. The set of directories is supposed to be a tree, and each inode is supposed to have a link count that indicates how many times it appears in directories. The tree structure could be checked by a recursive walk through the directories,but it is more efficient to combine this check with the walk through the inodes that checks for disk blocks, but recording, for each directory inode encountered, the inumber of its parent. The set of directories is a tree if and only if and only if every directory other than the root has a unique parent. This pass can also rebuild the link count for each inode by maintaining in memory an array with one slot for each inumber. Each time the inumber is found in a directory, increment the corresponding element of the array. The resulting counts should match the link counts in the inodes. If not, correct the counts in the inodes. This illustrates a very important principal that pops up throughout operating system implementation (indeed, throughout any large software system): the doctrine of hints and absolutes. Whenever the same fact is recorded in two different ways, one of them should be considered the absolute truth, and the other should be considered a hint. Hints are handy because they allow some operations to be done much more quickly that they could if only the absolute information was available. But if the hint and the absolute do not agree, the hint can be rebuilt from the absolutes. In a well-engineered system, there should be some way to verify a hint whenever it is used. Unix is a bit lax about this. The link count is a hint (the absolute information is a count of the number of times the inumber appears in directories), but Unix treats it like an absolute during normal operation. As a result, a small error can snowball into completely trashing the file system. For another example of hints, each allocated block could have a header containing the inumber of the file containing it and its offset in the file. There are systems that do this (Unix isnt one of them). The tree of blocks rooted at an inode then becomes a hint, providing an efficient way of finding a block, but when the block is found, its header could be checked. Any inconsistency would then be caught immediately, and the inode structures could be rebuilt from the information in the block headers. By the way, if the link count calculated by the scan is zero (i.e., the inode, although marked as allocated, does not appear in any directory), it would not be prudent to delete the file. A better recovery is to add an entry to a special lost+found directory pointing to the orphan inode, in case it contains something really valuable.

Transactions
The previous section talks about how to recover from situations that cant happen. How do these problems arise in the first place? Wouldnt it be better to prevent these problems rather than recover from them after the fact? Many of these problems arise, particularly after a crash, because some operation was half-completed. For example, suppose the system was in the middle of executing a unlink system call when the lights went out. An unlink operation involves several distinct steps:

remove an entry from a directory,

decrement a link count, and if the count goes to zero, move all the blocks of the file to the free list, and free the inode.

If the crash occurs between the first and second steps, the link count will be wrong. If it occurs during the third step, a block may be linked both into the file and the free list, or neither, depending on the details of how the code is written. And so on To deal with this kind of problem in a general way, transactions were invented. Transactions were first developed in the context of database management systems, and are used heavily there, so there is a tradition of thinking of them as database stuff and teaching about them only in database courses and text books. But they really are an operating system concept. Heres a two-bit introduction. We have already seen a mechanism for making complex operations appear atomic. It is called a critical section. Critical sections have a property that is sometimes called synchronization atomicity. It is also called serializability because if two processes try to execute their critical sections at about the same time, the next effect will be as if they occurred in some serial order. If systems can crash (and they can!), synchronization atomicity isnt enough. We need another property, called failure atomicity, which means an all or nothing property: Either all of the modifications of nonvolatile storage complete or none of them do. There are basically two ways to implement failure atomicity. They both depend on the fact that a writing a single block to disk is an atomic operation. The first approach is called logging. An append-only file called a log is maintained on disk. Each time a transaction does something to file-system data, it creates a log record describing the operation and appends it to the log. The log record contains enough information to undo the operation. For example, if the operation made a change to a disk block, the log record might contain the block number, the length and offset of the modified part of the block, and the the original content of that region. The transaction also writes a begin record when it starts, and a commit record when it is done. After a crash, a recovery process scans the log looking for transactions that started (wrote a begin record) but never finished (wrote a commit record). If such a transaction is found, its partially completed operations are undone (in reverse order) using the undo information in the log records. Sometimes, for efficiency, disk data is cached in memory. Modifications are made to the cached copy and only written back out to disk from time to time. If the system crashes before the changes are written to disk, the data structures on disk may be inconsistent. Logging can also be used to avoid this problem by putting into each log record redo information as well as undo information. For example, the log record for a modification of a disk block should contain both the old and new value. After a crash, if the recovery process discovers a transaction that has completed, it uses the redo information to make sure the effects of all of its operations are reflected on disk. Full recovery is always possible provided

The log records are written to disk in order, The commit record is written to disk when the transaction completes, and The log record describing a modification is written to disk before any of the changes made by that operation are written to disk.

This algorithm is called write-ahead logging. The other way of implementing transactions is called shadow blocks.5 Suppose the data structure on disk is a tree. The basic idea is never to change any block (disk block) of the data structure in place. Whenever you want to modify a block, make a copy of it (called a shadow of it) instead, and modify the parent to point to the shadow. Of course, to make the parent point to the shadow you have to modify it, so instead you make a shadow of the parent an modify it instead. In this way, you shadow not only each block you really wanted to modify, but also all the blocks on the path from it to the root. You keep the shadow of the root block in memory. At the end of the transaction, you make sure the shadow blocks are all safely written to disk and then write the shadow of the root directly onto the root block. If the system crashes before you overwrite the root block, there will be no permanent change to the tree on disk. Overwriting the root block has the effect of linking all the modified (shadow blocks) into the tree and removing all the old blocks. Crash recovery is simply a matter of garbage collection. If the crash occurs before the root was overwritten, all the shadow blocks are garbage. If it occurs after, the blocks they replaced are garbage. In either case, the tree itself is consistent, and it is easy to find the garbage blocks (they are blocks that arent in the tree). Database systems almost universally use logging, and shadowing is mentioned only in passing in database texts. But the shadowing technique is used in a variant of the Unix file system called (somewhat misleadingly) the Log-structured File System (LFS). The entire file system is made into a tree by replacing the array of inodes with a tree of inodes. LFS has the added advantage (beyond reliability) that all blocks are written sequentially, so write operations are very fast. It has the disadvantage that files that are modified here and there by random access tend to have their blocks scattered about, but that pattern of access is comparatively rare, and there are techniques to cope with it when it occurs. The main source of complexity in LFS is figuring out when and how to do the garbage collection.

Performance
The main trick to improve file system performance (like anything else in computer science) is caching. The system keeps a disk cache (sometimes also called a buffer pool) of recently used disk blocks. In contrast with the page frames of virtual memory, where there were all sorts of algorithms proposed for managing the cache, management of the disk cache is pretty simple. On the whole, it is simply managed LRU (least recently used). Why is it that for paging we went to great lengths trying to come up with an algorithm that is almost as good as LRU while here we can simply use true LRU? The problem with implementing LRU is that some information has to be updated on every single reference. In the case of paging, references can be as frequent as every instruction,

so we have to make do with whatever information hardware is willing to give us. The best we can hope for is that the paging hardware will set a bit in a page-table entry. In the case of file system disk blocks, however, each reference is the result of a system call, and adding a few extra instructions added to a system call for cache maintenance is not unreasonable. Summary File Systems and Space Management is an integral part of the operating systems. These section coves the file management and space management systems, which includes the file structure, file types and different file access modes etc. and also deals with the implementing file systems. In space management coves the block size and extents, keeping track of free space basic approaches. It covers the disk reliability techniques

Terminal Questions 1. 2. 3. 4. 5. 6. 7. What do you mean by file? Explain the significance. Explain why virtual memory and files are different kinds of objects. Discuss the file structure? Explain the various access modes. Discuss the various file organization methods? What do you mean by a block & an Extent? Discuss the concept of space management. What do you mean by consistency Checking, discuss how it will effect on file system.

Unit 9 : Input-Output Architecture : This unit covers the I/O structure , I/O control strategies, Program-controlled I/O, Interrupt-controlled I/O Direct Memory Access and cover the I/O address space. Introduction In our discussion of the memory hierarchy (in Unit 4), it was implicitly assumed that memory in the computer system would be fast enough to match the speed of the processor (at least for the highest elements in the memory hierarchy) and that no special consideration need be given about how long it would take for a word to be transferred from memory to the processor an address would be generated by the processor, and after some fixed time interval, the memory system would provide the required information. (In the case of a cache miss, the time interval would be longer, but generally

still fixed. For a page fault, the processor would be interrupted; and the page fault handling software invoked.) Although input-output devices are mapped to appear like memory devices in many computer systems, I/O devices have characteristics quite different from memory devices, and often pose special problems for computer systems. This is principally for two reasons:

I/O devices span a wide range of speeds. (e.g. terminals accepting input at a few characters per second; disks reading data at over 10 million characters / second). Unlike memory operations, I/O operations and the CPU are not generally synchronized with each other.

Objectives At the end of this unit, you will be able to understand the :

Fundamentals and significance of I/O Operations I/O structure for a medium-scale processor system I/O Control Strategies Various Mechanisms for I/O Operations

I/O structure Figure-1 shows the general I/O structure associated with many medium-scale processors. Note that the I/O controllers and main memory are connected to the main system bus. The cache memory (usually found on-chip with the CPU) has a direct connection to the processor, as well as to the system bus.

Figure 1: A general I/O structure for a medium-scale processor system Note that the I/O devices shown here are not connected directly to the system bus, they interface with another device called an I/O controller. In simpler systems, the CPU may also serve as the I/O controller, but in systems where throughput and performance are important, I/O operations are generally handled outside the processor. Until relatively recently, the I/O performance of a system was somewhat of an afterthought for systems designers. The reduced cost of high-performance disks, permitting the proliferation of virtual memory systems, and the dramatic reduction in the cost of high-quality video display devices, have meant that designers must pay much more attention to this aspect to ensure adequate performance in the overall system. Because of the different speeds and data requirements of I/O devices, different I/O strategies may be useful, depending on the type of I/O device which is connected to the computer. Because the I/O devices are not synchronized with the CPU, some information must be exchanged between the CPU and the device to ensure that the data is received reliably. This interaction between the CPU and an I/O device is usually referred to as handshaking. For a complete handshake, four events are important:

The device providing the data (the talker) must indicate that valid data is now available. The device accepting the data (the listener) must indicate that it has accepted the data. This signal informs the talker that it need not maintain this data word on the data bus any longer. The talker indicates that the data on the bus is no longer valid, and removes the data from the bus. The talker may then set up new data on the data bus.

The listener indicates that it is not now accepting any data on the data bus. the listener may use data previously accepted during this time, while it is waiting for more data to become valid on the bus.

Note that each of the talker and listener supply two signals. The talker supplies a signal (say, data valid, or DAV) at step (1). It supplies another signal (say, data not valid, or ) at step (3). Both these signals can be coded as a single binary value (DAV) which takes the value 1 at step (1) and 0 at step (3). The listener supplies a signal (say, data accepted, or DAC) at step (2). It supplies a signal (say, data not now accepted, or ) at step (4). It, too, can be coded as a single binary variable, DAC. Because only two binary variables are required, the handshaking information can be communicated over two wires, and the form of handshaking described above is called a two wire Handshake. Other forms of handshaking are used in more complex situations; for example, where there may be more than one controller on the bus, or where the communication is among several devices. Figure 2 shows a timing diagram for the signals DAV and DAC which identifies the timing of the four events described previously.

Figure 2: Timing diagram for two-wire handshake Either the CPU or the I/O device can act as the talker or the listener. In fact, the CPU may act as a talker at one time and a listener at another. For example, when communicating with a terminal screen (an output device) the CPU acts as a talker, but when communicating with a terminal keyboard (an input device) the CPU acts as a listener.

Self Assessment Questions 1. Explain the general I/O structure for a medium scale processor system with neat diagram. 2. What do you mean by handshaking, write the important four events in this context. I/O Control Strategies

Several I/O strategies are used between the computer system and I/O devices, depending on the relative speeds of the computer system and the I/O devices. The simplest strategy is to use the processor itself as the I/O controller, and to require that the device follow a strict order of events under direct program control, with the processor waiting for the I/O device at each step. Another strategy is to allow the processor to be interrupted by the I/O devices, and to have a (possibly different) interrupt handling routine for each device. This allows for more flexible scheduling of I/O events, as well as more efficient use of the processor. (Interrupt handling is an important component of the operating system.) A third general I/O strategy is to allow the I/O device, or the controller for the device, access to the main memory. The device would write a block of information in main memory, without intervention from the CPU, and then inform the CPU in some way that that block of memory had been overwritten or read. This might be done by leaving a message in memory, or by interrupting the processor. (This is generally the I/O strategy used by the highest speed devices hard disks and the video controller.)

Program-controlled I/O One common I/O strategy is program-controlled I/O, (often called polled I/O). Here all I/O is performed under control of an I/O handling procedure, and input or output is initiated by this procedure. The I/O handling procedure will require some status information (handshaking information) from the I/O device (e.g., whether the device is ready to receive data). This information is usually obtained through a second input from the device; a single bit is usually sufficient, so one input port can be used to collect status, or handshake, information from several I/O devices. (A port is the name given to a connection to an I/O device; e.g., to the memory location into which an I/O device is mapped). An I/O port is usually implemented as a register (possibly a set of D flip flops) which also acts as a buffer between the CPU and the actual I/O device. The word port is often used to refer to the buffer itself. Typically, there will be several I/O devices connected to the processor; the processor checks the status input port periodically, under program control by the I/O handling procedure. If an I/O device requires service, it will signal this need by altering its input to the status port. When the I/O control program detects that this has occurred (by reading the status port) then the appropriate operation will be performed on the I/O device which requested the service. A typical configuration might look somewhat as shown in Figure 3. The outputs labeled handshake in would be connected to bits in the status port. The input labeled handshake in would typically be generated by the appropriate decode logic when the I/O port corresponding to the device was addressed.

Figure 3: Program controlled I/O Program-controlled I/O has a number of advantages:


All control is directly under the control of the program, so changes can be readily implemented. The order in which devices are serviced is determined by the program, this order is not necessarily fixed but can be altered by the program, as necessary. This means that the priority of a device can be varied under program control. (The priority of a determines which of a set of devices which are simultaneously ready for servicing will actually be serviced first). It is relatively easy to add or delete devices.

Perhaps the chief disadvantage of program-controlled I/O is that a great deal of time may be spent testing the status inputs of the I/O devices, when the devices do not need servicing. This busy wait or wait loop during which the I/O devices are polled but no I/O operations are performed is really time wasted by the processor, if there is other work which could be done at that time. Also, if a particular device has its data available for only a short time, the data may be missed because the input was not tested at the appropriate time. Program controlled I/O is often used for simple operations which must be performed sequentially. For example, the following may be used to control the temperature in a room:
DO forever INPUT temperature

IF (temperature < setpoint) THEN turn heat ON ELSE turn heat OFF END IF

Note here that the order of events is fixed in time, and that the program loops forever. (It is really waiting for a change in the temperature, but it is a busy wait.) Self Assessment Questions 1. Write the advantages of program-controlled I/O Interrupt-controlled I/O Interrupt-controlled I/O reduces the severity of the two problems mentioned for programcontrolled I/O by allowing the I/O device itself to initiate the device service routine in the processor. This is accomplished by having the I/O device generate an interrupt signal which is tested directly by the hardware of the CPU. When the interrupt input to the CPU is found to be active, the CPU itself initiates a subprogram call to somewhere in the memory of the processor; the particular address to which the processor branches on an interrupt depends on the interrupt facilities available in the processor. The simplest type of interrupt facility is where the processor executes a subprogram branch to some specific address whenever an interrupt input is detected by the CPU. The return address (the location of the next instruction in the program that was interrupted) is saved by the processor as part of the interrupt process. If there are several devices which are capable of interrupting the processor, then with this simple interrupt scheme the interrupt handling routine must examine each device to determine which one caused the interrupt. Also, since only one interrupt can be handled at a time, there is usually a hardware priority encoder which allows the device with the highest priority to interrupt the processor, if several devices attempt to interrupt the processor simultaneously. In Figure -3, the handshake out outputs would be connected to a priority encoder to implement this type of I/O. the other connections remain the same. (Some systems use a daisy chain priority system to determine which of the interrupting devices is serviced first. Daisy chain priority resolution is discussed later.) In most modern processors, interrupt return points are saved on a stack in memory, in the same way as return addresses for subprogram calls are saved. In fact, an interrupt can often be thought of as a subprogram which is invoked by an external device. If a stack is used to save the return address for interrupts, it is then possible to allow one interrupt the interrupt handling routine of another interrupt. In modern computer systems, there are often several priority levels of interrupts, each of which can be disabled, or masked. There is usually one type of interrupt input which cannot be disabled (a non-maskable interrupt) which has priority over all other interrupts. This interrupt input is used for warning the processor of potentially catastrophic events such as an imminent power

failure, to allow the processor to shut down in an orderly way and to save as much information as possible. Most modern computers make use of vectored interrupts. With vectored interrupts, it is the responsibility of the interrupting device to provide the address in main memory of the interrupt servicing routine for that device. This means, of course, that the I/O device itself must have sufficient intelligence to provide this address when requested by the CPU, and also to be initially programmed with this address information by the processor. Although somewhat more complex than the simple interrupt system described earlier, vectored interrupts provide such a significant advantage in interrupt handling speed and ease of implementation (i.e., a separate routine for each device) that this method is almost universally used on modern computer systems. Some processors have a number of special inputs for vectored interrupts (each acting much like the simple interrupt described earlier). Others require that the interrupting device itself provide the interrupt address as part of the process of interrupting the processor. Direct Memory Access In most mini- and mainframe computer systems, a great deal of input and output occurs between the disk system and the processor. It would be very inefficient to perform these operations directly through the processor; it is much more efficient if such devices, which can transfer data at a very high rate, place the data directly into the memory, or take the data directly from the processor without direct intervention from the processor. I/O performed in this way is usually called direct memory access, or DMA. The controller for a device employing DMA must have the capability of generating address signals for the memory, as well as all of the memory control signals. The processor informs the DMA controller that data is available (or is to be placed into) a block of memory locations starting at a certain address in memory. The controller is also informed of the length of the data block.

There are two possibilities for the timing of the data transfer from the DMA controller to memory:

The controller can cause the processor to halt if it attempts to access data in the same bank of memory into which the controller is writing. This is the fatest option for the I/O device, but may cause the processor to run more slowly because the processor may have to wait until a full block of data is transferred. The controller can access memory in memory cycles which are not used by the particular bank of memory into which the DMA controller is writing data. This approach, called cycle stealing, is perhaps the most commonly used approach. (In a processor with a cache that has a high hit rate this approach may not slow the I/O transfer significantly).

DMA is a sensible approach for devices which have the capability of transferring blocks of data at a very high data rate, in short bursts. It is not worthwhile for slow devices, or for devices which do not provide the processor with large quantities of data. Because the controller for a DMA device is quite sophisticated, the DMA devices themselves are usually quite sophisticated (and expensive) compared to other types of I/O devices. One problem that systems employing several DMA devices have to address is the contention for the single system bus. There must be some method of selecting which device controls the bus (acts as bus master) at any given time. There are many ways of addressing the bus arbitration problem; three techniques which are often implemented in processor systems are the following (these are also often used to determine the priorities of other events which may occur simultaneously, like interrupts). They rely on

the use of at least two signals (bus_request and bus_grant), used in a manner similar to the two-wire handshake: Daisy chain arbitration Here, the requesting device or devices assert the signal bus_request. The bus arbiter returns the bus_grant signal, which passes through each of the devices which can have access to the bus, as shown in Figure - 4. Here, the priority of a device depends solely on its position in the daisy chain. If two or more devices request the bus at the same time, the highest priority device is granted the bus first, then the bus_grant signal is passed further down the chain. Generally a third signal (bus_release) is used to indicate to the bus arbiter that the first device has finished its use of the bus. Holding bus_request asserted indicates that another device wants to use the bus.

Figure 4: Daisy chain bus arbitration Priority encoded arbitration Here, each device has a request line connected to a centralized arbiter that determines which device will be granted access to the bus. The order may be fixed by the order of connection (priority encoded), or it may be determined by some algorithm preloaded into the arbiter. Figure - 5 shows this type of system. Note that each device has a separate line to the bus arbiter. (The bus_grant signals have been omitted for clarity.)

Figure 5: Priority encoded bus arbitration

Distributed arbitration by self-selection Here, the devices themselves determine which of them has the highest priority. Each device has a bus_request line or lines on which it places a code identifying itself. Each device examines the codes for all the requesting devices, and determines whether or not it is the highest priority requesting device. These arbitration schemes may also be used in conjunction with each other. For example, a set of similar devices may be daisy chained together, and this set may be an input to a priority encoded scheme. Using interrupts driven device drivers to transfer data to or from hardware devices works well when the amount of data is reasonably low. For example a 9600 baud modem can transfer approximately one character every millisecond (th second).

Figure 6 If the interrupt latency, the amount of time that it takes between the hardware device raising the interrupt and the device drivers interrupt handling routine being called, is low (say 2 milliseconds) then the overall system impact of the data transfer is very low. The 9600 baud modem data transfer would only take 0.002% of the CPUs processing time. For high speed devices, such as hard disk controllers or ethernet devices the data transfer rate is a lot higher. A SCSI device can transfer up to 40 Mbytes of information per second. Direct Memory Access, or DMA, was invented to solve this problem. A DMA controller allows devices to transfer data to or from the systems memory without the intervention of the processor. A PCs ISA DMA controller has 8 DMA channels of which 7 are available for use by the device drivers. Each DMA channel has associated with it a 16 bit address register and a 16 bit count register. To initiate a data transfer the device driver sets up the DMA channels address and count registers together with the direction of the data transfer, read or write. It then tells the device that it may start the DMA when it wishes. When the transfer is complete the device interrupts the PC. Whilst the transfer is taking place the CPU is free to do other things.

Device drivers have to be careful when using DMA. First of all the DMA controller knows nothing of virtual memory, it only has access to the physical memory in the system. Therefore the memory that is being DMAd to or from must be a contiguous block of physical memory. This means that you cannot DMA directly into the virtual address space of a process. You can however lock the processes physical pages into memory, preventing them from being swapped out to the swap device during a DMA operation. Secondly, the DMA controller cannot access the whole of physical memory. The DMA channels address register represents the first 16 bits of the DMA address, the next 8 bits come from the page register. This means that DMA requests are limited to the bottom 16 Mbytes of memory. DMA channels are scares resources, there are only 7 of them, and they cannot be shared between device drivers. Just like interrupts the device driver must be able to work out which DMA channel it should use. Like interrupts, some devices have a fixed DMA channel. The floppy device, for example, always uses DMA channel 2. Sometimes the DMA channel for a device can be set by jumpers, a number of Ethernet devices use this technique. The more flexible devices can be told (via their CSRs) which DMA channels to use and, in this case, the device driver can simple pick a free DMA channel to use. Self Assessment Questions 1. What do you mean by direct access memory? 2. Explain the two possibilities for the timing of the data transfer from the DMA controller to memory. The I/O address space Some processors map I/O devices in their own, separate, address space; others use memory addresses as addresses of I/O ports. Both approaches have advantages and disadvantages. The advantages of a separate address space for I/O devices are, primarily, that the I/O operations would then be performed by separate I/O instructions, and that all the memory address space could be dedicated to memory. Typically, however, I/O is only a small fraction of the operations performed by a computer system; generally less than 1 percent of all instructions are I/O instructions in a program. It may not be worthwhile to support such infrequent operations with a rich instruction set, so I/O instructions are often rather restricted. In processors with memory mapped I/O, any of the instructions which references memory directly can also be used to reference I/O ports, including instructions which modify the contents of the I/O port (e.g., arithmetic instructions.) Some problems can arise with memory mapped I/O in systems which use cache memory or virtual memory. If a processor uses a virtual memory mapping, and the I/O ports are allowed to be in a virtual address space, the mapping to the physical device may not be consistent if there is a context switch. Moreover. the device would have to be capable of

performing the virtual-to-physical mapping. If physical addressing is used, mapping across page boundaries may be problematic. If the memory locations are cached, then the value in cache may not be consistent with the new value loaded in memory. Generally, either there is some method for invalidating cache that may be mapped to I/O addresses, or the I/O addresses are not cached at all. We will look at the general problem of maintaining cache in a consistent state (the cache coherency problem) in more detail when we discuss multi-processor systems. Terminal Questions 1. 2. 3. 4. 5. What is the significance of I/O Operations? Draw a block diagram of an I/O structure and discuss the working principle. What are various I/O control strategies ? Discuss in brief. Explain programmed I/O and interrupt I/O. How they differ? Discuss the concept of Direct Memory Access. What are its advantages over other methods?

Unit 10 : Case Study on Window Operating Systems : In this units covers the covers the architecture of the Win NT OS, Win2000, Common functionality to handles the different activities. And it coves the service family functionality. Discussed the different versions of OS.

Introduction Windows 2000, Windows XP and Windows Server 2003 are all part of the Windows NT family of Microsoft operating systems. They are all preemptive, reentrant operating systems, which have been designed to work with either uniprocessor- or symmetrical multi processor (SMP)-based Intel x86 computers. To process input/output (I/O) requests it uses packet-driven I/O which utilises I/O request packets (IRPs) and asynchronous I/O. Starting with Windows XP, Microsoft began building in 64-bit support into their operating systems before this their operating systems were based on a 32-bit model. The architecture of the Windows NT operating system line is highly modular, and consists of two main layers: a user mode and a kernel mode. Programs and subsystems in user mode are limited in terms of what system resources they have access to, while the kernel mode has unrestricted access to the system memory and external devices. The kernels of the operating systems in this line are all known as hybrid kernels as their microkernel is essentially the kernel, while higher-level services are implemented by the executive, which exists in kernel mode.

Objective: At the end of this unit you will be understand the:


Architectural details of Windows NT Functionality and operations of Windows NT Services and functionality of Windows NT Operating Systems Deployment related issues in Windows NT

Architecture of the Windows NT operating system line The Windows NT operating system familys architecture consists of two layers (user mode and kernel mode), with many different modules within both of these layers. User mode in the Windows NT line is made of subsystems capable of passing I/O requests to the appropriate kernel mode software drivers by using the I/O manager. Two subsystems make up the user mode layer of Windows 2000: the Environment subsystem (runs applications written for many different types of operating systems), and the Integral subsystem (operates system specific functions on behalf of the environment subsystem). Kernel mode in Windows 2000 has full access to the hardware and system resources of the computer. The kernel mode stops user mode services and applications from accessing critical areas of the operating system that they should not have access to. The Executive interfaces with all the user mode subsystems. It deals with I/O, object management, security and process management. The hybrid kernel sits between the Hardware Abstraction Layer and the Executive to provide multiprocessor synchronization, thread and interrupt scheduling and dispatching, and trap handling and exception dispatching. The microkernel is also responsible for initializing device drivers at bootup. Kernel mode drivers exist in three levels: highest level drivers, intermediate drivers and low level drivers. Windows Driver Model (WDM) exists in the intermediate layer and was mainly designed to be binary and source compatible between Windows 98 and Windows 2000. The lowest level drivers are either legacy Windows NT device drivers that control a device directly or can be a PnP hardware bus.

User mode
The user mode is made up of subsystems which can pass I/O requests to the appropriate kernel mode drivers via the I/O manager (which exists in kernel mode). Two subsystems make up the user mode layer of Windows 2000: the Environment subsystem and the Integral subsystem. The environment subsystem was designed to run applications written for many different types of operating systems. None of the environment subsystems can directly access hardware, and must request access to memory resources through the Virtual Memory

Manager that runs in kernel mode. Also, applications run at a lower priority than kernel mode processes. Currently, there are three main environment subsystems: the Win32 subsystem, an OS/2 subsystem and a POSIX subsystem. The Win32 environment subsystem can run 32-bit Windows applications. It contains the console as well as text window support, shutdown and hard-error handling for all other environment subsystems. It also supports Virtual DOS Machines (VDMs), which allow MS-DOS and 16-bit Windows 3.x (Win16) applications to be run on Windows. There is a specific MS-DOS VDM which runs in its own address space and which emulates an Intel 80486 running MS-DOS 5. Win16 programs, however, run in a Win16 VDM. Each program, by default, runs in the same process, thus using the same address space, and the Win16 VDM gives each program its own thread to run on. However, Windows 2000 does allow users to run a Win16 program in a separate Win16 VDM, which allows the program to be preemptively multitasked as Windows 2000 will pre-empt the whole VDM process, which only contains one running application. The OS/2 environment subsystem supports 16-bit character-based OS/2 applications and emulates OS/2 1.x, but not 2.x or later OS/2 applications. The POSIX environment subsystem supports applications that are strictly written to either the POSIX.1 standard or the related ISO/IEC standards. The integral subsystem looks after operating system specific functions on behalf of the environment subsystem. It consists of a security subsystem, a workstation service and a server service. The security subsystem deals with security tokens, grants or denies access to user accounts based on resource permissions, handles logon requests and initiates logon authentication, and determines which system resources need to be audited by Windows 2000. It also looks after Active Directory. The workstation service is an API to the network redirector, which provides the computer access to the network. The server service is an API that allows the computer to provide network services.

Kernel mode
Windows 2000 kernel mode has full access to the hardware and system resources of the computer and runs code in a protected memory area. It controls access to scheduling, thread prioritization, memory management and the interaction with hardware. The kernel mode stops user mode services and applications from accessing critical areas of the operating system that they should not have access to as user mode processes ask the kernel mode to perform such operations on its behalf. Kernel mode consists of executive services, which are it made up on many modules that do specific tasks, kernel drivers, a microkernel and a Hardware Abstraction Layer, or HAL.

Executive

The Executive interfaces with all the user mode subsystems. It deals with I/O, object management, security and process management. It contains various components, including the I/O Manager, the Security Reference Monitor, the Object Manager, the IPC

Manager, the Virtual Memory Manager (VMM), a PnP Manager and Power Manager, as well as a Window Manager which works in conjunction with the Windows Graphics Device Interface (GDI). Each of these components exports a kernel-only support routine allows other components to communicate with one another. Grouped together, the components can be called executive services. No executive component has access to the internal routines of any other executive component. Each object in Windows 2000 exists in its own namespace. This is a screenshot from SysInternals WinObj The object manager is a special executive subsystem that all other executive subsystems must pass through to gain access to Windows 2000 resources essentially making it a resource management infrastructure service. The object manager is used to reduce the duplication of object resource management functionality in other executive subsystems, which could potentially lead to bugs and make development of Windows 2000 harder. To the object manager, each resource is an object, whether that resource is a physical resource (such as a file system or peripheral) or a logical resource (such as a file). Each object has a structure or object type that the object manager must know about. When another executive subsystem requests the creation of an object, they send that request to the object manager which creates an empty object structure which the requesting executive subsystem then fills in. Object types define the object procedures and any data specific to the object. In this way, the object manager allows Windows 2000 to be an object oriented operating system, as object types can be thought of as classes that define objects. Each instance of an object that is created stores its name, parameters that are passed to the object creation function, security attributes and a pointer to its object type. The object also contains an object close procedure and a reference count to tell the object manager how many other objects in the system reference that object and thereby determines whether the object can be destroyed when a close request is sent to it. Every object exists in a hierarchical object namespace. Further executive subsystems are the following: (i) I/O Manager: allows devices to communicate with user-mode subsystems. It translates user-mode read and write commands in read or write IRPs which it passes to device drivers. It accepts file system I/O requests and translates them into device specific calls, and can incorporate low-level device drivers that directly manipulate hardware to either read input or write output. It also includes a cache manager to improve disk performance by caching read requests and write to the disk in the background (ii) Security Reference Monitor (SRM): the primary authority for enforcing the security rules of the security integral subsystem. It determines whether an object or resource can be accessed, via the use of access control lists (ACLs), which are themselves made up of access control entries (ACEs). ACEs contain a security identifier (SID) and a list of

operations that the ACE gives a select group of trustees a user account, group account, or logon session permission (allow, deny, or audit) to that resource. (iii) IPC Manager: short for Interprocess Communication Manager, this manages the communication between clients (the environment subsystem) and servers (components of the Executive). It can use two facilities: the Local Procedure Call (LPC) facility (clients and servers on the one computer) and the Remote Procedure Call (RPC) facility (where clients and servers are situated on different computers. Microsoft has had significant security issues with the RPC facility. (iv) Virtual Memory Manager: manages virtual memory, allowing Windows 2000 to use the hard disk as a primary storage device (although strictly speaking it is secondary storage). It controls the paging of memory in and out of physical memory to disk storage. (v) Process Manager: handles process and thread creation and termination (vi) PnP Manager: handles Plug and Play and supports device detection and installation at boot time. It also has the responsibility to stop and start devices on demand sometimes this happens when a bus gains a new device and needs to have a device driver loaded to support that device. Both FireWire and USB are hot-swappable and require the services of the PnP Manager to load, stop and start devices. The PnP manager interfaces with the HAL, the rest of the executive (as necessary) and with device drivers. (vii) Power Manager: the power manager deals with power events and generates power IRPs. It coordinates these power events when several devices send a request to be turned off it determines the best way of doing this. The display system has been moved from user mode into the kernel mode as a device driver contained in the file Win32k.sys. There are two components in this device driver the Window Manager and the GDI: (viii) Window Manager: responsible for drawing windows and menus. It controls the way that output is painted to the screen and handles input events (such as from the keyboard and mouse), then passes messages to the applications that need to receive this input (ix) GDI: the Graphics Device Interface is responsible for tasks such as drawing lines and curves, rendering fonts and handling palettes. Windows 2000 introduced native alpha blending into the GDI.

(x) Microkernel & kernel-mode drivers


The Microkernel sits between the HAL and the Executive and provides multiprocessor synchronization, thread and interrupt scheduling and dispatching, and trap handling and exception dispatching. The Microkernel often interfaces with the process manager. The

microkernel is also responsible for initializing device drivers at bootup that are necessary to get the operating system up and running. Windows 2000 uses kernel-mode device drivers to enable it to interact with hardware devices. Each of the drivers has well defined system routines and internal routines that it exports to the rest of the operating system. All devices are seen by user mode code as a file object in the I/O manager, though to the I/O manager itself the devices are seen as device objects, which it defines as either file, device or driver objects. Kernel mode drivers exist in three levels: highest level drivers, intermediate drivers and low level drivers. The highest level drivers, such as file system drivers for FAT and NTFS, rely on intermediate drivers. Intermediate drivers consist of function drivers or main driver for a device that are optionally sandwiched between lower and higher level filter drivers. The function driver then relies on a bus driver or a driver that services a bus controller, adapter, or bridge which can have an optional bus filter driver that sits between itself and the function driver. Intermediate drivers rely on the lowest level drivers to function. The Windows Driver Model (WDM) exists in the intermediate layer. The lowest level drivers are either legacy Windows NT device drivers that control a device directly or can be a PnP hardware bus. These lower level drivers directly control hardware and do not rely on any other drivers.

(xi) Hardware abstraction layer


The Windows 2000 Hardware Abstraction Layer, or HAL, is a layer between the physical hardware of the computer and the rest of the operating system. It was designed to hide differences in hardware and therefore provide a consistent platform on which applications may run. The HAL includes hardware specific code that controls I/O interfaces, interrupt controllers and multiple processors. Windows 2000 was designed to support the 64-bit DEC Alpha. After Compaq announced they would discontinue support of the processor, Microsoft stopped releasing tests build of Windows 2000 for AXP to the public, stopping with beta 3. Development of Windows on the Alpha continued internally in order to continue to have a 64-bit architecture development model ready until the wider availability of the Intel Itanium IA-64 architecture. The HAL now only supports hardware that is compatible with the Intel x86 architecture. Microsoft has had numerous security issues caused by vulnerabilities in its RPC mechanisms. A list follows of the security bulletins that Microsoft have issued in regards to RPC vulnerabilities: Microsoft Security Bulletin MS03-026: issue with a vulnerability in the part of RPC that deals with message exchange over TCP/IP. The failure results because of incorrect handling of malformed messages. This particular vulnerability affects a Distributed Component Object Model (DCOM) interface with RPC, which listens on RPC enabled ports.

Microsoft Security Bulletin MS03-001: A security vulnerability results from an unchecked buffer in the Locator service. By sending a specially malformed request to the Locator service, an attacker could cause the Locator service to fail, or to run code of the attackers choice on the system. Microsoft Security Bulletin MS03-026: Buffer overrun in RPC may allow code execution Microsoft Security Bulletin MS03-010: This particular vulnerabilty affects the RPC Endpoint Mapper process, which listens on TCP/IP port 135. The RPC endpoint mapper allows RPC clients to determine the port number currently assigned to a particular RPC service. To exploit this vulnerability, an attacker would need to establish a TCP/IP connection to the Endpoint Mapper process on a remote machine. Once the connection was established, the attacker would begin the RPC connection negotiation before transmitting a malformed message. At this point, the process on the remote machine would fail. The RPC Endpoint Mapper process is responsible for maintaining the connection information for all of the processes on that machine using RPC. Because the Endpoint Mapper runs within the RPC service itself, exploiting this vulnerability would cause the RPC service to fail, with the attendant loss of any RPC-based services the server offers, as well as potential loss of some COM functions. Microsoft Security Bulletin MS04-029: This RPC Runtime library vulnerability was addressed in CAN-2004-0569, however the title is Vulnerability in RPC Runtime Library Could Allow Information Disclosure and Denial of Service. Microsoft Security Bulletin (MS00-066): A remote denial of service vulnerability in RPC is found. Blocking ports 135-139 and 445 can stop attacks. Microsoft Security Bulletin MS03-039: There are three newly identified vulnerabilities in the part of RPCSS Service that deals with RPC messages for DCOM activation- two that could allow arbitrary code execution and one that could result in a denial of service. The flaws result from incorrect handling of malformed messages. These particular vulnerabilities affect the Distributed Component Object Model (DCOM) interface within the RPCSS Service. This interface handles DCOM object activation requests that are sent from one machine to another. An attacker who successfully exploited these vulnerabilities could be able to run code with Local System privileges on an affected system, or could cause the RPCSS Service to fail. The attacker could then be able to take any action on the system, including installing programs, viewing, changing or deleting data, or creating new accounts with full privileges. To exploit these vulnerabilities, an attacker could create a program to send a malformed RPC message to a vulnerable system targeting the RPCSS Service. Microsoft Security Bulletin MS01-041: Several of the RPC servers associated with system services in Microsoft Exchange Server, SQL Server, Windows NT 4.0 and Windows 2000 do not adequately validate inputs, and in some cases will accept invalid inputs that prevent normal processing. The specific input values at issue here vary from RPC server to RPC server. An attacker who sent such inputs to an affected RPC server

could disrupt its service. The precise type of disruption would depend on the specific service, but could range in effect from minor (e.g., the service temporarily hanging) to major (e.g., the service failing in a way that would require the entire system to be restarted).
Windows 2000

Windows 2000 (also referred to as Win2K or W2K) is a preemptible and interruptible, graphical, business-oriented operating system that was designed to work with either uniprocessor or symmetric multi-processor (SMP) 32-bit Intel x86 computers. It is part of the Microsoft Windows NT line of operating systems and was released on February 17, 2000. Windows 2000 comes in four versions: Professional, Server, Advanced Server, and Datacenter Server. Additionally, Microsoft offers Windows 2000 Advanced ServerLimited Edition, which was released in 2001 and runs on 64-bit Intel Itanium microprocessors. Windows 2000 is classified as a hybrid-kernel operating system, and its architecture is divided into two modes: user mode and kernel mode. The kernel mode provides unrestricted access to system resources and facilitates the user mode, which is heavily restricted and designed for most applications. All versions of Windows 2000 have common functionality, including many system utilities such as the Microsoft Management Console (MMC) and standard system management applications such as a disk defragmentation utility. Support for people with disabilities has also been improved by Microsoft across their Windows 2000 line, and they have included increased support for different languages and locale information. All versions of the operating system support the Windows NT filesystem, NTFS 5, the Encrypted File System (EFS), as well as basic and dynamic disk storage. Dynamic disk storage allows different types of volumes to be used. The Windows 2000 Server family has enhanced functionality, including the ability to provide Active Directory services (a hierarchical framework of resources), Distributed File System (a file system that supports sharing of files) and fault-redundant storage volumes. Windows 2000 can be installed and deployed to an enterprise through either an attended or unattended installation. Unattended installations rely on the use of answer files to fill in installation information, and can be performed through a bootable CD using Microsoft Systems Management Server (SMS), by the System Preparation Tool (Sysprep).

History
Windows 2000 originally descended from the Microsoft Windows NT operating system product line. Originally called Windows NT 5, Microsoft changed the name to Windows 2000 on October 27, 1998. It was also the first Windows version that was released without a code name, though Windows 2000 Service Pack 1 was codenamed Asteroid and Windows 2000 64-bit was codenamed Janus (not to be confused with Windows 3.1, which had the same codename). The first beta for Windows 2000 was released on September 27, 1997 and several further betas were released until Beta 3 which was released on April 29, 1999. From here, Microsoft issued three release candidates between

July and November 1999, and finally released the operating system to partners on December 12, 1999. The public received the full version of Windows 2000 on February 17, 2000 and the press immediately hailed it as the most stable operating system Microsoft had ever released. Novell, however, was not so impressed with Microsofts new directory service architecture as they found it to be less scalable or reliable than their own Novell Directory Services (NDS) technology. On September 29, 2000, Microsoft released Windows 2000 Datacenter. Microsoft released Service Pack 1 (SP1) on August 15, 2000, Service Pack 2 (SP2) on May 16, 2001, Service Pack 3 (SP3) on August 29, 2002 and its last Service Pack (SP4) on June 26, 2003. Microsoft has stated that they will not release a Service Pack 5, but instead, have offered an Update Rollup for Service Pack 4. Microsoft phased out all development of their Java Virtual Machine (JVM) from Windows 2000 in Service Pack 3. Windows 2000 has since been superseded by newer Microsoft operating systems. Microsoft has replaced Windows 2000 Server products with Windows Server 2003, and Windows 2000 Professional with Windows XP Professional. Windows Neptune started development in 1999, and was supposed to be the home-user edition of Windows 2000. However, the project lagged in production time and only one alpha release was built. Windows Me was released as a substitute, and the Neptune project was forwarded to the production of Whistler (Windows XP). The only elements of the Windows project which were included in Windows 2000 were the ability to upgrade from Windows 95 or Windows 98, and support for the FAT32 file system. Several notable security flaws have been found in Windows 2000. Code Red and Code Red II were famous (and highly visible to the worldwide press) computer worms that exploited vulnerabilities of the indexing service of Windows 2000s Internet Information Services (IIS). In August 2003, two major worms named the Sobig worm and the Blaster worm began to attack millions of Microsoft Windows computers, resulting in the largest down-time and clean-up cost ever. Architecture The Windows 2000 operating system architecture consists of two layers (user mode and kernel mode), with many different modules within both of these layers. Windows 2000 is a highly modular system that consists of two main layers: a user mode and a kernel mode. The user mode refers to the mode in which user programs are run. Such programs are limited in terms of what system resources they have access to, while the kernel mode has unrestricted access to the system memory and external devices. All user mode applications access system resources through the executive which runs in kernel mode.

User mode
User mode in Windows 2000 is made of subsystems capable of passing I/O requests to the appropriate kernel mode drivers by using the I/O manager. Two subsystems make up

the user mode layer of Windows 2000: the environment subsystem and the integral subsystem. The environment subsystem was designed to run applications written for many different types of operating systems. These applications, however, run at a lower priority than kernel mode processes. There are three main environment subsystems: Win32 subsystem runs 32-bit Windows applications and also supports Virtual DOS Machines (VDMs), which allows MS-DOS and 16-bit Windows 3.x (Win16) applications to run on Windows. OS/2 environment subsystem supports 16-bit character-based OS/2 applications and emulates OS/2 1.3 and 1.x, but not 2.x or later OS/2 applications. POSIX environment subsystem supports applications that are strictly written to either the POSIX.1 standard or the related ISO/IEC standards. The integral subsystem looks after operating system specific functions on behalf of the environment subsystem. It consists of a security subsystem (grants/denies access and handles logons), workstation service (helps the computer gain network access) and a server service (lets the computer provide network services).

Kernel mode
Kernel mode in Windows 2000 has full access to the hardware and system resources of the computer. The kernel mode stops user mode services and applications from accessing critical areas of the operating system that they should not have access to. Each object in Windows 2000 exists in its own namespace. This is a screenshot from SysInternals WinObj The executive interfaces with all the user mode subsystems. It deals with I/O, object management, security and process management. It contains various components, including: Object manager: a special executive subsystem that all other executive subsystems must pass through to gain access to Windows 2000 resources. This essentially is a resource management infrastructure service that allows Windows 2000 to be an object oriented operating system. I/O Manager: allows devices to communicate with user-mode subsystems by translating user-mode read and write commands and passing them to device drivers. Security Reference Monitor (SRM): the primary authority for enforcing the security rules of the security integral subsystem.

IPC Manager: short for Interprocess Communication Manager, manages the communication between clients (the environment subsystem) and servers (components of the executive). Virtual Memory Manager: manages virtual memory, allowing Windows 2000 to use the hard disk as a primary storage device (although strictly speaking it is secondary storage). Process Manager: handles process and thread creation and termination PnP Manager: handles Plug and Play and supports device detection and installation at boot time. Power Manager: the power manager coordinates power events and generates power IRPs. The display system is handled by a device driver contained in Win32k.sys. The Window Manager component of this driver is responsible for drawing windows and menus while the GDI (graphical device interface) component is responsible for tasks such as drawing lines and curves, rendering fonts and handling palettes. The Windows 2000 Hardware Abstraction Layer, or HAL, is a layer between the physical hardware of the computer and the rest of the operating system. It was designed to hide differences in hardware and therefore provide a consistent platform to run applications on. The HAL includes hardware specific code that controls I/O interfaces, interrupt controllers and multiple processors. The microkernel sits between the HAL and the executive and provides multiprocessor synchronization, thread and interrupt scheduling and dispatching, trap handling and exception dispatching. The microkernel often interfaces with the process manager. The microkernel is also responsible for initializing device drivers at bootup that are necessary to get the operating system up and running.

Common functionality
Certain features are common across all versions of Windows 2000 (both Professional and the Server versions), among them being NTFS 5, the Microsoft Management Console (MMC), the Encrypting File System (EFS), dynamic and basic disk storage, usability enhancements and multi-language and locale support. Windows 2000 also has several standard system utilities included as standard. As well as these features, Microsoft introduced a new feature to protect critical system files, called Windows File Protection (WFP). This prevents programs (with the exception of Microsofts update programs) from replacing critical Windows system files and thus making the system inoperable. Microsoft recognised that the infamous Blue Screen of Death (or stop error) could cause serious problems for servers that needed to be constantly running and so provided a system setting that would allow the server to automatically reboot when a stop error

occurred. Users have the option of dumping the first 64KB of memory to disk (the smallest amount of memory that is useful for debugging purposes, also known as a minidump), a dump of only the kernels memory or a dump of the entire contents of memory to disk, as well as write that this event happened to the Windows 2000 event log. In order to improve performance on computers running Windows 2000 as a server operating system, Microsoft gave administrators the choice of optimising the operating system for background services or for applications.

NTFS 5
Windows 2000 supports disk quotas, which can be set via the Quotas tab found in the hard disk properties dialog box. Microsoft released the third version of the NT File System (NTFS) also known as version 5.0 in Windows 2000; this introduced quotas, file-system-level encryption (called EFS), sparse files and reparse points. Sparse files allow for the efficient storage of data sets that are very large yet contain many areas that only have zeroes. Reparse points allow the object manager to reset a file namespace lookup and let file system drivers implement changed functionality in a transparent manner. Reparse points are used to implement Volume Mount Points, Directory Junctions, Hierarchical Storage Management, Native Structured Storage and Single Instance Storage. Volume mount points and directory junctions allow for a file to be transparently referred from one file or directory location to another.

Encrypting File System


The Encrypting File System (EFS) introduced strong encryption into the Windows file world. It allowed any folder or drive on an NTFS volume to be encrypted transparently to the end user. EFS works in conjunction with the EFS service, Microsofts CryptoAPI and the EFS File System Run-Time Library (FSRTL). As of February 2004, its encryption has not been compromised. EFS works by encrypting a file with a bulk symmetric key (also known as the File Encryption Key, or FEK), which is used because it takes a relatively smaller amount of time to encrypt and decrypt large amounts of data than if an asymmetric key cipher is used. The symmetric key that is used to encrypt the file is then encrypted with a public key that is associated with the user who encrypted the file, and this encrypted data is stored in the header of the encrypted file. To decrypt the file, the file system uses the private key of the user to decrypt the symmetric key that is stored in the file header. It then uses the symmetric key to decrypt the file. Because this is done at the file system level, it is transparent to the user. Also, in case of a user losing access to their key, support for recovery agents that can decrypt files has been built in to the EFS system.

Basic and dynamic disk storage


Windows 2000 introduced the Logical Disk Manager for dynamic storage. All versions of Windows 2000 support three types of dynamic disk volumes (along with basic storage): simple volumes, spanned volumes and striped volumes: Simple volume: this is a volume with disk space from one disk. Spanned volumes: multiple disks spanning up to 32 disks. If one disk fails, all data in the volume is lost. Striped volumes: also known as RAID-0, a striped volume stores all its data across several disks in stripes. This allows better performance because disk read and writes are balanced across multiple disks. Windows 2000 also added support for iSCSI protocol. Accessibility support The Windows 2000 onscreen keyboard map allows users who have problems with using the keyboard to use a mouse to input text. Microsoft made an effort to increase the usability of Windows 2000 for people with visual and auditory impairments and other disabilities. They included several utilities designed to make the system more accessible: FilterKeys: These are a group of keyboard related support for people with typing issues, and include: SlowKeys: Windows is told to disregard keystrokes that are not held down for a certain time period BounceKeys: multiple keystrokes to one key to be ignored within a certain timeframe RepeatKeys: allows users to slow down the rate at which keys are repeated via the keyboards keyrepeat feature ToggleKeys: when turned on, Windows will play a sound when either the CAPS LOCK, NUM LOCK or SCROLL LOCK keys are pressed MouseKeys: allows the cursor to be moved around the screen via the numeric keypad instead of the mouse On screen keyboard: assists those who are not familiar with a given keyboard by allowing them to use a mouse to enter characters to the screen SerialKeys: gives Windows 2000 the ability to support speech augmentation devices

StickyKeys: makes modifier keys (ALT, CTRL and SHIFT) become sticky in other words a user can press the modifier key, release that key and then press the combination key. Normally the modifier key must remain pressed down to activate the sequence. On screen magnifier: assists users with visual impairments by magnifying the part of the screen they place their mouse over. Narrator: Microsoft Narrator assists users with visual impairments with system messages, as when these appear the narrator will read this out via the sound system High contrast theme: to assist users with visual impairments SoundSentry: designed to help users with auditory impairments, Windows 2000 will show a visual effect when a sound is played through the sound system

Language & locale support


Windows 2000 has support for many languages other than English. It supports Arabic, Armenian, Baltic, Central European, Cyrillic, Georgian, Greek, Hebrew, Indic, Japanese, Korean, Simplified Chinese, Thai, Traditional Chinese, Turkic, Vietnamese and Western European languages. It also has support for many different locales, a list of which can be found on Microsofts website.

System utilities
The Microsoft Management Console (MMC) is used for administering Windows 2000 computers. Windows 2000 introduced the Microsoft Management Console (MMC), which is used to create, save, and open administrative tools. Each of the tools is called a console, and most consoles allow an administrator to administer other Windows 2000 computers from one centralised computer. Each console can contain one or many specific administrative tools, called snap-ins. Snap-ins can be either standalone (performs one function), or extensions (adds functionality to an existing snap-in). In order to provide the ability to control what snap-ins can be seen in a console, the MMC allows consoles to be created in author mode or created in user mode. Author mode allows snap-ins to be added, new windows to be created, all portions of the console tree can be displayed and for consoles to be saved. User mode allows consoles to be distributed with restrictions applied. User mode consoles can have full access granted user so they can make whatever changes they desire, can have limited access so that users cannot add to the console but they can view multiple windows in a console, or they can have limited access so that users cannot add to the console and also cannot view multiple windows in a console. The Windows 2000 Computer Management console is capable of performing many system tasks. It is pictured here starting a disk defragmentation.

The main tools that come with Windows 2000 can be found in the Computer Management console (found in Administrative Tools in the Control Panel). This contains the event viewer a means of seeing events and the Windows equivalent of a log file, a system information viewer, the ability to view open shared folders and shared folder sessions, a device manager and a tool to view all the local users and groups on the Windows 2000 computer. It also contains a disk management snap-in, which contains a disk defragmenter as well as other disk management utilities. Lastly, it also contains a services viewer, which allows users to view all installed services and to stop and start them on demand, as well as configure what those services should do when the computer starts. REGEDIT.EXE utility: Windows 2000 comes bundled with two utilities to edit the Windows registry. One acts like the Windows 9x REGEDIT.EXE program and the other could edit registry permissions in the same manner that Windows NTs REGEDT32.EXE program could. REGEDIT.EXE has a left-side tree view that begins at My Computer and lists all loaded hives. REGEDT32.EXE has a left-side tree view, but each hive has its own window, so the tree displays only keys. REGEDIT.EXE represents the three components of a value (its name, type, and data) as separate columns of a table. REGEDT32.EXE represents them as a list of strings. REGEDIT.EXE was written for the Win32 API and supports right-clicking of entries in a tree view to adjust properties and other settings. REGEDT32.EXE was also written for the Win32 API and requires all actions to be performed from the top menu bar. Because REGEDIT.EXE was directly ported from Windows 98, it does not support permission editing (permissions do not exist in Windows 9x). Therefore, the only way to access the full functionality of an NT registry was with REGEDT32.EXE, which uses the older multiple document interface (MDI), which newer versions of regedit do not use. Windows XP was the first system to integrate these two programs into one, adopting the REGEDIT.EXE behavior with the additional NT functionality. The System File Checker (SFC) also comes bundled with Windows 2000. It is a command line utility that scans system files and verifies whether they were signed by Microsoft and works in conjunction with the Windows File Protection mechanism. It can also repopulate and repair all the files in the Dllcache folder. Recovery Console The Recovery Console is usually used to recover unbootable systems. The Recovery Console is an application that is run from outside the installed copy of Windows and that enables a user to perform maintenance tasks that cannot be run from inside of the installed copy, or cannot be feasibly run from another computer or copy of Windows 2000. It is usually used, however, to recover the system from errors causing booting to fail, which would render other tools useless.

It presents itself as a simple command line interface. The commands are limited to ones for checking and repairing the hard drive(s), repairing boot information (including NTLDR), replacing corrupted system files with fresh copies from the CD, or enabling/disabling services and drivers for the next boot. The console can be accessed in one of two ways: Starting from the Windows 2000 CD, and choosing to enter the Recovery Console or Installing the Recovery Console via Winnt32.exe, with the /cmdcons switch. However, the console can then only be used if the system boots to the point where NTLDR can start it.

Server family functionality


The Windows 2000 server family consists of Windows 2000 Server, Windows 2000 Advanced Server and Windows 2000 Datacenter Server. All editions of Windows 2000 Server have the following services and functionality builtin: Routing and Remote Access Service (RRAS) support, facilitating dial-up and VPN connections, support for RADIUS authentication, network connection sharing, Network Address Translation, unicast and multicast routing DNS server, including support for Dynamic DNS. Active Directory relies heavily on DNS. Microsoft Connection Manager Administration Kit and Connection Point Services Support for distributed file systems (DFS)) Hierarchical Storage Management support, a service that runs in conjunction with NTFS that automatically transfers files that are not used for some period of time to less expensive storage media Fault tolerant volumes, namely it supports Mirrored and RAID-5 Group policy (part of Active Directory)

Distributed File System


The Distributed File System, or DFS, allows shares in multiple different locations to be logically grouped under one folder, or DFS root. When users try to access a share that exists off the DFS root, the user is really looking at a DFS link and the DFS server transparently redirects them to the correct file server and share. A DFS root can only exist on a Windows 2000 version that is part of the server family, and only one DFS root can exist on that server. There can be two ways of implementing DFS on Windows 2000: through standalone DFS, or through domain-based DFS. Standalone DFS allows for only DFS roots that

exist on the local computer, and thus does not use Active Directory. Domain-based DFS roots exist within Active Directory and can have their information distributed to other domain controllers within the domain this provides fault tolerance to DFS. DFS roots that exist on a domain must be hosted on a domain controller or on a domain member server. The file and root information is replicated via the Microsoft File Replication Service (FRS).

Active Directory
Active Directory allows administrators to assign enterprise wide policies, deploy programs to many computers, and apply critical updates to an entire organization, and is one of the main reasons why many corporations have moved to Windows 2000. Active Directory stores information about its users and can act in a similar manner to a phone book. This allows all of the information and computer settings about an organization to be stored in a central, organized database. Active Directory Networks can vary from a small installation with a few hundred objects, to a large installation with millions of objects. Active Directory can organise groups of resources into a single domain and can link domains into a contiguous domain name space together to form trees. Groups of trees that do not exist within the same namespace can be linked together to form forests. Active Directory can only be installed on a Windows 2000 Server, Advanced Server or Datacenter Server computer, and cannot be installed on a Windows 2000 Professional computer. It requires that a DNS service that supports SRV resource records be installed, or that an existing DNS infrastructure be upgraded to support this functionality. It also requires that one or more domain controllers exist to hold the Active Directory database and provide Active Directory directory services.

Volume fault tolerance


Along with support for simple, spanned and striped volumes, the server family of Windows 2000 also supports fault tolerant volume types. The types supported are mirrored volumes and RAID-5 volumes: Mirrored volumes: the volume contains several disks, and when data is written to one it is mirrored to the other disks. This means that if one disk fails, the data can be totally recovered from the other disk. Mirrored volumes are also known as RAID-1. RAID-5 volumes: a RAID-5 volume consists of multiple disks, and it uses block-level striping with parity data distributed across all member disks. Should a disk fail in the array, the parity blocks from the surviving disks are combined mathematically with the data blocks from the surviving disks to reconstruct the data on the failed drive on-thefly (this works with various levels of success). Versions

Windows 2000 Professional was designed as the desktop operating system for businesses and power users. It is the basic unit of Windows 2000, and the most common. It offers greater security and stability than many of the previous Windows desktop operating systems. It supports up to two processors, and can address up to 4 GBs of RAM. Windows 2000 Server products share the same user interface with Windows 2000 Professional, but contain additional components for running infrastructure and application software. A significant component of the server products is Active Directory, which is an enterprise-wide directory service based on LDAP. Additionally, Microsoft integrated Kerberos network authentication, replacing the often-criticised NTLM authentication system used in previous versions. This also provided a purely transitivetrust relationship between Windows 2000 domains in a forest (a collection of one or more Windows 2000 domains that share a common schema, configuration, and global catalogue, being linked with two-way transitive trusts). Furthermore, Windows 2000 introduced a DNS server which allows dynamic registration of IP addresses. Windows 2000 Advanced Server is a variant of Windows 2000 Server operating system designed for medium-to-large businesses. It offers clustering infrastructure for high availability and scalability of applications and services, including main memory support of up to 8 gigabytes (GB) on Page Address Extension (PAE) systems and the ability to do 8-way SMP. It has support for TCP/IP load balancing and enhanced two-node server clusters based on the Microsoft Cluster Server (MSCS) in the Windows NT Server 4.0 Enterprise Edition. A limited edition 64 bit version of Windows 2000 Advanced Server was made available via the OEM Channel. It also supports failover and load balancing. Windows 2000 Datacenter Server is a variant of the Windows 2000 Server that is designed for large businesses that move large quantities of confidential or sensitive data frequently via a central server. As with Advanced Server, it supports clustering, failover and load balancing. Its system requirements are normal, but are compatible with vast amounts of power: A Pentium-class CPU at 400 MHz or higher up to 32 are supported in one machine. 256 MB of RAM up to 64 GB is supported in one machine. Approximately 1 GB of available disk space.

Deployment
Windows 2000 can be deployed to a site via various methods. It can be installed onto servers via traditional media (such as via CD) or via distribution folders that reside on a shared folder. Installations can be attended or unattended. An attended installation requires the manual intervention of an operator to choose options when installing the operating system. Unattended installations are scripted via an answer file, or predefined script in the form of an INI file that has all the options filled in already. The Winnt.exe or Winnt32.exe program then uses that answer file to automate the installation. Unattended installations can be performed via a bootable CD, using Microsoft Systems Management Server (SMS), via the System Preparation Tool (Sysprep), via running the Winnt32.exe program using the /syspart switch or via running the Remote Installation Service (RIS).

The Syspart method is started on a standardised reference computer though the hardware need not be similar and it copies the required installation files from the reference computers hard drive to the target computers hard drive. The hard drive does not need to be in the target computer and may be swapped out to it at any time, with hardware configuration still needing to be done later. The Winnt.exe program must also be passed a /unattend switch that points to a valid answer file and a /s file to point to the location of one or more valid installation sources. Sysprep allows the duplication of a disk image on an existing Windows 2000 Server installation to multiple servers. This means that all applications and system configuration settings will be copied across to the new Windows 2000 installations, but it also means that the reference and target computers must have the same HALs, ACPI support, and mass storage devices though Windows 2000 automatically detects plug and play devices. The primary reason for using Sysprep is for deploying Windows 2000 to a site that has standard hardware and that needs a fast method of installing Windows 2000 to those computers. If a system has different HALs, mass storage devices or ACPI support, then multiple images would need to be maintained. Systems Management Server can be used to upgrade system to Windows 2000 to multiple systems. Those operating systems that can be upgraded in this process must be running a version of Windows that can be upgraded (Windows NT 3.51, Windows NT 4, Windows 98 and Windows 95 OSR2.x) and those versions must be running the SMS client agent that can receive software installation operations. Using SMS allows installations to happen over a wide geographical area and provides centralised control over upgrades to systems. Remote Installation Services (RIS) are a means to automatically install Windows 2000 Professional (and not Windows 2000 Server) to a local computer over a network from a central server. Images do not have to support specific hardware configurations and the security settings can be configured after the computer reboots as the service generates a new unique security ID (SID) for the machine. This is required so that local accounts are given the right identifier and do not clash with other Windows 2000 Professional computers on a network. RIS requires that client computers are able to boot over the network via either a network interface card that has a Pre-Boot Execution Environment (PXE) boot ROM installed or that it has a network card installed that is supported by the remote boot disk generator. The remote computer must also meet the Net PC specification. The server that RIS runs on must be Windows 2000 Server and the server must be able to access a network DNS Service, a DHCP service and the Active Directory services. NDS eDirectory is a cross-platform directory solution that works on NT 4, Windows 2000 when available, Solaris and NetWare 5. Active Directory will only support the Windows 2000 environment. In addition, eDirectory users can be assured they are using the most trusted, reliable and mature directory service to manage and control their ebusiness relationships not a 1.0 release.

You might also like