Database Layout

Fundamentals of Database Layout
Version 2.0, August 2000
Copyright 2000 SAP AG. All rights reserved. No part of this brochure may be reproduced or transmitted in any form or for any purpose without the express permission of SAP AG. The information contained herein may be changed without prior notice. Some software products marketed by SAP AG and its distributors contain proprietary software components of other software vendors. Microsoft, WINDOWS, NT ,EXCEL and SQL-Server are registered trademarks of Microsoft Corporation. IBM, DB2, OS/2, DB2/6000, Parallel Sysplex, MVS/ESA, RS/6000, AIX, S/390, AS/400, OS/390, und OS/400 are registered trademarks of IBM Corporation. OSF/Motif is a registered trademark of Open Software Foundation. ORACLE is a registered trademark of ORACLE Corporation, California, USA. INFORMIX-OnLine for SAP is a registered trademark of Informix Software Incorporated. UNIX and X/Open are registered trademarks of SCO Santa Cruz Operation. SAP, R/2, R/3, RIVA, ABAP, SAPoffice, SAPmail, SAPaccess, SAP-EDI, SAP ArchiveLink, SAP EarlyWatch, SAP Business Workflow, SAP Retail, ALE/WEB, SAPTRONIC, SAPDB, mySAP.com are registered trademarks of SAP AG.
SAP AG l Neurottstrae 16 l 69190 Walldorf l Germany
Table of Contents
SAP
Table of Contents
Table of Contents
1. Motivation ............................................................................................................................1 1.1 Target Group.................................................................................................................1 1.2 Further Material ............................................................................................................1 1.3 Overview.......................................................................................................................2 2. Introduction .........................................................................................................................5 2.1 Architecture of DBMS..................................................................................................5 2.2 R/3 Architecture............................................................................................................6 3. Hardware Layout ..............................................................................................................11 3.1 General Hardware Architecture ..................................................................................11 3.2 Data Transfer Mechanisms .........................................................................................12 3.2.1 SCSI .................................................................................................................13 3.2.2 Fibre Channel ...................................................................................................16 3.3 Disk Subsystem ..........................................................................................................18 3.3.1 General Overview ............................................................................................18 3.3.2 Caching.............................................................................................................20 3.3.3 Striping .............................................................................................................21 3.3.4 Disk Mirroring..................................................................................................23 3.3.5 RAID ................................................................................................................24 4. Operating System ..............................................................................................................27 4.1 Device System ............................................................................................................27 4.2 OS Striping .................................................................................................................28 5. Database System................................................................................................................31 5.1 Common Characteristics of Database Systems ..........................................................31 5.2 Distributing the Storage..............................................................................................33 5.2.1 General Considerations ....................................................................................33 5.2.2 Striping .............................................................................................................35 5.2.3 Analyzing I/O Requirements............................................................................36 5.3 Runtime Considerations..............................................................................................37 5.3.1 Logfile ..............................................................................................................37 5.3.2 Avoiding Dynamic Space Management...........................................................38 5.3.3 I/O Access ........................................................................................................39 5.3.4 Parallelism ........................................................................................................39 6. DB2 Universal Database (Unix and NT) .........................................................................41 6.1 Physical and Logical DB Components .......................................................................41 6.1.1 DB2 UDB Enterprise Edition (EE) Concepts ..................................................41 6.1.2 DB2 UDB Enterprise-Extended Edition (EEE) Concepts ...............................42 6.2 Disk Layout ................................................................................................................44 6.3 I/O Access...................................................................................................................46 6.4 Parallelism ..................................................................................................................47 6.5 Specific Features.........................................................................................................48 7. DB2 UDB for OS/390 ........................................................................................................49
SAP AG
August 2000
Table of Contents

SAP
7.1 7.2 7.3 7.4 7.5
Physical and Logical DB Components ...................................................................... 49 Disk Layout ................................................................................................................ 50 I/O Access .................................................................................................................. 51 Parallelism.................................................................................................................. 52 Specific Features ........................................................................................................ 52
8. Informix ............................................................................................................................. 55 8.1 Physical and Logical DB components ....................................................................... 55 8.2 Disk Layout ................................................................................................................ 56 8.3 I/O Access .................................................................................................................. 57 8.4 Parallelism.................................................................................................................. 58 8.5 Specific Features ........................................................................................................ 59 9. Oracle................................................................................................................................. 61 9.1 Physical and Logical DB Components ...................................................................... 61 9.2 Disk Layout ................................................................................................................ 62 9.3 I/O Access .................................................................................................................. 63 9.4 Parallelism.................................................................................................................. 64 9.5 Specific Features ........................................................................................................ 65 10. SAP DB .............................................................................................................................. 67 10.1 Physical and Logical DB Components ...................................................................... 67 10.2 Disk Layout ................................................................................................................ 68 10.3 I/O Access .................................................................................................................. 69 10.4 Parallelism.................................................................................................................. 69 10.5 Specific Features ........................................................................................................ 69 11. SQL Server ........................................................................................................................ 71 11.1 Physical and Logical DB Components ...................................................................... 71 11.2 Disk Layout ................................................................................................................ 72 11.3 I/O Access .................................................................................................................. 73 11.4 Parallelism.................................................................................................................. 74 11.5 Specific Features ........................................................................................................ 75 Appendix ................................................................................................................................. 77 A. Terminology in DB Systems............................................................................................. 79 B. SSA..................................................................................................................................... 81 C. References.......................................................................................................................... 83 D. Introduction to Modeling Diagrams ............................................................................... 85 Index ........................................................................................................................................ 89
ii
August 2000
SAP AG
Table of Figures
SAP
Table of Figures
Table of Figures
Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 Figure 10 Figure 11 Figure 12 Figure 13 Figure 14 Figure 15 Figure 16 Figure 17 Figure 18 Figure 19 Figure 20 Figure 21 Figure 22 Figure 23
General DBMS architecture ...............................................................................5 R/3 overview ......................................................................................................7 General layout for a host with connected disks................................................11 SCSI bus architecture .......................................................................................13 System hardware layout using SCSI storage devices.......................................14 FC-AL loop architecture ..................................................................................17 Disk pack..........................................................................................................18 2 stripe sets striped across 3 disks....................................................................22 Overview of Unix devices, example HP-UX ...................................................27 Example for a host-based striping....................................................................29 Sample DB layout ............................................................................................35 SMP (left) and cluster/MPP configuration (right)............................................40 DB2 EE database components .........................................................................41 Typical DB2 configuration for R/3 ..................................................................42 DB2 EEE database components.......................................................................43 Typical DB2 configuration: BW ......................................................................44 DB2/390 database components ........................................................................49 Structure of an Informix database ....................................................................55 Structure of an Oracle database........................................................................61 Using partitioning with striping .......................................................................66 Structure of a SAP DB database ......................................................................67 Structure of an SQL Server database ...............................................................71 SSA loop architecture.......................................................................................81
SAP AG
August 2000
iii
Table of Figures

SAP
iv
August 2000
SAP AG
Table of Tables
SAP
Table of Tables
Table of Tables
Table 1 Table 2 Table 3 Table 4
SCSI types and their technical properties ........................................................15 RAID properties ...............................................................................................25 Consequences of the RAID properties .............................................................26 Terminology in DB systems.............................................................................79
SAP AG
August 2000
Table of Tables

SAP
vi
August 2000
SAP AG
Motivation
SAP
1. Motivation
Motivation
1.1 Target Group

This paper aims to provide background material for the following groups of persons:
Basis Consultants
= SAP basis consultants who wish to gain a deeper understanding of the larger context of database layout. The wish could result from looking at different system choices e.g. for the hardware, so that a better understanding of how the hardware is accessed by software components is possible. A recommended reading is to start with the Introduction (Chapter 2), studying the Disk Subsystem (section 3.3), and then reading the corresponding DBMS-specifics (in alphabetic order: Chapter 6 (DB2 Universal Database (Unix and NT)), Chapter 7 (DB2 UDB for OS/390) Chapter 8 (Informix), Chapter 9 (Oracle), Chapter 10 (SAP DB), or Chapter 11 (SQL Server)). = SAP (hardware and software) partners who need to implement database layout at the customer site. Here, the paper can be understood to provide additional or complementary information apart from other information sources like training, training materials, documentation, implementation guides or consultation. As opposed to very detailed information which appears e.g. in implementation guides, this paper focuses on the underlying hardware and software concepts. = SAP engineers working on implementations of mission-critical applications with high data volume, e.g. for the SAP Retail and CP/AFS solutions. Information may help to provide a better understanding of possible bottlenecks in the installed system at the customer site, and it can show where software architecture on database level can help to improve the system performance (e.g. by configuring parallel I/O handling). This paper tries to analyze system requirements from an independent point of view. It points out alternatives concerning the system hardware layout, and summarizes important functional features of all database management systems in use for SAP software systems (e.g. R/3 or the SAP Business Information Warehouse).
Partners
SAP Engineers
1.2 Further Material

After having worked through this paper, the reader should be able to tackle more specialized information sources. Here is an overview:
Additional Material
= The white paper: Optimizing SAP R/3 on Symmetrix A Best Practices Guide (available on SAPNet, use SAPNet search).
SAP AG
August 2000
Motivation

SAP
= The database implementation guides (available in SAPNet, following the links under Services [Online Services], Installation/Upgrade [Installation/Upgrade Guides]). = The database training materials (courses BC505, BC510, BC515, BC520, BC525, BC535). = The material enumerated in section B (Appendix), References.
1.3 Overview
DB Layout
Why is it sensible to care about DB (database) layout? The reason is that many applications are inherently limited by disk input/output (I/O). Often, CPU activity must be suspended while I/O activity completes. Such an application is said to be I/O bound. R/3 heavily depends on database processing and therefore is an I/O bound application. A good physical layout is a prerequisite to gain good I/O performance, availability, manageability and maintainable data growth.
System Platforms
The paper focuses on many issues connected with the physical layout of a database, especially from the perspective of high performance. In order to select a performant database platform, the combination of three platform decisions are to be made: 1) The hardware platform. On the one hand, the database server needs processing power, but on the other hand, a good I/O subsystem that can handle a high volume of data is essential. The connection between the two systems have to be considered carefully. One has to take into account where potential benefits can arise from parallelizing the data transfer. Besides traditional SCSI technology (section 3.2.1), new technologies (FC-AL, section 3.2.2, and SSA, appendix B) are considered. 2) The Operating System (OS). The paper focuses less on the OS, but an overview of HP-UX device handling is provided. Similar designs apply to other UNIX derivatives. 3) The database management system (DBMS). This paper examines all R/3 DBMS, namely (in alphabetic order) DB2 Universal Database (Unix and NT; Chapter 6), DB2 UDB for OS/390 (Chapter 7), Informix (Chapter 8), Oracle (Chapter 9), SAP DB (Chapter 10) and SQL-Server (Chapter 11). The general concepts of the physical layout are explained, and a comparison of terminology and the most striking differences in conception are taken into account. The paper is structured along these three combinatorial factors. It is not intended to provide all details for platform considerations, but should be regarded as an introduction to gain some insights of how different issues are interrelated.
Hardware
Operating System
Database Management System
August 2000
SAP AG
Motivation
SAP
Special Topics
The topics are not confined to laying out the database, because a good layout requires some knowledge of the basic principles of how the database works. For example, some databases can only parallelize queries efficiently if the data is already partitioned according to some specified rule. Therefore, also architectural issues (for all platforms) are considered in this paper.
SAP AG
August 2000
Motivation

SAP
August 2000
SAP AG
Introduction
SAP
2. Introduction
Introduction
DBMS Architecture
In this paper, an understanding of the architecture of database management systems (DBMS) is assumed. However, a rough overview of the concepts of a DBMS is also provided in section 2.1. Although most statements mentioned in this paper are independent from any specific application, some of the recommendations in Chapter 4 are specific for SAP R/3. Also, the DBMS described there are the DBMS supported by R/3. Therefore, a basic overview of the R/3 architecture is helpful and given in section 2.2.
R/3 Architecture
2.1 Architecture of DBMS

DBMS Architecture
A general overview of the principal building blocks of a DBMS is shown in Figure 1 .

DB Application
R
DB Application
R
Database Server Database Cache Parsed SQL Statements Table Buffers Log Buffer
DB Server Process
DB Server Process
DB Server Pool
R
Shared DB Services DB Writer Transaction Log Writer Backup Writer
Database File
Config. File
Transaction Logfile Database
Backup File
Figure 1
General DBMS architecture
Executing a Client Request
Applications using the DBMS for central data storage are shown at the top in Figure 1. When connecting to the DBMS, they are assigned to a DB server process. DB server processes are responsible for parsing the clients SQL request and executing the request. For reuse of the parsing information these are stored in the parsed SQL statements area of the DB cache (or, in case of SQL Server, they are stored as stored procedures). When requested data is read from a database file where the table and index data is stored, the data is written to the database cache (into the table
August 2000 5
Read Requests
SAP AG
Introduction

SAP
buffers area). Subsequent requests for that data can then be fulfilled by the cache directly.
Write Requests
Also for write requests (e.g. updates, inserts) the changes are first written to the cache. Only after specific intervals, or if there is place needed in the cache, these changes are written to the database files by the DB writer. The DB writer is an example of a shared DB service. This means that all DB server processes share the processes running a DB service. A DBMS can provide lots of DB services, while only three are shown here.
Log
The DBMS also administers a log that records all changes to the data. While a DB transaction is not completed, this data is written to the log buffer. On commit, the transaction log writer writes the relevant transaction data to the logfiles on disk. With the help of the logfiles, a recovery (e.g. in case of a system crash) can be performed by the DBMS. Another typical DB service is the writing of the backup files performed by the backup writer. As the name suggests, backup files store the contents of the logfiles permanently. This also is an essential part for restoring the database in cases of system crashes or hardware failures. The backup files storing logs (i.e. copies of logfiles) are sometimes called archive / archive files.
Backup
2.2 R/3 Architecture

This section presents a rough overview of the architecture of the R/3 basis system.
Three Tier Architecture
Figure 2 shows an example of an R/3 system. Each R/3 system can be divided up into at least three layers1 presentation, application and database layer. On the database layer, there is one database system. The application layer is made up of several application servers, one message server and one enqueue server. Figure 2 shows only one application server. The message server is not shown. On the presentation layer, there are several SAP GUIs. The database system consists of the database management system and the database itself. The database is the central storage place for different types of information. The database does not only contain all data from the business applications. For example, it also contains the application programs written in ABAP. The presentation layer is the interface between the R/3 system and its users. The entire dialog processing between a user logon and logoff is called a logon session. For each logon session, one SAP GUI is created. The SAP GUI is the graphical user interface for entering and displaying data within the corresponding logon session. A logon session consists of up to six main modes, each of which appears as an R/3 window. Initially, when a user logs
Database System
SAP GUI
The standard R/3 has 3 layers, and 4 or more if ITS is part of the system.
August 2000
SAP AG
Introduction
SAP
on, the logon session consists of only one main mode, and the user can open up to five further main modes. Within each main mode, the user can run an application program. The different main modes allow the user to run different applications in parallel, independently of one another.
Presentation Layer
SAP GUI SAP GUI
Application Server Gateway Shared Memory
Program Buffer Dispatcher Table Buffer
Work Process
Communication Areas
Application Layer
Taskhandler
User Contexts
Screen Processor
Local Memory
ABAP Processor
Database Interface
Database Management System
Database Layer
Database
Figure 2
R/3 overview
Dialog Step
When a user runs an application program that requires user interaction, then he navigates through a sequence of screens. The program logic in an application program that occurs between two screens is known as a dialog step. The dialog steps are executed within the application layer. Actually, when a user runs a program, control of the program is continually passed backwards and forwards between the layers. When a screen is ready for user input, the SAP GUI is active, and the application server is inactive with regard to that particular program, but free for other tasks. Once the user has
SAP AG
August 2000
Introduction

SAP
entered data on the screen, program control passes back to the application server. Now the SAP GUI is inactive. It is still displaying the screen, but it cannot accept user input within this screen. The SAP GUI does not become active again until the application program has prepared a new screen and sent it to the SAP GUI.
Application Server
Each application server consists of the gateway, the dispatcher and the work processes. The gateway is the interface for the communication with other application servers in the same R/3 System and with other SAP and non-SAP systems. The dispatcher is the link between the work processes and the SAP GUIs. There are several types of work processes, namely dialog, background, update, enqueue and spool work processes. All work processes have the same components. The type of a work process only determines the kind of tasks for which it is responsible in the application server. The dialog work processes execute the dialog steps of the application programs that run within the logon sessions. The dispatcher has the important task of distributing all dialog steps among the dialog work processes on the application server. It is important to note that the individual dialog steps of a program can be executed on different dialog work processes. The main components of a dialog work process are the taskhandler, the screen processor, the ABAP processor and the database interface. All requests sent to a work process are first handled by the taskhandler. A request to process a dialog step is forwarded to the screen processor. ABAP programs consist of two parts, the screens and the ABAP modules. As well as the actual input mask, a screen also consists of flow logic. The screen processor executes the screen flow logic and calls ABAP modules. The ABAP processor executes ABAP modules, and communicates with the database interface. The screen processor tells the ABAP processor which ABAP module should be processed next. The dialog work processes do not only execute the programs that are started by the user within a logon session. They also execute the functions that are called from other R/3 Systems by a remote function call (RFC). The other types of work processes are not considered here in detail.
Types of Work Processes
Dialog Work processes
Work Process Architecture
Memory Management
Figure 2 shows two memory areas, namely the work process local memory and the shared memory. While the work process local memory can be accessed only by the work process itself, the shared memory can be accessed by all processes of an application server. For example, the shared memory contains buffers for the application programs, which are stored in the database, and for the application data in the database tables. Buffering of data in the shared memory of the application server reduces the number of database calls required. This reduces access times for application programs and the load on the DBMS considerably. Moreover, the shared memory contains areas which are used for the communication within the application
August 2000
SAP AG
Introduction
SAP
server, in particular for the communication between the dispatcher and the work processes.
User Context
Another important content of the shared memory are the user contexts. A user context is created for each logon session. It contains for example the user authorizations of the user who has logged on and the current data for the application programs that run within that logon session. This includes the screens, the values of the ABAP variables and the contents of the internal tables and extracts. Since the individual dialog steps of a program can be executed on different dialog work processes, the user context is normally placed in shared memory (as shown in Figure 2). Also, each RFC session has its own user context. Like the user context of a logon session, the user context of an RFC session also contains the user authorizations and the current data for the programs that run within that RFC session.
SAP AG
August 2000
Introduction

SAP
10
August 2000
SAP AG
Hardware Layout
SAP
3. Hardware Layout
Hardware Layout
3.1 General Hardware Architecture

General Hardware Layout
Figure 3 gives a general overview of the hardware components involved in processing applications which use I/O requests to store data. Examples for these applications are OLTP (Online Transaction Processing) applications like R/3.
Memory Unit CPU Memory Controller Main Memory
Internal Bus (e.g. PCI Bus)
Bus Controller Unit DMA Controller
Bus Controller Main Processing Unit
Bus
Disk Controller
Disk Controller
Disk Disk Device
Disk Disk Device
Figure 3
General layout for a host with connected disks
Processing I/O Requests
While the application logic (i.e. the DBMS program) is processed by the CPU, data from disk has to be transferred to the main memory and vice versa. This is normally done by forwarding the I/O request to a DMA controller that uses a bus system to access the underlying disk devices where the data is stored. The DMA controller requests the bus controller to send the request via the bus to some specified disk device. There, a disk controller communicates with the bus controller to receive the request parameters. The request can then be processed by the disk controller by writing to disk or reading from disk. When I/O is completed, the results are
SAP AG
August 2000
11
Hardware Layout

SAP
returned to the bus controller, and from there to the DMA controller that transfers the data to main memory.
Direct Memory Access
DMA stands for direct memory access, and is performed by the DMA controller. The DMA controller can either be a basic part of the host system (e.g. a PC has a DMA controller in the chipset located on the motherboard), or part of an extended unit (e.g. the SCSI controller card). There are different models of how the disk data is transferred from disk into main memory: 1) Programmed I/O: the CPU reads/writes I/O data and waits for the I/O device operation to finish (CPU busy-waiting) 2) Interrupt-driven I/O: the CPU reads/writes I/O data and is interrupted once the I/O device operation is completed (using the interruption technique, the CPU can process different tasks while the I/O is processed independently). 3) The task to perform I/O access is transferred to a separate system component, the DMA which is able to transfer whole data blocks from memory to the device or vice versa. After the I/O operation is completed, the DMA notifies the CPU of the completion of the task (by interrupt). 4) The task to perform I/O access is transferred to a separate system component, an I/O processor (a.k.a. I/O channel), which is a processor of its own and can therefore perform a series of stored I/O operations and acknowledge once all operations are completed. There are two variants: Either the I/O operations are stored in main memory or the I/O processor has a memory on its own so it can process them locally. As a rule, most systems use DMA for data transfer, but for larger systems, I/O processors may be used.
Access Models
Cycle Stealing
As the DMA and the CPU have to use the same bus to access the main memory, conflicts occur. When the DMA has to transfer data to memory, it can temporarily disable the CPU access to memory (called cycle stealing). As a rule, in order to fully utilize the throughput of the bus controllers receiving their data, DMAs have to be assigned to the bus controllers. In the following sections two essential system components are described: the bus controller unit (section 3.2, Data Transfer Mechanisms) and the disk subsystems themselves (section 3.3, Disk Subsystem).
DMA Assignment
3.2 Data Transfer Mechanisms

Interface
To understand what the different technologies for data transmission look like, an understanding of the concept of an interface is needed. In general, an interface is defined as a boundary across which two systems (hardware or software) communicate. In the following, an interface is defined as a hardware or software data transmission mechanism that manages the
12
August 2000
SAP AG
Hardware Layout
SAP
exchange of data between certain devices and a computer (i.e. the interface determines the rules of data transmission between the computer on the one hand and the devices on the other hand). For data storage, the device will be a disk drive or disk subsystem. Physically, the interface is implemented by microprocessor chips on the motherboard, adapter cards or the disk drive itself. Standard Committees drive the adoption of interfaces such that any peripheral device following the Standard can be used interchangeably.
Standard Interfaces
Standard interfaces include SCSI (section 3.2.1) and Fibre Channel Arbitrated Loop (FC-AL, section 3.2.2)2. A detailed description of EIDE (Enhanced Integrated Drive Electronics) is not provided. EIDE is more targeted for smaller systems. A controller is (usually) a (hardware) part of a computer, typically a separate circuit board, which allows the computer to use certain kinds of peripheral devices. E.g., a disk controller is used to connect hard disks and floppy disks.
Controller
3.2.1 SCSI
SCSI
SCSI Compatibility
Put simply, SCSI (Small Computer Systems Interface) is a way of connecting multiple devices to the computer via an external bus. This already sets off SCSI from EIDE, a specification for two devices only. While EIDE is often used for PCs, SCSI is typically used in low-end to midrange server environments. The SCSI protocol allows for command enhancements while at the same time being backward-compatible. This is because SCSI devices (devices supporting the SCSI interface) only react to commands they recognize and let other commands pass. SCSI devices are connected to a bus in a row, with each device being connected to a previous one. Logically, all devices are connected to the same bus, i.e. they share the bus (see Figure 4).
SCSI Controller SCSI Device SCSI Device
SCSI Bus
SCSI Device
SCSI Device
SCSI Device
Figure 4
SCSI bus architecture
SCSI Disk Subsystem
Figure 5 gives an overview of a host system with one SCSI controller connected to 3 SCSI devices. One of the devices is a RAID system (see section 3.3.5 for a description of RAID), which is built from a controller (the RAID controller), the controller in turn being connected an the array of disk controllers with disks.
These two standards are also compared to IBM's SSA architecture, which is presented in appendix B.
SAP AG
August 2000
13
Hardware Layout

SAP
Memory Unit CPU Memory Controller Main Memory
Internal Bus (e.g. PCI Bus)
SCSI Controller Unit DMA Controller
SCSI Controller Main Processing Unit
SCSI Bus
RAID Controller
Disk Controller
Disk Controller
Disk Controller Disk Controller Disk Controller
Disk
Disk SCSI Disk
Disk SCSI Disk
Disk
Disk
RAID System
Figure 5
System hardware layout using SCSI storage devices
Multiple Devices
The ability to connect multiple devices to one bus is important, because this way the devices use less host resources (interrupts, controllers etc.). The number of connected devices is even larger with Fibre Channel (see section 3.2.2). SCSI allows to connect either 7 or 15 devices per SCSI controller (see Table 1). An important feature of SCSI is that commands can be sent from the SCSI controller to a SCSI device and then handled by the device asynchronously, which means that other devices connected to the same bus can receive further commands immediately. Thus, devices can work in parallel. This is in contrast to other technologies (e.g. EIDE, where the EIDE controller waits until the device has completed its task). Another feature of most SCSI devices improving access performance is command reordering. If commands to read disk sectors in random order arrive at the device, the order will be re-arranged so that the disk can be accessed in linear order, minimizing the head movement for the whole sequence.
Asynchronous Processing
Command Reordering
14
August 2000
SAP AG
Hardware Layout
SAP
SCSI Types
Table 1 shows the different SCSI types and their technical properties. These include the bus rate, the number of data lines, the throughput (in megabytes per second), and the maximum number of devices that can be attached to one SCSI controller. Bus rate (in Bus Width Throughput max. MHZ) (in bits) (MB/s) devices3 5 10 10 20 20 40 8 8 16 8 16 8 16 16 5 10 20 20 40 40 80 160 7 7 15 7 15 7 15 15
SCSI type SCSI-1 Fast SCSI Fast-Wide SCSI Ultra SCSI Ultra-Wide SCSI Ultra2 SCSI (LVD)
Ultra2-Wide SCSI (LVD) 40 Ultra3-Wide SCSI (LVD) 80

Table 1
Ultra3 SCSI
SCSI types and their technical properties
Apart from the increased speed (see data above), Ultra3 SCSI also introduces new features compared to Ultra2 SCSI. One of these is cyclic redundancy check (CRC), which offers improved reliability of data transmission and error correction. Apart from the technical properties described in Table 1, the requirements of the location of the storage systems (room, air-conditioning) is also important. SCSI comes in two variants, single-ended and differential. Because of enhanced noise reduction, differential SCSI allows longer cables to be used. There are two versions of differential SCSI, HVD (high voltage differential, for high-end servers) and LVD (low voltage differential, for PCs). At least Ultra2 SCSI and faster systems are based on differential SCSI. An important aspect which differentiates between interfaces is the use of arbitration. Arbitration is the situation in which all devices are connected to a common bus or link and only one communication is allowed at any one time. In the case of a bus architecture, there is always only one communication between two devices connected to the bus, i.e. it is an arbitrated architecture. On a SCSI bus for example, only two devices can communicate at a time. There is a defined protocol4 to determine which device may execute a
Single-Ended SCSI, Differential SCSI
Arbitration
Conflict Resolution
3 4
This number excludes the SCSI controller. The protocol is roughly defined as follows: Each device on the bus has a unique so-called SCSI id, which is in the range from 0 to the number of devices 1. All devices that want to execute a request raise a signal on a specified line, followed by raising a signal on the data line corresponding to their SCSI id. If more than one device has raised the signal, the device with the highest id value
SAP AG
August 2000
15
Hardware Layout

SAP
request. This is called conflict resolution. The device is the so-called initiator (usually the host adapter) which starts the communication to talk to a disk and once the conversation starts, a specific signal line is raised that prevents all other devices from interrupting. With arbitration only one conversation may take place at any given time. Therefore, a negotiation must occur at the start of each conversation before it can proceed. This reduces the efficiency of the transfers and consequently the overall data rate is reduced.
3.2.2 Fibre Channel

Fibre Channel is an industry-standard serial interface that was originally used to connect two systems via optical cabling or a system to a subsystem. It has evolved to include the ability to connect many devices, including disk drives. This addition to the Fibre Channel specifications is called Fibre Fibre Channel- Channel-Arbitrated Loop (FC-AL). FC-AL can be operated in two Arbitrated Loop different modes, an arbitrated mode and a point-to-point mode. The term FC-AL is derived from the arbitrated mode of FC-AL where a loop with arbitration is used which means that there is only one communication across the whole loop going on at a time, which is in contrast to SSA (see appendix B). Another way to state this property is to say that FC-AL has a common loop for all devices. When being operated in point-to-point mode, every two devices can start a communication simultaneously. However, because both devices could be the sending or receiving device, there remains some arbitration.
Fibre Channel Loop
FC-AL is a fast serial bus interface standard intended to replace SCSI on high-end servers. FC-AL I/O throughput is 100 MB/s (base speed). It has to be considered that the base speed is only effective after the request conflict resolution, which is more complicated and therefore more time-consuming than SCSI's conflict resolution. As well, the block size used by the DBMS to transfer data blocks to the devices has influence on the actual throughput; the larger the block size, the better the throughput will be. Many devices are dual-ported, i.e., they can be accessed via two independent ports. This doubles speed and increases fault tolerance (see Figure 6). For a dual-loop architecture, this provides a backup if one loop fails. The Fibre Channel interface is a loop architecture as opposed to a bus architecture like SCSI. The loop structure enables a rapid data exchange with a maximum transfer rate of 100 MB/s. As can be seen in Figure 6, communication between two devices is always using the whole loop as transport infrastructure, and every device is participating in the transmission (i.e. the devices except the two communicating devices are functioning as a
Loop Architecture
takes precedence over the other devices which want to execute the request. This procedure may be termed resolution for the next bus request.
16
August 2000
SAP AG
Hardware Layout
SAP
kind of repeater). This also means that if no precautions are taken5, if any device fails the whole loop is unavailable.
Loop A
Port A FC-AL Controller Port B FC-AL Device FC-AL Device FC-AL Device
Loop B (optional)
Figure 6
FC-AL loop architecture
Storage Area Network
The Fibre Channel loop can have any combination of hosts and devices (max. 127 devices). This means it provides a technological infrastructure for clustering of multiple servers into a single pool of storage (this only recently emerging concept is called storage area network (SAN)). However, as well the software has to support server clusters (e.g. NT clustering software which supports 2 servers to date, Novell clustering software Orion). Care should be taken of how FC-AL is incorporated into the whole system. The connection to the host (the host bus adapter) and the storage devices (e.g. RAID systems, see section 3.3.5) can decrease the throughput. Existing RAID systems may be designed for older SCSI throughputs up to 40 MB/s. The RAID system may have to incorporate more than one RAID controller and fast disks to achieve the throughput, especially if the dual-loop throughput of 200 MB/s is desired. Fibre Channel is designed to support SCSI devices. This allows for a smooth upgrading to this technology while re-using existing devices. For high availability, FC-AL allows storage devices to be placed much farther apart than SCSI devices could. In case of copper wires, the possible distance is up to 30 meters, and even 10 kilometers in case of fiber optic cable. This is important to allow for disaster recovery. While SCSI controllers issue commands with an asynchronous transfer rate of about 2 MB/s, Fibre Channels can issue commands with full speed (i.e. with the maximal throughput rate). Another advantage of Fibre Channel over SCSI does not concern speed but cabling, as it allows for distances of up to 10 kilometers for physical placement of servers and storage. Ultra SCSI for instance has a cabling distance limitation of 25 meters.
Data Throughput
SCSI Support
Comparison: FC-AL vs. SCSI
Throughput
So-called Port Bypass Circuits (PBCs) can be installed for a device on the loop. The PBC is basically an electronic switch that will allow a node to be bypassed and electronically removed from the loop. The PBC allows a device to be powered down and removed without interrupting traffic or data integrity on the loop.
SAP AG
August 2000
17
Hardware Layout

SAP
Technically, Fibre Channel may surpass SCSI. However, SCSI is the less expensive technology, so it may be the logical choice for low-end servers and workstations. FC-AL may be the preferred application when building server clusters (SAN). However, not all technological pieces of the architecture are available yet (e.g. lacking software support).
3.3 Disk Subsystem

3.3.1 General Overview
Disk Drive
A disk drive is a device which can access one or more disks. Disks can have different sizes (e.g. 3,5 inches or 5,25 inches are standard disk sizes). For each disk, a read/write head exists to access the disk. The axis around which the disks rotate is called the spindle. Disks are divided into tracks, and tracks are divided into sectors or blocks. If the disk drive includes multiple disks, all read/write heads will be moved simultaneously, i.e. they will all be positioned on the same track. As opposed to floppy disks, hard disks rotate all the time. This enables faster data access, as the head can immediately be positioned onto the right track. Typical rates of disk speed are 5400 rpm (rotations per minute), 7200 rpm or 10000 rpm. To increase storage capacity, disks are assembled into a disk pack (see Figure 7). A cylinder is the union of a certain track (e.g. track no 2) of all disks in the disk drive. The concept of a cylinder is important because data stored on the same cylinder can be retrieved much faster than if it were distributed among different cylinders.
spindle read/write head arm disk rotation
cylinder of tracks (imaginary)
Figure 7
Disk pack
To access data, 3 actions are required:

Access Time
1) The head has to be positioned to the right track. The time used for this action is called the seek time. 2) The drive has to rotate until the correct sector appears under the head. The time required for this is called the rotational delay or latency.
18
August 2000
SAP AG
Hardware Layout
SAP
3) The drive has to read the data from disk or write them to disk. This time is called the block transfer time (always whole blocks are read/written). Typically, the seek time and latency are much longer than the block transfer time. The total time to get a block from disk is known as the access time. This refers to the sum of the seek time, rotational delay and block transfer time. This time comes down to as little as nearly 10 ms for modern disks. The access time is an important measure for random accesses (i.e. selective accesses) to the disk.
Media Transfer Rate / Sustained Transfer Rate
Also, one can ask for the amount of data which can be continuously transferred from the disk to the host. One measure, called the media transfer rate, is the theoretically possible throughput when considering the drives rotational speed and the media density (i.e. the density with which the data is packed onto the disk). The faster the disk rotates, and the denser the media, the higher the media transfer rate will be. On the other hand, there is a practical measure called the sustained transfer rate, which depends on the media transfer rate and the caching of the hard disk (e.g. the size of the cache). The latter is normally measured by benchmarks. The sustained transfer rate is an important measure when accessing large contiguous areas on the disk (e.g. for a large table scan).
Hot Spot
A hot spot refers to data stored on some disk which is accessed very often and concurrently by some system component or user. Because of the frequency of access, it can be a bottleneck. The following gives an idea of typical disk performance values: = typical high-speed disk access time: up to 10 ms = typical low-speed disk access: 20 ms or more = fastest HD media transfer rate (speed with which data can be read by the storage device) is around 20 MB/s (for read and write). The sustained transfer rate is similar. Many small disks can be favorable to only few large disks, if the latter have only the same or a slightly better access time. This means that a disk subsystem has to be planned in terms of total disk capacity as well as the combined throughput of all disks together.
Typical Access Rates
Disk Contention
When disks are heavily accessed, the interval between two accesses decreases until the disk is continuously accessed, i.e. the disk operates at the maximum throughput rate it can provide. In this situation, requests for this disk cannot be satisfied immediately and will be queued in an I/O queue. This situation is called disk contention. The general strategy of DB layout is to avoid disk contention. From the previous considerations some conclusions can be drawn. If logically sequential data pages are scattered across one or more disks, accessing these pages in logical order will result in an overhead of seek time and latency for each physical access of one page. Even if the physical ordering corresponds to the logical ordering of data blocks, two different data requests can still interfere with each other if they are processed
SAP AG
August 2000
19
Hardware Layout

SAP
concurrently. Here, a good strategy could be to access more than one page at once, i.e. to access the following pages without any further seek procedure (for read access, see read-ahead caching in section 3.3.2). Any database will use this strategy to some extent.
Relationship to Controller
When planning the disk subsystem layout (e.g. for a SCSI system), the relationship between the SCSI controller and the connected disks must be taken into consideration. A simple calculation for the minimum amount of disks connected to an Ultra2-SCSI controller is the following: When using 4 fast SCSI disks with 20 MB/s, the optimal I/O throughput would be 4 x 20 MB/s = 80 MB/s, which is the maximal throughput for an Ultra2-SCSI controller. If slower disks are used, the number of disks should be increased accordingly. SCSI disks attached to the same SCSI controller can work in parallel, because the SCSI protocol works in such a way that an asynchronous request to the device is issued, and the device will send a notification on job termination.
3.3.2 Caching
Different Cache Levels
Caching can be done on different system levels. 1) Software caches are kept on the processing host. 2) Hardware caches are kept either on the bus controller, or on the disk device itself. When using hardware caches in combination with software caching, it is advisable to verify whether the cache management harmonizes with each other. For read caches, it does not make sense if the same information is cached on different levels (example: when the OS has a cache of 100 KB and there is a hardware cache with 100 KB as well, and the cached data is identical, then the hardware cache is useless).
Software Caching
Software cache will always perform better on read hits (i.e. the requested data can be delivered by reading from the cache) than a hardware cache, since a software cache is closer to the application. Software caches holding dirty data (cached data which has been modified) work best if flushing (writing to disk) the dirty data can be postponed until there is fewer I/O activity. Hardware caches which receive data to be written to disk will try to flush the data to disk as soon as possible. There are 2 qualities of software caches: While write access can only be postponed by software caching, read access can be completely satisfied and never reach the I/O subsystem.
Write-back vs. Write-through Caching
Caching can be either write-back or write-through cache. Write-back cache means that the write operation is considered complete when the cache is updated, regardless of when the data is actually written back to disk. The data may be changed several times before it is flushed to disk. In order to
20
August 2000
SAP AG
Hardware Layout
SAP
preserve cache contents in case of a power failure, some safeguard will be needed (like battery backup). As opposed to this, write-through caching implies that the data is written immediately through the cache to disk.
Effects of High System Load
A scenario for growing system load (e.g. an increasing number of concurrent user requests) can have two negative effects: Because there are more requests, there is less system idle time to flush dirty data to disk. Because data flushing is done less often, the software cache (e.g. the DBMS cache) fills up with dirty data, so that the pages intended for reading are paged out, which in turn causes the read hit rate to decline. In such a situation, the cache size should be increased. If the cache does not fit in main memory (i.e. the OS starts swapping), the physical memory size as well has to be increased.
Controllerbased Caching
One way of providing hardware caching is to place a hardware cache on the I/O controller. When data is to be written to disk, the controller can place the data into the cache and acknowledge the request. While processing continues, the controller transfers the cache data concurrently to the normal I/O activities on a dedicated I/O bus to the specified disk. The controller may include block reordering, i.e. flush the data to disk in an order which minimizes head movement. This process is sometimes called elevator sorted write-back. Another way of using the cache is read-ahead caching, i.e. prefetching data from contiguous disk space. This is sometimes referred to as the principle of locality of reference. This of course does not work well when used for striped data (e.g. with a stripe size of 8 KB, prefetching 64 KB means reading the wrong data). When using read-ahead caching, care has to be taken to size the cache large enough for the expected workload, considering the number of concurrent users. There is a relationship between the size of the software cache and the relationship between read/write requests received by the I/O subsystem. The more read requests are satisfied by the software cache, the higher the relation between write and read requests that are passed through to the I/O subsystem will be (e.g. if there are 4 read and 4 write requests, and the software cache can satisfy 3 of the read requests, the relation will be 4:1). As a consequence, the hardware system should be optimized for performing write operations.
Read-ahead Caching
Consequence of Software Caching
3.3.3 Striping
Striping
By striping, the disk(s) will be partitioned into so-called stripes of equal size, with a typical size ranging between 512 bytes and several megabytes. Such a set of stripes belonging together (i.e. having been created for the same purpose) form a so-called stripe set, which is treated as a logical storage unit. Striping can be used as a method of combining multiple disks
SAP AG
August 2000
21
Hardware Layout

SAP
into one logical storage unit (using only one stripe set). An example of 3 disks with 2 stripe sets is shown in Figure 8.
Disk 1 block 1 block 2 block 3 block 4 block 5 block 6 block 7 block 8 Disk 2 block 1 block 2 block 3 block 4 block 5 block 6 block 7 block 8 Disk 3 block 1 block 2 block 3 block 4 block 5 block 6 block 7 block 8
Annotation: 1) Stripe size is 1 block. 2) blocks colored in light gray belong to stripe set 1, blocks colored in dark grey belong to stripe set 2.
Figure 8
2 stripe sets striped across 3 disks
Disk writing to and reading from a stripe set is done in a round-robin manner. Example for a stripe set on n disks: The first write to stripe 1 is on disk 1, the second write to stripe 2 is on disk 2, and so on, and after the nth write to stripe n on disk n, the (n+1)th write is on the (n+1)th stripe on disk1. In Figure 8, a stripe corresponds to a block. The value of n is 3, therefore stripe 1 is on disk 1, stripe 2 is on disk 2, stripe 3 is on disk 3, and stripe 4 is again on disk 1.
Striping on Different Levels
Striping can be performed by different system components: 1) For hardware striping, the disk subsystem is configured to write stripewise (a.k.a. controller-based striping). 2) For OS striping, the OS I/O access will direct write/read requests to stripes defined on OS level. The hardware will not notice that the I/O requests implement a striping. The tradeoff of this solution is the computation and administration overhead in the OS layer. Within the OS, it is often the task of the logical volume manager (LVM, see also section 4.1) to perform the striping. 3) For DBMS striping, the DBMS will direct write/read requests to stripes defined on DB level. Neither the OS nor the hardware will notice that these I/O requests implement a striping. As with 2), this solution will cause some computation and administration overhead in the DBMS layer.
Combined Striping
Striping on different levels can be combined, e.g. hardware striping can be combined with OS striping, using a different stripe size. A scenario where this combination is sensible: When different (independent) disk subsystems have built-in striping capabilities, OS striping can be used to distribute a stripe set across all these subsystems. Of course, stripe sizes affect application performance: If applications often request large amounts of data, bigger stripe sizes should be used so that the record size falls within one stripe. This should work well if many I/O requests are submitted in parallel. However, if there are few but data-
Stripe Sizing
22
August 2000
SAP AG
Hardware Layout
SAP
intensive I/O requests, a shorter stripe size allows to distribute I/O reading better across the available disks, so the data can be read in parallel.
Host-based vs. Controllerbased Striping
Host-based and controller-based striping offer different benefits. Controllerbased striping has the advantage that it places stripe-set management at a level below the one of the OS and device drivers, which may reduce the load on the CPU and system. Host-based striping, on the other hand, has the advantage that stripe-set members can be distributed across different I/O subsystems and potentially minimize I/O subsystem bottlenecks. Whether striping is useful for an application depends on many factors: = (+) The striping policy aims at distributing I/O load and possible hot spots across the participating disks, i.e. distributing the workload at the backend. = (+) If the data is more evenly distributed across disks by striping, this can lead to shorter seek times (compared to disks with much more data which therefore execute many head movements). = (-) Prefetching of data from one disk can be obstructed, as the following data blocks are stored on other disks. = (-) One disk failure can affect the functioning of the whole striped set of disks. = (-) For small amounts of requested data which can be stored in the cache, striping is probably not very effective. Striping works best when data has to be fetched from disk.
Pros and Cons
Striping Tips & Tricks
The following provides some cautions as to correct use of striping: = The application profile (large scans vs. selective access) has to be considered. Large scans can be fed by prefetched data, which does not work well when striping is used. On the other hand, selective access requests can be quickly satisfied when accessing stripes on different disks. Many R/3 database requests are selective. = The stripe size has to harmonize with the block size (i.e. a stripe size should be a multiple of a block size), otherwise the disk subsystem may be slowed down. = Recovery methods must be considered carefully, as recovery of striped data may take longer. = Configuring the right stripe sets can become essential. The disks taking part in a stripe set should be distributed evenly across the available hardware components, e.g. SCSI controllers and disk controllers.
3.3.4 Disk Mirroring

Mirroring
Mirroring is the principle of writing the same data onto two different storage places. As write requests are handled on different levels (DBMS
SAP AG
August 2000
23
Hardware Layout

SAP
level, OS level, hardware level) mirroring can be implemented on different levels. RAID provides a method to mirror data on a hardware level.
Mirroring on Host Level
Host-level mirroring (either on the DBMS level or on the OS level) can have advantages and disadvantages: = (+) If a device fails, the mirror can still be written to a mirror device. If hardware mirroring is enabled on some device and this device fails, the mirror will be unavailable or lost, too. Therefore, additional hardware failover facilities have to be established. = (+) If the layout distributes data so that the primary data is located on the first half of a disk and mirror data on the second half (on a different disk in order to avoid contention), then the head movements for these data are reduced (the maximal head movement is across half of the disk). = (-) For each mirrored device separate I/O operations are needed. = (-) Depending on the device type different connections to the mirror device may be needed. = (-) Negative impact on hardware caching may result.
3.3.5 RAID
RAID
RAID means redundant array of inexpensive disks, i.e. an array of disks is used to provide a storage to the system which appears as one larger disk. This means that from the OS perspective the RAID system can be treated as one disk. All Raid systems but RAID-0 provide for some level of availability, i.e. after a disk crash or after a disk was pulled out of the array and replaced by a new one (called hot swapping), most RAID systems allow to restore the contents of the affected disk by the help of the contents of all the remaining disks and the so-called parity information. Parity information can be implemented as follows: 1) Reserve one disk of the array to store the parity information. 2) Perform for each bit on a disk the following procedure: Compute the XOR of the corresponding bit of all disks but the disk with the parity information. The result is then stored at the corresponding location on the disk with the parity information. This procedure ensures that if one of the disk crashes, the parity information and the XOR results calculated from the remaining disks allow to find out whether the corresponding bit on the failed disk was set or reset.
Availability
Parity Information
RAID Properties RAID-0
Table 2 provides an overview of different properties of RAID systems. RAID-0 provides no protection against disk crashes, and the mean time between failure (MTBF) of the whole area is n times higher than that of a single disk, where n is the number of disks in the array.
24
August 2000
SAP AG
Hardware Layout
SAP
Name RAID-0 RAID-1 RAID-2 RAID-3 RAID-4 RAID-5
Striping used Stripe size X X X X X large sector (small) sector (small) large large
Redundant Parity Mirror information information information X X X X X X X X X -
Table 2
RAID-1
RAID properties
RAID-1 is an array consisting of pairs of disks, where the second disk is a mirror of the first disk. The disk pair appears to the host as one disk. For writing access, the data is written twice in parallel, to the source disk and its mirror, so that the write performance remains unchanged. Read access can perform twice as fast, as data can be read from the source disk and the mirror in parallel. However, the double amount of disk storage is needed to store data, therefore this is the most expensive solution of all RAID systems. RAID-2 is outdated. This is because it reserves some of the disks of the array to store so-called ECC information (ECC means error correction code, i.e. information to verify and/or correct the stored data). However, modern disks can embed ECC information within data sectors, so additional ECC information should not be necessary. In addition, RAID-2 does not store enough information so that a crashed disk of an array could be restored. RAID-3 reserves one disk in the array to store parity information. This parity information suffices to restore any of the disks in the array. Unlike RAID-2, no additional ECC information is stored. RAID-4 is like RAID-3, but it uses large stripe sizes. Read operations can be overlapped, which means that data can be read from several disks in the array in parallel. However, write operations cannot overlap, because any write operation has to update the parity information which is stored on one disk, so two write operations interfere with one another. RAID-5 is like RAID-4, but parity information is not stored on one disk. Each disk stores parity information for some definite set of stripes. Therefore, not only read, but also write operations can potentially overlap. The following table shows the consequences of the properties of RAID systems (see Table 3).
RAID-2
RAID-3
RAID-4
RAID-5
RAID Evaluation
According to the list above, RAID-1 and RAID-5 appear to be the best candidates to be used in mission-critical systems. RAID-1 stresses performance, RAID-5 stresses storage efficiency (and therefore lowers the costs), providing equal availability at the same time.
SAP AG
August 2000
25
Hardware Layout

SAP
Characterization in comparison to other RAID systems Name RAID-0 RAID-1 RAID-2 RAID-3 RAID-4 RAID-5 Performance best performance best performance except RAID-0 middle performance middle performance middle performance better performance than RAID-2 to RAID-4 Storage efficiency best storage efficiency worst storage efficiency Availability worst availability good availability
second worst storage bad availability efficiency (after RAID-1) good storage efficiency good storage efficiency good storage efficiency good availability good availability good availability
Table 3
Consequences of the RAID properties
The difference between RAID-1 and RAID-5 in performance can be assessed by considering the additional overhead of RAID-5 writing operations. To update parity information, the old bit mask (contents of the disk before writing) is read and XORed with the new bit mask (overhead: 1 disk read, 1 XOR operation). This procedure results in the bit mask that describes the changed bits (changed bits have value 1). The old parity information is read and XORed with the change bit mask (overhead: 1 disk read, 1 XOR operation). Finally, the new parity information is written back to disk (overhead: 1 disk write). Altogether, the disadvantage in performance is that RAID-5 involves 5 additional operations: 2 disk reads, 2 XOR operations and 1 disk write (RAID-1 can perform its write operations in parallel, while RAID-5 needs the XOR calculations for the second write). RAID-5 writes are therefore considered to be 3/5 to 1/3 slower than RAID-1 writes.
26
August 2000
SAP AG
Operating System
SAP
4. Operating System
Operating System
4.1 Device System

Device System
Figure 9 shows the entities for device handling as defined by UNIX (example HP-UX):
File * * 0..1 0..1 Logical Volume * 1 Volume Group 0..1 partition device 0..1 logical device 2 group device 1 0..1 1..* disk device 2 0..1 2 partition device Disk / Hypervolume * 0..1 Partition * * Physical Extent 0..1
Device
Raw Device
Block Device
Other Device
Figure 9
Overview of Unix devices, example HP-UX
Logical Volumes
Disks can be grouped as volume groups. If disks are to be split into smaller pieces, there are two ways to do this: creating logical volumes via Operating System command utilities or creating disk partitions. Figure 9 shows both options. Choosing logical volumes allows to split disks into so-called physical extents of a fixed size (e.g. the current default is 4 MB). Multiple physical extents on several disks (assigned to the same volume group) can then each be combined into one logical volume. Sometimes, one talks of logical extents instead of physical extents, but there is a 1:1 mapping between these two notions: while the physical extent is the location on disk, the logical extent is the unit treated as part of a logical volume. For each logical volume, volume group or partition, the Operating System creates a device (an OS entity representing the physical hardware) somewhere under the /dev directory. Raw devices are accessed bytewise, whereas the access unit of block devices is the block. There are also other devices used for special purposes (e.g. the logical null device). Apart from files, databases may use devices to store and retrieve data. Because of the
Devices
SAP AG
August 2000
27
Operating System

SAP
overhead of file administration in a file system, performance may be better when using raw devices for tablespaces.
Logical Volume Manager
Logical volumes can only be used if the system includes a so-called LVM (logical volume manager), the OS unit that handles logical volumes. If there is no LVM installed, partitions have to be used. An advantage of raw devices over files is that logically contiguous blocks are also physically contiguous. This does not apply to files, because the file system may distribute the blocks according to an allocation map. When using files, the OS will read the data from file, save it to a kernel buffer and then save it to the memory space designated by the application. As opposed to this, DBMS using raw devices can handle their own shared buffer for raw devices that can receive the raw device contents. Therefore, the goal of data sharing can be reached as well. Only when using raw devices, the DBMS itself can guarantee that committed data will really be saved onto disk. Using the file system, the OS may buffer write data to store at a later time6. Therefore, several vendors require or recommend usage of raw devices rather than files. Furthermore, the usage of files can be restricted to data not critical to performance, and the file system can be placed on slow disks or tracks near the center of the disk, so that faster disk partitions can be accessed as raw device.
Raw Device vs. File
Easier Administration
Using a LVM is normally easier than using OS tools directly, because the administration is more flexible. E.g., logical volumes can be changed later on, while this is normally not possible with OS tools.
4.2 OS Striping
When creating a logical volume (HP-UX: command lvcreate), there are options where the number of disks from which to create the logical volume, the total size of the volume and the number of physical extents allocated to this logical volume can be set. An additional option is to determine a stripe size. Stripes are then located within the physical extents which are allocated to the logical volume.
Example Striping
Figure 10 gives an example for OS striping, with 3 disks and 2 logical volumes. The physical extents of both logical volumes are distributed across all 3 disks. Stripes are located within the physical extents. An example stripe size would be 128 KB, which means that one physical extent with a default size of 4 MB contains 32 stripes.
This holds especially for OS in earlier releases. Modern OS now mostly support a file system API where the application can force the OS to write the data to disk immediately.
28
August 2000
SAP AG
Operating System
SAP
Disk 1 Stripe Stripe Stripe Stripe Stripe Stripe
Disk 2 Stripe Stripe Stripe Stripe
Disk 3 Stripe Stripe Logical Volume 1
Physical Extent Stripe Stripe
Physical Extent Stripe Stripe Logical Volume 2
Physical Extent
Physical Extent
Physical Extent
Physical Extent
Physical Extent
Physical Extent
Figure 10 Example for a host-based striping
SAP AG
August 2000
29
Operating System

SAP
30
August 2000
SAP AG
Database System
SAP
5. Database System
Database System
5.1 Common Characteristics of Database Systems

Several things are common to most database systems. This introduction presents those which are useful for understanding the present paper. For the different terminology used by DBMS, see also appendix A.
Data Files
Databases use data files or raw devices to store the table data and index data. Data read from disk is usually stored in data buffers. Therefore, the same data requested twice can be read from the buffer for the second request. Put differently, two logical reads correspond to one physical read (the disk read). Data buffers are simple memory pages of the database server. In order to be able to share these buffers across processes, shared memory is used. However, if more buffers are used than physical memory exists, the OS starts swapping buffers to and from the OS swap file. Therefore, the number of buffers have to be taken into consideration in order not to degrade performance. It is common to all DBMS that I/O access is avoided as long as possible. The strategy therefore is to utilize the cache as well as possible, and cache tuning should precede I/O tuning.
Data Buffering
Logging
Any database management system (DBMS) uses so-called logfiles to record changes made to data. These logfiles can be used in case of a rollback, or for recovery in case of a hardware or DBMS crash. To improve data access to the logfiles, a logfile buffer is used. However, all log data for some transaction has to be written from the buffer to the logfile when this transaction commits the work. A log group is the set of all logfiles together. For mission-critical systems, the logfiles can be mirrored. Each write access to a logfile will then be also performed for all mirrors. Also a mirror consisting of all logfiles together is called a log group. Some DBMS use additional information especially designed for faster rollback of transactions, and store this rollback information in different formats (each DBMS has an own term for the rollback information, see appendix A). When using OS processes for data processing or I/O access, OS overhead is produced by process scheduling. The OS overhead mainly results from switching between the current process and the next process to be executed (this involves an expensive context switch). A first improvement has been the introduction of OS threads, light-weight processes scheduled within
Log Group
Rollback Information
Optimizing Process Architecture
SAP AG
August 2000
31
Database System

SAP
the context of an OS process. Threads are now supported by most major OS (NT, UNIX etc.). However, even threads produce some overhead. Therefore, some DBMS have implemented their own scheduling within threads. These sub-threads are called differently in DBMS (e.g. in SQL Server they are called fibres, in SAP DB they are called tasks).
Asynchronous I/O
Another important issue is that modern operating systems support the concept of asynchronous I/O, i.e. requests to the I/O subsystem submitted by some thread do not block this thread. While the threads processing continues, the OS makes sure that the request is processed independently. Therefore, to parallelize an I/O request, the DBMS can use one thread, so that the OS has to manage and schedule less threads. Many DBMS offer the choice between using files or raw devices as storage media. These options differ in performance and availability. Especially write performance can be much better for raw devices, because the raw device relies on data administration by the DBMS itself, while a file is being managed by the file system which is part of the OS. For the file system, the OS keeps kernel buffers that can speed up reading file data. However, care has to be taken when writing to a file, because the file system may keep the data only in the kernel buffer for some time before it is flushed to disk (usage as write-back cache). Modern OS provide a synchronization flag in the open command to a file. When setting the flag, write requests are immediately/synchronously transferred to disk. Still, reading data is faster when done via the cache (usage as write-through cache). Using host-based caching (software or hardware) in a cluster environment requires coordination across the cluster nodes. This is to ensure that data duplicated in multiple caches is the same. Here, hardware caches have the advantage that all cluster nodes will access the same cache, no coordination is required. A relevant issue for DBMS are the statistical data (called statistics). These table-based information are used by the DBMS to decide on how to retrieve data requested by users. An example: if the statistics say that a table contains only few different values, a select condition for one value would induce the DBMS to make a full table scan. However, if many different values exist, the result set can be retrieved more easily using the index. Because statistics can become obsolete, the DBMS has to update statistics from time to time. These updates can affect the DBMS' transaction throughput and should therefore be optimized as best as possible. The overall strategy of how, what and how often to backup can have strong implications on the database layout. One reason is that frequently changing data have to be backed up more often than other data. If possible, it is therefore useful to separate these two kinds of data in order to backup them using different backup schedules. Another reason is that the maximal downtime after a database crash has to be minimized. To recover quickly, the amount of log information after the last database image written to disk should be minimal (applying log information is a lengthy operation).
Choice of Raw Device and File System
Cache Coordination
Statistics
Backup Strategy
32
August 2000
SAP AG
Database System
SAP
Frequent data flushes (also see section 5.3.3 about checkpoints) therefore reduce recovery time, but need to be optimized in order to not slow down other DBMS operations. Additional information on the specific DBMS backup strategy are given in chapters 6 to 11.
5.2 Distributing the Storage

5.2.1 General Considerations
File Placing Strategy
The decision where to place files (or raw devices) is based on both safety and performance reasons. The following are reasonable considerations: = As the transaction log writer process (the process that writes the log information to the logfiles) and the DB file writer process should not block each other, the logfiles and the database files should be placed on different disks. Also for reasons of data safety, DB files and logfiles should be placed on different disks, because the logfiles are needed for recovery if a disk with DB files crashes. = For DBMS allowing the distribution of rollback information: As rollback information is created for many users, it is advisable to distribute it across different disks (some DBMS store this information separately, others place it within a tablespace). = For DBMS allowing separate placement of data and indexes: In order to access DB tables and their indexes in parallel, indexes can be placed in different tablespaces and these tablespaces should be placed onto different disks. = For DBMS that use archives: If the DB is run in archive mode (i.e. logfiles are copied to an archive by the DBMS automatically), the logfiles should be distributed at least over 2 disks so that the archive writer process can backup the non-active logfile while the other logfile can be accessed in parallel. There are normally several choices for the archive medium, e.g. tape or an external disk. The logfile and archive file should in any case be placed on different disks to avoid disk contention. = If several log groups are used, they should be placed onto separate disks so that all transaction log information is still available even if any of the disks crashes. = For DBMS using tablespaces: Some tablespaces (for a definition see appendix A) contain data which is rarely accessed or contain rarely changing data that can be cached in buffers. These tablespaces can be placed also on disks with very active I/O access, e.g. where the logfiles are located. Example candidate tablespaces for an R/3 database are: user1, ddic, docu, psapload, psapelxxx (with xxx giving the R/3 release, e.g. xxx=40b).
SAP AG
August 2000
33
Database System

SAP
= OS swap space should be put on its own disk/disk set if possible, so that memory swapping does not affect the overall performance. If the overall I/O system performance is bad, a method can be to verify if other system processes are generating I/O requests to disks used by the DBMS as well. This can be especially serious if a logfile is affected.
Hot Spots
A recommendation for hot spots (i.e. heavily accessed data which is concentrated on single disk locations, thus causing a bottleneck) is to place them on their own disk(s), possibly fast disks. This is especially useful if these tables are known to be accessed simultaneously (either by different queries or by joins). Placing tables on their own disks can be done by putting the table in its own tablespace with separate disks. Moreover, if a disk is partitioned, hot spots should be placed in partitions near the middle of the disk, where head movements are minimal and access time therefore is minimal. The least frequently used data can be placed on the outermost or innermost partitions. The previously explained method works well if there are several critical hot spots concentrated in one tablespace. However, if the one table taken out of the tablespace was responsible for the hot spot alone, the hot spot will only move to the new tablespace (so-called roaming hot spot). It may be necessary to use an additional method for distributing the hotspot (e.g. striping, see section 5.2.2, or partitioning). For a sample database layout, see the following Figure 11. It consists of 3x7+8=29 disks, where 4 disks are reserved for logfiles and their mirrors, one disk for the swap space, one disks for archiving, 2 disks for backup and rarely accessed tablespaces, 14 disks for data and index tablespaces, and 7 disks for custom tablespaces with critical tables (hot spots). Database configuration files are rarely accessed so can be put together with other uncritical data, but the mirror should be placed on a different disk for safety reasons. Striping can be used here additionally, e.g. defining 3 hardware stripe sets across 7 disks each (assuming 7 disks each are assigned to one controller).
Roaming Hot Spots
Sample DB Layout
34
August 2000
SAP AG
Database System
SAP
Disk
Disk
Disk Mirror Logfile1
Disk Mirror Logfile2
Disk Swap File Config. File
Disk Archive File Mirror Config. File
Disk Backup File Data File
Disk Backup File Data File Rarely accessed Tablespaces
Logfile1
Logfile2
Disk Data File Data File
Disk Data File Data File Index Tablespaces (possibly striped) Data Tablespaces (possibly striped)
Disk Data File Data File Index Tablespaces (possibly striped) Data Tablespaces (possibly striped)
Disk
Disk
Disk
Disk
Disk
Disk
Disk Custom Tablespace for Hotspots (possibly striped)
Data File
Data File
Data File
Data File
Data File
Data File
Data File
Figure 11 Sample DB layout
5.2.2 Striping
Disk Contention and Striping
There is a relationship between disk contention and striping: Striping some data across multiple disks can reduce disk contention for some disks which may have been an application hot spot before. The bigger the stripe size, the more likely it will the hot spot remain on a single or a few disks, the smaller the stripe size, the more likely it will be that the hot spot is distributed across more/all disks. Striping disks which were not striped before is only effective if there is some hot spot on one of the disks. Or, said in another way, if there was an I/O queue with pending requests for one or some of the disks. In this case, load distribution across the available disks is a natural tuning step. When using mirroring (logfile or data) together with striping, care should be taken to stripe the source data and all the mirrors in an analogous way. If striped differently (e.g. using fewer disks, or with different stripe sizes), in case of a write request (which can be executed in parallel for the source data and all mirrors) the processing time for the requests will differ, and the whole request has to wait until all writing has been completed. The trade-off when using striping is as follows: While the writing of a file may be sped up if one considers only the I/O for that file, the control of which files are stored on which disks is lost. So, if a table is known to represent a hot spot, putting that table into the same tablespace with many
Distributing Hot Spots
Mirroring with Striping
File Placement Control
SAP AG
August 2000
35
Database System

SAP
other tables would spoil performance. Therefore, a different tablespace (with different assigned disks not participating in the striping of the other tables) should be used to store that table. The technical term to describe the fact that several requestors want to access data which is stored at one location so that they are put into a queue and the request are processed serially is called data contention. This term may be understood as a generalization of the previously mentioned term disk contention.
5.2.3 Analyzing I/O Requirements

Estimate Disk Capacity
The goal in this section is to provide practical hints how to estimate the number of needed disks for an SAP R/3 system. If R/3 caching is properly configured7, there are basically two scenarios to consider: 1) If the number of dirty data blocks in the DBMS cache has exceeded a certain limit, the flushing of dirty data blocks to disk is triggered. 2) For certain business transactions, large volumes of data have to be read from the database and/or have to be modified.
Example
In the following, one example for each scenario is given, and an overall result is determined. Scenario 1: Flushing dirty data blocks to disk Assume the DBMS cache to be sized 100.000 blocks (with 8 KB per block). Also assume that the DBMS flushes all dirty data to disk when the number of dirty blocks exceeds 10% of the total amount of blocks. Therefore, 10.000 blocks have to be written to disk. Writing one block to disk corresponds to one disk I/O. If we assume that one disk has a throughput of 50 I/Os per second (which corresponds to an access time of 20 ms), one disk would need 200 seconds to write the 10.000 blocks to disk (10.000 I/Os divided by 50 I/Os per second). When using 10 disks, this time reduces to 20 seconds, or 5 seconds for 40 disks. Scenario 2: Reading large volumes of data, e.g. financial calculations for the closing of the financial period Assume the business transaction has to read 20 million records from the database, where one record has an average size of 1 KB. This adds up to 20 GB of data. Because an 8 KB block can store up to 8 records, 20 million records correspond to 2,5 million I/Os. To read that amount of data using a single disk would take 50.000 seconds (2.5 million I/Os divided by 50 I/Os per second). For the transaction to finish in a few hours, we therefore need at least 10 disks, which would need approximately 5.000 seconds8.
7 8
The goal is to reach a cache hit ration of about 97%. Because we calculated with 8 records per 8 KB block, an average rate of 4 records per 8 KB block would result in a total time of 10.000 seconds, or less than 3 hours.
36
August 2000
SAP AG
Database System
SAP
Overall result: Comparing the results of scenario 1, scenario 2, and the needed total disk capacity If we consider 5 seconds to a reasonable value for disk flushing, a minimum of 40 disks is needed. If we assume 9 GB disk capacity per disk, the total database size may not exceed 360 GB (9 GB multiplied by 40). If the database size exceeds that value, the total database size would directly determine the amount of disks.
5.3 Runtime Considerations

5.3.1 Logfile
Logfiles
Why are logfiles highly critical to performance? Logfile writing is triggered by commit (this is important for database recovery after a system crash). Because a commit can be requested by any DBMS server process serving a client request, and because logfile writing must be done synchronously, i.e. before the commit is acknowledged to the client, logfile writing is most critical to the DB transaction throughput. The logfile should reside on its own disk, for at least 2 reasons: 1) The logfile is probably the most actively accessed part of the DB files, therefore the throughput for that particular disk should be maximized as it will be the first apparent bottleneck. 2) If the read/write head of the disk only writes the logfile, the head does not have to change its position and to make a costly seek operation to go back to the current write position.
Logfile Placement
Log Groups
Log groups are mirrors of the log information, i.e. with 2 log groups, the second group is a mirror of the first group, and the log information exists twice. Furthermore, for each log group, multiple logfiles (at least 2) are used. If logfiles are mirrored, writing to the logfiles will happen in parallel. The logfile writing is finished when the last write request has finished. Therefore, it has to be taken into consideration that a slow disk used for logfile mirroring can have a great impact on the logfile writing performance. Hardware mirroring alone probably does not guarantee the same safety as logfile mirroring by the host or the DBMS, so it is not recommended9.
Logfile Mirroring
The reason is that the hardware component that controls the mirroring is a single point of failure.
SAP AG
August 2000
37
Database System

SAP
5.3.2 Avoiding Dynamic Space Management

Reasonable Storage Parameters
For each database object, a space will be initially allocated on disk. Only if the data volume exceeds the initial space, the DBMS will dynamically extend this space. Some DBMS allow tuning of setting initial extents and subsequent space extensions. Because space extension is a comparatively expensive operation, care should be taken to find reasonable values. If the initial setting is too large, space is wasted. If it is too small, the extension operation will impact performance. As well, the DBMS may limit the amount of extents allocated per tablespace. In general, the allocation of space by the DBMS reserves a contiguous space on disk, consisting of a number of blocks (unless striping is used). Reading such data blocks can enhance performance when reading or writing subsequent data blocks for a database object, because the disk head does not have to be repositioned. Prefetching can be performed for the database object (e.g. using Oracle multiblock read). On a larger scale, some DBMS even allow extension of the data files building the backbone of the database. It is better to allocate enough space in advance than to let the DBMS extend the data files. Automatically shrinking this file is even more dangerous to performance, because it can lead to increasing disk fragmentation. An additional aspect is the configuration of rollback information (for DBMS using them). If there are large transactions writing a lot of data used for rolling back the transaction, the rollback information will be extensively used. Having enough space available will avoid dynamic allocation of more space. Additional care should be taken to set parameters concerning the automatic deallocation of rollback space. For long queries or transactions large rollback storage should be assigned if the DBMS allows this. For OLTP applications with many concurrent transactions, rollback page contention can be avoided if each transaction is assigned its own (small) rollback space. Small rollback spaces also fit easier into the database cache. Care is also needed for temporary space management. The temporary space is often used for large sorts. Doing temporary space handling in memory is much more efficient. If temporary data has to be paged out to disk, the disk I/O considerably slows down the sort. If disk is used, what was said about rollback segments also holds here about space allocation and deallocation. If the DBMS allows to specify tablespaces as temporary, this can speed up performance because the temporary space may be used by the DBMS for multiple transactions. Furthermore, the DBMS may offer special buffers for direct I/O which can be used for large sorts, bypassing the normal buffer handling. Apart from this, a temporary space is different to normal data spaces as operations on this space are not logged (temporary space never needs recovery). Also, it does not make sense to mirror a temporary space. A possible configuration is to assign the temporary storage area to one or more RAID-0 devices. This is useful if the DBMS is expected to perform complex operations on the database (e.g. large sorts). RAID-0 provides no safety against data loss, but temporary data only exist for the duration of a database transaction and are not needed any longer after the end of the
Contiguous Device Space
Datafile Extension
Rollback Information
Temporary Space Management
38
August 2000
SAP AG
Database System
SAP
transaction. This is a good example for the use of RAID-0. Separating temporary data from other data will eliminate data contention.
5.3.3 I/O Access

Detecting I/O Bottlenecks
One method to detect I/O problems is watching the DB buffer utilization. If buffers overflow regularly, there is an obvious bottleneck at the I/O subsystem (i.e. the buffers cannot be written to disk by the DBMS in time). Each DBMS has its own policy for determining checkpoints at which all the data which resides in buffers and which does not correspond to the data on disk is being written to disk. As a precondition for writing the data buffers to disk, the log entries have to be written to disk first to guarantee that all transactions which must be rolled back in case of a system crash can actually be rolled back. Reasons for a checkpoint can be e.g. that some logfile is full and the DBMS switches to some other logfile, or the checkpoint can be triggered by some predefined parameter that specifies a timeout. As a database recovery can be made based on the last checkpoint and the log entries written thereafter, a database administrator can use the timeout value to decrease the system downtime in case of a system crash.
Checkpoints
5.3.4 Parallelism
Query Parallelism under Heavy Load
A consideration for the usefulness of parallelizing queries is that this works best if the number of users is small. If there are less users than CPUs available for parallel processing, there is a clear profit of parallel CPU usage. If the number of users outweigh the number of CPUs, a consideration of the workload has to be made. If all users need much CPU time, parallel processing may even be unfavorable. In principle, the R/3 application profile is OLTP-based. This means there are fewer table scans than selective requests. Parallel queries work best for large amounts of data, e.g. in the context of OLAP10 processing. If the DBMS does not support the parallel processing of large table scans on multiple CPUs, the task of parallelizing processing can be transferred to the application side. The profit of parallel processing is high as long as temporary data does not have to be temporarily stored on disk. In that case, the most processing time will be spent by disk I/O, and less CPU time. Parallel DBMS operations important for R/3 OLTP are parallel index creation and parallel update of statistics. Details are given in chapters 6 to 11.
R/3 Application Profile
10
OLAP means Online Analytic Processing, i.e. analyzing of (business) data with the help of database technology. As an OLAP system, SAP offers the SAP Business Information Warehouse.
SAP AG
August 2000
39
Database System

SAP
Parallel Query
In general, a parallel query is a single data request from a user or application that is executed by the database server on more than a single CPU. The CPUs can be either in one machine (an SMP machine) or on different machines. To do this, the DBMS has to split the requests into several parts so that each sub-request can be executed independently. Database layout may improve the quality of parallel queries by a fair distribution of the query data over available disks. One way to achieve this is by using table partitioning, where the data can be distributed evenly across the available disk space. When splitting a query into parts, there is a distinction according to the hardware architecture: When splitting a query to run on different CPUs within one SMP (symmetric multi processing) machine, this is an intra-parallel query.
Relationship Layout Parallel Queries
Splitting the Query
When splitting a query to run on different nodes on an MPP (massively parallel processing) machine or cluster (either a cluster of singleprocessor machines or a SMP cluster), it is an inter-parallel query. The difference between an SMP and a cluster/MPP configuration is shown in Figure 12.
Network CPU 1 CPU 2 CPU Machine 1 CPU Machine 2 CPU Machine 3
CPU 3
CPU 4
SMP machine
I/O Bus
I/O Bus
I/O Bus
I/O Bus
Disk
Disk
Disk
Disk
Figure 12 SMP (left) and cluster/MPP configuration (right)
Result Merging
While parts of the query can be processed in parallel, the results returned to the client require a single node to merge the result query data which are retrieved from the different CPUs/nodes. Efforts to apply parallel queries to the OLTP operations of R/3 has not yet lead to considerable performance improvements. SAP Business Information Warehouse has a completely different application profile (namely few, longlasting queries). Here, parallel queries may be used favorably.
Relationship to R/3
40
August 2000
SAP AG
DB2 Universal Database (Unix and NT Versions)

SAP
6. DB2 Universal Database (Unix and NT)

Universal Database
The following explanations focus on the Unix and NT versions of DB2 UDB (Universal Database). The other DB2 variants for AS/400 and OS/390 are not treated in detail. A detailed description of the concepts of the UDB can be found in [2], [5] and [6].
6.1 Physical and Logical DB Components

6.1.1 DB2 UDB Enterprise Edition (EE) Concepts
The DB2 UDB Enterprise Edition (in the following called DB2 EE) is used for SAP's OLTP workload environments, like R/3.
Database Structure
The following Figure 13 shows the components of a DB2 EE database. On top level, an instance may contain multiple databases. Several global settings for an instance are valid for all of its databases. A database consists of multiple tablespaces, consisting logically of a collection of database objects, like tables or indexes.
Instance 1 1..* Container 1..* File (SMS) Storage File (DMS) 1 Raw Device (DMS) Tablespace * 1 Table Database 1 Recording 1..* 1..* 1..* Recording Content Description 1 1
DB Catalog 1 1
Figure 13 DB2 EE database components
Container
Data from a tablespace is physically stored in DB2 containers, which are either files or raw devices. Space management within DB2 tablespaces is either done via database means - database-managed space (DMS). Or, it is provided by the Operating System - system-managed space (SMS). If the tablespace is based on DMS, space will be reserved in advance (as determined by the database administrator during tablespace creation or extension). The tablespace cannot grow beyond the predefined limit. The space will be preallocated in files, or will be taken from raw devices which
SAP AG
August 2000
41

SAP
have to be set up in advance. In the SMS case, DB2 will allocate space on demand using files, which are created per database object.
System Catalog
For the whole database there is exactly one database system catalog, where information about database objects (tables, indexes etc.) is stored. A typical DB2 configuration is depicted in Figure 14. The model employs a single database residing on multiple disks of a database server.
DB2 Client DB2 Client
Database Server DBMS Process/ Thread
Tablespace 1
Container
Container
Container
Container
Tablespace 2
Disk
Disk
Disk
Figure 14 Typical DB2 configuration for R/3
6.1.2 DB2 UDB Enterprise-Extended Edition (EEE) Concepts

The DB2 UDB Enterprise-Extended Edition (in the following called DB2 EEE) is used for SAP's OLAP workload environments, like BW. The DB2 EEE is based on the concepts of DB2 EE, and extended for usage of massively parallel computing environments. It is an implementation of the shared-nothing computing approach which utilizes totally independent computing units.
Database Structure
The following Figure 15 shows the components of a DB2 EEE database. Besides the previously mentioned database components, the following components can be identified. A database consists of database partitions, which are independent units with their own transaction log, and which store a part of the database. Partitions are entities which can reside on the same or on different physical systems. A nodegroup is a group of database partitions. A database partition can be a member of multiple nodegroups. In contrast to DB2 EE, a tablespace is assigned to a nodegroup. This means that the nodegroup specifies the partitions used to store the data of a specific tablespace. Distribution of data is done by hash partitioning. For every table within a tablespace (distributed on several partitions) a partitioning key has to be specified at creation time. The information to which specific DB partition a
Database Partition
42
August 2000
SAP AG

SAP
Partitioning Map
record is written, is provided by a partitioning map, which is a hash table for the partitioning key. This also applies for reading the data. The partitioning key cannot be changed. The database system catalog is located on a predefined database partition called catalog partition. If a database partition server (i.e. the database processes associated with and responsible for a database partition) needs catalog information, it submits a request to the partition server that controls the catalog partition. This information is cached for reuse.
Instance 1 1..* Database 1 1..* Database Partition 1..* 1..* 1 2..* Nodegroup 1 1..* Container 1 1..* Storage 1 * File (SMS) File (DMS) Raw Device (DMS) Table Recording 1..* Tablespace 1 DB Catalog 1 1 1 Content Description
Database System Catalog
1..* Recording
Figure 15 DB2 EEE database components
BW Configuration
In contrast to the R/3 configuration, a SAP BW (SAP Business Information Warehouse) configuration can use several database partitions (see Figure 16). Because DB2 uses the function-shipping model (see section 6.4) for the coordination of the database partition server, this configuration provides for good database scaling. The different partition servers do not compete for access to the disks. The speed of the network between database servers has a great impact on the response time in query processing, therefore high-speed networks are highly recommended.
SAP AG
August 2000
43

SAP
DB2 Client
DB2 Client
DB2 Client
DB2 Client
Database Server DB Server Process/ Thread
Database Server DB Server Process/ Thread
Tablespace 1
Container
Container
Container
Container
Tablespace 2
Container
Container
Container
Container
Tablespace 3 Catalog Tablespace
Container Disk Database Partition Disk Database Partition Disk Database Partition Disk Catalog Partition
Figure 16 Typical DB2 configuration: BW
6.2 Disk Layout

Configuring Containers
Determining container location and size is critical for the DB2 DB layout. This is because if containers are getting filled up with data, it is not possible with the current container implementation that the container size can be extended.11 Instead, one or more additional containers have to be added to the tablespace. However, when adding containers the database automatically starts a rebalancing process that distributes the data evenly across all available containers. This is because the DB2 database engine always requires that all containers of a tablespace are filled at an equal level with data (which applies in the normal case by virtue of the round robin data writing). Therefore, data from the existing containers have to be carried over to the new containers. As this is a resource-consuming work, it is better to estimate the size of the containers for a tablespace size for a longer period. Another parameter to be considered for tablespaces is the page size used, because DB2 tablespaces have a maximum number of pages. For a page size of 4 KB, the tablespace can be up to 64 GB, if the page size is 8 KB, the tablespace can be 128 GB etc. For DB2 UDB V6.1, possible page sizes are 4KB, 8KB, 16KB and 32 KB. These considerations are valid for systems with a single DB partition. For multiple partitions, each partition can hold data up to the mentioned limit. For performance reasons SAP BW uses tablespaces with a 8 KB page size. When considering container size, it does not matter to have the containers with master data filled to a larger extent, while the containers with transaction data will be almost empty at the start of the production phase (e.g. filled to only 2%).
Page Size
Container Sizing
11
With DB2 release 7.1, this restriction is removed, and containers can be extended.
44
August 2000
SAP AG

SAP
Calculating Data Growth
Estimating the transaction data tablespace is important. Example: If the weekly data growth is estimated at 1 GB/week, and a page size is determined with 4 KB, it can be calculated how long data can be accumulated in the database before the tablespace is full (for this example: 64 weeks, because the maximum size of a tablespace with a 4 KB page size is 64 GB). As this also pertains to single DB partition systems, using multiple partitions allows to extend the accumulation time by the number of partitions. At the same time, one decides on the number of containers, which determines the I/O load distribution. Note that a maximum of 255 containers per tablespace can be used. The best layout uses one disk per container (to parallelize access to the container) from the performance perspective. For each the swap file, the R/3 paging file, and the database logfile, a separate disk is favorable. The following represents some simple considerations for a DB2configuration for R/3:
Container Placement
R/3 Layout
If we assume 26 tablespaces, half of these are reserved for indexes. From the remaining 13 tablespaces there are 3 important ones: PSAPDDICD, holding R/3 dictionary tables
PSAPSTABD, holding the master data (Stammdaten). PSAPBTABD, holding the transaction data (Bewegungsdaten). 3 containers per tablespace are often used as basic default value.
Logfiles
A critical part of a database are the logfiles. The sum of all active logfiles must not exceed 4 GB12, a limit for DB2 which causes the DBMS to roll back the transaction that caused this condition. A logfile is active as long as there is at least one open transaction within that logfile. DB2 can be operated in 2 modes: circular logging (i.e. logfiles are written in a circular fashion and are not archived) and log retention logging (i.e. after a logfile is full it is archived, sometimes called LOCKRETAIN mode). The following applies to log retention logging, which is mandatory for R/3. With the completion of all transactions (updates etc. to the database tables) stored in a logfile, the corresponding logfile can be archived by the userexit. The space will be reused by the database if not needed anymore. If the user-exit fails to archive the logfiles, the filesystem may be filled up, as the database is not able to delete a non-archived logfile. This also would cause the DBMS to no longer process any more transactions because no more logfiles could be allocated. There are primary and secondary logfiles. The latter are allocated only if there is a need to use more files than primary logfiles (the number of primary logfiles is restricted).
Log Retention Logging
Primary, Secondary Logfiles
12
This value is extended to 32 GB for DB2 release 7.1.
SAP AG
August 2000
45

SAP
Temporary Tablespaces
DB2 uses temporary tablespaces when temporarily storing data, e.g. during an overflowed sort which spills (i.e. whose space requirements cannot be satisfied by main memory). While in a standard R/3 installation only one temporary tablespace is used, some R/3 installations or BW installations rather use two temporary tablespaces. The reason is that for each page size used there is an own temporary tablespace, so besides the default temporary tablespace with a 4 KB page size, there exists a second temporary tablespace with a 8 KB page size for large BW tables. Until R/3 release 4.5, the temporary tablespace is located in the DMS. From release 4.6 on, it is located in the SMS.
Table Reorganization
If a table record does not fit any more in its original storage location due to a record update, it has to be put into another data page and a reference to the new page is inserted into the original page (so-called overflow records). Because access to overflow records require two I/Os, it may be advantageous to reorganize a table with many overflow records. Such a reorganization can be performed online, and the records are sorted based on the primary key so as to enable efficient prefetching (also see section 6.3). DB2 offers full online, full offline and online tablespace backup. In case of a full offline backup a consistent database state is written. To speed up recovery time in cases where only few tablespaces are corrupt, online tablespace recovery is most efficient. A full online backup needs log information to restore the database consistently, therefore after the database backup a backup of the logfiles is necessary.
Backup Strategy
6.3 I/O Access

Parallel I/O
The basic understanding behind containers is to provide for parallel processing just as it is the case with database partitions. The best physical layout would be to have one disk available per container. In this case, containers can be accessed completely in parallel. Moreover, in order to occupy each access process/thread to the same degree, the containers are filled with data to the same degree. This is done by writing to all available containers of a tablespace in a round-robin manner. DB2 performs I/O access either by the agents (i.e. the database server processes or tasks which accept and process user requests) themselves, or by separate I/O servers (for UNIX, these are processes, for NT these are threads) if prefetching is used. In this case, for each container one I/O server is started. Prefetching is configurable per tablespace. The setting of the prefetch parameter is based on the unit of extents. In order to speed up prefetching for table selections via primary key, table reorganizations sort records according to the primary key. DB2 has a unique concept for checkpoints, which is called soft checkpoint. When a checkpoint occurs, modified buffer pages are written to disk asynchronously. Therefore, on the one hand, transactions do not have to wait for checkpoint completion. On the other hand, it may be necessary for a recovery to read logfile information older than the time of the checkpoint.
Prefetching
Soft Checkpoint
46
August 2000
SAP AG

SAP
For flushing the data, soft checkpoints use I/O cleaners that have the general task of writing dirty buffer pages to disk.
6.4 Parallelism
Types of Parallelism
There are two types of parallelism for a database query. a) Intra-partition parallelism works such that the cost-based optimizer generates a (parallel) execution plan defining tasks (tagged by parallelism operators) that can be executed in parallel. The tasks are then processed in parallel by different agents, producing intermediate results stored in so-called table queues (i.e. the results can be consumed directly). The amount of agents can be specified, or the decision of the number of parallel agents can be left to the optimizer. This type of parallelism is supported for both DB2 EE and DB2 EEE. b) Inter-partition parallelism serves to distribute an SQL statement (e.g. query) across database partition servers belonging to different database partitions. A prerequisite for this is that the query has to be split up into parts according to the partitioning described by the partitioning map. Here also, table queues are used for storage of intermediate results. This parallelism works only in environments that employ multiple database partitions (DB2 EEE). Intra-partition parallelism and inter-partition parallelism can be mixed to increase the degree of parallelism.
Models of Processing in Distributed Systems
For distributed systems, there is a distinction between 2 models of processing requests: 1) The data-shipping model (also called shared-everything model) assumes that a node fulfils a data request itself and synchronizes access to the data with the other nodes. Typically, the requested data is part of data pages, which are either retrieved from disk or cached in a buffer. The synchronization becomes necessary because each node may have its own memory and data buffer. The OPS (Oracle Parallel Server, see section 9.4) and DB2 for OS/390 (see section 7.4) work with this principle. However, the two DBMS follow different concepts: While OPS uses local buffers that are locked and owned by some node during the time of data access, the OS/390 DB2 keeps a global buffer for all nodes and uses a hardware component for the synchronization. 2) The function-shipping model (also called shared-nothing model) assumes that a node requests other nodes hosting some data to perform the data access. UDB with multiple partitions works with this principle: The so-called coordinator node which is directly connected with the client that requested data distributes the requests to all nodes that host part of the requested data. The results returned by the other nodes are then processed (e.g. merged, sorted etc.) by the coordinator node and returned to the client. This model offers good scalability.
SAP AG
August 2000
47

SAP
Currently, R/3 configurations with UDB do not use multiple partitions.

Index Creation
DB2 supports parallel index creation.
6.5 Specific Features

Buffer Pool Assignment
In connection with the usage of multiple buffer pools (i.e. collections of buffer pages) DB2 offers a feature of assigning a buffer pool exclusively to a tablespace. This feature is useful in scenarios where data is known to be continuously accessed over time, e.g. BW dimension data or R/3 master data. Because of the separation of this buffer pool from other buffer pools, these data pages cannot be paged out even if e.g. a large query demands lots of pages in an other buffer pool.
48
August 2000
SAP AG
DB2 UDB for OS/390

SAP
7. DB2 UDB for OS/390

DB2 UDB for OS/390
Universal Database
The following section deals with the OS/390 version of DB2 UDB (Universal Database), which will be abbreviated by DB2/390 in the following. For details please refer to IBMs DB2/390 documentation and to the SAP manual Database Administration Guide: DB2 for OS/390.

Database Structure
Figure 17 shows all components of a DB2/390 subsystem13.

Subsystem 1 1 DB Catalog 1 1 * Database
Recording 1..*
1 * Table 1 * Index 1 * 0..1 Tablespace/ * Indexspace 1..254 Stogroup * * Volume
Note:
A B C
^ =
B C
Figure 17 DB2/390 database components
Tables are created in tablespaces which carry all physical storage attributes. A tablespace holds the data of one or more tables. Indexes are kept separate from the table and each index is physically stored in an indexspace of its own. Tablespaces and indexspaces are grouped in databases, and the multitude of databases then forms the full DB2/390 subsystem. Each tablespace and indexspace is associated with at least one stogroup. This is a set of volumes on direct access storage devices (DASD)14 that hold all data stored in tablespaces and indexspaces.
13
In this context, a subsystem is an instance of a relational database management system.
SAP AG
August 2000
49
DB2 UDB for OS/390

SAP
System Catalog
Within a DB2/390 subsystem there is exactly one database system catalog, where the information about all database objects (tables, indexes, stogroups, tablespaces, indexspaces, etc.) is kept. Buffer pools are areas of virtual storage in which DB2/390 temporarily stores pages of tablespaces or indexes for caching purposes. When an application program accesses a row of a table, DB2/390 retrieves the page containing that row and places the page into a buffer. If the needed data is already in the buffer, there is no need to access DASD, which significantly reduces the cost of retrieving the data. For details on buffer pool tuning, see section 7.5. DB2/390 records all data changes and significant events in a log as they occur. In the case of failure, this data is used to recover. Each log record is written to a DASD for achiving. When the active log is full, DB2/390 copies the contents of the active log to a data set called the archive log. An inventory of all active and archive log data sets is kept in the so-called bootstrap data set (BSDS).
Buffer pools
Active and Archive Logs
7.2 Disk Layout

R/3 Layout
During an R/3 installation, the mapping of R/3 tables to DB2/390 is governed by the following basic rules: A table that is not R/3 buffered on the application server is placed into a dedicated, single-table tablespace. A table that is R/3 buffered is placed into a multi-table tablespace which should not hold more than 100 tables For each tablespace there is only one database and vice versa (i.e. an R/3 system includes multiple databases). A table and its indexes (i.e. their table/indexspaces) belong to separate stogroups that correspond to R/3s data classes (i.e. master data, transaction data). During R/3 runtime no additional stogroups are created. If needed R/3 creates additional databases and tablespaces (i.e. during runtime, but without creating new stogroups).
Hot Spots
To neutralize hot spots tables should be moved to separate tablespaces with dedicated buffer pools15. Additionally, DB2/390s partitioning via key ranges can be applied to extremly large or heavily used tables. See section 7.4 for more details on partitioning.
14 15
A DASD is a device in which access time is independent of the location of data. However, the assignment of buffers to tablespaces can be modified via SQL statements.
50
August 2000
SAP AG
DB2 UDB for OS/390

SAP
Backup Strategy
There are two different types of backup, online and offline backup, each in three different flavors: full, incremental, and with option CHANGELIMIT. The latter leaves the decision on whether to perform an incremental or a full backup to DB2/390. Backups work on the level of tablespaces and indexspaces. Due to this granularity, the backup strategy can be extremly optimized by taking into account how often the data held in a tablespace is changed. In the event of subsystem failure, backup and log information is used to recover. To speed up recovery time, online backups should be run every 1-2 days. Occasional offline backups on heavily updated and critical tablespaces, if possible of the entire subsystem, are also recommended.16 DB2/390 also supports backups based on volume copies which can be done very fast.
7.3 I/O Access

Read Mechanisms Normal Read
DB2/390 uses three read mechanisms: normal read, sequential prefetch, and list sequential prefetch. Normal read is used when just one or a few consecutive pages are retrieved. The unit of transfer for a normal read is one page. Normal reads are synchronous operations. Sequential prefetch is performed concurrently with other operations of the originating application program. It brings pages into a virtual buffer pool before they are required (i.e. pages succeeding the requested page) and reads several pages with a single I/O operation. Sequential prefetch can be used to read data pages via table space scans or index scans with clustered data reference17. It can also be used to read index pages in an index scan. Sequential prefetch allows CPU and I/O operations to be overlapped. List sequential prefetch is used to prefetch data pages that are not contiguous (such as through non-clustered indexes). Query I/O parallelism manages concurrent I/O requests (asynchronous I/O) for a single query, fetching pages into the buffer pool in parallel. This processing can significantly improve the performance of I/O-bound queries. I/O parallelism is used only when one of the other parallelism modes cannot be used. Depending on the row length, a tablespace page can be 4 KB, 8 KB, 16 KB or 32 KB. Write operations are usually performed concurrently with user requests. Updated pages are queued by data set (i.e. tablespace, partition or indexspace) until they are written when:
Sequential Prefetch
List Sequential prefetch Query I/O
Write Operations
16 17
The advantage of an offline backup is that the database is archived in a consistent state. For a clustered data reference, the index is used when modifying rows, but associated pages are read by prefetches.
SAP AG
August 2000
51
DB2 UDB for OS/390

SAP
A checkpoint is taken. The percentage of updated pages in a virtual buffer pool for a single data set exceeds a preset limit called the vertical deferred write queue threshold (VDWQT). The percentage of unavailable pages in a virtual buffer pool exceeds a preset limit called the deferred write queue threshold (DWQT).
7.4 Parallelism
Partitioning
When DB2/390 plans to access data from a table or index in a partitioned table space (here, partitions correspond to stogroups), it can initiate multiple parallel operations. The response time for data or processor-intensive queries can be significantly reduced. Query CPU parallelism enables true multi-tasking within a query. A large query can be broken into multiple smaller queries. These smaller queries run simultaneously on multiple processors accessing data in parallel. This reduces the elapsed time for a query. To expand even further the processing capacity available for processorintensive queries, DB2/390 can split a large query across different DB2/390 members in a data sharing group (here, a member is a host). This is known as Sysplex query parallelism.
Query CPU
Sysplex Query

ASCII Processing Data Sharing
Even though DB2/390 processes SQL statements in EBCDIC format, all user data is stored and processed in ASCII format. It is possible to combine several DB2/390 subsystem instances which all access the same data to a data sharing group. Data consistency is ensured by using group buffer pools and a lock manager within a coupling facility. Data sharing allows to: Significantly improve the performance (e.g. in combination with partitioning)
Scale the processing capacity Extent the availability (724h; this means that one member of the data sharing group is running independently of the other members).
Configure a DB2/390 environment with great flexibility Within DB2/390 you can use up to 50 buffer pools that contain 4-KB buffers and up to 10 buffer pools each for 8 KB, 16 KB, and 32 KB buffers.
52
August 2000
SAP AG
DB2 UDB for OS/390

SAP
Buffer Pool Tuning
You can set the size of each of those buffer pools separately when installing DB2/390. Sizes and other characteristics of a buffer pool can be changed at any time while DB2/390 is running. Multiple buffer pools allow better matching of the tablespace and index access characteristics. In addition to the buffer pools you can cache the data in hiperpools, dataspaces and coupling facility group buffer pools18 all of which can significanty improve system performance. Using the COMPRESS clause of the CREATE TABLESPACE and ALTER SQL statements allows you to compress data in a table space or in a partition of a partitioned table space by exploiting a HW feature. In many cases, using the COMPRESS clause can significantly reduce the amount of DASD space needed to store data, but the compression ratio you achieve depends on the characteristics of your data. You can use the DSN1COMP utility, to determine how well your data will compress. With compressed data, you might see some of the following performance benefits, depending on the SQL work load and the amount of compression: Higher buffer pool hit ratios (also, compression is used in the buffer pool) Fewer I/Os Fewer getpage operations
Data Caching
Compression
TABLESPACE
18
Hiperpools, dataspaces and coupling facility group buffer pools represent specific types of buffers, with a different hardware implementation.
SAP AG
August 2000
53
DB2 UDB for OS/390

SAP
54
August 2000
SAP AG
Informix
SAP
8. Informix
Informix
8.1 Physical and Logical DB components

Database Structure
The largest storage space is called a chunk (see the following Figure 18), which is either a file19 or a raw device. Chunks provide the storage for dbspaces. Also, raw devices can be divided into chunks by using offsets. E.g., a 2 GB raw device can be divided into two chunks of equal size by using one offset of 0 KB and another offset of 1 GB. Note that a chunk has an upper limit of 2 GB, i.e. devices larger than 2 GB have to be split into several partitions. A chunk, then, consists of pages. For the allocation of data pages to database objects, extents are used. There are first extent sizes and next extent sizes for all subsequent allocations. In Informix, an extent is entirely contained in a chunk (it cannot cross chunk boundaries).
Database 0..1 1 Content Description
1..* Chunk 1 1..* Physical Storage Dbspace 1 1..* 1 1 Recording
1 DB Catalog 1 1
* File Part of Raw Device Table 1 1
1..* Recording
Space Recording
TblSpace
Figure 18 Structure of an Informix database
Logical DB Components
As logical database components, Informix offers the database, tables, and dbspaces, in which tables are stored. Furthermore, at database creation time, dbspaces are specified as the storage medium for the database itself. The default dbspace which always exists is the root dbspace, which holds
19
Files are also sometimes called cooked file space, as the file system has prepared the disk for storage.
SAP AG
August 2000
55
Informix

SAP
the database catalog and is the default dbspace for temporary tables. It also is the default dbspace for the creation of a database. Another logical component is the tblspace. Each table is assigned a tblspace. For easier disk space tracking by a database administrator, the tblspace records (contains) exactly those extents that are used by the table in question. This is because the table can be distributed across several chunks of the tables dbspace.
Storing Large Objects
In Informix databases, apart from normal dbspaces and pages, Informix offers the concept of blobspaces and blobpages for storage of byte or text information. This is mainly a concept for big data clusters and less interesting for OLTP applications which mainly store fewer structured data (blobspaces are not used for SAP systems). Therefore, this storage concept is not further illustrated here. Dbspaces can be mirrored on the database level which means that for each chunk in the original dbspace a mirror counterpart exists for each mirror. Mirroring is needed for rootdbs, logdbs and physdbs, but the mirroring can also be done on the hardware level (e.g. RAID-1). Something that differs in an Informix database if compared to other databases is that not only the physical log (i.e. the rollback information) is part of a dbspace, but also that the logical log is part of some dbspace. This somehow unifies administration, e.g. the concept of mirroring a dbspace for availability can also be applied to the logical log (which must be mirrored).
DbSpace Mirroring
Logical Log
8.2 Disk Layout

Relationship Chunk - Disk
Even if the use of offsets allows to separate a physical disk/partition into multiple chunks, this is not recommended and best be avoided20. Tracking the disk usage is easier if there is a 1:1 relationship. For high-volume OLTP, it is recommended to define own dbspaces for the physical (physdbs) and the logical log (logdbs), taking them out of the root dbspace where they are stored by default. This is done by R/3 installations automatically. The recommendation for Informix to prevent hot spots is to use fragmentation (see section 8.5 for a detailed description). Grouping of tables in dbspaces also should consider the (un)availability in case one assigned device should fail. Other dbspaces are still operable. When setting up temporary dbspaces (tempdbs), care should be taken to distribute it across separate disks. Using several temporary dbspaces can be advantageous.
Separating the Physical and Logical Log
Hot Spots
Data Availability
Temporary Tablespace
20
As mentioned before, because of the 2 GB limitation of a chunk using offsets may become necessary.
56
August 2000
SAP AG
Informix
SAP
Mirroring Recommendations Extent Allocation
Mirroring is done for the root dbspace, and the (dbspaces for the) physical and the logical log. It is important to place mirrors on different disks, and ideally access these via different controllers. Extent allocation works in such a way that the DBMS looks sequentially in chunks assigned to a dbspace until it finds enough free continuous data pages available21. However, if no sufficiently large gap is found, the extent is created with a smaller size. Because there is a maximum amount of extents (depending on the page size), the DBMS has 2 methods to efficiently manage extent allocation: = After 16 extents, the extent size is doubled after all 8 extents (i.e. after 24 extents, 32 extents etc.). = The DBMS tries to melt extents together (i.e. if adjacent extents belong to the same database object they can be combined into one extent).
Parameters Detached Indexes
A general recommendation is to avoid extent interleaving where possible. Extent interleaving happens if two tables in the same dbspace grow and if the DBMS assigns new extents in alternating order for these two tables. Reading the table data cannot use prefetching effectively because the table is not stored physically contiguously. A second effect is an increase in the devices seek time. A possible solution is to change the table storage parameters (using large extents, setting a large next extent size22) or to Setting Storage assign the tables to different dbspaces. To separate data and indexes, in Informix one has to use detached indexes. If the index is attached, table and index together are restricted to 32 GB (in case of 2 KB pages). When putting a table into its own dbspace, and its detached index also in an own dbspace, each can grow to 32 GB (in case of 2 KB pages). If this is too restricting, the data has to be fragmented (see section 8.5). When planning the backup strategy, it should be considered that the unit of backup is at least a dbspace (the actual restriction depends on the tool used). The disk layout should consider this fact (e.g., tables to be backed up together can be placed in a common dbspace).
Backup Strategy
8.3 I/O Access

Page-Cleaning
In Informix, the process of flushing modified pages from shared memory to disk is called page cleaning. The task of the page-cleaner thread is to write buffers to disk. There are 3 situations when this occurs:
21
Note that only space in a chunk of raw disk is physically contiguous (this does not hold for file space which will be spread by the OS according to the free space management). However, it has to be considered that if no sufficiently large gap is found the DBMS will create an extent with a smaller than planned size.
22
SAP AG
August 2000
57
Informix

SAP
1) For a user request, a page needs to be read into the buffers. However, if there is no space available, and if all buffer pages are modified, a page has to be paged out. 2) There is a parameter LRU_MAX_DIRTY that specifies a limit in percent (percentage of dirty pages in comparison with all buffer pages) at which the page-cleaning will be triggered. When the page-cleaning is running, it stops when the lower limit in percent (parameter LRU_MIN_DIRTY) is reached. 3) A checkpoint occurs.
Checkpoint
When a checkpoint occurs, page-cleaner threads are activated to perform a so-called chunk write. One thread is assigned to one chunk. For the chunks that contain dirty pages, a (chunk) map of pages to write on disk will be created, and then the pages are written to disk according to the order in which they appear physically, minimizing the disk access time.
8.4 Parallelism
Dynamic Scalable Architecture
The DSA (Dynamic Scalable Architecture) is Informix support for SMP machines. This enables the DBMS to process a request of a single client on multiple CPUs in parallel. The central concept in DSA is the virtual processor. Informix uses the term virtual processor to refer to a single OS process processing client requests. This process can handle multiple (concurrent) threads, each serving one client session. Threads are also used to achieve data I/O, logging I/O, and other tasks. Virtual processors are assigned to some specific class, which characterizes the processing goal (e.g. data processing, disk I/O, etc.). Threads within virtual processors are shared, i.e. the processing of a thread can migrate from one virtual processor to another virtual processor of the same class. Furthermore, several virtual processors of the same class can process a single task in parallel. Using the concept of processor affinity, specific virtual processors can be fixed to some CPU, which thereafter processes the virtual processor exclusively. Using the virtual processor concept, disk I/O is performed either by using the OS asynchronous I/O facilities, or implementing its own asynchronous I/O (virtual processor class AIO). For writing the logical and physical log, different virtual processors are used (classes LIO, PIO). In order to get a good I/O throughput, the I/O task is assigned to some priority, so that important I/O tasks can be processed before low-priority I/O tasks. If the OS supports asynchronous I/O (called Kernel AIO), this should be used for all I/O, especially logical and physical log writing (supported by the Informix kio thread). Otherwise, log writing is done by virtual processors (classes LIO, PIO). While the (logical/physical) log is written by one virtual processor, a mirror of the (logical/physical) log is written by a separate (mirror) virtual processor.
Virtual Processors
Asynchronous I/O
58
August 2000
SAP AG
Informix
SAP
Parallel Query
Informix supports parallel queries (PDQ). The access of a partitioned table uses the parallel query technique to access the partitions in parallel. Other important DBMS operations to run in parallel are sorting operations, parallel index creation and the update of statistics. For parallel sorting, multiple processes may be specified each with its own temporary dbspace. In order to use parallel index creation, the corresponding table has to be partitioned (see section 8.5). The same applies to the parallel update statistics operation. With the latest DBMS releases, Informix also supports MPP and clustered systems (including clustered SMP systems) with the Extended Parallel Option. The concept of the Informix parallel processing is built after the paradigm of function shipping. This means that there is no central lock or buffer management, but instead the database data is partitioned (see section8.5) across several nodes that each own a partition of the data. If a node needs data kept on another node it will submit a request to the corresponding node. Together with this option, the Informix DBMS supports both intra-parallel (parallelism within a SMP node) and interparallel (parallelism for loosely-coupled systems) processing.
Parallel Sorting, Index Creation, Update Statistics
Cluster Support

Data Partitioning Partitioning Strategies
Informix DBMS allows data partitioning (called fragmentation). This holds for both tables and indexes. For fragmentation, there are 2 strategies. When fragmenting a table, round robin writing can be used to achieve equal data distribution across all fragments. Note that this partitioning method should not be applied to indexes. The second method is fragmentation by expression (i.e. formulating conditions for table fields that determine to which fragment a certain row belongs), which can be used for both table and index. If table data and index data exceed the limit of 32 GB (for a page size of 2 KB), they have to be fragmented. Distribution across max. 256 fragments with a limit of 32 GB each leads to a total limit of 8 TB of data for a single table. Apart from this reason, fragmentation also mostly leads to a performance gain.
Master Size Limitations
SAP AG
August 2000
59
Informix

SAP
60
August 2000
SAP AG
Oracle
SAP
9. Oracle
Oracle
For detailed information about the physical structure of Oracle databases, see [3]. A detailed description of the server architecture and the Oracle Parallel Server (OPS)23 is given in [4].

Database Structure
In principle, Oracle databases are structured simply, as can be seen in the following Figure 19. Leaving out some storage parameters and storage units which only serve to organize the tablespace growth, the following gives a description of an Oracle database.
Database 1 1 Content Description
1..* Storage Unit 1 1..* Physical Storage Tablespace 1 1 1..* Recording
1 DB Catalog 1
* File Raw Device Table
1..* Recording
Figure 19 Structure of an Oracle database
Tablespaces
An Oracle database consists of a number of tablespaces. Tablespaces are logically composed of tables (or other database objects, like indexes), which are physically stored in a storage unit which is either a file or a raw device. A description of all tablespaces and tables is provided in the DB catalog. Like other DBMS, Oracle accesses data from disk in units of data blocks, the smallest unit of the database used for I/O. The data block size used by Oracle can be configured when creating the database. The recommendation is to use a multiple of the OS block size. This reduces I/O overhead for transferring unnecessary data to or from disk. This is a general consideration. For R/3, the block size is fixed to 8 KB. While tablespace data is stored within blocks, there are larger storage units available: extents consist of multiple data blocks but have to be physically stored in one place (file or disk), segments that consist of extents store
Data Blocks
Extents
23
The OPS is not treated in detail here because it is not much used for the R/3 system.
SAP AG
August 2000
61
Oracle

SAP
whole database objects and can cross physical borders (i.e. they can be distributed across files or disks). For a detailed description see [3] and [4]. Here it will suffice to say that it is essential to set the storage parameters for all tablespaces correctly so that tables can grow on the physical medium in a reasonable way. The tablespace storage parameters are set when creating the tablespace.
R/3 Tablespaces
In principle, all types of segments (table, index, rollback, temporary) can be located in any tablespace. R/3 databases, however, have different tablespaces for each segment type (PSAPTEMP for temporary, PSAPROLL for rollback, and for each data tablespace a corresponding index tablespace exists). The Oracle DBMS attempts to create and extend storage areas as contiguous if the OS supports this method of space utilization, so normally disk defragmentations should not be necessary. It is advantageous to defragment prior to creating storage areas and after extensive file extensions. If storage areas extend frequently, larger initial allocations as well as larger extensions should be considered.
Defragmentation
9.2 Disk Layout

Striping
There are 2 options for using striping: 1) Using the OS or LVM respectively.
Manual Striping
2) Striping data manually. For this purpose, a table is created with extents that span the whole disk space (several disks). When writing into the extents, the data will be distributed across the extents and therefore across disks which are allocated to the extents. However, this option requires the administrator to precisely estimate table sizes. If not estimated correctly, the allocated space is either wasted or the table data will be written to some unknown disk space (which obscures the layout and may degrade performance). Oracle can use temporary tablespaces for sorting. If a temporary tablespace can not be used and the sort exceeds the available memory, Oracle will allocate, sort and finally deallocate temporal segments in other tablespaces. For temporary tablespaces a different allocation procedure is used that provides a shared area which can be used by all user processes performing sort operation. Until Oracle version 8, the DBMS administrator had to monitor tablespace growth, because a filled tablespace would put the DBMS to a halt. Version 8 offers an auto-extension feature for database files. Because there are database states where the internal data structures (B* trees) are not well-balanced (e.g. due to deletes of data stored in the leaf nodes), an index reorganization may be useful to re-balance the tree. For indices the REBUILD option can be used in connection with an ALTER INDEX statement. As well, this is useful when the storage characteristics
Temporary Tablespaces
Tablespace Sizing
Table Reorganization
62
August 2000
SAP AG
Oracle
SAP
(e.g. extent growth) are changed. To speed up index creation, the option NOLOGGING can be used if index recovery is not necessary.
Backup Strategy
Depending on whether the DBMS is in archive mode or not, the backup can be done in online or offline mode. The reason is that only an offline backup guarantees a consistent database image, while an online backup requires the archived redo logfiles for a consistent recovery. In any case, the archived redo logfiles (also called offline redo logfiles) are needed to recover the last consistent database state. Because Oracle allows tablespace backup, for large databases the backup strategy can restrict daily backups to tablespaces which can also be recovered separately. The layout of tablespaces therefore also interrelates with the backup strategy.
9.3 I/O Access

Asynchronous I/O Processing
Normally, Oracle provides DB write processes (technical name: DBWn processes) to write the changed buffer data to the DB. If there is one writer (technical name: DBW0) only and the OS provides asynchronous I/O, this process can use I/O processes (slaves), which are kernel processes shared among all processes running on the system. If the OS does not provide asynchronous I/O24, a process requesting I/O will block. Therefore, in this case Oracle will direct I/O requests to different Oracle processes. That way, it is guaranteed that I/O requests can be sent in parallel. Oracle provides the parameter DISK_ASYNCH_IO to activate or deactivate the OS asynchronous I/O facility. The number of I/O server processes can be configured using parameter DBWR_IO_SLAVES. Not on all platforms asynchronous I/O will improve I/O performance. In the R/3 environment, Solaris benefits from asynchronous I/O, while HP-UX shows no advantage in using it. In general, the I/O access unit is the page size, which for R/3 DBMS is 8 KB. For large table scans, it is effective to read multiple data blocks in serial order in one step. The Oracle optimizer can activate this prefetching (socalled multiblock read for Oracle) in these situations. The parameter DB_FILE_MULTIBLOCK_READ_COUNT gives the amount of data blocks requested in one step. As several OS restrict the data amount of one request to 64 KB, in these cases the value has to be set accordingly. When using Oracle DBMS with striping, the stripe size should be a multiple of the block size, at least 2. This holds for applications which require many random reads and writes. However, if there are many sequential reads, the stripe size should be a multiple (recommended: double) of DB_FILE_MULTIBLOCK_READ_COUNT * block size.
Platform Dependency
I/O Access Unit
Prefetching
Striping and Block Size
24
Some OS only support asynchronous I/O for raw devices.
SAP AG
August 2000
63
Oracle

SAP
9.4 Parallelism
Parallel Index Creation
Oracle offers parallel index creation, with either specifying the number of processes to be operated in parallel (using the PARALLEL clause) or leaving the number unspecified thus using all available CPUs. If recoverability of the index is not needed the logging can be disabled (option NOLOGGING), which further speeds up index creation. Another consideration is that until version 8.0, Oracle locks the table to be indexed. Contrarily, during index creation, Oracle version 8.1 does not lock the table but stores table manipulations as delta information, which is applied to the index at the end of index creation (only during this delta information update is the table locked). When statistics are updated, Oracle can use either estimation or computation of the statistical values. It should be considered that computation involves a table or index scan and sort, which may require temporary (disk) space. In principle, there are two types of Oracle DBMS: the normal DBMS. It can be used on systems with one processor and on systems with multiple processors (SMP). Beginning from Oracle version 7.1, the Oracle Parallel Query Option (OPQ) is available as part of the standard DBMS. In spite of the name, this includes parallelizing of queries, index creation and load operations. Later other SQL operations followed. Oracle Parallel Server (OPS). It must be used on loosely-coupled systems (clusters), or massive-parallel systems (MPP).
Statistics Update
Oracle DBMS Types
Oracle Parallel Server
The second type has to be clearly distinguished from a distributed database. Oracle provides a sort of distributed database via the database links, and uses a two-phase commit protocol (2PC protocol) to coordinate distributed transactions. A description can be found in [4]. Each part of the distributed database maintains its own catalog, and performs all database operations independently, i.e. it is actually a database in its own right. As opposed to this, OPS works on a shared database. Special techniques to synchronize the shared access to data, especially synchronizing the data buffers which exist on each node, is needed. As OPS is not used as R/3 DBMS much, no detailed description is given here (see [4] for a detailed description). For the normal SMP parallelism, Oracle can use one of the two following parallelism methods: Parallelizing by block range (dynamic method). Oracle divides the table into ranges of data blocks and executes the SQL operation on each range. The decision how to divide the table is made dynamically. Parallelizing by partition (static method). This uses the definition of the table partitioning and is therefore a static division of the table. However, the SQL operations available for the 2 methods differ. Therefore, depending on the application profile, one of the two methods may be favorable for specific tables.
SMP Parallelism
64
August 2000
SAP AG
Oracle
SAP

Operation Modes Shadow Processes Multi-Threaded Server
Oracle provides two modes in which the DBMS can be operated. For heavy load on the database server processes, Oracle can use so-called shadow processes to serve a client exclusively (in the relationship 1:1). This is the configuration used for R/3. An alternative is the use of MTS (Multi-Threaded Server). When using this configuration, server processes can serve multiple clients. A DBMS dispatcher schedules client requests to server processes which are free for processing. The latter configuration does not make sense for R/3, because R/3 itself schedules R/3 client requests, so an additional scheduling does not improve performance (R/3 work process connections produce heavy database load). Partitioning25 (a table which is specified at table creation time) can control how data is spread across physical devices. To balance I/O utilization, it can be specified where to store the partitions of a table or index. Partitioning enables location control where for a known application profile, the disk contention can be reduced. Reducing disk contention is also the goal of striping, but the means are different ones. With partitioning, the placement of historical or rarely used data on slower disks can be achieved. Striping would make no difference to any disk taking part in the striping set. Partitioning and Striping can be used at the same time, the following Figure 20 shows two different configurations. The configuration on the left shows 3 partitions that are each striped across all 6 available disks. On the other hand, the configuration on the right shows each partition striped across its own set of disks. While striping each partition across all available disks will maximize performance (the chance to read in parallel is maximal), availability is very bad. This means that if any of the disks should crash, all partitions will be unavailable. For mission-critical data, there are (Oracle) recommendations to emphasize availability over performance. Therefore, data should be stored redundantly (e.g. using RAID-1).
Data Distribution by Partitioning Partitioning vs. Striping
Partitioning and Striping
Increased Administration
From an administrative point of view, similar to the tuning of storage parameters, the partitioning specification requires profound knowledge, and also requires a basic understanding of what the data distribution within the partitioned table looks like.
25
Because partitioning is a new feature with Oracle 8, it has not yet been used for R/3, except the SAP Business Information Warehouse (SAP BW). The following considerations are therefore not directly relevant to current database layout.
SAP AG
August 2000
65
Oracle

SAP
Partition 1 Partition 1
Partition 2
Partition 3
Partition 2
Partition 3 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Disk 6 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Disk 6
Figure 20 Using partitioning with striping
66
August 2000
SAP AG
SAP DB
SAP
10. SAP DB
SAP DB
Easy Administration
On the whole, SAP DB (previously ADABAS for R/3) is a database that tries to minimize the administration overhead. Configurations that force the DB administrator to tune database parameters are avoided.

Database Structure
The structure of the database much resembles the Oracle model (see Figure 21). However, the SAP DB has no such thing as a tablespace. Instead, both tables and the storage units (which are called devspace for SAP DB) are directly part of the database. A table can be spread across all available storage units (disks).
Database 1 Content Description
1..* Storage Unit (devspace) * 1..* Storage
1..* Table 1 1..* Recording
1 DB Catalog
File
Raw Device
Figure 21 Structure of a SAP DB database
DBMS Striping
The general policy of SAP DB is to distribute the data of any table across all available disks. This way of distribution can be called database striping. The access unit is the data page (which is equivalent to the stripe size here), which is 8 KB in SAP DB version 7, and 4 KB in version 6. Because there is nothing like a tablespace to hold collections of tables, there is no possibility of separating data from each other. Hotspots cannot be treated in such a way that the critical data are placed on their own disks. On the other hand, because there are no tablespaces, there is no administration work to be done to monitor to which level the tablespaces are filled. Because all tables can use the whole available disk space, this is the only limit of table growth.
Tablespaces
Devspace Mirroring
When data is to be mirrored for availability reasons, SAP DB can write data to mirrored devspaces.
SAP AG
August 2000
67
SAP DB

SAP
Extending the Storage
If the available disk space has to be extended, a new devspace can be added. In this case only, a database reorganization is needed to redistribute the data from the already filled devspaces to the newly added, fresh devspaces. This reorganization will be triggered by the DBMS itself in the background without user intervention. Reorganization is not needed in other cases, because the data is always distributed evenly across the disks.
10.2 Disk Layout

Amount of Disks and Disk Sizes
Because SAP DB always distributes data across all available disks, the more disks there are the better. Therefore, it is advantageous to use more smaller disks which have a combined capacity equal to the capacity of less larger disks. However, as a rule larger disks have a better access time than smaller disks, so the effect of using smaller disks is compensated to a certain extent. For UNIX platforms, it is recommended to use raw devices as devspace. This is because of performance and availability reasons. The reason for better availability is that some filesystems still use a write-back cache. If the system should crash, data might not have been flushed, and the database may be in an inconsistent state. To prevent this, raw devices which provide direct disk access can be used. For SAP DB, this means that raw devices should be used at least for the log and for the data devspaces. Modern OS alleviate this problem somewhat (if the DBMS uses its features correctly). RAID systems are recommended for SAP DB not for performance, but only for availability. The performance advantage of striping for RAID-5 is not effective because SAP DB stripes the data itself. Logfiles have to be mirrored. This can either be done by host-side mirroring (i.e. either by SAP DB or by the OS), or using RAID-1. RAID-1, rather than RAID-5, should be used because it is safer, and the performance is better which is important for the critical logfile writing. Several logfiles can be used, but writing to these is sequential.
Raw Device vs. File System
Using RAID
Logfile Placement
Assignment Disks Devspaces
As a rule, one devspace should be assigned its own disk. This even holds for RAID-5, where the disks are not visible to the OS. E.g. a RAID-5 with 6 disks should have 6 devspaces assigned. For each devspace, at least one separate SAP DB I/O queue exists, which handles requests for the devspace (the number of I/O queues can be configured). If a single devspace for the RAID-5 device exists, multiple I/O requests will be queued there, even if they will soon be forwarded by the RAID controller to some RAID disk and processed in parallel. Multiple devspaces means that multiple queues exist, so requests can potentially be forwarded to the RAID system faster.
Sort Data and Temporal Data
SAP DB uses the normal data area for storage of temporal and sort data. Therefore, the calculation of the data devspaces have to take into account this additional storage requirements.
68
August 2000
SAP AG
SAP DB
SAP
Backup Strategy
SAP DB offers either a full (offline or online) backup or an incremental backup. In addition, the logfiles can be backed up automatically. Backup can be made in such a way that the database image is transaction consistent. Using the AUTO-SAVE-LOG tool, the log can be backed up to either an external disk or a tape. This can be configured in such a way that the previous logfile is backed up while the current logfile records the current data changes, so that there is no contention between these two disk writes. A recovery of the database recovers the complete database from the backup information.
10.3 I/O Access

Asynchronous I/O
SAP DB has got its own mechanism for asynchronous I/O. In cases where the OS provides a slower asynchronous I/O, SAP DB uses its own implementation. In case of NT, the OS mechanisms are used. This allows SAP DB to parallelize I/O requests. The SAP DB implementation of I/O uses OS processes or OS threads (which one is used depends on the OS) to request the I/O asynchronously. This is analogous to the Oracle implementation.
I/O Access Unit
Prefetching
There is no additional I/O unit or unit for table expansion as other DBMS provide (often called an extent in this context). However, writing is normally done in 32 KB units (as of version 6) or 64 KB units (as of version 7). SAP DB does not use prefetching of data. This is reasonable because a logically contiguous data area can be physically distributed across many disks. The physically following data block of a disk may be part of a completely different database object. SAP DB distinguishes between savepoints and checkpoints. When a savepoint is triggered, the current buffer pages are written to disk directly. A checkpoint is a special savepoint. When a checkpoint occurs, the DBMS waits until all active write transactions are terminated, and then flushes the dirty pages to disk, therefore producing a consistent and persistent database image. While the database image of a checkpoint can be recovered alone, a savepoint database image is only recoverable together with the log information.
Savepoint, Checkpoint
10.4 Parallelism
Parallel Operations
Generally, disk I/O requests can be handled in parallel. An important consideration for OLTP is that indices can be created in parallel.

The basic strategy for writing to the devspaces is the round robin method.
SAP AG
August 2000
69
SAP DB

SAP
Automatic Table and Index Reorganization
According to SAP DB's principle of easy administration, manual database table and index reorganization is unnecessary. In particular, deletion of leaf nodes in a B* tree leads to an automatic re-balancing process.
70
August 2000
SAP AG
SQL Server
SAP
11. SQL Server

SQL Server
Easy Administration
A DBMS implements some general goals determined by the DBMS designers and implementors. In case of SQL Server 7.0, the overall goal has been to provide for easy DB administration, therefore knowledge of tuning parameters is getting less important.

Database Structure
The overall structure of SQL Server is shown in Figure 22. The top-level component is the instance26, which controls multiple databases. The instance controls some resources used by all databases, e.g. the data cache and the temporary storage. Four databases are used by the DBMS and are created during system installation: the master database, the tempdb, the model database and the msdb.
Instance 1 4..* Database 1 1..* File 1 1..* Physical Storage Filegroup 1 * Table 1 1..* Recording 1 1 DB Catalog 1 Content Description
1..* Recording
Figure 22 Structure of an SQL Server database
= The master database stores configuration data of the instance. = tempdb provides temporary storage, e.g. for sorting. = The model database represents a template for the creation of new databases.
26
SQL Server 7.0 has at most one instance per database server. SQL Server 2000 enables the use of multiple instances, which are completely independent of each other, i.e. there are independent executable files, buffers and system databases (master, tempdb, model and msdb). This also allows to run different DBMS release levels on one database server.
SAP AG
August 2000
71
SQL Server

SAP
= msdb contains information about jobs and backups. Apart from the system databases, multiple user databases can be created. Configuration can either be done on instance or database level. The filegroup has a similar role as an Oracle tablespace. The filegroup contains a number of files, which are used to store the table data.
DB Catalogs
DB catalogs are stored per database, i.e. the system tables containing database object information is stored in the same database as the described database objects. System databases store additional information in their catalog, e.g. configuration and user information within the master database. A typical R/3 configuration always consists of one instance, one user database and one filegroup. For storage allocation, SQL Server uses the storage unit of one extent. An extent corresponds to 8 data pages. A data page is equal to 8 KB, the size of an extent is therefore 64 KB. SQL Server extents differ from Oracle extents. For SQL Server, the unit of freelist management is the extent, while Oracle maintains block freelists as part of the segment administration data.
R/3 Configuration Extents as Allocation Unit
11.2 Disk Layout

Parameter Configuration
SQL Server 7.0 implements features like autogrowing and autoshrinking of files. This is a mechanism that improves the availability of the database because the data files can grow as long as disk space remains. Also, memory configuration is done only once per instance, and SQL Server then continually adjusts the size of internal memory pools (e.g. for data cache, connections etc.) itself. By restricting the available memory, resource contention with other concurrent processes can be avoided. As of SQL Server 2000, the reservable virtual memory per process has been extended to 64 GB (using a mapping method called the Advanced Windowing Extension (AWE)).
Proportional Fill
The basic policy of SQL Server to distribute data on the disk is called proportional fill. This means that data is written to all files of a database, such that the degree to which the files are filled is equal across all these files. With data distributed equally across files, also disk activity is also likely to become distributed across devices (if files are stored on different devices). If the data files are each placed on separate RAID devices, a distribution across two dimensions can be achieved: the RAID controller vertically stripes the data across the RAID disks, while SQL Server distributes the data horizontally across the available RAID devices. SAP recommends to use at least RAID-5. In order to prepare for data growth, it is useful to utilize multiple data files. If stored on a single (RAID) device, there is no difference between using one or more files, but if additional devices are added to the configuration,
Number of Data Files
72
August 2000
SAP AG
SQL Server
SAP
the existent files can be placed on the new devices and are then already of equal size. Files can always be added or removed on demand.
Log Files
For safety reasons, SAP recommends to place log files on a RAID-1 device. In addition, no other files should reside on this device because of performance reasons. Normally, in case of R/3 the transaction log consists of one logfile. The autogrowing feature can also be applied to logfiles. In order to free space within the logfile, a regular log backup is necessary.
Temporary Space
An important consideration for the layout in an SQL Server database is the tempdb database. It holds all temporary data. An important application of the tempdb database is sorting. Placing tempdb on one disk could lead to disk queuing, therefore the following recommendation is given: The tempdb files should reside on devices with low activity. Neither database log files, database data files nor the NT page file should reside on the same device.
Backup Strategy
SQL Server supports full online backup, differential backup and backup of separate files or filegroups, and log backup. The backup is performed per database. A full backup also contains the logfiles. For R/3, daily full or differential backups and more frequent log backups (e.g. once per hour) are recommended. A differential backup can be mixed with full backups (e.g. in a cycle of one full, four differential backups).
11.3 I/O Access

Asynchronous I/O
SQL Server uses Windows NT's asynchronous I/O facilities. With these, multiple I/O requests for each database file can be processed asynchronously. The parameter max async io specifies the maximum amount of asynchronous I/O requests per file. Asynchronous I/O is automatically configured as of SQL Server 2000. SQL Server uses a Windows NT concept introduced with NT version 4.0 Service Pack 2, called scatter-gather I/O. For one I/O request, in order to transfer data from memory to disk or vice versa, the DBMS does not have to specify one contiguous memory area, but can pass addresses of several memory areas to the I/O routine. This allows the DBMS to use the existing memory more effectively. For SQL Server, the checkpoint frequency is indirectly controlled by the recovery interval option, which specifies the maximum number of minutes to recover the database. In order to decrease the amount of disk writes after a checkpoint, individual worker threads also write dirty pages from the buffer to disk when they are not busy with other requests (called lazy writing strategy).
Scatter-Gather I/O
Checkpoint
SAP AG
August 2000
73
SQL Server

SAP
11.4 Parallelism
Process Architecture
Each SQL Server instance consists of a single NT process. The process maintains a pool of either threads or fibres27 for user connections. There is no dedicated shadow process/thread per connection. Instead, on a user request, a free thread/fibre is allocated from the pool. SQL Server can execute queries in parallel. When SQL Server parses the query request and prepares an execution plan, it can determine to execute the query in parallel. The decision about how many CPUs to use for parallel query execution is made by the DBMS itself, based on several considerations: Only if the cost-based optimizer calculates that the estimated time for execution surpasses a certain threshold (as determined by configuration option cost threshold for parallelism), a parallel query execution plan is generated.
Parallel Queries
Determining the Degree of Parallelism
How many CPUs may be used by SQL Server, and to which degree are they free to handle requests? This number can be limited by using the configuration option affinity mask. This allows to configure a host such that "affiliated" CPUs (i.e. CPUs sharing a common memory) are assigned to an SQL Server instance for processing. This strongly improves the performance of resource access28. The configuration option max degree of parallelism determines the maximum number of threads to be used for parallel plan execution. If set to 0, the maximum number of available CPUs (as determined by the other settings) is used. As parallel queries need more memory, is enough memory available? This is checked by SQL Server prior to parallel execution.
Distributing the Query Parts Creating a Parallel Query Execution Plan
When executing parallel queries, for each part of the query a separate OS thread is used. If the decision was made to execute a query in parallel, so-called exchange operators are inserted into the execution plan, which makes it a parallel execution plan. The exchange operators provide process management, data distribution and flow control. As of SQL Server 2000, multiple threads can be used in parallel to create an index. Also, SQL Server 2000 allows to perform database consistency checks (DBCCs) in parallel. These checks allow to verify the physical consistency of the database, i.e. to ensure that e.g. data structures are stored consistently on disk.
Parallel Index Creation Database Consistency Checks
27
In Windows NT, fibres are lightweight threads. For R/3, SQL Server is normally configured to use threads instead of fibres. The setting of the affinity mask is only useful for SMP machines with more than 4 CPUs.
28
74
August 2000
SAP AG
SQL Server
SAP

Automatical Creation of Statistics
Useful features in SQL Server 7.0 are the auto create statistics and auto update statistics functions. The DBMS itself determines which statistics are needed and updates the statistics itself automatically, without the intervention of a DBMS administrator. SQL Server 2000 provides a feature that enables multiple queries that perform the same table scan to re-use the read data such that pages are read only once (of course this works only if the query working threads run at the same time; if a working thread has paged out data which are afterwards needed by another working thread this data has to be read again from disk).
"Merry-go-round Scan"
SAP AG
August 2000
75
SQL Server

SAP
76
August 2000
SAP AG
Appendix
SAP
Appendix
Appendix
SAP AG
August 2000
77
Appendix

SAP
78
August 2000
SAP AG
Appendix
SAP
A. Terminology in DB Systems
Explanation Oracle Informix dbspace file29 SQL Server filegroup file29 DB2 tablespace SAP DB [unavailable]
Logical database unit that tablespace holds a collection of tables Physical database unit that file29 provides storage for tables (e.g. file) Logfile recording the data redo log changes done by transactions Storage that helps rollback a transaction to rollback segment
container
devspace
logical logfile transaction log logfile (logdbs) physical logfile (physdbs) rootdbs logfile30 retrieve_log
log log30
Storage that keeps database system catalog information tablespace
system tables in each database tempdb
system catalog catalog
Storage that keeps temporary temporary data tablespace
tempdbs
tempspace (temp. tablespace) agent
[unavailable]31
Operating System process or server process thread that services a client request
virtual processor
database server
user kernel thread (UKT)

32
Table 4
Terminology in DB systems
29
File is just the name for the physical space, not a file in the sense of a data container within a file system. Therefore, a file can be either a normal data file, or it can be a raw device. The log/logfile is also used for rollback. Of course, temporary tables exist. However, there is no specific devspace to carry temporary data, but these are distributed across devspace like any other data. From SAP DB version 7.0 onwards, all OS platforms use OS threads to treat user requests. Previously, on UNIX platforms user requests were assigned to OS processes, which were then called user kernel processes (UKP).
30 31
32
SAP AG
August 2000
79
Appendix

SAP
80
August 2000
SAP AG
Appendix
SAP
B. SSA
Serial Storage Architecture
SSA (Serial Storage Architecture) is an approved ANSI standard for an interface to disk clusters and arrays, intended to work with high-end computers ranging from Mainframes down to LAN servers. SSA allows full-duplex (data transmission in both directions) serial data transfer at rates of 80 MB/s (for single ports), or 160 MB/s for dual porting. SSA has packet-multiplexed data transfer, i.e. data transmission is packetwise. SSA provides a loop architecture (see also section 3.2.2). SSA allows a maximum of 48 devices per loop, where a loop is controlled by one controller or controller port respectively, in case the controller offers several ports which can operate in parallel.
Fibre Channel Loop
SSA is mainly IBM-only technology. However, there is an ongoing development effort to combine SSA and FC-AL technology by several vendors. The combination of the two is called Fibre Channel Loop (FCL) and is supposed to mostly adopt the FC-AL architecture with full backwardcompatibility to FC-AL. However, it is supposed to support non-arbitration, which is the main contribution from SSA. An important difference between SSA on the one hand and SCSI and FCAL on the other hand is the arbitration. SSA is the only technology that provides for non-arbitrated loops. In order to avoid the need for arbitration, every device must be on a dedicated link with an adjacent device. Thus no permission is required to talk on a dedicated point-to-point link. If multiple devices are put together each with their dedicated link then a daisy chain results where multiple conversations can occur simultaneously (see Figure 23 for the example of a single-ported SSA controller). The way to transfer data is via relative addressing: When transferring data, the amount of devices which have to pass on the data as conveyors is being communicated. E.g. if data is being targeted to the 3rd device relative to some device, the communication will be initiated with the address 3, passed on by the next device as address 2 and so on until the data reaches the target device.
Arbitration vs. Non-Arbitration
SSA Controller
SSA Device
SSA Device
SSA Device
Figure 23 SSA loop architecture
SAP AG
August 2000
81
Appendix

SAP
Comparison: SSA vs. SCSI
In spite of SSA being an interesting technology, there are drawbacks as well. A SSA card with 80 MB/s throughput for example requires 4 DMAs with 20 MB/s throughput for data transmission to the main memory. In comparison, using 4 Ultra-SCSI controllers with 4 DMAs (with a throughput of 40 MB/s each) can provide an I/O throughput of 160 MB/s. Nevertheless, the 4 SCSI controllers are more inexpensive than the one SSA card. There is a market tendency that prefers serial data transfer technology over parallel data transfer technology. It seems that most hardware vendors are choosing FC-AL, possibly FCL in the future rather than SSA. However, the future has to show whether technical challenges (e.g. the construction of SANs) can be fulfilled by the chosen technology.
Comparison: SSA vs. FC-AL
82
August 2000
SAP AG
Appendix
SAP
C. References
[1] Optimizing SAP R/3 on Symmetrix: A Best Practices Guide [available in SAPNet, use search] [2] IBM DB2 Universal Database Administration Guide, Version 5.2, Document Number S10J-8157-01 [3] Oracle online documentation for Oracle8 Enterprise Edition release 8.0.5. [4] Strner, Gnther: Oracle 7. Die verteilte semantische Datenbank. dbms publishing, 1993. [5] IBM DB2 Universal Database Administration Guide: Design and Implementation, Version 6, Document Number SC09-2839-00 [6] IBM DB2 Universal Database Administration Guide: Performance, Version 6, Document Number SC09-2840-00 [7] Liebschutz, Harry D.: The Oracle Cookbook. For Design, Administration, and Implementation. MIS:Press, Inc., 1996. [8] R/3 Advanced Technology Group: Database Layout for R/3 Installations under Oracle. [available in SAPNet, use alias ATG] [9] R/3 Advanced Technology Group: Database Layout for SAP Installations with Informix [available in SAPNet, use alias ATG].
SAP AG
August 2000
83
Appendix

SAP
84
August 2000
SAP AG
Appendix
SAP
D. Introduction to Modeling Diagrams

This appendix gives an introduction to the modeling diagrams used by the Basis Modeling Team. It shows only those diagram types used in this document.
SAP AG
August 2000
85
Appendix

SAP
Block Diagram
Communication Channel with Direction of Request
Customer
Multiple Human Agents Communication Channel

Order Processor
Agent
Availability Checker
Order Entry Agent
temp. data
Write Access Read Access Structure Variance
Modifying (Read & Write) Access Storage Place
Protocol Boundary
Persistency Controller
Warehouse
pending
Orders delivered
Block diagrams show the structure of a system, i.e. its components partitioned in active and passive parts, and the connections between them at a certain point in time (exception see below). The connections show the paths for the data flow in a system. An agent is an active system component. An agent can be refined by any complex block diagram (i.e. can contain agents and storage places). If there are multiple agents with similar features, these can be drawn on top of each other. Human beings as part of the system are depicted as human agents. A storage place is a passive system component which can store data. It can be refined by a number of sub-storage places or revised by any complex block diagram. Similar to agents, storage places can be drawn on top of each other. If an agent can only write data to a storage place, it has write access. Similarly, if an agent can only read data from a storage place, it has read access. If an agent can read and write, this is called modifying access. A communication channel is a non-saving, passive component. It can temporarily hold data and events for the duration of one
communication step. Agents can only communicate via communication channels (or using storage places). If the communication takes place in the pattern of one agent requesting something and the other agent responding to it, this is shown with an arrow indicating the direction of request. A protocol boundary shows the possible machine boundaries or protocols by which agents communicate with each other. Entities (agents, storage places, communication channels) can be created or destroyed during the lifetime of a system. This is called structure variance. When structure variance is employed, the block diagram shows the structures of more than one point in time. The modified entities are like data for the agent being responsible for the structure variance. Therefore, the symbol indicating structure variance is similar to the symbol of a storage place.
86
August 2000
SAP AG
Appendix
SAP
Basics of UML Class Diagrams (E/R Diagram)

Class (Entity Type) Name (with name direction arrow)
Party address: Address
Role
Association (Relationship Type)
Contact
represents * 1
customer 1
CustomerOrder orderingDate: Date
ProductType shortDescription: String amountOnStock: Real
Composition Generalization / Specialization
Multiplicity Association Class (Relationship Entity Type)
{ProductType.amountOnStock > 0}
LegalParty
Person birthDate: Date ...
LineItem
Constraint
Attributes
promisedDate: Date ...
Class diagrams show classes and their relations. This description covers the basic features of UML class diagrams which can be used for abstract class diagrams and entity relationship (E/R) diagrams. Concepts of E/R diagrams are given in parenthesis. A class describes the features of structurally and semantically related objects. A class has a name and attributes. Attributes can be typed. (E/R diagrams contain entity types, which are a set of structurally and semantically related entities.) A specialization connects similar classes. The specialized class can have attributes in addition to the attributes defined by the more general class. The triangle marks the more general class. Object oriented languages often support the implementation of specialized classes with the help of an inheritance mechanism. Associations (Relationship Types) denote a (semantic) relation on the objects of the connected classes. Associations can be named. An arrow near the name can indicate the read direction (e.g. for a name like is parent of). The role an object plays in an association can be noted at a place near the connection point of the association symbol and the class symbol. Multiplicity specifies how many instances of one class can be linked in an association to one instance of the other class.
Multiplicity specifications can consist of a number constant (e.g. 5), a number range (e.g. 3..5) or a comma separated list of multiplicity specifications. A single * means any not negative number, a * as the upper bound of a range means any number which is larger than the lower bound. If the elements of an association have features of their own, or have associations with other objects, this is shown by an association class (relationship entity type). It is denoted by an association symbol with a connected class symbol. A special kind of association is the composition which establishes a whole part relationship. The diamond marks the whole class. The part objects are existentially dependent on the corresponding whole objects, i.e. if the whole ceases to exist, the part is also destroyed. A constraint notes formally or informally a condition which must be met. The braces are mandatory. If they are left out the symbol denotes just a comment.
SAP AG
August 2000
87
Appendix

SAP
88
August 2000
SAP AG
Index
SAP
Index
Index
/ /dev, 27 A ABAP, 6 ABAP module, 8 ABAP processor, 8 access time, 19 AIO, 58 ALTER INDEX, 62 application server, 8 arbitrated mode, 16 arbitration, 15 archive mode, 63 archiving, 45 AS/400, 41 asynchronous I/O, 32, 58, 63, 69, 73 asynchronous processing, 14 auto create statistics, 75 auto update statistics, 75 auto-extension, 62 autogrowing, 72 AUTO-SAVE-LOG, 69 autoshrinking, 72 B background work process, 8 backup file, 6 backup schedule, 32 backup strategy, 32, 46, 57, 63 backup writer, 6 balancing process, 70 basis consultants, 1 blobpage, 56 blobspace, 56 block, 18 block device, 27 block size, 61 block transfer time, 19 buffer pool, 48 bus controller, 11, 12 bus rate, 15 business scenario, 36 C cable distance, 17 caching, 20 catalog partition, 43, 53 checkpoint, 39, 58, 69, 73 chunk, 55 circular logging, 45 command reordering, 14 conflict resolution, 16 container, 41, 44, 46 container size, 44 context switch, 31 contiguous space, 38 controller, 13 controller-based striping, 23 coordinator node, 47
CP/AFS, 1 CRC, 15 cycle stealing, 12 Cyclic Redundancy Check, 15 cylinder, 18 D data block, 61 data buffer, 31 data contention, 36 data file, 31 data file extension, 38 data file shrinking, 38 data growth, 45 database, 41, 55 database cache, 5 database catalog, 56 database file, 5 database interface, 8 database layout, 2 database management system. see DBMS database partition, 42, 46 database partition server, 43, 53 database server, 42, 46 database system catalog, 42, 50 data-shipping model, 47 DB cache, 5 DB file writer process, 33 DB server process, 5 DB service, 6 DB transaction, 6 DB writer, 6 DB_FILE_MULTIBLOCK_READ_COUNT, 63 DB2, 2, 41, 49 DB2 EE, 41 DB2 EEE, 42, 47 DB2 UDB, 41, 49 DB2 UDB Enterprise Edition. see DB2 EE DB2 UDB Enterprise-Extended Edition. see DB2 EEE DB2 Universal Database. see DB2 UDB. see DB2 UDB DB2 Universal Server, 2 DBMS, 2, 5, 31 DBMS architecture, 5 DBMS crash, 31 DBMS striping, 22 dbspace, 55 dbspace backup, 57 DBW0, 63 DBWn, 63 DBWR_IO_SLAVES, 63 detached index, 57 device, 27 device handling, 27 devspace, 67 dialog step, 7 dialog work process, 8
SAP AG
August 2000
89
Index

SAP
differential SCSI, 15 direct memory access, 12 dirty data, 20 dirty pages, 36 disk contention, 19, 35 disk controller, 11 disk defragmentation, 62 disk drive, 18 disk fragmentation, 38 disk pack, 18 disk partition, 27 DISK_ASYNCH_IO, 63 DMA, 11, 82 DMA controller, 11 DMS, 41, 46 DSA, 58 Dynamic Scalable Architecture. see DSA E ECC information, 25 EIDE, 13 elevator sorted write-back, 21 enqueue work process, 8 Error Correction Code, 25 exchange operator, 74 execution plan, 74 Extended Parallel Option, 59 extent, 55, 61, 72 extent allocation, 57 extent interleaving, 57 F FC-AL, 2, 13, 16, 81 FCL, 81 fibre, 32 Fibre Channel, 13, 16 Fibre Channel-Arbitrated Loop. see FC-AL file, 41, 61 file placing, 33 filegroup, 72 flow logic, 8 flushing, 20 flushing data, 36 fragmentation, 56, 59 freelist management, 72 function-shipping model, 47 H hardware cache, 20 hardware mirroring, 37 hardware platform, 2 hardware striping, 22 host-based caching, 32 host-based striping, 23 host-level mirroring, 24 hot spot, 19, 35 hot swapping, 24 HP-UX, 2, 27 HVD SCSI, 15 I I/O, 1, 2, 11, 39, 46, 63 I/O access, 31 I/O activity, 20 I/O cleaner, 47 I/O controller, 21 I/O load distribution, 45
I/O performance, 34 I/O processor, 12 I/O server, 46 I/O subsystem, 32 I/O throughput, 16 I/O tuning, 31 index, 33 indexe, 41 Informix, 2 initial extent, 38 interface, 12 inter-parallel query, 40 inter-partition parallelism, 47 interrupt-driven I/O, 12 intra-parallel query, 40 intra-partition parallelism, 47 K kernel buffer, 32 L latency, 18 light-weight process, 31 LIO, 58 local memory, 8 locality of reference, 21 LOCKRETAIN, 45 log, 6 log buffer, 6 log group, 31 log retention logging, 45 logdbs, 56 logfile, 31, 37, 39, 45, 50, 51, 68 logical extent, 27 logical log, 56 logical volume, 27 logical volume manager. see LVM logon session, 6 loop, 16, 81 LRU_MAX_DIRTY, 58 LRU_MIN_DIRTY, 58 lvcreate, 28 LVD SCSI, 15 LVM, 28, 62 M massively parallel processing. see MPP master data, 45 max_async_io, 73 mean time between failure. see MTBF media transfer rate, 19 mirror, 37 mirror device, 24 mirrored devspace, 67 mirroring, 23, 35, 56 MPP, 40, 64 MTBF, 24 MTS, 65 multiblock read, 38, 63 Multi-Threaded Server. see MTS N next extent size, 57 nodegroup, 42 NOLOGGING, 63, 64 non-arbitrated loop, 81
90
August 2000
SAP AG
Index
SAP
O offline backup, 46 offset, 55 OLAP, 39 OLTP, 11 OPQ, 64 OPS, 61, 64 Oracle, 2 Oracle Parallel Query, 64 Oracle Parallel Server. see OPS OS process, 32 OS striping, 22, 28 OS/390, 41 overflow record, 46 overlapped read operations, 25 P packet-multiplexed data transfer, 81 page, 55 page cleaning, 57 page size, 44, 46 PARALLEL, 64 Parallel DBMS operations, 39 parallel index creation, 39, 64 parallel processing, 39 parallel query, 40, 59 parallel update of statistics, 39 parallelism, 47, 64, 74 parallelizing queries, 39 parity information, 24 parsing, 5 partitioning, 34, 65 partitioning key, 43 partitioning map, 43 PDQ, 59 physdbs, 56 physical extent, 27 physical log, 56 PIO, 58 point-to-point mode, 16 prefetching, 21, 38, 46, 57, 63 process scheduling, 31 programmed I/O, 12 R R/3, 5 R/3 application profile, 39 R/3 basis system, 6 R/3 system, 6 RAID, 24, 68 RAID controller, 13 RAID system, 17 RAID-0, 24, 38 RAID-1, 25 RAID-2, 25 RAID-3, 25 RAID-4, 25 RAID-5, 25, 68 raw device, 27, 32, 41, 55, 61, 68 read/write head, 18, 37 read-ahead caching, 21 rebalancing, 44 REBUILD, 62 recovery, 6, 31 redo logfile, 63
Redundant Array of Inexpensive Disks. see RAID relative addressing, 81 remote function call. see RFC RFC, 8 RFC session, 9 roaming hot spot, 34 rollback, 31, 62 rollback information, 31, 33, 38 root dbspace, 55, 57 rootdbs, 56 rotational delay, 18 rotations per minute, 18 round-robin, 22, 46 S sample database layout, 34 SAN, 17 SAP Business Information Warehouse, 1. see SAP BW SAP BW, 40, 43 SAP DB, 2 SAP GUI, 6, 7 SAP partners, 1 SAP Retail, 1 savepoint, 69 scatter-gather I/O, 73 screen, 8 screen processor, 8 SCSI, 13, 81 SCSI controller, 13 SCSI controller card, 12 SCSI device, 13 SCSI id, 15 SCSI technology, 2 SCSI types, 15 secondary logfile, 45 sector, 18 seek time, 18 segment, 61 Serial Storage Architecture. see SSA shadow process, 65 shared memory, 8 single-ended SCSI, 15 SMP, 40, 58, 59, 64 SMS, 41, 46 soft checkpoint, 46 software cache, 20 sort, 38 sorting, 62, 68 spindle, 18 splitting a query, 40 spool work process, 8 SQL request, 5 SQL-Server, 2 SSA, 16, 81 statistics, 32, 64 Storage Area Network. see SAN stored procedure, 5 stripe, 21 stripe set, 21 stripe size, 22, 28, 35 striping, 21, 34, 35, 62, 63, 65, 67 striping policy, 23 sustained transfer rate, 19
SAP AG
August 2000
91
Index

SAP
swap space, 34 symmetric multi processing. see SMP system crash, 39 system load, 21 T table, 41, 55 table buffer, 6 table reorganization, 62, 70 table scan, 32 tablespace, 28, 38, 41, 61 tablespace backup, 46 tablespace growth, 62 tablespace recovery, 46 taskhandler, 8 tblspace, 56 tempdb, 73 tempdbs, 56 temporary space management, 38 temporary tablespace, 46, 62 thread, 32 throughput, 15 track, 18
transaction commit, 31 transaction data, 45 transaction log, 42 transaction log writer, 6 transaction log writer process, 33 transaction throughput, 37 two-phase commit protocol, 64 typical disk access time, 19 U Ultra3 SCSI, 15 update work process, 8 user context, 9 V virtual processor, 58 volume group, 27 W work process, 8 write-back cache, 20, 32 write-through cache, 20 X XOR, 24
92
August 2000
SAP AG

Database Layout

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Database Layout

Uploaded by

Copyright:

Available Formats

Fundamentals of Database Layout

Version 2.0, August 2000

SAP AG l Neurottstrae 16 l 69190 Walldorf l Germany

Fundamentals of Database Layout

Fundamentals of Database Layout

7.1 7.2 7.3 7.4 7.5

Fundamentals of Database Layout

Fundamentals of Database Layout

Fundamentals of Database Layout

Table 1 Table 2 Table 3 Table 4

Fundamentals of Database Layout

Fundamentals of Database Layout

1.1 Target Group

1.2 Further Material

Fundamentals of Database Layout

Database Management System

Fundamentals of Database Layout

Fundamentals of Database Layout

Fundamentals of Database Layout

2.1 Architecture of DBMS

A general overview of the principal building blocks of a DBMS is shown in Figure 1 .

Shared DB Services DB Writer Transaction Log Writer Backup Writer

Transaction Logfile Database

General DBMS architecture

Executing a Client Request

Fundamentals of Database Layout

2.2 R/3 Architecture

Fundamentals of Database Layout

Application Server Gateway Shared Memory

Program Buffer Dispatcher Table Buffer

Database Management System

Fundamentals of Database Layout

Types of Work Processes

Dialog Work processes

Work Process Architecture

Fundamentals of Database Layout

Fundamentals of Database Layout

Fundamentals of Database Layout

3.1 General Hardware Architecture

Internal Bus (e.g. PCI Bus)

Bus Controller Unit DMA Controller

Bus Controller Main Processing Unit

Disk Disk Device

Disk Disk Device

General layout for a host with connected disks

Processing I/O Requests

Fundamentals of Database Layout

3.2 Data Transfer Mechanisms

Fundamentals of Database Layout

SCSI bus architecture

SCSI Disk Subsystem

Fundamentals of Database Layout

Memory Unit CPU Memory Controller Main Memory

Internal Bus (e.g. PCI Bus)

SCSI Controller Unit DMA Controller

SCSI Controller Main Processing Unit

Disk Controller Disk Controller Disk Controller

Disk SCSI Disk

Disk SCSI Disk

System hardware layout using SCSI storage devices

Fundamentals of Database Layout

Ultra2-Wide SCSI (LVD) 40 Ultra3-Wide SCSI (LVD) 80

SCSI types and their technical properties