You are on page 1of 57

Database Concepts

What is database?
A database System is basically a computer based record keeping System or the collection of data usually referred to as the database. Database contains information about one

particular enterprise. A database may also be defined as a collection of interrelated data stored together to serve multiple application. The data is such stored so that is independent of programs which use the data. The intention (aim) of a database is that the same collection of data should serve as many applications as possible. Purpose of database: The database should be a repository (stock) of the data needed for an organizations data processing. Data should be accurate, private and protected from damage. It should be organized so that diverse application with different data requirement can employ the data. The ways in which end user want to utilize data will constantly change and in some cases demands for new used of the data will arises rapidly and urgently. The extent (scope) to which these demands can be satisfied determines the overall value of the database system. Database management system provided centralized control of the data so it is answer to all the problems like data redundancy (duplication of data), data inconsistency, unshareable data, and insecure, incorrect and unstandardized data. Advantages of database System: 1. Reduce the data redundancy: duplication of data is known as data redundancy. Non-database system maintains separate copy of data for each application. For example in college students records are maintained and the hostel also maintains same records of all those students who live in hostel, thus one record is getting stored in two files so this lead to duplication or incorrect data. One problem of redundancy is unnecessary wastage of storage space. In database system all the data is stored centrally at one place and all the application require data refer to the database. Now if any change is to be made it will be made at just one place and the same changed information will be available to all the applications referring to it. Thus redundancy gets controlled. 2. Control of inconsistency of data: when redundancy is not controlled there may be occasions on which the two entries about the same data do not agree (that is when one of them stored the update information and the other does not).on such point data base is said to be inconsistent (incompatible). And this inconsistent data will provide incorrect or conflicting information. By controlling redundancy, inconsistency can be controlled. 3. Database facilitates sharing of data: Sharing of data means, individual pieces of data in the database may be shared among several different users. In the sense that each of those users may have access to the same piece of data and each of them may use it for different purposes.

4. Database enforces standards: All the standards of organization can be applied to centrally stored data in database. There may be certain industry standards that must be satisfied by the data. 5. Database can ensure data security: The information stored inside a database must be kept secure and private. Data security refers to protection of data against accidental or Intentional disclosure to unauthorized persons, or unauthorized modification or restriction of data. Privacy of data refers to the rights of individual and organizations to determine for themselves when, how and to what extent information about them is to be transmitted to others. 6. Integrity can be maintained through database: By integrated database we mean unification of several otherwise distinct data files. With any redundancy among these files partially or wholly eliminated. The database management system designs certain integrity checks to ensure that data values confirm to certain specified rules. For example date cannot be like 25/25/06.it is invalid date. Therefore a database management system defines many integrity checks to check for the values that must lie within certain range of values.

Role of Database Administrator: There are three types of users for a DBMS. They
are: 1. The END user who uses the application. This is the user who actually pus the data in the system to use in business. This user need not know anything about the organization of data in the physical level or logical level. He needs to have access and knowledge of only the data he is using. 2. The APPLICATION PROGRAMMER who develops the application programs. He has more knowledge about data and its structure since he has manipulate the data using his programs. He also need not have access and knowledge of the complete data in the system. 3. The DATABASE ADMINISTRATOR (DBA) who is like the super user of the system. The role of DBA is very important and is defined by the following functions: I. Defining the schema: The DBA defines the schema which contains the structure of the data in the application. The DBA determines what data needs to be present in the system and how this data has to be represented and organized. II. Interaction with users: The DBA needs to interact continuously with the users to understand the data in the system and its use. III. Defining security and integrity checks: The DBA finds about the access restrictions to be defined and defines security checks accordingly. Data integrity checks are also defined by the DBA. IV. Defining backup and recovery procedures: The DBA also defines procedures for backup and recovery. Defining backup procedures includes specifying what data is to backed up, the periodicity (time period) of taking backups and also the medium and storage place for the backup data. V. Monitoring performance: The DBA has to continuously monitor the performance of the queries and take measures to optimize all the queries in the application.

Basic Concepts of Database


Data Repository: All data in the database reside in a data repository. This is the data storage unit where physical data files are kept. The data repository contains the physical data. Mostly, it is a central place of storage for the data content. Data Dictionary: The data repository contains the actual data. Let us say that you want to keep data about the customers of your company in your database. The structure of a customers data could include fields such as customer name, customer address, city, state, zip code, phone number, and so on. Data about a particular customer could be as follows in the respective fields: Jane Smith/1234 Main Street/Piscataway/NJ/08820.There are two aspects of the data about customers. One aspect is the structure of the data consisting of the field names, field sizes, data types, and so on. This part is the structure of the data for customers. The other part is the actual data for each customer consisting of the actual data values in the various fields. The first part relating to the structure resides separately in storage, and this is called the data dictionary or data catalogue. A data dictionary contains the structures of the various data elements in the database. It also contains the relationships among data elements. The other part relating to the actual data about individual customers resides in the data repository. The data dictionary and the data repository work together to provide information to users. Database Software: Are Oracle and Informix databases? Oracle and Informix are really the software that manages data. These are database software or database management systems. Database software supports the storing, retrieving, and updating of data in a database. Database software is not the database itself. The software helps you store, manage, and protect the data in a database.

Database abstraction: Abstraction is the process that provides the users only as much
information that is required by them. That means system does not disclose all the details of data, rather it hides certain details of how the data is stored and maintained. So abstraction is the process by which we show some information about an entity to the user but most of all unnecessary information is not disclose to him A good database system ensured easy, smooth and efficient data structures in such a way so that every type of database user and user application, system analyst and physical storage system analyst is able to access its desired information efficiently.

Data Access:

The database approach includes the fundamental operations that can be applied to data. Every database management system provides for the following basic operations: READ data contained in the database ADD data to the database UPDATE individual parts of the data in the database DELETE portions of the data in the database Database practitioners refer to these operations by the acronym CRUD:

CCreate or add data RRead data UUpdate data DDelete data

Transaction Support: When a transaction is initiated it should complete all the tasks and
leave the data in the database in a consistent state. That is, if the initial stock is 1000 units and the order is for 25 units, the stock value stored in the database after the transaction is completed must be 975 units. How can this be a problem? See what can happen in the execution of the transaction. First, the transaction may not be able to perform all its tasks because of some malfunction preventing its completion. Second, numerous transactions from different order entry clerks may be simultaneously looking for inventory of the same product. Database technology enables a transaction to complete a task in its entirety or back out intermediary data updates in case of all functions preventing completion. Various level of database implementation: A database is implemented through three general levels: internal, conceptual and external. Most of the systems are designed around 3-level/3tier architecture. Under this scheme the database is assumed to be made up of 3 layers or levels and each level is developed accordingly. Each level implements abstractions is some manner. 1. Internal level (physical level): This is the lowest level of abstraction. It describes

how the data are actually stored on the storage medium. At this level complex low level data structures are described in details .it is closest to physical storage so termed as physical level.
2. Conceptual level: This level of abstraction describes what data are actually stored in the database it also describes the relationships existing among data. The user of this level are not concerned with how these logical data structures will be implemented at the physical level. Rather they just are concerned about what information is to keep in the database. 3. External level (view level): This level is closest to the users and is concerned with the way in which the data are viewed by individual users .most of the users of the database are not concerned with all the information contained in the database. Instead they need only a part of the database relevant to them. For example an account holder is interested only in his account details not with the rest information stored in the database. To simplify such users interaction with the system this level of abstraction is defined.

Database schema and instance: A scheme is an outline or a plan that describe the
records and relationships existing in the view. The overall set of such relationships for the entire

database is known as database schema. It includes all tables, their constraints and relationships. Or The overall design of the database is called the database schema Or The collection of information stored in the database at particular moments is called an instance of the database. There three type of schema:

1. 2. 3.

Physical schema Conceptual schema External schema

1. Physical Schema: Physical schema describes the database design at the physical
level. It specifies additional storage details. The physical schema summaries how the relations described in the conceptual schema are actually stored on secondary devices such as disks and tapes.

2. Conceptual Schema: The conceptual schema is also known as logical schema. It


describes the stored data in terms of the data model of the RBMS. In a relational DBMS, the conceptual schema describes all relations that are stored in the database.

3. External Schema: External schema is also known as data model of DBMS. This
schema allows data access to be customized at the level of individual users or group of users.

Concept of data independence: The ability to modify a schema definition in one level
without affecting a schema definition in the next higher level is called data independence. There are two level of data independence: physical and logical Physical data independence: It refers to the ability to modify the scheme followed at the Physical level without affecting the scheme followed at the conceptual level. That is the applications programs remain the same even thought the scheme at physical level gets modified. Modifications at the physical level are occasionally necessary in order to improve performance of the system. Example: Addition or removal of new entities, attributes, or relationships to the conceptual schema should be possible without having to change existing external schemas or having to rewrite existing applications programs.

Logical data independence: It refers to the ability to modify the conceptual scheme without causing any changes in the schemes followed at view levels. The logical data independence ensures that the application programs remain the same. Modifications at the conceptual level are necessary whenever logical structures of the database get altered because of some unavoidable reasons. It is more difficult to achieve logical data independence than the physical data independence. The reason being that the application programs are heavily dependent on the logical structure of the database. Example: A change to the internal schema, such as using different file organization or storage structures, storage devices, or indexing strategy, should be possible without having to change the conceptual or external schemas.

Physical Storage structure of Database


Introduction: The DBMS views (consider) the database as a collection of records. The File Manager of the underlying operating system views database as a set of pages and the disk manager view database as a collection of physical locations on the disk. When DBMS makes a request for a specific record to the file manager, the file manager maps the records to a page and requests the disk manager for specific page. The disk manager determines the physical location on the disk and retrieves the required page.

Database access steps


DBMS
Request stored record Stored record returned

File Manager
Request stored page Stored page returned

Disk Manager

Clustering: The method of storing logically related records, physically together is called clustering. This technique improves the performance of database. For example: if a query retrieving customer with consecutive cust_id. If clustering is based on cust_id then it will help in improving the performance of these queries. Assume that the customer records size is 128 bytes and the typical size of a page retrieved by the file manger is 1KB (1024 bytes). If there is no clustering, than records will be stored at random physical locations. In the worst case scenario, each record may be placed in a different page. Hence a query to retrieve 100 records with consecutive cust_id (from 101 to 201), will required 100 pages to be accessed which in turn translates to 100 disk accesses. But if records are clustered, a page can contain 8 records. Hence the number of pages to be accessed for retrieving the 100 consecutive records will be only 13. So only 13 disk accesses will be required to obtain the query results. Indexing: Indexing is another common method for making retrievals faster. If we take a sequential search on the CUSTOMER table to retrieve all records with the value Delhi in the city column of table, the time taken for this operation depends on the number of pages to be accessed. If the records are randomly stored, the page accesses depend on the volume of data. If the records are stored physically together, the number of pages depends on the size of each record. City Bombay Bombay Calcutta Delhi

Madras Madras .

CustNo Name 001 002 003 004 005 006 . Shah

City Bombay

Shrinivas Madras Gupta Delhi

Banerjee Calcutta Apte Kumar . Bombay Madras ..

Above a new index file is created. The number of records in the index file is same as that of the data file. The index file has two fields in each record. On field contains the value of the Cust_city field and the second contains a pointer to the actual data record in the CUSTOMER table. Whenever a query based on Cust_city field occurs, a search is a carried out on the index file. Here it is to be noted that this search will be much faster than a sequential search in the CUSTOMER table, if the records are stored physically together. This is because of the much smaller size of the index record due to which each will be able to contain more number of records. Thus the access involves a sequential access on the index file and a direct access on the actual data file. Hashing: It is also a method to make retrieval faster. This method provides direct access to record on the basis of value of a specific field called the hash_field. When a new record is inserted, it is physically stored at an address which is computed by applying a mathematical function (hash function) to the value of the hash field. When a record is to be retrieved, the same hash function is used to compute the address where the records are stored. Retrieval is faster since a direct access is provided and there is no search involved in the process. Collision: It is possible that two records have the same address. Such situation leads to the collision. So collision can be avoided by: Linear Search: While inserting a new record, if it is found that the location at the hash address is already occupied by a previously inserted record, search for the next free location available in the disk and store the new record at this location. A pointer from the first records at the original hash address to the new records will also be stored. During retrieval, the hash address is computed to locate the records. When it is seen that the records is not available at the hash address, the pointer from the record at that address is followed to locate the required records.

Collision chain: In this technique the hash address location contains the heads of a list of pointers linking together all records which hash to that address. In this method an overflow area needs to be used if the number of records mapping on to the same has address exceeds the number of locations linked to it.

Data models
The three data models that are used for database management are: (I) Relational data models (ii) Hierarchical data models (iii) Network data models

The Network data model: In this model the data is represented by collections of
Records and relationships among data are represented by links. In a Network database the collection of records are connected to one another by means of links. A record is a collection of fields (attributes), each of which contains only one data value. In network data model, while mapping to files, links are implemented by adding pointer fields to records that are associated via a link. Each record must have one pointer field for each link with which it is associated. The operations on a network database are performed through a data manipulation language for network model. The operations that can be performed on a network database include find, insert, delete, modify etc. the inserting or removing records involve connect, disconnect, and reconnect operations.

Hierarchical Data model: It is very much similar to the network model except that in
this model records are organized as tress. This model represents relationship among its records through parent child relationship that can be easily represented through tree like structures this hierarchical database is a collection of records connected to one another through links. The records type at the top of the tree is called root. And root can have any number of dependent and each of these dependents can have any number of lower level dependents and so on. The operations on a hierarchical database are performed through a data manipulation language of hierarchical data model. The operations that can be performed on hierarchical database include retrieval, insertions, deletion and modifications of records.

Relational data model: In this model data is organized into tables (i.e. rows and
columns). These tables are called relations. A row in a table represents a relationship among a set of values, since a table is a collection of such relationship, it is generally referred as relations. Rows of relations are generally referred to as tuples and the columns are usually referred to as attributes.

The relational model is based on a collection of tables (relations). The user of (relational) database system may query these tables, insert new tuples, delete tuples, and modifies tuples. There are several languages for expressing these operations. Properties of relational model:
1. 2. 3. In any column of a table, all items are of the same king whereas item in different columns may not be of the same kind. For a row each column must have an atomic value and also for a row a column cannot have more than one value. All rows of a relation are distinct. That is a relation does not contain two rows which are identical in every column that is each row of a column can be uniquely identified by its contests. The ordering of rows within a relation is immaterial. That is we cannot retrieve anything by saying that from row number 5, column name is to be accessed. There is no order maintained for rows inside a relation. The columns of row are assigned distinct names and the ordering of these columns is immaterial.

4.

5.

Properties of Relations:
1. No duplicate tuplesA relation cannot contain two or more tuples which have the same values for all the attributes. i.e. in any relation every row is unique. 2. Tuples are unorderedThe order order of rows in a relation is immaterial. 3. Attributes are unorderedThe order of columns in a relation is immaterial. 4. Attribute values are atomicEach tuple contains exactly one value for each attribute.

Touple: The rows of tables (relations) are generally referred to as touples.


Attributes: The columns of tables (relations) are generally referred to as attributes. Degree: The number of attributes in a relation determines the degree of a relation. A relation having 3 attributes is said to be a relation of degree 3. Cardinality: The number of touples (rows) in a relation is called the cardinality of the relation.

Domain: Domain is the data type and size of an attribute or A domain is a pool of values from
which the actual values appearing in a given column are drawn. A domain is said to be atomic if elements of the domain are considered to be indivisible units. For example, the set of integers is an atomic domain but the set of all sets of integers is a non-atomic domain.

Key: It is a set of one or more columns whose combined values are unique among all
occurrences in a given table. A key is the relational means of specifying uniqueness.

Candidate key: Every attribute (column) of a table can be specify as candidate key which
can be selected as a primary key. And it must satisfies the following condition: 1. The attribute or the set of attributes uniquely identifies each touple in the relation. 2. If the key is a set of attributes then no subset of these attributes has property uniqueness.

Primary key: It is an attribute (column) that uniquely defines each touple in a relation.

Foreign Key: An attribute in a relation R1 which indicates the relationship of R1 with another
relation R2. The foreign key attribute in R1 must contain values matching with those of the values in R2.

Super key: A super key is a column or set of columns that uniquely identifies a row within a
table. A primary can be known as super key.

Alternate key: An alternate key is any candidate key which is not selected to be the primary
key.

Compound Key: a compound key is a key that consists of 2 or more columns. It is also
known composite key or concatenated key.

Secondary key: A secondary key is a data field that is used for data searches and retrieval. Or
a secondary key is a column which use for searching.

Example: select * from student where rollno=1;


In this query rollno is the secondary key. Relational integrity: We know that relational model has three main components, data structure, and data integrity and data manipulation. The aim of data integrity is to specify rules that implicitly or explicitly define a consistent database state. The integrity of RDBMS is based on certain rules proposed by E.F. Codd and few constraints which also proposed by codd. Integrity Rules: The following are the integrity rules to be satisfied by any relation. 1. Not component of the primary key can be null 2. The database must not contain any unmatched foreign key values. This is called the referential integrity rule. Relational data manipulation (Relational Algebra): Relational algebra is a basic set of operations for manipulating relational data. These operations enable the user to perform basic retrieval operations. The result of retrieval operation on a table is another relation. Thus the

relational algebraic operations produce new relations, which can be further manipulated using some relational algebraic operations. For example a select operation on a relation produced another relation and we can perform another select operation on that table too. The relational algebra is a procedural query language. It specifies the operations to be performed on existing relation to derive result relations. These operation can be divided into two parts. 1. Basic set oriented operations. (ex: Union, Intersection, Set-difference, Cartesian Products.) 2. Relational oriented operation. (ex: Select, project, Rename, join, division) Set Theoretic operations R

S First Forrest Sally DonJuan Last Gump Green Demarco Age 36 28 27 First Bill Sally Mary Tony Last Smith Green Keen Jones Age 22 28 23 32

Union (R U S): This operation results a relation with tuples from R and S with duplicates removed. (R U S) First Last Age Bill Smith 22 Sally Green 28 Mary Keen 23 Tony Jones 32 Forrest Gump 36 DonJuan Demarco 27 Difference (R-S): This operation results a relation with tuples from R but not From S. First Last Age Bill Smith 22 Mary Keen 23 Tony Jones 32 Intersection (R S): This operation results a relation with tuples that appear in both R and S. First Last Age Sally Green 28

Cartesian product: This operation produces all combinations of tuples from two relations. R S First Last Age Bill Smith 22 Mary Keen 23 Tony Jones 32

Dinner Steak Lobster

Dessert Ice cream Cheesecake

First Bill Bill Mary Mary Tony Tony

Last Smith Smith Keen Keen Jones Jones

R*S Age 22 22 23 23 32 32

Dinner Steak Lobster Steak Lobster Steak Lobster

Dessert Ice Cream Cheesecake Ice Cream Cheesecake Ice Cream Cheesecake

Union Compatible Relations: Two relations R and S are union compatible if and only if they have the same degree, and the domains of the corresponding attributes are the same. Union, intersection and difference operators may only be applied to union compatible relations. Union and intersection are commutative operations. i.e. RUS=SUR RS=SR Difference operation is not commutative. R-S not equal S-R. The resulting relations may not have meaningful names for the attributes. Convention is to use the attribute names from the relation. Exercise: T First Last William Smith Sally Green Mary Kontrary Score 44 28 27

Compute RUT Compute RT Show that R-T is not equal to T-R

Selection Operator 1. Selection and projection are unary operators. 2. The selection operator is sigma (). 3. The selection operation acts like a filter on a relation by returning only a certain numbers of tuples. 4. The resulting relation will have the same degree as the original relation. 5. The resulting relation will have fewer numbers of tuples than the original relation. 6. The tuples to be returned depend on a condition that is part of the selection operator. 7. This operation returns only rows that satisfy the condition. 8. A condition can be made by combination of comparison or logical operators that operate on the attributes of relation. 9. Use the truth table for logical operations. Example:

Name Smith Jones Green Brown Smith

Office 400 220 160 420 500

Dept CS Econ Econ CS Fin

Rank Assistant Adjunct Assistant Associate Associate

1. Select only those employee in the CS department Name Office Dept Rank Smith 400 CS Assistant Brown 420 CS Associate 2. Select only those employees with last name smith who are assistant professors. Name Office Dept Rank Smith 400 CS Assistant Project Operator: Projection is also a unary operator. The projection operator is pi (). Projection limits the attributes that will be returned from the original relation. The general syntax is attributes R, where attributes is the list of attributes to be displayed and R is the relation. The resulting relation will have the same number of tuples as the original relation. The degree of the resulting relation may be equal to or less than that of the original relation. Example: Project only the names and department of the employees. name, dept (EMP) Name Dept Smith CS Jones Econ Green Econ Brown CS Smith Fin Combining selection and projection: The selection and projection operators can be combined to perform both operations. Show the names of all employees working in the CS department. name Dept=CS (EMP)) Result: Name Rank Smith CS Brown CS Show the name and rank of those employees who are not in the CS department or adjuncts: name, rank ( (rank=Adjunct V Dept=CS) (EMP) Result: Name Rank Green Assistant

Smith

Associate

Aggregate functions: we can apply aggregate functions to attributes and tuples. Aggregate functions are: SUM MININUM MAXIMUM AVERAGE, MEAN, MEDIAN COUNT. Example: Name Office Dept Salary Smith 400 CS 45000 Jones 220 Econ 35000 Green 160 Econ 50000 Brown 420 CS 65000 Smith 500 Fin 60000 1. Find the minimum salary: Result: MIN (salary) 35000 2. Find the average salary: Result: AVG (salary) 51000 1. Find the total payroll for the Economics department: }SUM (salary) (}Dept=Econ) (EMP)) Result: SUM (salary) 85000 Joint Operation: Join operation brings together two relations and combines their attributes and tuples in a specific fashion. It takes the attributes from the two relations that are to be joined. The join condition can be =, > , < . when the join condition operator is = then we call this Equijoin. Attributes in common are repeated in join operation. Depart Relation Dept. Main office Phone CS 404 555-1212 Econ 200 555-1234 Fin 501 555-4321 Hist 100 555-9876
}MIN (salary)

(EMP)

}AVG (salary)

(EMP)

Name Smith Jones Green Brown Smith Result: Name Smith Jones Green Brown Smith Office EMP.Dept 400 220 160 420 500 CS Econ Econ CS Fin

EMP Relation Office Dept. 400 CS 220 Econ 160 Econ 420 CS 500 Fin Main office 404 200 200 404 501

Salary 45000 35000 50000 65000 60000 Phone 555-1212 555-1234 555-1234 555-1212 555-4321

Salary Depart.Dept 45000 35000 50000 65000 60000 CS Econ Econ CS Fin

Natural Join: Natural join operation removes the duplicate attributes. The natural join operator is *. Name Office Dept Salary Main Phone office Smith 400 CS 45000 404 555-1212 Jones 220 Econ 35000 200 555-1234 Green 160 Econ 50000 200 555-1234 Brown 420 CS 65000 404 555-1212 Smith 500 Fin 60000 501 555-4321 Outer Join: In the join above join operation only those tuples from both relations that satisfy the join condition are included in the output relation. The outer join includes other tuples as well according to few rules. There are three types of outer join: 1. Left outer join includes all tuples in the left hand relation and includes only those matching tuples from the right hand relation. 2. Right outer join includes all tuples in the right hand relation and includes only those matching tuples from the left hand relation. 3. Full outer join includes all tuples in the left hand relation and from the right hand relation. Example: Assume we have tow relations: PEOPLE and MENU. People Name Age Food Alice 21 Hamburger Bill 24 Pizza Carl 23 Beer Dina 19 Shrimp

Menu Food Day Pizza Monday Hamburger Tuesday Chicken Wednesday Pasta Thursday Tacos Friday PEOPLE (people.food=menu.food) MENU (left outer Name Age People.Food Alice 21 Hamburger Bill 24 Pizza Carl 23 Beer Dina 19 Shrimp join) Menu.Food Hamburger Pizza Null Null Day Tuesday Monday Null Null

PEOPLE (people.food=menu.food) MENU (right outer join) Name Age People.Food Menu.Food Bill 24 Pizza Pizza Alice 21 Hamburger Hamburger Null Null Null Chicken Null Null Null Pasta Null Null Null Tacos PEOPLE (people.food=menu.food) MENU (full outer join) Name Age People.Food Alice 21 Hamburger Bill 24 Pizza Carl 23 Beer Dina 19 Shrimp Null Null Null Null Null Null Null Null Null

Day Monday Tuesday Wednesday Thursday Friday

Menu.Food Hamburger Pizza Null Null Chicken Pasta Tacos

Day Tuesday Monday Null Null Wednesday Thursday Friday

Outer union: The outer union operation is applied to partially union compatible relations. Operator is *. Example PEOPLE * MENU Name Age Food Day Alice 21 Hamburger Null Bill 24 Pizza Null Carl 23 Beer Null Dina 19 Shrimp Null Null Null Hamburger Monday Null Null Pizza Tuesday Null Null Chicken Wednesday Null Null Pasta Thursday Null Null Tacos Friday

Tuple Relational Calculus: The tuple relational calculus is a non procedural language while the relational algebra is a procedural language. We must provide a formal description of the information desired. A query in the tuple relational calculus is expressed as {t|P(t)} i.e. the set of tuples for which predicate is true. We also use the notation.

T[a] to indicate the value of tuple on attribute a. T r to show that tuple is in relation t. Domain Relational Calculus: Like tuple relational calculus domain relational calculus is also a relational calculus. In domain calculus the variables range over single values from domains of attributes rather than the raging over tuples. To form a relation of degree n for a query result, we must have n of these domain variablesone for each attribute. An expression of the domain calculus is of the following form: {x1,x2,.xn | cond (x1, x2, x3..xn+1.xn+2)} Where x1, x2.xn, xn+1 are domain variables that range over domains of attributes and COND is a collection or formula of the domain relational calculus. A formula is made up of atoms.

Codds Rules
Dr. E. F. Codd in 1970 defined 13 rules, referred as Codd's 12 Rules, for the relational model. Dr.E.F. Codd, the founder of the relational database system, places the
relational models characteristics in three broad categories. 1. Structural features that support view of data. Example relations view and queries. 2. Integrity features. Example entity and referential integrity. 3. Data manipulation features. Example data insertion, retrieval, deletion and update. To qualify a database as relational CODDs rules are as follow: 0. Any DBMS that is said to be a relational database management system must be able to manage databases entirely through its relational capabilities. 1. Information Rule: According to this rule all information in a relational data base is represented explicitly at the logical level and in exactly one way by values in tables. i.e. everything in database exists in tables and is accessed via table access routines. 2. Guaranteed Access Rule: Every value in a relational database is guaranteed to be accessible by using a combination of the table name, primary key value, and column name. 3. Systematic treatment of null values: According to this rule Null values which are distinct from blanks and zero are supported in fully relational DBMS for representing missing information and inapplicable information in a systemic way, independent of data type. If data does not exist or not apply then a value of NULL is applied, this is understood by the RDBMS as meaning non-applicable data. 4. Active, online relational catalog (Data Dictionary): The description of the database and its contents is represented at the logical level as tables and can therefore be queried using the database language. The data dictionary is held within the RDBMS, thus there is no-need for off-line volumes to tell you the structure of the database. 5. Comprehensive data sublanguage (Supportable language): Every RDBMS should provide a language to allow the user to query the contents of the RDBMS and also manipulate the contents of the RDBMS. This language must have a well-defined syntax

and be comprehensive. It must support data definition, manipulation, integrity rules, authorization, and transactions. 6. View updating rule: All views that are theoretically updatable should be updatable through the system, i.e. not only user can modify the data, but the RDBMS too when no user is logged-in. 7. High level insertion, update, and deletion: The DBMS supports not only high level retrievals but also high level inserts, updates, and deletes. The capability of handling a base relation or a derived relation as a single operand applies not only to the retrieval of data of but also to the insertion, update and deletion of data. 8. Physical data independence: Application programs and terminal activities remain logically unimpaired whenever any changes are made in either storage representations or access methods. Changes to physical level do not require a change to an application based on the structure. The user should not be aware of where or upon which media data files are stored. 9. Logical data independence: Application programs are logically unaffected when changes are made to the table structures. Logical changes in the tables and views such as adding, deleting column or changing field lengths need not modification in the program. The database can change to reflect change without user intervention. 10. Integrity independence: The database language must be capable of defining integrity rules. They must be stored in the online catalog, and they cannot be bypassed. If a column only accepts certain values, then it is the RDBMS which enforces these constraints and not the user program, this means that an invalid value can never be entered into this column, if the constraints were enforced via programs there is always a change that a buggy program might allow incorrect values into the system. 11. Distributed independence: Application programs are not affected by changes in the distribution of physical data. This improves system reliability since application program will work even if the program and data are moved to different sites. 12. Non sub-version: It must not be possible to bypass the integrity rules defined through the database language by using lower-level languages.

Normalization
Normalization is a design technique that is widely used as a guide in designing relational database. Normalization is a two step process that puts data into tabular form by removing repeating groups and then removes duplicated data from the relational tables. There are currently five normal forms that have been defined. The goal of normalization is to create a set of relational tables that are free of redundant data and that can be consistently and correctly modified. This means that all tables in a relational database should be in third normal form (3NF).

Need of Normalization: The main goal of normalization is to create a database that is free from redundancy and that can be consistently and correctly modified. Normalization basically based on functional dependence and decomposition of table. Functional Dependence: When a column of a table is dependent on another column of the
same table in such a way that each value of it must be precisely one value of the column on which it is dependent is known as functional dependence. Example: In a table listing Employee number (EmpNo) and Employee name (EmpName), it can be said that Name if functionally dependent upon Employee Number (EmpNOEmpName) because an employees name can be uniquely determined from their Employee number but its reverse order i.e. statement (nameEmpNo) is not true because more than one employee can have the same name but different empNo. EmpNumber EmpName 1 Amit 2 Rohit 3 Amit Dependence Preserving: The decomposition is said to be dependence preserving if original set of constraints or dependencies can be derived from the decomposed relations without joining the relations. Attribute Preservation: It means all the attributes must be present in decomposed relations. Lack of redundancy: It means redundancy must be controlled in the decomposed relations. And if present it should of much lesser than the original composed relation.

Problems of un-normalized table or database anomalies: These are the problems that occur due to redundancy in the relations. These anomalies affect the process of inserting, deleting and modifying data in the relations. Some important data may be lost in a relations is updated that contains database anomalies. So elimination of these anomalies is necessary in order to perform different processing on the relations without any problem. A database contains basically four types of anomalies. 1. Redundancy Anomalies 2. Updation Anomalies 3. Insertion anomalies 4. Deletion anomalies S.No. 1 1 1 5 StudName Smith Smith Smith Jones Address Ddun Ddun Ddun Agra Enrollno Cp302 Cp303 Cp304 Cp302 Cname Database Communication Software eng. Database Instructor Gupta Wilson Williams Gupta Office 102 102 1024 102

This table is in 1NF but it contain following anomalies: 1. Redundancy Anomalies: A lot of information is being repeated. Student name, address, course name, instructor name of office number is being repeated often.

Every time we wish to insert a student enrollmentNo We must insert the name of the course as well as the name and offices number it its instructor. And also we are to repeat his name and address. Repetition of information results in wastage of storage as well as other problems. 2. Updation Anomalies: Redundant information not only wastes storage but makes
updates more difficult. For example changing the name of instructor of CP302 would require that all tuples containing CP302 enrolment information be updated. If for some reason, all tuples are not updated, we might have a database that gives two names of instructor for subject CP302. This difficulty is called the update anomaly.

3. Insertion Anomalies: Inability to represent certain informationLet the primary key of the above relation is (Sno, Eno). Any new tuple to be inserted in the relation must have a value for the primary key since existential integrity requires that a key may not be null. So if we want to insert the number and name of a new course in the database, it would not be possible until a student enrolls in the course and we are able to insert values of Sno and Eno. Similarly information about a new student cannot be inserted in the database until the student enrolls in a subject. These difficulties are called insertion anomalies. 4. Deletion anomalies: Loss of useful information in some cases useful information may be lost when a tuple is deleted. For example, if we delete the tuple corresponding to student 1 doing cp304, we will lose relevant information about curse cp304 (like course name, instructor, office number) if the student 5 was the only student enrolled in that course. Similarly deletion of course cp302 from the database may remove all information about the student names jones. This called deletion anomalies.
Decomposition: A relation scheme can be decomposed into a collection of relations schemes to eliminate some of the anomalies contained in the original relation scheme. However any such decomposition requires that the information contained in the original relation be maintained. This in turn requires that the decomposition be such that a join of the decomposed relation gives the same set of tuples as the original relation and that the dependencies of the original relation be preserved. Desirable properties of decomposition are: 1. Content Preserving (Lossless join decomposition) 2. Dependency preservation 3. Attribute preservation 4. Lack of redundancy Content Preserving (Lossless join decomposition): The decomposition is said to be lossless if original relations tuples can be derived from the decomposed relations by join operation.

Normal Forms

First Normal Form: A relation which contains a repeating group is called an unnormalized relation. Removal of repeating groups is most important to make a relation in first normal form. So a relation is said to be in first normal form (1NF) if it does not contain a repeating groups. Un-normalized table (contain multiple city name in a single column) Sales Man City Amit Bombay, Delhi, Agra Sumit Roorkee, Delhi Relation in first normal form (No repeating groups) Sales Man City Amit Bombay Amit Delhi Amit Agra Sumit Roorkee Sumit Delhi
Second Normal Form: A relation is said to be in second normal form if it is in first normal form and no non-key attribute is functionally dependant on only a portion of the primary key. Or every non-key attribute is fully dependent on the primary key. An attribute is said to be a non-key attribute if it is not part of the primary key. Example: This given table is in first normal form Hours Emp-No Project-No Here the primary key is Emp-NO + Project-No. And functional dependencies are: Emp-No+Project-NoHours Emp-NoEmp-Name Project-NoProj-Location. This relation can be converted to second normal form by decomposing it into three tables: Emp-Project-Hours Hours Emp-No Project-No Emps Emp-No Project Location Project-No

Emp-Name

Proj-Location

Emp-Name

Proj-Location

Third Normal Form: A relation is said to be in third normal form (3NF) if it is in 2NF and if the only determinants it has, are candidate keys. A determinant is an attribute on which other attributes depends. Or The elimination of transitive dependence from a 2NF relation leads to the relation in 3NF. Transitive dependency means a non key attribute is dependent upon another attribute which in turn dependent upon primary key. So a relation is said to be in third normal form if it is in 2NF and every non key attribute is non-transitively dependent upon the primary key. Example: This table is in 2NF but not in 3NF. In this table DeptName and DeptMgr are dependent upon DeptNo which is a candidate key while DeptNo is dependent on Empno which is primary key so transitive dependency exists in this table. Ename EmpNo Bdate Address DeptNo DeptName DeptMgr

To make this table in third normal form this table will broken into two sub tables which are fully in 3NF. In both these tables the non key attributes are fully dependent directly on primary key so transitive dependency does not exists. Hence the both tables are in 3NF. Ename DeptNo EmpNo Bdate Address DeptMgrd DeptNo

DeptName

Boyce Codd Normal Form (BCNF): A relation is said to be in BCNF if it is in 3NF and all of its determinates are candidate keys. So to convert a table into BCNF decompose it such that every determinant becomes a candidate key. Example: This relation is not in BCNF because Tech# depends upon Class-Codd which is not a candidate key. Stud-ID Tech# Class-Codd Grade So to convert it into BCNF we need to decompose it into two tables like this. Stud-ID Class-Codd Class Code Teach# Grade

Now every determinant in these tables is a candidate key. So this table in BCNF along with 3NF. Multi-valued dependencies and fourth normal form: A relation is said to be in 4NF if it is in 3NF and the relation has no multi valued dependence of attribute. To convert a relation to 4NF, we will split the relation into separate relations each containing the attribute which multi determines others. Database design: Database design provides a means to represents the real world entities in a form that can be processed by the computer. Database models present a process of abstracting

real world entities into computer representations. They give us methods to capture the state and dynamic attributes of real entities. The information collected from entities should be best. For evolving a good database design it is important that one should use a database design model. The database design models have the following benefits: 1. They provide a means to represent real world objects in computer usable form. 2. They capture and represent association and relationship among the real world objects. 3. They define how the objects in the application interactive in logical terms. 4. They allow the database designer to capture static and dynamic organization and flow of information within modeled enterprises. 5. They help in improving the reliability maintainability of the system. Requirement Analysis: This technique is used to define the scope of the requirements of an application domain. This technique examines the entire scope of the problem domain and includes. 1. Defining the applications functionality. 2. Defining all the information managed and used by the application. 3. Identifying all resources requirement including hardware software and other physical resources. 4. Deciding on the security requirements and mechanism. 5. Defining the reliability, quantity performance of the application. For all these requirements the analyst must examine the information collected with regard to the following: 1. Correctness: The information should correctly represent the real world system that it is modeling. 2. Consistency: The data should adequately capture the constraints of the real world entity. 3. Completeness: The represented real world entity should not have any missing attributes. All the relevant components of the real world entity should be present in the model. 4. Realistic Representation: The representation of the real world entity should of the real world entity should make sense. 5. Need: The information that is being captured should be required for the application. One-to-Many: A one-to-many relationship is implemented by including the primary key of the one relation as a foreign key in the many relation. For example one department is related to several employees. Hence we include the primary key of DEPT relation as a foreign key in the EMPLOYEE relation. The relations look as follow: DEPT (Deptno, Deptname, DeptLocation) EMPLOYEE (Empno, empname, empaddress, salary, deptno) Many-to-Many: A many-to-many relationship can be implemented by creating a new relation whose key is the combination of the keys of the original relations. Suppose assume that every employee can be assigned to several departments and that every department can have several employees. In this case, we create a new relation whose primary key is the combination of Empno and Deptno. This new relation represents the fact that an employee works in a department. Hence we will give this the name WORKS IN and obtain the following collection of relations.

Dept (Deptno, Deptname, Deptlocation) Employee (Empno, Empname, Empaddress, Salary) WORKS IN (Empno, Deptno) Other attributes in WORKS IN relation are those attributes which depend on both employee and department. For example the date when the employee was first assigned to the department. One-to-One: This relationship between employees and departments will be if every employee is assigned to a single department and every department consists of only one employee. This kind of relationship is the most difficult to implement. Database Architecture: Client server computing architectures are commonly described as having two or more tiers. according to how application logic is distributed between client and server. Client server architecture generally has a client tier and a server tier, but there can be more tiers also. One Tier Architecture: Using a single physical resource to access and process information is known as one tier architecture. Example 1. Suppose a person uses Microsoft access to load a list of personal address and phone numbers and he save this file in My Document folder of the computer. This is example of one tier architecture because Microsoft access runs on the users local machine, and references a file that is stored on that machines hard drive. 2. File server architecture is another example of one tier architecture. Suppose a workgroup database is stored in a shared location on a single machine. Workgroup members use a software package such as Microsoft access to load the data and then process it on their local machine. In this case, the data may be shared among different users, but all of the processing occurs on the local machine. Essentially, the file server is just an extra hard drive from which to retrieve files.

Two Tier Architecture: Two tier architecture is one that is familiar to many of todays
computer users. A common implementation of this type of system is that of a Microsoft windows base client program that accesses a server database such as Oracle or SQL server. Users interact through a GUI (Graphical User Interface) to communicate with the database server across a network via SQL (Structured Query Language). Client server database architecture or N-tier architechture: The client server model is basic to distributed system. It is a response to the limitations of client-host model. In client server architecture allows client to make requests that are routed to the appropriate server. These requests are made in the form of transactions. This model consists of three parts. Client: The client is the machine running the front end application. Client also refers to the client process that runs on the client machine. The client has no direct data access responsibilities. It simple request process from the server and displays data manage by the server.

Server: The server is the machine that runs the DBMS software and handles the functions required for concurrent, shared data access. It is often referred as back end. Server also refers to the server process that runs on the server machine. The server receives and process SQL and other query statements originating from client applications. Network: The network enables remote data access through client server and server to server communication.

Transaction
A transaction is a logical unit of work (LUW) that must succeed or fail in its occurrence. This statement means that a transaction may involve many sub-steps, which should either all be carried out successfully or all be ignored if some failure occurs. Each transaction, generally, involves one or more data manipulation language (DML) statements (like insert, delete, and update) and ends with either a COMMIT to make the changes permanent or ROLLBACK to undo the changes. The database transactions must be handled in a way so that their integrity is maintained. Example: Let us assume that a database system has to execute a transaction that transfers Rs. 1000/- from account X to account Y. Begin transaction Get balance from account X Calculate new balance as x-1000 Store new balance into database file Get balance from account y Calculate new balance as y+1000 Store new balance into database file End transaction While executing this transaction, the DBMS will make sure that either all six steps of this transaction are carried out successfully or none of them are carried out. Let us now consider a case of failure that arises during the transaction execution. Let us assume that a system failure occurs after the first three steps have been carried out successfully. Now, because of some system failure the rest of the steps cannot be carried out. At this point of time, RS 1000/- has been withdrawn from account X but not deposited to account Y. now if the things are left as it is , the database will be in inconsistent state. Therefore the DBMS will 1. Either undo all the changes done by fist three steps (i.e. amount withdrawn will be again deposited to account x) such processing of undoing things is known as ROLLBACK operation. 2. Or redo the rest three steps. if it can come out of the system failures and successfully complete the transaction, such processing of successfully completion is known as

COMMIT operation. In case of COMMIT operation, changes are permanently reflected in the database. There for we can say that a transaction will either be COMMITED or be ROLLBACK. The DBMS maintains special logs (redo log and undo log) to perform redo or undo operations if required. Commit: Committing a transaction means all the steps of a transaction are carried out successfully and all data changes are made permanent in the database. Rollback: It means transaction has not been finished completely and hence all data changes made by the transaction in the database, if any, are undone and the database returns to the same state as it was before this transaction. Transaction Handling Issues: At any point of time, there may be more than one transaction that are either getting executed or are to be executed. Let us assume there are two transactions T1, and T2 that are to be executed by the DBMS. Multiple transactions can be executed in one to the following two ways. 1. Serially i.e. serial execution of transactions. 2. Concurrently i.e. concurrent (simultaneous) execution of transactions. Transaction Properties A database system is responsible for ensuring proper execution of transactions despite failures i.e. either the entire transaction executes or none of it does. To ensure this data integrity, the database systems maintain the following properties of the transactions. These properties are termed as ACID. 1. Atomicity (all or none concept): This property ensures that either all operations of the transaction are reflected properly in the database, or none are. This property specifies that either no changes will be made to the database or the database will have changed in a consistent manner. This property has two states. a. DONE: Means a transaction must complete successfully and its effect should be visible in the database. b. NEVER STARTED: Means if a transaction fails during execution then all its modifications must be undone to bring back the database to last consistent state, i.e. remove the effect of failed transaction. 2. Consistency: A transaction while executing in isolation (i.e. no other transaction executing concurrently), transfer a consistent state of the database into another consistent state, without necessary preserving consistency at all intermediate step. 3. Isolation: Transactions are isolated from one another, even though many transactions may be running concurrently, but updates of a transaction are hidden from others until a transaction commits. Each transaction is unaware of other transactions executing concurrently in the system. 4. Durability: Once a transaction commits, its updates survive, even if the system crashes before the updates are physically written into the database.

Transaction States 1. Active: The initial state. The transaction remains in this state during its execution. 2. Partially committed: A transaction enters in this state from active state after final statement has been executed. 3. Aborted: A transaction enter in this state after it has been Rolled back and database has been restored to a consistent state which exists prior to the commencement of execution of current transaction. 4. Committed: a transaction enters this state from partially committed state after its successful completion. A transaction is said to be terminated if it has either committed or aborted. When a transaction is aborted system has two options. A. It can restart the transaction but only if the transaction was aborted due to some hardware or software failure and not due to error in internal logic of the transaction is called new transition. B. It can kill the transaction if the failure was due internal logic error which has to be corrected by rewriting of application program or because the needed data was bad or not found in the database.

Concurrency and related problems


When more than one transaction are executing at the same time then it is known as concurrency. Concurrency is common in client-server architecture where many users are executing their transactions on the same database. If locking or other control technique is not implemented and several users access a database concurrently, problem may occur if their transactions use the same data at the same time. Concurrently problems include: 1. Lost updates 2. Uncommitted dependency (dirty read) 3. Inconsistent read 4. Phantom phenomenon. 1. Lost updates: Lost update occur when two or more transaction select the same row and then update the row based on the value originally selected. Each transaction is unaware of other transactions. The last update overwrites updates made by the other transaction which results in lost data. Example:

Time 1 2 3 4 5 6

Transaction T1 T2 T1 T2 T1 T2

Step Read Balance Read Balance Balance=500+200 Balance=500-100 Write Balance (Lost Update) Write Balance

Stored Value 500 500 700 400 700 400

Uncommitted Dependency (Dirty Read): Uncommitted dependency occurs when a second transaction selects a row that is being updated by another transaction. The second transaction is reading data that has not been committed yet and may be changed by the transaction updating the row. Inconsistent analysis (Non-repeatable read): Inconsistent analysis occurs when a
second transaction accesses the same row several times and reads different data each time. Inconsistent analysis is similar to uncommitted dependency in that another transaction is changing the data that a second transaction is reading. However in inconsistent analysis the data read by the second transaction was committed by the transaction that made the changes. Also inconsistent analysis involves multiple reads (two or more) of the same row and each time the information is changed by another transaction thus the term non repeatable read.

For example an editor reads the same document twice, but between each reading the writer Re-writes the document. When the editor reads the document for the second time it has changed. The original read was not repeatable. This problem could be avoided if the editor could read the document only after the writer has finished writing it.
Phantom Read: It occurs when an insert or delete action is performed against a row

that belong to a range of rows being read by a transaction the transactions first read of the range of rows shows that no longer exists in the second or succeeding read as a result of a deleting by a different transaction. Similarly as the result of an insert by a different transaction, the transactions second or succeeding read shows a row that did not exist in original read.

Serializability:

When two or more transactions are executed

concurrently on a database, their effect should be the same as if they has executed serially, with one completing before the other starts. This is the concept of serialization i.e. two process or more which are running in parallel for same data base they should not execute at the same time but they should be allotted time slot for their execution so one process may execute after completing the earlier process.

To achieve serialization we need some method of allowing the transaction to proceed in parallel while at the same time avoiding the interference problems described earlier. For this we can use some technique like locking.

Locking: The idea of locking is that interference between concurrent transactions can be
controlled by enabling transactions to lock part of the database while that part is in use, preventing other transactions from reading the same data. When the reading is complete the lock is released enabling other transaction to continue by this technique we can overcome almost all concurrency problems.

Two-Phase Locking: The simple locking scheme described above does not actually
guarantee complete protection from concurrency problems. The lost update and the other

problems described above can still arise if multiple locks are applied during transaction. Although each individual update is protected the fact that updates from two or more transactions can interleave in time can cause data corruption.

Example:
1. 2. 3. 4. 5. 6. Anne starts a transaction and reads a row of a table after applying a lock. Bill tries to access the same row but is prevented by the lock. Anne updates the row, writes it to disk and releases the lock. Bill now succeeds in reading the locks it and updates it. Bill now writs the row to disk and releases the lock. Annes transaction for whatever reason fails and is rolled back, restoring the database to its state before Annes. so now update done by Bills transaction has been lost. The problem here is that a lock has been released too early, allowing for the possibility of interference from another transaction. A slightly more elaborate locking scheme called two phase locking avoids this situation. In two phase technique each transaction has two phases, a growing phase where locks are acquired and a shrinking phase during which locks are released.

Granularity of Locking (level of locking): A lock can be applied in a variety of


levels according to requirement. Locking can be applied at database level, table level, and row or page level. So applying locking according to the level of requirement is known as granularity of locking. The most easy locking is the entire database locking preventing all other activity. It is simple to implement but is wasteful and would reduce the performance of database. So this type of locking is suitable only when performing some global operations on the database like compacting, re-indexing etc. we can apply table locking on second level. It is early applying locking technique. In it the table will be locked by a transaction that is using it but other tables of database will leave unlock for access by other transaction. It is also a difficult and restrictive process because some time a transaction requires access to several tables for one transaction in case of table locking, multi table transaction will not work. So most logical level is row or record level locking. It locks only one row of a table at a time. In practice it is very difficult to implement row level locking so generally page level locking is implemented beside row level. Page locking applies locks in units of whole physical block or page of the database storage medium such that the row being accessed is contained within the locked zone. If the size of row is less than or equal to the page size then a single page lock will suffice. If the row spans many pages then all pages holding parts of the row will be locked. A drawback of page locking is that locked pages will often contain other rows not involved in the transaction. Single row lock will actually be locking other rows and preventing access to them by other users while those are not being used by the current transaction that locked them.

Other locking variants (Techniques): Beside granularity there are many other
techniques of locking. 1. Optimistic and pessimistic locking: This term refers to the manner in which locks are applied for reading, writing and updating activities. Optimistic locking assumes that transaction will not conflict with another transaction here updating proceeds without a lock being applied. At the point of commit in the transaction, a check is made to ascertain whether in fact any other transaction has accessed the same data, if so the transaction are rolled back and must be restarted. This technique tries to optimize the system performance by minimizing the duration of lock. Transaction controlled by optimistic locking may not immediately succeeded and may require re-running. Beside this in this method one can reads the record that is currently being updated. If application system is not very busy with large no of transaction then we can apply optimistic locking technique.

2. Pessimistic Locking Technique: Pessimistic locking assumes that a conflict will


occur and the data is locked at the start of the transaction. Once the data is locked the transaction can run for completion without any disturbance. This technique impairs lock for much longer period. The main advantage of pessimistic locking is that read access will always be the most up to date data. We can apply this technique on very busy application system. Where a large number of transaction being occurred per hour. Because conflicts between transaction are very high in such situation.

3. Shared and exclusive locks: There are two form of locking one is shared or slock. Shared lock is applied when one requires only reading the data. Second one is exclusive lock or x-lock. This lock is applied when data is to be updated. Several transactions can apply s-lock simultaneously enabling each user to read the database. Now at the same time we can not apply x-lock until the x-lock is terminated. If x-lock is applied to unit of data then no other lock can be applied on that data. So we can apply either s-lock or x-lock or no any lock.

Deadlock:

Locking is the technique that we apply for solving

concurrent problem but some time locking itself becomes a problem known as deadlock. It occurs when online users are able to apply two or more lock at the same time. This can result in a circular wait situation that brings the activity of both users to a halt. So locking is applied to avoid corruption of database arising from concurrent access but unfortunately it is itself becomes a problem of deadlock. Example: suppose we have an airline seat reservation system using concurrent access to a database at some time two customers Anne and Bill are being served by the online operators. Both Anne and Bill want to book a seat on flight AB123 and a later flight AB456. Annes assistant first of all access AB123 flight information there placing a lock on it. At the same time Bills assistant access flight AB456 and lock it. Before committing the flight booking, each assistant now tries to access the other flight but of course finds that it is locked. But they think it will be unlocked soon. So now two booking operation are now deadlocked. Anne is waiting to

book AB456 while holding AB123 and Bill is waiting for AB123 while holding AB456. A circular wait situation now exists and will persist until one user rolls back their transaction. Dealing with deadlocks: A number of techniques have been developed to deal with deadlock. Detection: It is possible for DBMS to detect the presence of deadlock by checking for circular waits within the locks and lock request. When detected, the deadlock can be resolved by rolling back one of the member transaction this will have the effect of removing all locks held by the transaction and hence breaking the deadlock. The online user responsible for the aborted transaction would need to be informed that the transaction did not succeed and must be restarted. The DBMS can use the matrix technique to detect circular waits in the current resource locking situation. Deadlock Prevention: This protocol ensures that System will never enter into a deadlock state. Rules of this protocol are: 1. Each transaction locks all its data items before it begins transaction. 2. Impose partial ordering of all data times and require that a transaction can lock data items only in the order specified by the partial order. 3. It include wound wait and wait die strategies that use timestamps to determine transaction age and determine if a transaction should wait or be rolled back on a lock conflict. What is wait die strategies: It means older transaction may wait for younger one to release data item. Younger transactions never wait for older ones. They are rolled back instead. A transaction may die several times before acquiring needed data item. Example: if Ti is older than Tj, then Ti is younger than Tj than it will be aborted i.e. dies and will restart later with the same timestamp. What is Wound-wait: According to it older transaction wounds (forces rollback) of younger transactions instead of waiting for it. Younger transactions may wait for older ones. And it may cause fever rollback than wait dies scheme. Example: if Ti is older than Tj, then it will abort Tj and will start it with same time stamp this is known Ti wounds Tj. But if Ti is younger than Tj, Ti is allowed to wait. In both above process a rollback transaction is restarted with its original timestamp. Older transactions have precedence over newer ones, and starvation is avoided. Timeout-Based Schemes: A transaction waits for a lock only for a specified amount of time. After that, the transaction times out and is rolled back. Thus deadlocks are not possible. This scheme is Simple to implement but starvation is possible. Determination of good value of the timeout interval is difficult. Deadlock Recovery: When a deadlock is detected three factors to consider: 1. Victim Selection: Some transaction will have to rolled back (i.e. is to made victim) to break deadlock. Select the victim transaction that will incur minimum cost (computation time, data items used, etc.) i. e. transaction that came at last and so on. 2. Rollback: Abort the transaction and then restart it. More effective to roll back transaction only as far as necessary to break deadlock.

3. Starvation: It happens if same transaction is always chosen as victim. Include the number of rollbacks in the cost factor to avoid starvation.

Recovery Techniques
Recovery implies restoring the database to a state that is assumed to be correct, after some failure has rendered the current state incorrect. A database is recoverable if every piece of information that the database contains, can be reconstructed from some other information stored redundantly somewhere else in the system. The main goal of recovery is to ensure the automaticity property of a transaction. If a transaction fails before completing its execution, the recovery mechanism has to make sure that the transaction has no lasting effects on the database. 1. Recovery from transaction failures usually means that the database is recovered to the most recent consistent state just before the time of failure. To do this the system must keep information about the changes that were applied to data items by the various transactions. 2. If the database is physically damaged such as a disk the recovery method restores a past copy of the database. 3. Recovery from non-catastrophic transaction failures can be classified into: I. If damage is to a wide portion of the database due disk crash, the recovery method restores a post copy of the database that was backed up to archival strategies and reconstruct a more current state by redoing the operations of committed transaction from the back up log, up to the time of failure. II. When database is not physically damaged but has become inconsistent due to non catastrophic transaction. It is necessary to redo some operations in order to restore a consistent state of the database.

Recovery technique based on deferred update


The idea behind deferred update techniques is to defer or postpone any actual updates to the database until the transaction completes its execution successfully and reaches its commit point. 1. During transaction execution the updates are stored only in the log and in the cache buffers. After the transaction reaches its commit point and the log is force to write to disk the updates are recorded in the database. 2. If a transaction fails before reaching its commit point, there is no need to undo any operations. 3. Some transactions do not affect the database like generating and printing reports from information retrieved from database.

Recovery Techniques based on immediate update


When a transaction issues an update command,

1. The database can be updated immediately, without any need to wait for the transaction to reach its commit point but an update operation must be recorded in the log (on disk) before it is applied to the database. 2. In a single user system if a failure occurs, the affect of all operations must be undone. 3. When concurrent execution is permitted the recovery process depends on the protocols used for concurrency control. For example a strict two phase locking protocol does not allow a transaction to read or write an item unless the transaction that last write the item has committed (or aborted or and rolled back) Commit transaction: This operation signals successful end of transaction or that a transaction has successfully completed and database is in a consistent state again and all updates made by the transaction can now be made permanent. Rollback Transaction: This operation signal unsuccessful end of transaction it indicates to the transaction manager that a transaction has failed and could not complete due to some failure and database may be in an inconstant state made by the transaction be undone.

How do undo the updates in case of rollback


System maintains a log on disk or tope this log file contains details of all update operations. The pre update and post update values of the updated objects are recorded in the log. In case of rollback system uses a log to restore the values to the pre update state. The log is maintained in two portions. 1. An active or online log which is maintains on disk and is use for miner recovery during normal system operation. 2. An archive or offline log which is maintain on tap. When the size online log on the disk reach on some preset limit than it transfer to offline log than a tap. All updates performed after the last backup (or dump) has been taken are saved on the offline log. Which are used to restore system from the last backup in case of major failures? Transaction Recovery: A transaction begins with the successful execution of a BEGIN TRANSACTION statement and ends with the successful execution of a COMMIT or ROLLBACK statement. Thus a commit establish a commit point at which database is a state of consistency, a rollback, rolls back the database to the previous commit point at which database was in a state of consistency. A transaction is used as a unit of recovery because if a transaction successfully commits, the system will guarantee that all its updates are permanently installed in the database and every updates are physically written into the database. This operation of writing after crash will be done by the log file during the restart after the crash. Restarts will recovery any transaction that completed successfully but did not manage, to get their updates physically written prior to crash. System Recovery: Failure of transaction may be of two types. 1. Local failure. 2. Global failure.

1. Local failure: It affects the transaction in which a failure has occurred. Recovery from such failure can handle by transaction recovery technique. 2. Global Failure: If affects all the transaction in progress at the time of failure. These failures can be divided into two parts. I. System Failure (Power fail): This failure affects all the transaction currently in progress but does not physically damage the database. This failure is called soft crash. II. Media Failure (Disk head crash): It causes damage to the database or to some portion of it. And affect those transactions currently using that portion of database. A media failure is called hard crash. Recovery from System failure: During System failures, contents of main memory i.e. database buffers are lost. The precise state of the transaction which was in progress at the time of failure is no longer known. Such a transaction would need to be rolled back when the system restarts. But there may be some transactions which might have committed before the system failure but not manages to get their updates transferred from the database buffers to the physical database. Such transactions will need to be redone. How does the system know at restart which transaction to UNDO and which transaction to REDO? After some intervals the system automatically takes a checkpoint. This checkpoint mainly involves two functions: 1. Physically writing the contents of the database buffers out to the physical database. 2. Physically writing a special CHECKPOINT RECORD out to the physical log. This CHECKPOINT RECORD gives a list of the transactions that were in progress at the time when CHECKPOINT was taken. So on the basis of CEHCKPOINT Transaction can be of following type: 1. Transactions, which began and committed before CHECKPOINT. These need no action during restart after a failure. 2. Transactions, which began before or after the CHECKPOINT and COMMITTED after the checkpoint but prior to failure. These need REDO operation at the time of restart after the failure. 3. Transactions, which began before or after the CHECKPOINT and were still NOT COMMITTED at the time of failure. These need UNDO operation at the time of restart after the failure. Recovery procedure: At restart time, the system goes through the following procedure: 1. Start with two lists the UNDO list and REDO list. Initialize the UNDO list to the list of transactions given in the most recent CHECKPOINT RECORD. Initialize the REDO list to empty. 2. Search forward through the log, starting from the most recent CHECKPOINT record. 3. If a BEGIN TRANSACTION log entry is found for transaction T, add T to the UNDO list. 4. If a COMMIT log entry is found for transaction T, move T from UNDO list to REDO list. 5. When the end of log is reached, the UNDO and REDO list are final.

6. The system now works backward through the log, undoing the transactions in the UNDO List. This is called backward recovery. 7. Then, the system works forward redoing the transactions in the REDO list. This is called forward recovery. Recovery from media failure: Media failure is a failure like disk head crash or disk controller failure, in which some portion of the database is physically destroyed. Recovery from such a failure involves reloading the database from a backup copy (dump) and then using the log (both active and archive portions) to REDO all transactions that completed since the backup copy was taken. There is no need to UNDO those transactions, which were in progress at the time of failure, since those have been lost from the database buffers anyway.

Database Security and Authorization


A DBMS stores critical, important, confidential and vital data of organization, institute, offices etc. so security of this data is an important issue. Database security is an important component of a DBAs job. Without a comprehensive database security plan and implementation, the integrity of organization database will become compromised. Each DBA should learn the security mechanism at his disposal to assure that only authorized users are accessing and changing data in the companys databases. Reasons for database security: 1. Protection of database from unauthorized users. 2. Protection of data against internal and external threats. 3. Ensures that users are allowed or disallowed to do the things they are trying to do base on security policies. Types of security: Database security is a very broad area that addresses many issues, including the following: 1. Legal and ethical issues regarding the right to access certain information. Some information may be keep private and cannot be accessed legally by unauthorized persons. 2. Policy issues at the government, institutional or corporate level as to what kinds of information should not be made publicly available. For example credit ratings and personal medical records. 3. System related issues such as the system levels at which various security functions should be enforced. For example whether a security function should be handle at the physical hardware level, the operating system level or the DBMS level. 4. The need in some organization to identify multiple security levels and to categorize the data and users based on these classifications. For example top secret, secret, confidential, and unclassified. The security policy of the organization with respect to permitting access to various classifications of data must be enforced.

Threats to database: Threats to database result in the loss or degradation of some or all of the following security goals. Integrity, availability and confidentiality. 1. Loss of integrity: Database integrity refers to the requirement that information be protected from improper modification. Modification of data includes creation, insertion, modification, changing the status of data, and deletion. Integrity is lost if unauthorized changes are made to the data by either intentional or accidental acts, continuity of which lead corrupted data. 2. Loss of availability: Database availability refers to making objects available to a human user or a program to which they have a legitimate right. 3. Loss of confidentiality: Database confidentiality refers to the protection of data from unauthorized disclosure. The impact of unauthorized disclosure of confidential information can range from violation of the data privacy act to the jeopardization of national security. Unauthorized disclosure of data could result in loss of public confidence or legal action against the organization. 4. Discretionary security mechanisms: These are use to grant privileges to users, including the capability to access specific data files, records or fields in a specified mode such as read, insert, delete or update. 5. Mandatory security mechanisms: These are used to enforce multilevel security by classifying the data and users into various security classes or levels and then implementing the appropriate security policy of the organization. For example a typical security policy is to permit users at a certain classification level to see only the data items classified at the users own classification level. An extension of this is role based security, which enforces policies and privilege based on the concept of roles. Database security and DBA: The DBA is the central authority for managing a database system. The DBAs responsibilities include granting privileges to users who need to use the system and classifying users and data in accordance with the policy of the organization. The DBA has a DBA account in the DBMS, sometimes called a system or super user account, which provides powerful capabilities that are not made available to regular database accounts and users. DBA privileges commands include commands for granting and revoking privileges to individual accounts, users, or user groups (role) and for performing the following types of actions. 1. Account creation: This action creates a new account and password for a user or a group of users to enable access to the DBMS. 2. Privilege granting: This action permits the DBA to grant certain privileges to certain accounts. 3. Privilege revocation: This action permits the DBA to revoke (cancel) certain privileges that were previously given to certain accounts. 4. Security level assignment: This action consists of assigning user accounts to the appropriate security classification level. Protection of data within database: To protect data within database there are mainly two techniques. 1. Authorization to access data. 2. Provide Roles to database users.

Authorization: Authorization is permission given to a user program or process to access an object or set of objects. The type of access granted to a user can be read-only, or read and write only. The two methods by which access is provided to the users are privileges and rolls. Privilege: A privilege is a permission to access a named object in a prescribed manner. For example permission to query a table. Privilege is granted to enable a user to connect to the database. Role: A Role is a mechanism that can be used to provide authorization. A single person or a group of people can be granted a role or a group of roles. By defining roles, the DBA can manage access privileges much more easily.

Database privileges: 1. The right to connect to the database. 2. The right to create a table. 3. The right to select rows from another users table. 4. The right to execute another users stored procedure. A user can receive a privilege in two different ways. 1. We can grant privileges to a user explicitly (directly to a user). For example we can grant the privilege only to insert records into a table. 2. We can grant privilege to a role not to a single user (implicitly). Then all users under this role will be follower of that privilege automatically. For example you can grant the privilege to select, insert, update, and delete records from the EMPLOYEE table to the role named ACCOUTANT, which is turn we can grant to Ram and Shayam. Categories of privileges: There are two categories of privileges. 1. System privilege: A system privilege is the right to perform a particular action on a particular type of object. For example the privilege to create tables and to delete the rows of any table in a database is system privilege. We can grant or revoke system privileges to users and roles. If system privileges are granted to roles the advantages of roles can be used to manage system privilege. System privileges are granted to or revoke from users and roles using SQL commands GRANT and REVOKE. 2. Object privileges: An object privileges is a privilege or right to perform a particular action on a specific tables, view, sequence, procedure, function or package. For example the privilege to delete row from the table DEPT is an object privilege depending on the type of object privilege. Object privileges can be granted to or revoked from, users and roles using the SQL commands GRANT and REVOKE, respectively.
Roles: Roles are named groups of related privileges that you grant to users or other roles. There are some properties of roles allow for easier privilege management within a database. 1. Reduced privilege administration.

2. 3. 4. 5.

Dynamic privilege management. Selective availability of privilege. Application awareness. Application specific security.

Uses for roles: Generally a role is created for mainly two purposes. 1. To manage the privileges for a database application. 2. To manage the privilege for a user group. Roes can also be categorized into two parts. Application Roles: You create an application role by granting all the privilege necessary to run a given database application. Then you grant the application role to other role or to specify users. An application can have several different roles, with each role assigned a different set of privilege that allow for greater or lesser data access while using the application. User Roles: You create a user role for a group of database users with common privilege requirements. You mange user privileges by granting application roles and privilege to the user role and then granting the user role to appropriate users. The functionality of database roles includes the following. 1. A role can be granted system or object privileges. 2. A role can be granted to other roles but a role cannot be granted to itself and circularly (A->B->C->A). 3. Any role can be granted to any database user. 4. Each role granted to a user is at a given time, either enabled or disabled. 5. An indirectly granted role (a role granted to a role) can be explicitly enabled or disabled for a user. The Grant Command: The permission to perform an operation in database is given to a user by GRANT command. Syntax: 1. Grant {All | Privileges list } on { table name [ ( column comma-list)] | view-name [(column-comma-list)] to {public | user-list} [with Grant option] 2. Grant {All | privilege-list [(column-comma-list)]} on {table-name | view-name} to {public | user list} [with Grant option]. If ALL is specified then all the privilege for the object for which the user issuing the GRANT has grant authority, will be granted. If a PRIVILEGE LIST is specified, then only the listed privilege will be granted. The ON clause specifies the object on which the privileges are granted. It can be a table or view. If the optional COLUMN-COMMA-LIST is specified the privileges will be restricted to those columns. If the column list is not specified, then the grant will be for the entire table | view. The TO clause is used to identify the users to whom the privileges are granted. The keyword PUBLIC means that the privileges are granted to all known users of the system. If the USER_LIST is specified, then the privileges will be granted to the users specified in the list. If the WITH GRANT OPTION is specified it means that the recipient has the authority to grant the privileges that were granted to him to another user. Example: 1. Select authority on the BOOK table to all users. Grant select on book to public.

2. Grant the select, delete, update authority, ON CATALOG table to user Ajay. Grant select, delete, update on catalog to Ajay. 3. Grant select, delete and update authority with the capability to grant those privileges to other user on CATALOG table to user Ajay. Grant select, update on catalog to Ajay with Grant option. 4. GRANT ALL PRIVILEGE on BOOK table to user Vijay. Grant all ob book to Vijay 5. Give the system privileges for creating tables and views to Amar. Grant create table, create view to Amar. 6. Grant the update authority on the price column of the catalog to user Amit. Grant update (price) on catalog to Amit. The Revoke Command: The revoke command is used to remove a privilege granted to a user. It is opposite to grant command. 1. Revoke {All | privilege-list} On {table-name [(column-comma-list)] | view-name [(column-comma-list)]} from {public | user-list}. 2. Revoke {All | privilege-list [(column-comma-list)]} on {table-name | view-name} from {public | user-list}. If ALL is specified then all the privileges for the object specified will be revoked. If a privilege list is specified then only those privileges will be revoked. The on clause specifies the object from which the privileges are removed. If the optional COLUMNCOMMA-LIST is specified, the privileges will be restricted to those columns. If the column list is not specified, then the revoke will be for entire table. The FROM clause is to identify the user from whom the privileges are taken away. The keyword public means that the privileges are revoked from all known users of the system. If USER-LIST is specified then the privileges will be revoked from users specified in the list. The user issuing the revoke command should be the user who granted the privileges in the first place. Dropping a domain, table or view will cause an automatic revoke for all privileges on the dropped object for all users. Example: 1. Revoke the system privileges from creating table from AJAY. Revoke create table from Ajay 2. Revoke the select privileges on catalog table from Ajay Revoke select on catalog from Ajay 3. Revoke the update privileges on catalog table from all users. Revoke update on catalog from public 4. Remove all privileges on catalog table from user Rohan Revoke all on catalog from Rohan 5. Remove delete and update authority on the price and year columns of the catalog table from user Amar. Revoke delete, update (price, year) or catalog from Amit. Data encryption: Encapsulation is a technique of encoding data, so that only authorized users can understand it. Protecting data in database includes access control, data integrity encryption

and auditing. Although encryption is not sufficient for data security yet we can encrypt sensitive data before it is stored in the database. Example: credit card number, user password and name, industrial formulas. There are two standard encryption techniques. 1. DES (data encryption standard): It provides standard based encryption for data privacy. 2. 3 DES (Triple DES): It encrypts message data with three passes of the DES algorithm. Database Integrity: Database integrity ensures that data in the database is correct and consistent. Database integrity mechanism can be divided into that mechanism that support system integrity and that enforce relational database integrity properties (like entity integrity, referential integrity, and transaction integrity and business rule). Traditional system integrity involves ensuring that the data inserted into the system is the same as the contents of data when it is retrieved. Further, data must not be altered or deleted by a user who is not authorized to do so. For example a business rule says that no employee in the employee table can receive an increment greater than 20% of the value in the salary column. If an insert or update statement attempts to violet statement attempts to violet this integrity rule, the statement must fail. Flow Control: Flow control regulates the distribution or flow of information among accessible object. A flow between object X and object Y occurs when a program read values from X and writes values into Y. flow control check that information contained in some objects does not flow explicitly or implicitly into less protected objects. Thus a user cannot get indirectly in Y what he or she cannot get directly from X. Most flow controls employ some concept of security class, the transfer of information from a sender to a receiver is allowed only if the receivers security class is at least as privileged as the senders. Example: A flow control includes preventing a service program from leaking a customers confidential data. The flow policy specifies the channels along which information is allowed to move. The simplest flow policy specifies just two classes of information: confidential (C) and non-confidential (N), and allows all flows except those from class C to class N. this policy can solve the confinement problem that arises when a service program handles data such as customer information, some of which may be confidential. Covert channels: A covert channel allows a transfer of information that violates the security or the policy. Specifically, a covert channel allows information to pass from a higher classification level to a lower classification level through improper means. Covert channel can be classified into two broad categories: Storage and Timing channel. In a storage channel information is conveyed by accessing system information or what is otherwise inaccessible to the user while in a timing channel the information is conveyed by the timing of events or processes. Digital Signatures: A digital signature is an example of using encryption techniques to provide authentication services in electronic commerce applications. A digital signature is a means of associating a mark unique to an individual with a body of text. The mark should be unforgettable, meaning that others should be able to check that the signature does come from

originator. A digital signature consists of a string of symbols. If a persons digital signature were always the same for each message, then one could easily counterfeit it by simply copying the string of symbols. Thus signatures must be different for each use. Audit and Control: Audit is an analysis of an organizations computer and information systems in order to evaluate the efficiency, Correctness and integrity of its database systems as well as to uncover potential security cracks. Auditing is done to verify that DBMS operations are properly implemented and executed. It is usually done by an external auditor so that the audit process may be fair and unbiased. Audit trail tracks all the transactions executed concurrently to find out any leakage or breach in the security and to find out any possibility of fraud. DBMS audit: An information system is not just a DBMS or a computer. The major elements of DBMS or (IS-information system) audit can be broadly classified as: 1. Physical and environmental review: This includes physical security, power supply, air conditioning, humidity control and other environmental factors. 2. System administration review: This includes security review of the operating systems, database management systems, all system administration procedures and compliance. 3. Application software review: It includes access control and authorizations, validations, error and exception handling, business process flows watching the application software and complementary manual controls and procedure. 4. Network security review: Review of internal and external connections to the system, perimeter security, firewall review, router access control list, port scanning and detection are some typical areas of coverage. 5. Business continuity review: This includes existence and maintenance of fault tolerant and redundant hardware, backup procedures and storage, and documented and tested disaster recovery business continuity plan. 6. Data integrity review: The purpose of this is security of live data to verify adequacy of controls and impact of weaknesses. This testing can be done using generalized audit software. Control: Control is a process established by management to provide reasonable assurance that DBMS or IS objective will be achieved. Control is done to provide assurance to management about: a. Effectiveness of operations. b. Economical and efficient use of resources. c. Compliance with policies, procedures, laws regulations. d. Safeguarding of assets and interest from losses of all kinds, including those arising from fraud, irregularity or corruption. e. Integrity/ reliability of information, accounts, data. Simple control are the defined set of rules and procedures established to ensure that DBMS or IS performs as per desired objectives. The control starts with the design of database and the application programs. Recent Tends in Database Security: Threats on database security can be grouped into two different categories, physical and logical. Physical threats consist of forced disclosure of

passwords, destruction of storage devices, power failures, and theft. The most common way to prevent this type of threat is limit the access to the storage device s and put backup and recover procedures in place. Logical threats can result in denial of services, disclosure of information, and modification of data. 1. Insider threat: one of the largest threats to a database is a corrupt authorized user. This user can access confidential information. This information can then be leaked electronically or by some other means such as printout or by word of mouth. There is very little that can be done to prevent this from within the database management system. Mandatory access controls can help a little bit by not allowing a user logged in with classified access to save or copy the data to a location with unclassified access. This type of threat is usually handled by limiting the number of users with that level of access and other complicated procedure. 2. Login Attacks: Another way to compromise a database is to successfully log in as a legitimate user. This can be done by physically stealing the information or monitoring network traffic for login information. Another attack could involve accessing password lists stored in an operating System. And of course login information can only be as secure as the password used. If it is easy to crack; there is not much that can be done. Restrictions on the type and form of password can help, but does not solve the problem. The database must employ authentication and encryption to ensure that this type of attack is less likely. 3. Network attack: There are a multitude of possible attacks on a database if it is accessible over a network, even more so if that network is the internet. A number of precautions can be put in place such as a firewall to protect the database and possibly the web server. The data sent over the network can be secured by a number of means. A common method on the internet is the secure socket layer. This would prevent an attacker from just gathering information by watching network traffic. A good method for authentication with database will also be necessary. Certificates can also be used in conjunction with databases to ensure authentication. An especially common attack has been the denial of service attack. This type of attack is related more to the web server allowing access to the database, but can also be mounted against a database itself. 4. Trojan horses: Trojan horses are corrupt software application that leak confidential information. These applications are part of the normal use of a system, but have been modified to copy or send sensitive information to unauthorized locations or users. An application that has Trojan horses must be installed on the system. This could be done by the attacker or by an administrator that did not realize that there was a Trojan horse in the application. The corrupted application will operate as expected for all practical purposes. But it will be doing some additional illegal function as well. 5. Interface control: The user of a database can use information that they have access to and possibly some supplementary (external) information to infer information that they do not have access to. Data at a high security level can be inferred from data at a lower security level. This can be a very difficult threat to prevent. This threat is usually associated with statistical databases. Information about individual can be inferred from

answers to allowed statistical queries on the database. A native approach would be to move the lower level data to a higher level. Only the minimum amount of the lower level data needed to prevent the inference should be moved to the higher security level.

Data Mining and warehouse


Data warehouse: Data warehouse is a collection of data that is designed to support management decision making. This term generally refers to combine many different databases across entire enterprises. Development of a data warehouse includes development of systems to extract data from operating system plus installation of a warehouse database system that provides manager a flexible access to the data. But it is very costly to setting up a data warehouse. The privacy goals of data warehouse are: 1. Provide access to the data of an organization. 2. Data consistency 3. Capacity to separate and combine data. 4. Inclusion of tools set to query, analyze and present information. 5. Publish used data. 6. Drive business reengineering. Characteristics of data in data warehouse: 1. Subject oriented: The data warehouse should oriented towards those major subjects area of the organization, which have been defined in the data model. 2. Integrated: The data warehouse receive data from a number of sources. Each of these sources has an application designer each freely encoding, naming convention, physical attributes, and measurement of attributes. The filtering and translation necessary to transfer the many sources into one consistent database is known as integration. 3. Nonvolatile: Data can be loaded and accessed from warehouse but cant be change so warehouse is non-volatile. 4. Time variant: Data warehouse always contains some elements of time and series of snapshots. Time horizon is 5-10 years. Confirmation of data warehouse process: The hardware, software and data resources needed to construct the data warehouse are organizationally dependent. Decision made by the organization based upon needs and resources available will determine the architecture of a particular data warehouse. Beside this there are some phases which are common to all data warehouses. These are: 1. Acquisition of data or Acquiring of data: Every data warehouse has a source for acquiring data. In data warehouse data is extracted from the organizational, operational data. Desired data is extracted, filtered, translated and integrated into the storage environment. 2. Storage: Vast amount of organizational data is indexed and partitioned to allow for economic and efficient access.

3. Data access: The organizational ability to access data is fundamental to the concept of data warehouse. Data warehouse components 1. Summarized data: Data may be summarized is two ways: a. Lightly summarized: Lightly summarized data are the hallmarks of a data warehouse. All enterprises elements do not have the same information requirements, so effective data warehouse design provides for customized, lightly summarized data for every enterprise elements. b. Highly summarized: It is for enterprise executives. Highly summarized data can come from either the lightly summarized data or from current detail. Data volume at this level is very low. 2. Current details: Current detail contains the bulk of data. Current detail comes directly from operational system and may be stored as row data. Current detail is the lowest level of the data granularity in the data warehouse. Current detail is typically two to five years old. Current detail refreshment occurs as frequently as necessary as to support enterprises requirement. 3. System of records: It is a source of data to feed the data warehouse. Because data in warehouse can not be modified so data in data warehouse should be rich by highly qualified features like complete, accurate, and best structured. 4. Integration and transformation programs: These programs perform function such as: i. Reformatting, recalculating or modifying key structures. ii. Adding time elements. iii. Identifying default values. iv. Supplying logic to select between multiple data sources. 5. Archives: It contains old data (nearly two years old or over it). There is massive amount of data stored in warehouse archives with a low incidence of access. Archive also contains metadata that describes the old datas characteristic. 6. Metadata: Metadata is also known as data about data. It is a data definition language which provides a meaningful description of the information contents. Metadata is used by data warehouse developers to manage and control data warehouse creation and maintenance resides outside the data warehouse. Metadata is also used for creation reports and graphs in front end data access tools. Use of data warehouse 1. In producing standard reports and queries periodically. 2. Data mining. 3. Interface with other data warehouses. 4. Queries against summarized data. Advantages of data warehouse 1. A data warehouse provides a common data model for all data of interest regardless of datas source. This makes it easier to report and analyze information that it would be if multiple data models were used to retrieve information such as sales invoices, order receipts, general ledge charges etc.

2. Prior to loading data into the data warehouse, inconsistencies are identified and resolved. This greatly simplifies reporting and analysis. 3. Information in the data warehouse is under the control of data warehouse users so that, even if the source system data is purged over time, the information is the warehouse can be stored safely for extended periods of time. 4. Because they are separate from operational systems, data warehouses provide retrieval of data without solving down operational systems. 5. Data warehouses can work in conjunction so enhance the value of operational business applications like customer relationship management system. 6. Data warehouse facilitate decision support system applications such as trend reports, exception reports, and reports that show actual performance versus goals. Disadvantages of data warehouse 1. Data warehouses have high costs. They are not static and their maintenance cost is also high. 2. Data warehouses can get outdated relatively quickly. There is a cost of delivering suboptimal information to the organization. New data warehouses solve this by using a technology called change data capture. 3. There is often a fine line between data warehouses and operational systems. Duplicate, expensive functionality may be develop. Problems of data warehousing: 1. Hidden problems with source systems: Hidden problem with the source systems feeding the warehouse may be identified, possibly after years of being undetected. 2. Required data is not captured: Warehouse projects often highlight a requirement for data not being captured by the existing source systems. 3. Increased end user demands: After end user received query and reporting tools, request for support from IS staff may increase rather than decrease. 4. Data Homogenization: Large scale warehousing can become an exercise in data homogenization that lessens the value of the data. For example, in producing a consolidated and integrated view of the organizations data, the warehouse designer may be tempted to emphasize similarities rather than difference in the data used by different application areas such as products sales and product inventory. 5. High demand for resources: The warehouse can use huge amounts of disk space. 6. Data ownership: Warehousing may change the attitude of the end-users to the ownership of the data. Sensitive data that was originally viewed and used only by a particular department or business area such as in sales or marking, may now be made accessible to others in the organization. 7. High maintenance: Warehouses are high maintenance systems. Any reorganization of the business process and the source systems may affect the warehouse. 8. Long duration projects: The building of a warehouse can take up to three years, which is why some organizations are building data marts. Data marts support only the requirements of a particular department or functional area and can therefore be built much more rapidly. 9. Complexity of integration: The most important area for the management of a data warehouse is the integration capabilities. This can be very difficult task, as there are a

number of tools for every operation of the warehouse, which must integrate well in order that the warehouse works to the organizations benefit. 10. The requirement for a data warehouse DBMS: Load performance, load processing, and data quality management, query performance, terabyte scalability, mass user scalability, networked data warehouse, warehouse administration, and integrated dimensional analysis, advanced query functionality. Data warehouse v/s operational systems: Operational systems are optimized for preservation of data integrity and speed of recording of business transactions through use of database normalization and an entity relationship model. Data warehouse are optimized for speed of data retrieval. Frequently data in data warehouse are de-normalized via a dimension based model. Also to speed data retrieval, data warehouse data are often stored multiple times. In their most granular form and in summarized forms called aggregates. Data warehouse data are gathered from the operational systems and held in the data warehouses even after the data has been purged from the operational system. Data Marts Terminology: A data warehouse is central aggregation (collection) of data which can be distributed physically. A data mart is a data repository (collection) that may or may not derive from a data warehouse and that emphasizes ease of access and usability for a particular designed purpose. There can be multiple data mart inside a single corporation, each one relevant to one or more business units for which it was designed. Data mart may or may not be dependent or related to other data marts in a single corporation. If the data marts are designed using conformed facts and dimensions, then they will be related. Reasons for creating a data mart 1. Easy access for frequently needed data. 2. Creates collective view by a group of users. 3. Improves end user response time. 4. Ease of creation. 5. Lower cost than implementing a full data warehouse. 6. Potential users are more clearly defined than in a full data warehouse. Dependent data mart: A dependent data mart is a logical subset (view) or physical subset (extract) of a larger data warehouse, isolated for one of the following reasons: 1. Performance: To offload the data mart to a separate computer for greater efficiency or to obviate the need to manage that workload on the centralized data warehouse. 2. Security: To separate an authorized data subset selectively. 3. Expediency: To bypass the data governance and authorizations required to incorporate a new application on the enterprise data warehouse. 4. Proving Ground: To Demonstrate the viability and ROI (return on investment) potential of an application prior to migrating it to the enterprises data warehouse. 5. Politics: A copying strategy for IT (Information Technology) in situations where a user group has more influence than funding or is not a good citizen on the centralized data warehouse. Design Schemas

Star Schema: It is simplest style of data warehouse schema. The start schema is a way to implement multi-dimensional database functionality using a mainstream relational database. Given the typical commitment to relational databases of most organizations, a specialized multidimensional DBMS is likely to be both expensive and inconvenient. Another reason for using a star schema is its simplicity from the users point of view. Queries are never complex because the only joins and conditions involve a fact table and a single level of dimension tables, without the indirect dependencies to other tables that are possible in a better normalized snowflake schema. Snowflake schema: A snowflake schema is a logical arrangement of tables in a relational database such that the entity relationship diagram resembles a snowflake in shape. Closely related to the star schema, the snowflake schema is represented by centralized fact tables which are connected to multiple dimensions. In the snowflake schema, however, dimensions are normalized into multiple related tables where as the schemas dimensions are de-normalized with each dimension being represented by a single table.

Data Mining
Data mining is also known as data or knowledge discovery. It is a process of analyzing data from different perspectives and summarizing it into useful information. Information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational database. Data mining and warehousing: first of all data that is to be mine is first extracted from an enterprises data warehouse. Data mining database may be logical rather than a physical subset of your data warehouse. A data warehouse is not a requirement for data mining. Setting up a large data ware house that consolidate data from multiple sources resolve data integrity problems and loads the data into the query database are the main task of data warehouse. We can mine data from one or more operational or transactional database by simply extracting into a read only database.

Data mining application


1. It can be used to control the cost and to increase revenue. 2. Data mining is very useful for industries; telecommunication and credit card companies are two of the leaders in applying data mining to detect fraud from their services. 3. Insurance companies and stock exchange are also interested in applying these technologies to reduce fraud. 4. Medical applications are another fruitful area. Data mining can be used to predict the effectiveness of surgical procedure, medical test or medication. 5. Companies active in financial market used data mining to determine market and industries characteristic as well as to predict individual companies and stock performance.

6. Retailers are making more use of data mining to decide which product to stock in particular store as well as to access the effectiveness of promotion and coupons. 7. Pharmaceuticals firms are mining large database of chemical compound and of genetic material to discover substances the might be candidate for development as agents for the treatment of dieses.

Data mining goals


1. Data mining is an innovative way to get new way and valuable business in side by analyzing the information of the company database. These inside can enable you to take good business decision. 2. Data mining uncovers these in depths business intelligence by using advance analytical and modeling technique. 3. The information that data mining provides can lead to an improvement in the quality and dependently of business decision making. 4. Only data mining enables the bank to create profiles of the customers who already have this type of account. 5. The bank can then use data mining to find other customers who match that profile so that it can accurately target a marketing plane. 6. Data mining can identify the characteristics of a known group of customers. For example those who have proven a record as who were created risk. 7. Data mining tools automate the process of discovering of information from large stores of data.

Data mining, machine learning and statics: Data mining takes advantage of advances in the fields of artificial intelligence and statics. Both disciplines have been working on problems of pattern recognition and classification. Data mining does not replace traditional statistical techniques. Rather, it is an extension of statistical methods that is in part the result of major change in the statistics community. The key point is that data mining is the application of these and other AI and statistical techniques to common business problems in a fashion that makes these techniques available to the skilled knowledge worker as well as the trained statistics professional. Data mining is a tool for increase the productivity of people trying to build predictive models. How does data mining works? Data mining software analyzes relationships and patterns in stored transaction data based on open ended user queries. Generally four types of relationships are sought (to search or seek): Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic by having daily specials. Clusters: data items are grouped according to logical relationships or consumer preferences. For example, data can be mined to indentify market segment or consumer affinities.

Associations: data can be mined to indentify associations. The beer diaper example is an example of associative mining. Sequential patterns: data is mined to anticipate behavior patterns and trends. For example, an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumers purchase of sleeping bags and hiking shoes. Data mining elements 1. Extract, transform, and load transaction data onto the data warehouses system. 2. Store and manage the data in a multidimensional database system. 3. Provide data access to business analysis and information technology professionals. 4. Analyze the data by application software. 5. Present the data in a useful format, such as a graph or table. Data mining process: data mining process is a step forward towards turning data into knowledge also called knowledge discovery in databases (KDD). KDD refers to the overall process of discovering useful knowledge from data. Using data mining tools
1. Data mining tools. 2. Data mining applications.

Data Mining tools: ?????????????????from institute computer

Entity Relationship Model (Database Design)


E-R Model is a technique for building a logical model of an enterprise. Logical models are high level abstract views of an enterprise data. E-R model is the most common way to express the analytical result of an early stage in the construction of a new database. E-R diagram are a way to represent the structure and layout of a database. It is used frequently to describe the database schema. ER diagrams are very useful because they provide a good conceptual view of any database, regardless of the underlying hardware and software. An ERD is a model that identifies the concepts or entities that exists in a system and the relationship between those entities. An ERD is often used as a way to visualize a relational database. Each entity represents a database table, and the relationship lines represent the keys in one table that point to specific records in related tables. ERDs may also be more abstract, not necessarily capturing every table needed within a database, but serving to diagram the major concepts and relationships. This model was proposed by Peter in 1976 as a way to unify the network and relational database views. For the database, the utility of the ER model is:
1. It maps well to the relational model. The constructs used in the ER model can easily be transformed into relational tables.

2. It is simple and easy to understand with a minimum of training. Therefore, the model can be used by the database designer to communicate the design to the end user. 3. In addition, the model can be used as a design plan by the database developer to implement a data model in specific database management software.

Basic constructs of E-R Modeling The E-R model views the real world as a construct of entities and association between entities. Entities: Entities are the principal data object about which information is to be collected. Entities are usually recognizable concepts, either concrete or abstract, such as person, place, things or events which have relevance to the database. Some specific examples of entities are EMPLOYEES, PROJECTS, and INVOICES. An entity is analogous to a table in the relational model. Entities are classified as independent or dependent. An independent entity is one that does not rely on another for identification. A dependent entity is one that does rely on another for identification. An entity occurrence is an individual occurrence of an entity. An occurrence is analogous to a row in the relational table. Special Entity Types Associative entities are entities used to associate two or more entities in order to reconcile a many to many relationship. Subtypes entities are used in generalization hierarchies to represent a subset of instances of their parent entity, called the super type, but which have attributes or relationships that apply only to the subset. Relationships A relationship represents an association between two or more entities. An example of a relationship would be: Employees are assigned to projects. Projects have subtasks Departments manage one or more projects. Relationships are classified in terms of degree, connectivity, cardinality, and existence. These concepts will be discussed below. Attributes Attributes describe the entity of which they are associated. A particular instance of an attribute is a value. For example Ajay is one value of the attribute name. The domain of an attribute is the collection of all possible values an attribute can have. The domain of name is a character string. Attributes can be classified as identifiers or descriptors. Identifiers, more commonly called keys, uniquely identify an instance of an entity. A descriptor describes non unique characteristics of an entity instance. Classifying relationships Relationships are classified by their degree, connectivity, cardinality, direction, type and existence. Not all modeling methodologies use all these classifications. Degree of relationship The degree of a relationship is the number of entities associated with the relationship. The unary relationship is the general form of degree n. special cases are the binary, and ternary, where the degree is 2, and 3, respectively. Binary relationship, the association between two entities is the most common type in the real world. A recursive binary relationship occurs when an entity is related to itself. An example might be some employees are married to other employees.

A ternary relationship involves three entities and is used when a binary relationship is inadequate. Many modeling approaches recognize only binary relationships. Ternary or unary relationships are decomposed into two or more binary relationships. Connectivity and cardinality The connectivity of a relationship describes the mapping of associated entity instances in the relationship. The values of connectivity are one or many. The cardinality of a relationship is the actual number of related occurrences for each of the two entities. The basic types of connectivity for relations are: One-to-one, one-to-many, and many-to-many. A one-to-one (1:1) relationship is when at most one instance of an entity A is associated with one instance of entity B. for example, Employee in the company is each assigned their own office. For each employee there exists a unique office and for each office there exists a unique employee. A one-to-many (1:N) relationship is when at most one instance of a entity A is associated with one instance of entity B, but for one instance of entity B, there is only one instance of entity A. an example of a 1:N relationships is a department has many employees each employee is assigned to one department. A many-to-many (M:N) relationship, sometimes called non-specific, is when for one instance of entity A, there are zero, one, or many instances of entity B and for one instance of entity B there are zero, one, or many instances of entity A. an example is: Employee can be associated to no more than two projects at the same time; projects must have assigned at least three employees. A single employee can be assigned to many projects; conversely, a single project can have assigned to it many employees. Direction: The direction of a relationship indicates the originating entity of a binary relationship. The entity from which a relationship originates is the parent entity; the entity where the relationship terminates is the child entity. The direction of a relationship is determined by its connectivity. In one-to-one relationship the direction is from the independent entity to a dependent entity. If both entities are independent, the direction is arbitrary. With one-to-many relationships, the entity occurring once is the parent. The direction of many-to-many relationships is arbitrary. Type: An indentifying relationship is one in which one of the child entities is also a dependent entity. A non-identifying relationship is one which both entities are independent. Existence: Existence denotes whether the existence of an entity instance is dependent upon the existence of another, related, entity instance. The existence of an entity in a relationship is defined as either mandatory or optional. If an instance of an entity must always occur for an entity to be included in a relationship, then it is mandatory. For example every project must be managed by a single department. If the instance of the entity is not required, it is optional. Example employees may be assigned to work on projects. Generalization hierarchies

A generalization hierarchy is a form of abstraction that specifies that two or more entities that share common attributes can be generalization into a higher level entity type called a super type or generic type. The lower level entity type called a super type, or categories, to the super type. Subtypes are dependent entities. Generalization occurs when two or more entities represent categories of the same real-world object. For example wages-employee and classified-employee represents categories of the same entity, Employees. In this example Employee would be the super type; wages-employee and classified-employee would be the sub type. ER Notation: The symbols used for basic ER constructs are:
1. Entities are represented by labeled rectangles. The label is the name of the entity; entity name should be singular nouns. 2. Relationships are represented by a solid line connecting two entities. The name of the relationship is written above the line. Relationship names should be verbs. 3. Attributes, when included, are listed inside the entity rectangle. Attributes which are identifiers are underlined. Attribute names should be singular nouns. 4. Cardinality of many is represented by a line ending in a crows foot. If the crows foot is omitted, the cardinality is one. 5. Existence is represented by placing a circle or a perpendicular bar on the line. Mandatory existence is shown by the bar next to the entity for an instance is required. Optional existence is shown by placing a circle next to the entity that is optional.

File Organization and Indexes


Computer Storage Media: The Collection of data that makes up a computerized database must be stored physically on some computer storage medium. The DBMS software can then retrieve, update, and process this data as needed. Computer storage media form a storage hierarchy that includes two main categories:
1. Primary storage: This category includes storage media that can be operated on directory by the computer central processing unit (CPU), such as the computer main memory and smaller but faster cache memories. Primary storage usually provides fast access to data but is of limited storage capacity. 2. Secondary storage: This category includes magnetic disks, optical disks, and tapes. These devices usually have a larger capacity, cost less, and provide slower access to data than do primary storage devices. Data in secondary storage cannot be processed directly by the CPU; it must first be copied into primary storage.

Memory hierarchy and storage devices In a modern computer system data resides and is transported throughout a hierarchy of storage media. The highest speed memory is the most expensive and is therefore available

with the least capacity. The lowest speed memory is offline tape storage, which is essentially available in indefinite storage capacity. At the primary storage level, the memory hierarchy includes at the most expensive end cache memory, which is a static RAM. Cache memory is used by CPU to speed up execution of programs. The next level of primary storage is DRAM (Dynamic Random Access Memory), which provides the main work are for CPU for keeping programs and data and is popularly called main memory. The advantage of DRAM is that, it is low cost, which continues to decrease; the drawback is its volatility1 and lower speed compared with static RAM. At the secondary storage level, the hierarchy includes magnetic disks, as well as mass storage in the form of CD-ROM (Compact-Disk Read Only Memory) devices, and finally tapes at the least expensive end of the hierarchy. The storage capacity is measured in kilobytes, Megabytes, Gigabyte, and Terabyte etc. Programs reside and execute in DRAM. Generally large permanent databases reside on secondary storage, and portions of database are read into and written from buffers in main memory as needed. Now that personal computers and workstations have hundreds of megabytes of data in DRAM. Between DRAM and magnetic disk storage, another form of memory, flash memory, is becoming common, particularly because it is nonvolatile. Flash memories (Pend Drive) are high density, high performance memories using EEPROM (Electrically Erasable Programmable Read Only Memory) technology. The advantage of flash memory is the fast access speed; the disadvantage is that an entire block must be erased and written over at a time. Flash memory cards are appearing as the data storage medium in appliances with capacities ranging from a few megabytes to a few gigabytes. These are appearing in cameras, MP3 players, USB storage accessories etc. CD-ROM disks store data optically and are read by a laser. DD-ROMs contain prerecorded data that cannot be overwritten. WORM (Write once ready many) disks are a form of optical storage used for archiving data. They allow data to be written once and read any number of times without the possibility of erasing. They hold about half of a gigabyte of data per disk. The DVD (Digital Video Disk) is a recent standard for optical disk allowing 4-5 GB of storage per disk. Storage of databases Database typically store large amount of data must persist over long periods of time. The data is accessed and processed repeatedly during this period. This contrasts with the notion of transient data structures that persist for only a limited time during program execution. Most databases are stored permanently on magnetic disk secondary storage, for the following reasons:
1. Generally, databases are too large to fit entirely in main memory. 2. The circumstances that cause performance loss of stored data arise less frequently for disk secondary storage than for primary storage. Hence we refer to disk- and other secondary storage devices-as nonvolatile storage, whereas main memory is often called volatile storage. 3. The cost of storage per unit of data is an order of magnetic loss for disk than for primary storage.

Hardware description of Disk Devices Magnetic disks are used for storing large amounts of data. These are random access secondary storage devices. The most basic unit of data on the disk is a single bit of information. By magnetizing an area on disk in certain ways, one can make it represent a bit value of either 0 or

1. To code information, bits are grouped into bytes or characters. Byte sizes are typically 4 to 8 bits, depending on the computer and the device. We assume that one character is stored in a single byte, and we use the terms byte and character interchangeably. The capacity of a disk is the number bytes it can store, which is usually very large. A floppy disk hold from 400 KB to 1.5 MB, hard disk hold hundreds of MB to GB and large disk used with servers can stores tens to hundreds of GB. And this capacity is growing as technology improves. All disks are made from magnetic material shaped as a thin circular disk and protected by a plastic or acrylic cover. A disk is single sided if it stores information on only one of its surfaces and double sided if both surfaces are used. To increase storage capacity, disk are assembled into a disk pack, which may include many disks and hence many surfaces. Information is stored on a disk surface in concentric circles of small width, for each having a distinct diameter. Each circle is called a track. For disk packs, the tracks with the same diameter on various surfaces are called a cylinder because of the shape they would form it connected in space. Data stored on one cylinder can be retrieved much faster than if it were distributed among different cylinders. The number of tracks on a disk ranges from a few hundred to a few thousand, and the capacity of each track typically ranges from tens of KB to 150 KBs. A track is divided into small blocks known as sectors. One type of sector organization calls a portion of a track that subtends a fixed angle at the center as a sector. The division of a track into equal sized disk blocks or pages is set by the operating system during disk formatting or initialization. Block size is fixed during initialization and cannot be changed dynamically. Typical disk block sizes range from 512 to 4096 bytes. A disk with hard coded sectors often has the sectors subdivided into blocks during initialization. Blocks are separated by fixed size inter block gaps, which include specially coded control information written during disk initialization. This information is used to determine which block on the track follows each inter block gap. Magnetic Tape storage devices Magnetic tapes are sequential access devices, to access the nth block on tape, we must first scan over the preceding n-1 blocks. Data is stored on reels of high capacity magnetic tape, somewhat similar to audio and video tapes. A tape drive is required to read the data from or to write the data to a tape reel. Usually each group of bits that forms a byte is stored across the tape, and the bytes themselves are stored consecutively on the tape. A read/write head is used to read or write data on tape. Data records on tape are also stored in blocks. The main characteristic of a tape is its requirement that we access the data blocks in sequential order. To get a block from the middle of the reel of tape, the tape is mounted and then scanned until the required block gets under the read/write head. For this reason tape access can be slow and tapes are not used to store online data, except it for some specialized applications. However tapes serve a very important function that of backing up of database. One reason for backup is to keep copies of disk files in case the data is lost because of a disk crash, which can happen if the disk read write head touches the disk surface because of mechanical malfunction. Tape can also be used to store excessively large database files. They are also used for storing images and system libraries. Backing up enterprise databases so that no transaction information is lost is a major undertaking. 4 nibble 1 bit

8 bit 1024 byte 1024 KB 1024 MB 1024 GB 1000 TB

1 byte 1 kb (Kilo Byte) 1 MB (Mega Byte) 1 GB (Giga Byte) 1 TB (Terra byte) 1 (PB) Peta byte

Buffering of blocks: When several blocks need to be transfer from disk to main memory and all the block addresses are known, several buffers can be reserved in main memory to speed up the transfer. While one buffer is being read or written, the CPU can process data in the other buffer. This is possible because an independent disk I/O processor (controller) exists that, once started, can proceed to transfer a data block between memory and disk independent of and in parallel to CPU processing. Records and Record Types Data is usually records in the form of records. Each record is the collection of related data values or items, where each value is formed of one or more bytes and corresponds to a particular field of the record. Records usually describe entities and their attributes. For example an EMPLOYEE specifies some attribute of that employee entity, and each field value in the record specifies some attributes of that employee, such as name, age and DOB etc. a collection of field names and their corresponding data types constitute a record type or record format definition. A data type associated with each field specifies the types of values a field can take. The data type of a field is usually one of the standard data types used in programming. These include numeric (integer, long integer, or floating point), string of characters (fixed length or varying), Boolean (having 0 and 1 or true and false values only), and sometimes specially coded date and time data types. The number of bytes required for each data type is fixed for a given computer system. An integer may require 4 bytes, long integer 8 bytes, real number 4 bytes, a Boolean 1 byte, a data 10 bytes and fixed length string of k characters k bytes. Variable length strings may require as many bytes as there are characters in each field value. Files, fixed length records, and variable length records A file is a sequence of records. In many cases, all records in a file are of the same record type. If every record in the file has the same size then the file is said to be made from fixed length records. If different records in the file have different sizes, the file is said to be made from varying length records. For example, the NAME field of EMPLOYEE can be a variable-length field.
1. The file records are of the same record type, but one or more of the field may have multiple values for individual records. Such a field is called a repeating filed and a group of values for the field is often called a repeating group. 2. The file records are of the same record type, but one or more of the field are optional, that is they have values for some but not all of the file records. Variable length records. A file may have variable length records for several reasons. 3. The file contain records of different record types and hence of varying size. This would occur if related records of different types were clustered (place together) on disk blocks. For example, the GRAD_REPORT records of a particular student may be placed following that students record.

Record blocking and spanned versus un-spanned records The records of a file must be allocated to disk blocks because a block is the unit of data transfer between disk and memory. When the block size is larger than the record size, each block will contain numerous records. But some files have un-usually large records which can not fit in a single block. When a block size is large than a record in such situation if we will store a record then some space of block will remain empty or unused. To utilize this unused space, we can store part of a record on one block and the rest on another. A pointer at the end of the first block points to the block containing the remainder of the record in case it is not the next consecutive block on disk. This organization is called spanned, because records can span more than one block. Whenever a record is larger than a block, we must use a spanned organization. If records are not allowed to cross block boundaries, the organization is called unopened. This is used with fixed length records. Allocating file blocks on Disk: There are several standard techniques for allocating the blocks of a file on disk. In contiguous allocation the file blocks are allocated to consecutive disk blocks. This makes reading the whole file very fast using double buffering, but it makes expanding the file difficult. In linked allocation each file block contains a pointer to the next file block. This makes it easy to expand the file but makes it slow to read the whole file. A combination of the two allocates cluster of consecutive disk blocks, and the clusters are linked. Clusters are sometimes called the segments or extents. Another possibility is to use indexed allocation, where one or more index blocks contain pointers to the actual file blocks. It is also common to use combinations of these techniques. File Headers: A file header or file descriptor contains information about a file that is needed by the system programs that access the file records. The header include information to determine the disk addresses of the file blocks as well as to record format descriptions, which may include field lengths and order of fields within a record for fixed length un-spanned records and field type codes, separator characters, and record type codes for variable length records. Operations on files: Basically we can perform retrieval and update operation on a file.

You might also like