You are on page 1of 61

THE DESIGN AND IMPLEMENTATION OF CONTENT BASED FILE SYSTEM

PROJECT REPORT

Submitted by SANKARAN .S SHANKAR. S KARTHICK KUMAR .C.S

In partial

fulfillment for the award of the degree of BACHELOR OF ENGINEERING in

COMPUTER SCIENCE AND ENGINEERING ST .PETERRS ENGINEERING COLLEGE, CHENNAI ANNA UNIVERSITY, CHENNAI - 600025

APRIL 2007

ABSTRACT

File systems abstract raw disk data into file and directories thereby providing an easy-to-use interface for user applications to store and retrieve persistent information. Most common file systems that exist today treat file data as opaque entities, and they do not utilize information about their contents to perform useful optimizations. For example, todays file systems do not detect files with same data. In this project, we design and implement a new file system that understands the contents of the data it stores, to enable interesting functionality. Specifically we focus on detecting and eliminating duplicate data items across files. Eliminating duplicate data has two key advantages: first, the disk just stores a single

copy of a data item even if multiple files share it, thereby saving storage space. Second, the disk I/O required for reading and writing copies of the same data are eliminated, thereby improving performance. To implement this, we plan to work on the existing Linux Ext2 file

system and reuse part of its source code. Through our implementation we hope to demonstrate the utility of eliminating duplicate file data both in terms of space savings and performance improvements.

TABLE OF CONTENTS

1. INTRODUCTION 2. MOTIVATION 3. BACKGROUND 3.1 . File System Background 3.2. Overview of Linux VFS

3.3 . Layout of the Ext2 File System 3.4 . Content Based File System 4 DESIGN 4.1 Overall structure 4.2 Detecting Duplicate Data 4.3 Tracking Content Information 4.4 Online Duplicate Elimination 4.5 DISCUSSION i.Concept in LBFS ii.File-level content checking 5 IMPLEMENTATION 5.1 Data Structures 5.2 HASH TABLE OPERATIONS 5.2.1. Initializing the hash table 5.2.2. Compute Hash 5.2.3. Add Entry 5.2.4. Check Duplicate 5.2.5. Remove Entry 5.2.6. Free Hash Table 5.3 Read Data flow 5.4 WRITE DATA FLOW 5.4.1 5.4.2 Overall Write Flow Unique Data (New Block)

5.4.3 5.4.4 5.4.5 5.4.6

Unique Data(Overwrite Existing Block) Duplicate Data(New Block) Duplicate Data(Overwrite) Delete Data

5.5 Duplicate Eliminated Cache 5.6 Code snippets 6 EVALUATION 6.1 Correctness Check 6.2 Performance Check 6.3 Overall Analysis 7 8 9 RELATED WORK FUTURE WORK REFERENCES

INTRODUCTION : A file system is a method for storing and organizing files and the data they contain to make it easy to find and access them. The present file systems in Linux are content-oblivious. They do not know about the nature of data present in the disk blocks. Therefore, even if two or more blocks of data are are identical, they are stored as redundant copies in the disk. This is the case dealt with in this project. The outcome is duplicate elimination in disks in the blocklevel. Content hashing is the method used to compare the two blocks to be redundant. MD5 algorithm is used to compute the hash values of each of the blocks in the disk. Before each write to a disk, the hash value is calculated for the new write and this is compared to the already existing hash values. If the same hash value is already present in the hash table, it means the block is a duplicate block. This project is implemented in the Linux kernel 2.6, wherein the existing ext2 file system functions are modified accordingly to accomplish the goal of duplicate elimination. The evaluation of this project is done in two phases namely the correctness check and the performance check. A testing program which does exhaustive file system operations is executed and the file system is found to be stable and also seen to produce expected results. The performance check is done by evaluating the postmark results. This report details the design and implementation of the content-based file system. Since the main change is dont only in the write phase, each write scenarios are explained in detail with the status of the hash table illustrated by diagrams both before and after the program code segment execution. MOTIVATION : The existing ext2 file system in Linux is devoid of any information regarding the data in the disk. It cannot distinguish between two blocks of data to be either unique or identical. This means, two identical files will be stored in two separate locations in the disk and thus require the space double the size of the file. This would not only result in wastage of disk space but also increased I/O operations in reading both files separately, holding two identical pages of data in the cache etc., Let us consider some of the example cases wherein this disadvantage proves to be a bigger problem. Consider the case wherein the virtual machines are used for testing purposes. When Linux is installed in the virtual machine and if the host OS of the virtual machine is also Linux, then, all the packages in Linux will first be stored in the disk and when the Linux is again installed in the Virtual machine, again the same set of packages will be separately stored in the

disk.

The average size of the whole packages in the Linux distribution will be around 5 Thus, disk space of about 10 Gigabytes of storage will be needed to When the virtual machine runs

Gigabytes of storage.

have the Linux virtual machine with the host OS as Linux.

there will again be more I/O operations resulting in the degradation of performance. But if the duplicate blocks are eliminated, considerable disk space of about 5 Gigabytes of storage can be saved and there will be a considerable reduction in the number of I/O operations thereby increasing the performance to a good extend. Therefore, if by some means, the file system knows the content of the blocks in the disk before it writes a new block, this disadvantage can very well be eliminated. Hashing is a coming technique to generate a fixed-size unique hash for any arbitrary-sized input. Thus, when the content of each data blocks in the disk are hashed, they can easily be compared with one another and the file system can control the read and write operation accordingly. hashing in this project is done by using the Message Digest Algorithm. BACKGROUND : 1) File System Background: A file system is a method for storing and organizing files and the data they contain to make it easy to find and access them. More formally, it is a set of abstract data types that are implemented for the storage, hierarchical organization, manipulation, navigation, access, and retrieval of data. A disk has just a group of sectors and tracks. So, it can be able to just do operations in the granularity of sectors. eg: read a sector, write to a sector etc. But we need to have a hierarchical structure for maintaining the files. With just a disk present, this cannot be done because, the hard disk is just a linear collection of bits(0s and 1s) arranged into tracks and sectors. For this purpose file systems are used. With a file system, we have an interface between the user programs and the hard disk. We tell the file system to write a file to the disk and it is the file system that knows the disk structure and copies the blocks of data in the disk. A file system treats the data from the disk as just a fixed-sized block that contains information. But the file system has no semantic information pertaining to the data. data. They treat all the data, may it be that of a file or a dentry or an inode in a single notion as a block of Content

2) Overview of Linux VFS: Linux comprises of the Virtual File system Switch (VFS) layer which lies intermediate to the application and the various file systems. Every request to the disk, before passing to the file system, goes through the VFS. It acts as a generalization layer to all the underlying file systems. Its function is to locate the file system for a particular file from its file object and then map the functions to the file system-specific functions. Some of the common functions like read and write work in the same pattern for most of the file systems. Therefore, a generic function is available for these type of functions and the VFS layer maps the request to these generic functions. There are four basic objects in the VFS namely,

Super Block Object Inode Object File Object Dentry Object.

Consider a process P1, makes a read request for a file F1 that is stored in the disk partition formatted with the Ext2 file system. Similarly process P2 requests for file F2 in Ext3 file The VFS layer in turn system. Both these requests get transferred to the corresponding system call (sys_read() in this case). The system call handling routine transfers it to the VFS layer. transfers the read request to the corresponding file systems read function. The VFS knows the file system associated of any file by the file object that is passed to it. Some of the file systems in turn maps the basic operation requests to the generic functions which ultimately carries out the request and gives it to the layer above. 3) Page Cache: The page cache is the main disk cache used by the Linux kernel. In most cases, the kernel refers to the page cache when reading from or writing to disk. New pages are added to the page cache to satisfy User Mode processes read requests. If the page is not already in the cache, a new entry is added to the cache and filled with the data read from the disk. If there is enough free memory, the page is kept in the cache for an indefinite period of time and can then be reused by other processes without accessing the disk.

PROCESS 1

PROCESS 2

PROCESS 3

PROCESS 4

FILE

FILE

FILE

FILE

READ

WRITE

READ

WRITE

VFS
INODE OBJECT

FILE OBJECT

ENTRY OBJECT

SUPER BLOCK OBJECT

EXT2

EXT3

NFS

DISK CONTROLLER

DISK

Fig 3.1 : Overview of Linux Filesystems Similarly, before writing a page of data to a block device, the kernel verifies whether the corresponding page is already included in the cache; if not, a new entry is added to the cache and filled with the data to be written on disk. The I/O data transfer does not start immediately: the disk update is delayed for a few seconds, thus giving a chance to the processes to further modify the data to be written (in other words, the kernel implements deferred write operations).

Kernel code and kernel data structures don't need to be read from or written to disk. Kernel designers have implemented the page cache to fulfill two main requirements: Quickly locate a specific page containing data relative to a given owner. To take the

maximum advantage from the page cache, searching it should be a very fast operation. Keep track of how every page in the cache should be handled when reading or writing

its content. For instance, reading a page from a regular file, a block device file, or a swap area must be performed in different ways, thus the kernel must select the proper operation depending on the page's owner. The unit of information kept in the page cache is, of course, a whole page of data. A page does not necessarily contain physically adjacent disk blocks, so it cannot be identified by a device number and a block number. Instead, a page in the page cache is identified by an owner and by an index within the owner's data usually, an inode and an offset inside the corresponding file. 4)Buffer Pages: In old versions of the Linux kernel, there were two different main disk caches: the page cache, which stored whole pages of disk data resulting from accesses to the contents of the disk files, and the buffer cache , which was used to keep in memory the contents of the blocks accessed by the VFS to manage the disk-based file systems. Starting from stable version 2.4.10, the buffer cache does not really exist anymore. In fact, for reasons of efficiency, block buffers are no longer allocated individually; instead, they are stored in dedicated pages called "buffer pages ," which are kept in the page cache. Formally, a buffer page is a page of data associated with additional descriptors called "buffer heads", whose main purpose is to quickly locate the disk address of each individual block in the page. In fact, the chunks of data stored in a page belonging to the page cache are not necessarily adjacent on disk. Whenever the kernel must individually address a block, it refers to the buffer page that holds the block buffer and checks the corresponding buffer head. Here are two common cases in which the kernel creates buffer pages:

When reading or writing pages of a file that are not stored in contiguous disk blocks.

This happens either because the file system has allocated noncontiguous blocks to the file, or because the file contains "holes".

When accessing a single disk block (for instance, when reading a superblock or an

inode block). In the first case, the buffer page's descriptor is inserted in the radix tree of a regular file. The buffer heads are preserved because they store precious information: the block device and the logical block number that specify the position of the data in the disk. In the second case, the buffer page's descriptor is inserted in the radix tree rooted at the address_space object of the inode in the bdev special file system associated with the block device. This kind of buffer pages must satisfy a strong constraint: all the block buffers must refer to adjacent blocks of the underlying block device. An instance of where this is useful is when the VFS wants to read the 1,024-byte inode block containing the inode of a given file. Instead of allocating a single buffer, the kernel must allocate a whole page storing four buffers; these buffers will contain the data of a group of four adjacent blocks on the block device, including the requested inode block. All the block buffers within a single buffer page must have the same size; hence, on the 80 x 86 architecture, a buffer page can include from one to eight buffers, depending on the block size. When a page acts as a buffer page, all buffer heads associated with its block buffers are collected in a singly linked circular list. The private field of the descriptor of the buffer page points to the buffer head of the first block in the page; every buffer head stores in the b_this_page field a pointer to the next buffer head in the list. Moreover, every buffer head stores the address of the buffer page's descriptor in the b_page field. Figure 3.2 shows a buffer page containing four block buffers and the corresponding buffer heads. Because the private field contains valid data, the PG_private flag of the page is also set; hence, if the page contains disk data and the PG_private flag is set, then the page is a buffer page. Notice, however, that other kernel components not related to the block I/O subsystem use the private and PG_private fields for other purposes. 4) Writing Dirty Pages to Disk The kernel keeps filling the page cache with pages containing data of block devices. Whenever a process modifies some data, the corresponding page is marked as dirty that is, its PG_dirty flag is set.

Figure3.2 : A buffer page including four buffers and their buffer heads Unix systems allow the deferred writes of dirty pages into block devices, because this noticeably improves system performance. Several write operations on a page in cache could be satisfied by just one slow physical update of the corresponding disk sectors. Moreover, write operations are less critical than read operations, because a process is usually not suspended due to delayed writings, while it is most often suspended because of delayed reads. Thanks to deferred writes, each physical block device will service, on the average, many more read requests than write ones. A dirty page might stay in main memory until the last possible moment that is, until system shutdown. drawbacks:

However, pushing the delayed-write strategy to its limits has two major

If a hardware or power supply failure occurs, the contents of RAM can no longer be

retrieved, so many file updates that were made since the system was booted are lost.

The size of the page cache, and hence of the RAM required to contain it, would have to

be huge at least as big as the size of the accessed block devices. Therefore, dirty pages are flushed (written) to disk under the following conditions:

The page cache gets too full and more pages are needed, or the number of dirty pages

becomes too large.


Too much time has elapsed since a page has stayed dirty. A process requests all pending changes of a block device or of a particular file to be

flushed; it does this by invoking a sync(), fsync(), or fdatasync() system call. Buffer pages introduce a further complication. The buffer heads associated with each buffer page allow the kernel to keep track of the status of each individual block buffer. The PG_dirty

flag of the buffer page should be set if at least one of the associated buffer heads has the BH_Dirty flag set. When the kernel selects a dirty buffer page for flushing, it scans the associated buffer heads and effectively writes to disk only the contents of the dirty blocks. As soon as the kernel flushes all dirty blocks in a buffer page to disk, it clears the PG_dirty flag of the page. 5) Layout of the Ext2 file system : The first block in each Ext2 partition is never managed by the Ext2 file system, because it is reserved for the partition boot sector. The rest of the Ext2 partition is split into block groups , each of which has the layout shown in Figure XXXXXXXXXXX. As you will notice from the figure, some data structures must fit in exactly one block, while others may require more than one block. All the block groups in the file system have the same size and are stored sequentially, thus the kernel can derive the location of a block group in a disk simply from its integer index.

Figure 3.3 :

Layouts of an Ext2 partition and of an Ext2 block group

Block groups reduce file fragmentation, because the kernel tries to keep the data blocks belonging to a file in the same block group, if possible. Each block in a block group contains one of the following pieces of information:

A copy of the file system's superblock A copy of the group of block group descriptors A data block bitmap An inode bitmap A table of inodes

A chunk of data that belongs to a file; i.e., data blocks If a block does not contain any meaningful information, it is said to be free. As

seen from Figure 3.3 , both the superblock and the group descriptors are duplicated in each block group. Only the superblock and the group descriptors included in block group 0 are used by the kernel, while the remaining superblocks and group descriptors are left unchanged; in fact, the kernel doesn't even look at them. When the e2fsck program executes a consistency check on the file system status, it refers to the superblock and the group descriptors stored in block group 0, and then copy them into all other block groups. If data corruption occurs and the main superblock or the main group descriptors in block group 0 become invalid, the system administrator can instruct e2fsck to refer to the old copies of the superblock and the group descriptors stored in a block groups other than the first. Usually, the redundant copies store enough information to allow e2fsck to bring the Ext2 partition back to a consistent state. This figure 3.4 shows the actual mapping of the inode to the corresponding data blocks in a single group.

|--Inode table---| |---Indirect blocks pointing to data blks---------| |---Data Blks----| Fig 3.4 Inode pointers in Ext2 filesystem As shown from the figure above, each entry in the inode table points to a specific data block and the contents of the data blocks are never taken care of. for this. Therefore, there exists multiple copies of same information in many data blocks in the disk and more space is wasted

6) Data Blocks Addressing in EXT2: Each nonempty regular file consists of a group of data blocks . Such blocks may be referred to either by their relative position inside the file their file block number or by their position inside the disk partition their logical block number. Deriving the logical block number of the corresponding data block from an offset f inside a file is a two-step process: 1. Derive from the offset f the file block number the index of the block that contains the character at offset f. 2. Translate the file block number to the corresponding logical block number. Because Unix files do not include any control characters, it is quite easy to derive the file block number containing the f th character of a file: simply take the quotient of f and the file system's block size and round down to the nearest integer. For instance, let's assume a block size of 4 KB. If f is smaller than 4,096, the character is contained in the first data block of the file, which has file block number 0. If f is equal to or greater than 4,096 and less than 8,192, the character is contained in the data block that has file block number 1, and so on. This is fine as far as file block numbers are concerned. However, translating a file block number into the corresponding logical block number is not nearly as straightforward, because the data blocks of an Ext2 file are not necessarily adjacent on disk. The Ext2 file system must therefore provide a method to store the connection between each file block number and the corresponding logical block number on disk. This mapping, which goes back to early versions of Unix from AT&T, is implemented partly inside the inode. It also involves some specialized blocks that contain extra pointers, which are an inode extension used to handle large files. The i_block field in the disk inode is an array of EXT2_N_BLOCKS components that contain logical block numbers. In the following discussion, we assume that EXT2_N_BLOCKS has the default value, namely 15. The array represents the initial part of a larger data structure, which is illustrated in Figure XXXX. As can be seen in the figure, the 15 components of the array are of 4 different types:

The first 12 components yield the logical block numbers corresponding to the first 12

blocks of the file to the blocks that have file block numbers from 0 to 11.

The component at index 12 contains the logical block number of a block, called

indirect block, that represents a second-order array of logical block numbers. They correspond to the file block numbers ranging from 12 to b/4+11, where b is the file system's block size (each logical block number is stored in 4 bytes, so we divide by 4 in the formula). Therefore, the kernel must look in this component for a pointer to a block, and then look in that block for another pointer to the ultimate block that contains the file contents.

The component at index 13 contains the logical block number of an indirect block

containing a second-order array of logical block numbers; in turn, the entries of this second-order array point to third-order arrays, which store the logical block numbers that correspond to the file block numbers ranging from b/4+12 to (b/4)2+(b/4)+11.

Finally, the component at index 14 uses triple indirection: the fourth-order arrays store

the logical block numbers corresponding to the file block numbers ranging from (b/4)2+(b/4)+12 to (b/4)3+(b/4)2+(b/4)+11.

Figure 3.5 : Data structures used to address the file's data blocks

In Figure 3.5, the number inside a block represents the corresponding file block number. The arrows, which represent logical block numbers stored in array components, show how the

kernel finds its way through indirect blocks to reach the block that contains the actual contents of the file. Notice how this mechanism favors small files. If the file does not require more than 12 data blocks, every data can be retrieved in two disk accesses: one to read a component in the i_block array of the disk inode and the other to read the requested data block. For larger files, however, three or even four consecutive disk accesses may be needed to access the required block. In practice, this is a worst-case estimate, because dentry, inode, and page caches contribute significantly to reduce the number of real disk accesses. Notice also how the block size of the file system affects the addressing mechanism, because a larger block size allows the Ext2 to store more logical block numbers inside a single block. Table 1 shows the upper limit placed on a file's size for each block size and each addressing mode. For instance, if the block size is 1,024 bytes and the file contains up to 268 kilobytes of data, the first 12 KB of a file can be accessed through direct mapping and the remaining 13-268 KB can be addressed through simple indirection. Files larger than 2 GB must be opened on 32bit architectures by specifying the O_LARGEFILE opening flag. Table 1. File-size upper limits for data block addressing Block size 1,024 2,048 4,096 Direct 12 KB 24 KB 48 KB 1-Indirect 268 KB 1.02 MB 4.04 MB 2-Indirect 64.26 MB 513.02 MB 4 GB 3-Indirect 16.06 GB 256.5 GB ~ 4 TB

THE CONTENT-BASED FILE SYSTEM : The content based file system employs a technique called content hashing to compare the content of the blocks and find if they are duplicate. Message Digest Algorithm is used to compute the hash value of each block. corresponding hash value is calculated. entry for every valid data blocks in the disk. For every data blocks in the disk, the At any given instant, the hash table will have one There are two hash table structures called the

checksum hash table that is indexed by the checksum field and the block hash table that is indexed by the block number field. Both these structures point to the single copy of the hash node. In the context of the file system, there is no change in any of the inode data structures. The only difference is that, the inode table will have multiple entries pointing to the same block number in the disk. For eg. If a 50 Megabytes-sized file is created in the partition and if the file contains just 40 MB of unique having the content based file system,

information in it and the remaining 10 MB of data are duplicate, then, the inode table for that file will have the same number of pointers as with the fully-unique 50 MB file. But, the inode table will have duplicate pointers for the remaining 10 MB and therefore, the essential space that is occupied by the file is just 40 MB. extend. In this way, disk space can be saved to a good When the file is read, the inode table is accessed in a normal way and reads the

corresponding data blocks. When a data block is already read from the disk and if it is present in the buffer cache, then, if the same block is needed again, it doesnt need to invoke a disk read again. It can just use the copy of data in the buffer cache. In this way, by the better cache hit, performance can be enhanced during read operation. Therefore, for any operation like file create, file overwrite, file append or truncation, before the disk write operation is invoked, the checksum values are compared and only if there are no duplicate blocks, the new block is written; otherwise, no new data blocks are written and the disk space remains the same . there is no place for redundancy. As a result of this, at any given instant, the disk will be having only a single copy of any data and In addition to efficient utilization of disk space, content hashing also helps in maintaining the data integrity. In certain cases, this technique can reduce considerable amount of disk space and also increase the performance of the operations.

4. DESIGN 4.1 OVERALL STRUCTURE : HASH TABLE INODE TABLE DISK

CHECKSUM 1 CHECKSUM 2 CHECKSUM 3 CHECKSUM 4 CHECKSUM 5 CHECKSUM 6 CHECKSUM 7 CHECKSUM 8 CHECKSUM 9 CHECKSUM 10 CHECKSUM 11 CHECKSUM 12

REFCOUNT REFCOUNT REFCOUNT REFCOUNT REFCOUNT REFCOUNT REFCOUNT REFCOUNT REFCOUNT REFCOUNT REFCOUNT REFCOUNT

BLOCK 11 BLOCK 9 BLOCK 6 BLOCK 3 BLOCK 1 BLOCK 2 BLOCK 7 BLOCK 10 BLOCK 4 BLOCK 12 BLOCK 5 BLOCK 8

BLOCK 3 BLOCK 1 BLOCK 3 BLOCK 1 BLOCK 3 BLOCK 7 BLOCK 8 BLOCK 12 BLOCK 7 BLOCK 8 BLOCK 12 BLOCK 2

BLOCK 1 BLOCK 2 BLOCK 3 BLOCK 4 BLOCK 5 BLOCK 6 BLOCK 7 BLOCK 8 BLOCK 9 BLOCK 10 BLOCK 11 BLOCK 12

Illustrated above is the overall structure of the mapping between the inode and the disk blocks. As shown above, the hash table bears entries for every unique disk data blocks. table has redundant entries pointing to the same disk block. 4.2 DETECTING DUPLICATE DATA: This section tells about the technique adopted to detect the duplicate data blocks in the disk. Alternatives: Detecting of duplicate data blocks involves comparing the contents of the blocks. The most straight-forward method to do this is to compare the block contents bit-by-bit and then detect duplication. overhead to the system. But this method is very much inefficient and would involve much Another alternative to accomplish this is through error detecting The inode

codes.

These codes dont lead to overhead and thus not too inefficient. But, these are not This means, there may be duplicate blocks that are easily missed by the

collision resistant. blocks.

error detection codes. Therefore, this technique cannot be relied upon to detect the duplicate Considering these cases, collision-resistant hashing proves to be a more viable method for accomplishing the goal of detecting duplicate blocks. Collision Resistant Hashing : Collision-Resistant hashing is a technique by which, a unique hash value is generated for each unique content of the block. The Message Digest Algorithm 5(MD5) is used for this purpose. The MD5 algorithm is widely used in many cryptographic applications and it is It takes in the input of arbitrary For any unique input, a unique MD5 hash is found to be more collision resistant than its predecessors. length and produces a MD5 hash of 128 bits. produced.

This is a one-way hashing technique wherein, from a given data, its corresponding We employ this technique to compare the contents of the block. As, the

hash value can very well be calculated but given a MD5 hash, it is not possible to derive the input data to it. MD5 hashing algorithm gives a unique hash for every unique input data, it can be said that no identical blocks of data will give raise to a single hash value. actual blocks of data. 4.3 TRACKING CONTENT INFORMATION: The hash table is used to track the content information pertaining to the various disk data blocks. For every unique content in the disk, there is an entry in the hash table. In the The hash table, there is a checksum->block number mapping available. overall structure of the hash table is illustrated by the following figure. This mapping is used to Therefore, if the hash value is calculated for every data blocks, then comparing the hash value would mean comparing the

locate the duplicate blocks(if any) before any write operation takes place in the disk.

Figure : 4.3 Hash table structure

C1 C2 C3 C4 C5 C6

B5 B1 B3 B2 B6 B4

RC RC RC RC RC RC

BLOCK 1 BLOCK 2 BLOCK 3 BLOCK 4 BLOCK 5 BLOCK 6

C2 C4 C3 C6 C1 C5

B1 B2 B3 B4 B5 B6

RC RC RC RC RC RC

Checksum Hash Table

Disk Blocks

Block Hash Table

4.4 ONLINE DUPLICATE ELIMINATION: WRITE SCENARIOS : 1) Unique Data (New Block): This happens in the cases of file creation or a file append. Basically, a new block is allocated here and then the data is copied to it. Before writing the data to the block, the hash table is checked and found to be a new data. Therefore, the new block entry, its checksum value and a reference count of 1 is added to the hash table and the normal write operation is continued to execute.

2) Unique Data (Existing Block): This happens in the case of file modification. When a file is modified, one or more blocks corresponding to the file get modified and before writing it to the disk, the routine redundancy check is performed. In this case, the new modified checksum doesnt find an entry in the hash table. But already there is an entry in the hash table for that block but with a different checksum. Therefore, in this case, the reference count of that block is decremented and a new block is allocated and the new contents are copied to it with updating the hash table for that new block.

3) Duplicate Data (New block): This happens in the case of file creation or a file append wherein the efficiency of content-based file system is utilized. In this case, before writing to a disk, the hash table is checked for duplicate block and that corresponding block is mapped to the corresponding inode pointer. The reference count of the block is incremented.

4) Duplicate Data (Existing Block): This is a case where an existing block is modified and the now modified contents are found to be duplicate. In this case, the reference count of the old block is Then, the decremented and the reference count of the new block is incremented. corresponding inode pointers are updated with the new block.

5) Delete Data: This happens when a file is modified by deleting a part of its contents, or during a file remove. In this case, the reference count alone is decremented and when the reference count is 0, the block gets freed and the hash entry is removed. DISCUSSION : Comparison with LBFS : LBFS is a network file system which conserves communication bandwidth between clients and servers. It takes advantage of cross-file similarities. When transferring a file between the client and server, LBFS identifies chunks of data that the recipient already has in other files and avoids transmitting the redundant data over the network. Here, there is no block-level redundancy check performed. Instead, some chunks of data are compared to check duplication of data. Chunks are variable-sized blocks. Here, when a modification is made to a shared data, the block size is made to increase and then necessary changes are made to the kernel to handle that. This is more complex to implement when compared to the block-level redundancy check that is performed in this project File-Level Redundancy check : Another technique that is synonymous with the block-level hashing is the file-level redundancy checking. In this case, the whole file is compared and the redundant files are

eliminated. can be felt.

This usage of this technique is limited to the availability of redundant files in a But in the case of block-level redundancy check, the duplicate blocks are Even when two files are different, as a whole, the

file system. Only if there are more than one redundant files in the disk partition, the advantage eliminated in inter-file environment.

duplicate elimination can be done for some of the redundant blocks in the two files alone. The chances of having the advantage of this file system can be found in a more frequent manner since there may be many blocks that are redundant and spread over different files. Therefore, the block-level content hashing and redundancy elimination is a beneficial method that brings about efficient disk space usage and also a better performance by reducing the number of disk accesses and by attaining a better cache hit.

5. IMPLEMENTATION: Figure : 5.0 : Control Logic of CBFS

CHECK DUPLICATE IN CHECKSUM TABLE

FOUND

NOT FOUND

CHANGE PAGE MAPPING

CHECK BLOCK HASH TABLE

CHANGE INODE POINTER

UPDATE HASH TABLE

FOUND

NOT FOUND

REMOVE HASH ENTRY

ADD HASH ENTRY

ALLOCATE NEW BLOCK

EXIT WITH SUCCESS

5.1 DATA STRUCTURES : 1) Hash Table Structure : There are two hash table structures called checksum hash table (indexed by the checksum field) and a block hash table (indexed by the block number field). Both these hash table structures point to a single copy of the hash node. The following figure illustrates the overall structure of the hash table. Figure 5.1 : Structure of hash tables

CHECKSUM HASH TABLE POINTER INDEXED BY CHECKSUM

CHECKSUM 1

BLOCK 2

REFERENC E

BLOCK HASH ABLE POINTER INDEXED BY BLOCK

CHECKSUM 2

BLOCK 5

REF COUNT

CHECKSUM 3

BLOCK 1

REF COUNT

CHECKSUM 4

BLOCK 3

REF COUNT

CHECKSUM 5

BLOCK 4

REF COUNT

CHECKSUM 6

BLOCK 6

REF COUNT

2) Components of the hash table : The hash table comprises of three field namely a checksum field of 128 bits, a block number field (Logical block number of the disk) and the reference count of the block. The block number field helps in locating the physical block of the disk. every valid data block in the disk, there will be a hash entry in the hash table. Therefore, for

5.2 HASH TABLE OPERATIONS: There are two hash table pointers namely, checksum_htable and block_htable. operations that are associated with the hash table are namely, 1) Initializing the hash table: Initializing the hash table involves creation of the hash table structure, and the hash buckets. This is done at the time of mounting the content based file system. This is done in the kernel function, cbfs_fill_super(). hash tables are given below : struct cbfs_hash_table* cbfs_init_checksum_hash_table(int (*hash) (void *)) int i; struct cbfs_hash_table *newtable; { The code snippet for initializing the two The

newtable = kmalloc(sizeof(struct cbfs_hash_table), GFP_KERNEL); BUG_ON(!newtable); memset(newtable,0, sizeof(struct cbfs_hash_table));

newtable->buckets = (struct cbfs_hash_bucket *) __get_free_pages(GFP_KERNEL, get_order(NUM_CHECKSUM_BUCKETS * sizeof(struct cbfs_hash_bucket))); BUG_ON(!newtable->buckets); memset(newtable->buckets,0, NUM_CHECKSUM_BUCKETS * sizeof(struct cbfs_hash_bucket));

for (i=0; i<NUM_CHECKSUM_BUCKETS;i++) INIT_LIST_HEAD(&newtable->buckets[i].node_list);

newtable->hash = hash; spin_lock_init(&newtable->lock); return newtable; }

struct cbfs_hash_table* cbfs_init_block_hash_table(int (*hash) (void *)) int i; struct cbfs_hash_table *newtable; {

newtable = kmalloc(sizeof(struct cbfs_hash_table), GFP_KERNEL); BUG_ON(!newtable); memset(newtable,0, sizeof(struct cbfs_hash_table));

newtable->buckets = (struct cbfs_hash_bucket *) __get_free_pages(GFP_KERNEL, get_order(NUM_CHECKSUM_BUCKETS * sizeof(struct cbfs_hash_bucket))); BUG_ON(!newtable->buckets); memset(newtable->buckets,0, NUM_BLOCK_BUCKETS * sizeof(struct cbfs_hash_bucket));

for (i=0; i<NUM_BLOCK_BUCKETS; i++) INIT_LIST_HEAD(&newtable->buckets[i].node_list);

newtable->hash = hash; spin_lock_init(&newtable->lock); return newtable; }

2) Check duplicate: This function, takes in the contents of the block and returns whether there is another block already with the same contents. This function internally invoked cbfs_compute_checksum() function and calculates the checksum field and then compares it with the checksum fields in the hash table. cbfs_check_duplicate_block() function : long cbfs_check_duplicate_block(struct cbfs_hash_table *checksum_htable, char *data, long struct cbfs_hash_table *block_htable, new_blk_no) { Given below is the code snippet for

long

err = 0;

char *checksum; long blk_no; struct cbfs_hash_node *node_c, *node_b;

checksum = cbfs_compute_checksum(data); spin_lock(&checksum_htable->lock); node_c = cbfs_node_lookup_by_checksum(checksum_htable, checksum); node_b = cbfs_node_lookup_by_block(block_htable, new_blk_no); if (!node_c && !node_b) {

cbfs_add_hash_entry(checksum_htable, block_htable, checksum, new_blk_no); err = new_blk_no; goto out; } else if (!node_c && node_b) { err = -1; } else { node_c->ref_count++; blk_no = node_c->block_number; err = blk_no; goto out; } out: spin_unlock(&checksum_htable->lock); return err; }

char* cbfs_compute_checksum(char *data) char *err = NULL; char *checksum;

checksum = kmalloc(CHECKSUM_SIZE, GFP_KERNEL); hmac(data, DATA_SIZE, KEY, KEY_SIZE, (void *)checksum);

err = checksum; return err; }

3) Add hash entry This function, takes in the checksum value, block number and then adds the entry in the hash table and initializes the reference count to 1. This is invoked when a new block is allocated and it is not found to be duplicate. cbfs_add_hash_entry() function is given below : void cbfs_add_hash_entry(struct cbfs_hash_table *checksum_htable, struct cbfs_hash_table *block_htable, char *checksum, long blk_no) { The code snippet for the

struct cbfs_hash_node *newnode; int checksum_bucket; int block_bucket; long *blk = &blk_no;

newnode = kmem_cache_alloc(cacheptr, GFP_KERNEL); newnode->checksum = checksum; newnode->block_number = blk_no; newnode->ref_count = 1; INIT_LIST_HEAD(&newnode->block_ptr); INIT_LIST_HEAD(&newnode->checksum_ptr); checksum_bucket = checksum_htable->hash((void *)checksum); block_bucket = block_htable->hash((void *)blk); list_add_tail(&newnode->checksum_ptr, &checksum_htable>buckets[checksum_bucket].node_list); checksum_htable->len++; list_add_tail(&newnode->block_ptr, &block_htable>buckets[block_bucket].node_list); block_htable->len++; }

4) Remove hash entry

This function is for removing the hash entry if its reference count is 1 and for decrementing the reference count otherwise. function. thus the hash table is kept updated This is called inside the cbfs_free_block() at every instant. The function Therefore, for every block free in the file system, this function is invoked and

cbfs_remove_hash_entry() is given below : int cbfs_remove_hash_entry (struct cbfs_hash_table *checksum_htable, struct cbfs_hash_table *block_htable, long blk_no) int err = 0; int checksum_bucket; int block_bucket; char *checksum=NULL; long *blk = &blk_no; struct list_head *pos1, *pos2; struct cbfs_hash_node *node; {

block_bucket = block_htable->hash((void *)blk); spin_lock(&block_htable->lock); list_for_each(pos1, &block_htable>buckets[block_bucket].node_list) {

node = list_entry(pos1, struct cbfs_hash_node, block_ptr); if (node == NULL) {

printk (\nNo node to free !!); err=0; goto out; } if(node->block_number == blk_no) { if(node->ref_count == 1) {

checksum = node->checksum; list_del(pos1); block_htable->len--; goto cs; }

else { node->ref_count--; err = -1; goto out; } } } goto out;

cs: checksum_bucket = checksum_htable->hash((void *)checksum); list_for_each(pos2, &checksum_htable>buckets[checksum_bucket].node_list) {

node = list_entry(pos2, struct cbfs_hash_node, checksum_ptr); if(memcmp((void *)node->checksum, (void *)checksum, CHECKSUM_SIZE) == 0) { list_del(pos2); kmem_cache_free(cacheptr, node); checksum_htable->len--; err = 0; goto out; } } out: spin_unlock(&block_htable->lock); return err; }

5) Freeing the hash table: This is for removing the entire hash table from the memory. This is done when the CBFS module is removed from the kernel. This is invoked in the function, Given cbfs_module_exit() function which will be called at the time of module remove. below is the code snippet for the cbfs_hash_free() function :

void cbfs_hash_free(struct cbfs_hash_table *checksum_htable, struct cbfs_hash_table *block_htable) { int i; struct cbfs_hash_node *node; struct list_head *pos1, *pos2, *n;

for(i=0;i<NUM_CHECKSUM_BUCKETS;i++)

list_for_each_safe(pos1,n,&checksum_htable>buckets[i].node_list) {

node=list_entry(pos1, struct cbfs_hash_node, checksum_ptr); list_del(pos1); } } for(i=0;i<NUM_BLOCK_BUCKETS;i++) {

list_for_each_safe(pos2,n,&block_htable>buckets[i].node_list) {

node=list_entry(pos2, struct cbfs_hash_node, block_ptr); list_del(pos2); kmem_cache_free(cacheptr, node); } } free_pages ((unsigned long)checksum_htable->buckets, get_order (NUM_CHECKSUM_BUCKETS * sizeof(struct cbfs_hash_bucket))); free_pages ((unsigned long)block_htable->buckets, get_order (NUM_CHECKSUM_BUCKETS * sizeof(struct cbfs_hash_bucket))); printk(\nBuckets freed);

kfree(checksum_htable); kfree(block_htable); printk(\nHash tables freed); }

5.3 READ DATA FLOW :

Figure : 5.3 Read data flow SYS_READ() FUNCTION

VFS LAYER (Maps the file systemspecific read() function or the generic read() function)

GENERIC_FILE_READ()

CHECKS THE INODE POINTERS TO LOCATE

READS THE CORRESPONDING BLOCKS FROM DISK

As illustrated in the above flow diagram, the read request travels through a series of layers and functions and finally, ends up searching the inode pointers for the disk blocks to read. The inode pointers will be identical for the redundant blocks and for those blocks, the disk read is made only once and for every subsequent block reads, it will be available in the cache. Therefore the performance is improved by this content based file system. content based file system a more viable option. 5.4 WRITE DATA FLOW : 1)Unique Data (New block) : This is the most common write scenario in a system, wherein, a file is either created or appended. Here, a new block is allocated, the block address is added into the appropriate place in the inode pointer and the, the content to be written are copied from the users address space Also, no extra complexity is added with regard to the read function or the inode structure. This makes the

to the page that is mapped to that block. At this stage, the hash value for the new content that is available in the page is calculated.

SYS_WRITE() FUNCTION

VFS LAYER

GENERIC_FILE_WRITE()

GENERIC_FILE_BUFFERED_WRITE()

CBFS_COMMIT WRITE (Here the Hash Table Is


Checked For Duplicate Blocks. If Duplicate blocks found Write Is Not Performed, Else Normal Write Is Done)

EXIT WITH SUCCESS

This hash value if looked up in the checksum hash table. Since this is a unique data, the hash table lookup returns a miss. table. Therefore, the already allocated new blocks number, the hash value of the newly written content and a reference count of 1 is added to the hash After this, the buffer is marked as dirty and the disk write is carried out. These processes are carried out by invoking the cbfs_check_duplicate_block() function.

BEFORE : HASH TABLE INODE

e$^%/*ui5*)kp2^# %$^%VHGF%$%$ G^%+4rt#$!~()_872dfghjklqw )fxr#$48%^$&)se%

12000 14070 1407

2 2 1

14070 14070 12000 12000 13050 13050

0 19004 1900 4

x#$6778jhJ)*(^&^ Io$%i$%^*iie+|@# #^&(^*^JHs^784ld

16020 1602

1 1 1

12000 12000 11600 11600 14070 14070

0 11600 1160 0 13050 1305 0

AFTER: HASH TABLE e$^%/*ui5*)kp2^#


12000 14070

INODE 2 2 1 1
14070 12000 13050 12000

G^%+4rt#$!~()_872dfghjklqw )fxr#$48%^$&)se% x#$6778jhJ)*(^&^

19004 16020

Io$%i$%^*iie+|@# #^&(^*^JHs^784ld ^rt@#$!~45iou*(_

11600 13050 15100

1 1 1

11600 14070 15100

2)Unique Data (Overwrite existing block) : One of the modes of file update is an in-place overwrite, where the same block is overwritten with new contents. This may or may not cause new blocks to be allocated. Because the blocks are modified with the new overwritten data, the VFS layer would write

those blocks to disk. At this point, the content-based FS looks at the blocks written and determines if this is a case of overwrite or a new allocation. The way it does this is by consulting the block to block hash table. If there is already an entry for this block in the hash table, it means that the block was already allocated. BEFORE : HASH TABLE INODE

e$^%/*ui5*)kp2^#

12000 14070

2 2 1

14070 12000 13050

G^%+4rt#$!~()_872dfghjklqw )fxr#$48%^$&)se%

19004

x#$6778jhJ)*(^&^ Io$%i$%^*iie+|@# #^&(^*^JHs^784ld

16020 11600 13050

1 1 1

12000 11600 14070

AFTER : HASH TABLE e$^%/*ui5*)kp2^#


12000 14070

INODE 1 2 1 1
14070 15090 13050 12000

G^%+4rt#$!~()_872dfghjklqw )fxr#$48%^$&)se% x#$6778jhJ)*(^&^

19004 16020

Io$%i$%^*iie+|@# #^&(^*^JHs^784ld &*dshj$#%#ko()&^

11600 13050 15090

1 1 1

11600 14070

Once it finds this, the next step is to find whether the old version of the block had the same contents as the current version. To find this, it computes checksum on the current contents, and compares it with the checksum stored in the hash table against that block number. If the checksums don't match, it means that the new block now contains a different piece of data, and thus can no longer be identified by the old checksum 3)Duplicate Data (New block): Another type of write scenario is where, a new block is to be written and the content of the to-be-written block is already present in the disk. Here, a new block is allocated, its address is added to the inode pointers and the new content is copied to the page in memory that is mapped to the newly-allocated block. At this stage, the hash value of the new content is calculated and a hash look up is made in the checksum hash table. Since the content is already present in the disk, the checksum hash table will find the block holding the content and returns its block number. This block number is the mapped_block. After this is done, cbfs_free_branches() function is called which removes the old block number from the inode pointers and then frees the already allocated block. Then, the mapped block number is added to the inode pointer and then, the reference count of the mapped block is incremented in the hash table. BEFORE : HASH TABLE e$^%/*ui5*)kp2^#
12000 14070

INODE 1 2 1 1
14070 15090 13050 12000

G^%+4rt#$!~()_872dfghjklqw )fxr#$48%^$&)se% x#$6778jhJ)*(^&^

19004 16020

Io$%i$%^*iie+|@# #^&(^*^JHs^784ld &*dshj$#%#ko()&^

11600 13050 15090

1 1 1

11600 14070

AFTER : HASH TABLE e$^%/*ui5*)kp2^#


12000 14070

INODE 2 2 1
14070 12000 13050

G^%+4rt#$!~()_872dfghjklqw )fxr#$48%^$&)se%

19004

x#$6778jhJ)*(^&^ Io$%i$%^*iie+|@# #^&(^*^JHs^784ld

16020 11600 13050

1 2 1

12000 11600 14070 11600

4)Duplicate Data (Overwrite existing block): This is comparatively a rare and a complex case of write. present in the disk. This is the case where,

the already existing block is to be modified and the modified content happens to be already Here the overwrite may or may not require allocation of the new block. This can be found by checking for the block in the block hash table. Again, if it is an already existing block, its content cannot be changed straight-forward because, it might be a shared block. So, the hash table is looked up and the reference count is checked. If the reference count is 1, then the new checksum value is computed and the hash table is updated for the new checksum value for the same block. than 1, then the duplicate block check is made. If the reference count is greater This is done by calculating the new

checksum and looking up the checksum hash table. Since the content is already present in the disk, the checksum hash table will find the block holding the content and returns its block number. This block number is the mapped_block. Then, the old block number is removed from the inode pointers and the mapped block is spliced to it. The reference count of the mapped block is also incremented.

BEFORE: HASH TABLE INODE

e$^%/*ui5*)kp2^#

12000 14070

2 1 1

11600 12000 13050

G^%+4rt#$!~()_872dfghjklqw )fxr#$48%^$&)se%

19004

x#$6778jhJ)*(^&^ Io$%i$%^*iie+|@# #^&(^*^JHs^784ld

16020 11600 13050

1 2 1

12000 11600 14070

AFTER HASH TABLE INODE

e$^%/*ui5*)kp2^# G^%HJF%&((YV** )fxr#$48%^$&)se%

12000 14070 19004

2 2 1

14070 12000 13050

x#$6778jhJ)*(^&^ Io$%i$%^*iie+|@# #^&(^*^JHs^784ld 5)Delete Data :

16020 11600 13050

1 1 1

12000 11600 14070

This is invoked in the function cbfs_free_block() which is responsible for freeing the blocks of data. This will be called at the time of file truncation or a block remove. This function The cbfs_free_block() invokes the cbfs_remove_hash_entry() function.

takes in the block number as argument and looks up the block hash table for the entry. It

finds the entry and checks the reference count. If the reference count is 1, then, the entry is removed from the hash table and a 0 is returned to the cbfs_free_block() function. cbfs_free_block() function. If the reference count is greater than 1, then it is decremented and a 1 is returned to the The cbfs_free_block() function on receiving a 0, proceeds with actually freeing the block by clearing the bit in the block bitmap. Similarly when a 1 is received, the execution is stopped and the block is not actually freed. 5.5 DUPLICATE ELIMINATED CACHE: In the block-level duplicate elimination done here, the hash table maintained is a global table and therefore, the blocks belonging to all the files in that disk partition will be present in the hash table. This means, two processes accessing two different files having one or more shared block will have only a single copy of the block in the page cache. This is because, when the first file accesses the shared block, it will be read from the disk and made available in the page cache. Subsequently, when another process accesses that shared block, would first check for the block in the page cache. Since there will be a page corresponding to the shared block, already in the page cache, a disk read is saved. In this way, the duplicate elimination is carried out in the page cache which therefore helps in increasing the performance by reducing the disk accesses. 6) EVALUATION : 6.1) Correctness : The correctness of the project is checked with a help of a testing program, which exhaustively checks all the transactions a file system can operate on. The testing program checks the file system state on different instances like copying a duplicate file, copying a file with duplicate content, modifying a file which has shared blocks, etc., It is found that the testing program returned expected results and the file system remained stable. Given below is the testing program and its output : tester.c : #include <linux/errno.h> #include <sys/syscall.h> #include <sys/types.h> #include <sys/stat.h> #include <unistd.h> #include <fcntl.h> #include <stdio.h>

#include <malloc.h> #include <string.h> #include <stdlib.h>

#define NUM_DATA 256 #define NUM_FILES 20 #define STAGE_SIZE 4096 #define FILE_SIZE 256

int main(int argc, char **argv) char **data; int i=0, j=0,err=0,fp; char filename[32];

data = (char **)malloc(sizeof(char*) * NUM_DATA); for (i=0; i<NUM_DATA; i++) { data[i] = malloc(STAGE_SIZE); memset(data[i], i, STAGE_SIZE); }

printf("\nCreating files ..."); fflush(stdout);

for (i=0; i<NUM_FILES; i++) { sprintf(filename, "testfile-%d.dat", i);

fp = open(filename, O_CREAT | O_WRONLY, 0644); if (fp < 0) {

perror("tester"); printf("\nERROR: Cannot open file"); exit(1); }

for (j=0; j<FILE_SIZE; j++) { err = write(fp, data[i], STAGE_SIZE); if (err != STAGE_SIZE) perror("tester"); {

printf("\nERROR: Write failed"); exit(1); } }

close(fp); }

fp = open("nondup.dat", O_CREAT|O_WRONLY, 0644); if (fp < 0) {

perror("tester"); printf("\nERROR: Cannot open file"); exit(1); }

for (i=0; i<32; i++)

err = write(fp, data[NUM_DATA-(i+1)], STAGE_SIZE); if (err != STAGE_SIZE) perror("tester"); printf("\nERROR: Write failed"); exit(1); } } {

close(fp);

sync();

printf("\nPerforming duplicate appends ... (dup->dup)"); fflush(stdout);

for (i=0; i<NUM_FILES; i++) { sprintf(filename, "testfile-%d.dat", i);

fp = open(filename, O_APPEND); if (fp < 0) {

perror("tester");

printf("\nERROR: Cannot open file"); exit(1); }

for (j=0; j<FILE_SIZE; j++) { write(fp, data[NUM_FILES-i], STAGE_SIZE); if (err != STAGE_SIZE) perror("tester"); printf("\nERROR: Write failed"); exit(1); } } {

close(fp); } sync();

printf("\nPerforming duplicate overwrites ... (dup>dup)"); fflush(stdout);

for (i=0; i<NUM_FILES; i++) { sprintf(filename, "testfile-%d.dat", i);

fp = open(filename, O_RDWR); if (fp < 0) {

perror("tester"); printf("\nERROR: Cannot open file"); exit(1); }

lseek(fp, ((FILE_SIZE/2) - 8) * STAGE_SIZE, SEEK_SET); write(fp, data[NUM_FILES-i], STAGE_SIZE); if (err != STAGE_SIZE) perror("tester"); {

printf("\nERROR: Write failed"); exit(1); }

close(fp); }

sync();

printf("\nPerforming non-duplicate overwrites ... (dup>nondup)"); fflush(stdout);

for (i=0; i<NUM_FILES; i++) { sprintf(filename, "testfile-%d.dat", i);

fp = open(filename, O_RDWR); if (fp < 0) {

perror("tester"); printf("\nERROR: Cannot open file"); exit(1); }

lseek(fp, ((FILE_SIZE/2) - 16) * STAGE_SIZE, SEEK_SET); write(fp, data[NUM_FILES+i], STAGE_SIZE); if (err != STAGE_SIZE) perror("tester"); printf("\nERROR: Write failed"); exit(1); } {

close(fp); }

sync();

printf("\nPerforming duplicate overwrites ... (nondup>dup)"); fflush(stdout);

fp = open("nondup.dat", O_CREAT|O_RDWR); if (fp < 0) {

perror("tester"); printf("\nERROR: Cannot open file"); exit(1); }

for (i=0; i<32; i++)

err = write(fp, data[i], STAGE_SIZE); if (err != STAGE_SIZE) perror("tester"); printf("\nERROR: Write failed"); exit(1); } } {

close(fp); sync();

printf("\n\nTester completed.\n");

} OUTPUT :

6.2) Performance : We conducted all tests on a Virtual Machine running on a 2.8 Ghz Celeron processor with 1GB of RAM, a 80 GB Western Digital Caviar IDE disk. The operating system was Fedora Core 4 running a 2.6.15 kernel. We tested the Content based file system using the postmark benchmark. Postmark As an I/O-intensive benchmark that tests the worst-case I/O performance of the file system, we ran Postmark [12]. Postmark stresses the file system by performing a series of operations such as directory lookups, creations, and deletions on small les. Postmark has three phases: The file creation phase which creates a working set of les, The transactions phase, which involves creations, deletions, appends, and reads, and The file deletion phase removes all les in the working set. We configured Postmark to create 20000 les (between 512 bytes and 10KB) and perform 2,00,000 transactions. Figure 6.1 shows the results of Postmark on Ext2 and CBFS Postmark results :

70 0 60 5 60 0 50 5 50 0 40 5 40 0 30 5 30 0 20 5 20 0 10 5 10 0 5 0 0 E2 x t CS B F
T aT e ol i t m T nato r sc n a i Te i m

CODE SNIPPETS : 1) __CBFS_COMMIT_WRITE () : static int __cbfs_commit_write(struct inode *inode, struct page *page, unsigned from, unsigned to) {

unsigned block_start, block_end; int part = 0; unsigned blocksize; struct buffer_head *bh, *head; sector_t iblock, block; unsigned bbits; long int allocated_block, mapped_block; void *paddr; int err = -EIO; unsigned long goal; int offsets[4]; Indirect chain[4]; Indirect *partial; int boundary = 0; int depth = 0;

blocksize = 1 << inode->i_blkbits; bbits = inode->i_blkbits; iblock = (sector_t)page->index << (PAGE_CACHE_SHIFT bbits);

if (page && (page->index != iblock))

printk(KERN_WARNING "\nCalculation wrong!!"); BUG(); }

paddr = page_address(page); for(bh = head = page_buffers(page), block_start = 0; bh != head || !block_start; iblock++, block_start=block_end, bh = bh>b_this_page) { block_end = block_start + blocksize; if (block_end <= from || block_start >= to) { if (!buffer_uptodate(bh)) part = 1; } else if (S_ISREG(inode->i_mode)) { if (buffer_new(bh)) clear_buffer_new(bh); recheck: depth = cbfs_block_to_path(inode, iblock, offsets, &boundary); cbfs_get_branch(inode, depth, offsets, chain, &err); allocated_block = (long) chain[depth-1].key; mapped_block = cbfs_check_duplicate_block(c_htable, b_htable, (char *) paddr, allocated_block);

if (mapped_block == allocated_block) goto out; else if (mapped_block < 0) {

cbfs_free_branches(inode, chain[depth1].p, chain[depth-1].p+1, 0); cbfs_get_block(inode, iblock, bh, 1); goto recheck; } else { goal = mapped_block; cbfs_free_branches(inode, chain[depth-1].p, chain[depth-1].p+1, 0); cbfs_get_block_direct(inode, iblock, bh, 1, goal); } } else { out: set_buffer_uptodate(bh); mark_buffer_dirty(bh); } }

if (bh->b_blocknr != pno_to_blockno(inode, page->index)) { printk("\nBUG: page mapping is screwed up! %ld, and %ld", (long int)bh->b_blocknr, (long int)pno_to_blockno(inode, page>index)); }

/* * If this is a partial write which happened to make all buffers * uptodate then we can optimize away a bogus readpage() for * the next read(). Here we 'discover' whether the page went

* uptodate as a result of this (potentially partial) write. */ if (!part) SetPageUptodate(page); return 0; }

2) CBFS_GET_BLOCK_DIRECT(): int cbfs_get_block_direct(struct inode *inode, sector_t iblock, struct buffer_head *bh, int create, unsigned long goal) { int err = -EIO; int offsets[4]; Indirect chain[4]; Indirect *partial; int boundary = 0; int depth = 0; int left;

depth = cbfs_block_to_path(inode, iblock, offsets, &boundary); if (depth == 0) goto out; reread: partial = cbfs_get_branch(inode, depth, offsets, chain, &err);

/* Simplest case - block found, no allocation needed */ if (!partial) { got_it: map_bh(bh, inode->i_sb, le32_to_cpu(chain[depth1].key));

if (boundary) set_buffer_boundary(bh);

/* Clean up and exit */ partial = chain+depth-1; /* the whole chain */ goto cleanup; }

/* Next simple case - plain lookup or failed read of indirect block */ if (err == -EIO) { cleanup: while (partial > chain) { brelse(partial->bh); partial--; } out: return 0; }

/* * Indirect block might be removed by truncate while we were * reading it. Handling of that case (forget what we've got and * reread) is taken out of the main path. */ if (err == -EAGAIN) goto changed;

left = (chain + depth) - partial; err = cbfs_alloc_branch_direct(inode, left, goal, offsets+(partial-chain), partial); if (err) goto cleanup;

if (cbfs_use_xip(inode->i_sb)) { /* * we need to clear the block */

err = cbfs_clear_xip_target (inode, le32_to_cpu(chain[depth-1].key)); if (err) goto cleanup; }

if (cbfs_splice_branch_direct(inode, iblock, chain, partial, left) < 0) {

goto changed; }

set_buffer_new(bh); goto got_it;

changed: while (partial > chain) { brelse(partial->bh); partial--; } goto reread; } 3)CBFS_ALLOC_BRANCH_DIRECT: static int cbfs_alloc_branch_direct(struct inode *inode, int num, unsigned long goal, int *offsets, Indirect *branch) { int blocksize = inode->i_sb->s_blocksize; int n = 0; int err = 0; int i; int parent;

parent = goal;

branch[0].key = cpu_to_le32(parent); if (parent) for (n = 1; n < num; n++) { struct buffer_head *bh; /* Allocate the next block */ int nr = cbfs_alloc_block(inode, parent, &err); if (!nr) break; branch[n].key = cpu_to_le32(nr); /* * Get buffer_head for parent block, zero it out and set * the pointer to new one, then send parent to disk. */ bh = sb_getblk(inode->i_sb, parent); if (!bh) { err = -EIO; break; } lock_buffer(bh); branch[n].bh = bh; branch[n].p = (__le32 *) bh->b_data + offsets[n]; *branch[n].p = branch[n].key; set_buffer_uptodate(bh); unlock_buffer(bh); mark_buffer_dirty_inode(bh, inode); /* We used to sync bh here if IS_SYNC(inode). * But we now rely upon generic_osync_inode() * and b_inode_buffers. */ if (S_ISDIR(inode->i_mode) && IS_DIRSYNC(inode)) sync_dirty_buffer(bh); parent = nr; } if (n == num) return 0; But not for directories.

/* Allocation failed, free what we already allocated */

for (i = 1; i < n; i++) bforget(branch[i].bh); return err; }

4)CBFS_SPLICE_BRANCH_DIRECT():

static inline int cbfs_splice_branch_direct(struct inode *inode, long block, Indirect chain[4], Indirect *where, int num) { struct cbfs_inode_info *ei = CBFS_I(inode); int i;

/* Verify that place we are splicing to is still there and vacant */ write_lock(&ei->i_meta_lock); // // if (!verify_chain(chain, where-1) || *where->p) goto changed;

/* That's it */

*where->p = where->key; ei->i_next_alloc_goal = le32_to_cpu(where[0].key);

write_unlock(&ei->i_meta_lock);

/* We are done with atomic stuff, now do the rest of housekeeping */

inode->i_ctime = CURRENT_TIME_SEC;

/* had we spliced it onto indirect block? */ if (where->bh) mark_buffer_dirty_inode(where->bh, inode);

mark_inode_dirty(inode); return 0;

changed: write_unlock(&ei->i_meta_lock); for (i = 1; i < num; i++) bforget(where[i].bh); for (i = 0; i < num; i++) cbfs_free_blocks(inode, le32_to_cpu(where[i].key), 1); return -EAGAIN; }

5)CBFS_FREE_BLOCKS() : void cbfs_free_blocks (struct inode * inode, unsigned long block, unsigned long count) { int i;

for (i = 0; i < count; i++)

cbfs_free_block (inode, block, 1); block++; } }

6)CBFS_FREE_BLOCK():

void cbfs_free_block (struct inode * inode, unsigned long block, unsigned long count) { struct buffer_head *bitmap_bh = NULL; struct buffer_head * bh2; unsigned long block_group; unsigned long bit; unsigned long i; unsigned long overflow; struct super_block * sb = inode->i_sb; struct cbfs_sb_info * sbi = CBFS_SB(sb); struct cbfs_group_desc * desc; struct cbfs_super_block * es = sbi->s_es; struct cbfs_hash_node *node; unsigned freed = 0, group_freed = 0; int err = 0;

err = cbfs_remove_hash_entry(c_htable, b_htable, (long) block); if (err < 0) err = 1; goto error_return; } if (block < le32_to_cpu(es->s_first_data_block) || block + count < block || block + count > le32_to_cpu(es->s_blocks_count)) { cbfs_error (sb, "cbfs_free_blocks", "Freeing blocks not in datazone - " "block = %lu, count = %lu", block, count); goto error_return; } cbfs_debug ("freeing block(s) %lu-%lu\n", block, block + count - 1); {

do_more:

overflow = 0; block_group = (block - le32_to_cpu(es>s_first_data_block)) / CBFS_BLOCKS_PER_GROUP(sb); bit = (block - le32_to_cpu(es->s_first_data_block)) % CBFS_BLOCKS_PER_GROUP(sb); /* * Check to see if we are freeing blocks across a group * boundary. */ if (bit + count > CBFS_BLOCKS_PER_GROUP(sb)) { overflow = bit + count - CBFS_BLOCKS_PER_GROUP(sb); count -= overflow; } brelse(bitmap_bh); bitmap_bh = read_block_bitmap(sb, block_group); if (!bitmap_bh) goto error_return; desc = cbfs_get_group_desc (sb, block_group, &bh2); if (!desc) goto error_return;

if (in_range (le32_to_cpu(desc->bg_block_bitmap), block, count) || in_range (le32_to_cpu(desc->bg_inode_bitmap), block, count) || in_range (block, le32_to_cpu(desc->bg_inode_table), sbi->s_itb_per_group) || in_range (block + count - 1, le32_to_cpu(desc>bg_inode_table), sbi->s_itb_per_group)) cbfs_error (sb, "cbfs_free_blocks", "Freeing blocks in system zones - " "Block = %lu, count = %lu", block, count);

for (i = 0, group_freed = 0; i < count; i++) {

if (!ext2_clear_bit_atomic(sb_bgl_lock(sbi, block_group), bit + i, bitmap_bh->b_data)) { cbfs_error(sb, __FUNCTION__, "bit already cleared for block %lu", block + i); } else { group_freed++; } block++; } mark_buffer_dirty(bitmap_bh); if (sb->s_flags & MS_SYNCHRONOUS) sync_dirty_buffer(bitmap_bh);

group_release_blocks(sb, block_group, desc, bh2, group_freed); freed += group_freed;

if (overflow) { block += count; count = overflow; printk("\nOverflow"); goto do_more; } error_return: brelse(bitmap_bh); release_blocks(sb, freed); DQUOT_FREE_BLOCK(inode, freed); }

7)INIT_CBFS_FS static int __init init_cbfs_fs(void) { int err = init_cbfs_xattr(); c_htable = cbfs_init_checksum_hash_table(myhash_cs);

b_htable = cbfs_init_block_hash_table(myhash_b);

cacheptr = kmem_cache_create("nodespace", sizeof(struct cbfs_hash_node), 0, 0, NULL, NULL);

if (err) return err; err = init_inodecache(); if (err) goto out1; err = register_file system(&cbfs_fs_type); if (err) goto out; return 0; out: destroy_inodecache(); out1: exit_cbfs_xattr(); return err; } 8)EXIT_CBFS_FS:

static void __exit exit_cbfs_fs(void) { int result;

cbfs_hash_free(c_htable, b_htable); unregister_file system(&cbfs_fs_type); destroy_inodecache(); kmem_cache_destroy(cacheptr); exit_cbfs_xattr(); }

SCREENSHOTS :

You might also like