You are on page 1of 5

This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts

for publication in the IEEE CCNC 2010 proceedings

Relation-Based File Management for Portable Device


Junghwan Kim junghwani.kim@samsung.com Samsung Electronics Hyunju Ahn hyunju.ahn@samsung.com Samsung Electronics Chanho Park chanho61.park@samsung.com Samsung Electronics

AbstractAs storage capacity in CE devices has been increased, the number and type of les has been increased accordingly. But traditional le systems generally use hierarchical directory/le structure to organize les. When we want to store and retrieve a le, we should know the name and location of the le exactly. And only one access path can be allowed to users. These le systems are not adequate to manage les of portable storage. We present WebFS to effectively manage les stored in mobile storage and to provide users with a convenient way to retrieve les. WebFS represents complicated information of les by using extended le metadata and provides various and effective way to access les by using inter-le relationships among les.

I. I NTRODUCTION Generally a le system is a way to store and organize the raw data as les and provides users with abstraction of physical storage device. FAT, NTFS and EXT2/3 are widely used as general le systems. Most of these legacy le systems have hierarchical directory/le structure. In case of that storage capacity and the number of les to be stored was small, legacy hierarchical le systems were enough to retrieve/manage les easily and effectively. But it becomes a complicated and annoying task to classify and access les under legacy le systems as many as enormous les are stored in the portable device. In the past, the major problem was to decide which les are irrelevant to users preference to reserve storage space by deleting those les. However the capacity of storage used in mobile device has been increased to several tens of gigabytes, then the corresponding problem has been changed to retrieve the les which are desired or preferred. It becomes a critical issue than classifying les and deciding where to store these les. And over time, users preference is changed and les are classied into more small groups and more directories are created, then it is difcult to remember the exact le name and the path of the corresponding le which is desired eagerly. A le can be included in one or more classes according to semantics and usage, but it is difcult to manipulate multiple paths in traditional le systems. Some applications have been proposed to solve these difculties of legacy le systems: Googles Desktop Search [1], Apples Spotlight [2], Beagle [3], Microsofts WinFS [4]. But they just focus on searching les based on indexed le data. As a number of les to be stored are increased, it is hard to create and manage the indexing on corresponding les. For the more, these

applications are worked on desktop environment. It is required to more computational resources and does not adequate to the CE device. In order to solve these problems, we need effective mass le management system to compliment legacy le systems. Generally contemporarily accessed les are related into the users interest and those les might have similar contents. From that, we thought that users le access sequence and features of accessed les should be considered for managing mass les effectively. Also if those inter-le relationships can be preserved, a user will be able to access related les easily by using the semantic relationship. By using interle relationship, we can navigate les similar to Web-style browsing. So our system is named as WebFS. The rest of this paper is organized as follows. In the next section, we discuss the previous work related to the semantic le systems. In Section 3, the architecture of WebFS is described. Next we dene what semantic relationship is and describe how to extract the semantic relationships among les. Finally, we show the result of WebFS prototype in Section 5; and state our concluding remarks and future improvement on WebFS in Section 6. II. R ELATED W ORKS AND BACKGROUNDS In this section, we survey the historical approaches on the semantic le system [5] and analyze the difference with our proposed le system. WinFS(Windows Future Storage [4]) is a data management system based on relational databases, developed by Microsoft. WinFS recognizes different types of les such as picture, email, text document, audio, video, calendar, contact. Each le has different properties and interle relationships are also exposed as properties. If the specic application is not allowed in given le system or operation system, WinFS can not use the relationship among les and provide the additional information on users. But in the WebFS, those metadata and inter-le relationships are stored and managed at le system level. Therefore the useful information can be managed without special purpose applications. Another semantic approach uses the attributes and links among les. LiFS(Linking File System [6][7]) provides application-dened le attributes and attributed links between les for providing a rich, efcient, shared le system metadata infrastructure. Such links and application-dened attributes can efciently express the le semantics and inter-le relationships, but they should be manually dened by user or

978-1-4244-5176-0/10/$26.00 2010 IEEE

This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE CCNC 2010 proceedings

TABLE I N EW FILE SYSTEM CALLS System call getrelation setrelation rmrelation Description Get les which have semantic relationship Set Set relation between les Remove relation between les TABLE II F ILE SYSTEM OPERATIONS getattr fgetattr access readlink opendir readdir releasedir mknod mkdir symlink unlink rmdir rename link chmod chown truncate ftruncate utimes create open read write statfs ush release fsync getrelation setrelation rmrelation init destroy
Virtual Kernel Space Data Transfer Service User Space Kernel Space

......

Office Programs

Image Viewer Editor

WebFS Browser

Movie Player

Music Player

VFS

WebFS

Ext2/Ext3...

File System Service Provider

Content Analyzer

Access Tracer

Subject Analyzer

Relation Manager

Figure 1.

WebFS system architecture

applications. In other words, the semantics within LiFS can not reect the usage of les and contain only the users pushing information. On the other hand, WebFS can automatically extract the extended metadata and manage the le metadata. Bloehdorn et al. [8] suggests the virtual le system with tag semantics on a hierarchical le system. It allows users to manage les through their existing desktop applications for le management by overloading common le system operations with tagging semantics. In that system, the directory is replaced into tag hierarchy. So that, if the tags related to the interested les are presented, that system can search the correct le. Otherwise, the interested les may not be retrieved. On the other hand, the tags related to the given les are automatically extracted and managed; it is easy to search the interesting les in WebFS. III. A RCHITECTURE In this section, we describe the architecture of WebFS and key components. WebFS manages les based on semantic relation among les and those relationships are managed at le system level not at application level. To satisfy that requirement, the essential components of WebFS is implemented by using of FUSE(File System in User Space [9]). WebFS framework consists of a kernel module(FUSE) and user library(libfuse). FUSE directs VFS calls to a le system in user space. The user and applications can use functionality of a le system like WebFS via kernel system calls. A. File System Operations In order to utilize new features of WebFS, we dene several new kernel system calls as described in Table I. The new system calls have similar syntax to the legacy system calls. To support new system calls, new VFS operations are added. FUSE kernel module and FUSE user library are also modied to t to the VFS change. In table II, operations in white-colored cells are executed identically to the legacy operations without change. But operations in gray-colored cells are to be modied or added to provide new functionality of WebFS. The init operation initializes resources for WebFS during mount time. On the contrary

the destroy operation returns resources when un-mounting WebFS volume. WebFS manages WebFSs specic metadata of les such as unique le ID, le location, access sequence of les, keywords as le feature, and inter-le relationships. The create, open, unlink, rename, truncate, ftruncate, release operations should execute legacy operations. Moreover, they should create, modify and destroy WebFS metadata of les. For create, unlink, WebFS metadata needs to be initialized and destroyed. For rename, truncate, ftruncate operations, WebFS metadata needs to be modied. When a le is released, if the content of the le is changed, then the features of the le are extracted and inter-le relationships are updated for that le. For create, open operations, we also need to trace le access sequence. The getrelation, setrelation, rmrelation operations are newly added to support new features of WebFS. B. Architecture of WebFS Fig. 1 shows the system architecture of WebFS. WebFS modules can be categorized into the user space and kernel space. User space modules(Data Transfer Service and WebFS Browser) use the WebFS core and take part in le browsing and saving. The main sub-modules of WebFS are implemented on virtual kernel space using FUSE. The description of the main sub-modules is follow: File System Service Provider: manages VFS operations forwarded by FUSE. Access Tracer: maintains frequently accessed les and recently added les, especially traces le access sequences. Content Analyzer: extracts le-type dependent metadata and keywords in addition to basic le metadata. Subject Analyzer: constructs keyword-graph based on le metadata and keywords extracted by Content Analyzer. Relation Manager: gures out inter-le relationships based on the semantics of les extracted by Content Analyzer, Access Tracer, Subject Analyzer. IV. R ELATION -BASED F ILE S YSTEM (W EB FS) A. Semantic Relation Basically, WebFS does not manage the les but manipulate the inter-le relationship or semantic relation. Semantic relation is categorized into temporal locality and feature similarity.

This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE CCNC 2010 proceedings

List of successors

Fj Fj Fj

Cij Cij Cij Cij

Most recently accessed

FB FC FE

3 2 1 1 1 5 2 4 1 F C F B F A

FC FB FE FA FC FD FF FA FD

2 5 1 1 2 1 1 2 1

FC FB FE FD FB FC FC FE

1 2 1 2 1 1 1 2 F F F E F D

FE FA FB FB FA FF FB FC FD

3 1 1 1 2 2 1 1 1

List of predecessors

Fi
Most recently accessed

Fj

Least recently accessed

FF FE FA

File ID Count

Fk Fk Fk Fk

Cki Cki Cki Cki

FA FB FF

Least recently accessed

File ID Count

Figure 3.

Access List Management


successors

Figure 2.

Access sequence list of the current le


FB FC FE 3 2 1 1 1 5 2 4 1 1 2 1 2 1 1 1 2 FF FE FD FC FB FA FC FB FE FA FC FD FF FA FD FE FA FB FB FA FF FB FC FD 2 5 1 1 2 1 1 2 1 3 1 1 1 2 2 1 1 1 FF FE FC FE FD FB FD FB FC FC FB FA FB FA FF FA FB FC
predecessors

FB

FC

These semantic relations are main key point to analyze the relationship between user behavior and users preferred les. 1) Temporal Locality: Files tend to be accessed simultaneously or sequentially during a specic time period if a user is interested in those les or users tasks and applications are in need of those les. Over time, users interests are possible to be changed, then a user might be interested in different les. There is temporal locality in access sequence of les and it reects users interests, we can obtain some semantic information from le access pattern of a user. If we can cluster les with temporal locality in le access sequence, we can nd out semantically related les to each other in the point of view of users. 2) Feature Similarity: Feature similarity represents content similarities among les. If keywords which represent contents of les are extracted and relationships among these keywords are constructed, feature similarity among les can be also retrieved based on the extracted keywords. B. Relation extraction based on Temporal Locality 1) Identifying File Access Pattern: WebFS traces le access operations to nd out temporal locality in le access sequence of users. There are many operations that user can do with a le, but we only trace the create/open operations as le access operation. After a le is opened or created, a user can read/append/overwrite data from/to the le many times and those operations are mostly sequential. But if we trace all the le operations, then meaningless le access sequence can be collected. A user is tend to access related les during similar time period, but les are not accessed all the time in an exact same sequence. Considering that, WebFS traces previously accessed les and successively accessed les. WebFS maintains le access sequence for each le in WebFS volume. Whenever a le is accessed (created or opened), Access Tracer of WebFS updates the access sequence list of that le. As shown on the Fig. 2, it maintains two lists of m successors and m predecessors for each le. In this example, m is set to 4. An entry of the lists has a le ID and reference count that indicates how many times to be observed. In the Fig. 2, Fi , Fj and Fk represent le IDs and Cij indicates the count in case that Fj is accessed right after Fi . The maximum entry count of list is xed. If the list is full, the LRU policy is used to select the victim. If an access sequence is not observed recently, we can infer that the le

FF FE FA FA FB FF FC FB FE FD FB FC FC FE

FC

FA

FA

FF

FE

FA

FA

FF

FB

FC

Figure 4.

The high related le list management

is irrelevant to the preference of a user and temporal locality between les become relatively weak. Now we explain the procedure of Access Tracer with an example. Assume that a given access order is as follows:
ABCBDEAEFDBABCEDACBABEFCABDEABCDEBCACFBA

In case of that m is set to 3, the maintained sets are described in Fig. 3. If the number of successors or predecessors for a le is more than m, the one that is not recently accessed will be removed irrespective of the observed count and the new one is added to the list. 2) Temporal Semantic Relationship: In WebFS, les in frequently observed access sequences can be considered to have semantic relationship based on temporal locality. And we regard more frequently accessed les as more tightly related les. So the n most frequently accessed successors and predecessors are selected as related les. If n is assumed to 2, in Fig. 4, the les in gray-colored blocks can be congured as semantically related les. In the rst place, les with the highest reference count are selected. In case the reference counts are equal to each other, the recency is used as tiebreaker. That is the most recently accessed les have priority for being selected as related les. C. Relation extraction based on Feature Similarity 1) Feature(Keyword) Extraction: In order to analyze the content of the given le, we need to extract the main keywords

This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE CCNC 2010 proceedings

D1 = {(a,3), (b,2), (c,1)} D2 = {(b,1), (c,2), (d,3)} D3 = {(c,1), (d,2), (e,1)} 1

d,1 c,1 D3

TABLE III T EST BENCHMARK SET Test target EXT3 EXT3 + FUSE EXT3 + FUSE + TagFS EXT3 + FUSE + WebFS Description Represents legacy le systems File system operations with FUSE only Represents another le systems using FUSE WebFS using FUSE and legacy le system

a
1 1 2 2 1

e,1 c,2 b,1 D3 D2

d
1

e,1 d,2 b,1 a,1 D3 D2 D1

d,1 c,1 a,1 D2 D1

e a

c,1 b,1 D1

Figure 5.

Example of Keyword Graph

or tags. In this work, we use the state-of-the-art approach on each le type. For example, Entropy-based feature extraction is used for text-based documents such as PDF, PPT, DOC and TXT. And for image les like JPG, GIF and BMP, essential keywords are selected based on keywords which were extracted from EXIF Tag and statistical information of RGB values in images. We use the properties within ID3 tag on audio les. So we use the way separating tokens from the text of document. And the each extracted token is listed as candidate for keyword. The more a word appears in the text, the more there is the possibility that it is a keyword. So we use the frequency of a word to extract keywords from document le format. 2) Constructing Feature Graph: Text-based document has several keywords and the meaningful keywords related to Audio/Image le can be extracted from that corresponding metadata. However, it is difcult to identify relationship among these keywords. We thought that if keyword relationship is classied or identied; then le relationship is identied. We assume that keywords of a document have semantic relationship among them. We dene that if some keywords are simultaneously occurred on several documents, then these have a strong co-relationship. WebFS uses a keyword graph to represent a co-relationship. The basic idea of keyword graph is that le relationship (such as les content search, content similarity, and categorization) is identied by maintaining high co-occurrence keywords. If a handle of modied le is closed in the WebFS, new keywords are extracted by using keyword extractor. After extracting keywords from the le, the given le has a list of elements which include keyword and frequency. Keyword graph is constructed by adding new extracted keywords. To describe the data structure of the keyword graph and how to manage the keyword graph, we use an example keyword graph in Fig. 5. Keyword graph is constructed by adding the keywords of D1 D3 les. A left side of the gure is a concept view of keyword graph. A circle represents a topic which is created after adding a keyword. A edge among topics represents a relationship of topics. The attached number of edge represents how many topics are co-occurred in documents. One of the best co-related topics to the connected topic b is c because an edge between b and c has the highest value. A right side of gure represents a data

structure which is converted from a concept view of keyword graph. Each topic has two list of edges which indicate a corresponding topic and the list of le link which have the topic. In WebFS, keyword graph is constructed from not whole les but recently created or modied les. 3) Extracting Feature Similarity: In this work, we use a LSA(Latent Semantic Analysis [10]) method to retrieve document similarity. If similar documents group is retrieved by using keyword graph, then a number of computation and resources are reduced. Steps of retrieving document group using the LSA method are as follows: i. Finds the topics with the index terms of the given document and retrieves documents from the le list of the topics. ii. Traverses edges from the found topics and retrieves the documents from the le list of the target topics. iii. Makes a Term-Document matrix from the retrieved document group. It is a sparse matrix whose rows correspond to terms and whose columns correspond to documents. The value of Term-Document matrix is frequency which indicates how many such terms occur in the document. iv. Applies a SVD(Singular Value Decomposition) method to Term-Document matrix. Then it is decomposed into U SV T matrices. U and V T is orthogonal matrices, S is a diagonal matrix. v. Selects k dimension from S and V T matrix and multiply T Sk and Vk . T Bk = S k V k Given the Bk , the similarity of the documents is retrieved from the cosine similarities. V. E XPERIMENTAL R ESULTS A. Standard File Operations In order to evaluate WebFS prototype, we used Postmark Benchmark [11]. WebFS maintains its own metadata to provide new features, so it takes more times to handle le operations than legacy le systems. And WebFS runs in user space by use of FUSE, there is no wonder that WebFS is slower than legacy le systems running in kernel space like EXT2/EXT3. To fairly compare the performance of WebFS prototype with legacy le systems and another le systems using FUSE, we set up the benchmark test conguration as in Table III. A disk volume formatted with EXT3 is used to represent legacy le systems. And to measure the overhead of FUSE, we implement a le system in user space that just executes the legacy le system operations by basic kernel system calls.

This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE CCNC 2010 proceedings

TABLE IV P OSTMARK TEST CONFIGURATION Case 1 2 3 Number of initial les 1000 1000 1000 Number of transactions 10000 10000 10000 Range of le size 3MB 100KB 1MB 100KB 3MB

Precision 1 ARC LRU 1

Recall ARC LRU

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0 10 20 Cutoff 30

0 10 20 Cutoff 30

Figure 7.
Create transactions per second 8 7 6 5 4 3 2 5 1 0 1 2 Case Total transactions per second 30 25 20 15 10 5 0 1 2 Case 3 EXT3 EXT3+FUSE EXT3+FUSE+TagFS EXT3+FUSE+WebFS 3 0 1 2 Case 3 10 15 EXT3 EXT3+FUSE EXT3+FUSE+TagFS EXT3+FUSE+WebFS 25 Read transactions per second EXT3 EXT3+FUSE EXT3+FUSE+TagFS EXT3+FUSE+WebFS

TREC data results

20

cutoff points. The precision and recall of using ARC method have higher value of 10% than using LRU method. The LRU method removes a topic which do not used currently even though it frequently used in past, but ARC method has less such probabilities than LRU method. ARC method is more efcient algorithm than LRU to utilize resources and to retrieve more relevant documents. VI. C ONCLUSION AND F UTURE W ORKS Although storage capacity in CE devices is tending to increase rapidly, the solution for managing les stored in portable device is remained at application level. In this paper, we present a lesystem-based solution for le management under the portable device. WebFS extends common le metadata of traditional le systems and extracts extended le metadata, such as important keywords, related subjects and le access order. Based on the extended le metadata, WebFS determines various inter-le relationships and makes that users can retrieve les in convenient way. The feasibility of WebFS is veried at le system level by use of FUSE framework. In the future, WebFS will play an important role in presenting a direction to dene and develop a new le system for mobile device. R EFERENCES
[1] [2] [3] [4] [5] [6] Google, Desktop search. http://desktop.google.com. Apple, Spotlight. http://en.wikipedia.org/wiki/Spotlight (software). Mediawiki, Beagle project. http://beagle-project.org/Main Page. Microsoft, Windows future storage. http://msdn.microsoft.com/en-us/ library/cc836634.aspx. D. K. Gifford, P. Jouvelot, M. A. Sheldon, and J. W. O. Jr., Semantic le systems, ACM Symposium on Operating System Principles, vol. 25, no. 5, pp. 1626, 1991. S. Ames, N. Bobb, S. A. Brandt, A. Hiatt, C. Maltzahn, E. L. Miller, A. Neeman, and D. Tuteja, Richer le system metadata using links and attributes, in Proceedings of the IEEE/NASA Goddard Conference on Mass Storage Systems and Technologies, 2005. S. Ames, N. Bobb, K. Greenan, O. Hofmann, M. W. Storer, C. Maltzahn, E. L. Miller, and S. A. Brandt, LiFS: An attribute-rich le system for storage class memories, in Proceedings of the IEEE/NASA Goddard Conference on Mass Storage Systems and Technologies, 2006. S. Bloehdorn, O. Gorlitz, S. Schenk, and M. Volkel, TagFS - tag semantics for hierarchical le systems, in Proceedings of the International Conference on Knowledge Management, pp. 68, 2006. M. Szeredi, Filesystems in userspace. http://fuse.sourceforge.net. T. K. Landauer, P. W. Foltz, and D. Laham, An introduction to latent semantic analysis, Discourse Processes, vol. 25, pp. 259284, 1998. J. Katcher, Postmark lesystem benchmark. http://www.netapp.com/ tech library/3022.html. TREC, Text retrieval conference. http://trec.nist.gov.

Figure 6.

System test results

TagFS [8] is a database le system using FUSE and use SQLite to store les and metadata. WebFS currently maintains metadata of les on system memory. Postmark Benchmark is congured to three categories for each test benchmark set. Table IV shows the congurations. The system test result is shown in Fig. 6. Each result shows the transaction per second. Create and read operation are represented on upper results, respectively. As shown in the results, for conguration 1 and 3, performance degradation of WebFS is not remarkable. But for conguration 2, performance of WebFS decreases a little. But it still competitive to EXT3 with FUSE and such degradation is tolerable in consideration of smart feature of WebFS. And for most cases, WebFS even performs better than TagFS using Database. B. Feature Handling We select 35 topics in the TREC Blog Track [12] and extract 30 documents in the each topic for experiments. 2000 documents which do not relate to these topics are also included. The total number of documents for experiments is 3050. We measure the number of relevant documents from retrieved documents. Our evaluation compares with the LRU algorithm and the ARC algorithm in order to manage the topic list. A cutoff indicates how many les show from the retrieved les. When the cutoff is 30, the precision and recall have the same value. Because each topic has 30 documents. If the cutoff point is increased, recall is also increased, however precision is decreased. Because retrieved documents include not only relevant documents but also irrelevant documents. Fig. 7 shows the precision and recall levels of the two algorithms at various

[7]

[8] [9] [10] [11] [12]

You might also like