Professional Documents
Culture Documents
(Chapters 4)
Signature Files
Characteristics
Word-oriented index structures based on hashing Low overhead (10%~20% over the text size) at the cost of forcing a sequential search over the index Suitable for not very large texts Inverted files outperform signature files for most applications
Structure
Use superimposed coding to create signature. Each text is divided into logical blocks. A block contains n distinct non-common words. Each word yields word signature. A word signature is a B-bit pattern, with m 1-bit.
Each word is divided into successive, overlapping triplets. e.g. free --> fr, fre, ree, ee Each such triplet is hashed to a bit position.
The word signatures are ORed to form block signature. Block signatures are concatenated to form the document signature.
Example
Search
Use hash function to determine the m 1-bit positions. Examine each block signature for 1s bit positions that the signature of the search word has a 1.
False Drop
For a given value of B, the value of m that minimizes the false drop probability is such that each row of the matrix contains 1s with probability 0.5.
Fd = 2-m m = B ln2/n
documents
Vertical partitioning
Storing the signature matrix column-wise improves the response time on the expense of insertion time.
Horizontal partitioning
Grouping similar signatures together and/or providing an index on the signature matrix may result in better-than-linear search.
with compression
bit-block compression (BC) variable bit-block compression (VBC)
with compression
compressed bit slices (CBS) doubly compressed bit slices (DCBS) no-false-drop method (NFD)
Horizontal partitioning
data independent partitioning Gustafsons method partitioned signature files data dependent partitioning 2-level signature files 5-trees
Criteria
the storage overhead the response time on single word queries the performance on insertion, as well as whether the insertion maintains the append-only property
10
Compression
idea
Create sparse document signatures on purpose. Compress them before storing them sequentially.
Method
Use B-bit vector, where B is large. Hash each word into one (or k) bit position(s). Use run-length encoding (McIlroy 1982).
11
[L1] [L2] [L3] [L4] [L5] where [x] is the encoded vale of x. search: Decode the encoded lengths of all the preceding intervals example: search data (1) data ==> 0000 0000 0000 0010 0000 (2) decode [L1]=0000, decode [L2]=00, decode [L3]=000000 disadvantage: search becomes low
14
Vertical Partitioning
idea avoid bringing useless portions of the document signature in main memory methods
store the signature file in a bit-sliced form or in a frame-sliced form store the signature matrix column-wise to improve the response time on the expense of insertion time
15
transpose
documents
represent
documents
F bit-files
search: (1) retrieve m bit-files. e.g., the word signature of free is 001 000 110 010 the document contains free: 3rd, 7th, 8th, 11th bit are set i.e., only 3rd, 7th, 8th, 11th files are examined. (2) and these vectors. The 1s in the result N-bit vector denote the qualifying logical blocks (documents). (3) retrieve text file through pointer file. insertion: require F disk accesses for a new logical block (document), one for each bit-file, but no rewriting
Ideas
random disk accesses are more expensive than sequential ones force each word to hash into bit positions that are closer to each other in the document signature these bit files are stored together and can be retrieved with a few random accesses
Procedures
The document signature (F bits long) is divided into k frames of s consecutive bits each. For each word in the document, one of the k frames will be chosen by a hash function. Using another hash function, the word sets m bits in that frame.
18
frames
FSSF (Continued)
Search
Only one frame has to be retrieved for a single word query. I.E., only one random disk access is required. e.g., search documents that contain the word free ->because the word signature of free is placed in 2nd frame, only the 2nd frame has to be examined. At most k frames have to be scanned for an k word query.
Insertion
Only f frames have to be accessed instead of F bit-slices.
20
idea
create a very sparse signature matrix store it in a bit-sliced form compress each bit slice by storing the position of the 1s in the slice.
21
Insertion
Require too many disk accesses (equal to F, which is typically 600-1000).
22
documents
Let m=1. To maintain the same false drop probability, F has to be increased. To compress each bit file, we store only the positions of the 1s. For unpredictable number of 1s, we store them in buckets of size Bp.
Obtain the pointers to the relevant documents from buckets Hash a word to obtain bucket address h(base)=30
buckets
hash function Distinguish synonyms partially. h1(base)=30 h2(base)=011 Follow the pointers of posting buckets to retrieve the qualifying documents.
Horizontal Partitioning
1. Goal: group the signatures into sets, partitioning the signature matrix horizontally. 2. Grouping criterion
documents
Using a portion of a document signature as a signature key to partition the signature file. All signatures with the same key will be grouped into a so-called module. When a query signature arrives,
examine its signature key and look for the corresponding modules scan all the signatures within those modules that have been selected
28