Professional Documents
Culture Documents
Submitted By
Ch V. V. D Prasad
2010-14
GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING
FOR WOMEN
DEPARTMENT OF INFORMATION TECHNOLOGY
CERTIFICATE
This is to certify that the project report “TO MAINTAIN PRIVACY IN
PUBLISHING SEARCH LOGS BY IMPLEMENTING ZEALOUS ALGORITHM”
is a bonafide work of following IV/IV B.Tech students in the Department of Information
Technology of JNT University, Kakinada during the academic year 2013-14, in partial
fulfillment of the requirement for the award of the degree of Bachelor of Technology of this
university.
External Examiner
ACKNOWLEDGEMENT
No volume of words is enough to express our deep gratitude towards our guide
Mr Ch. V. V. Prasad, Assistant Professor, Computer Science and Engineering, Gayatri
Vidya Parishad College of Engineering for Women, who has been very concerned and has
aided us with all the materials essential for the project work and preparation of this project
report. She helped us in exploring this vast topic in an organized manner and provided us
with all the ideas on how to work towards this project.
We would also like to thank other faculty and staff members who were always there
in the hour of need and provided us with all the help and facilities which we required for the
completion of this project.
Most importantly we would like to thank our parents and friends for their support.
CONTENTS
ABSTRACT
LIST OF FIGURES
LIST OF TABLES
LIST OF SCREENS
SYMBOLS & ABBREVIATIONS
1. INTRODUCTION 1
1.1 Motivation 1
1.2 Problem Definition 1
1.3 Objective of Project 1
1.4 Limitations of Project 2
1.5 Organization of Documentation 2
2. LITERATURE SURVEY 3
2.1 Introduction 3
2.2 Existing System 3
2.3 Disadvantages of Existing system 3
2.4 Proposed System 3
2.5 Conclusion 4
3. ANALYSIS 5
3.1 Introduction 5
3.2 Software Requirement Specification 5
3.2.1 User requirement 5
3.2.2 Software requirement 5
3.2.3 Hardware requirement 6
3.3 Content diagram of Project 6
3.4 Algorithms 7
3.4.1 K-anonymity Algorithm 7
3.4.2 Zealous Algorithm 8
3.5 Conclusion 8
4. DESIGN 9
4.1 Introduction 9
4.2 UML diagrams 9
4.2.1 Use Case Diagram 10
4.2.2 Class Diagram 10
4.2.3 Sequence Diagram 11
4.2.4 Activity Diagram 12
4.3 Module design and organization 13
4.4 Conclusion 14
7. CONCLUSION 37
7.1 Project Conclusion 37
7.2 Future Enhancement 37
8. REFERENCES 38
ABSTRACT
Many search engine’s use the servers databases on the history of the users search
Queries. Search logs serve as Gold mines to many of the researchers and statistical analysers.
Many companies use this search log information for vital purpose and care is taken to avoid
disclosing the sensitive part of the users search queries. In this project a Zealous algorithm is
implemented and experimental study has been conducted to show how this algorithm is used
to publish frequently searched queries and keywords. Traditional methods like k-anonymity
which deals about the vulnerability in attacks in compared with a more ensured and
guaranteed Zealous algorithm which provides a stronger utility for this problem and finally
concludes with a larger experimental study on generalized real time applications comparing
Zealous and previous works, showing Zealous yields a better comparable utility and further
achieving a stronger privacy.
LIST OF FIGURES
1.1 MOTIVATION
A basic scenario in the present day is having a high end security to the users, while
using a search engine for search purpose. It’s quite a tedious task for the administrator to
provide security by not revealing the user personal information. Using the traditional
algorithm’s it has become quite a difficult task, due to lack of proper security measures. In
order to overcome the drawbacks, this project mainly focuses on implementing the new
algorithms with enhanced security features for providing security.
Data Mining is a conceptual theory that has helped to define our project in a m12ore
generalized way. While the traditional methods deal with the micro aggregation focusing
more on clustering and grouping the required data sets based on the search query. To enhance
and provide a greater security to the user related search log, various algorithms have been
compared and implemented.
A search log mainly consists of the related user information such as IP address, time
stamp, keyword, no of clicks. Each and every data related to the user needs to be protected
and must be kept in a secure manner. There is a huge necessity and demand to generate new
algorithms in a view to protect the personal information of the user’s related query while
publishing the search log for research and analysis.
Only the authenticated user can view the results of the query that was entered.
The results of a query were displayed only if the details of the corresponding query
are presented in the database.
A concise overview of the rest of the documentation work is explained below. The
purpose of this document is to present a final report of our project “To Maintain privacy in
publishing search logs by implementing Zealous Algorithm” in Data Mining. This
document is divided into various chapters.
Chapter 2: Literature survey describes the primary terms involved in the development of this
application. It also gives the overview of the existing system, disadvantages of the existing
system and features of proposed system.
Chapter3. Analysis deals with detail analysis of the project. Software Requirement
Specification which further contain User requirement analysis, Software requirement
analysis, and Hardware requirement analysis. It also includes Content diagram of Project,
Algorithms and Flowcharts
Chapter4: Design includes UML diagrams along with explanation of Module design and
organization.
Chapter 5: Contains implementation of the system along with explanation of key functions
and screen shots of Outputs.
Chapter6. Gives the testing and validation details with design of test cases and scenarios
along with validation screen shorts.
2.1 INTRODUCTION
Search engines play a crucial role in the navigation through the vastness of the web.
Today’s search engines do mine information about their users. They store the queries, clicks,
IP-addresses, and other information about the interactions with users is called a search log.
As search logs contain valuable information they can be used in the development and testing
of new algorithms to improve performance and quality. Scientists and marketing companies
all around the world would like to tap this gold mine for their own research; search engine
companies, however do not release them because they contain sensitive information about
their users.
Search engines such as Bing, Google, Yahoo etc., log interactions with their users.
When a user submits a query and clicks on one or more results, a new entry is added to the
search log. Without loss of generality, we assume that a search log has the following schema
{USER-ID, QUERY. TIME, CLICKS}, where a USER-ID identifies a user, a QUERY is a
set of keywords, TIME is the timestamp and CLICKS is a list of URL’s that the user clicked
on.
The existing k-anonymity privacy algorithm though it provides privacy for the search
logs but is insufficient in the light of attackers who can actively influences the search log.
2.5 CONCLUSION
This chapter gives the complete meaning of literature survey and its importance in
documentation purposes along with motivation which drove to design this project. This
chapter portrays the features of communication mechanisms and their security. It also
discusses the limitations of present features that are to be added to the existing system for
more efficient system i.e., the proposed system.
3. ANALYSIS
3.1 INTRODUCTION
Processor : Pentium IV
Hard Disk : 40 GB
RAM : 3.86 GB
The user enters a query through the user interface in the search engine. The search
engine fetches the results of the related query from the database and displays to the user.
Whenever, a user entry a query in the search engine, the search engine maintains a search log
for each and every user. The search log contains the username, query searched, clicks
(number of click of a particular query) and timestamp. This search log is maintained in the
database and only the owners (admin) have the rights to access the database. Our project
describes how to maintain privacy for this search log, which contains user’s personal
information.
3.4 ALGORITHMS
In this project, we implement 2 algorithms. They are:
1. k-anonymity and
2. Zealous
3.4.1 k-anonymity Algorithm:
Input:
k: integer(generated randomly).
begin
end
Output:
Protected data set.
3.4.2 Zealous Algorithm
Input:
m: maximum no. of items from user,
T: First Threshold value (generated randomly) and
Tʹ: Second Threshold Value (generated randomly)
Step1: For each user u select a set su of up to m distinct items from search history.
Step2: Based on the selected items, create a histogram consisting of pairs(k,ck) where k
denotes an item (keyword or query) and ck denotes the number of users u that have k
in their search history su. We call this histogram as Original histogram.
Step3: Delete from the histogram the pairs (k, ck) with count ck smaller than τ (Threshold
value).
Step4: For each pair (k, ck) in the histogram, sample a random number nk and add nk to the
count ck. Resulting in a noisy count ckʹ← ck + nk.
Step5: Delete from the histogram the pairs (k, ck) with noisy counts ckʹ ≤ τʹ.
Step6: Publish the remaining items and their noisy counts (Sanitized histogram).
Output:
Protected Histogram.
3.5 CONCLUSION
This chapter starts with the introduction of analysis in general followed by Software
Requirement Specification which contains user requirements i.e. what a user is expecting
from the system, software requirements i.e. what are the software’s required to implement
and run the system and hardware requirements required for establishing and running the
project.
Content diagram explaining the basic contents of project and their functionality
briefly is also included. Algorithms and Flow charts specifying different algorithms used in
development of project, their implementation details and flow charts explaining their working
have also been included.
4. DESIGN
4.1 INTRODUCTION
The design process for a software system has two levels. At first level the focus is
which modules are needed for the system, the specification of these modules and how the
modules should be interconnected. This is what is called system designing of top level
design. In the second level, the internal design of the modules, or how the specification of the
module can be satisfied is described upon.
The first level produces system design, which defines the components needed for the
system, and how the components interact with each other. It focus is on depending on in
which that modules are needed for the system, the specification of these modules and how the
module should be interconnected.
UML stands for Unified Modeling Language. UML is a standard language for
specifying, visualising, constructing and documenting the software system and its
components.
UML is the one of the most exciting tools in the world of system development today.
UML enables system builders to create blue prints that capture their visions in a standard,
easy to understand and communicate them to others. The UML consists of a number of
graphical elements that combine to form diagrams.
The Use case diagram models the interactions between the system’s external clients
and the use cases of the system. Each use case represents a different capability that the
system provides the client. Stick figure represents an actor. Use case diagrams identify
functionality provided by the system, the users who interact with the system (actors), and the
association between the users and the functionality. Use Cases used in the analysis phase of
software development to articulate the high-level requirement of the system.
Login
Enters a query
validation
Log is created
User Admin
Clicks on the results
Provide privacy
Class diagram identify the class structure of a system, including the properties and
methods of each class. It depicts the relationships that can exist between classes, such as an
inheritance relationship. The class diagram is one of the most widely used diagrams from
the UML specification
Names: Every class should have name that distinguish from other names.
Attributes: An attribute is a named property of a class that describes a range of values that
instances of the property may hold.
Operations: An operation is the implementation of a service that can be requested from any
object of the class affected to behavior.
1 : Login()
2 : Authentication()
3 : Enters a query()
4 : Creates a log()
Login
Validation
Enters a query
Creates a log
Log Updated
1. Query Substitution
2. Index Caching
3. Item Set Generation and Ranking
4.4 CONCLUSION
This chapter portrays the designing phase of work. It includes the UML diagrams
which represent how the work has to be coded. It also illustrates the procedure followed in
the implementation phase.
5. IMPLEMENTATION AND RESULTS
5.1 INTRODUCTION
Implementation is the part of the process where software engineers actually program
the code for the project.
The iterative models of Software Development Life Cycle (SDLC) are different. The
implementation stage in iterative models is less pressured compared to the waterfall model.
Iterative model focuses on creating prototypes right from the start. That means there will
always be implementation in the iterative model. However, these are just stages in
development software.
The good thing about this is that when the software is implemented it is guaranteed to
work based on the preference of the users since they have helped in the creation of the
software.
5.3.2.4 Search Engine displays the results of the related query. User clicks on the anyone of
the results.
5.3.2.5 Whenever, the user clicks on the results. The database updates the count of specific
query.
5.3.2.6 Generates both algorithms i.e., k-anonymity and Zealous by clicking on the button.
5.3.2.7 Histogram, between the keywords and the number of users who have keywords in
their search history.
5.3.2.10 Histogram after adding a sample random number nk to the count ck.
5.3.2.11 Histogram after eliminating the pairs whose values are less than τʹ (Threshold
value), a random number w.r.to above histogram.
5.4 CONCLUSION
This chapter illustrates the implementation of designed modules along with results. It
describes about what is implementation in general software development life cycle, then in
iteration model.
Key functions in implementation phase are discussed. In method of implementation,
forms of input like browsing of file present in system and typing of file are explained with
screen shots and outputs are displayed.
6. TESTING AND VALIDATION
6.1 INTRODUCTION
6.1.1 Testing
A strategy for software testing must accommodate low-level tests that are necessary to
verify that all small source code segments have been correctly implemented as well as high-
level tests that validate major system functions against customer requirements.
Testing is a process of executing program with the intent of finding error. A good test
case is one that has high probability of finding an undiscovered error. If testing is conducted
successfully it uncovers the errors in the software. Testing cannot show the absence of
defects, it can only show that software defects present.
Black-Box Testing
White-Box Testing
Alpha Testing
Beta Testing
Unit Testing
Integration Testing
System Testing
Acceptance Testing
6.1.2.1 Black-Box Testing
Alpha testing is the software prototype stage when the software is first able to run.
It will not have the intended functionality, but it will have core functions and will be able to
accept inputs and generate outputs. An alpha test usually takes place in the developer's offices
on a separate system.
Unit testing is also called as module testing. It is essentially used for verification of
the code produced by individual programmers, and is typically done by the programmer of
the module. Generally, a module is offered by a programmer for integration and use by others
only after it has been unit tested satisfactorily.
6.1.2.6 Integration Testing
In integration testing, many unit tested modules are combined into subsystems, which
are then tested. The goal here is to see if the modules can be integrated properly. Hence, the
emphasis is on testing interfaces between modules. This testing activity can be considered
testing the design.
6.1.2.7 System Testing
Testing of the debugging programs is one of the most critical aspects of the computer
programming triggers, without programs that works, the system would never produce the
output for which it was designed. Testing is best performed when user development are asked
to assist in identifying all errors and bugs. It is not quantity but quality of the data used the
matters of testing.
6.1.2.8 Acceptance Testing
Acceptance testing is often performed with realistic data of the client to demonstrate
that the software is working satisfactorily. It may be done in the setting in which the software
is to eventually function. It essentially tests if the system satisfactorily solves the problems
for which it was commissioned.
The test plan focuses on how the testing for the project will proceed, which units will
be tested, and what approaches are to be used during the various stages of testing. Test case
specifications have to be done separately for each unit. Based on the approach specified in the
test plan, first the features to be tested for this unit must be determined. The overall approach
stated in the plan is refined into specific test techniques that should be followed and into the
criteria to be used for evaluation.
Test case specification gives, for each unit to be tested, all test cases, inputs to be used
in the test cases, conditions being tested by the test case, and outputs expected for those test
cases. Test case specification is a major activity in the testing process. Careful selection of
test cases that satisfy the criteria and approach specified is essential for proper testing.
6.3 CONCLUSION
This chapter is introduced with explanation of testing and validation in software
development life cycle in general. Design of test cases and different scenarios is given in
detail along with screen shots, followed by validations of generated test cases and scenarios.
7. CONCLUSION
The project can be used in the development of algorithms that release useful
information about infrequent keywords, queries, and clicks in a search log while preserving
user privacy. Thus, searched user queries cannot be matched to a particular user.
8. REFERENCES