You are on page 1of 49

TO MAINTAIN PRIVACY IN PUBLISHING SEARCH LOGS

BY IMPLEMENTING ZEALOUS ALGORITHM

A PROJECT REPORT SUBMITTED IN THE PARTIAL FULFILLMENT OF THE


REQUIREMENTS FOR THE AWARD OF DEGREE OF BACHELOR OF
TECHNOLOGY IN INFORMATION TECHNOLOGY

Submitted By

S. Sanghee Sneha R. Soundarya Bhargavi


(10JG1A1245) (10JG1A1240)

R. Rajyalakshmi Srinija P. Mounika


(10JG1A1238) (10JG1A1255)

Under the Esteemed Guidance of

Ch V. V. D Prasad

DEPARTMENT OF INFORMATION TECHNOLOGY


GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING
FOR WOMEN
(Affiliated to Jawaharlal Nehru Technological University Kakinada)

2010-14
GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING
FOR WOMEN
DEPARTMENT OF INFORMATION TECHNOLOGY

CERTIFICATE
This is to certify that the project report “TO MAINTAIN PRIVACY IN
PUBLISHING SEARCH LOGS BY IMPLEMENTING ZEALOUS ALGORITHM”
is a bonafide work of following IV/IV B.Tech students in the Department of Information
Technology of JNT University, Kakinada during the academic year 2013-14, in partial
fulfillment of the requirement for the award of the degree of Bachelor of Technology of this
university.

Ms S. Sanghee Sneha(10JG1A1245) Ms R. Soundarya Bhargavi(10JG1A1240)

Ms R. Rajyalakshmi Srinija(10JG1A1238) Ms P. Mounika(10JG1A1255)

Internal Guide Head of the Department


CH. V. V. D PRASAD C. SRINIVAS

External Examiner
ACKNOWLEDGEMENT

No volume of words is enough to express our deep gratitude towards our guide
Mr Ch. V. V. Prasad, Assistant Professor, Computer Science and Engineering, Gayatri
Vidya Parishad College of Engineering for Women, who has been very concerned and has
aided us with all the materials essential for the project work and preparation of this project
report. She helped us in exploring this vast topic in an organized manner and provided us
with all the ideas on how to work towards this project.

We are also thankful to our Prof Dr. K. A. Gopalarao, Principal,


Prof Dr. C. Bhaskara Sharma, Director, Prof Dr. G. Sudheer, Vice Principal,
Mr. C. Srinivas, Head of the Department, CSE & IT for the motivation and inspiration that
they rendered for this work.

We would also like to thank other faculty and staff members who were always there
in the hour of need and provided us with all the help and facilities which we required for the
completion of this project.

Most importantly we would like to thank our parents and friends for their support.
CONTENTS

ABSTRACT
LIST OF FIGURES
LIST OF TABLES
LIST OF SCREENS
SYMBOLS & ABBREVIATIONS

1. INTRODUCTION 1
1.1 Motivation 1
1.2 Problem Definition 1
1.3 Objective of Project 1
1.4 Limitations of Project 2
1.5 Organization of Documentation 2

2. LITERATURE SURVEY 3
2.1 Introduction 3
2.2 Existing System 3
2.3 Disadvantages of Existing system 3
2.4 Proposed System 3
2.5 Conclusion 4

3. ANALYSIS 5
3.1 Introduction 5
3.2 Software Requirement Specification 5
3.2.1 User requirement 5
3.2.2 Software requirement 5
3.2.3 Hardware requirement 6
3.3 Content diagram of Project 6
3.4 Algorithms 7
3.4.1 K-anonymity Algorithm 7
3.4.2 Zealous Algorithm 8
3.5 Conclusion 8
4. DESIGN 9
4.1 Introduction 9
4.2 UML diagrams 9
4.2.1 Use Case Diagram 10
4.2.2 Class Diagram 10
4.2.3 Sequence Diagram 11
4.2.4 Activity Diagram 12
4.3 Module design and organization 13
4.4 Conclusion 14

5. IMPLEMENTATION & RESULTS 15


5.1 Introduction 15
5.2 Explanation of Key functions 15
5.3 Method of Implementation 16
5.2.1 Sample Code 16
5.2.2 Output Screens 25
5.2.3 Result Analysis 30
5.4 Conclusion 30

6. TESTING & VALIDATION 31


6.1 Introduction 31
6.1.1 Testing 31
6.1.1.1 Strategies of Testing 31
6.1.1.2 Fundamentals of Testing 31
6.1.2 Types of Testing 31
6.1.2.1 Black-Box Testing 32
6.1.2.2 White-Box Testing 32
6.1.2.3 Alpha Testing 32
6.1.2.4 Beta Testing 32
6.1.2.5 Unit Testing 32
6.1.2.6 Integration Testing 32
6.1.2.7 System Testing 33
6.1.2.8 Acceptance Testing 33
6.2 Design of test cases and scenarios 33
6.3 Conclusion 36

7. CONCLUSION 37
7.1 Project Conclusion 37
7.2 Future Enhancement 37

8. REFERENCES 38
ABSTRACT

Many search engine’s use the servers databases on the history of the users search
Queries. Search logs serve as Gold mines to many of the researchers and statistical analysers.
Many companies use this search log information for vital purpose and care is taken to avoid
disclosing the sensitive part of the users search queries. In this project a Zealous algorithm is
implemented and experimental study has been conducted to show how this algorithm is used
to publish frequently searched queries and keywords. Traditional methods like k-anonymity
which deals about the vulnerability in attacks in compared with a more ensured and
guaranteed Zealous algorithm which provides a stronger utility for this problem and finally
concludes with a larger experimental study on generalized real time applications comparing
Zealous and previous works, showing Zealous yields a better comparable utility and further
achieving a stronger privacy.
LIST OF FIGURES

Figure 3.3 Content Diagram of Project 6

Figure 4.2.1 Use Case Diagram 10

Figure 4.2.2 Class Diagram 11

Figure 4.2.3 Sequence Diagram 12

Figure 4.2.4 Activity Diagram 13


LIST OF TABLES

Table 6.2.1 Test Case Login 33


Table 6.2.2 Test Case Authentication 34
LIST OF SCREENS

Screen 5.3.2.1 Login Page 25

Screen 5.3.2.2 Message for authenticated users 25

Screen 5.3.2.3 Entering a query 26

Screen 5.3.2.4 Displaying the results of related query 26

Screen 5.3.2.5 Updates the query 27

Screen 5.3.2.6 Generating of algorithms 27

Screen 5.3.2.7 Original Histogram 28

Screen 5.3.2.8 Eliminating Duplicates 28

Screen 5.3.2.9 Eliminating histogram below the threshold value 29

Screen 5.3.2.10 Adding Noise 29

Screen 5.3.2.11 Sanitized Histogram 30


SYMBOLS & ABBREVIATIONS

SDLC: Software Development Life Cycle or Systems Development Life Cycle

UML: Unified Modeling Language


1. INTRODUCTION

1.1 MOTIVATION

A basic scenario in the present day is having a high end security to the users, while
using a search engine for search purpose. It’s quite a tedious task for the administrator to
provide security by not revealing the user personal information. Using the traditional
algorithm’s it has become quite a difficult task, due to lack of proper security measures. In
order to overcome the drawbacks, this project mainly focuses on implementing the new
algorithms with enhanced security features for providing security.
Data Mining is a conceptual theory that has helped to define our project in a m12ore
generalized way. While the traditional methods deal with the micro aggregation focusing
more on clustering and grouping the required data sets based on the search query. To enhance
and provide a greater security to the user related search log, various algorithms have been
compared and implemented.

1.2 PROBLEM DEFINITION

A search log mainly consists of the related user information such as IP address, time
stamp, keyword, no of clicks. Each and every data related to the user needs to be protected
and must be kept in a secure manner. There is a huge necessity and demand to generate new
algorithms in a view to protect the personal information of the user’s related query while
publishing the search log for research and analysis.

1.3 OBJECTIVE OF PROJECT


The main objective of this project is to provide security while publishing a search log
to the analysts for a greater improvisation during search. The initial part of the project deals
with how a traditional algorithm k-anonymity though providing security has a major
drawback which is overcome. The next phase of project deals mainly with the Zealous
algorithm which gives a greater security feature. The main objective is to overcome the
drawback of k-anonymity and give an assured glimpse of security feature to the users which
are made possible while implementing Zealous.
1.4 LIMITATIONS OF THE PROJECT

 Only the authenticated user can view the results of the query that was entered.
 The results of a query were displayed only if the details of the corresponding query
are presented in the database.

1.5 ORGANISATION OF DOCUMENTATION

A concise overview of the rest of the documentation work is explained below. The
purpose of this document is to present a final report of our project “To Maintain privacy in
publishing search logs by implementing Zealous Algorithm” in Data Mining. This
document is divided into various chapters.

Chapter 2: Literature survey describes the primary terms involved in the development of this
application. It also gives the overview of the existing system, disadvantages of the existing
system and features of proposed system.

Chapter3. Analysis deals with detail analysis of the project. Software Requirement
Specification which further contain User requirement analysis, Software requirement
analysis, and Hardware requirement analysis. It also includes Content diagram of Project,
Algorithms and Flowcharts

Chapter4: Design includes UML diagrams along with explanation of Module design and
organization.

Chapter 5: Contains implementation of the system along with explanation of key functions
and screen shots of Outputs.

Chapter6. Gives the testing and validation details with design of test cases and scenarios
along with validation screen shorts.

Chapter7: Contains project conclusion and future enhancements.


2. LITERATURE SURVEY

2.1 INTRODUCTION

Search engines play a crucial role in the navigation through the vastness of the web.
Today’s search engines do mine information about their users. They store the queries, clicks,
IP-addresses, and other information about the interactions with users is called a search log.
As search logs contain valuable information they can be used in the development and testing
of new algorithms to improve performance and quality. Scientists and marketing companies
all around the world would like to tap this gold mine for their own research; search engine
companies, however do not release them because they contain sensitive information about
their users.

Search engines such as Bing, Google, Yahoo etc., log interactions with their users.
When a user submits a query and clicks on one or more results, a new entry is added to the
search log. Without loss of generality, we assume that a search log has the following schema
{USER-ID, QUERY. TIME, CLICKS}, where a USER-ID identifies a user, a QUERY is a
set of keywords, TIME is the timestamp and CLICKS is a list of URL’s that the user clicked
on.

2.2 EXISTING SYSTEM


In the present system, we are using k-anonymity privacy algorithm for providing
privacy for logs generated by search engines.

2.3 DISADVANTAGES OF EXISTING SYSTEM

The existing k-anonymity privacy algorithm though it provides privacy for the search
logs but is insufficient in the light of attackers who can actively influences the search log.

2.4 PROPOSED SYSTEM

In our project we implement Zealous algorithm, which overcomes the drawbacks of


the existing k-anonymity algorithm. Zealous algorithm ensures strong privacy with good
utility by following two phases of framework. In the first phase, Zealous generates a
histogram of items for the input search log and then removes from the histogram the items
with frequencies below a threshold. In the second phase, Zealous adds noise to the histogram
counts, and eliminates the items whose noisy frequencies are smaller than another threshold.
The resulting histogram (sanitized histogram) is returned as the output of Zealous algorithm.

2.5 CONCLUSION

This chapter gives the complete meaning of literature survey and its importance in
documentation purposes along with motivation which drove to design this project. This
chapter portrays the features of communication mechanisms and their security. It also
discusses the limitations of present features that are to be added to the existing system for
more efficient system i.e., the proposed system.
3. ANALYSIS

3.1 INTRODUCTION

Requirement Analysis is the first phase in the software development process.


The main objective of the phase is to identify the problem and the system to be developed.
The later phases are strictly dependent on this phase and hence requirements for the system
analyst to be clearer, precise about this phase. Any inconsistency in this phase will lead to lot
of problems in the other phases to be followed. Hence there will be several reviews before the
final copy of the analysis is made on the system to be developed. After all the analysis is
completed the system analyst will submit the details of the system to be developed in the
form of a document called requirement specification. Conceptually, requirements analysis
includes three types of activity:

 Eliciting requirements: The task of communicating with customers and users to


determine what their requirements are. This is sometimes also called requirements
gathering.
 Analyzing requirements: Determining whether the stated requirements are unclear,
incomplete, ambiguous, or contradictory, and then resolving these issues.
 Recording requirements: Requirements might be documented in various forms, such as
natural-language documents, use cases, user stories, or process specifications.

3.2 SOFTWARE REQUIREMENT SPECIFICATION

3.2.1 User Requirement


 Programmer must need to know the number of queries present in back-end
before for the easy generation of k-anonymity.
3.2.2 Software Requirement

 Operating System : Windows 7


 Front-end : HTML, JDK 1.7
 IDE : MyEclipse 8.6
 Back-end : Oracle
 Web server : Apache Tomcat
3.2.3 Hardware Requirement

 Processor : Pentium IV
 Hard Disk : 40 GB
 RAM : 3.86 GB

3.3 CONTENT DIAGRAM OF PROJECT

Fig 3.3 Content Diagram of Project

The user enters a query through the user interface in the search engine. The search
engine fetches the results of the related query from the database and displays to the user.
Whenever, a user entry a query in the search engine, the search engine maintains a search log
for each and every user. The search log contains the username, query searched, clicks
(number of click of a particular query) and timestamp. This search log is maintained in the
database and only the owners (admin) have the rights to access the database. Our project
describes how to maintain privacy for this search log, which contains user’s personal
information.
3.4 ALGORITHMS
In this project, we implement 2 algorithms. They are:
1. k-anonymity and
2. Zealous
3.4.1 k-anonymity Algorithm:
Input:

X: total number of data sets and

k: integer(generated randomly).

begin

while (|X| ≥ 3*k) do{


Compute average record x of all records in X;
Consider the most distant record xr to the average record x;
Form a cluster around xr. The cluster contains xr together with the k-1 closest
records to xr;
Remove these records from data set X;
Find the most distant record xs from record xr;
Form a cluster around xs. The cluster contains xs together with the k-1 closest
records to xs;
Remove these records from data ser X;
}
if (|X| ≥ 2*k) do{
Compute the average record x of all records in X;
Consider the most distant record xr to the average record x;
Form a cluster around xr. The cluster contains xr together with the k-1 closest
records to xr;
Remove these records from data set X;
}
Form a cluster with the remaining records;

end

Output:
Protected data set.
3.4.2 Zealous Algorithm
Input:
m: maximum no. of items from user,
T: First Threshold value (generated randomly) and
Tʹ: Second Threshold Value (generated randomly)

Step1: For each user u select a set su of up to m distinct items from search history.

Step2: Based on the selected items, create a histogram consisting of pairs(k,ck) where k
denotes an item (keyword or query) and ck denotes the number of users u that have k
in their search history su. We call this histogram as Original histogram.

Step3: Delete from the histogram the pairs (k, ck) with count ck smaller than τ (Threshold
value).

Step4: For each pair (k, ck) in the histogram, sample a random number nk and add nk to the
count ck. Resulting in a noisy count ckʹ← ck + nk.

Step5: Delete from the histogram the pairs (k, ck) with noisy counts ckʹ ≤ τʹ.

Step6: Publish the remaining items and their noisy counts (Sanitized histogram).
Output:
Protected Histogram.

3.5 CONCLUSION
This chapter starts with the introduction of analysis in general followed by Software
Requirement Specification which contains user requirements i.e. what a user is expecting
from the system, software requirements i.e. what are the software’s required to implement
and run the system and hardware requirements required for establishing and running the
project.

Content diagram explaining the basic contents of project and their functionality
briefly is also included. Algorithms and Flow charts specifying different algorithms used in
development of project, their implementation details and flow charts explaining their working
have also been included.
4. DESIGN

4.1 INTRODUCTION

Software Design is a process of planning the new or modified system.


Analysis specifies what a new or modified system does. Design specifies how to accomplish
the same. Design is essentially a bridge between requirement specification and the final
solution satisfying the requirements. Design of a system is essentially a blue print or a
solution for the system.

The design process for a software system has two levels. At first level the focus is
which modules are needed for the system, the specification of these modules and how the
modules should be interconnected. This is what is called system designing of top level
design. In the second level, the internal design of the modules, or how the specification of the
module can be satisfied is described upon.

The first level produces system design, which defines the components needed for the
system, and how the components interact with each other. It focus is on depending on in
which that modules are needed for the system, the specification of these modules and how the
module should be interconnected.

4.2 UML DIAGRAMS

UML stands for Unified Modeling Language. UML is a standard language for
specifying, visualising, constructing and documenting the software system and its
components.

UML is the one of the most exciting tools in the world of system development today.
UML enables system builders to create blue prints that capture their visions in a standard,
easy to understand and communicate them to others. The UML consists of a number of
graphical elements that combine to form diagrams.

The goals of UML are:

 To model systems using object-oriented concepts.


 To establish an explicit coupling between conceptual as well as executable.
 To address the issues of scale inherent in complex, mission critical system.
 To create a modeling language usable by both humans and machines.
4.2.1 Use Case Diagram

The Use case diagram models the interactions between the system’s external clients
and the use cases of the system. Each use case represents a different capability that the
system provides the client. Stick figure represents an actor. Use case diagrams identify
functionality provided by the system, the users who interact with the system (actors), and the
association between the users and the functionality. Use Cases used in the analysis phase of
software development to articulate the high-level requirement of the system.

Login

Enters a query

validation
Log is created

Displays the results

User Admin
Clicks on the results

Updates the log

Provide privacy

4.2.2 Class Diagram

Class diagram identify the class structure of a system, including the properties and
methods of each class. It depicts the relationships that can exist between classes, such as an
inheritance relationship. The class diagram is one of the most widely used diagrams from
the UML specification

There are some important constraints to be noted are:

Names: Every class should have name that distinguish from other names.
Attributes: An attribute is a named property of a class that describes a range of values that
instances of the property may hold.

Operations: An operation is the implementation of a service that can be requested from any
object of the class affected to behavior.

User Log Admin


+UserName +UserName +UserName
+Password Contains +Query Maintains +Password
1..* 1 +Clicks 1 1
+Login() +Login()
+Timestamp
+Search() +Update()
+Select() +Update()

Authenticated Users Unauthenticated Users

4.2.3 Sequence Diagram

A sequence diagram is a kind of interaction diagram that shows how processes


operate with one another and in what order. It mostly emphasizes the time ordering of
messages. A sequence diagram shows, as parallel vertical lines (lifelines), different
processors or objects that live simultaneously, and as horizontal arrow the messages the
messages exchanged between them, in the order in which they occur.
User Search Engine Database Admin

1 : Login()

2 : Authentication()

3 : Enters a query()

4 : Creates a log()

5 : Searches the results of the reated query()

6 : Displays the results of the related query()

7 : Clicks on the results()

8 : Updates the log()

9 : Provides privacy for the user log()

4.2.4 Activity Diagram


An Activity diagram is a special kind of statechart diagram that shows the flow from
one activity to another activity within a system. This addresses the dynamic view of the
system. They are especially important in modeling the function of the system and emphasize
the flow of control among the objects.
User Search Engine Admin

Login
Validation

Authenticated Users Unauthenticated Users

Enters a query

Creates a log

Searches the query

Displays the query

Clicks on the results of query

Log Updated

4.3 MODULE DESIGN AND ORGANIZATION

The system can be broadly divided into 4 modules:

1. Query Substitution
2. Index Caching
3. Item Set Generation and Ranking

4.3.1 Query Substitution

Query substitutions are suggestions to rephrase a user query to match it to documents


or advertisements that do not contain the actual keywords of the query. Query substitutions
can be applied in query refinement, sponsored search, and spelling error correction.
Query substitution as a representative application for search quality. First, the query is
partitioned into subsets of keywords called phrases, based on their mutual information. Next,
for each phrase candidate, query substitutions are determined based on the distribution of
queries.
4.3.2 Index Caching

Index caching, as a representative application for search performance. The index


caching application does not require high coverage because of its storage restriction.
However, high precision of the top-j most frequent items is necessary to determine which of
them to keep in memory. On the other hand, in order to generate many query substitutions, a
larger number of distinct queries and query pairs are required. Thus should be set to a large
value for index caching and to a small value for query substitution. In our experiments we
fixed the memory size to be 1 GB. Our inverted index stores the document posting list for
each keyword sorted according to their relevance which allows retrieving the documents in
the order of their relevance.

4.3.3 Item Set Generation and Ranking


All of our results apply to the more general problem of publishing frequent items /
item sets / consecutive item sets. Our results (positive as well as negative) can be applied
more generally to the problem of publishing frequent items or item sets .We then compare
these rankings with the rankings produced by the original search log which serve as ground
truth. To measure the quality of the query substitutions it does not only compare the ranks of
a substitution in the two rankings, but is also penalizes highly relevant substitutions according
to [q0, . . . ,qj−1] that have a very low rank in [q_0 , . . . , q_j−1].

4.4 CONCLUSION
This chapter portrays the designing phase of work. It includes the UML diagrams
which represent how the work has to be coded. It also illustrates the procedure followed in
the implementation phase.
5. IMPLEMENTATION AND RESULTS

5.1 INTRODUCTION

Implementation is the part of the process where software engineers actually program
the code for the project.

The iterative models of Software Development Life Cycle (SDLC) are different. The
implementation stage in iterative models is less pressured compared to the waterfall model.
Iterative model focuses on creating prototypes right from the start. That means there will
always be implementation in the iterative model. However, these are just stages in
development software.

The advantage of implementation in iterative models of SDLC is its ability to change


easily. Developers often use this model to determine what they have done so far for the
software. Most to the iterative models need the help of potential end users.

The good thing about this is that when the software is implemented it is guaranteed to
work based on the preference of the users since they have helped in the creation of the
software.

5.2 EXPLANATION OF KEY FUNCTIONS


JFreeChart:
JFreeChart is an open-source framework for the programming language Java, which allows
the creation of a wide variety of both interactive and non-interactive charts. JFreeChart
supports X-Y charts, Gantt charts and Bar charts. It also has built-in histogram plotting.
JFreeChart also works with GNU Class path, a free software implementation of the standard
class library for the Java programming language.
JFreeChart automatically draws the axis scales and legends. Charts in GUI
automatically get the capability to zoom in with mouse and change some settings through
local menu. The existing charts can be easily updated through the listeners that the library has
on its data collections.
JFreeChart is a consistent and well-documented API, a flexible design that is easy to
extend, and targets both server-side and client-side. It supports Swing components, image
files and vector graphics file formats
5.3 METHODS OF IMPLEMENTATIONS
5.3.1 Sample Code
k-anonymity Algorithm
package org.goldmine.log;
import java.util.*;
import java.io.*;
public class K_AnominityAlg {
public void Algo(ArrayList dataset){
int datasetLength = 15;
Random rObj = new Random();
int rNum = rObj.nextInt(7);
System.out.println("Random Number:" + rNum);
if (rNum <= 5) {
int kVal = 3 * rNum;
System.out.println(kVal);
int addVal = 0;
for (int i = 0; i < dataset.size(); i++) {
addVal = addVal + (Integer.parseInt(dataset.get(i).toString()));
}
System.out.println("Added value: " + addVal);
double avgVal = addVal / (dataset.size());
System.out.println("Average value: " + avgVal);
ArrayList resVal = euclideanMethodKLessThan6(dataset, avgVal);
}
else {
int kVal = 2 * rNum;
System.out.println(kVal);
int addVal = 0;
for (int i = 0; i < dataset.size(); i++) {
addVal = addVal + (Integer.parseInt(dataset.get(i).toString()));
}
System.out.println("Added value: " + addVal);
double avgVal = addVal / (dataset.size());
System.out.println("Average value: " + avgVal);
ArrayList resVal = euclideanMethod(dataset, avgVal);
}
}
public static ArrayList euclideanMethodKLessThan6(ArrayList XVal, double avgVal) {
ArrayList resVal = new ArrayList();
for (int i = 0; i < XVal.size(); i++) {
double doubleVal = Double.parseDouble(XVal.get(i).toString());
double diffVal = Math.pow(Math.abs((avgVal - doubleVal)), 2);
double sqrtVal = Math.sqrt(diffVal);
resVal.add(sqrtVal);
}
double cluster[] = main(resVal);
System.out.println("Resultant cluster: ");
ArrayList retList = new ArrayList();
for (int i = 0; i < cluster.length; i++) {
System.out.println(cluster[i]);
retList.add(cluster[i]);
}
System.out.println("Removed items list: " + retList.get(retList.size() - 1) + " " +
retList.get(retList.size() - 2) + " " + retList.get(retList.size() - 3));
retList.remove(retList.get(retList.size() - 2));
retList.remove(retList.get(retList.size() - 3));
retList.remove(retList.get(retList.size() - 4));
System.out.println("Xr dataset: " + retList);
retList.remove(retList.get(0));
retList.remove(retList.get(1));
retList.remove(retList.get(2));
System.out.println("Xs dataset: " + retList);
return retList;
}
public static ArrayList euclideanMethod(ArrayList XVal, double avgVal) {
ArrayList resVal = new ArrayList();
for (int i = 0; i < XVal.size(); i++) {
double doubleVal = Double.parseDouble(XVal.get(i).toString());
double diffVal = Math.pow(Math.abs((avgVal - doubleVal)), 2);
double sqrtVal = Math.sqrt(diffVal);
resVal.add(sqrtVal);
}
double cluster[] = main(resVal);
System.out.println("Resultant cluster: ");
ArrayList retList = new ArrayList();
for (int i = 0; i < cluster.length; i++) {
System.out.println(cluster[i]);
retList.add(cluster[i]);
}
System.out.println("Removed items list: " + retList.get(retList.size() - 1) + " " +
retList.get(retList.size() - 2) + " " + retList.get(retList.size() - 3));
retList.remove(retList.get(retList.size() - 2));
retList.remove(retList.get(retList.size() - 3));
retList.remove(retList.get(retList.size() - 4));
System.out.println("Xr dataset: " + retList);
return retList;
}
public static double[] main(ArrayList initialData) {
int N = 9;
double arr[] = new double[initialData.size()]; // initial data
for (int i = 0; i < arr.length; i++) {
arr[i] = Double.parseDouble(initialData.get(i).toString());
}
double m1, m2, a, b, n = 0;
int i;
boolean flag = true;
double sum1 = 0, sum2 = 0;
a = arr[0];
b = arr[1];
m1 = a;
m2 = b;
double cluster1[] = new double[9], cluster2[] = new double[9];
for (i = 0; i < 9; i++) {
System.out.print(arr[i] + "\t");
}
System.out.println();
do {
n++;
sum1 = 0;
sum2 = 0;
int k = 0, j = 0;
for (i = 0; i < 9; i++) {
if (Math.abs(arr[i] - m1) <= Math.abs(arr[i] - m2)) {
cluster1[k] = arr[i];
k++;
}
else {
cluster2[j] = arr[i];
j++;
}
}
System.out.println();
for (i = 0; i < 9; i++) {
sum1 = sum1 + cluster1[i];
}
for (i = 0; i < 9; i++) {
sum2 = sum1 + cluster2[i];
}
a = m1;
b = m2;
m1 = Math.round(sum1 / k);
m2 = Math.round(sum2 / j);
if (m1 == a && m2 == b) {
flag = false;
}
else {
flag = true;
}
System.out.println("After iteration " + n + " , cluster 1 :\n");
for (i = 0; i < 9; i++) {
System.out.print(cluster1[i] + "\t");
}
System.out.println("\n");
System.out.println("After iteration " + n + " , cluster 2 :\n");
for (i = 0; i < 9; i++) {
System.out.print(cluster2[i] + "\t");
}
} while (flag);
System.out.println("Final cluster 1 :\n"); // final clusters
for (i = 0; i < 9; i++) {
System.out.print(cluster1[i] + "\t");
}
System.out.println();
System.out.println("Final cluster 2 :\n");
for (i = 0; i < 9; i++) {
System.out.print(cluster2[i] + "\t");
}
return cluster2;
}
}
BarGraph
package org.goldmine.log;
import org.jfree.chart.*;
import org.jfree.data.category.*;
import org.jfree.chart.plot.*;
import java.awt.*;
import java.util.*;
public class ValuesBarGraph{
public ValuesBarGraph(ArrayList<Integer> al,ArrayList<String> keys){
System.out.println("Values Generated by Histogram: "+al);
DefaultCategoryDataset dataset = new DefaultCategoryDataset();
for(int i = 0; i < al.size(); i++){
dataset.setValue(Integer.parseInt(al.get(i).toString()),"Values",keys.ge
t(i).toString());
}
// dataset.setValue(1, "Values", "Precision");
// dataset.setValue(4, "Values", "Recall");
// dataset.setValue(10, "Values", "F-Measure");
JFreeChart chart = ChartFactory.createBarChart("Histogram
Chart","Keywords", "Values", dataset, PlotOrientation.VERTICAL, false,true,
false);
chart.setBackgroundPaint(Color.cyan);
chart.getTitle().setPaint(Color.black);
CategoryPlot p = chart.getCategoryPlot();
p.setRangeGridlinePaint(Color.blue);
ChartFrame frame1=new ChartFrame("Values",chart);
frame1.setVisible(true);
frame1.setSize(400,350);
}
}
Zealous Algorithm
package org.goldmine.log;
import org.jfree.chart.*;
import org.jfree.data.category.*;
import org.jfree.chart.plot.*;
import java.awt.*;
import java.util.*;
public class ZealousAlg{
ArrayList<Integer> nn=new ArrayList<Integer>();
int add=0;
for(int i=0;i<values.size();i++){
add=(Integer)values.get(i)+(Integer)noiseValues.get(i);
nn.add(add);
}
System.out.println("Zealous Algorithm:");
ValuesBarGraph v1 = new ValuesBarGraph(values, names); //First Histogram
HashSet newList = new HashSet();
ArrayList namesList = new ArrayList();
HashSet newList1 = new HashSet();
ArrayList namesList1 = new ArrayList();
HashSet duplicateRValues = new HashSet();
HashSet duplicateRNames = new HashSet();
HashSet noiseValuesDup = new HashSet();
HashSet noiseNamesDup = new HashSet();
Collections.sort(values, Collections.reverseOrder()); //Arranging the values in sorted
order
int minVal = values.get(0);
int maxVal = (Integer) values.get(values.size() - 1);
try {
for (int i = 0; i < values.size(); i++) {
duplicateRValues.add(values.get(i));
duplicateRNames.add(names.get(i));
}
}
catch (Exception e) {}
Iterator itrObj2 = duplicateRValues.iterator();
ArrayList dupValList = new ArrayList();
while (itrObj2.hasNext()) {
dupValList.add(itrObj2.next());
}
itrObj2 = duplicateRNames.iterator();
ArrayList dupNameList = new ArrayList();
while (itrObj2.hasNext()) {
dupNameList.add(itrObj2.next());
}
System.out.println("After Eliminating Duplicates:");
System.out.println("Remaining Values after eliminating Duplicates:" +
duplicateRValues);
System.out.println("Remaining Keywords after eliminating Duplicates:" +
duplicateRNames);
ValuesBarGraph v2 = new ValuesBarGraph(dupValList, dupNameList); // Second
Histogram
System.out.println("Maximum Value: " + minVal);
System.out.println("Minimum Value: " + maxVal);
Random rObj1 = new Random();
int rNum = rObj1.nextInt(minVal) + maxVal;
System.out.println("First Threshold Value:" + rNum);
for (int i = 0; i < values.size(); i++) {
if (Integer.parseInt(values.get(i).toString()) >= rNum) {
newList.add(values.get(i));
namesList.add(names.get(i));
}
}
System.out.println("Values with respect to First Threshold: " + newList);
System.out.println("Keywords with respect to First Threshold: " + namesList);
Iterator itrObj3 = newList.iterator();
ArrayList newListAL = new ArrayList();
while (itrObj3.hasNext()) {
newListAL.add(itrObj3.next());
}
ValuesBarGraph v3 = new ValuesBarGraph(newListAL, namesList); //Third
Histogram
System.out.println("Noise Values: " +noiseValues);
ValuesBarGraph v4 = new ValuesBarGraph(nn, names); //Fourth Histogram
System.out.print("Total number of clicks i.e., normal and 404 links: " + nn);
System.out.println("\n");
System.out.println("Maximum Value: " + nn.get(0));
System.out.println("Minimum Value: " + nn.get(nn.size() - 1));
int minVal1 = (Integer) nn.get(0);
int maxVal1 = (Integer) nn.get(values.size() - 1);
for(int i = 0; i < nn.size()-1; i++){
noiseValuesDup.add(nn.get(i));
}
ArrayList addNoiseValues = new ArrayList();
Iterator itrObj4 = noiseValuesDup.iterator();
while(itrObj4.hasNext()){
addNoiseValues.add(itrObj4.next());
}
ArrayList addNoiseNames = new ArrayList();
itrObj4 = noiseNamesDup.iterator();
while(itrObj4.hasNext()){
addNoiseNames.add(itrObj4.next());
}
Random rObj2 = new Random();
int rNum1 = rObj2.nextInt(minVal1) + maxVal1;
System.out.println("Second Threshold Value: " + rNum1);
for (int i = 0; i < addNoiseValues.size(); i++) {
if (rNum1 <= Integer.parseInt(addNoiseValues.get(i).toString())) {
newList1.add(addNoiseValues.get(i));
namesList1.add(names.get(i));
}
}
System.out.println("Values with respect to Second Threshold: " + newList1);
System.out.println("Keywords with respect to Second Threshold: " + namesList1);
Iterator itrObj5 = newList1.iterator();
ArrayList newListAL1 = new ArrayList();
while (itrObj5.hasNext()) {
newListAL1.add(itrObj5.next());
}
ValuesBarGraph v5 = new ValuesBarGraph(newListAL1, namesList1); //Fifth
Histogram
}
5.3.2 Output Screens

5.3.2.1 User entering username and password in Login Page.

5.3.2.2 Displays a message “Unauthenticated User”, when unauthenticated users login.


5.3.2.3 Only authenticated users have the access to search for a query. User entered a query in
search engine.

5.3.2.4 Search Engine displays the results of the related query. User clicks on the anyone of
the results.
5.3.2.5 Whenever, the user clicks on the results. The database updates the count of specific
query.

5.3.2.6 Generates both algorithms i.e., k-anonymity and Zealous by clicking on the button.
5.3.2.7 Histogram, between the keywords and the number of users who have keywords in
their search history.

5.3.2.8 Histogram after eliminating the duplicates in the above histogram.


5.3.2.9 Histogram after eliminating the pairs whose values are less than τ (Threshold value), a
random number w.r.to above histogram.

5.3.2.10 Histogram after adding a sample random number nk to the count ck.
5.3.2.11 Histogram after eliminating the pairs whose values are less than τʹ (Threshold
value), a random number w.r.to above histogram.

5.3.3 Result Analysis


Only the authenticated users have the access to search for a query. The search log will
be updated whenever a user logins, searches for a query, clicks on the results of the query.
The output is displayed in the form of histograms (original histogram), and the duplicates
queries are eliminated. The pairs of histograms are eliminated whenever the values are less
then threshold value (random number) and finally a sanitized histogram consisting of both
original values and noise counts is formed.

5.4 CONCLUSION
This chapter illustrates the implementation of designed modules along with results. It
describes about what is implementation in general software development life cycle, then in
iteration model.
Key functions in implementation phase are discussed. In method of implementation,
forms of input like browsing of file present in system and typing of file are explained with
screen shots and outputs are displayed.
6. TESTING AND VALIDATION
6.1 INTRODUCTION

6.1.1 Testing

Testing is a process of executing a program with a intent of finding an error. Testing


presents an interesting anomaly for the software engineering. The goal of the software testing
is to convince system developer and customers that the software is good enough for
operational use. Testing is a process intended to build confidence in the software. Testing is a
set of activities that can be planned in advance and conducted systematically. Software
testing is often referred to as verification & validation.

6.1.1.1 Strategies of Testing

A strategy for software testing must accommodate low-level tests that are necessary to
verify that all small source code segments have been correctly implemented as well as high-
level tests that validate major system functions against customer requirements.

6.1.1.2 Fundamentals of Testing

Testing is a process of executing program with the intent of finding error. A good test
case is one that has high probability of finding an undiscovered error. If testing is conducted
successfully it uncovers the errors in the software. Testing cannot show the absence of
defects, it can only show that software defects present.

6.1.2 Types of Testing


The various types of testing are

 Black-Box Testing
 White-Box Testing
 Alpha Testing
 Beta Testing
 Unit Testing
 Integration Testing
 System Testing
 Acceptance Testing
6.1.2.1 Black-Box Testing

Black-Box testing is also called as functional or behavioral testing. In black-box


testing the structure of the program is not considered and the tester only knows the inputs that
can be given to the system and what output the system should give.

6.1.2.2 White-Box Testing

White-Box testing is also called as glass-box or structural testing. It is a test case


design method that uses the control structure of the procedural design to derive test cases.
Using white box testing methods, the software engineer can derive test cases that guarantees
all independent parts within a module that have been exercised at least once and exercise all
logical decisions on their true and false sides.

6.1.2.3 Alpha Testing

Alpha testing is the software prototype stage when the software is first able to run.
It will not have the intended functionality, but it will have core functions and will be able to
accept inputs and generate outputs. An alpha test usually takes place in the developer's offices
on a separate system.

6.1.2.4 Beta Testing

Beta testing is a “live application” of the software in an environment that cannot be


controlled by the developer. The beta test is conducted at one or more customer sites by the
end user of the software.

6.1.2.5 Unit Testing

Unit testing is also called as module testing. It is essentially used for verification of
the code produced by individual programmers, and is typically done by the programmer of
the module. Generally, a module is offered by a programmer for integration and use by others
only after it has been unit tested satisfactorily.
6.1.2.6 Integration Testing
In integration testing, many unit tested modules are combined into subsystems, which
are then tested. The goal here is to see if the modules can be integrated properly. Hence, the
emphasis is on testing interfaces between modules. This testing activity can be considered
testing the design.
6.1.2.7 System Testing
Testing of the debugging programs is one of the most critical aspects of the computer
programming triggers, without programs that works, the system would never produce the
output for which it was designed. Testing is best performed when user development are asked
to assist in identifying all errors and bugs. It is not quantity but quality of the data used the
matters of testing.
6.1.2.8 Acceptance Testing
Acceptance testing is often performed with realistic data of the client to demonstrate
that the software is working satisfactorily. It may be done in the setting in which the software
is to eventually function. It essentially tests if the system satisfactorily solves the problems
for which it was commissioned.

6.2 DESIGN OF TEST CASES AND SCENARIOS

The test plan focuses on how the testing for the project will proceed, which units will
be tested, and what approaches are to be used during the various stages of testing. Test case
specifications have to be done separately for each unit. Based on the approach specified in the
test plan, first the features to be tested for this unit must be determined. The overall approach
stated in the plan is refined into specific test techniques that should be followed and into the
criteria to be used for evaluation.

Test case specification gives, for each unit to be tested, all test cases, inputs to be used
in the test cases, conditions being tested by the test case, and outputs expected for those test
cases. Test case specification is a major activity in the testing process. Careful selection of
test cases that satisfy the criteria and approach specified is essential for proper testing.

Table1: Test Case 1

Name of the Test Case: Login

Testing Technology: Acceptance Testing

Input Data: Not Applicable


 User Cannot sign in Login Page without entering the Username and Password in
respective fields.
Output:

Test Result: Successful with no defects

Table 6.2.1 Test Case Login

Table2: Test Case 2

Name of the Test Case: Authentication

Testing Technology: Acceptance Testing

Input Data: Applicable


 Only the authentication users have the access (rights) to search for a query.
Output:
Test Result: Successful with no defects

Table 6.2.1 Test Case Authentication

6.3 CONCLUSION
This chapter is introduced with explanation of testing and validation in software
development life cycle in general. Design of test cases and different scenarios is given in
detail along with screen shots, followed by validations of generated test cases and scenarios.
7. CONCLUSION

7.1 PROJECT CONCLUSION

The experimental result shows the importance of the maintaining privacy in


publishing search logs. Search logs are needed to provide privacy as their contain user’s
information. In our project we implement both k-anonymity and Zealous algorithm.
As k-anonymity algorithm doesn’t provide good utility and stronger privacy, so in this
case we implemented Zealous algorithm which provides good utility and stronger privacy for
publishing search logs.
We analyzed Zealous algorithm for publishing frequent keywords, queries and clicks
of a search log. Our project concludes with the comparison between k-anonymity and
Zealous algorithms.

7.2 FUTURE ENHANCEMENT

The project can be used in the development of algorithms that release useful
information about infrequent keywords, queries, and clicks in a search log while preserving
user privacy. Thus, searched user queries cannot be matched to a particular user.
8. REFERENCES

[1] Publishing Search Logs – A Comparative Study of Privacy Guarantees by Michaela


Gotz, Ashwin Machanavajjhala, Guozhang Wang, Xiaokui Xiao and Johannes Gehrke.
[2] Anonymizing Query Logs by E. Adar, World Wide Web(WWW) Workshop Query Log
Analysis, 2007.
[3] Analysis of Google Logs Retention Policies by Vincent Toubiana and Helen
Nissenbaum.
[4] Anonymous Search Histories Featuring Personalizes Advertisement-Balancing Privacy
with Economic Interests by Thorben Burghardt, Klemens Bohm, Achim Guttmann and Chris
Clifton.
[5] User k-anonymity for privacy preserving data mining of query logs by Guillermo
Navarro-Arribas, Vicenç Torra, Arnau Erola and Jordi Castellà-Roca; Elsevier Publications.
[6] Relaeasing search queries and clicks privately by Aleksandra Korolova, Krishnaram
Kenthapadi, Nina Mishra and Alexandros Ntoulas.

You might also like