Elysium Data Mining 2010

Elysium Technologies Private Limited
ISO 9001:2008 A leading Research and Development Division

Madurai | Chennai | Kollam | Ramnad | Tuticorin | Singapore
Abstract Data Mining 2010 - 2011
01 Zipfs Trust Discovery in Structured P2P Network
The use of peer-to-peer (P2P) applications is growing Dramatically. To finish transactions successfully, rust
mechanism plays an important role, which not only compact the communication traffic but also the data discovery. In
this paper we address the problem of trust discovery mechanism in structured P2P network we proposed before. The
main contribution of this paper is addressing the Zipf’s law to trust discovery. At last, the experimentally evaluate the
effectiveness of uniform and zipf trust distribution, the result shows that Zipf’s law performed advantages to uniform
distribution
02 Personalized Web Search with Location Preferences
As the amount of Web information grows rapidly, search engines must be able to retrieve information according to the
user's preference. In this paper, we propose a new web search personalization approach that captures the user's
interests and preferences in the form of concepts by mining search results and their click throughs. Due to the
important role location information plays in mobile search, we separate concepts into content concepts and location
concepts, and organize them into ontologies to create an ontology-based, multi-facet (OMF) pro_le to precisely capture
the user's content and location interests and hence improve the search accuracy. Moreover, recognizing the fact that
different users and queries may have different emphases on content and location information, we introduce the notion
of content and location entropies to measure the amount of content and location information associated with a query,
and click content and location entropies to measure how much the user is interested in the content and location
information in the results. Accordingly, we propose to done personalization effectiveness based on the entropies and
use it to balance the weights between the content and location facets. Finally, based on the derived ontologies and
personalization effectiveness, we train an SVM to adapt a personalized ranking function for re-ranking of future search.
We conduct extensive experiments to compare the precision produced by our OMF pro_les and that of a baseline
method. Experimental results show that OMF improves the precision significantly compared to the baseline
03 Web Objects Clustering Using Transaction Log
In this paper, we present a novel method for clustering web objects. Most of existing methods aren’t suffientto explore
similar objects, because the basic data, which include attributes of objects, click-through data, and link data, are often
sparse, carce or difficult to obtain. In contrast, the information we exploit is transaction log, which is more common,
denser as well as noisier. To reduce the influence of the noises, we calculate the similarity in two steps. Firstly, we use
a basic similarity to discover objects’ neighbors. The objects are represented by vectors consisting of their neighbors.
Secondly, the cosine similarity of the object vectors is calculated for clustering. Experiments on synthetic data show
that our method is robust against noises. Using noisy data, we increase the precision by 10%. Finally, we show real
clustering results based on a movie dataset and achieve the coverage of 76% and the precision of 60%.
#230, Church Road, Anna Nagar, Madurai 625 020, Tamil Nadu, India
(: +91 452-4390702, 4392702, 4390651
Website: www.elysiumtechnologies.com,www.elysiumtechnologies.info
Email: info@elysiumtechnologies.com
04 Using Domain Top-page Similarity Feature in Machine Learning-based Web Phishing Detection
This paper presents a study on using a concept feature to detect web phishing problem. Following the features
introduced in Carnegie Mellon Anti- phishing and Network Analysis Tool (CANTINA), we applied additional domain top
age Similarity feature to a machine learning based phishing detection system. We preliminarily experimented with a
smallest of 200 web data, consisting of 100 phishing webs and another 100 non-phishing webs. The evaluation result
in terms of f-measure was up to 0.9250, with 7.50% of error rate.
05 Transit Vehicle Dispatching Based on Genetic Algorithm-RBF Neural Network
Transit vehicle reasonable dispatching is very important to solve the congestion of traffic. Artificial neural network is
the common dispatching method; among which RBF neural network is a feed-forward neural network with one hidden
layer, which can uniformly approximate any continuous function to a prospected accuracy. In RBF neural network, the
choice of the widths and centers of the Gaussian function, the output weights will affect the accuracy of RBF neural
network model. In the per, genetic algorithm is employed to determinate the RBFneural network’s parameters. The
genetic algorithm-RBF neural network is studied and applied to transit vehicle dispatching. The experimental results
show that the calculation results of GA-RBF neural network are consistent with actual results
06 Research on the Semantic Similarity Computation Method Based on EUO
This paper carries out the study on semantic similarity in the semantic information search and combines the characteristics
of e-commerce information field to construct the method of EUO. The paper further comes up with the idea as to carry out
semantic similarity computation combining two semantic similarity computation methods, namely, information content
method and link distance method on the basis of EUO. This method utilizes upper level ontology and relevant feedback
search method to extract the semantic feature vectors in the page information of the Internet. Experiments have also proved
that the semantic similarity computation method based on EUO can relatively better reflect the semantic similarity relations
among words in the field of e-commerce.
07 Ranking DMUs in the DEA context using super and cross efficiency
This work proposes a revised cross evaluation matrix which can be utilized to obtain a full rank of DMUs. The revised
matrix contains super-efficiency values for Diagonal elements and cross-e efficiency values for non-diagonal elements
of the matrix. This matrix enables better the difference of efficiency or performance of DMUs than the original cross
evaluation matrix.
(: +91 452-4390702, 4392702, 4390651
08 Progressive Result Generation for Multi-Criteria Decision Support Queries
Multi-criteria decision support (MCDS) is crucial in many business and web applications such as web searches, B2B
portals and on-line commerce. Such MCDS applications need to report results early; as soon as they are being
generated so That they can react and for mutate competitive decisions in near real-time. The ease in expressing user
preferences in web-based applications has made Pareto-optimal (skyline) queries a popular class of MCDS queries.
However, state-of-the-art techniques either focus on handling skylines on single input sets (i.e., no joins) or do not
tackle the challenge of producing progressive early output results. In this work, we propose a progressive query
evaluation framework ProgXe that transforms the execution of queries involving skyline over joins to be non-blocking,
i.e., to be progressively generating results early and often. In Proxy the query processing (join, mapping and skyline) is
conducted at multiple levels of abstraction, thereby exploiting the knowledge gained from both input as well as
mapped output spaces. This knowledge enables us to identify and reason about abstract level relationships to
guarantee correctness of early output. It also provides optimization opportunities previously missed by current
technique use. To further optimize ProgXe, we incorporate an ordering technique that optimizes the rate at which
results are reported by translating the optimization of tuple-level processing into a job-sequencing problem. Our
experimental study over a wide variety of data sets demonstrates the superiority of our approach over state-of-the-art
techniques
09 Probabilistic Top-k Query Processing in Distributed Sensor Networks
In this paper, we propose the notion of sufficient set for distributed processing of probabilistic Top-k queries in
cluster-based wireless sensor networks. Through the derivation of sufficient boundary, we show that data items
ranked lower Than sufficient boundary are not required for answering the probabilistic top-k queries, thus are subject
to local pruning. Accordingly, we develop the sufficient set-based (SSB) algorithm for inter-cluster query processing.
Experimental results show that the proposed algorithm reduces data transmissions significantly.
10 Preference Queries in Large Multi-Cost Transportation Networks
Research on spatial network databases has so far considered that there is a single cost value associated with each
road segment of the network. In most real-world situations, however, there may exist multiple cost types involved in
transportation Decision making. For example, the different costs of a road segment could be its Euclidean length, the
driving time, the walking time, possible toll fee, etc. The relative significance of these cost types may vary from user to
user. In this paper we consider such multi-cost transportation networks (MCN), where each edge (road segment) is
associated with multiple cost values. We formulate skyline and top-k queries in MCNs and design algorithms for their
efficient processing. Our solutions have two important properties in preference-based querying; the skyline methods
are progressive and the top-k ones are incremental. The performance of our techniques is evaluated with experiments
on a real road network.
(: +91 452-4390702, 4392702, 4390651
11 Policy- Aware Sender Anonymity in Location Based Services
Sender anonymity in location-based services (LBS) attempts to hide the identity of a mobile device user who sends
requests to the LBS provider for services in her proximity (e.g. “find the nearest gas station” etc.). The goal is to keep
the requester’s interests private even from attackers who (via hacking or subpoenas) gain access to the request and to
the locations of the mobile user and other nearby users at the time of the request. In an LBS context, the best-studied
privacy guarantee is known as sender k-anonymity. We show that state-of-the art solutions for sender k-anonymity
defend only against naive attackers who have no knowledge of the anonymization policy that is in use. We strengthen
the privacy guarantee to defend against more realistic “policy-aware” attackers. We describe a polynomial algorithm to
Obtain an optimum anonymization policy. Our implementation and experiments show that the policy-aware sender k-
anonymity has potential for practical impact, being efficiently enforceable, with limited reduction in utility when
compared to policy unaware guarantees.
12 Optimized Query Evaluation Using Cooperative Sorts
Many applications require sorting a table over multiple sort orders: generation of multiple reports from a table,
evaluation of a complex query that involves multiple instances of a relation, and batch processing of a set of queries.
In this paper, we study how multiple sorting of a table can be efficiently performed. We introduce a new evaluation
technique, called cooperative sort that exploits the relationships among the input set of sort orders to minimize I/O
operations for the collection of sort operations. To demonstrate the efficiency of the proposed scheme, we
implemented it in PostgreSQL and evaluated its performance using both TPC-DS benchmark and synthetic data. Our
experimental results show significant performance improvement over the traditional non-cooperative sorting scheme.
13 Multilevel Trust Management Framework for Pervasive Computing
A multilevel trust management framework in pervasive environment is proposed in this paper. In the framework, the
trust, the reputation, reciprocity and related concepts are defined newly. The multilevel discrete trust metric finds a
good way to solve the problems, which the granularity of binary trust is coarse and the calculation of continuous trust
is complex. The design of storage counter in the framework makes up the deficiency that storage capacity of pervasive
devices is limited. The experimental result indicates preliminarily that the multilevel trust metric has good effect on
trusted authentication in pervasive environment.
14 Multi-Guarded Safe Zone: An Effective Technique To Monitor Moving Circular Range Queries
Given a positive value r, a circular range query returns the objects that lie within the distance r of the query location. In
this paper, we study the circular range queries that continuously change their locations. We present an efficient and
effective technique to monitor such moving range queries by utilizing the concept of a safe zone. The safe zone of a
query is the area with a property that while the query remains inside it, the results of the query remain unchanged.
Hence, the query does not need to be re-evaluated unless it leaves the safe zone. The shape of the safe zone is defined
by the so-called guard objects. The cost of checking whether a query lies in the safe Zone takes k distance
computations, where k is the number of the guard objects. Our contributions are as follows. 1) We propose a technique
(: +91 452-4390702, 4392702, 4390651
based on powerful pruning rules and a unique access order which efficiently computes the safe zone and minimizes
the I/O cost. 2) To show the effectiveness of the safe zone, we theoretically evaluate the probability that a query leaves
the safe zone within one time unit and the expected distance a query moves before it leaves the safe zone.
Additionally, for the queries that have diameter of the safe zone less than its expected value multiplied by a constant,
we also give an upper bound on the expected number of guard objects. This upper bound turns out to be a constant,
that is, it does not depend either on the radius r of the query or the density of the objects. The theoretical analysis is
verified by extensive experiments. 3) Our thorough experimental study demonstrates that our proposed approach is
close to optimal and is an order of magnitude faster than a na¨ive algorithm.
15 Mining periodic-frequent item sets with approximate periodicity using interval transaction-ids list tree
Temporal periodicity of item set appearance can be regarded as an important criterion for measuring the
interestingness of item sets in several applications. A frequent item set can be said periodic-frequent in a database if it
appears at a regular interval given by the user. In this paper, we propose a concept of the approximate periodicity of
each item set. Moreover, a new tree-based data structure, called ITL-tree (Interval Transaction-ids List tree), is
proposed. Our tree structure maintains an approximation of the occurrence information in a highly compact manner
for the periodic frequent item sets mining. A pattern-growth mining is used to generate all of periodic-frequent item
sets by a bottom up traversal of the ITL-tree for user-given periodicity and support thresholds. The performance study
shows that our data structure is very efficient for mining periodic-frequent item sets with approximate periodicity
results.
16 Managing Uncertainty of XML Schema Matching
Despite of advances in machine learning technologies, a schema matching result between two database schemas (e.g.,
those derived from COMA++) is likely to be imprecise. In particular, numerous instances of “possible mappings”
between the schemas may be derived from the matching result. In this paper, we study the problem of managing
possible mappings between two heterogeneous XML schemas. We observe that for XML schemas, their possible
mappings have a high degree of overlap. We hence propose a novel data structure, called the block tree, to capture the
commonalities among possible mappings. The block tree is useful for representing the possible mappings in compact
manner, and can be generated efficiently. Moreover, it supports the evaluation of probabilistic twig query (PTQ), which
returns the probability of portions of an XML document that match the query pattern. For users who are interested only
in answers with k-highest probabilities, we also propose the top-k PTQ, and present an efficient solution for it. The
second challenge we have tackled is to efficiently generate possible mappings for a given schema matching. While this
problem can be solved by existing algorithms, we show how to improve the performance of the solution by using a
divide-and conquer approach. An extensive evaluation on realistic datasets shows that our approaches significantly
improve the efficiency of generating, storing, and querying possible mappings.
17 Top Cells: Keyword-Based Search of Top-k Aggregated Documents in Text Cube
Previous studies on supporting keyword queries in RDBMSs provide users with a ranked list of relevant linked
structures (e.g. joined tuples) or individual tuples. In this paper, we aim to support keyword search in a data cube with
text-rich dimension(s) (so-called text cube). Each document is associated with structural dimensions. A cell in the text
cube aggregates a set of documents with matching dimension values on a subset of dimensions. Given a keyword
(: +91 452-4390702, 4392702, 4390651
query, our goal is to find the top-k most relevant cells in the text cube. We propose a relevance scoring model and
efficient ranking algorithms. Experiments are conducted to verify their efficiency.
18 Osprey: Implementing Map Reduce-Style Fault Tolerance in a Shared-Nothing Distributed Database
In this paper, we describe a scheme for tolerating and recovering from mid-query faults in a distributed shared nothing
database. Rather than aborting and restarting queries, our system, Osprey, divides running queries into sub queries,
and Replicates data such that each sub query can be rerun on a different node if the node initially responsible fails or
returns too slowly. Our approach is inspired by the fault tolerance properties of Map Reduce, in which map or reduce
jobs are greedily assigned to workers, and failed jobs are rerun on other workers. Osprey is implemented using a
middleware approach, with only a small amount of custom code to handle cluster coordination. Each node in the
system is a discrete database system running on a separate machine. Data, in the form of tables, is partitioned
amongst database nodes and each partition is replicated on several nodes, using a technique called chained
Declustering [1]. A coordinator machine acts as a standard SQL interface to users; it transforms an input SQL query
into a set of sub queries that are then executed on the nodes. Each sub query represents only a small fraction of the
total execution of the Query; worker nodes are assigned a new sub query as they finish their current one. In this
greedy-approach, the amount of work lost due to node failure is small (at most one sub query’s work), and the system
is automatically load balanced, because slow nodes will be assigned fewer sub queries. We demonstrate Osprey’s
viability as a distributed system for a small data warehouse data set and workload. Our experiments show that the
overhead introduced by the middleware is small compared to the workload, and that the system shows promising load
balancing and fault tolerance properties.
19 Efficient Processing of Substring Match Queries With Inverted q-gram Indexes
With the widespread of the internet, text-based data sources have become ubiquitous and the demand of effective
support for string matching queries becomes ever increasing. The relational query language SQL also supports LIKE
clause over string data to handle substring matching queries. Due to popularity of such substring matching queries,
there have been a lot of studies on designing efficient indexes to support the LIKE clause in SQL. Among them, q-gram
based indexes have been studied extensively. However, how to process substring matching queries efficiently with
such indexes has received very little attention until recently. In this paper, we show that the optimal execution of
intersecting posting lists of q-grams for substring matching queries should be decided judiciously. Then we present
the optimal and approximate algorithms based on cost estimation for substring matching queries. Performance study
confirms that our techniques improve query execution time with q-gram indexes significantly compared to the
traditional algorithms.
20 Efficient Identification of Coupled Entities in Document Collections
The relentless pace at which textual data are generated on-line necessitates novel paradigms for their understanding
and exploration. To this end, we introduce a methodology for discovering strong entity associations in all the slices
(metadata value restrictions) of a document collection. Since related documents mention approximately the same
group of core entities (people, locations, etc.), the groups of coupled entities discovered can be used to expose
themes in the document collection. We devise and evaluate algorithms capable of addressing two flavors of our core
problem: algorithm THR-ENT for computing all sufficiently strong entity associations and algorithm TOP-ENT for
computing the top-k strongest entity associations, for each slice of the document collection.
(: +91 452-4390702, 4392702, 4390651
Discover Information and Knowledge from Websites using an Integrated Summarization and
21 Visualization Framew ork
The number of Web sites has noticeably increased to roughly 225 million in the last ten years. This means there is a
rapid growth of knowledge and information on the Internet. Although search engines can help users to filter their
desired information based on key words, the searched result is normally presented in the form of a list, and users have
to visit each Web page in order to determine the appropriateness of the result. A considerable amount of time therefore
has to be spent on finding the required information. To address this issue, this paper proposes a knowledge discovery
approach on the Web by providing an overview of the information on a Website using an integration of summarization
and visualization techniques. This includes text summarization, tag cloud, Document Type View, and interactive
features such as drill down and thumbnails. This approach is capable to reduce the time required to identify and
search for information or knowledge from the Web.
22 Design of Time-Way for “H” Configuration of Electroplating machine
Design of Time-Way for electroplating machine is a complicated job especially in “H” configuration machine.
Experienced engineer are the designers for these job. However, not only the result is not accurate, but also cause
more setup Time. This paper describes techniques to design Time-Way for cyclic hoist scheduling (CHS) of
electroplating machine, which have an “H” configuration lay out. Tree search algorithm has been used to generate a
machine sequence. An expert system based on engineer knowledge is used with tree search algorithm in order to
generate a Time-Way. These techniques cannot guarantee the minimal cycle time for hoist scheduling Problems. The
results can be used very well in real industrial problems. These techniques give more accuracy; more efficiency and
less setup time when compare to the design of Time-Way by engineers.
23 Design of Fast Multiple String Searching Based on Improved Prefix Tree
Multi-string matching is one of the most important components in data mining task. New applications in many
technology fields require high performance string matching algorithms. This paper first presents a new string
searching approach based on a data structure called prefix tree. The innovative algorithm eliminates the functional
overlap of the table HASH and Prefix Function. Then we make a little improvement on the prefix tree and present a
second algorithm that is faster and more space-saving. It is demonstrated analytically that the two algorithms inherit
the optimality and are very competitive in practice. On tests of both real life and synthetic data, our algorithms are also
efficient and especially effective for various string pattern and large alphabet sets.
24 Cross-Document Co reference Resolution based on Automatic Text Summary
Cross-document co reference resolution plays an import part in the filed of natural language processing (NLP). It
captures the ability of gathering documents for information about a certain entity. Most previous algorithms identify
the underlying entity of a given document depending on the original text, which is unreliable if the original text
contains multiple parts of different themes. In this paper, we propose a Cross-document co reference resolution
(: +91 452-4390702, 4392702, 4390651
algorithm based on automatic text summary instead of the original text. In our approach, we extract query-specific and
informative-indicative summary from the original text by using Hobbs algorithm and measure the similarity between
two summaries. This automatic text summary-based cross-document co reference resolution (ATSCDCR) system is
effective in disambiguating different entities of the same mention name and identifying the same entity of different
mention names. The results from our experiments show that the macro average of ATSCDCR system is up to 73.16%
and the micro average of ATSCDCR system is 67.34 %.
25 C3: Concurrency Control on Continuous Queries over Moving Objects
: Moving object management approaches, especially continuous query processing techniques, have attracted
significant research effort due to the broad usage of location-aware devices. However, little attention has been given to
designing concurrency control protocols for continuous query processing. Existing concurrency control protocols for
spatial indices are based on a single indexing tree, while popular continuous query processing approaches require
multiple indices. In addition, continuous monitoring combined with frequent location updates challenges the
development of serializable isolation for concurrent index operations. This paper proposes an efficient concurrent
continuous query processing approach C3, which fuses scalable continuous query processing methods with lazy
update techniques on R-trees. The proposed concurrency control protocol, equipped with intra- and inter-index
protection, assures serializable isolation, consistency, and deadlock-freedom. The correctness of the proposed
protocol is theoretically proven, and the experiment Results demonstrated its scalability and efficiency.
26 Clustering Data on Manifold with Local and Global Consistency
Data clustering aims at finding the hidden patterns in a large collection of data and a large body of effective algorithms
have been proposed to partition the data in the past three decades. However, most of the algorithms fail to handle data
that expose a manifold structure which is common in many data-driven applications, such as interpretation and
recognition of video, handwritten character and image data. In this paper, we study the problem of clustering on
manifold that aims to partition a set of input data into several clusters each of which contains data points from a
simple low-dimensional manifold. We apply the basic assumption of local and global consistency on the manifold. A
novel algorithm name CMLGC is proposed to find the proper clusters on the manifold. Our research can also be seen
as an instance of manifold learning. The encouraging results on several synthetic and real-world data sets are
obtained which validate our proposed algorithm.
27 Q-Cop: Avoiding Bad Query Mixes to Minimize Client Timeouts under Heavy Loads
In three-tiered web applications, some form of admission control is required to ensure that throughput and response
times are not significantly harmed during periods of heavy load. We propose Q-Cop, a prototype system for improving
admission control decisions that considers a combination of the load on the system, the number of simultaneous
queries being executed, the actual mix of queries being executed, and the expected time a user may wait for a reply
before they or their browser give up (i.e., time out). Using TPC-W queries, we show that the response times of different
types of queries can vary significantly depending not just on the number of queries being processed but on the mix of
other queries that are running simultaneously. We develop a model of expected query execution times that accounts
for the mix of queries being executed and integrate this model into a three-tiered system to make admission Control
decisions. Our results show that this approach makes more informed decisions about which queries to reject and as a
(: +91 452-4390702, 4392702, 4390651
result significantly reduces the number of requests that time out. Across the range of workloads examined an average
of 47% fewer requests are unsuccessful than the next best approach.
28 An Improved Ant-Colony Clustering Algorithm Based On the Innovational Distance Calculation Formula
Focused on the disadvantage of classical Euclidian distance in data clustering analysis, we propose an improved
distance calculation formula, which describes the local compactness and global connectivity between data points.
Furthermore, we improve ant-colony clustering algorithm by using the improved distance calculation formula.
Theoretical analysis and experiments show that this method is more efficient and has the ability to identify complex no
convex clusters
29 A Unified Record Linkage Strategy for Web Service Data
Record linkage, also known as duplicate detection, is a key process that ensures the quality of data stored for Web
service data. Given two lists of records, record linkage consists of determining all pairs that are similar to each other,
where The overall similarity between two records is defined based on domain-specific similarities over individual
attributes constituting the record. In this paper, we present a unified framework for recognizing clusters of near-
duplicate records of multi-language data, especially for Chinese/English mixed Web data. The key ideas are: (1) Pre-
processing multi-language data Using Chinese words segmentation and Chinese named entity recognition techniques;
(2) Pair-wise comparison method based on domain specific similarities, especially, the string kernel method; (3)a
priority queue of duplicate clusters and representative records strategy to respond adaptively to the data scale.
Experiments on real databases show that the proposed recode linkage strategy is efficiency and effectiveness. .
30 A Scalable, Accurate Hybrid Recommender System
Recommender systems apply machine learning techniques for filtering unseen information and can predict whether a
user would like a given resource. There are three main types of recommender systems: collaborative filtering, content-
based filtering, and demographic recommender systems. Collaborative filtering recommender systems recommend
items by taking into account the taste (in terms of preferences of items) of users, under the assumption that users will
be interested in items that user similar to them have rated highly. Content-based filtering recommender systems
recommend items based on the textual information of an item, under the assumption that users will like similar items
to the ones they liked before. Demographic recommender systems categorize users or items based on their personal
attribute and make recommendation based on demographic categorizations. These systems suffer from scalability,
data sparsity, and cold-start problems resulting in poor quality recommendations and reduced coverage. In this paper,
we propose a unique cascading hybrid recommendation approach by combining the rating, feature, and demographic
information about items. We empirically show that our approach outperforms the state of the art recommender system
algorithms, and eliminates recorded problems with recommender systems.
(: +91 452-4390702, 4392702, 4390651
31 A Practical Approach to E-government Performance Evaluation Based on Web Usage Mining
We proposed a practical approach for e-government evaluation which combines the subjective indexes with the
subjective indexes. There are some limitations in traditional method which is single resources of data make evaluation
too subjectively. In order to avoid it, we gain the subjective indexes by user’s survey and experts review, and gain the
objective indexes by web usage mining approach and financial statistic. We use these indexes to give the e-
government website a comprehensive evaluation.
32 A Novel Differential Evolution-Clustering Hybrid Resampling Algorithm on Imbalanced Datasets
When dealing with the imbalanced datasets (IDS), the hyper plane of Support vector machine (SVM) tends to minority
class (positive class), which causes low classification accuracy. Aiming at this problem, we propose a novel
differential evolution clustering hybrid resembling SVM algorithm (DEC-SVM). This algorithm utilizes the similar
mutation and crossover operators of Differential Evolution (DE) for over-sampling to enlarge the ratio of positive
samples, and then we apply clustering to the oversampled training dataset as a data cleaning method for both classes,
removing the redundant or noisy samples. Experimental results show that our method DEC-SVM performs better,
compared with standard SVM, SMOTE-SVM and DE-SVM under the criterion of F-measure and ROC Area (AUC) upon
ten different UCI standard datasets.
33 A Novel Approach for High Dimensional Data Clusteringt
Clustering is considered as the most important unsupervised learning problem. It aims to find some structure in a
collection of unlabeled data. Dealing with a large quantity of data items can be problematic because of time
complexity. On the Other hand high dimensional data is a challenge arena in data clustering e.g. time series data.
Novel algorithms are needed to be robust, scalable, efficient and accurate to cluster of these kinds of data. In this
study we proposed a two stages algorithm base on K-Means to achieve our objective.
34 A Central Sub-image Based Global Motion Estimation Method for In-Car Video Stabilization
This paper presents a novel global motion estimation method based on the phase correlation of central sub-image. In
this study, we consider the case that the In-Car videos are captured from the cameras placed in front of a car. The
backgrounds of these In-Car videos usually vary with the moving car, which result in the inaccuracy of classical image
stabilization method. As a result, a central sub-image based image stabilization method is presented in this paper. The
simulation results show that the proposed method is efficient in improving the accuracy of detecting global motion
vectors of In-Car videos.
(: +91 452-4390702, 4392702, 4390651

Elysium Data Mining 2010

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Elysium Data Mining 2010

Uploaded by

Copyright:

Available Formats

Elysium Technologies Private Limited

ISO 9001:2008 A leading Research and Development Division

Abstract Data Mining 2010 - 2011

01 Zipfs Trust Discovery in Structured P2P Network

02 Personalized Web Search with Location Preferences

03 Web Objects Clustering Using Transaction Log

05 Transit Vehicle Dispatching Based on Genetic Algorithm-RBF Neural Network

06 Research on the Semantic Similarity Computation Method Based on EUO

08 Progressive Result Generation for Multi-Criteria Decision Support Queries

09 Probabilistic Top-k Query Processing in Distributed Sensor Networks

10 Preference Queries in Large Multi-Cost Transportation Networks

11 Policy- Aware Sender Anonymity in Location Based Services

12 Optimized Query Evaluation Using Cooperative Sorts

13 Multilevel Trust Management Framework for Pervasive Computing

16 Managing Uncertainty of XML Schema Matching

17 Top Cells: Keyword-Based Search of Top-k Aggregated Documents in Text Cube

18 Osprey: Implementing Map Reduce-Style Fault Tolerance in a Shared-Nothing Distributed Database

19 Efficient Processing of Substring Match Queries With Inverted q-gram Indexes

20 Efficient Identification of Coupled Entities in Document Collections

22 Design of Time-Way for “H” Configuration of Electroplating machine

23 Design of Fast Multiple String Searching Based on Improved Prefix Tree

24 Cross-Document Co reference Resolution based on Automatic Text Summary

25 C3: Concurrency Control on Continuous Queries over Moving Objects

26 Clustering Data on Manifold with Local and Global Consistency

29 A Unified Record Linkage Strategy for Web Service Data

30 A Scalable, Accurate Hybrid Recommender System

31 A Practical Approach to E-government Performance Evaluation Based on Web Usage Mining

32 A Novel Differential Evolution-Clustering Hybrid Resampling Algorithm on Imbalanced Datasets

33 A Novel Approach for High Dimensional Data Clusteringt

You might also like