You are on page 1of 11

Newsstand through RSS Feeds

INTRODUCTION
A lot of people like to view and analyze news from various news sources. These
people are therefore likely to subscribe to RSS feeds from many news sources.
Many a times, people are interested only in the top news stories of their categories
of interest. Therefore, the users have to scan through all the top news stories in
order to get to read stories of his/her interest. For example, a user interested in
sports related top news stories has to go through all the top news stories from
various channels and their time spent in analyzing news from the multiple sources.
We therefore identified the need of bringing together news from various sources
and categorizing them and presenting them to the users as a single news feed.
The user can then subscribe only to this news feed as against subscribing to
multiple feeds. The system accepts the RSS documents of the different news
sources acting as an input to the system.
Web feeds provide a way for websites especially those that are frequently updated
to provide up to date information to their users. Feeds are provided in either RSS
or Atom format.
A web aggregation site is a website that has content from various feeds in one
place. This makes it easier for users to view contents from various websites at
once. It also removes the overhead of having to build the content of a feed
aggregator by the user. Popular aggregation websites include newsnow.com,
kicknews.com.
When aggregators have to categorize the content consumed from feeds, they either
use a predefined category that has been registered for the source of the feed or try
to get the category from the meta-data supplied with the feed content. Using the
predefined category of the source brings up scenario in which the category does
not match the actual content being consumed. In some cases also, the category
supplied in the meta-data would not match any of the categories set up in the
aggregator.

Problem Definition
People usually want to collect more information about a news. Gathering all these
News helps users to aware of the current reality. Web blogs are full of un-indexed
and unprocessed text that reflects the heterogeneity. It is not easy to walk through a
lot of news and read it carefully .
Sometime news are directly talked about the product and sometime reviews are
explicitly mentioned. Thus, there is a need to collect and process different news
sources so that it can be used in decision making processes.
In this proposed system, it propose a app that will collects, parsed, processed,
annotate and analyze the different news from various rss feeded channels which
express either positive or negative information through crawling .

Background of the Study


1 Feed Reader or News Aggregator software allow you to grab the
RSS feeds from various sites and display them for you to read and
use.
2 Users who are interested in consuming the content of feeds use an
aggregator software called feed reader. Aggregator software can
either be a windows or a web application and it collects feed
contents from various sources in one view. With a feed reader, a
user can have the latest content of his/her favourite website in one
place; thereby reducing time spent checking different websites. A
spin-off of feed readers is web aggregation sites.
3 RSS solves a problem for people who regularly use the web. It
allows you to easily stay informed by retrieving the latest content
from the sites you are interested in. You save time by not needing

to visit each site individually. You ensure your privacy, by not


needing to join each site's email newsletter.

Purpose and Objective


In order to remain updated of the latest news articles many users
subscribe to various RSS feeds. However, many a times this information
is scattered across various news sources and spans more than one
domain. Our system provides a single RSS feed that presents all the
news items from various different news sources and groups them into
categories. This would save a lot of users time which he would
otherwise spend in visiting various news sites and
finding top news of his category of interest.
Our project aims at processing four RSS feeds (representing four news
channels) and obtaining a single, well-categorized output feed. In this
report, we have discussed the various implementation specific details,
the algorithms, the advantages and the limitations of our project as well
as the challenges that still remain to be resolved. We have also presented
the user evaluation method and test results of our System.

4
5

LITERATURE SURVEY

There are two main approaches to building text classifiers Knowledge


Engineering (KE) approach and Machine Learning (ML). Knowledge Engineering
(KE) used to be very popular. It involves manually defining a set of rules encoding
knowledge from experts to place texts in specified categories. KE gradually lost its
popularity in the 1990s to Machine Learning (ML) approach which involves
building automatic text classifier by learning the characteristics of the categories of
interest from a set of pre-classified texts .
In deciding whether to use Machine learning or Knowledge Engineering approach
to text classification, sentences in Dutch Law were classified using both Machine
Learning technique and Knowledge engineering approach . SVM and pattern based
KE were implemented and was found that SVM attained accuracy of up to 90%.
A Scientific News Aggregator that gathered news from both Atom and RSS feeds
of about 1000 web journals was developed . NB classifier was used to classify the
news coming from the different sources into stipulated categories of interest. Since
a relatively large part of the RSS/Atom feed was already manually classified from
the originating news source, the key idea implemented for classifying was to use
the classifier in a mixed mode: as soon as already classified scientific news by a
scientific news source was seen, the classifier switched to training mode; the
remaining unclassified scientific news was categorized with the classifier in
categorizing mode.
Multi-label classification was implemented . A ranking function was used to
compute the relevancy of all predefined categories to the news item. The contents
of <title>, <description> and <link> elements were retrieved and used as features.
Normalized term frequency method was used to determine the weight of individual
feature in the vector space.
SVM was used by to classify news articles into three categories; Sports, Business
and Entertainment. The vector representation of features serves as entry point into
the SVM classifier. The SVM classifier was implemented using LIBSVM - an

integrated software for support vector classification, regression and distribution


estimation [one-class SVM] with the support for multiclass classification.
Categorization of news text using SVM and ANN was carried out in. In the overall
comparison of SVM and ANN algorithms for the data set that was used, the results
for both recall and precision over all conditions indicate significantly differences in
the performance of the SVM algorithm over the ANN algorithm and since SVM is
a less (computationally) complex algorithm than the ANN, they concluded that
SVM is preferable at least for the type of data examined, i.e., many short text
documents in a relatively few well populated categories.
A method of Text Categorization on web documents using text mining and
information extraction based on the classical summarization techniques was
proposed in. First, web documents are pre-processed by removing the html tags,
meta-data, comment information, images, bullets, buttons, graphics, links and all
other hyper data in order to establish an organized data file, by recognizing feature
terms like term frequency count and weight percentage of each term. Experimental
results showed that this approach of Text Categorization is more suitable for
Informal English language based web content where there is vast amount of data
built in informal terms. The method significantly reduced the query response time,
improved the accuracy and degrees of relevancy.
In [16], rough set theory was used to automatically classify text documents. After
pre-processing text documents and stemming the features, they used specific
thresholds of 10%, 8%, 6% and 4% to reduce the size of the feature space based on
the frequency of each feature in that text document. Thereafter, their model used a
pair of precise concepts from the rough set theory that are called the lower and
upper approximations to classify any test text document into one or more of main
categories and sub-categories of interest. The rough set theory produced accuracy
of up to 96%.
Text categorization was used to detect intrusion by [9]. KNN classifier was used
for the classification. System processes were taken as documents to be classified
and system calls were taken as distinct words. The tf-idf text categorization
weighting technique was adopted to transform each process into a vector. Their
preliminary result showed that the text categorization approach is effective in the
detection of intrusive program behaviour.
SVM was used as the classification algorithm in this paper because it has high
dimensional input space, understands that there are few irrelevant features and tries
to use as many features as possible, the documents vectors are sparse and most
text categorization problems are linearly separable . Gmeans is a algorithm for text
clustering of a given dataset and it uses four different similarity measures, six

various initialization methods and a powerful local search strategy called first
variation. A frequent item based approach[9] provides a natural way of reducing
the dimensionality of the vector space model. These frequent item sets are
discovered using association rule mining.

Methodology

Steps:

1)
Send the query term to Web server using
Web service:

User
Enters
Term
Request

Web
Service

Search

Database

2) Fetch every web page content stored


in database
Web
Server

Fetching web page

Database

3)
Calculate the term frequency for every
words of the individual pages for every page
in stored in the database.
What does tf-idf mean?
Tf-idf stands for term frequency-inverse document frequency, and the tfidf weight is a weight often used in information retrieval and text mining.
This weight is a statistical measure used to evaluate how important a word
is to a document in a collection or corpus. The importance increases
proportionally to the number of times a word appears in the document but
is offset by the frequency of the word in the corpus. Variations of the tf-idf
weighting scheme are often used by search engines as a central tool in
scoring and ranking a document's relevance given a user query.
One of the simplest ranking functions is computed by summing the tf-idf
for each query term; many more sophisticated ranking functions are
variants of this simple model.

Tf-idf can be successfully used for stop-words filtering in various subject


fields including text summarization and classification.
How to Compute:
Typically, the tf-idf weight is composed by two terms: the first computes the
normalized Term Frequency (TF), aka. the number of times a word appears
in a document, divided by the total number of words in that document; the
second term is the Inverse Document Frequency (IDF), computed as the
logarithm of the number of the documents in the corpus divided by the
number of documents where the specific term appears.
TF: Term Frequency, which measures how frequently a term
occurs in a document. Since every document is different in length, it
is possible that a term would appear much more times in long
documents than shorter ones. Thus, the term frequency is often
divided by the document length (aka. the total number of terms in the
document)
as
a
way
of
normalization:
TF(t) = (Number of times term t appears in a document) / (Total
number of terms in the document).
IDF: Inverse Document Frequency, which measures how
important a term is. While computing TF, all terms are considered
equally important. However it is known that certain terms, such as
"is", "of", and "that", may appear a lot of times but have little
importance. Thus we need to weigh down the frequent terms while
scale up the rare ones, by computing the following:
IDF(t) = log_e(Total number of documents / Number of documents
with term t in it).

4)
Apply word weight-age technique to fetch
the weights of every document / webpage in
relation to search keyword.
Given a query q composed of a set of words wi, we
calculate wi, d for each wi for every document d D.
In the simplest way, this can be done by running through
the document collection and keeping a running sum of fw, d
and fw, D.
Once done, we can easily calculate wi d according to the
mathematical framework presented before.
Once all wi, ds are found, we return a set D* containing
documents d such that we maximize the following
equation:
i wi, d (3).
Either the user or the system can arbitrarily determine the
size of D* prior to initiating the query. Also, documents are
returned in a decreasing order according to equation (3).
This is the traditional method of implementing TF-IDF.

5)
Select the top page with the highest
weight.

6)

Send the Result Back to Mobile App

Android
App

Loading
the top page

Set of
Web
Pages

1.6 Hardware and software


requirements

HARDWARE REQUIREMENTS:

1 GB RAM.
200 GB HDD.
Intel 1.66 GHz Processor Pentium 4

SOFTWARE REQUIREMENTS:
Windows XP, Windows 7,8
Visual Studio 2010
MS SQL Server 2008
Windows Operating System
Eclipse for android

You might also like