Professional Documents
Culture Documents
INTRODUCTION
A lot of people like to view and analyze news from various news sources. These
people are therefore likely to subscribe to RSS feeds from many news sources.
Many a times, people are interested only in the top news stories of their categories
of interest. Therefore, the users have to scan through all the top news stories in
order to get to read stories of his/her interest. For example, a user interested in
sports related top news stories has to go through all the top news stories from
various channels and their time spent in analyzing news from the multiple sources.
We therefore identified the need of bringing together news from various sources
and categorizing them and presenting them to the users as a single news feed.
The user can then subscribe only to this news feed as against subscribing to
multiple feeds. The system accepts the RSS documents of the different news
sources acting as an input to the system.
Web feeds provide a way for websites especially those that are frequently updated
to provide up to date information to their users. Feeds are provided in either RSS
or Atom format.
A web aggregation site is a website that has content from various feeds in one
place. This makes it easier for users to view contents from various websites at
once. It also removes the overhead of having to build the content of a feed
aggregator by the user. Popular aggregation websites include newsnow.com,
kicknews.com.
When aggregators have to categorize the content consumed from feeds, they either
use a predefined category that has been registered for the source of the feed or try
to get the category from the meta-data supplied with the feed content. Using the
predefined category of the source brings up scenario in which the category does
not match the actual content being consumed. In some cases also, the category
supplied in the meta-data would not match any of the categories set up in the
aggregator.
Problem Definition
People usually want to collect more information about a news. Gathering all these
News helps users to aware of the current reality. Web blogs are full of un-indexed
and unprocessed text that reflects the heterogeneity. It is not easy to walk through a
lot of news and read it carefully .
Sometime news are directly talked about the product and sometime reviews are
explicitly mentioned. Thus, there is a need to collect and process different news
sources so that it can be used in decision making processes.
In this proposed system, it propose a app that will collects, parsed, processed,
annotate and analyze the different news from various rss feeded channels which
express either positive or negative information through crawling .
4
5
LITERATURE SURVEY
various initialization methods and a powerful local search strategy called first
variation. A frequent item based approach[9] provides a natural way of reducing
the dimensionality of the vector space model. These frequent item sets are
discovered using association rule mining.
Methodology
Steps:
1)
Send the query term to Web server using
Web service:
User
Enters
Term
Request
Web
Service
Search
Database
Database
3)
Calculate the term frequency for every
words of the individual pages for every page
in stored in the database.
What does tf-idf mean?
Tf-idf stands for term frequency-inverse document frequency, and the tfidf weight is a weight often used in information retrieval and text mining.
This weight is a statistical measure used to evaluate how important a word
is to a document in a collection or corpus. The importance increases
proportionally to the number of times a word appears in the document but
is offset by the frequency of the word in the corpus. Variations of the tf-idf
weighting scheme are often used by search engines as a central tool in
scoring and ranking a document's relevance given a user query.
One of the simplest ranking functions is computed by summing the tf-idf
for each query term; many more sophisticated ranking functions are
variants of this simple model.
4)
Apply word weight-age technique to fetch
the weights of every document / webpage in
relation to search keyword.
Given a query q composed of a set of words wi, we
calculate wi, d for each wi for every document d D.
In the simplest way, this can be done by running through
the document collection and keeping a running sum of fw, d
and fw, D.
Once done, we can easily calculate wi d according to the
mathematical framework presented before.
Once all wi, ds are found, we return a set D* containing
documents d such that we maximize the following
equation:
i wi, d (3).
Either the user or the system can arbitrarily determine the
size of D* prior to initiating the query. Also, documents are
returned in a decreasing order according to equation (3).
This is the traditional method of implementing TF-IDF.
5)
Select the top page with the highest
weight.
6)
Android
App
Loading
the top page
Set of
Web
Pages
HARDWARE REQUIREMENTS:
1 GB RAM.
200 GB HDD.
Intel 1.66 GHz Processor Pentium 4
SOFTWARE REQUIREMENTS:
Windows XP, Windows 7,8
Visual Studio 2010
MS SQL Server 2008
Windows Operating System
Eclipse for android