Welcome to Scribd!

Etola: A News Headline Clustering Tool

Uploaded by

0% found this document useful (0 votes)

48 views15 pages

Nibir Bora & Bhabani Mishra. In 1st Project Innovation Contest (PIC 2012), poster session of 8th International Conference on Distributed Computing and Internet Technology (ICDCIT 2012), Bhubaneswar, 1-4 February, 2012.

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Attribution Non-Commercial (BY-NC)

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

48 views15 pages

Etola: A News Headline Clustering Tool

Uploaded by

Nibir Bora

Copyright:

Attribution Non-Commercial (BY-NC)

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 15

Search inside document

ICDCIT-2012

PROJECT INNOVATION CONTEST

Etola: A News Headline Clustering Tool

(PIC No: 11)
Nibir Nayan Bora Project Guide: Dr. Bhabani Shankar Prasad Mishra KIIT University, Bhubaneswar, Odisha

Outline
Perform Cluster Analysis on news headlines.
Create a dataset Propose a new technique Compare with traditional method Evaluate quality

Cluster Analysis
Assigning objects to groups (called clusters) in accordance with the general clustering rule, high intra-cluster similarity and low inter-cluster similarity

Document clustering / Text categorization Unsupervised learning

Why News Headlines?

How are news headlines different? Small size Less dimensions Grammatical inconsistent Benefit: Traditional clustering methods can be modified accordingly.

The Dataset
Extracted from the Reuters corpus (90 category split) No. of documents: 343 No. of classes (topics): 26 e.g. gold, coffee, jobs, retail, housing. Each headline relates to a single topic.

The Dataset (examples)

Headline U.S. cotton certificate expiration date extended China trying to increase cotton output, paper says January housing sales drop, realty group says Quebec February housing starts fall Pakistan not seen as major wheat exporter
Weather hurting Yugoslav wheat - USDA Report

Topic Cotton Cotton Housing Housing Wheat

wheat

Preprocessing
Unwanted characters Case normalization Tokenization Stopwords Stemming

Vector Space Model

k-means
Randomly select k initial centroids repeat
Assign each document to the closest centroid. Recalculate cluster centroids.

until clusters do not change

Frequent term
Create a list of all terms in corpus and arrange them in order of their frequencies. for each term
if the term is present more often in unclustered documents, then Form a cluster by collecting all documents containing the term.

Apply refinement

Refinement
For clusters with single document, mark document as unclustered. Assign each unclustered document to its closest cluster. Apply k-means clustering to these clusters.

Frequent term plus

Assumption: The lesser the number of terms in a document, the easier it is to identify the topic term. A document contains only one topic term.
What do we do? Arrange the documents in reverse order of number of terms. Try to find a topic term in each.

Frequent term plus (algorithm)

Create a list of all terms in corpus and their frequencies. Arrange the documents in ascending order of number of terms. for each document
Starting with term with highest frequency in corpus, try to find a topic term. if the term is present more often in unclustered documents, then Form a cluster by collecting all documents containing the term.

Apply refinement

Frequent nouns
Similar to Frequent term clustering. In preprocessing, all non-noun features are removed.

Frequent nouns plus

Similar to Frequent term plus clustering. In preprocessing, all non-noun features are removed.

Results
Legend: FF Frequent feature FN Frequent nouns
k-means FF FF plus FN FN plus

No. of clusters generated

Correct clusters

26
17

29
19

34
22

28
20

33
22

Incorrect clusters
No. of documents correctly clustered

9
238

10
226

12
255

8
241

11
272

Applications
Create news categories by clustering news from various sources.
Like Google News

Future Work
Cluster descriptor Use of synonyms

The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
Rating: 4 out of 5 stars
4/5 (5794)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
Rating: 3.5 out of 5 stars
3.5/5 (399)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
Rating: 3.5 out of 5 stars
3.5/5 (231)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
Rating: 4 out of 5 stars
4/5 (894)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
Rating: 4 out of 5 stars
4/5 (98)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
Rating: 4.5 out of 5 stars
4.5/5 (537)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
Rating: 4.5 out of 5 stars
4.5/5 (474)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
Rating: 4.5 out of 5 stars
4.5/5 (838)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
Rating: 4 out of 5 stars
4/5 (587)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
Rating: 4.5 out of 5 stars
4.5/5 (265)
Yes Please
From Everand
Yes Please
Amy Poehler
Rating: 4 out of 5 stars
4/5 (1891)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
Rating: 4.5 out of 5 stars
4.5/5 (440)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
Rating: 4.5 out of 5 stars
4.5/5 (271)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
Rating: 4 out of 5 stars
4/5 (73)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
Rating: 4.5 out of 5 stars
4.5/5 (344)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
Rating: 4.5 out of 5 stars
4.5/5 (234)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
Rating: 3.5 out of 5 stars
3.5/5 (738)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
Rating: 4.5 out of 5 stars
4.5/5 (1712)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
Rating: 3.5 out of 5 stars
3.5/5 (137)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
Rating: 4 out of 5 stars
4/5 (599)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
Rating: 4 out of 5 stars
4/5 (45)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
Rating: 3.5 out of 5 stars
3.5/5 (2219)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
Rating: 4.5 out of 5 stars
4.5/5 (806)
John Adams
From Everand
John Adams
David McCullough
Rating: 4.5 out of 5 stars
4.5/5 (2409)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
Rating: 4 out of 5 stars
4/5 (1090)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
Rating: 4 out of 5 stars
4/5 (1015)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
Rating: 4 out of 5 stars
4/5 (1839)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Tóibín
Rating: 3.5 out of 5 stars
3.5/5 (1937)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
Rating: 4.5 out of 5 stars
4.5/5 (119)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
Rating: 4.5 out of 5 stars
4.5/5 (4609)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
Rating: 4.5 out of 5 stars
4.5/5 (789)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
Rating: 3.5 out of 5 stars
3.5/5 (2322)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
Rating: 3.5 out of 5 stars
3.5/5 (792)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
Rating: 4.5 out of 5 stars
4.5/5 (2099)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
Rating: 4 out of 5 stars
4/5 (3811)
Little Women
From Everand
Little Women
Louisa May Alcott
Rating: 4 out of 5 stars
4/5 (104)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
Rating: 4 out of 5 stars
4/5 (4200)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
Rating: 4 out of 5 stars
4/5 (1103)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
Rating: 4.5 out of 5 stars
4.5/5 (1929)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carre
Rating: 3.5 out of 5 stars
3.5/5 (104)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
Rating: 4 out of 5 stars
4/5 (821)
Numerical Method - Accuracy of Numbers
Document9 pages
Numerical Method - Accuracy of Numbers
proud
No ratings yet
Least Square Methods
Document17 pages
Least Square Methods
dsdsdsdsds
No ratings yet
Stationary Stochastic Processes
Document11 pages
Stationary Stochastic Processes
Minal Chalakh
No ratings yet
Automation MCQ
Document49 pages
Automation MCQ
Amer Al-khorasani
No ratings yet
2-CS701 Solved 25 MID Term Most Important
Document15 pages
2-CS701 Solved 25 MID Term Most Important
4khurra
No ratings yet
Go, Rust Cheat Sheet
Document99 pages
Go, Rust Cheat Sheet
David
No ratings yet
Assignm 3
Document3 pages
Assignm 3
Luis Bellido C.
No ratings yet
PE VI-MEG347A-CFD Syllabus 2020
Document5 pages
PE VI-MEG347A-CFD Syllabus 2020
Vaibhav Anand
No ratings yet
High Impedance Fault Detection Using Multi Domain Feature With Artificial Neural Network
Document15 pages
High Impedance Fault Detection Using Multi Domain Feature With Artificial Neural Network
ANKAN CHANDRA
No ratings yet
Cryptography and Network Security Feb 2020
Document1 page
Cryptography and Network Security Feb 2020
Akshita dammu
No ratings yet
Thesis Lu Dai
Document85 pages
Thesis Lu Dai
Anurag Bansal
No ratings yet
Value at Risk Final
Document27 pages
Value at Risk Final
Anu Bumra
No ratings yet
Ackermann Function
Document8 pages
Ackermann Function
bit_richard
No ratings yet
Fuzzy Logic - Manafeddin Namazov
Document5 pages
Fuzzy Logic - Manafeddin Namazov
Arif Rizkianto
No ratings yet
AIML Online
Document16 pages
AIML Online
MAHESHUMANATH GOPALAKRISHNAN
No ratings yet
IES3006: Control Systems: R G H + - y H +
Document5 pages
IES3006: Control Systems: R G H + - y H +
ntuan_181
No ratings yet
Tm4112 - 6 Solution of Linear Sets of Equations
Document27 pages
Tm4112 - 6 Solution of Linear Sets of Equations
Ray Yuda
No ratings yet
Artificial Intelligence Artificial Neural Networks - : Introduction
Document43 pages
Artificial Intelligence Artificial Neural Networks - : Introduction
Rawaz Aziz
No ratings yet
Or Objectives
Document19 pages
Or Objectives
Karthik Saraa
No ratings yet
Worksheet 5 Testing
Document6 pages
Worksheet 5 Testing
Ahmed
No ratings yet
Chapter One Discrete-Time Signals and Systems: Lecture #1
Document6 pages
Chapter One Discrete-Time Signals and Systems: Lecture #1
DANIEL ABERA
No ratings yet
Important questions on time series analysis and forecasting
Document8 pages
Important questions on time series analysis and forecasting
Gajendra Sagar
No ratings yet
Weather Trends Sales Predictor
Document6 pages
Weather Trends Sales Predictor
Wananda Muhammad Rifki
No ratings yet
Critical Path Method
Document53 pages
Critical Path Method
Frst-Ngwazi Prince Christopher Manda
100% (1)
EE362 Ch5
Document70 pages
EE362 Ch5
SerdarKaraman
No ratings yet
Most Expected 2 Marks Most Expected Questions Stack Computer Science Class 12
Document28 pages
Most Expected 2 Marks Most Expected Questions Stack Computer Science Class 12
sridharan.sri7766
No ratings yet
Eng M 540
Document60 pages
Eng M 540
Gursimar Singh
No ratings yet
Basic Stata Commands for Data Analysis
Document10 pages
Basic Stata Commands for Data Analysis
samsul rosadi
No ratings yet
Z Transform
Document10 pages
Z Transform
Suvra Pattanayak
No ratings yet
Sequence Detector Finite State Machine Design
Document2 pages
Sequence Detector Finite State Machine Design
asic_master
No ratings yet