You are on page 1of 36

Extract insight from texts using SAP Text Analysis

Tomer Steinberg
SAP Israel Public
Agenda

Why use text analysis functionality?


Background: SAP’s text analysis technology
Search: Full-text search and fuzzy search
Text analysis: Entity and fact extraction
Text mining
Wrap-up

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 2


Why use text analytics
Why Text Analytics

Enterprise Challenges Massive amounts of data locked


Companies are struggling to:
 Search on unstructured text related content

 Extract meaningful, structured information from unstructured text

 Combine unstructured with structured data

 Leverage data in real-time to gauge and guide their business strategy


and solve critical problems

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 4


Potential use cases

Law enforcement
Intelligence
Social Media Analytics
Precision Marketing
Predictive Maintenance

Investment trade
Credit Scoring
Patents

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 5


CONTEXTUAL MARKETING
HOW IT WORKS
Mary’s free
REAL TIME INTENT SIGNALS SHOW THAT: shipping offer
is focused on
leggings

Mary’s interest
is leggings

MARY Winter Ready?


Free Shipping on %Category%
Jane’s is on
jackets
Jane’s Get yours now, with
interest is Free shipping!

jackets
JANE Free shipping on %category%
and more thru 11/28

Sue’s looking BAGS FLEECE JACKETS ACCESSORIES Sue’s is on


for a fleece Pouch Inc, 2345 Madison Avenue, New York. Unsubscribe fleece

SUE Single template with dynamic content


SAP’s Text Analysis Technology
Background: SAP’s text analysis technology

Inxight spun off Inxight acquired Business Objects First integration Text analysis in
from PARC, a by Business acquired by SAP into SAP HANA SAP HANA
Xerox Company Objects Text analysis technology Foundation for full-text Foundation for virtually
Finite-State technology Integration of text continues to focus on BI search, BI and sentiment any type of unstructured
for modeling natural analysis technology into applications analysis applications textual data processing in
language BI applications the platform

1997 2007 2008 2012 Today

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 8


Background: SAP’s text analysis technology

The development team is one of the SAP HANA core teams

SAP Labs in Boston, MA


 Located near Kendall Square
in Cambridge, close to MIT

14 engineers

7 computational linguists

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 9


SAP HANA Platform – More than just a database

Any Apps SAP Business Suite


Any App Server and BW ABAP App Server

SQL MDX R JSON Open Connectivity

SAP HANA Platform


Extended Application Services
App Server| UI Integration Services | Web Server
Supports any Device

Application Development
Processing Engine

Life-cycle Management
Process Orchestration

Unified Administration
OLTP | OLAP | Search | Text Analysis |Predictive | Events | Spatial | Rules | Planning | Calculators

Security
Database Services

Application Function Libraries & Data Models


Predictive Analysis Libraries | Business Function Libraries | Data Models & Stored Procedures

Integration Services
Data Virtualization | Replication | ETL/ELT | Mobile Synch | Streaming

Deployment: On-Premise | Hybrid | On-Demand

SAP HANA platform converges Database, Data Processing and Application


Platform capabilities & provides Libraries for predictive, planning, text, spatial,
and business analytics so businesses can operate in real-time.
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 10
Why does SAP HANA provide text analysis functionality?

Capabilities
Native full-text and fuzzy search
In-database text analysis
Graphical modeling of search models
Info Access – HTML5 UI toolkit and API for JavaScript

Benefits
Less data duplication and movement – leverage one
infrastructure for analytical and search workloads
Extract salient information from unstructured textual
data
Easy-to-use modeling tools – HANA Studio
Build search applications quickly – Info Access

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 11


What types of text processing capabilities are supported?

Search Text analysis Text mining

In addition to string matching, Capabilities range from basic Text mining makes semantic
HANA features full-text search tokenization and stemming to determinations about the overall
which works on content stored more complex semantic content of documents relative to
in tables or exposed via views. analysis in the form of entity other documents. Capabilities
Just like searching on the and fact extraction. Text include key term identification
Internet, full-text search analysis applies within individual and document categorization.
finds terms irrespective of the documents and is the Text mining is complementary to
sequence of characters and foundation for both full-text text analysis.
words. search and text mining.

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 12


What types of text processing capabilities are supported?

Search Text analysis Text mining

In addition to string matching, Capabilities range from basic Text mining makes semantic
HANA features full-text search tokenization and stemming to determinations about the overall
which works on content stored more complex semantic content of documents relative to
in tables or exposed via views. analysis in the form of entity other documents. Capabilities
Just like searching on the and fact extraction. Text include key term identification
Internet, full-text search analysis applies within individual and document categorization.
finds terms irrespective of the documents and is the Text mining is complementary to
sequence of characters and foundation for both full-text text analysis.
words. search and text mining.

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 13


What types of text processing capabilities are supported?

Nicole Kidman, Aaron Eckhart and ‘Rabbit Hole’


By MEKADO MURPHY
Dan Steinberg/ Associated Press

Aaron Eckhart and Nicole Kidman at the Toronto


International Film Festival
Search Text analysis Text mining
TORONTO — Nicole Kidman returns to Toronto, this
time in the role of both actor and producer for her latest
project, “Rabbit Hole.” The film, in which she co-stars
with Aaron Eckhart, looks at a suburban married couple
In addition to string matching, Capabilities range from basic Text mining makes semantic
who experience a tremendous loss.
“Rabbit Hole” is based on the play by David Lindsay-
HANA features full-text search tokenization and stemming to determinations about the overall
Abaire, who also adapted it for the screen. The play
received a positive review when it premiered at
which works on content stored more complex semantic content of documents relative to
Manhattan Theater Nicole
ClubKidman
in 2006 and caught the PERSON
attention of Ms. Kidman and her producing partner, Per PERSON
Aaron Eckhart
in tables or exposed via views. analysis in the form of entity other documents. Capabilities
Saari, who decidedMEKADO MURPHY
to option it. PERSON
Dan Steinberg PERSON
Just like searching on the and fact extraction. Text include key term identification
Ms. Kidman and Mr. Eckhart shared some thoughts
about the new film and the process of working with theirORGANIZATION
Associated Press
Internet, full-text search analysis applies within individual and document categorization.
TORONTO
director, John Cameron Mitchell.
Nicole Kidman
CITY
PERSON
finds terms irrespective of the documents and is the Text mining is complementary to
Toronto CITY
David Lindsay-Abaire PERSON
sequence of characters and foundation for both full-text text analysis.
Manhattan Theater Club PLACE
2006 YEAR
words. search and text mining. Ms. Kidman PERSON
Per Saari PERSON
Ms. Kidman PERSON
Mr. Eckhart PERSON
John Cameron Mitchell PERSON
… …

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 14


What types of text processing capabilities are supported?
At Dresden Semperoper, a New Take on ‘Tristan and
Isolde’

By ROSLYN SULCAS February 17, 2015

DRESDEN, Germany — David Dawson’s new “Tristan


Search and Isolde” for the Dresden Semperoper Ballett raisesText analysis
interesting questions about the full-length story ballet, a
Text mining
genre
Vodafone Turns Focus much-loved Seeking
to Broadband, by audiences and seldom tackled by
to Catch
Up to Rivals choreographers today.

InBy addition
MARK SCOTT to It’s string
surprising
February 16, 2015 matching,
that the Tristan and Isolde story, a Capabilities range from basic Text mining makes semantic
medieval Celtic tale that has long figured in literature,
HANA features
As consumers filmthe
change andfull-text
in Wagner’s
way search
opera
they use their of the same name, hastokenization
been and stemming to determinations about the overall
so infrequently used by ballet. Like “Romeo and Juliet,” it
which worksitself
Vodafone is finding has
on content
smartphones, surf the web and watch television,
instant
in need attraction stored
and
of a face-lift. union
After between lovers from
more complex semantic content of documents relative to
opposing camps, with business, Category
society and history against them, Classical_Music
inyears
tables or
of focusing exposed
heavily
Vodafone, based in and
Britain
on its
tragic via
cellphone
anddeath
views.
at its end.
the world’s
analysis
You can imagine what John
second-
in the form of entity other documents. Capabilities
Key terms Semperoper, Wagner, ballet,
Just
largest like
mobilesearching
Cranko
operator behind
guns-blazing
on
or Kenneth
China the
MacMillan,
Mobile basedwho
story balletsbroadband.
and
on brought the big,
like “Manon” and “Eugene
all-
fact extraction.
John Cranko,
Text
Royal Ballet School …
include key term identification
subscribers, is concentrating on high-speed
Internet, full-text Onegin” to search
the world in the 1960s and 1970s (ballet analysis
box applies within individual and document categorization.
offices are still thanking
Once, Europeans were happy to pay for separate them), might have done with it.
finds terms
cellphone, cable andirrespective
pay-TV services. Now, ofthey
the prefer documents and is the Text mining is complementary to
sequence of characters and
them bundled into a single package that streams content
to any device — a smartphone, tablet or Internet-
foundation for both full-text text analysis.
words.
connected television. search and text mining.
Regional rivals like OrangeCategory
of France and Deutsche
Telecommunications
Telekom of Germany have moved quickly to offer …
Key terms Vodafone, broadband, cellphone business,
Orange, Deutsche Telekom, …

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 15


Search
Search
Full-text indexing

A full-text index – required for Google-like Available Languages


searches – is defined on a table column Arabic Indonesian
Catalan Japanese
The table column is ‘aware’ of its index – Chinese (Simplified) Korean
Chinese (Traditional) Norwegian (Bokmal)
insert, update, delete is handled automatically Croatian Norwegian (Nynorsk)
Czech Polish
Fast delta indexing Danish Portuguese
Broad language identification & processing Dutch Romanian
English Russian
Farsi Serbian
French Slovak
German Slovenian
Greek Spanish
Hebrew Swedish
Hungarian Thai
Italian Turkish

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 17


Search
Full-text indexing
The following steps are executed on unstructured text :

File format filtering


  Converts any binary document format to text/HTML

Language detection Identifies language to apply appropriate tokenization


  and stemming
Tokenization Decomposes word sequences
  E.g. “card-based payment systems”  “card” “based” “payment” “systems”

Stemming Normalizes tokens to linguistic base form


 E.g. houses  house; ran  run


Full-text index  ‘Attaches’ to the table column

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 18


Text Analysis
Text analysis
An option to the full-text index
The following steps may be executed on unstructured text to augment
full-text indexing:

Part-of-Speech

 Tags word categories
Examples: quick: Adj; houses: Nn-Pl

Noun groups

 Identifies concepts
Examples: text data; global piracy

Entity extraction

 Classifies pre-defined entity types
Examples: Winston Churchill: PERSON; U.K.: COUNTRY;

Relates entities – e.g., classifies sentiments with topics


Fact extraction
 Example: I love SAP HANA:
[Sentiment] I [StrongPositiveSentiment] love [/StrongPositiveSentiment]
[Topic] SAP HANA [/Topic].[/Sentiment]

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 20


Text analysis
Entity and fact extraction

Text analysis gives ‘structure’ to two sorts of elements from


unstructured text:

Entities:
John Lennon was one of the Beatles.
<PERSON>John Lennon</PERSON> was one of the
<ORGANIZATION@ENTERTAINMENT>Beatles</ORGANIZATION@ENTERTAINMENT>.

Facts:
I love your product.
I <STRONGPOSITIVESENTIMENT>love</STRONGPOSITIVESENTIMENT> <TOPIC>your
product</TOPIC>.

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 21


Topic 2 of 11 | Overview: Unstructured textual data in SAP HANA
Text analysis
Supported types for entity extraction

Who: People, job title, and national Where: Addresses, cities, states,
identification numbers countries, facilities, internet
What: Companies, organizations, addresses, and phone numbers
financial indexes, and products How much: Currencies and units of measure
When: Dates, days, holidays, months, Generic concepts: text data, global piracy, and so
years, times, and time periods on

Languages:
Arabic, English, Dutch, Farsi, French, German,
Italian, Japanese, Korean, Portuguese, Russian,
Simplified Chinese, Spanish, Traditional Chinese

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 22


Topic 2 of 11 | Overview: Unstructured textual data in SAP HANA
Text analysis
Supported fact extraction (1/2)

Voice of customer
Sentiments: strong positive, weak positive, neutral, weak negative, strong negative, and problems
Requests: general and contact info
Emoticons: strong positive, weak positive, weak negative, strong negative
Profanity: ambiguous and unambiguous

Languages:
English, Dutch*, French, German, Italian,
Portuguese, Russian, Simplified Chinese, Spanish,
Traditional Chinese
*Emoticons and profanity only

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 23


Topic 2 of 11 | Overview: Unstructured textual data in SAP HANA
Text analysis
Supported fact extraction (2/2)

Enterprise Public Sector


Membership information Action & travel events
Management changes Military units
Product releases Person-alias, -appearance, -attributes, -relationships
Mergers & acquisitions Spatial references
Organizational information Domain-specific entities

Language: Language:
English English

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 24


Text analysis
How entity extraction works

Grammatical Parsing:

 Can we bill you?


 Bill Smith was the president.

Built-in entity extraction is not keyword


Semantic Disambiguation:
search. Text analysis applies full linguistic
and statistical techniques (i.e., natural  I talked to Bill yesterday.
language processing) to make sure the  The bill was signed into law

entities which get returned are correct.

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 25


Topic 2 of 11 | Overview: Unstructured textual data in SAP HANA
Text analysis
Quality of extraction

Significant investment in gold corpus


development across languages to
achieve objective, repeatable
assessments of entity and fact extraction
(i.e., blind testing).

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 26


Language Support
Language LINGANALYSIS_BASIC/STEMS LINGANALYSIS_FULL EXTRACTION_CORE EXTRACTION_CORE_VOICEOFCUSTOMER
Arabic   
Catalan  
Chinese (Simplified)    
Chinese (Traditional)    
Croatian  
Czech  
Danish  
Dutch   
English    
Farsi   
French    
German    
Greek 
 
Hebrew
Hungarian 
Indonesian  
Italian    
Japanese   
Korean   
Norwegian (Bokmal)  
Norwegian (Nynorsk)  
Polish 
Portuguese    
Romanian 
Russian    
Serbian  
Slovak  
Slovenian  
Spanish    
Swedish  
Thai  
Turkish  

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 27


Demo
Step 1

Load
Documents

Analyze Create
Results Text Index

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 29


Step 2

Load
Documents

Analyze Create
Results Text Index

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 30


Step 3

Load
Documents

Analyze Create
Results Text Index

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 31


Text mining
Text mining

Text mining works at the document level

Determinates the overall content of documents relative to other documents.

Used for:
 Identify similar documents
 Identify key terms of a document
 Identify related terms
 Categorize new documents based on a training corpus
Scenarios
 Highlight the key terms when viewing a patent document
 Identify similar incidents for faster problem solving
 Categorize new scientific papers along a hierarchy of topics

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 33


Text Mining Demo
Wrap Up

Structure massive amounts of unstructured data


 Search on unstructured text related content
 Extract meaningful, structured information from unstructured text
 Combine unstructured with structured data
 Leverage data in real-time to gauge and guide their business strategy and solve critical problems

Business Benefits
 Understand your Customer/Process

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 35


Wrap-Up

Tomer Steinberg
Tomer.Steinberg@sap.com

© 2015 SAP SE or an SAP affiliate company. All rights reserved.

You might also like