SlideShare a Scribd company logo
Extract insight from texts using SAP Text Analysis
Tomer Steinberg
SAP Israel Public
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 2
Agenda
Why use text analysis functionality?
Background: SAP’s text analysis technology
Search: Full-text search and fuzzy search
Text analysis: Entity and fact extraction
Text mining
Wrap-up
Why use text analytics
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 4
Why Text Analytics
Enterprise Challenges Massive amounts of data locked
Companies are struggling to:
 Search on unstructured text related content
 Extract meaningful, structured information from unstructured text
 Combine unstructured with structured data
 Leverage data in real-time to gauge and guide their business strategy
and solve critical problems
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 5
Potential use cases
Law enforcement
Intelligence
Social Media Analytics
Precision Marketing
Predictive Maintenance
Investment trade
Credit Scoring
Patents
Mary’s interest
is leggings
Jane’s
interest is
jackets
Sue’s looking
for a fleece
REAL TIME INTENT SIGNALS SHOW THAT:
Mary’s free
shipping offer
is focused on
leggings
Jane’s is on
jackets
Sue’s is on
fleece
Single template with dynamic content
CONTEXTUAL MARKETING
HOW IT WORKS
MARY
JANE
SUE
Get yours now, with
Free shipping!
Pouch Inc, 2345 Madison Avenue, New York. Unsubscribe
BAGS FLEECE JACKETS ACCESSORIES
Winter Ready?
Free Shipping on %Category%
Free shipping on %category%
and more thru 11/28
SAP’s Text Analysis Technology
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 8
Background: SAP’s text analysis technology
1997
Inxight spun off
from PARC, a
Xerox Company
Finite-State technology
for modeling natural
language
2007
Inxight acquired
by Business
Objects
Integration of text
analysis technology into
BI applications
2008
Business Objects
acquired by SAP
Text analysis technology
continues to focus on BI
applications
2012
First integration
into SAP HANA
Foundation for full-text
search, BI and sentiment
analysis applications
Today
Text analysis in
SAP HANA
Foundation for virtually
any type of unstructured
textual data processing in
the platform
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 9
Background: SAP’s text analysis technology
The development team is one of the SAP HANA core teams
SAP Labs in Boston, MA
 Located near Kendall Square
in Cambridge, close to MIT
14 engineers
7 computational linguists
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 10
Any Apps
Any App Server
SAP Business Suite
and BW ABAP App Server
JSONR Open ConnectivityMDXSQL
SAP HANA Platform – More than just a database
SAP HANA platform converges Database, Data Processing and Application
Platform capabilities & provides Libraries for predictive, planning, text, spatial,
and business analytics so businesses can operate in real-time.
SAP HANA Platform
UnifiedAdministration
Life-cycleManagement
Security
Extended Application Services
Integration Services
Deployment:
Database Services
ApplicationDevelopment
ProcessOrchestration
OLTP | OLAP | Search | Text Analysis |Predictive | Events | Spatial | Rules | Planning | Calculators
Processing Engine
Application Function Libraries & Data Models
Predictive Analysis Libraries | Business Function Libraries | Data Models & Stored Procedures
Data Virtualization | Replication | ETL/ELT | Mobile Synch | Streaming
App Server| UI Integration Services | Web Server
On-Premise | Hybrid | On-Demand
Supports any Device
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 11
Why does SAP HANA provide text analysis functionality?
Capabilities
Native full-text and fuzzy search
In-database text analysis
Graphical modeling of search models
Info Access – HTML5 UI toolkit and API for JavaScript
Benefits
Less data duplication and movement – leverage one
infrastructure for analytical and search workloads
Extract salient information from unstructured textual
data
Easy-to-use modeling tools – HANA Studio
Build search applications quickly – Info Access
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 12
What types of text processing capabilities are supported?
Search
In addition to string matching,
HANA features full-text search
which works on content stored
in tables or exposed via views.
Just like searching on the
Internet, full-text search
finds terms irrespective of the
sequence of characters and
words.
Text mining
Text mining makes semantic
determinations about the overall
content of documents relative to
other documents. Capabilities
include key term identification
and document categorization.
Text mining is complementary to
text analysis.
Text analysis
Capabilities range from basic
tokenization and stemming to
more complex semantic
analysis in the form of entity
and fact extraction. Text
analysis applies within individual
documents and is the
foundation for both full-text
search and text mining.
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 13
What types of text processing capabilities are supported?
Search
In addition to string matching,
HANA features full-text search
which works on content stored
in tables or exposed via views.
Just like searching on the
Internet, full-text search
finds terms irrespective of the
sequence of characters and
words.
Text mining
Text mining makes semantic
determinations about the overall
content of documents relative to
other documents. Capabilities
include key term identification
and document categorization.
Text mining is complementary to
text analysis.
Text analysis
Capabilities range from basic
tokenization and stemming to
more complex semantic
analysis in the form of entity
and fact extraction. Text
analysis applies within individual
documents and is the
foundation for both full-text
search and text mining.
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 14
What types of text processing capabilities are supported?
Search
In addition to string matching,
HANA features full-text search
which works on content stored
in tables or exposed via views.
Just like searching on the
Internet, full-text search
finds terms irrespective of the
sequence of characters and
words.
Text mining
Text mining makes semantic
determinations about the overall
content of documents relative to
other documents. Capabilities
include key term identification
and document categorization.
Text mining is complementary to
text analysis.
Text analysis
Capabilities range from basic
tokenization and stemming to
more complex semantic
analysis in the form of entity
and fact extraction. Text
analysis applies within individual
documents and is the
foundation for both full-text
search and text mining.
Nicole Kidman, Aaron Eckhart and ‘Rabbit Hole’
By MEKADO MURPHY
Dan Steinberg/ Associated Press
Aaron Eckhart and Nicole Kidman at the Toronto
International Film Festival
TORONTO — Nicole Kidman returns to Toronto, this
time in the role of both actor and producer for her latest
project, “Rabbit Hole.” The film, in which she co-stars
with Aaron Eckhart, looks at a suburban married couple
who experience a tremendous loss.
“Rabbit Hole” is based on the play by David Lindsay-
Abaire, who also adapted it for the screen. The play
received a positive review when it premiered at
Manhattan Theater Club in 2006 and caught the
attention of Ms. Kidman and her producing partner, Per
Saari, who decided to option it.
Ms. Kidman and Mr. Eckhart shared some thoughts
about the new film and the process of working with their
director, John Cameron Mitchell.
Nicole Kidman PERSON
Aaron Eckhart PERSON
MEKADO MURPHY PERSON
Dan Steinberg PERSON
Associated Press ORGANIZATION
TORONTO CITY
Nicole Kidman PERSON
Toronto CITY
David Lindsay-Abaire PERSON
Manhattan Theater Club PLACE
2006 YEAR
Ms. Kidman PERSON
Per Saari PERSON
Ms. Kidman PERSON
Mr. Eckhart PERSON
John Cameron Mitchell PERSON
… …
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 15
What types of text processing capabilities are supported?
Search
In addition to string matching,
HANA features full-text search
which works on content stored
in tables or exposed via views.
Just like searching on the
Internet, full-text search
finds terms irrespective of the
sequence of characters and
words.
Text mining
Text mining makes semantic
determinations about the overall
content of documents relative to
other documents. Capabilities
include key term identification
and document categorization.
Text mining is complementary to
text analysis.
Text analysis
Capabilities range from basic
tokenization and stemming to
more complex semantic
analysis in the form of entity
and fact extraction. Text
analysis applies within individual
documents and is the
foundation for both full-text
search and text mining.
At Dresden Semperoper, a New Take on ‘Tristan and
Isolde’
By ROSLYN SULCAS February 17, 2015
DRESDEN, Germany — David Dawson’s new “Tristan
and Isolde” for the Dresden Semperoper Ballett raises
interesting questions about the full-length story ballet, a
genre much-loved by audiences and seldom tackled by
choreographers today.
It’s surprising that the Tristan and Isolde story, a
medieval Celtic tale that has long figured in literature,
film and in Wagner’s opera of the same name, has been
so infrequently used by ballet. Like “Romeo and Juliet,” it
has instant attraction and union between lovers from
opposing camps, with society and history against them,
and tragic death at its end. You can imagine what John
Cranko or Kenneth MacMillan, who brought the big, all-
guns-blazing story ballets like “Manon” and “Eugene
Onegin” to the world in the 1960s and 1970s (ballet box
offices are still thanking them), might have done with it.
Category Classical_Music
Key terms Semperoper, Wagner, ballet,
John Cranko, Royal Ballet School …
Vodafone Turns Focus to Broadband, Seeking to Catch
Up to Rivals
By MARK SCOTT February 16, 2015
As consumers change the way they use their
smartphones, surf the web and watch television,
Vodafone is finding itself in need of a face-lift. After
years of focusing heavily on its cellphone business,
Vodafone, based in Britain and the world’s second-
largest mobile operator behind China Mobile based on
subscribers, is concentrating on high-speed broadband.
Once, Europeans were happy to pay for separate
cellphone, cable and pay-TV services. Now, they prefer
them bundled into a single package that streams content
to any device — a smartphone, tablet or Internet-
connected television.
Regional rivals like Orange of France and Deutsche
Telekom of Germany have moved quickly to offer …
Category Telecommunications
Key terms Vodafone, broadband, cellphone business,
Orange, Deutsche Telekom, …
Search
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 17
Search
Full-text indexing
A full-text index – required for Google-like
searches – is defined on a table column
The table column is ‘aware’ of its index –
insert, update, delete is handled automatically
Fast delta indexing
Broad language identification & processing
Available Languages
Arabic Indonesian
Catalan Japanese
Chinese (Simplified) Korean
Chinese (Traditional) Norwegian (Bokmal)
Croatian Norwegian (Nynorsk)
Czech Polish
Danish Portuguese
Dutch Romanian
English Russian
Farsi Serbian
French Slovak
German Slovenian
Greek Spanish
Hebrew Swedish
Hungarian Thai
Italian Turkish
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 18
Search
Full-text indexing
The following steps are executed on unstructured text :
File format filtering

 Converts any binary document format to text/HTML
Language detection
 
Identifies language to apply appropriate tokenization
and stemming
Tokenization

 Decomposes word sequences
E.g. “card-based payment systems”  “card” “based” “payment” “systems”
Stemming
 Normalizes tokens to linguistic base form
E.g. houses  house; ran  run

Full-text index  ‘Attaches’ to the table column
Text Analysis
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 20
Text analysis
An option to the full-text index
The following steps may be executed on unstructured text to augment
full-text indexing:
Part-of-Speech

 Tags word categories
Examples: quick: Adj; houses: Nn-Pl
Noun groups

 Identifies concepts
Examples: text data; global piracy
Entity extraction

 Classifies pre-defined entity types
Examples: Winston Churchill: PERSON; U.K.: COUNTRY;
Fact extraction

Relates entities – e.g., classifies sentiments with topics
Example: I love SAP HANA:
[Sentiment] I [StrongPositiveSentiment] love [/StrongPositiveSentiment]
[Topic] SAP HANA [/Topic].[/Sentiment]
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 21
Text analysis
Entity and fact extraction
Text analysis gives ‘structure’ to two sorts of elements from
unstructured text:
Entities:
John Lennon was one of the Beatles.
<PERSON>John Lennon</PERSON> was one of the
<ORGANIZATION@ENTERTAINMENT>Beatles</ORGANIZATION@ENTERTAINMENT>.
Facts:
I love your product.
I <STRONGPOSITIVESENTIMENT>love</STRONGPOSITIVESENTIMENT> <TOPIC>your
product</TOPIC>.
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 22
Who: People, job title, and national
identification numbers
What: Companies, organizations,
financial indexes, and products
When: Dates, days, holidays, months,
years, times, and time periods
Where: Addresses, cities, states,
countries, facilities, internet
addresses, and phone numbers
How much: Currencies and units of measure
Generic concepts: text data, global piracy, and so
on
Text analysis
Supported types for entity extraction
Topic 2 of 11 | Overview: Unstructured textual data in SAP HANA
Languages:
Arabic, English, Dutch, Farsi, French, German,
Italian, Japanese, Korean, Portuguese, Russian,
Simplified Chinese, Spanish, Traditional Chinese
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 23
Voice of customer
Sentiments: strong positive, weak positive, neutral, weak negative, strong negative, and problems
Requests: general and contact info
Emoticons: strong positive, weak positive, weak negative, strong negative
Profanity: ambiguous and unambiguous
*Emoticons and profanity only
Text analysis
Supported fact extraction (1/2)
Topic 2 of 11 | Overview: Unstructured textual data in SAP HANA
Languages:
English, Dutch*, French, German, Italian,
Portuguese, Russian, Simplified Chinese, Spanish,
Traditional Chinese
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 24
Text analysis
Supported fact extraction (2/2)
Enterprise
Membership information
Management changes
Product releases
Mergers & acquisitions
Organizational information
Public Sector
Action & travel events
Military units
Person-alias, -appearance, -attributes, -relationships
Spatial references
Domain-specific entities
Topic 2 of 11 | Overview: Unstructured textual data in SAP HANA
Language:
English
Language:
English
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 25
Text analysis
How entity extraction works
Built-in entity extraction is not keyword
search. Text analysis applies full linguistic
and statistical techniques (i.e., natural
language processing) to make sure the
entities which get returned are correct.
Grammatical Parsing:
 Can we bill you?
 Bill Smith was the president.
Semantic Disambiguation:
 I talked to Bill yesterday.
 The bill was signed into law
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 26
Text analysis
Quality of extraction
Significant investment in gold corpus
development across languages to
achieve objective, repeatable
assessments of entity and fact extraction
(i.e., blind testing).
Topic 2 of 11 | Overview: Unstructured textual data in SAP HANA
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 27
Language Support
Language LINGANALYSIS_BASIC/STEMS LINGANALYSIS_FULL EXTRACTION_CORE EXTRACTION_CORE_VOICEOFCUSTOMER
Arabic   
Catalan  
Chinese (Simplified)    
Chinese (Traditional)    
Croatian  
Czech  
Danish  
Dutch   
English    
Farsi   
French    
German    
Greek 
Hebrew  
Hungarian 
Indonesian  
Italian    
Japanese   
Korean   
Norwegian (Bokmal)  
Norwegian (Nynorsk)  
Polish 
Portuguese    
Romanian 
Russian    
Serbian  
Slovak  
Slovenian  
Spanish    
Swedish  
Thai  
Turkish  
Demo
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 29
Load
Documents
Create
Text Index
Analyze
Results
Step 1
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 30
Load
Documents
Create
Text Index
Analyze
Results
Step 2
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 31
Load
Documents
Create
Text Index
Analyze
Results
Step 3
Text mining
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 33
Text mining
Text mining works at the document level
Determinates the overall content of documents relative to other documents.
Used for:
 Identify similar documents
 Identify key terms of a document
 Identify related terms
 Categorize new documents based on a training corpus
Scenarios
 Highlight the key terms when viewing a patent document
 Identify similar incidents for faster problem solving
 Categorize new scientific papers along a hierarchy of topics
Text Mining Demo
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 35
Wrap Up
Structure massive amounts of unstructured data
 Search on unstructured text related content
 Extract meaningful, structured information from unstructured text
 Combine unstructured with structured data
 Leverage data in real-time to gauge and guide their business strategy and solve critical problems
Business Benefits
 Understand your Customer/Process
© 2015 SAP SE or an SAP affiliate company. All rights reserved.
Wrap-Up
Tomer Steinberg
Tomer.Steinberg@sap.com

More Related Content

PDF
Vespa, A Tour
PDF
Text Analytics Online Knowledge Base / Database
PPT
Implementing Semantic Search
PPT
Predictive Text Analytics
PPT
Understanding Seo At A Glance
PPT
Semantic search
PPT
Text Analytics: Yesterday, Today and Tomorrow
PDF
Haystack- Learning to rank in an hourly job market
Vespa, A Tour
Text Analytics Online Knowledge Base / Database
Implementing Semantic Search
Predictive Text Analytics
Understanding Seo At A Glance
Semantic search
Text Analytics: Yesterday, Today and Tomorrow
Haystack- Learning to rank in an hourly job market

Similar to Text analysis matrix event 2015 (20)

ODP
Key Phrases for Better Search
PDF
12 Things the Semantic Web Should Know about Content Analytics
PDF
Dmm117 – SAP HANA Processing Services Text Spatial Graph Series and Predictive
PPT
Irmac presentation for website
PPTX
CSHALS 2010 W3C Semanic Web Tutorial
PDF
From Linked Data to Semantic Applications
PPT
Introduction of Semantic Web using NLP techniques.
PDF
Transform unstructured e&p information
PPT
Text Analytics Market Insights: What's Working and What's Next
PDF
X api chinese cop monthly meeting feb.2016
PPTX
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
PPTX
Sem tech2013 tutorial
PPTX
Recent Trends in Semantic Search Technologies
PDF
Why Semantics Matter? Adding the semantic edge to your content, right from au...
PDF
Data Science - Part XI - Text Analytics
PPTX
Jim Hendler's Presentation at SSSW 2011
PPT
Linking library data
PPT
Semantic Search using RDF Metadata (SemTech 2005)
PDF
How Graph Databases used in Police Department?
PDF
Serverless Text Analytics with Amazon Comprehend
Key Phrases for Better Search
12 Things the Semantic Web Should Know about Content Analytics
Dmm117 – SAP HANA Processing Services Text Spatial Graph Series and Predictive
Irmac presentation for website
CSHALS 2010 W3C Semanic Web Tutorial
From Linked Data to Semantic Applications
Introduction of Semantic Web using NLP techniques.
Transform unstructured e&p information
Text Analytics Market Insights: What's Working and What's Next
X api chinese cop monthly meeting feb.2016
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Sem tech2013 tutorial
Recent Trends in Semantic Search Technologies
Why Semantics Matter? Adding the semantic edge to your content, right from au...
Data Science - Part XI - Text Analytics
Jim Hendler's Presentation at SSSW 2011
Linking library data
Semantic Search using RDF Metadata (SemTech 2005)
How Graph Databases used in Police Department?
Serverless Text Analytics with Amazon Comprehend
Ad

More from Luc Vanrobays (20)

PPTX
Introduction to extracting data from sap s 4 hana with abap cds views
PDF
How to use abap cds for data provisioning in bw
PDF
Sap bw4 hana architecture archetypes
PDF
BW Adjusting settings and monitoring data loads
PDF
Abap Objects for BW
PDF
Build and run an sql data warehouse on sap hana
PDF
Bi05 fontes de_dados_hana_para_relatorios_presentação_conceitual_2
PDF
Dmm203 – new approaches for data modelingwith sap hana
PDF
Dmm302 - Sap Hana Data Warehousing: Models for Sap Bw and SQL DW on SAP HANA
PDF
Dmm300 – mixed scenarios for sap hana data warehousing and BW: overview and e...
PDF
DMM270 – Spatial Analytics with Sap Hana
PDF
Dmm212 – Sap Hana Graph Processing
PDF
What is mmd - Multi Markdown ?
PDF
Dmm300 - Mixed Scenarios/Architecture HANA Models / BW
PDF
PDF
DMM161_2015_Exercises
PDF
DMM161 HANA_MODELING_2015
PDF
EA261_2015_Exercises
PDF
EA261_2015
PDF
Tech ed 2012 eim260 modeling in sap hana-exercise
Introduction to extracting data from sap s 4 hana with abap cds views
How to use abap cds for data provisioning in bw
Sap bw4 hana architecture archetypes
BW Adjusting settings and monitoring data loads
Abap Objects for BW
Build and run an sql data warehouse on sap hana
Bi05 fontes de_dados_hana_para_relatorios_presentação_conceitual_2
Dmm203 – new approaches for data modelingwith sap hana
Dmm302 - Sap Hana Data Warehousing: Models for Sap Bw and SQL DW on SAP HANA
Dmm300 – mixed scenarios for sap hana data warehousing and BW: overview and e...
DMM270 – Spatial Analytics with Sap Hana
Dmm212 – Sap Hana Graph Processing
What is mmd - Multi Markdown ?
Dmm300 - Mixed Scenarios/Architecture HANA Models / BW
DMM161_2015_Exercises
DMM161 HANA_MODELING_2015
EA261_2015_Exercises
EA261_2015
Tech ed 2012 eim260 modeling in sap hana-exercise
Ad

Recently uploaded (20)

PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
AI in Product Development-omnex systems
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
ai tools demonstartion for schools and inter college
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
medical staffing services at VALiNTRY
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PPTX
Essential Infomation Tech presentation.pptx
VVF-Customer-Presentation2025-Ver1.9.pptx
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
AI in Product Development-omnex systems
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
ai tools demonstartion for schools and inter college
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Upgrade and Innovation Strategies for SAP ERP Customers
Odoo Companies in India – Driving Business Transformation.pdf
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PTS Company Brochure 2025 (1).pdf.......
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
CHAPTER 2 - PM Management and IT Context
Odoo POS Development Services by CandidRoot Solutions
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
How Creative Agencies Leverage Project Management Software.pdf
medical staffing services at VALiNTRY
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Essential Infomation Tech presentation.pptx

Text analysis matrix event 2015

  • 1. Extract insight from texts using SAP Text Analysis Tomer Steinberg SAP Israel Public
  • 2. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 2 Agenda Why use text analysis functionality? Background: SAP’s text analysis technology Search: Full-text search and fuzzy search Text analysis: Entity and fact extraction Text mining Wrap-up
  • 3. Why use text analytics
  • 4. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 4 Why Text Analytics Enterprise Challenges Massive amounts of data locked Companies are struggling to:  Search on unstructured text related content  Extract meaningful, structured information from unstructured text  Combine unstructured with structured data  Leverage data in real-time to gauge and guide their business strategy and solve critical problems
  • 5. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 5 Potential use cases Law enforcement Intelligence Social Media Analytics Precision Marketing Predictive Maintenance Investment trade Credit Scoring Patents
  • 6. Mary’s interest is leggings Jane’s interest is jackets Sue’s looking for a fleece REAL TIME INTENT SIGNALS SHOW THAT: Mary’s free shipping offer is focused on leggings Jane’s is on jackets Sue’s is on fleece Single template with dynamic content CONTEXTUAL MARKETING HOW IT WORKS MARY JANE SUE Get yours now, with Free shipping! Pouch Inc, 2345 Madison Avenue, New York. Unsubscribe BAGS FLEECE JACKETS ACCESSORIES Winter Ready? Free Shipping on %Category% Free shipping on %category% and more thru 11/28
  • 8. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 8 Background: SAP’s text analysis technology 1997 Inxight spun off from PARC, a Xerox Company Finite-State technology for modeling natural language 2007 Inxight acquired by Business Objects Integration of text analysis technology into BI applications 2008 Business Objects acquired by SAP Text analysis technology continues to focus on BI applications 2012 First integration into SAP HANA Foundation for full-text search, BI and sentiment analysis applications Today Text analysis in SAP HANA Foundation for virtually any type of unstructured textual data processing in the platform
  • 9. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 9 Background: SAP’s text analysis technology The development team is one of the SAP HANA core teams SAP Labs in Boston, MA  Located near Kendall Square in Cambridge, close to MIT 14 engineers 7 computational linguists
  • 10. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 10 Any Apps Any App Server SAP Business Suite and BW ABAP App Server JSONR Open ConnectivityMDXSQL SAP HANA Platform – More than just a database SAP HANA platform converges Database, Data Processing and Application Platform capabilities & provides Libraries for predictive, planning, text, spatial, and business analytics so businesses can operate in real-time. SAP HANA Platform UnifiedAdministration Life-cycleManagement Security Extended Application Services Integration Services Deployment: Database Services ApplicationDevelopment ProcessOrchestration OLTP | OLAP | Search | Text Analysis |Predictive | Events | Spatial | Rules | Planning | Calculators Processing Engine Application Function Libraries & Data Models Predictive Analysis Libraries | Business Function Libraries | Data Models & Stored Procedures Data Virtualization | Replication | ETL/ELT | Mobile Synch | Streaming App Server| UI Integration Services | Web Server On-Premise | Hybrid | On-Demand Supports any Device
  • 11. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 11 Why does SAP HANA provide text analysis functionality? Capabilities Native full-text and fuzzy search In-database text analysis Graphical modeling of search models Info Access – HTML5 UI toolkit and API for JavaScript Benefits Less data duplication and movement – leverage one infrastructure for analytical and search workloads Extract salient information from unstructured textual data Easy-to-use modeling tools – HANA Studio Build search applications quickly – Info Access
  • 12. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 12 What types of text processing capabilities are supported? Search In addition to string matching, HANA features full-text search which works on content stored in tables or exposed via views. Just like searching on the Internet, full-text search finds terms irrespective of the sequence of characters and words. Text mining Text mining makes semantic determinations about the overall content of documents relative to other documents. Capabilities include key term identification and document categorization. Text mining is complementary to text analysis. Text analysis Capabilities range from basic tokenization and stemming to more complex semantic analysis in the form of entity and fact extraction. Text analysis applies within individual documents and is the foundation for both full-text search and text mining.
  • 13. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 13 What types of text processing capabilities are supported? Search In addition to string matching, HANA features full-text search which works on content stored in tables or exposed via views. Just like searching on the Internet, full-text search finds terms irrespective of the sequence of characters and words. Text mining Text mining makes semantic determinations about the overall content of documents relative to other documents. Capabilities include key term identification and document categorization. Text mining is complementary to text analysis. Text analysis Capabilities range from basic tokenization and stemming to more complex semantic analysis in the form of entity and fact extraction. Text analysis applies within individual documents and is the foundation for both full-text search and text mining.
  • 14. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 14 What types of text processing capabilities are supported? Search In addition to string matching, HANA features full-text search which works on content stored in tables or exposed via views. Just like searching on the Internet, full-text search finds terms irrespective of the sequence of characters and words. Text mining Text mining makes semantic determinations about the overall content of documents relative to other documents. Capabilities include key term identification and document categorization. Text mining is complementary to text analysis. Text analysis Capabilities range from basic tokenization and stemming to more complex semantic analysis in the form of entity and fact extraction. Text analysis applies within individual documents and is the foundation for both full-text search and text mining. Nicole Kidman, Aaron Eckhart and ‘Rabbit Hole’ By MEKADO MURPHY Dan Steinberg/ Associated Press Aaron Eckhart and Nicole Kidman at the Toronto International Film Festival TORONTO — Nicole Kidman returns to Toronto, this time in the role of both actor and producer for her latest project, “Rabbit Hole.” The film, in which she co-stars with Aaron Eckhart, looks at a suburban married couple who experience a tremendous loss. “Rabbit Hole” is based on the play by David Lindsay- Abaire, who also adapted it for the screen. The play received a positive review when it premiered at Manhattan Theater Club in 2006 and caught the attention of Ms. Kidman and her producing partner, Per Saari, who decided to option it. Ms. Kidman and Mr. Eckhart shared some thoughts about the new film and the process of working with their director, John Cameron Mitchell. Nicole Kidman PERSON Aaron Eckhart PERSON MEKADO MURPHY PERSON Dan Steinberg PERSON Associated Press ORGANIZATION TORONTO CITY Nicole Kidman PERSON Toronto CITY David Lindsay-Abaire PERSON Manhattan Theater Club PLACE 2006 YEAR Ms. Kidman PERSON Per Saari PERSON Ms. Kidman PERSON Mr. Eckhart PERSON John Cameron Mitchell PERSON … …
  • 15. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 15 What types of text processing capabilities are supported? Search In addition to string matching, HANA features full-text search which works on content stored in tables or exposed via views. Just like searching on the Internet, full-text search finds terms irrespective of the sequence of characters and words. Text mining Text mining makes semantic determinations about the overall content of documents relative to other documents. Capabilities include key term identification and document categorization. Text mining is complementary to text analysis. Text analysis Capabilities range from basic tokenization and stemming to more complex semantic analysis in the form of entity and fact extraction. Text analysis applies within individual documents and is the foundation for both full-text search and text mining. At Dresden Semperoper, a New Take on ‘Tristan and Isolde’ By ROSLYN SULCAS February 17, 2015 DRESDEN, Germany — David Dawson’s new “Tristan and Isolde” for the Dresden Semperoper Ballett raises interesting questions about the full-length story ballet, a genre much-loved by audiences and seldom tackled by choreographers today. It’s surprising that the Tristan and Isolde story, a medieval Celtic tale that has long figured in literature, film and in Wagner’s opera of the same name, has been so infrequently used by ballet. Like “Romeo and Juliet,” it has instant attraction and union between lovers from opposing camps, with society and history against them, and tragic death at its end. You can imagine what John Cranko or Kenneth MacMillan, who brought the big, all- guns-blazing story ballets like “Manon” and “Eugene Onegin” to the world in the 1960s and 1970s (ballet box offices are still thanking them), might have done with it. Category Classical_Music Key terms Semperoper, Wagner, ballet, John Cranko, Royal Ballet School … Vodafone Turns Focus to Broadband, Seeking to Catch Up to Rivals By MARK SCOTT February 16, 2015 As consumers change the way they use their smartphones, surf the web and watch television, Vodafone is finding itself in need of a face-lift. After years of focusing heavily on its cellphone business, Vodafone, based in Britain and the world’s second- largest mobile operator behind China Mobile based on subscribers, is concentrating on high-speed broadband. Once, Europeans were happy to pay for separate cellphone, cable and pay-TV services. Now, they prefer them bundled into a single package that streams content to any device — a smartphone, tablet or Internet- connected television. Regional rivals like Orange of France and Deutsche Telekom of Germany have moved quickly to offer … Category Telecommunications Key terms Vodafone, broadband, cellphone business, Orange, Deutsche Telekom, …
  • 17. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 17 Search Full-text indexing A full-text index – required for Google-like searches – is defined on a table column The table column is ‘aware’ of its index – insert, update, delete is handled automatically Fast delta indexing Broad language identification & processing Available Languages Arabic Indonesian Catalan Japanese Chinese (Simplified) Korean Chinese (Traditional) Norwegian (Bokmal) Croatian Norwegian (Nynorsk) Czech Polish Danish Portuguese Dutch Romanian English Russian Farsi Serbian French Slovak German Slovenian Greek Spanish Hebrew Swedish Hungarian Thai Italian Turkish
  • 18. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 18 Search Full-text indexing The following steps are executed on unstructured text : File format filtering   Converts any binary document format to text/HTML Language detection   Identifies language to apply appropriate tokenization and stemming Tokenization   Decomposes word sequences E.g. “card-based payment systems”  “card” “based” “payment” “systems” Stemming  Normalizes tokens to linguistic base form E.g. houses  house; ran  run  Full-text index  ‘Attaches’ to the table column
  • 20. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 20 Text analysis An option to the full-text index The following steps may be executed on unstructured text to augment full-text indexing: Part-of-Speech   Tags word categories Examples: quick: Adj; houses: Nn-Pl Noun groups   Identifies concepts Examples: text data; global piracy Entity extraction   Classifies pre-defined entity types Examples: Winston Churchill: PERSON; U.K.: COUNTRY; Fact extraction  Relates entities – e.g., classifies sentiments with topics Example: I love SAP HANA: [Sentiment] I [StrongPositiveSentiment] love [/StrongPositiveSentiment] [Topic] SAP HANA [/Topic].[/Sentiment]
  • 21. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 21 Text analysis Entity and fact extraction Text analysis gives ‘structure’ to two sorts of elements from unstructured text: Entities: John Lennon was one of the Beatles. <PERSON>John Lennon</PERSON> was one of the <ORGANIZATION@ENTERTAINMENT>Beatles</ORGANIZATION@ENTERTAINMENT>. Facts: I love your product. I <STRONGPOSITIVESENTIMENT>love</STRONGPOSITIVESENTIMENT> <TOPIC>your product</TOPIC>.
  • 22. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 22 Who: People, job title, and national identification numbers What: Companies, organizations, financial indexes, and products When: Dates, days, holidays, months, years, times, and time periods Where: Addresses, cities, states, countries, facilities, internet addresses, and phone numbers How much: Currencies and units of measure Generic concepts: text data, global piracy, and so on Text analysis Supported types for entity extraction Topic 2 of 11 | Overview: Unstructured textual data in SAP HANA Languages: Arabic, English, Dutch, Farsi, French, German, Italian, Japanese, Korean, Portuguese, Russian, Simplified Chinese, Spanish, Traditional Chinese
  • 23. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 23 Voice of customer Sentiments: strong positive, weak positive, neutral, weak negative, strong negative, and problems Requests: general and contact info Emoticons: strong positive, weak positive, weak negative, strong negative Profanity: ambiguous and unambiguous *Emoticons and profanity only Text analysis Supported fact extraction (1/2) Topic 2 of 11 | Overview: Unstructured textual data in SAP HANA Languages: English, Dutch*, French, German, Italian, Portuguese, Russian, Simplified Chinese, Spanish, Traditional Chinese
  • 24. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 24 Text analysis Supported fact extraction (2/2) Enterprise Membership information Management changes Product releases Mergers & acquisitions Organizational information Public Sector Action & travel events Military units Person-alias, -appearance, -attributes, -relationships Spatial references Domain-specific entities Topic 2 of 11 | Overview: Unstructured textual data in SAP HANA Language: English Language: English
  • 25. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 25 Text analysis How entity extraction works Built-in entity extraction is not keyword search. Text analysis applies full linguistic and statistical techniques (i.e., natural language processing) to make sure the entities which get returned are correct. Grammatical Parsing:  Can we bill you?  Bill Smith was the president. Semantic Disambiguation:  I talked to Bill yesterday.  The bill was signed into law
  • 26. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 26 Text analysis Quality of extraction Significant investment in gold corpus development across languages to achieve objective, repeatable assessments of entity and fact extraction (i.e., blind testing). Topic 2 of 11 | Overview: Unstructured textual data in SAP HANA
  • 27. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 27 Language Support Language LINGANALYSIS_BASIC/STEMS LINGANALYSIS_FULL EXTRACTION_CORE EXTRACTION_CORE_VOICEOFCUSTOMER Arabic    Catalan   Chinese (Simplified)     Chinese (Traditional)     Croatian   Czech   Danish   Dutch    English     Farsi    French     German     Greek  Hebrew   Hungarian  Indonesian   Italian     Japanese    Korean    Norwegian (Bokmal)   Norwegian (Nynorsk)   Polish  Portuguese     Romanian  Russian     Serbian   Slovak   Slovenian   Spanish     Swedish   Thai   Turkish  
  • 28. Demo
  • 29. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 29 Load Documents Create Text Index Analyze Results Step 1
  • 30. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 30 Load Documents Create Text Index Analyze Results Step 2
  • 31. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 31 Load Documents Create Text Index Analyze Results Step 3
  • 33. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 33 Text mining Text mining works at the document level Determinates the overall content of documents relative to other documents. Used for:  Identify similar documents  Identify key terms of a document  Identify related terms  Categorize new documents based on a training corpus Scenarios  Highlight the key terms when viewing a patent document  Identify similar incidents for faster problem solving  Categorize new scientific papers along a hierarchy of topics
  • 35. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 35 Wrap Up Structure massive amounts of unstructured data  Search on unstructured text related content  Extract meaningful, structured information from unstructured text  Combine unstructured with structured data  Leverage data in real-time to gauge and guide their business strategy and solve critical problems Business Benefits  Understand your Customer/Process
  • 36. © 2015 SAP SE or an SAP affiliate company. All rights reserved. Wrap-Up Tomer Steinberg Tomer.Steinberg@sap.com