SlideShare a Scribd company logo
.consulting .solutions .partnership
Text Analysis with SAP HANA
Text Analysis with SAP HANA
2© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg
Motivation - Big Data1 3
Text Analysis with SAP HANA2 7
Enhancement Options3 21
Text Analysis with SAP HANA
3© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg
Motivation - Big Data1 3
Text Analysis with SAP HANA2 7
Enhancement Options3 21
Big Data - taking a closer look
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 4
• Big Data is hot topic today, but what is hidden in the “Big Data”?
• According to Merril Lynch 80-90% of all potentially usable business information may originate in
unstructured form
(Structure, Models and Meaning: Is "unstructured" data merely unmodeled?, Intelligent Enterprise, March 1, 2005.)
• According to Computer World unstructured information might account for more than 70%–80% of
all data in organizations
(Holzinger, Andreas; et al. (2013). "Combining HCI, Natural Language Processing, and Knowledge Discovery - Potential of IBM Content
Analytics as an Assistive Technology in the Biomedical Field" in Human-Computer Interaction and Knowledge Discovery in Complex,
Unstructured, Big Data. Lecture Notes in Computer Science. Springer. pp. 13–24)
• This data will grow up to 40 zettabytes by 2020
• The data might origin from:
− Social Networks
− Call Centers
− “Letters” from Customer
− ...
What is the Problem with Unstructured Data?
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 5
• It is unstructured!
− Not organized
− No pre-defined data model
− No metadata or mix of data and metadata
Limited/No access to the data via classical programs
• But the data contains valuable information
We have a lot of information that is relevant for the business but we cannot access it
How can we solve that issue?
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 6
• Text Analysis: Extracting high quality information from texts
• Typical process of a text analysis:
− Parsing of the text
− Adding features like linguistic information
− Insertion to database in structured manner
• Examples for typical text analysis tasks:
− Entity recognition: Is it an organization or a person or a place including domain facts like
requests?
− Sentiment analysis: What attitudinal information is “hidden” in the text?
− Relationship, fact and event extraction
Text Analysis with SAP HANA
7© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg
Motivation - Big Data1 3
Text Analysis with SAP HANA2 7
Enhancement Options3 21
What has this to do with SAP HANA?
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 8
© SAP SE
Text Analysis with HANA - Basics
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 9
• Starting point: database table containing the text
• Supported data types are:
− TEXT
− BINTEXT
− NVARCHAR
− VARCHAR
− NCLOB,
− CLOB
− BLOB
Text Analysis with HANA - Basics
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 10
Fulltext index incl. options (see system view SYS.FULLTEXT_INDEXES)
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 11
Text Analysis with HANA - Basics
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 12
Index properties on the table
Text Analysis with HANA - Basics
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 13
Fulltext index table $TA_*
Text Analysis with HANA – Linguistic Analysis
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 14
LINGANALYSIS_BASIC = Tokenization
Text Analysis with HANA – Linguistic Analysis
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 15
LINGANALYSIS_STEMS = Tokeniziation + Stems
Text Analysis with HANA – Linguistic Analysis
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 16
LINGANALYSIS_FULL = Tokeniziation + Stems + Tagging
Text Analysis with HANA – Entity Extraction
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 17
• In order to get more information out of the data SAP delivers several configurations
• These configurations focus on entity and fact extraction under specific aspects
• Types of Extraction:
− EXTRACTION_CORE
− EXTRACTION_CORE_ENTERPRISE
− EXTRACTION_CORE_PUBLIC_SECTOR
− EXTRACTION_CORE_VOICEOFCUSTOMER
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 18
Text Analysis with HANA – Entity Extraction
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 19
EXTRACTION_CORE = Basic Entity Extraction (People, Organizations, Places)
Text Analysis with HANA – Entity Extraction
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 20
EXTRACTION_CORE_VOICEOFCUSTOMER = Basic Entity Extraction + Sentiments
Text Analysis with SAP HANA
21© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg
Motivation - Big Data1 3
Text Analysis with SAP HANA2 7
Enhancement Options3 21
Text Analysis with HANA – Custom Dictionary
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 22
• In several use cases you might need to enhance the dictionary due to your business domain
• Structure of a dictionary
© SAP SE
Text Analysis with HANA – Workflow of Enhancement
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 23
1. Find an extraction configuration that is most fitting for you
2. Copy the configuration into the target folder
3. Create a new custom dictionary
4. Reference the dictionary in your configuration copy
5. Recreate the fulltext index using your custom configuration
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 24
Text Analysis with HANA – Workflow of Enhancement
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 25
1. Find an extraction configuration that is most fitting for you
Text Analysis with HANA – Workflow of Enhancement
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 26
2. Copy the configuration into the target folder
Important: File suffix *.hdbtextconfig
Text Analysis with HANA – Workflow of Enhancement
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 27
3. Create a new custom dictionary
Important: File suffix *.hdbtextdict
Text Analysis with HANA – Workflow of Enhancement
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 28
4. Reference the dictionary in your configuration copy
Important: You have to specify the full path
Text Analysis with HANA – Workflow of Enhancement
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 29
5. Recreate the fulltext index using your custom configuration
Text Analysis with HANA – Enhancement of Sentiment Analysis
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 30
• Special Case: Enhancement of sentiments
• You can directly enhance/tailor the files delivered by SAP
Text Analysis with HANA – What’s next?
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 31
• Assume that we are in an “industry”-specific context or mining for “slang”-like facts and entities
• Good example for this are sports!
• We use the example of CrossFit® … as there are some funny facts to extract
• Question: How can we extract complex entities from a text?
• Examples:
− Did somebody attend a CrossFit training?
− Does somebody want to join a CrossFit box?
Text Analysis with HANA – Text Analysis Extraction Rules
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 32
Setup and Status Quo
Text Analysis with HANA – Text Analysis Extraction Rules
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 33
• Extraction rules (CGUL rules): pattern-based language for pattern matching using character or
token-based regular expressions combined with linguistic attributes to define custom entity types.
• Goal of the rule sets:
− Extract complex facts based on relations between entities and predicates.
− Entity-to-Entity relations to associate entities such as times, dates, and locations, with other
entities
− Identify entities in domain-specific language.
− Capture facts expressed in new, popular “slang”
Text Analysis with HANA – Text Analysis Extraction Rules
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 34
Extraction Rule
Regular ExpressionsTokens
Luck ☺Dictionaries
Text Analysis with HANA
Tokens, Operators, Expression Markers and Directives
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 35
• Tokens define the syntactic units of the text analysis
<string, STEM: <stem>, POS: <postag>>
• Example: <activat.*, STEM: activat.*, POS: V>
• Several operators are possible to enable the matching:
− Standard operators e. g. character wildcard “.”, alternations “|”
− Iteration operators
e.g. zero or one occurrence of preceding item “?” ; zero or many occurrence of preceding item “*”
− Grouping and containment operators, e. g. item group “( )”, range groups “[ ]”
Text Analysis with HANA
Tokens, Operators, Expression Markers and Directives
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 36
• Expression Markers allow the definition of delimiters of the searched terms
• Several markers are available:
− Paragraph Marker: Specifies beginning and end of paragraph – [P]
− Entity Marker: Limits an expression to one or several entity types – [TE] <expr> [/TE]
− Sentence Marker: Specifies the beginning and end of a sentence – [SN] [/SN]
− Clause Container: Matches entire clause if expression is matched somewhere in the clause
[CC] <expr> [/CC]
Text Analysis with HANA
Tokens, Operators, Expression Markers and Directives
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 37
• Directives allow the definition of character classes, groups of tokens and relation types
• #define (character class): denotes character expressions
Example: #define ALPHA: [A-Za-z]
• #subgroup (group of tokens): defines a group of one or more tokens
Example: #subgroup Cloud: <HCP>|<AWS>|<Azure>
• #group (relation type): definition of custom facts and entity types consisting of one or more
tokens
Example:
#group HANA: <HANA>
#group HANANATIVE: %(HANA) <native>
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 38
Text Analysis with HANA – Text Analysis Extraction Rules
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 39
Step 1 – Create a dictionary (It is all about entities)
Text Analysis with HANA – Text Analysis Extraction Rules
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 40
Step 2 – Create a custom configuration
Text Analysis with HANA – Text Analysis Extraction Rules
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 41
Recreate the fulltext index with the custom configuration
Text Analysis with HANA – Text Analysis Extraction Rules
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 42
Next step: Create a simple plain text rule (*.hdbtextrule) and adopt configuration
Text Analysis with HANA – Text Analysis Extraction Rules
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 43
Result of the plain rule
Text Analysis with HANA – Text Analysis Extraction Rules
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 44
Refactor and enhance the rule
Text Analysis with HANA – Text Analysis Extraction Rules
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 45
Reduce the extracted entities using the PreProcessor Configuration
Text Analysis with HANA – Summary
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 46
• SAP HANA contains a lot of functionality
• One very powerful feature is text analysis
• Besides the delivered content you have a lot of options to adopt the text analysis to extract the
entities and facts that you need
• Since SP09 rules get compiled upon activation (no separate compilation necessary)
• Creating custom dictionaries and text rules is cumbersome
No support in IDE
• The results of the text analysis form the basis of predictive analytics (also part of SAP HANA ☺)
© msg | September 2015 | SAP Web IDE - IT Conference on SAP Technologies by msg 47
Q&A
.consulting .solutions .partnership
Dr. Christian Lechner
Principal IT Consultant
+49 (0) 171 7617190
christian.lechner@msg-systems.com
msg systems ag (Headquarters)
Robert-Buerkle-Str. 1, 85737 Ismaning
Germany
www.msg-systems.com
Text Analysis with HANA – Ressources
© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 49
• SAP HANA Search Developer Guide (Fulltext Index Options)
help.sap.com -> Search Developer Guide
• SAP HANA Text Analysis Developer Guide:
help.sap.com -> TA Developer Guide
• SAP HANA Text Analysis Language Reference Guide:
help.sap.com -> TA Language Refrence Guide
• SAP HANA Text Analysis Extraction Customization Guide:
help.sap.com -> TA Extraction Customization Guide
• YouTube Playlist of SAP HANA Academy:
Text Analysis and Search

More Related Content

PPTX
Text Analysis with SAP HANA
PPTX
HANA SPS07 Text Analysis
PDF
SAP HANA SPS09 - Text Analysis
PDF
SAP HANA SPS09 - Full-text Search
PDF
SAP HANA SPS10- Text Analysis & Text Mining
PPTX
What's new for Text in SAP HANA SPS 11
PPTX
What's New for SAP HANA Smart Data Integration & Smart Data Quality
PDF
Building Custom Advanced Analytics Applications with SAP HANA
Text Analysis with SAP HANA
HANA SPS07 Text Analysis
SAP HANA SPS09 - Text Analysis
SAP HANA SPS09 - Full-text Search
SAP HANA SPS10- Text Analysis & Text Mining
What's new for Text in SAP HANA SPS 11
What's New for SAP HANA Smart Data Integration & Smart Data Quality
Building Custom Advanced Analytics Applications with SAP HANA

What's hot (20)

PDF
SAP HANA SPS09 - HANA IM Services
PDF
SAP HANA SPS10- SHINE
PDF
Spark Usage in Enterprise Business Operations
PDF
Dmm203 – new approaches for data modelingwith sap hana
PDF
SAP HANA SPS09 - HANA Modeling
PDF
SAP HANA SPS09 - Development Tools
PDF
SAP HANA SPS10- Extended Application Services (XS) Programming Model
PPTX
HANA SPS07 Smart Data Access
PDF
Dmm117 – SAP HANA Processing Services Text Spatial Graph Series and Predictive
PPTX
HANA SPS07 Fulltext Search
PDF
What's New in SAP HANA SPS 11 Operations
PPTX
SAP Helps Reduce Silos Between Business and Spatial Data
PPT
SAP Integrated Business Planning
PDF
DMM161 HANA_MODELING_2015
PDF
SQL Anywhere and the Internet of Things
PDF
Dmm302 - Sap Hana Data Warehousing: Models for Sap Bw and SQL DW on SAP HANA
PDF
SAP HANA SPS10- SQLScript
PDF
Dmm212 – Sap Hana Graph Processing
PDF
SAP HANA Training - For Technical/BASIS administrators.
PDF
EA261_2015
SAP HANA SPS09 - HANA IM Services
SAP HANA SPS10- SHINE
Spark Usage in Enterprise Business Operations
Dmm203 – new approaches for data modelingwith sap hana
SAP HANA SPS09 - HANA Modeling
SAP HANA SPS09 - Development Tools
SAP HANA SPS10- Extended Application Services (XS) Programming Model
HANA SPS07 Smart Data Access
Dmm117 – SAP HANA Processing Services Text Spatial Graph Series and Predictive
HANA SPS07 Fulltext Search
What's New in SAP HANA SPS 11 Operations
SAP Helps Reduce Silos Between Business and Spatial Data
SAP Integrated Business Planning
DMM161 HANA_MODELING_2015
SQL Anywhere and the Internet of Things
Dmm302 - Sap Hana Data Warehousing: Models for Sap Bw and SQL DW on SAP HANA
SAP HANA SPS10- SQLScript
Dmm212 – Sap Hana Graph Processing
SAP HANA Training - For Technical/BASIS administrators.
EA261_2015
Ad

Viewers also liked (8)

PDF
SAP Inside Track Munich 2016 - SAP HANA Cloud Platform
PDF
SAP HANA SPS10- Predictive Analysis Library and Application Function Modeler
PDF
SAP HANA Cloud Platform - The big picture
PPTX
SAP HANA in Healthcare: Real-Time Big Data Analysis
PDF
SAP Platform & S/4 HANA - Support for Innovation
PPTX
What's new in SAP HANA SPS 11 SQL/SQLScript
PDF
SAP HANA Vora SITMTY 20160707
PPTX
What's New in SAP HANA View Modeling
SAP Inside Track Munich 2016 - SAP HANA Cloud Platform
SAP HANA SPS10- Predictive Analysis Library and Application Function Modeler
SAP HANA Cloud Platform - The big picture
SAP HANA in Healthcare: Real-Time Big Data Analysis
SAP Platform & S/4 HANA - Support for Innovation
What's new in SAP HANA SPS 11 SQL/SQLScript
SAP HANA Vora SITMTY 20160707
What's New in SAP HANA View Modeling
Ad

Similar to Text Analysis with SAP HANA (20)

PDF
Text analysis matrix event 2015
PDF
Text Analytics
PDF
How SAP HANA Leverages the Cloud to Glean Business Insights from Unstructured...
PDF
Business intelligence in the era of big data
PDF
hsta1-RecordOfAchievement
PPSX
Sap HANA Presentation to SAPnsight Dallas Breakfast Huddle in June 2014
PPTX
PDF
5507832a c074-4013-9d49-6e58befa9c3e-161121113026
PDF
Getting Started with Unstructured Data
PDF
InterSystems Presentatie: Breakthrough BI: analyzing all the data
PPTX
What's New in SPS11 Overview
PPTX
Text Analytics Overview, 2011
PPTX
Text Mining & Sentiment Analysis made easy, with Azure and Power BI
PDF
Comment rendre votre architecture BI plus flexible avec HANA?
PPTX
SAP HANA and SAP Controlling – New Opportunities and New Challenges
PPTX
SAP HANA and SAP Controlling – New Opportunities and New Challenges
PDF
Developing and Deploying Applications on the SAP HANA Platform
DOCX
PDF
C13,C33,A35 アプリケーション開発プラットフォームとしてのSAP HANA by Makoto Sugishita
PDF
A Text Analytics Marketscape (from Strata NY 2014)
Text analysis matrix event 2015
Text Analytics
How SAP HANA Leverages the Cloud to Glean Business Insights from Unstructured...
Business intelligence in the era of big data
hsta1-RecordOfAchievement
Sap HANA Presentation to SAPnsight Dallas Breakfast Huddle in June 2014
5507832a c074-4013-9d49-6e58befa9c3e-161121113026
Getting Started with Unstructured Data
InterSystems Presentatie: Breakthrough BI: analyzing all the data
What's New in SPS11 Overview
Text Analytics Overview, 2011
Text Mining & Sentiment Analysis made easy, with Azure and Power BI
Comment rendre votre architecture BI plus flexible avec HANA?
SAP HANA and SAP Controlling – New Opportunities and New Challenges
SAP HANA and SAP Controlling – New Opportunities and New Challenges
Developing and Deploying Applications on the SAP HANA Platform
C13,C33,A35 アプリケーション開発プラットフォームとしてのSAP HANA by Makoto Sugishita
A Text Analytics Marketscape (from Strata NY 2014)

Recently uploaded (20)

PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PPTX
The various Industrial Revolutions .pptx
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
Enhancing emotion recognition model for a student engagement use case through...
PPTX
Chapter 5: Probability Theory and Statistics
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PPTX
Tartificialntelligence_presentation.pptx
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Architecture types and enterprise applications.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PPTX
observCloud-Native Containerability and monitoring.pptx
A contest of sentiment analysis: k-nearest neighbor versus neural network
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Assigned Numbers - 2025 - Bluetooth® Document
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
The various Industrial Revolutions .pptx
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Enhancing emotion recognition model for a student engagement use case through...
Chapter 5: Probability Theory and Statistics
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Tartificialntelligence_presentation.pptx
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Developing a website for English-speaking practice to English as a foreign la...
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Univ-Connecticut-ChatGPT-Presentaion.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Architecture types and enterprise applications.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
O2C Customer Invoices to Receipt V15A.pptx
NewMind AI Weekly Chronicles – August ’25 Week III
observCloud-Native Containerability and monitoring.pptx

Text Analysis with SAP HANA

  • 2. Text Analysis with SAP HANA 2© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg Motivation - Big Data1 3 Text Analysis with SAP HANA2 7 Enhancement Options3 21
  • 3. Text Analysis with SAP HANA 3© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg Motivation - Big Data1 3 Text Analysis with SAP HANA2 7 Enhancement Options3 21
  • 4. Big Data - taking a closer look © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 4 • Big Data is hot topic today, but what is hidden in the “Big Data”? • According to Merril Lynch 80-90% of all potentially usable business information may originate in unstructured form (Structure, Models and Meaning: Is "unstructured" data merely unmodeled?, Intelligent Enterprise, March 1, 2005.) • According to Computer World unstructured information might account for more than 70%–80% of all data in organizations (Holzinger, Andreas; et al. (2013). "Combining HCI, Natural Language Processing, and Knowledge Discovery - Potential of IBM Content Analytics as an Assistive Technology in the Biomedical Field" in Human-Computer Interaction and Knowledge Discovery in Complex, Unstructured, Big Data. Lecture Notes in Computer Science. Springer. pp. 13–24) • This data will grow up to 40 zettabytes by 2020 • The data might origin from: − Social Networks − Call Centers − “Letters” from Customer − ...
  • 5. What is the Problem with Unstructured Data? © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 5 • It is unstructured! − Not organized − No pre-defined data model − No metadata or mix of data and metadata Limited/No access to the data via classical programs • But the data contains valuable information We have a lot of information that is relevant for the business but we cannot access it
  • 6. How can we solve that issue? © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 6 • Text Analysis: Extracting high quality information from texts • Typical process of a text analysis: − Parsing of the text − Adding features like linguistic information − Insertion to database in structured manner • Examples for typical text analysis tasks: − Entity recognition: Is it an organization or a person or a place including domain facts like requests? − Sentiment analysis: What attitudinal information is “hidden” in the text? − Relationship, fact and event extraction
  • 7. Text Analysis with SAP HANA 7© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg Motivation - Big Data1 3 Text Analysis with SAP HANA2 7 Enhancement Options3 21
  • 8. What has this to do with SAP HANA? © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 8 © SAP SE
  • 9. Text Analysis with HANA - Basics © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 9 • Starting point: database table containing the text • Supported data types are: − TEXT − BINTEXT − NVARCHAR − VARCHAR − NCLOB, − CLOB − BLOB
  • 10. Text Analysis with HANA - Basics © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 10 Fulltext index incl. options (see system view SYS.FULLTEXT_INDEXES)
  • 11. © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 11
  • 12. Text Analysis with HANA - Basics © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 12 Index properties on the table
  • 13. Text Analysis with HANA - Basics © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 13 Fulltext index table $TA_*
  • 14. Text Analysis with HANA – Linguistic Analysis © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 14 LINGANALYSIS_BASIC = Tokenization
  • 15. Text Analysis with HANA – Linguistic Analysis © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 15 LINGANALYSIS_STEMS = Tokeniziation + Stems
  • 16. Text Analysis with HANA – Linguistic Analysis © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 16 LINGANALYSIS_FULL = Tokeniziation + Stems + Tagging
  • 17. Text Analysis with HANA – Entity Extraction © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 17 • In order to get more information out of the data SAP delivers several configurations • These configurations focus on entity and fact extraction under specific aspects • Types of Extraction: − EXTRACTION_CORE − EXTRACTION_CORE_ENTERPRISE − EXTRACTION_CORE_PUBLIC_SECTOR − EXTRACTION_CORE_VOICEOFCUSTOMER
  • 18. © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 18
  • 19. Text Analysis with HANA – Entity Extraction © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 19 EXTRACTION_CORE = Basic Entity Extraction (People, Organizations, Places)
  • 20. Text Analysis with HANA – Entity Extraction © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 20 EXTRACTION_CORE_VOICEOFCUSTOMER = Basic Entity Extraction + Sentiments
  • 21. Text Analysis with SAP HANA 21© msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg Motivation - Big Data1 3 Text Analysis with SAP HANA2 7 Enhancement Options3 21
  • 22. Text Analysis with HANA – Custom Dictionary © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 22 • In several use cases you might need to enhance the dictionary due to your business domain • Structure of a dictionary © SAP SE
  • 23. Text Analysis with HANA – Workflow of Enhancement © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 23 1. Find an extraction configuration that is most fitting for you 2. Copy the configuration into the target folder 3. Create a new custom dictionary 4. Reference the dictionary in your configuration copy 5. Recreate the fulltext index using your custom configuration
  • 24. © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 24
  • 25. Text Analysis with HANA – Workflow of Enhancement © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 25 1. Find an extraction configuration that is most fitting for you
  • 26. Text Analysis with HANA – Workflow of Enhancement © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 26 2. Copy the configuration into the target folder Important: File suffix *.hdbtextconfig
  • 27. Text Analysis with HANA – Workflow of Enhancement © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 27 3. Create a new custom dictionary Important: File suffix *.hdbtextdict
  • 28. Text Analysis with HANA – Workflow of Enhancement © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 28 4. Reference the dictionary in your configuration copy Important: You have to specify the full path
  • 29. Text Analysis with HANA – Workflow of Enhancement © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 29 5. Recreate the fulltext index using your custom configuration
  • 30. Text Analysis with HANA – Enhancement of Sentiment Analysis © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 30 • Special Case: Enhancement of sentiments • You can directly enhance/tailor the files delivered by SAP
  • 31. Text Analysis with HANA – What’s next? © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 31 • Assume that we are in an “industry”-specific context or mining for “slang”-like facts and entities • Good example for this are sports! • We use the example of CrossFit® … as there are some funny facts to extract • Question: How can we extract complex entities from a text? • Examples: − Did somebody attend a CrossFit training? − Does somebody want to join a CrossFit box?
  • 32. Text Analysis with HANA – Text Analysis Extraction Rules © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 32 Setup and Status Quo
  • 33. Text Analysis with HANA – Text Analysis Extraction Rules © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 33 • Extraction rules (CGUL rules): pattern-based language for pattern matching using character or token-based regular expressions combined with linguistic attributes to define custom entity types. • Goal of the rule sets: − Extract complex facts based on relations between entities and predicates. − Entity-to-Entity relations to associate entities such as times, dates, and locations, with other entities − Identify entities in domain-specific language. − Capture facts expressed in new, popular “slang”
  • 34. Text Analysis with HANA – Text Analysis Extraction Rules © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 34 Extraction Rule Regular ExpressionsTokens Luck ☺Dictionaries
  • 35. Text Analysis with HANA Tokens, Operators, Expression Markers and Directives © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 35 • Tokens define the syntactic units of the text analysis <string, STEM: <stem>, POS: <postag>> • Example: <activat.*, STEM: activat.*, POS: V> • Several operators are possible to enable the matching: − Standard operators e. g. character wildcard “.”, alternations “|” − Iteration operators e.g. zero or one occurrence of preceding item “?” ; zero or many occurrence of preceding item “*” − Grouping and containment operators, e. g. item group “( )”, range groups “[ ]”
  • 36. Text Analysis with HANA Tokens, Operators, Expression Markers and Directives © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 36 • Expression Markers allow the definition of delimiters of the searched terms • Several markers are available: − Paragraph Marker: Specifies beginning and end of paragraph – [P] − Entity Marker: Limits an expression to one or several entity types – [TE] <expr> [/TE] − Sentence Marker: Specifies the beginning and end of a sentence – [SN] [/SN] − Clause Container: Matches entire clause if expression is matched somewhere in the clause [CC] <expr> [/CC]
  • 37. Text Analysis with HANA Tokens, Operators, Expression Markers and Directives © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 37 • Directives allow the definition of character classes, groups of tokens and relation types • #define (character class): denotes character expressions Example: #define ALPHA: [A-Za-z] • #subgroup (group of tokens): defines a group of one or more tokens Example: #subgroup Cloud: <HCP>|<AWS>|<Azure> • #group (relation type): definition of custom facts and entity types consisting of one or more tokens Example: #group HANA: <HANA> #group HANANATIVE: %(HANA) <native>
  • 38. © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 38
  • 39. Text Analysis with HANA – Text Analysis Extraction Rules © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 39 Step 1 – Create a dictionary (It is all about entities)
  • 40. Text Analysis with HANA – Text Analysis Extraction Rules © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 40 Step 2 – Create a custom configuration
  • 41. Text Analysis with HANA – Text Analysis Extraction Rules © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 41 Recreate the fulltext index with the custom configuration
  • 42. Text Analysis with HANA – Text Analysis Extraction Rules © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 42 Next step: Create a simple plain text rule (*.hdbtextrule) and adopt configuration
  • 43. Text Analysis with HANA – Text Analysis Extraction Rules © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 43 Result of the plain rule
  • 44. Text Analysis with HANA – Text Analysis Extraction Rules © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 44 Refactor and enhance the rule
  • 45. Text Analysis with HANA – Text Analysis Extraction Rules © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 45 Reduce the extracted entities using the PreProcessor Configuration
  • 46. Text Analysis with HANA – Summary © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 46 • SAP HANA contains a lot of functionality • One very powerful feature is text analysis • Besides the delivered content you have a lot of options to adopt the text analysis to extract the entities and facts that you need • Since SP09 rules get compiled upon activation (no separate compilation necessary) • Creating custom dictionaries and text rules is cumbersome No support in IDE • The results of the text analysis form the basis of predictive analytics (also part of SAP HANA ☺)
  • 47. © msg | September 2015 | SAP Web IDE - IT Conference on SAP Technologies by msg 47 Q&A
  • 48. .consulting .solutions .partnership Dr. Christian Lechner Principal IT Consultant +49 (0) 171 7617190 christian.lechner@msg-systems.com msg systems ag (Headquarters) Robert-Buerkle-Str. 1, 85737 Ismaning Germany www.msg-systems.com
  • 49. Text Analysis with HANA – Ressources © msg | September 2015 | Text Analysis with SAP HANA - IT Conference on SAP Technologies by msg 49 • SAP HANA Search Developer Guide (Fulltext Index Options) help.sap.com -> Search Developer Guide • SAP HANA Text Analysis Developer Guide: help.sap.com -> TA Developer Guide • SAP HANA Text Analysis Language Reference Guide: help.sap.com -> TA Language Refrence Guide • SAP HANA Text Analysis Extraction Customization Guide: help.sap.com -> TA Extraction Customization Guide • YouTube Playlist of SAP HANA Academy: Text Analysis and Search