SlideShare a Scribd company logo
Text-Mining:
Big Data Analytics voor ongestructureerde data
Prof dr ir Jan C. Scholtes
https://textmining.nu
Prof dr ir Jan C. Scholtes
3
Exploratory Search
4
Text Mining
Text Mining: The next step in
Search Technology
Finding without knowing exactly what
you’re looking for, or finding what
apparently isn’t there (or who do not
want to be found …).
5
5
•Social network analysis
•Community Detection
•Different types of
visualization for
temporal, geographical,
semantic or relational
mappings.
•Anomaly Detection
•Decision Tree
•Bayes Classifiers
•Rochio
•k-NN
•Support Vector Machines
•Clustering
•CNN
•LSTM
•Entity extraction
•Fact, Event & Concept
extraction
•Negations, co-reference
resolution
•Grammars
•Statistical methods: Hidden
Markov Models, Maximum
Entropy Models, Conditional
Random Fields, …
•Data normalization
(Ontology matching)
•Inverted file index
•Relevance ranking
•Relevance feedback
•Faceted search
•Incomplete matching
•Index compression
•Precision & Recall
Search
Information
Extraction
Link Analysis
& Data
Visualization
Machine
Learning
6
Language_Name English
CITY New Brunswick, WASHINGTON
COMPANY J&J, Johnson & Johnson
COUNTRY Greece, Poland, Romania, United Kingdom
CURRENCY .02 USD, 21400000 USD, 48600000 USD, 59.47 USD, 70000000 USD
DATE 04-08
DAY Fri, Friday
NOUN_GROUP
biotech drugs, bribery case, denying guilt, final growth frontier, foreign countries, giving gifts, holding corporations,
intense revenue pressure, meaningful credit, medical device kickbacks, medical devices, multiple businesses, next several
days, non-U.S. markets, only way, orthopedic hips, other countries, over-the-counter medicines, paid kickbacks, past
year, paying kickbacks, same time, several new positions, similar violations, travel gifts
ORGANIZATION Department of Justice, Justice Department, SEC, Securities and Exchange Commission, University of Michigan
PEOPLES Iraqi
PERSON Erik Gordon, Mythili Raman, William Weldon
PLACE_REGION Europe
PRODUCT Benadryl, Tylenol
PROP_MISC Band-Aids, Food Program, Foreign Corrupt Practices Act, United Nations Oil
STATE N.J.
TIME 1:32 pm ET
TIME_PERIOD 13 years, five years, six months, three years
YEAR 2007
Problem
"We went to the government to report improper payments and have taken full responsibility for these actions," said
William Weldon, Chairman and CEO of J&J., Last month federal health regulators took legal control of the plant where
millions of bottles of defective medication were produced., The charges against J&J were brought under the Foreign
Corrupt Practices Act, which bars publicly traded companies from bribing officials in other countries to get or retain
business., The company will pay $21.4 million in criminal penalties for improper payments and return $48.6 million in
illegal profits, according to the government., The SEC says J&J agents used fake contracts and sham companies to deliver
the bribes.
Sentiment
giving meaningful credit to companies that self-report, We are committed to holding corporations accountable for bribing
foreign officials, what is honest
Request make sure it complies with anti-bribery laws across its businesses
7
WHAT happened?
8
WHO
8
9
WHAT-WHEN: Topic Rivers
10
WHY & WHO: Emotion Detection
11
Anomaly Detection
Σ(Φ)
12
Text Mining the Lord of the Rings
• Automatic
identification of
key players
(custodians)
• Automatic
identification of
locations.
• Automatic
identification of
travel patterns of
key players.
• Visualize in time.
TEXT-MINING: BIG DATA ANALYTICS VOOR ONGESTRUCTUREERDE DATA - Big Data Expo 2019
Memory Consistency
24/7
Speed &
Scalability
Search
M&A and
Restructuring
Data
Collection
Analytics
eDiscovery,
Regulatory
Requests,
Investigations,
Fact-Finding
Missions
Reporting
Archiving
Knowledge
Management
Production
Big Data Analytics and the Law
ZyLAB used as e-
Discovery & e-Disclosure
standard for all United
Nations-backed War Crime
Tribunals and ongoing UN
courts
16SLIDE / 16
• FOIA (WOB)
• Audits &
Internal Investigations
• Litigation
• Arbitration
• Answering Regulatory
Requests
• Subject Access
Requests
• Right to be Forgotten
eDiscovery
17
3x more relevant
documents than
Boolean search
No complex queries, just
review documents
2x total number of
relevant documents
is all that need to be
reviewed
Estimate
accurately percentage of all
relevant documents found at
end
Teach the computer what to look for …
18
CCPA
SLIDE / 19
GDPR & AVG: Aflakken, anonimiseren, …
SLIDE / 20
Hoe werkt dat?
Search Pattern Recognition Text-Mining
Thank you!
Time for Q&A
Prof dr ir Jan C. Scholtes
https://www.linkedin.com/in/jscholtes/
https://textmining.nu

More Related Content

PPTX
Data mining
PPT
1.3 applications, issues
PDF
Data mining
PDF
Large Scale Data Analytics
PDF
Maastricht university - Text-Mining: Big Data Analytics voor ongestructureerd...
PDF
Text mining scholtes - big data congress utrecht 2018
PDF
Big data analytics for legal fact finding
PDF
Text mining voor Business Intelligence toepassingen
Data mining
1.3 applications, issues
Data mining
Large Scale Data Analytics
Maastricht university - Text-Mining: Big Data Analytics voor ongestructureerd...
Text mining scholtes - big data congress utrecht 2018
Big data analytics for legal fact finding
Text mining voor Business Intelligence toepassingen

Similar to TEXT-MINING: BIG DATA ANALYTICS VOOR ONGESTRUCTUREERDE DATA - Big Data Expo 2019 (20)

PPTX
Big Data Analytics
PPTX
Big data and data mining
PPTX
Introduction Data Science.pptx
PPTX
Text mining and analytics v6 - p1
PPT
Text mining and data mining
PDF
Introduction to Data Mining
PDF
Cs501 dm intro
PPTX
Data mining
PPTX
Data mining
PPTX
datamining_Uses_Process_Image_Captioning.ppt.pptx
PDF
Twitter data analysis using R
PPTX
PPTX
UNIT - 5: Data Warehousing and Data Mining
PPTX
Chap1-Introduction.pptx. Data Mining and introduction about it in a specified...
PPT
Unit 1 (Chapter-1) on data mining concepts.ppt
PDF
Dealing with Common Data Requirements in Your Enterprise
PPTX
Text mining
PPT
Analysis of ‘Unstructured’ Data
PPTX
Aggahsbsbsbsbsbsbsbsbsbwbshhwhwhwgwhwhwh
PDF
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
Big Data Analytics
Big data and data mining
Introduction Data Science.pptx
Text mining and analytics v6 - p1
Text mining and data mining
Introduction to Data Mining
Cs501 dm intro
Data mining
Data mining
datamining_Uses_Process_Image_Captioning.ppt.pptx
Twitter data analysis using R
UNIT - 5: Data Warehousing and Data Mining
Chap1-Introduction.pptx. Data Mining and introduction about it in a specified...
Unit 1 (Chapter-1) on data mining concepts.ppt
Dealing with Common Data Requirements in Your Enterprise
Text mining
Analysis of ‘Unstructured’ Data
Aggahsbsbsbsbsbsbsbsbsbwbshhwhwhwgwhwhwh
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
Ad

More from webwinkelvakdag (20)

PPTX
ISM eCompany: Sander Berlinski
PDF
Social Nomads - Lynn
PDF
Thuiswinkel.org & Omoda: Alicja Van Ewijk
PDF
Worldpay: Maria Prados
PDF
Van Moof: Simon Vreeman
PDF
ANWB: Carolina van den Hoven & Margot van Leeuwen
PPTX
HEMA: Ilse Lankhorst, Bas Karsemeijer
PDF
ISM eCompany: Kees Beckeringh
PDF
ING: Dirk Mulder
PPTX
Martijn Kozijn: Jessica van Haaster & Martijn Leclaire
PDF
ING: Dirk Mulder
PDF
Cemex trescon: Marloe de Ruiter
PPTX
LINDA.Foundation: Jocelyn Nassenstein-Brouwer
PDF
Maersk: Niek Minderhoud
PDF
Q&A: Brenda Hoekstra
PDF
Aanhangwagendirect & PI Marketing: Merin Eggink & Mascha Soors
PDF
ISM eCompany: Ralph van Woensel
PPTX
Lecot: Raf Maesen
PPTX
Lobbes: Berry de Snoo
PDF
ISM eCompany: Sander Lems
ISM eCompany: Sander Berlinski
Social Nomads - Lynn
Thuiswinkel.org & Omoda: Alicja Van Ewijk
Worldpay: Maria Prados
Van Moof: Simon Vreeman
ANWB: Carolina van den Hoven & Margot van Leeuwen
HEMA: Ilse Lankhorst, Bas Karsemeijer
ISM eCompany: Kees Beckeringh
ING: Dirk Mulder
Martijn Kozijn: Jessica van Haaster & Martijn Leclaire
ING: Dirk Mulder
Cemex trescon: Marloe de Ruiter
LINDA.Foundation: Jocelyn Nassenstein-Brouwer
Maersk: Niek Minderhoud
Q&A: Brenda Hoekstra
Aanhangwagendirect & PI Marketing: Merin Eggink & Mascha Soors
ISM eCompany: Ralph van Woensel
Lecot: Raf Maesen
Lobbes: Berry de Snoo
ISM eCompany: Sander Lems
Ad

Recently uploaded (20)

PPTX
climate analysis of Dhaka ,Banglades.pptx
PPT
DATA COLLECTION METHODS-ppt for nursing research
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
modul_python (1).pptx for professional and student
PDF
Lecture1 pattern recognition............
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Modelling in Business Intelligence , information system
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
Business Analytics and business intelligence.pdf
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
PDF
Introduction to the R Programming Language
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
A Complete Guide to Streamlining Business Processes
climate analysis of Dhaka ,Banglades.pptx
DATA COLLECTION METHODS-ppt for nursing research
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
modul_python (1).pptx for professional and student
Lecture1 pattern recognition............
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Modelling in Business Intelligence , information system
IBA_Chapter_11_Slides_Final_Accessible.pptx
Introduction-to-Cloud-ComputingFinal.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Business Analytics and business intelligence.pdf
Galatica Smart Energy Infrastructure Startup Pitch Deck
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
Introduction to the R Programming Language
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Acceptance and paychological effects of mandatory extra coach I classes.pptx
A Complete Guide to Streamlining Business Processes

TEXT-MINING: BIG DATA ANALYTICS VOOR ONGESTRUCTUREERDE DATA - Big Data Expo 2019

  • 1. Text-Mining: Big Data Analytics voor ongestructureerde data Prof dr ir Jan C. Scholtes https://textmining.nu
  • 2. Prof dr ir Jan C. Scholtes
  • 4. 4 Text Mining Text Mining: The next step in Search Technology Finding without knowing exactly what you’re looking for, or finding what apparently isn’t there (or who do not want to be found …).
  • 5. 5 5 •Social network analysis •Community Detection •Different types of visualization for temporal, geographical, semantic or relational mappings. •Anomaly Detection •Decision Tree •Bayes Classifiers •Rochio •k-NN •Support Vector Machines •Clustering •CNN •LSTM •Entity extraction •Fact, Event & Concept extraction •Negations, co-reference resolution •Grammars •Statistical methods: Hidden Markov Models, Maximum Entropy Models, Conditional Random Fields, … •Data normalization (Ontology matching) •Inverted file index •Relevance ranking •Relevance feedback •Faceted search •Incomplete matching •Index compression •Precision & Recall Search Information Extraction Link Analysis & Data Visualization Machine Learning
  • 6. 6 Language_Name English CITY New Brunswick, WASHINGTON COMPANY J&J, Johnson & Johnson COUNTRY Greece, Poland, Romania, United Kingdom CURRENCY .02 USD, 21400000 USD, 48600000 USD, 59.47 USD, 70000000 USD DATE 04-08 DAY Fri, Friday NOUN_GROUP biotech drugs, bribery case, denying guilt, final growth frontier, foreign countries, giving gifts, holding corporations, intense revenue pressure, meaningful credit, medical device kickbacks, medical devices, multiple businesses, next several days, non-U.S. markets, only way, orthopedic hips, other countries, over-the-counter medicines, paid kickbacks, past year, paying kickbacks, same time, several new positions, similar violations, travel gifts ORGANIZATION Department of Justice, Justice Department, SEC, Securities and Exchange Commission, University of Michigan PEOPLES Iraqi PERSON Erik Gordon, Mythili Raman, William Weldon PLACE_REGION Europe PRODUCT Benadryl, Tylenol PROP_MISC Band-Aids, Food Program, Foreign Corrupt Practices Act, United Nations Oil STATE N.J. TIME 1:32 pm ET TIME_PERIOD 13 years, five years, six months, three years YEAR 2007 Problem "We went to the government to report improper payments and have taken full responsibility for these actions," said William Weldon, Chairman and CEO of J&J., Last month federal health regulators took legal control of the plant where millions of bottles of defective medication were produced., The charges against J&J were brought under the Foreign Corrupt Practices Act, which bars publicly traded companies from bribing officials in other countries to get or retain business., The company will pay $21.4 million in criminal penalties for improper payments and return $48.6 million in illegal profits, according to the government., The SEC says J&J agents used fake contracts and sham companies to deliver the bribes. Sentiment giving meaningful credit to companies that self-report, We are committed to holding corporations accountable for bribing foreign officials, what is honest Request make sure it complies with anti-bribery laws across its businesses
  • 10. 10 WHY & WHO: Emotion Detection
  • 12. 12 Text Mining the Lord of the Rings • Automatic identification of key players (custodians) • Automatic identification of locations. • Automatic identification of travel patterns of key players. • Visualize in time.
  • 14. Memory Consistency 24/7 Speed & Scalability Search M&A and Restructuring Data Collection Analytics eDiscovery, Regulatory Requests, Investigations, Fact-Finding Missions Reporting Archiving Knowledge Management Production Big Data Analytics and the Law
  • 15. ZyLAB used as e- Discovery & e-Disclosure standard for all United Nations-backed War Crime Tribunals and ongoing UN courts
  • 16. 16SLIDE / 16 • FOIA (WOB) • Audits & Internal Investigations • Litigation • Arbitration • Answering Regulatory Requests • Subject Access Requests • Right to be Forgotten eDiscovery
  • 17. 17 3x more relevant documents than Boolean search No complex queries, just review documents 2x total number of relevant documents is all that need to be reviewed Estimate accurately percentage of all relevant documents found at end Teach the computer what to look for …
  • 19. SLIDE / 19 GDPR & AVG: Aflakken, anonimiseren, …
  • 20. SLIDE / 20 Hoe werkt dat? Search Pattern Recognition Text-Mining
  • 21. Thank you! Time for Q&A Prof dr ir Jan C. Scholtes https://www.linkedin.com/in/jscholtes/ https://textmining.nu