SlideShare a Scribd company logo
Text-Mining:
Big Data Analytics voor ongestructureerde data
Prof dr ir Jan C. Scholtes
2
Exploratory Search
3
Text-Mining: structuring the unstructured
4
Text Mining
Text Mining: The next step in
Search Technology
Finding without knowing exactly what
you’re looking for, or finding what
apparently isn’t there.
5
5
•Social network analysis
•Community Detection
•Different types of
visualization for temporal,
geographical, semantic or
relational mappings.
•Decision Tree
•Bayes Classifiers
•Rochio
•k-NN
•Support Vector Machines
•Clustering
•Entity extraction
•Fact, Event & Concept
extraction
•Negations, co-reference
resolution
•Grammars
•Statistical methods: Hidden
Markov Models, Maximum
Entropy Models, Conditional
Random Fields, …
•Data normalization
(Ontology matching)
•Inverted file index
•Relevance ranking
•Relevance feedback
•Faceted search
•Incomplete matching
•Index compression
•Precision & Recall
Search
Information
Extraction
Link Analysis
& Data
Visualization
Machine
Learning
6
Language_Name English
CITY New Brunswick, WASHINGTON
COMPANY J&J, Johnson & Johnson
COUNTRY Greece, Poland, Romania, United Kingdom
CURRENCY .02 USD, 21400000 USD, 48600000 USD, 59.47 USD, 70000000 USD
DATE 04-08
DAY Fri, Friday
NOUN_GROUP
biotech drugs, bribery case, denying guilt, final growth frontier, foreign countries, giving gifts, holding corporations,
intense revenue pressure, meaningful credit, medical device kickbacks, medical devices, multiple businesses, next several
days, non-U.S. markets, only way, orthopedic hips, other countries, over-the-counter medicines, paid kickbacks, past
year, paying kickbacks, same time, several new positions, similar violations, travel gifts
ORGANIZATION Department of Justice, Justice Department, SEC, Securities and Exchange Commission, University of Michigan
PEOPLES Iraqi
PERSON Erik Gordon, Mythili Raman, William Weldon
PLACE_REGION Europe
PRODUCT Benadryl, Tylenol
PROP_MISC Band-Aids, Food Program, Foreign Corrupt Practices Act, United Nations Oil
STATE N.J.
TIME 1:32 pm ET
TIME_PERIOD 13 years, five years, six months, three years
YEAR 2007
Problem
"We went to the government to report improper payments and have taken full responsibility for these actions," said
William Weldon, Chairman and CEO of J&J., Last month federal health regulators took legal control of the plant where
millions of bottles of defective medication were produced., The charges against J&J were brought under the Foreign
Corrupt Practices Act, which bars publicly traded companies from bribing officials in other countries to get or retain
business., The company will pay $21.4 million in criminal penalties for improper payments and return $48.6 million in
illegal profits, according to the government., The SEC says J&J agents used fake contracts and sham companies to deliver
the bribes.
Sentiment
giving meaningful credit to companies that self-report, We are committed to holding corporations accountable for bribing
foreign officials, what is honest
Request make sure it complies with anti-bribery laws across its businesses
7
WHAT happened?
8
WHO: Community Detection
9
WHAT-WHEN: Topic Rivers
10
WHY & HOW: Emotion Detection
11
Anomaly Detection
Σ(Φ)
12
Text Mining the Lord of the Rings
• Automatic
identification of
key players
(custodians)
• Automatic
identification of
locations.
• Automatic
identification of
travel patterns of
key players.
• Visualize in time.
Maastricht university - Text-Mining: Big Data Analytics voor ongestructureerde data
Memory Consistency
24/7
Speed &
Scalability
Search
M&A and
Restructuring
Data
Collection
Analytics
eDiscovery,
Regulatory
Requests,
Investigations,
Fact Finding
Missions
Reporting
Archiving
Knowledge
Management
Production
Big Data Analytics and the Law
SLIDE / 15
SLIDE / 16
Source: Comparing the Performance of Artificial
Intelligence to Human Lawyers in the Review of
Standard Business Contracts, February 2018,
LawGeex.
• Lack of precision leads
to noise, too many false
hits, too much work to
review, which yields
high cost of review.
• Lack of recall leads to
missing relevant
documents which yields
risk.
17
18
Human Performance
• When both precision and recall are
over 80%, human performance is
approached.
• This applies to the best humans.
• It can be argued that values over
80% are often subject to different
interpretations and discussions.
18
19
eDiscovery, Fact Finding Missions
(waarheidsvinding), Investigations
(regulatory and internal), Evidence
Seizure (bewijsbeslag), …
Teaching the computer what you are
looking for
21
Results
• Find more 2-3x relevant
documents
• In fraction of the time to
review entire data set
• You know exactly what
percentage of relevant
documents you found
• No need to understand
complex search tools or
queries: just reviewing
SLIDE / 22
Thank you!
Time for Q&A
Prof dr ir Jan C. Scholtes
https://www.linkedin.com/in/jscholtes/

More Related Content

PDF
Build Intelligent Fraud Prevention with Machine Learning and Graphs
PPTX
Graphs and innovative graph solutions for financial services
PPTX
All you-need-to-know-about-data-mining(1)
PDF
The Case for Graphs in AML
PDF
Supply Chain Management with A.I.
PPT
Using Open Data to fuel LegalTech Innovation
PDF
LegalTech - Bots vs Lawyers
PDF
TEXT-MINING: BIG DATA ANALYTICS VOOR ONGESTRUCTUREERDE DATA - Big Data Expo 2019
Build Intelligent Fraud Prevention with Machine Learning and Graphs
Graphs and innovative graph solutions for financial services
All you-need-to-know-about-data-mining(1)
The Case for Graphs in AML
Supply Chain Management with A.I.
Using Open Data to fuel LegalTech Innovation
LegalTech - Bots vs Lawyers
TEXT-MINING: BIG DATA ANALYTICS VOOR ONGESTRUCTUREERDE DATA - Big Data Expo 2019

Similar to Maastricht university - Text-Mining: Big Data Analytics voor ongestructureerde data (20)

PDF
Text mining scholtes - big data congress utrecht 2019
PDF
Text mining voor Business Intelligence toepassingen
PPTX
Introduction To Data Mining and Data Mining Techniques.pptx
PPTX
solutions and understanding text analytics
PPTX
Big data analytics - Introduction to Big Data and Hadoop
PPTX
Big Data Analytics
PPT
Text mining and data mining
PDF
Big Data & Social Analytics presentation
PDF
Business Intelligence A Managerial Perspective On Analytics 3rd Edition Shard...
PPTX
Big data and data mining
PPTX
Big data Analytics Unit - CCS334 Syllabus
PDF
Lesson_1_definitions_BIG DATA INROSUCTIONUE.pdf
PDF
20CS601 - Big data Analytics - types of data , definition of big data
PDF
Ictam big data
PDF
Big Data Analytics.pdfbgfjgjgghfhhffhdfyf
PPT
01-introduction.ppt the paper that you can unless you want to join me because...
PPTX
big data and machine learning ppt.pptx
PPTX
Introduction Data Science.pptx
PPTX
PDF
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
Text mining scholtes - big data congress utrecht 2019
Text mining voor Business Intelligence toepassingen
Introduction To Data Mining and Data Mining Techniques.pptx
solutions and understanding text analytics
Big data analytics - Introduction to Big Data and Hadoop
Big Data Analytics
Text mining and data mining
Big Data & Social Analytics presentation
Business Intelligence A Managerial Perspective On Analytics 3rd Edition Shard...
Big data and data mining
Big data Analytics Unit - CCS334 Syllabus
Lesson_1_definitions_BIG DATA INROSUCTIONUE.pdf
20CS601 - Big data Analytics - types of data , definition of big data
Ictam big data
Big Data Analytics.pdfbgfjgjgghfhhffhdfyf
01-introduction.ppt the paper that you can unless you want to join me because...
big data and machine learning ppt.pptx
Introduction Data Science.pptx
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
Ad

More from BigDataExpo (20)

PDF
Centric - Jaap huisprijzen, GTST, The Bold, IKEA en IENS. Zomaar wat toepassi...
PDF
Google Cloud - Google's vision on AI
PDF
Pacmed - Machine Learning in health care: opportunities and challanges in pra...
PDF
PGGM - The Future Explore
PDF
Universiteit Utrecht & gghdc - Wat zijn de gezondheidseffecten van omgeving e...
PPTX
Rob van Kranenburg - Kunnen we ons een sociaal krediet systeem zoals in het o...
PDF
OrangeNXT - High accuracy mapping from videos for efficient fiber optic cable...
PDF
Dynniq & GoDataDriven - Shaping the future of traffic with IoT and AI
PDF
Teleperformance - Smart personalized service door het gebruik van Data Science
PDF
FunXtion - Interactive Digital Fitness with Data Analytics
PDF
fashionTrade - Vroeger noemde we dat Big Data
PDF
BigData Republic - Industrializing data science: a view from the trenches
PDF
Bicos - Hear how a top sportswear company produced cutting-edge data infrastr...
PDF
Endrse - Next level online samenwerkingen tussen personalities en merken met ...
PDF
Bovag - Refine-IT - Proces optimalisatie in de automotive sector
PDF
Schiphol - Optimale doorstroom van passagiers op Schiphol dankzij slimme data...
PDF
Veco - Big Data in de Supply Chain: Hoe Process Mining kan helpen kosten te r...
PPTX
Rabobank - There is something about Data
PDF
VU Amsterdam - Big data en datagedreven waardecreatie: valt er nog iets te ki...
PDF
Booking.com - Data science and experimentation at Booking.com: a data-driven ...
Centric - Jaap huisprijzen, GTST, The Bold, IKEA en IENS. Zomaar wat toepassi...
Google Cloud - Google's vision on AI
Pacmed - Machine Learning in health care: opportunities and challanges in pra...
PGGM - The Future Explore
Universiteit Utrecht & gghdc - Wat zijn de gezondheidseffecten van omgeving e...
Rob van Kranenburg - Kunnen we ons een sociaal krediet systeem zoals in het o...
OrangeNXT - High accuracy mapping from videos for efficient fiber optic cable...
Dynniq & GoDataDriven - Shaping the future of traffic with IoT and AI
Teleperformance - Smart personalized service door het gebruik van Data Science
FunXtion - Interactive Digital Fitness with Data Analytics
fashionTrade - Vroeger noemde we dat Big Data
BigData Republic - Industrializing data science: a view from the trenches
Bicos - Hear how a top sportswear company produced cutting-edge data infrastr...
Endrse - Next level online samenwerkingen tussen personalities en merken met ...
Bovag - Refine-IT - Proces optimalisatie in de automotive sector
Schiphol - Optimale doorstroom van passagiers op Schiphol dankzij slimme data...
Veco - Big Data in de Supply Chain: Hoe Process Mining kan helpen kosten te r...
Rabobank - There is something about Data
VU Amsterdam - Big data en datagedreven waardecreatie: valt er nog iets te ki...
Booking.com - Data science and experimentation at Booking.com: a data-driven ...
Ad

Recently uploaded (20)

PPTX
modul_python (1).pptx for professional and student
PDF
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
Introduction to the R Programming Language
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Database Infoormation System (DBIS).pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
importance of Data-Visualization-in-Data-Science. for mba studnts
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PPTX
Leprosy and NLEP programme community medicine
PDF
How to run a consulting project- client discovery
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPT
DATA COLLECTION METHODS-ppt for nursing research
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PDF
Lecture1 pattern recognition............
PPTX
Modelling in Business Intelligence , information system
PDF
[EN] Industrial Machine Downtime Prediction
PPT
Predictive modeling basics in data cleaning process
PDF
Mega Projects Data Mega Projects Data
modul_python (1).pptx for professional and student
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Introduction to the R Programming Language
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Database Infoormation System (DBIS).pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
importance of Data-Visualization-in-Data-Science. for mba studnts
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Leprosy and NLEP programme community medicine
How to run a consulting project- client discovery
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
DATA COLLECTION METHODS-ppt for nursing research
Optimise Shopper Experiences with a Strong Data Estate.pdf
Lecture1 pattern recognition............
Modelling in Business Intelligence , information system
[EN] Industrial Machine Downtime Prediction
Predictive modeling basics in data cleaning process
Mega Projects Data Mega Projects Data

Maastricht university - Text-Mining: Big Data Analytics voor ongestructureerde data

  • 1. Text-Mining: Big Data Analytics voor ongestructureerde data Prof dr ir Jan C. Scholtes
  • 4. 4 Text Mining Text Mining: The next step in Search Technology Finding without knowing exactly what you’re looking for, or finding what apparently isn’t there.
  • 5. 5 5 •Social network analysis •Community Detection •Different types of visualization for temporal, geographical, semantic or relational mappings. •Decision Tree •Bayes Classifiers •Rochio •k-NN •Support Vector Machines •Clustering •Entity extraction •Fact, Event & Concept extraction •Negations, co-reference resolution •Grammars •Statistical methods: Hidden Markov Models, Maximum Entropy Models, Conditional Random Fields, … •Data normalization (Ontology matching) •Inverted file index •Relevance ranking •Relevance feedback •Faceted search •Incomplete matching •Index compression •Precision & Recall Search Information Extraction Link Analysis & Data Visualization Machine Learning
  • 6. 6 Language_Name English CITY New Brunswick, WASHINGTON COMPANY J&J, Johnson & Johnson COUNTRY Greece, Poland, Romania, United Kingdom CURRENCY .02 USD, 21400000 USD, 48600000 USD, 59.47 USD, 70000000 USD DATE 04-08 DAY Fri, Friday NOUN_GROUP biotech drugs, bribery case, denying guilt, final growth frontier, foreign countries, giving gifts, holding corporations, intense revenue pressure, meaningful credit, medical device kickbacks, medical devices, multiple businesses, next several days, non-U.S. markets, only way, orthopedic hips, other countries, over-the-counter medicines, paid kickbacks, past year, paying kickbacks, same time, several new positions, similar violations, travel gifts ORGANIZATION Department of Justice, Justice Department, SEC, Securities and Exchange Commission, University of Michigan PEOPLES Iraqi PERSON Erik Gordon, Mythili Raman, William Weldon PLACE_REGION Europe PRODUCT Benadryl, Tylenol PROP_MISC Band-Aids, Food Program, Foreign Corrupt Practices Act, United Nations Oil STATE N.J. TIME 1:32 pm ET TIME_PERIOD 13 years, five years, six months, three years YEAR 2007 Problem "We went to the government to report improper payments and have taken full responsibility for these actions," said William Weldon, Chairman and CEO of J&J., Last month federal health regulators took legal control of the plant where millions of bottles of defective medication were produced., The charges against J&J were brought under the Foreign Corrupt Practices Act, which bars publicly traded companies from bribing officials in other countries to get or retain business., The company will pay $21.4 million in criminal penalties for improper payments and return $48.6 million in illegal profits, according to the government., The SEC says J&J agents used fake contracts and sham companies to deliver the bribes. Sentiment giving meaningful credit to companies that self-report, We are committed to holding corporations accountable for bribing foreign officials, what is honest Request make sure it complies with anti-bribery laws across its businesses
  • 10. 10 WHY & HOW: Emotion Detection
  • 12. 12 Text Mining the Lord of the Rings • Automatic identification of key players (custodians) • Automatic identification of locations. • Automatic identification of travel patterns of key players. • Visualize in time.
  • 14. Memory Consistency 24/7 Speed & Scalability Search M&A and Restructuring Data Collection Analytics eDiscovery, Regulatory Requests, Investigations, Fact Finding Missions Reporting Archiving Knowledge Management Production Big Data Analytics and the Law
  • 16. SLIDE / 16 Source: Comparing the Performance of Artificial Intelligence to Human Lawyers in the Review of Standard Business Contracts, February 2018, LawGeex.
  • 17. • Lack of precision leads to noise, too many false hits, too much work to review, which yields high cost of review. • Lack of recall leads to missing relevant documents which yields risk. 17
  • 18. 18 Human Performance • When both precision and recall are over 80%, human performance is approached. • This applies to the best humans. • It can be argued that values over 80% are often subject to different interpretations and discussions. 18
  • 19. 19 eDiscovery, Fact Finding Missions (waarheidsvinding), Investigations (regulatory and internal), Evidence Seizure (bewijsbeslag), …
  • 20. Teaching the computer what you are looking for
  • 21. 21 Results • Find more 2-3x relevant documents • In fraction of the time to review entire data set • You know exactly what percentage of relevant documents you found • No need to understand complex search tools or queries: just reviewing
  • 23. Thank you! Time for Q&A Prof dr ir Jan C. Scholtes https://www.linkedin.com/in/jscholtes/