SlideShare a Scribd company logo
NLP Structured Data Investigation on Non-Text
Casey Stella
@casey_stella
2016
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Table of Contents
Preliminaries
Borrowing from NLP
Demo
Questions
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Introduction
Hi, I’m Casey Stella!
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Domain Challenges in Data Science
A data scientist has to merge analytical skills with domain expertise.
• Often we’re thrown into places where we have insufficient domain experience.
• Gaining this expertise can be challenging and time-consuming.
• Unsupervised machine learning techniques can be very useful to understand complex
data relationships.
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Domain Challenges in Data Science
A data scientist has to merge analytical skills with domain expertise.
• Often we’re thrown into places where we have insufficient domain experience.
• Gaining this expertise can be challenging and time-consuming.
• Unsupervised machine learning techniques can be very useful to understand complex
data relationships.
We’ll use an unsupervised structure learning algorithm borrowed from NLP to look at
medical data.
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Word2Vec
Word2Vec is a vectorization model created by Google [1] that attempts to learn
relationships between words automatically given a large corpus of sentences.
• Gives us a way to find similar words by finding near neighbors in the vector space
with cosine similarity.
1
http://radimrehurek.com/2014/12/making-sense-of-word2vec/
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Word2Vec
Word2Vec is a vectorization model created by Google [1] that attempts to learn
relationships between words automatically given a large corpus of sentences.
• Gives us a way to find similar words by finding near neighbors in the vector space
with cosine similarity.
• Uses a neural network to learn vector representations.
1
http://radimrehurek.com/2014/12/making-sense-of-word2vec/
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Word2Vec
Word2Vec is a vectorization model created by Google [1] that attempts to learn
relationships between words automatically given a large corpus of sentences.
• Gives us a way to find similar words by finding near neighbors in the vector space
with cosine similarity.
• Uses a neural network to learn vector representations.
• Work by Pennington, Socher, and Manning [2] shows that the word2vec model is
equivalent to a word co-occurance matrix weighting based on window distance and
lowering the dimension by matrix factorization.
1
http://radimrehurek.com/2014/12/making-sense-of-word2vec/
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Word2Vec
Word2Vec is a vectorization model created by Google [1] that attempts to learn
relationships between words automatically given a large corpus of sentences.
• Gives us a way to find similar words by finding near neighbors in the vector space
with cosine similarity.
• Uses a neural network to learn vector representations.
• Work by Pennington, Socher, and Manning [2] shows that the word2vec model is
equivalent to a word co-occurance matrix weighting based on window distance and
lowering the dimension by matrix factorization.
Takeaway: The technique boils down, intuitively, to a riff on word co-occurence. See
here1 for more.
1
http://radimrehurek.com/2014/12/making-sense-of-word2vec/
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Clinical Data as Sentences
Clinical encounters form a sort of sentence over time. For a given encounter:
• Vitals are measured (e.g. height, weight, BMI).
• Labs are performed and results are recorded (e.g. blood tests).
• Procedures are performed.
• Diagnoses are made (e.g. Diabetes).
• Drugs are prescribed.
Each of these can be considered clinical “words” and the encounter forms a clinical
“sentence”.
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Clinical Data as Sentences
Clinical encounters form a sort of sentence over time. For a given encounter:
• Vitals are measured (e.g. height, weight, BMI).
• Labs are performed and results are recorded (e.g. blood tests).
• Procedures are performed.
• Diagnoses are made (e.g. Diabetes).
• Drugs are prescribed.
Each of these can be considered clinical “words” and the encounter forms a clinical
“sentence”.
Idea: We can use word2vec to investigate connections between these clinical concepts.
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Demo
As part of a Kaggle competition2, Practice Fusion, a digital electronic medical records
provider released depersonalized clinical records of 10,000 patients. I ingested and
preprocessed these records into 197,340 clinical “sentences” using Pig and Hive.
2
https://www.kaggle.com/c/pf2012-diabetes
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Demo
As part of a Kaggle competition2, Practice Fusion, a digital electronic medical records
provider released depersonalized clinical records of 10,000 patients. I ingested and
preprocessed these records into 197,340 clinical “sentences” using Pig and Hive.
MLLib from Spark now contains an implementation of word2vec, so let’s use pyspark
and IPython Notebook to explore this dataset on Hadoop.
2
https://www.kaggle.com/c/pf2012-diabetes
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Questions
Thanks for your attention! Questions?
• Code & scripts for this talk available on my github presentation page.3
• Find me at http://caseystella.com
• Twitter handle: @casey_stella
• Email address: cstella@hortonworks.com
3
http://github.com/cestella/presentations/
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
Bibliography
[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of
word representations in vector space. CoRR, abs/1301.3781, 2013.
[2] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global
vectors for word representation. In Proceedings of the 2014 Conference on Empirical
Methods in Natural Language Processing (EMNLP), pages 1532–1543. Association
for Computational Linguistics, 2014.
Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016

More Related Content

PDF
NLP Structured Data Investigation on Non-Text by Casey Stella
PDF
NLP Structured Data Investigation on Non-Text
PDF
Natural Language Processing on Non-Textual Data
PDF
NLP Structured Data Investigation on Non-Text
PDF
Streaming Outlier Analysis for Fun and Scalability
PPTX
2017 CodeFest Wrap-up Presentation
PDF
Mapping Keywords to
PPT
Extending the Espresso Method for Greater Recall
NLP Structured Data Investigation on Non-Text by Casey Stella
NLP Structured Data Investigation on Non-Text
Natural Language Processing on Non-Textual Data
NLP Structured Data Investigation on Non-Text
Streaming Outlier Analysis for Fun and Scalability
2017 CodeFest Wrap-up Presentation
Mapping Keywords to
Extending the Espresso Method for Greater Recall

What's hot (14)

PPTX
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
PDF
Dt35682686
PDF
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
PDF
Words, Documents and Distance: Deep Learning and Semantic Analysis
PDF
Reproducibility in cheminformatics and computational chemistry research: cert...
PDF
Abcd iqs ssoftware-projects-mercecrosas
PDF
ODSC East 2017: Data Science Models For Good
PPTX
2015 balti-and-bioinformatics
PPTX
Detecting word substitution in text
PDF
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
PDF
Link Analysis of Life Sciences Linked Data
PDF
BIOMAG2018 - Denis Engemann - MNE-HCP
PDF
Machine learning in the life sciences with knime
PPTX
Return to the Materials Digital Humanities Conference 2013
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Dt35682686
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
Words, Documents and Distance: Deep Learning and Semantic Analysis
Reproducibility in cheminformatics and computational chemistry research: cert...
Abcd iqs ssoftware-projects-mercecrosas
ODSC East 2017: Data Science Models For Good
2015 balti-and-bioinformatics
Detecting word substitution in text
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
Link Analysis of Life Sciences Linked Data
BIOMAG2018 - Denis Engemann - MNE-HCP
Machine learning in the life sciences with knime
Return to the Materials Digital Humanities Conference 2013
Ad

Viewers also liked (10)

PPTX
LEGO: Data Driven Growth Hacking Powered by Big Data
PPTX
Lambda-less Stream Processing @Scale in LinkedIn
PPTX
Apache Hive 2.0: SQL, Speed, Scale
PPTX
PPTX
The Elephant in the Clouds
PPTX
HDFS Erasure Coding in Action
PDF
Dataflow with Apache NiFi - Crash Course - HS16SJ
PPTX
Evolving HDFS to a Generalized Storage Subsystem
PDF
Case study of DevOps for Hadoop in Recruit.
PDF
Comparison of Transactional Libraries for HBase
LEGO: Data Driven Growth Hacking Powered by Big Data
Lambda-less Stream Processing @Scale in LinkedIn
Apache Hive 2.0: SQL, Speed, Scale
The Elephant in the Clouds
HDFS Erasure Coding in Action
Dataflow with Apache NiFi - Crash Course - HS16SJ
Evolving HDFS to a Generalized Storage Subsystem
Case study of DevOps for Hadoop in Recruit.
Comparison of Transactional Libraries for HBase
Ad

Similar to NLP Structured Data Investigation on Non-Text (20)

PDF
NLP Structured Data Investigation on Non-Text by Casey Stella
PPTX
Using a keyword extraction pipeline to understand concepts in future work sec...
PPTX
Using Knowledge Graph for Promoting Cognitive Computing
PDF
Natural Language Processing for Materials Design - What Can We Extract From t...
PDF
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
PPTX
Research Objects for FAIRer Science
PDF
Capturing and leveraging materials science knowledge from millions of journal...
PDF
Open IE tutorial 2018
PDF
Data Preparation for Data Science
PDF
Data Preparation of Data Science
PDF
Applications of Natural Language Processing to Materials Design
PDF
Natural Language Processing Through Different Classes of Machine Learning
PDF
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
PPTX
Deep Neural Methods for Retrieval
PPTX
Dynamic Search Using Semantics & Statistics
PPTX
Natural language inference(NLI) importtant
PPTX
Idcc kansa-kansa-arbuckle
PDF
Spark Summit Europe: Share and analyse genomic data at scale
PDF
Extracting and Making Use of Materials Data from Millions of Journal Articles...
PDF
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...
NLP Structured Data Investigation on Non-Text by Casey Stella
Using a keyword extraction pipeline to understand concepts in future work sec...
Using Knowledge Graph for Promoting Cognitive Computing
Natural Language Processing for Materials Design - What Can We Extract From t...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Research Objects for FAIRer Science
Capturing and leveraging materials science knowledge from millions of journal...
Open IE tutorial 2018
Data Preparation for Data Science
Data Preparation of Data Science
Applications of Natural Language Processing to Materials Design
Natural Language Processing Through Different Classes of Machine Learning
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Neural Methods for Retrieval
Dynamic Search Using Semantics & Statistics
Natural language inference(NLI) importtant
Idcc kansa-kansa-arbuckle
Spark Summit Europe: Share and analyse genomic data at scale
Extracting and Making Use of Materials Data from Millions of Journal Articles...
NPG Scientific Data; SSP, Boston, May 2014: http://www.sspnet.org/events/annu...

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
PPT
State of Security: Apache Spark & Apache Zeppelin
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
PDF
Enabling Digital Diagnostics with a Data Science Platform
PDF
Revolutionize Text Mining with Spark and Zeppelin
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
PDF
Hadoop Crash Course
PDF
Data Science Crash Course
PDF
Apache Spark Crash Course
PDF
Dataflow with Apache NiFi
PPTX
Schema Registry - Set you Data Free
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
PPTX
Mool - Automated Log Analysis using Data Science and ML
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PPTX
HBase in Practice
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
PPTX
Backup and Disaster Recovery in Hadoop
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
Unleashing the Power of Apache Atlas with Apache Ranger
Enabling Digital Diagnostics with a Data Science Platform
Revolutionize Text Mining with Spark and Zeppelin
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Data Science Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Mool - Automated Log Analysis using Data Science and ML
How Hadoop Makes the Natixis Pack More Efficient
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop

Recently uploaded (20)

PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PPTX
1. Introduction to Computer Programming.pptx
PDF
Hindi spoken digit analysis for native and non-native speakers
PPTX
Chapter 5: Probability Theory and Statistics
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Hybrid model detection and classification of lung cancer
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
Getting Started with Data Integration: FME Form 101
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
Web App vs Mobile App What Should You Build First.pdf
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
1. Introduction to Computer Programming.pptx
Hindi spoken digit analysis for native and non-native speakers
Chapter 5: Probability Theory and Statistics
A comparative analysis of optical character recognition models for extracting...
OMC Textile Division Presentation 2021.pptx
Hybrid model detection and classification of lung cancer
TLE Review Electricity (Electricity).pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Programs and apps: productivity, graphics, security and other tools
1 - Historical Antecedents, Social Consideration.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
A Presentation on Artificial Intelligence
Getting Started with Data Integration: FME Form 101
Zenith AI: Advanced Artificial Intelligence
Enhancing emotion recognition model for a student engagement use case through...
Building Integrated photovoltaic BIPV_UPV.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Encapsulation_ Review paper, used for researhc scholars

NLP Structured Data Investigation on Non-Text

  • 1. NLP Structured Data Investigation on Non-Text Casey Stella @casey_stella 2016 Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 2. Table of Contents Preliminaries Borrowing from NLP Demo Questions Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 3. Introduction Hi, I’m Casey Stella! Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 4. Domain Challenges in Data Science A data scientist has to merge analytical skills with domain expertise. • Often we’re thrown into places where we have insufficient domain experience. • Gaining this expertise can be challenging and time-consuming. • Unsupervised machine learning techniques can be very useful to understand complex data relationships. Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 5. Domain Challenges in Data Science A data scientist has to merge analytical skills with domain expertise. • Often we’re thrown into places where we have insufficient domain experience. • Gaining this expertise can be challenging and time-consuming. • Unsupervised machine learning techniques can be very useful to understand complex data relationships. We’ll use an unsupervised structure learning algorithm borrowed from NLP to look at medical data. Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 6. Word2Vec Word2Vec is a vectorization model created by Google [1] that attempts to learn relationships between words automatically given a large corpus of sentences. • Gives us a way to find similar words by finding near neighbors in the vector space with cosine similarity. 1 http://radimrehurek.com/2014/12/making-sense-of-word2vec/ Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 7. Word2Vec Word2Vec is a vectorization model created by Google [1] that attempts to learn relationships between words automatically given a large corpus of sentences. • Gives us a way to find similar words by finding near neighbors in the vector space with cosine similarity. • Uses a neural network to learn vector representations. 1 http://radimrehurek.com/2014/12/making-sense-of-word2vec/ Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 8. Word2Vec Word2Vec is a vectorization model created by Google [1] that attempts to learn relationships between words automatically given a large corpus of sentences. • Gives us a way to find similar words by finding near neighbors in the vector space with cosine similarity. • Uses a neural network to learn vector representations. • Work by Pennington, Socher, and Manning [2] shows that the word2vec model is equivalent to a word co-occurance matrix weighting based on window distance and lowering the dimension by matrix factorization. 1 http://radimrehurek.com/2014/12/making-sense-of-word2vec/ Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 9. Word2Vec Word2Vec is a vectorization model created by Google [1] that attempts to learn relationships between words automatically given a large corpus of sentences. • Gives us a way to find similar words by finding near neighbors in the vector space with cosine similarity. • Uses a neural network to learn vector representations. • Work by Pennington, Socher, and Manning [2] shows that the word2vec model is equivalent to a word co-occurance matrix weighting based on window distance and lowering the dimension by matrix factorization. Takeaway: The technique boils down, intuitively, to a riff on word co-occurence. See here1 for more. 1 http://radimrehurek.com/2014/12/making-sense-of-word2vec/ Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 10. Clinical Data as Sentences Clinical encounters form a sort of sentence over time. For a given encounter: • Vitals are measured (e.g. height, weight, BMI). • Labs are performed and results are recorded (e.g. blood tests). • Procedures are performed. • Diagnoses are made (e.g. Diabetes). • Drugs are prescribed. Each of these can be considered clinical “words” and the encounter forms a clinical “sentence”. Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 11. Clinical Data as Sentences Clinical encounters form a sort of sentence over time. For a given encounter: • Vitals are measured (e.g. height, weight, BMI). • Labs are performed and results are recorded (e.g. blood tests). • Procedures are performed. • Diagnoses are made (e.g. Diabetes). • Drugs are prescribed. Each of these can be considered clinical “words” and the encounter forms a clinical “sentence”. Idea: We can use word2vec to investigate connections between these clinical concepts. Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 12. Demo As part of a Kaggle competition2, Practice Fusion, a digital electronic medical records provider released depersonalized clinical records of 10,000 patients. I ingested and preprocessed these records into 197,340 clinical “sentences” using Pig and Hive. 2 https://www.kaggle.com/c/pf2012-diabetes Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 13. Demo As part of a Kaggle competition2, Practice Fusion, a digital electronic medical records provider released depersonalized clinical records of 10,000 patients. I ingested and preprocessed these records into 197,340 clinical “sentences” using Pig and Hive. MLLib from Spark now contains an implementation of word2vec, so let’s use pyspark and IPython Notebook to explore this dataset on Hadoop. 2 https://www.kaggle.com/c/pf2012-diabetes Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 14. Questions Thanks for your attention! Questions? • Code & scripts for this talk available on my github presentation page.3 • Find me at http://caseystella.com • Twitter handle: @casey_stella • Email address: cstella@hortonworks.com 3 http://github.com/cestella/presentations/ Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016
  • 15. Bibliography [1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013. [2] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543. Association for Computational Linguistics, 2014. Casey Stella@casey_stella (Hortonworks) NLP Structured Data Investigation on Non-Text 2016