SlideShare a Scribd company logo
NLP Structured Data Investigation on Non-Text
Casey Stella
@casey_stella
2015
Casey Stella@casey_stella (Hortonworks)NLP Structured Data Investigation on Non-Text 2015
Table of Contents
Preliminaries
Borrowing from NLP
Demo
Questions
Casey Stella@casey_stella (Hortonworks)NLP Structured Data Investigation on Non-Text 2015
Introduction
• I’m a Principal Architect at Hortonworks
• I work primarily doing Data Science in the Hadoop Ecosystem
• Prior to this, I’ve spent my time and had a lot of fun
◦ Doing data mining on medical data at Explorys using the Hadoop
ecosystem
◦ Doing signal processing on seismic data at Ion Geophysical using
MapReduce
◦ Being a graduate student in the Math department at Texas A&M in
algorithmic complexity theory
Casey Stella@casey_stella (Hortonworks)NLP Structured Data Investigation on Non-Text 2015
Domain Challenges in Data Science
A data scientist has to merge analytical skills with domain expertise.
• Often we’re thrown into places where we have insufficient domain
experience.
• Gaining this expertise can be challenging and time-consuming.
• Unsupervised machine learning techniques can be very useful to
understand complex data relationships.
Casey Stella@casey_stella (Hortonworks)NLP Structured Data Investigation on Non-Text 2015
Domain Challenges in Data Science
A data scientist has to merge analytical skills with domain expertise.
• Often we’re thrown into places where we have insufficient domain
experience.
• Gaining this expertise can be challenging and time-consuming.
• Unsupervised machine learning techniques can be very useful to
understand complex data relationships.
We’ll use an unsupervised structure learning algorithm borrowed from
NLP to look at medical data.
Casey Stella@casey_stella (Hortonworks)NLP Structured Data Investigation on Non-Text 2015
Word2Vec
Word2Vec is a vectorization model created by Google [1] that
attempts to learn relationships between words automatically given a
large corpus of sentences.
• Gives us a way to find similar words by finding near neighbors in the
vector space with cosine similarity.
1
http://radimrehurek.com/2014/12/making-sense-of-word2vec/
Casey Stella@casey_stella (Hortonworks)NLP Structured Data Investigation on Non-Text 2015
Word2Vec
Word2Vec is a vectorization model created by Google [1] that
attempts to learn relationships between words automatically given a
large corpus of sentences.
• Gives us a way to find similar words by finding near neighbors in the
vector space with cosine similarity.
• Uses a neural network to learn vector representations.
1
http://radimrehurek.com/2014/12/making-sense-of-word2vec/
Casey Stella@casey_stella (Hortonworks)NLP Structured Data Investigation on Non-Text 2015
Word2Vec
Word2Vec is a vectorization model created by Google [1] that
attempts to learn relationships between words automatically given a
large corpus of sentences.
• Gives us a way to find similar words by finding near neighbors in the
vector space with cosine similarity.
• Uses a neural network to learn vector representations.
• Work by Pennington, Socher, and Manning [2] shows that the
word2vec model is equivalent to a word co-occurance matrix
weighting based on window distance and lowering the dimension by
matrix factorization.
1
http://radimrehurek.com/2014/12/making-sense-of-word2vec/
Casey Stella@casey_stella (Hortonworks)NLP Structured Data Investigation on Non-Text 2015
Word2Vec
Word2Vec is a vectorization model created by Google [1] that
attempts to learn relationships between words automatically given a
large corpus of sentences.
• Gives us a way to find similar words by finding near neighbors in the
vector space with cosine similarity.
• Uses a neural network to learn vector representations.
• Work by Pennington, Socher, and Manning [2] shows that the
word2vec model is equivalent to a word co-occurance matrix
weighting based on window distance and lowering the dimension by
matrix factorization.
Takeaway: The technique boils down, intuitively, to a riff on word
co-occurence. See here1 for more.
1
http://radimrehurek.com/2014/12/making-sense-of-word2vec/
Casey Stella@casey_stella (Hortonworks)NLP Structured Data Investigation on Non-Text 2015
Clinical Data as Sentences
Clinical encounters form a sort of sentence over time. For a given
encounter:
• Vitals are measured (e.g. height, weight, BMI).
• Labs are performed and results are recorded (e.g. blood tests).
• Procedures are performed.
• Diagnoses are made (e.g. Diabetes).
• Drugs are prescribed.
Each of these can be considered clinical “words” and the encounter
forms a clinical “sentence”.
Casey Stella@casey_stella (Hortonworks)NLP Structured Data Investigation on Non-Text 2015
Clinical Data as Sentences
Clinical encounters form a sort of sentence over time. For a given
encounter:
• Vitals are measured (e.g. height, weight, BMI).
• Labs are performed and results are recorded (e.g. blood tests).
• Procedures are performed.
• Diagnoses are made (e.g. Diabetes).
• Drugs are prescribed.
Each of these can be considered clinical “words” and the encounter
forms a clinical “sentence”.
Idea: We can use word2vec to investigate connections between these
clinical concepts.
Casey Stella@casey_stella (Hortonworks)NLP Structured Data Investigation on Non-Text 2015
Demo
As part of a Kaggle competition2, Practice Fusion, a digital electronic
medical records provider released depersonalized clinical records of
10,000 patients. I ingested and preprocessed these records into
197,340 clinical “sentences” using Pig and Hive.
2
https://www.kaggle.com/c/pf2012-diabetes
Casey Stella@casey_stella (Hortonworks)NLP Structured Data Investigation on Non-Text 2015
Demo
As part of a Kaggle competition2, Practice Fusion, a digital electronic
medical records provider released depersonalized clinical records of
10,000 patients. I ingested and preprocessed these records into
197,340 clinical “sentences” using Pig and Hive.
MLLib from Spark now contains an implementation of word2vec, so
let’s use pyspark and IPython Notebook to explore this dataset on
Hadoop.
2
https://www.kaggle.com/c/pf2012-diabetes
Casey Stella@casey_stella (Hortonworks)NLP Structured Data Investigation on Non-Text 2015
Questions
Thanks for your attention! Questions?
• Code & scripts for this talk available on my github presentation
page.3
• Find me at http://caseystella.com
• Twitter handle: @casey_stella
• Email address: cstella@hortonworks.com
3
http://github.com/cestella/presentations/
Casey Stella@casey_stella (Hortonworks)NLP Structured Data Investigation on Non-Text 2015
Bibliography
[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.
Efficient estimation of word representations in vector space. CoRR,
abs/1301.3781, 2013.
[2] Jeffrey Pennington, Richard Socher, and Christopher Manning.
Glove: Global vectors for word representation. In Proceedings of
the 2014 Conference on Empirical Methods in Natural Language
Processing (EMNLP), pages 1532–1543. Association for
Computational Linguistics, 2014.
Casey Stella@casey_stella (Hortonworks)NLP Structured Data Investigation on Non-Text 2015

More Related Content

PDF
Natural Language Processing on Non-Textual Data
PDF
NLP Structured Data Investigation on Non-Text
PDF
NLP Structured Data Investigation on Non-Text
PDF
Streaming Outlier Analysis for Fun and Scalability
PPT
Extending the Espresso Method for Greater Recall
PDF
Natural Language Processing for Materials Design - What Can We Extract From t...
PDF
Words, Documents and Distance: Deep Learning and Semantic Analysis
PDF
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Natural Language Processing on Non-Textual Data
NLP Structured Data Investigation on Non-Text
NLP Structured Data Investigation on Non-Text
Streaming Outlier Analysis for Fun and Scalability
Extending the Espresso Method for Greater Recall
Natural Language Processing for Materials Design - What Can We Extract From t...
Words, Documents and Distance: Deep Learning and Semantic Analysis
Extracting and Making Use of Materials Data from Millions of Journal Articles...

What's hot (19)

PPTX
Using a keyword extraction pipeline to understand concepts in future work sec...
PPTX
Detecting word substitution in text
PDF
Applications of Natural Language Processing to Materials Design
PDF
A survey on location based serach using spatial inverted index method
DOCX
Independent Study_Final Report
PDF
Reproducibility in cheminformatics and computational chemistry research: cert...
PPTX
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
PPTX
Distributional Semantics and Unsupervised Clustering for Sensor Relevancy Pre...
PDF
Mapping Keywords to
PDF
Large scale classification of chemical reactions from patent data
PDF
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...
PDF
Adaptive User Feedback for IR-based Traceability Recovery
PDF
Assessing Factors Underpinning PV Degradation through Data Analysis
PDF
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
PDF
Determining the Credibility of Science Communication
PDF
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
PDF
Three hypergraph eigenvector centralities
PPT
Probablistic information retrieval
PPTX
HyQue: Evaluating scientific Hypotheses using semantic web technologies
Using a keyword extraction pipeline to understand concepts in future work sec...
Detecting word substitution in text
Applications of Natural Language Processing to Materials Design
A survey on location based serach using spatial inverted index method
Independent Study_Final Report
Reproducibility in cheminformatics and computational chemistry research: cert...
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Distributional Semantics and Unsupervised Clustering for Sensor Relevancy Pre...
Mapping Keywords to
Large scale classification of chemical reactions from patent data
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...
Adaptive User Feedback for IR-based Traceability Recovery
Assessing Factors Underpinning PV Degradation through Data Analysis
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Determining the Credibility of Science Communication
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Three hypergraph eigenvector centralities
Probablistic information retrieval
HyQue: Evaluating scientific Hypotheses using semantic web technologies
Ad

Similar to NLP Structured Data Investigation on Non-Text (20)

PPTX
Using Knowledge Graph for Promoting Cognitive Computing
PPTX
Natural language inference(NLI) importtant
PDF
Natural Language Processing Through Different Classes of Machine Learning
PDF
Open IE tutorial 2018
PPTX
Research Objects for FAIRer Science
PDF
Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...
PDF
Semantic IoT Semantic Inter-Operability Practices - Part 1
PDF
Capturing and leveraging materials science knowledge from millions of journal...
PPTX
From content discovery to deep understanding
PPTX
Deep Neural Methods for Retrieval
PDF
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
PDF
Towards Incidental Collaboratories; Research Data Services
PDF
From Linked Data to Semantic Applications
PPTX
Semantic Similarity and Selection of Resources Published According to Linked ...
PDF
AI Beyond Deep Learning
PPTX
Web Minnig and text mining presentation
PDF
Generating domain specific sentiment lexicons using the Web Directory
PPTX
Creating an Urban Legend: A System for Electrophysiology Data Management and ...
PPTX
SEEKing our way to better presentation of data and models from scientific inv...
PDF
AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-W...
Using Knowledge Graph for Promoting Cognitive Computing
Natural language inference(NLI) importtant
Natural Language Processing Through Different Classes of Machine Learning
Open IE tutorial 2018
Research Objects for FAIRer Science
Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...
Semantic IoT Semantic Inter-Operability Practices - Part 1
Capturing and leveraging materials science knowledge from millions of journal...
From content discovery to deep understanding
Deep Neural Methods for Retrieval
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Towards Incidental Collaboratories; Research Data Services
From Linked Data to Semantic Applications
Semantic Similarity and Selection of Resources Published According to Linked ...
AI Beyond Deep Learning
Web Minnig and text mining presentation
Generating domain specific sentiment lexicons using the Web Directory
Creating an Urban Legend: A System for Electrophysiology Data Management and ...
SEEKing our way to better presentation of data and models from scientific inv...
AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-W...
Ad

More from Hortonworks (20)

PDF
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
PDF
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
PDF
Getting the Most Out of Your Data in the Cloud with Cloudbreak
PDF
Johns Hopkins - Using Hadoop to Secure Access Log Events
PDF
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
PDF
HDF 3.2 - What's New
PPTX
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
PDF
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
PDF
IBM+Hortonworks = Transformation of the Big Data Landscape
PDF
Premier Inside-Out: Apache Druid
PDF
Accelerating Data Science and Real Time Analytics at Scale
PDF
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
PDF
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
PDF
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
PDF
Making Enterprise Big Data Small with Ease
PDF
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
PDF
Driving Digital Transformation Through Global Data Management
PPTX
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
PDF
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
PDF
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Johns Hopkins - Using Hadoop to Secure Access Log Events
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
HDF 3.2 - What's New
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
IBM+Hortonworks = Transformation of the Big Data Landscape
Premier Inside-Out: Apache Druid
Accelerating Data Science and Real Time Analytics at Scale
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Making Enterprise Big Data Small with Ease
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Driving Digital Transformation Through Global Data Management
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Unlock Value from Big Data with Apache NiFi and Streaming CDC

Recently uploaded (20)

PPTX
OMC Textile Division Presentation 2021.pptx
PPTX
A Presentation on Artificial Intelligence
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
A novel scalable deep ensemble learning framework for big data classification...
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Hybrid model detection and classification of lung cancer
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Tartificialntelligence_presentation.pptx
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Getting Started with Data Integration: FME Form 101
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
August Patch Tuesday
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
Heart disease approach using modified random forest and particle swarm optimi...
OMC Textile Division Presentation 2021.pptx
A Presentation on Artificial Intelligence
Programs and apps: productivity, graphics, security and other tools
A novel scalable deep ensemble learning framework for big data classification...
cloud_computing_Infrastucture_as_cloud_p
1 - Historical Antecedents, Social Consideration.pdf
Hybrid model detection and classification of lung cancer
Encapsulation_ Review paper, used for researhc scholars
Tartificialntelligence_presentation.pptx
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
TLE Review Electricity (Electricity).pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Getting Started with Data Integration: FME Form 101
Univ-Connecticut-ChatGPT-Presentaion.pdf
Web App vs Mobile App What Should You Build First.pdf
Assigned Numbers - 2025 - Bluetooth® Document
Accuracy of neural networks in brain wave diagnosis of schizophrenia
August Patch Tuesday
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Heart disease approach using modified random forest and particle swarm optimi...

NLP Structured Data Investigation on Non-Text

  • 1. NLP Structured Data Investigation on Non-Text Casey Stella @casey_stella 2015 Casey Stella@casey_stella (Hortonworks)NLP Structured Data Investigation on Non-Text 2015
  • 2. Table of Contents Preliminaries Borrowing from NLP Demo Questions Casey Stella@casey_stella (Hortonworks)NLP Structured Data Investigation on Non-Text 2015
  • 3. Introduction • I’m a Principal Architect at Hortonworks • I work primarily doing Data Science in the Hadoop Ecosystem • Prior to this, I’ve spent my time and had a lot of fun ◦ Doing data mining on medical data at Explorys using the Hadoop ecosystem ◦ Doing signal processing on seismic data at Ion Geophysical using MapReduce ◦ Being a graduate student in the Math department at Texas A&M in algorithmic complexity theory Casey Stella@casey_stella (Hortonworks)NLP Structured Data Investigation on Non-Text 2015
  • 4. Domain Challenges in Data Science A data scientist has to merge analytical skills with domain expertise. • Often we’re thrown into places where we have insufficient domain experience. • Gaining this expertise can be challenging and time-consuming. • Unsupervised machine learning techniques can be very useful to understand complex data relationships. Casey Stella@casey_stella (Hortonworks)NLP Structured Data Investigation on Non-Text 2015
  • 5. Domain Challenges in Data Science A data scientist has to merge analytical skills with domain expertise. • Often we’re thrown into places where we have insufficient domain experience. • Gaining this expertise can be challenging and time-consuming. • Unsupervised machine learning techniques can be very useful to understand complex data relationships. We’ll use an unsupervised structure learning algorithm borrowed from NLP to look at medical data. Casey Stella@casey_stella (Hortonworks)NLP Structured Data Investigation on Non-Text 2015
  • 6. Word2Vec Word2Vec is a vectorization model created by Google [1] that attempts to learn relationships between words automatically given a large corpus of sentences. • Gives us a way to find similar words by finding near neighbors in the vector space with cosine similarity. 1 http://radimrehurek.com/2014/12/making-sense-of-word2vec/ Casey Stella@casey_stella (Hortonworks)NLP Structured Data Investigation on Non-Text 2015
  • 7. Word2Vec Word2Vec is a vectorization model created by Google [1] that attempts to learn relationships between words automatically given a large corpus of sentences. • Gives us a way to find similar words by finding near neighbors in the vector space with cosine similarity. • Uses a neural network to learn vector representations. 1 http://radimrehurek.com/2014/12/making-sense-of-word2vec/ Casey Stella@casey_stella (Hortonworks)NLP Structured Data Investigation on Non-Text 2015
  • 8. Word2Vec Word2Vec is a vectorization model created by Google [1] that attempts to learn relationships between words automatically given a large corpus of sentences. • Gives us a way to find similar words by finding near neighbors in the vector space with cosine similarity. • Uses a neural network to learn vector representations. • Work by Pennington, Socher, and Manning [2] shows that the word2vec model is equivalent to a word co-occurance matrix weighting based on window distance and lowering the dimension by matrix factorization. 1 http://radimrehurek.com/2014/12/making-sense-of-word2vec/ Casey Stella@casey_stella (Hortonworks)NLP Structured Data Investigation on Non-Text 2015
  • 9. Word2Vec Word2Vec is a vectorization model created by Google [1] that attempts to learn relationships between words automatically given a large corpus of sentences. • Gives us a way to find similar words by finding near neighbors in the vector space with cosine similarity. • Uses a neural network to learn vector representations. • Work by Pennington, Socher, and Manning [2] shows that the word2vec model is equivalent to a word co-occurance matrix weighting based on window distance and lowering the dimension by matrix factorization. Takeaway: The technique boils down, intuitively, to a riff on word co-occurence. See here1 for more. 1 http://radimrehurek.com/2014/12/making-sense-of-word2vec/ Casey Stella@casey_stella (Hortonworks)NLP Structured Data Investigation on Non-Text 2015
  • 10. Clinical Data as Sentences Clinical encounters form a sort of sentence over time. For a given encounter: • Vitals are measured (e.g. height, weight, BMI). • Labs are performed and results are recorded (e.g. blood tests). • Procedures are performed. • Diagnoses are made (e.g. Diabetes). • Drugs are prescribed. Each of these can be considered clinical “words” and the encounter forms a clinical “sentence”. Casey Stella@casey_stella (Hortonworks)NLP Structured Data Investigation on Non-Text 2015
  • 11. Clinical Data as Sentences Clinical encounters form a sort of sentence over time. For a given encounter: • Vitals are measured (e.g. height, weight, BMI). • Labs are performed and results are recorded (e.g. blood tests). • Procedures are performed. • Diagnoses are made (e.g. Diabetes). • Drugs are prescribed. Each of these can be considered clinical “words” and the encounter forms a clinical “sentence”. Idea: We can use word2vec to investigate connections between these clinical concepts. Casey Stella@casey_stella (Hortonworks)NLP Structured Data Investigation on Non-Text 2015
  • 12. Demo As part of a Kaggle competition2, Practice Fusion, a digital electronic medical records provider released depersonalized clinical records of 10,000 patients. I ingested and preprocessed these records into 197,340 clinical “sentences” using Pig and Hive. 2 https://www.kaggle.com/c/pf2012-diabetes Casey Stella@casey_stella (Hortonworks)NLP Structured Data Investigation on Non-Text 2015
  • 13. Demo As part of a Kaggle competition2, Practice Fusion, a digital electronic medical records provider released depersonalized clinical records of 10,000 patients. I ingested and preprocessed these records into 197,340 clinical “sentences” using Pig and Hive. MLLib from Spark now contains an implementation of word2vec, so let’s use pyspark and IPython Notebook to explore this dataset on Hadoop. 2 https://www.kaggle.com/c/pf2012-diabetes Casey Stella@casey_stella (Hortonworks)NLP Structured Data Investigation on Non-Text 2015
  • 14. Questions Thanks for your attention! Questions? • Code & scripts for this talk available on my github presentation page.3 • Find me at http://caseystella.com • Twitter handle: @casey_stella • Email address: cstella@hortonworks.com 3 http://github.com/cestella/presentations/ Casey Stella@casey_stella (Hortonworks)NLP Structured Data Investigation on Non-Text 2015
  • 15. Bibliography [1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013. [2] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543. Association for Computational Linguistics, 2014. Casey Stella@casey_stella (Hortonworks)NLP Structured Data Investigation on Non-Text 2015