SlideShare a Scribd company logo
May 9, 2018
Sharon Garewal, Taxonomy Manager (Sharon.Garewal@ithaka.org)
Harnessing the power of a
semantic index at JSTOR
ITHAKA is a not-for-profit organization that helps the academic
community use digital technologies to preserve the scholarly record
and to advance research and teaching in sustainable ways.
JSTOR is a not-for-profit
digital library of academic
journals, books, and
primary sources.
Ithaka S+R is a not-for-profit
research and consulting
service that helps academic,
cultural, and publishing
communities thrive in the
digital environment.
Portico is a not-for-profit
preservation service for
digital publications, including
electronic journals, books,
and historical collections.
Artstor provides 2+ million
high-quality images and
digital asset management
software to enhance
scholarship and teaching.
Presentation Outline
JSTOR Thesaurus
MAIstro Rulebase
JSTOR Labs Text Analyzer
LDA topic models
Topics appear on search results page.
Topics appear at the document level (i.e. journal
article, book chapter, etc.)
Topics are linked to a topic page at the document
level.
The topic page provides a description/image (if
available and correct) from Wikipedia.
Topic cards highlight the top four occurring
thesaurus terms within the document.
Thesaurus on
JSTOR Platform
Garewal Harnessing the Power of a Semantic Index at JSTOR
Garewal Harnessing the Power of a Semantic Index at JSTOR
Garewal Harnessing the Power of a Semantic Index at JSTOR
User feedback
Ability for users to provide feedback on
the topics.
Can choose individual topics as
inaccurate or thumbs up/thumbs down
all topics.
Report is exported each week and
terms flagged as inaccurate are
reviewed.
7
MAIstro
MAIstro is the thesaurus software developed
by Access Innovations. MAI=Machine Aided
Indexing. There are three components of the
software:
• Thesaurus master-This is where terms
and the hierarchy are maintained.
• Rule Builder-Automated and manual
entry of rules for terms which allows for
customization.
• Test MAI-Area to test documents against
the thesaurus and the rule base for
suggested terms.
Garewal Harnessing the Power of a Semantic Index at JSTOR
Garewal Harnessing the Power of a Semantic Index at JSTOR
Indexing and display
Content requirements
Terms will not be applied to foreign language documents.
Terms will not be applied to documents that are coded as miscellaneous or
book reviews.
Some documents will not have terms applied since the document is a
shorter length.
Term must appear 3x (default) before it is triggers a match.
Suggested terms are given for each document; default of 10 terms are
given.
• Ordered by number of times the term appears within the document.
JSTOR Labs works with partner publishers,
libraries and labs to create tools for
researchers, teachers and students that are
immediately useful – and a little bit magical.
Text Analyzer - beta
Text Analyzer
Analyzes arbitrary text to
find related content in
JSTOR archive
Drag-n-drop
File select
Text Analyzer • Text Analyzer extracts topics and
named entities from submitted
text to find related/similar
document in the JSTOR archive
• Topics are based on the terms in
the JSTOR Thesaurus
Text Analyzer
Text is submitted via:
• Direct input
• Copy/paste
• Local file
• Drag and drop from local
computer filesystem or a web URL
• Photo of text, via phone camera
A variety of document types
are supported:
• PDF
• MS-Word
• HTML
• RTF
• Plain text, Powerpoint, and Excel
• Images (on-the-fly OCR is
performed)
Another example
My bookshelf at work
Topics inferred
Using a smartphone photo as input…
Image analysis
My bookshelf at work
Topics inferred
Text Analyzer recommendations
Recommendations are based on a ”best fit” of all prioritized
terms (topics and entities) and weights
• The selection of documents in results represents an ‘OR’ of documents
containing one or more terms
• Results ordering is based on a score representing the number of terms
matched, the importance of the term(s) to the document (based on LDA
weight) and the user-specified importance
• A user is able to quickly refine the terms and weights to tailor the results
to a specific need
Values used in relevancy and ranking calculations are available
for inspection
Latent dirichlet allocation (LDA)
• LDA is one of the most common algorithms for topic modeling
The Latent part of LDA comes into play because in statistics, a variable we
have to infer rather than directly observing is called a "latent variable". We're
only directly observing the words and not the topics, so the topics
themselves are latent variables (along with the distributions themselves).*
• LDA is based on the concept that:
• Every document is a mixture of topics
• Every topic is a mixture of words
• LDA is a mathematical method for estimating both of these at the same
time:
• Finding the mixture of words that is associated with each topic,
• while also determining the mixture of topics that describes each
document
* https://www.quora.com/What-is-a-good-explanation-of-Latent-Dirichlet-Allocation
Latent dirichlet allocation (LDA)
• Model training
• Can be supervised or unsupervised
• When performed unsupervised a predefined number of topics are
identified and are represented by word probabilities
• Supervised training involves the use of a tagged corpus where the tags
will be used as topic labels
• For topics we use a subset of the JSTOR Thesaurus
• For the model training documents we’re now using Wikipedia
articles associated with each topic
• Topic inferencing
• Using a trained topic model, ‘latent’ topics in a document can be
inferred using the words in the text
• Topics are expressed as probability distribution
What is an LDA topic?
It’s simple…
OK, the math isn’t so simple but conceptually
a topic is just a set of “word” relationships
climate temperature earth ice warming global weather atmosphere climate_change ocean cycle oscillation carbon greenhouse model scientist
water age ice_age atmospheric event tropical gas pacific heat dioxide carbon_dioxide pattern average region wave surface extreme variation air
classification cold precipitation latitude cooling global_warming land radiation solar science greenhouse_gas rainfall determine condition enso
report emission current dry theory variability force summer polar infrared annual future north range mass climatic feedback atlantic northern
record sea rise scientific natural evidence scale factor planet winter cold_wave glacial air_mass warm climate_model regional cfc
extreme_weather climatology niño interglacial oceanic assessment pollution phase absorption location published america ancient result energy
arrhenius surface_temperature climate_oscillation vegetation seasonal trend moon south shift activity sun assessment_report climate_pattern
humid hurricane fluctuation anomaly methane conclude decadal maritime tree arctic concentration month short infrared_radiation glacier
future_climate monsoon forecasting global_temperature continental milankovitch orbital water_vapor vapor james estimate normal observe
maximum variable pdo heat_wave pacific_ocean climatologist tree_ring arid convince forcings holocene ice_sheet cloud fourier
climate_sensitivity icehouse weather_event wmo southern_oscillation climate_climate solar_variation climate_cycle north_america croll
human_emission icehouse_climate agassiz enso_event climate_index global_climate climate_variability mid_latitude paleoclimatology
thornthwaite köppen current_climate indian_ocean niño_southern_oscillation niño_southern interglacial_period climate_classification
electromagnetic_radiation term_climate ocean_atmosphere wind_shear fossil_fuel ramanathan wetherald manabe keeling absorbing_infrared
james_croll charpentier scientific_opinion buckland level_pressure sea_level_pressure inter_decadal tropical_pacific decadal_oscillation mjo
climate_science ozone_depletion nao extreme_weather_event change_climate climate_proxy east_pacific annual_basis ice_cap
subarctic_climate oceanic_climate humid_subtropical modern_climate climate_zone bergeron polar_ice regular_cycle scientific_literature
lake_bed current_interglacial temperature_fluctuation warm_period shorter_term classification_include climate_force milankovitch_cycle
projected_increase excessive_heat heatwave bioclimatology cfc_focused dioxide_molecule carbon_dioxide_molecule
absorbing_infrared_radiation lovelock_speculate james_lovelock_speculate scientist_james_lovelock_speculate scientist_james_lovelock
british_scientist_james_lovelock british_scientist_james core_drilled ice_core_drilled particulate_pollution aerosol_pollution sea_core
deep_sea_core david_keeling charles_david_keeling charles_david callendar varves högbom infrared_absorption measure_infrared
cycle_lasting venetz perraudin change_climate_change climate_change_climate_change climate_change_climate change_science
climate_change_science century_scientist background_climate bake_crust british_scientist langley james_lovelock cfc_molecule
chlorofluorocarbon_cfc tyndall john_tyndall extreme_event scientist_james hothouse energy_budget teleconnections sst_anomaly
For example, the top “words” associated with the topic Climatology
Named Entity Recognition (NER)
• Entities in a submitted text are identified and available for
document selection
• Persons
• Locations
• Organizations
• Results from multiple entity recognition engines are merged
during analysis
• IBM Alchemy
• OpenCalais (Thompson Reuters)
• OpenNLP (Apache)
• Stanford NER
Garewal Harnessing the Power of a Semantic Index at JSTOR
Garewal Harnessing the Power of a Semantic Index at JSTOR
Future opportunities
• Multilanguage topic inferencing (being tested in Text Analyzer)
• Expanding the LDA topic model training set
• Integration of LDA Topic modeling and MAIstro Rulebase
• Named entities on all documents
27
Thank You

More Related Content

PPTX
Building an LDA topic model using Wikipedia
PPT
The thomson reuters
PPT
Using the library for research
PPT
Environmental science-spring 2014
PPT
MSc Patient safety: information resources - Aberdeen 0910
DOC
Using eSearch and key databases
PPS
Informatics UG1 2006-7
Building an LDA topic model using Wikipedia
The thomson reuters
Using the library for research
Environmental science-spring 2014
MSc Patient safety: information resources - Aberdeen 0910
Using eSearch and key databases
Informatics UG1 2006-7

Similar to Garewal Harnessing the Power of a Semantic Index at JSTOR (20)

PPTX
The JTHES as Part of the Intelligence Layer for the Sustainability Collection...
PPTX
JSTOR Sustainability Collection - DHUG 2015
PPTX
W13 libr250 databases___sources1
PPTX
Library Resources for EL3208
DOCX
ECS 111 Homework ProblemsJanuary 20181) Choose a region (.docx
PPT
The paper trail:steps towards a reference model for the metadata ecology
PPTX
Creating an Urban Legend: A System for Electrophysiology Data Management and ...
PDF
Identity, Location, and Citation at NEON
PPTX
Zen and the Art of Metadata Maintenance
PDF
A-Study_TopicModeling
PPTX
W13 libr250 databases_scholarlyvs_popular
PPT
Author workshop TU Delft 20111122
DOCX
1)What is meant by population transition Briefly describe the r.docx
PPT
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
PDF
A Document Exploring System on LDA Topic Model for Wikipedia Articles
PPT
Civil spring-2012
PPTX
Relevance vs Subject
PPT
How to Find Information in Environmental Science
DOCX
Describe the various ways that water is used as an.docx
DOCX
1)Describe the various ways that water is used as an ecosystem s.docx
The JTHES as Part of the Intelligence Layer for the Sustainability Collection...
JSTOR Sustainability Collection - DHUG 2015
W13 libr250 databases___sources1
Library Resources for EL3208
ECS 111 Homework ProblemsJanuary 20181) Choose a region (.docx
The paper trail:steps towards a reference model for the metadata ecology
Creating an Urban Legend: A System for Electrophysiology Data Management and ...
Identity, Location, and Citation at NEON
Zen and the Art of Metadata Maintenance
A-Study_TopicModeling
W13 libr250 databases_scholarlyvs_popular
Author workshop TU Delft 20111122
1)What is meant by population transition Briefly describe the r.docx
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Document Exploring System on LDA Topic Model for Wikipedia Articles
Civil spring-2012
Relevance vs Subject
How to Find Information in Environmental Science
Describe the various ways that water is used as an.docx
1)Describe the various ways that water is used as an ecosystem s.docx
Ad

More from National Information Standards Organization (NISO) (20)

PPTX
Larry Bennett_ ALA Annual Convention 2025AL2 slides.pptx
PPTX
Potash "Our Journey & Vision for Accessible Content"
PPTX
O'Leary "Progress Assessment - How Far Are We from Delivery"
PPTX
Carpenter and O'Leary "Accessibility Standards and the Future of Inclusive Pu...
PPTX
Davidian "Transfer Code of Practice Standing Committee Update"
PPTX
Patham "NISO Open Discovery Initiative (ODI) Update"
PPTX
Hichliffe "A Standard Terminology for Peer Review"
PPTX
Levin "KBART RP Update at ALA Annual 2025"
PPTX
Carpenter "Advancing Infrastructure for Sustainable Collections: CCLP Project...
PPTX
Gibson "Secrets to Changing Behaviour in Scholarly Communication: A 2025 NISO...
PPTX
Gibson "Secrets to Changing Behaviour in Scholarly Communication: A 2025 NISO...
PDF
Carpenter "2025 NISO Annual Members Meeting"
PPTX
Allen "Social Marketing in Scholarly Communications"
PPTX
Gibson "Secrets to Changing Behaviour in Scholarly Communication: A 2025 NISO...
PDF
Gibson "Secrets to Changing Behaviour in Scholarly Communication: A 2025 NISO...
PDF
Pfeiffer "Secrets to Changing Behavior in Scholarly Communication: A 2025 NIS...
PPTX
Gilstrap "Accessibility Essentials: A 2025 NISO Training Series, Session 7, M...
PPTX
Turner "Accessibility Essentials: A 2025 NISO Training Series, Session 7, Lan...
PPTX
Comeford "Accessibility Essentials: A 2025 NISO Training Series, Session 7, A...
PPTX
Laverick and Richard "Accessibility Essentials: A 2025 NISO Training Series, ...
Larry Bennett_ ALA Annual Convention 2025AL2 slides.pptx
Potash "Our Journey & Vision for Accessible Content"
O'Leary "Progress Assessment - How Far Are We from Delivery"
Carpenter and O'Leary "Accessibility Standards and the Future of Inclusive Pu...
Davidian "Transfer Code of Practice Standing Committee Update"
Patham "NISO Open Discovery Initiative (ODI) Update"
Hichliffe "A Standard Terminology for Peer Review"
Levin "KBART RP Update at ALA Annual 2025"
Carpenter "Advancing Infrastructure for Sustainable Collections: CCLP Project...
Gibson "Secrets to Changing Behaviour in Scholarly Communication: A 2025 NISO...
Gibson "Secrets to Changing Behaviour in Scholarly Communication: A 2025 NISO...
Carpenter "2025 NISO Annual Members Meeting"
Allen "Social Marketing in Scholarly Communications"
Gibson "Secrets to Changing Behaviour in Scholarly Communication: A 2025 NISO...
Gibson "Secrets to Changing Behaviour in Scholarly Communication: A 2025 NISO...
Pfeiffer "Secrets to Changing Behavior in Scholarly Communication: A 2025 NIS...
Gilstrap "Accessibility Essentials: A 2025 NISO Training Series, Session 7, M...
Turner "Accessibility Essentials: A 2025 NISO Training Series, Session 7, Lan...
Comeford "Accessibility Essentials: A 2025 NISO Training Series, Session 7, A...
Laverick and Richard "Accessibility Essentials: A 2025 NISO Training Series, ...
Ad

Recently uploaded (20)

PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
Cell Structure & Organelles in detailed.
PDF
Classroom Observation Tools for Teachers
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
01-Introduction-to-Information-Management.pdf
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
Lesson notes of climatology university.
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PPTX
Pharma ospi slides which help in ospi learning
PPTX
Presentation on HIE in infants and its manifestations
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
Institutional Correction lecture only . . .
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Cell Structure & Organelles in detailed.
Classroom Observation Tools for Teachers
2.FourierTransform-ShortQuestionswithAnswers.pdf
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
01-Introduction-to-Information-Management.pdf
Chinmaya Tiranga quiz Grand Finale.pdf
Microbial disease of the cardiovascular and lymphatic systems
human mycosis Human fungal infections are called human mycosis..pptx
Lesson notes of climatology university.
STATICS OF THE RIGID BODIES Hibbelers.pdf
102 student loan defaulters named and shamed – Is someone you know on the list?
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
Pharma ospi slides which help in ospi learning
Presentation on HIE in infants and its manifestations
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Institutional Correction lecture only . . .

Garewal Harnessing the Power of a Semantic Index at JSTOR

  • 1. May 9, 2018 Sharon Garewal, Taxonomy Manager (Sharon.Garewal@ithaka.org) Harnessing the power of a semantic index at JSTOR
  • 2. ITHAKA is a not-for-profit organization that helps the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways. JSTOR is a not-for-profit digital library of academic journals, books, and primary sources. Ithaka S+R is a not-for-profit research and consulting service that helps academic, cultural, and publishing communities thrive in the digital environment. Portico is a not-for-profit preservation service for digital publications, including electronic journals, books, and historical collections. Artstor provides 2+ million high-quality images and digital asset management software to enhance scholarship and teaching.
  • 3. Presentation Outline JSTOR Thesaurus MAIstro Rulebase JSTOR Labs Text Analyzer LDA topic models
  • 4. Topics appear on search results page. Topics appear at the document level (i.e. journal article, book chapter, etc.) Topics are linked to a topic page at the document level. The topic page provides a description/image (if available and correct) from Wikipedia. Topic cards highlight the top four occurring thesaurus terms within the document. Thesaurus on JSTOR Platform
  • 8. User feedback Ability for users to provide feedback on the topics. Can choose individual topics as inaccurate or thumbs up/thumbs down all topics. Report is exported each week and terms flagged as inaccurate are reviewed. 7
  • 9. MAIstro MAIstro is the thesaurus software developed by Access Innovations. MAI=Machine Aided Indexing. There are three components of the software: • Thesaurus master-This is where terms and the hierarchy are maintained. • Rule Builder-Automated and manual entry of rules for terms which allows for customization. • Test MAI-Area to test documents against the thesaurus and the rule base for suggested terms.
  • 12. Indexing and display Content requirements Terms will not be applied to foreign language documents. Terms will not be applied to documents that are coded as miscellaneous or book reviews. Some documents will not have terms applied since the document is a shorter length. Term must appear 3x (default) before it is triggers a match. Suggested terms are given for each document; default of 10 terms are given. • Ordered by number of times the term appears within the document.
  • 13. JSTOR Labs works with partner publishers, libraries and labs to create tools for researchers, teachers and students that are immediately useful – and a little bit magical.
  • 15. Text Analyzer Analyzes arbitrary text to find related content in JSTOR archive Drag-n-drop File select
  • 16. Text Analyzer • Text Analyzer extracts topics and named entities from submitted text to find related/similar document in the JSTOR archive • Topics are based on the terms in the JSTOR Thesaurus
  • 17. Text Analyzer Text is submitted via: • Direct input • Copy/paste • Local file • Drag and drop from local computer filesystem or a web URL • Photo of text, via phone camera A variety of document types are supported: • PDF • MS-Word • HTML • RTF • Plain text, Powerpoint, and Excel • Images (on-the-fly OCR is performed)
  • 18. Another example My bookshelf at work Topics inferred Using a smartphone photo as input…
  • 19. Image analysis My bookshelf at work Topics inferred
  • 20. Text Analyzer recommendations Recommendations are based on a ”best fit” of all prioritized terms (topics and entities) and weights • The selection of documents in results represents an ‘OR’ of documents containing one or more terms • Results ordering is based on a score representing the number of terms matched, the importance of the term(s) to the document (based on LDA weight) and the user-specified importance • A user is able to quickly refine the terms and weights to tailor the results to a specific need Values used in relevancy and ranking calculations are available for inspection
  • 21. Latent dirichlet allocation (LDA) • LDA is one of the most common algorithms for topic modeling The Latent part of LDA comes into play because in statistics, a variable we have to infer rather than directly observing is called a "latent variable". We're only directly observing the words and not the topics, so the topics themselves are latent variables (along with the distributions themselves).* • LDA is based on the concept that: • Every document is a mixture of topics • Every topic is a mixture of words • LDA is a mathematical method for estimating both of these at the same time: • Finding the mixture of words that is associated with each topic, • while also determining the mixture of topics that describes each document * https://www.quora.com/What-is-a-good-explanation-of-Latent-Dirichlet-Allocation
  • 22. Latent dirichlet allocation (LDA) • Model training • Can be supervised or unsupervised • When performed unsupervised a predefined number of topics are identified and are represented by word probabilities • Supervised training involves the use of a tagged corpus where the tags will be used as topic labels • For topics we use a subset of the JSTOR Thesaurus • For the model training documents we’re now using Wikipedia articles associated with each topic • Topic inferencing • Using a trained topic model, ‘latent’ topics in a document can be inferred using the words in the text • Topics are expressed as probability distribution
  • 23. What is an LDA topic? It’s simple…
  • 24. OK, the math isn’t so simple but conceptually a topic is just a set of “word” relationships climate temperature earth ice warming global weather atmosphere climate_change ocean cycle oscillation carbon greenhouse model scientist water age ice_age atmospheric event tropical gas pacific heat dioxide carbon_dioxide pattern average region wave surface extreme variation air classification cold precipitation latitude cooling global_warming land radiation solar science greenhouse_gas rainfall determine condition enso report emission current dry theory variability force summer polar infrared annual future north range mass climatic feedback atlantic northern record sea rise scientific natural evidence scale factor planet winter cold_wave glacial air_mass warm climate_model regional cfc extreme_weather climatology niño interglacial oceanic assessment pollution phase absorption location published america ancient result energy arrhenius surface_temperature climate_oscillation vegetation seasonal trend moon south shift activity sun assessment_report climate_pattern humid hurricane fluctuation anomaly methane conclude decadal maritime tree arctic concentration month short infrared_radiation glacier future_climate monsoon forecasting global_temperature continental milankovitch orbital water_vapor vapor james estimate normal observe maximum variable pdo heat_wave pacific_ocean climatologist tree_ring arid convince forcings holocene ice_sheet cloud fourier climate_sensitivity icehouse weather_event wmo southern_oscillation climate_climate solar_variation climate_cycle north_america croll human_emission icehouse_climate agassiz enso_event climate_index global_climate climate_variability mid_latitude paleoclimatology thornthwaite köppen current_climate indian_ocean niño_southern_oscillation niño_southern interglacial_period climate_classification electromagnetic_radiation term_climate ocean_atmosphere wind_shear fossil_fuel ramanathan wetherald manabe keeling absorbing_infrared james_croll charpentier scientific_opinion buckland level_pressure sea_level_pressure inter_decadal tropical_pacific decadal_oscillation mjo climate_science ozone_depletion nao extreme_weather_event change_climate climate_proxy east_pacific annual_basis ice_cap subarctic_climate oceanic_climate humid_subtropical modern_climate climate_zone bergeron polar_ice regular_cycle scientific_literature lake_bed current_interglacial temperature_fluctuation warm_period shorter_term classification_include climate_force milankovitch_cycle projected_increase excessive_heat heatwave bioclimatology cfc_focused dioxide_molecule carbon_dioxide_molecule absorbing_infrared_radiation lovelock_speculate james_lovelock_speculate scientist_james_lovelock_speculate scientist_james_lovelock british_scientist_james_lovelock british_scientist_james core_drilled ice_core_drilled particulate_pollution aerosol_pollution sea_core deep_sea_core david_keeling charles_david_keeling charles_david callendar varves högbom infrared_absorption measure_infrared cycle_lasting venetz perraudin change_climate_change climate_change_climate_change climate_change_climate change_science climate_change_science century_scientist background_climate bake_crust british_scientist langley james_lovelock cfc_molecule chlorofluorocarbon_cfc tyndall john_tyndall extreme_event scientist_james hothouse energy_budget teleconnections sst_anomaly For example, the top “words” associated with the topic Climatology
  • 25. Named Entity Recognition (NER) • Entities in a submitted text are identified and available for document selection • Persons • Locations • Organizations • Results from multiple entity recognition engines are merged during analysis • IBM Alchemy • OpenCalais (Thompson Reuters) • OpenNLP (Apache) • Stanford NER
  • 28. Future opportunities • Multilanguage topic inferencing (being tested in Text Analyzer) • Expanding the LDA topic model training set • Integration of LDA Topic modeling and MAIstro Rulebase • Named entities on all documents 27