SlideShare a Scribd company logo
Using Open Source Tools for Visualization
 and Semantic Mapping in a Large Scale
         Article Digital Library

                  Glen Newton
               glen.newton@gmail.com
            Biology Dept, Carleton University
           http://zzzoot.blogspot.com/

                    Code4Lib-North
           Queen's University, Kingston, Ontario
                   Friday May 7 2010

             Based on VLDL2009 Workshop
               Presentation at ECDL2009
Outline

•   Maps of Science
•   Broad Research Interests
•   Research Goals
•   Process
•   Scalability issues
•   Open Source Tools
•   Environment
•   Results
•   Conclusions
•   Future Work
From Bollen et al 2009 PLOS1
From Leydesdorff
From Leydesdorff & Rafols 2006   & Rafols 2006
From Leydesdorff & Rafols 2006
Broad Research
                                  Interests
• Search results visualization & refinement
• Domain-specific discovery, with a particular interest in genomics
    and drug discovery
• Improved discovery in STM domains through results visualization
    and contextualization, browse/explore/refine
• Use of Open Source tools in complex research problem spaces
Research Goals

• Use Open Source tools to support large scale semantic text analysis and
     visualization
• Find way to extract journal (& article) semantic vector space (semantics
     much better than keyword or tf-idf -based representations natural
     language)
• Latent Semantic Analysis (LSA) works for small/medium sized corpora,
     does not scale to large scale of items and/or terms
• New alternative: Semantic Vectors (SV): uses random vectors & avoids
     expensive singular value decomposition (SVD)
• Can SV scale & generate sensible semantic vector space of journals on
     corpus of this size?
• Can the visualization produced be useful for results query visualization,
     refinement, discovery?
Corpus

• Licensed journal articles from STM publishers: Elsevier, Springer,
     etc
• ~4100 journal titles, classified into 23 categories (by publishers)
• ~8.4m journal articles
• Selection of articles/journals:
       – Only those with authors, abstract (no notices, obituaries, etc)
       – Only English language articles
       – Only journals with >50 articles in corpus
       – Resulting corpus: 5,733,721 articles from 2231 journals
       – Categories overlapping: 1.53 categories per journal
Corpus
 Category                                       # Journals
                                                per category
 Agriculture & Biological Sciences              358
 Arts and Humanities                            70
 Biochemistry, Genetics and Molecular Biology   240
 Business, Management and Accounting            106
 Chemical Engineering                           126
 Chemistry                                      226
 Civil Engineering                              64
 Computer Science                               218
 Decision Science                               50
 Earth and Planetary Science                    146
 Economics, Econometrics and Finance            112
Category                       # Journals per category
Energy and Power               73
Engineering and Technology     328
Environmental Science          138
Immunology and Microbiology    104
Materials Science              160
Mathematics                    205
Medicine                       671
Neuroscience                   103
Pharmacology, Toxicology and   73
Pharmaceutics
Physics and Astronomy          210
Psychology                     126
Social Science                 222
Process

• Index full-text (only) with Lucene 2.4, aggressive stopword list,
     Porter stemming using LuSql tool
• Build Semantic Vectors (v1.18, parallelized) index from Lucene
     index, with 512 semantic dimensions
• Find item x item distance matrix from SV index of 512-
     dimensional vectors
• Using R, use multidimensional scaling (MDS) to reduce from 512-
     D to 2-D
Scalability Issues

•  #items, #unique terms
        – #unique terms: SV easily handles very well
        – #items: SV handles fairly well
        – #items: impacts size of distance matrix (#items x #items)
        – R cannot handle huge article distance matrix in MDS (i.e.
             millions of articles vs. thousands of journals)
• Instead of using articles for items, use journals for items
• Make single large full-text document from concatenation of all
      articles of particular journal & index these
Open Source Tools

•   Lucene
•   LuSql (High performance Lucene index building tool)
•   Semantic Vectors
•   R
•   Processing
•   Linux
Environment

• Dell PowerEdge 1955 Blade server, 2 x dual-core Xeon 5050
    processors with 2x2MB cache, 3.0 Ghz 64bit, 32GB RAM,
    attached to a Dell EMC AX150 storage arrays via SilkWorm
    200E Series 16-Port Capable 4Gb Fabric Switch.
• Operating system: Linux openSUSE 10.2 (64-bit X86-64), kernel
    2.6.18.8-0.10-default #1 SMP
• Java version 1.6.0.07 (build 1.6.0 07-b06) Java HotSpot 64-Bit
  Server VM (build 10.0-b23, mixed mode).
• Processing 1.0 (processing.org)
Results: Scalability

• Corpus: ~600GB full-text
• Lucene index: 43GB
      – LuSql: 13 hours 51 minutes to produce
• SV index: 58 minutes, 885 MB, 21.6m terms
      – Distance matrix: 6 minutes
Results: Visualization

• Using Processing environment, built simple
    validation/visualization tool
Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library
Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library
Harder sciences and
engineering categories
Chemistry
Material Science
Physics and
Astronomy
Engineering and
Technology
Mathematics
Computer Science
Civil Engineering
Chemical Engineering
Agriculture and
biomedical categories
Agriculture and
Biological Sciences
Biochemistry, Genetics
and Molecular Biology
Immunology and
Microbiology
Pharmacology
Neuroscience
Medicine
Medicine
Psychology
Interdisciplinary and
non-science categories
Environmental Science
Earth and
Planetary Science
Energy and Power
Decision Science
Economics,
Econometrics
And Finance
Social Sciences
Business, Management
and Accounting
Arts and Humanities
Examination of outliers,
extrema and cataloging
errors
Ecotoxicology and
Environmental Safety
                       Organic Geochemistry




                              Corporate Environmental
                              Strategy


                         Environmental Science
Journal of Biomolecular NMR



              Journal of X-Ray
              Science and Technology




           Medicine
           Medicine
Colloidal and
Polymer Science




                  Annales Henri Poincare




        Medicine
        Medicine
Medicine
         Medicine
French language Medical
& Psychology Journals
Bulletin of
              Mathematical Biology




Journal of
Medical
Ultrasonics




                 Mathematics
Conclusions

•   Reasonable mapping results
•   Full-text only (no citations, metadata) gives good results
•   Scalable to significant size
•   Open Source tools supported a complex research process and
      were easy to modify to deal with scalability issues
Future Work

• Proper precision and recall evaluation using same corpus
• Validate with NetNews-20 collection for P & R
• Evaluate non-metric MDS
• Project articles onto semantic journal space & build interactive
    discovery interface & evaluate
       – Index journal 'documents' and journal articles
       – SV on all
       – Distance matrix only on journals
       – Do MDS
       – Use eigenvectors to transform N-d article vector to 2-D
• Explore 3-D interface (MDS N-d → 3D)
Acknowledgements

• Collaborators: Michel Dumontier, Alison Callahan @Carleton
• Support: Greg Kresko, Andre Vellino, Jeff Demaine @ NRC-
    CISTI
Demo

• Link to project demo page
License




Creative Commons Attribution-Noncommercial-No Derivative Works 2.

More Related Content

PDF
Semantic Journal Mapping for Search Visualization in a Large Scale Article Di...
DOC
Bioinformatics.doc
PPTX
Annuaire du pass french tech promotion 2014-2015
PPTX
Offre premium du Pass French Tech
PDF
Annuaire du pass french tech promotion 2015-2016
PDF
Apresentação evarejo
PPT
HISTÒRIA DEL POP-ROCK (50s i 60s)
Semantic Journal Mapping for Search Visualization in a Large Scale Article Di...
Bioinformatics.doc
Annuaire du pass french tech promotion 2014-2015
Offre premium du Pass French Tech
Annuaire du pass french tech promotion 2015-2016
Apresentação evarejo
HISTÒRIA DEL POP-ROCK (50s i 60s)

Viewers also liked (15)

PDF
5 mohammad chamani
PDF
Fri5 35
PDF
Grupos edmodo
PDF
2015/11/30付 オリジナルiTunes週間トップソングトピックス
DOC
081202 Gzt Vtow4
PPTX
Chip Project Proposal By Heba
PDF
PPTX
Антирадянські виступи 1921 року
PPTX
Cornwall Life magazine analysis
PDF
RobDiploma
PDF
Lean Green Belt_Yun Lin
PDF
PPTX
135. verdadera oración
PPT
We>Me New Jersey Library Association Presentation 2012
PPT
M1 PPT
5 mohammad chamani
Fri5 35
Grupos edmodo
2015/11/30付 オリジナルiTunes週間トップソングトピックス
081202 Gzt Vtow4
Chip Project Proposal By Heba
Антирадянські виступи 1921 року
Cornwall Life magazine analysis
RobDiploma
Lean Green Belt_Yun Lin
135. verdadera oración
We>Me New Jersey Library Association Presentation 2012
M1 PPT
Ad

Similar to Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library (20)

PDF
Datos enlazados BNE and MARiMbA
PPT
Vellino presentationtocisti
PPTX
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
PPT
Open Analytics Environment
PDF
Mendeley:From three guys in a virtual garage to changing the face of science?
PDF
Lewis isb 7 april2014
PDF
Lewis isb 7 april2014
PDF
OpenML 2014
PDF
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
PPTX
ContentMine: Open Data and Social Machines
PPT
Services For Science April 2009
PPT
Agents In An Exponential World Foster
PPT
Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era
PDF
Understanding the Big Picture of e-Science
PPTX
Preserving the Inputs and Outputs of Scholarship
PPT
Νetworking content repositories to provide meaningful services to users
PDF
Using Architectures for Semantic Interoperability to Create Journal Clubs for...
PPTX
Scio12 sem web_final
PPTX
NISO/DCMI Webinar: Metadata for Managing Scientific Research Data
PDF
Science Mapping and Research Positioning
Datos enlazados BNE and MARiMbA
Vellino presentationtocisti
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
Open Analytics Environment
Mendeley:From three guys in a virtual garage to changing the face of science?
Lewis isb 7 april2014
Lewis isb 7 april2014
OpenML 2014
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
ContentMine: Open Data and Social Machines
Services For Science April 2009
Agents In An Exponential World Foster
Scott Edmunds ICIS talk at UC Davis: Open Publishing for the Big Data era
Understanding the Big Picture of e-Science
Preserving the Inputs and Outputs of Scholarship
Νetworking content repositories to provide meaningful services to users
Using Architectures for Semantic Interoperability to Create Journal Clubs for...
Scio12 sem web_final
NISO/DCMI Webinar: Metadata for Managing Scientific Research Data
Science Mapping and Research Positioning
Ad

Recently uploaded (20)

PPTX
A Presentation on Artificial Intelligence
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
1. Introduction to Computer Programming.pptx
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
A Presentation on Touch Screen Technology
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
project resource management chapter-09.pdf
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
August Patch Tuesday
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Approach and Philosophy of On baking technology
PDF
1 - Historical Antecedents, Social Consideration.pdf
A Presentation on Artificial Intelligence
Agricultural_Statistics_at_a_Glance_2022_0.pdf
1. Introduction to Computer Programming.pptx
Univ-Connecticut-ChatGPT-Presentaion.pdf
Assigned Numbers - 2025 - Bluetooth® Document
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Building Integrated photovoltaic BIPV_UPV.pdf
A Presentation on Touch Screen Technology
Encapsulation_ Review paper, used for researhc scholars
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
project resource management chapter-09.pdf
Enhancing emotion recognition model for a student engagement use case through...
August Patch Tuesday
Accuracy of neural networks in brain wave diagnosis of schizophrenia
A novel scalable deep ensemble learning framework for big data classification...
Heart disease approach using modified random forest and particle swarm optimi...
Programs and apps: productivity, graphics, security and other tools
Approach and Philosophy of On baking technology
1 - Historical Antecedents, Social Consideration.pdf

Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library

  • 1. Using Open Source Tools for Visualization and Semantic Mapping in a Large Scale Article Digital Library Glen Newton glen.newton@gmail.com Biology Dept, Carleton University http://zzzoot.blogspot.com/ Code4Lib-North Queen's University, Kingston, Ontario Friday May 7 2010 Based on VLDL2009 Workshop Presentation at ECDL2009
  • 2. Outline • Maps of Science • Broad Research Interests • Research Goals • Process • Scalability issues • Open Source Tools • Environment • Results • Conclusions • Future Work
  • 3. From Bollen et al 2009 PLOS1
  • 4. From Leydesdorff From Leydesdorff & Rafols 2006 & Rafols 2006
  • 5. From Leydesdorff & Rafols 2006
  • 6. Broad Research Interests • Search results visualization & refinement • Domain-specific discovery, with a particular interest in genomics and drug discovery • Improved discovery in STM domains through results visualization and contextualization, browse/explore/refine • Use of Open Source tools in complex research problem spaces
  • 7. Research Goals • Use Open Source tools to support large scale semantic text analysis and visualization • Find way to extract journal (& article) semantic vector space (semantics much better than keyword or tf-idf -based representations natural language) • Latent Semantic Analysis (LSA) works for small/medium sized corpora, does not scale to large scale of items and/or terms • New alternative: Semantic Vectors (SV): uses random vectors & avoids expensive singular value decomposition (SVD) • Can SV scale & generate sensible semantic vector space of journals on corpus of this size? • Can the visualization produced be useful for results query visualization, refinement, discovery?
  • 8. Corpus • Licensed journal articles from STM publishers: Elsevier, Springer, etc • ~4100 journal titles, classified into 23 categories (by publishers) • ~8.4m journal articles • Selection of articles/journals: – Only those with authors, abstract (no notices, obituaries, etc) – Only English language articles – Only journals with >50 articles in corpus – Resulting corpus: 5,733,721 articles from 2231 journals – Categories overlapping: 1.53 categories per journal
  • 9. Corpus Category # Journals per category Agriculture & Biological Sciences 358 Arts and Humanities 70 Biochemistry, Genetics and Molecular Biology 240 Business, Management and Accounting 106 Chemical Engineering 126 Chemistry 226 Civil Engineering 64 Computer Science 218 Decision Science 50 Earth and Planetary Science 146 Economics, Econometrics and Finance 112
  • 10. Category # Journals per category Energy and Power 73 Engineering and Technology 328 Environmental Science 138 Immunology and Microbiology 104 Materials Science 160 Mathematics 205 Medicine 671 Neuroscience 103 Pharmacology, Toxicology and 73 Pharmaceutics Physics and Astronomy 210 Psychology 126 Social Science 222
  • 11. Process • Index full-text (only) with Lucene 2.4, aggressive stopword list, Porter stemming using LuSql tool • Build Semantic Vectors (v1.18, parallelized) index from Lucene index, with 512 semantic dimensions • Find item x item distance matrix from SV index of 512- dimensional vectors • Using R, use multidimensional scaling (MDS) to reduce from 512- D to 2-D
  • 12. Scalability Issues • #items, #unique terms – #unique terms: SV easily handles very well – #items: SV handles fairly well – #items: impacts size of distance matrix (#items x #items) – R cannot handle huge article distance matrix in MDS (i.e. millions of articles vs. thousands of journals) • Instead of using articles for items, use journals for items • Make single large full-text document from concatenation of all articles of particular journal & index these
  • 13. Open Source Tools • Lucene • LuSql (High performance Lucene index building tool) • Semantic Vectors • R • Processing • Linux
  • 14. Environment • Dell PowerEdge 1955 Blade server, 2 x dual-core Xeon 5050 processors with 2x2MB cache, 3.0 Ghz 64bit, 32GB RAM, attached to a Dell EMC AX150 storage arrays via SilkWorm 200E Series 16-Port Capable 4Gb Fabric Switch. • Operating system: Linux openSUSE 10.2 (64-bit X86-64), kernel 2.6.18.8-0.10-default #1 SMP • Java version 1.6.0.07 (build 1.6.0 07-b06) Java HotSpot 64-Bit Server VM (build 10.0-b23, mixed mode). • Processing 1.0 (processing.org)
  • 15. Results: Scalability • Corpus: ~600GB full-text • Lucene index: 43GB – LuSql: 13 hours 51 minutes to produce • SV index: 58 minutes, 885 MB, 21.6m terms – Distance matrix: 6 minutes
  • 16. Results: Visualization • Using Processing environment, built simple validation/visualization tool
  • 45. Examination of outliers, extrema and cataloging errors
  • 46. Ecotoxicology and Environmental Safety Organic Geochemistry Corporate Environmental Strategy Environmental Science
  • 47. Journal of Biomolecular NMR Journal of X-Ray Science and Technology Medicine Medicine
  • 48. Colloidal and Polymer Science Annales Henri Poincare Medicine Medicine
  • 49. Medicine Medicine French language Medical & Psychology Journals
  • 50. Bulletin of Mathematical Biology Journal of Medical Ultrasonics Mathematics
  • 51. Conclusions • Reasonable mapping results • Full-text only (no citations, metadata) gives good results • Scalable to significant size • Open Source tools supported a complex research process and were easy to modify to deal with scalability issues
  • 52. Future Work • Proper precision and recall evaluation using same corpus • Validate with NetNews-20 collection for P & R • Evaluate non-metric MDS • Project articles onto semantic journal space & build interactive discovery interface & evaluate – Index journal 'documents' and journal articles – SV on all – Distance matrix only on journals – Do MDS – Use eigenvectors to transform N-d article vector to 2-D • Explore 3-D interface (MDS N-d → 3D)
  • 53. Acknowledgements • Collaborators: Michel Dumontier, Alison Callahan @Carleton • Support: Greg Kresko, Andre Vellino, Jeff Demaine @ NRC- CISTI
  • 54. Demo • Link to project demo page