Biodiversity Informatics: Mining Untapped Resources February 8, 2010 Marine Biology Laboratory and Woods Hole Oceanographic Institute Library  P. Bryan Heidorn Director University of Arizona School of Information Resources and Library Science
Biodiversity Information Diversity Wrongly perceived as bioinformatics and two sets of base-pairs Biodiversity = Data Complexity Requires new information theory and cyberinfrastructure Largely unrecognized as an interesting problem are in computer science
The problem Information is not in accessible  Computer Science, Information Science and Technology has not addressed the problem
Dark data is the data that we know is/was there but we can’t see it.   Hubble Space Telescope composite image "ring" of dark matter in the galaxy cluster Cl 0024+17
Naive View of Science Data GenBank PDB f ( x )= ax k + o ( x k ) Power Law of Science Data f ( x )= ax k + o ( x k )| X<.20 Data Volume Science Projects and Initiatives
Does NSF’s Data Follow the Power Law? I do not know but if  $1 = X bytes…..
20-80  Rule The small are big! Total Grants 9347  $2,137,636,716 20% 80% Number Grants 1869 7478 Total Dollars $1,199,088,125 $938,548,595 Range $6,892,810-$350,000 $350,000- $831
Related Ideas John Porter:  Deep verses Wide databases Swanson:  Undiscovered Public Knowledge Science Commons:  Big Verses Small science
Where to find dark data Literature/Biodiversity Heritage Library Museum Specimens Field notes (Un)Experimental data sets Citizen Observations
What is dark data good for? Ecological Niche Modeling Climate Change niche change prediction Taxonomic Name Resolution Literature Search Support Taxonomic intelligence Key-like – character searching Phenology and Phenology change Food-web / trophic level
Global Biodiversity Information Facility has 100s of millions of specimen records Animalia
Global Earth Observation System of Systems (GEOSS) and Historical Data Unpublished observations of flowing time in Concord by Alfred Hosmer from 1888 to 1902 Photographs of Flowers Blue Hill Observatory meteorological data Richard B. Primack, Abraham J. Miller-Rushing, Daniel Primack, and Sharda Mukunda (2007). Using Photographs to Show the Effects of Climate Change on Flowering Time.  Arnoldia  65(1), p2-9. Historical and Current Data need to be in a form that allow for use and reuse.
Willis CG, Ruhfel BR, Primack RB, Miller-Rushing AJ, Losos JB, et al. 2010 Favorable Climate Change Response Explains Non-Native Species' Success in Thoreau's Woods. PLoS ONE 5(1): e8878. doi:10.1371/journal.pone.0008878  Favorable Climate Change Response Explains Non-Native Species’ Success in Thoreau’s Woods
The problem with Museum Specimens >1 Billion Natural History Specimens Collected over 250 years / many languages No publishing standards Near infinite classes  Your high school teacher lied 6 min / label * 1B labels = 100M hours  Saving 1 min = 16.7 Million hours  $10/hr = $167,000,000 1/4790 of U.S. deregulation financial bailout
Natural History Specimens
Automatic Metadata Extraction (Darwin Core) From Museum Specimen Labels … <co> Curtis,  </co><hdlc>  North American Pl </hdlc><cnl> No.</cnl><cn> 503*</cn> <gn> Polygala</gn><sp> ambigua,</sp><sa> Nutt.,</sa><val> var.</val> <hb> Coral soil,</hb><lc> Cudjoe Key, South Florida. </lc><col> Legit</col><co> A. H. Curtiss.</co><dt>February</dt>… With Qin Wei, Univ of Illinois
S ample records
Sample OCR Output Yale University Herbarium ~r-^&quot;&quot;&quot; r-n------- YU.001300 Curtisb,  North American Pl C^o.nr r^-n ANTS, No. 503* &quot;^ Polygala ambigna, Nntt., var. Coral soil, Cudjoe Key, South Florida. Legit A. H. Curtiss.
Label Labels bc - barcode bt - barcode text cm - common/colloquial name cn - collection number co - collector cd - collection date fm - family name ft - footer info
Label Labels gn - genus name  hd - header info in - infra name ina - infra name author lc - location  pd - plant description sa - scientific name author sp - species name
Example Training Record <?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?> <?oxygen RNGSchema=&quot;http://www3.isrl.uiuc.edu/~TeleNature/Herbis/semanticrelax.rng&quot; type=&quot;xml&quot;?> <labeldata> <bt>Yale University Herbarium </bt><ns> ~r-^&quot;&quot;&quot; r-n------</ns><bc> YU.001300 </bc><co  cc=&quot;Curtiss&quot; > Curtisb,  </co><hdlc  cc=&quot;North American Plants&quot; >  North American Pl </hdlc><ns>C^o.nr r^-n ANTS,</ns> <cnl> No.</cnl><cn> 503*</cn><ns> &quot;^</ns> <gn> Polygala</gn><sp> ambigna,</sp><sa> Nntt.,</sa><val> var.</val> <hb> Coral soil,</hb><lc> Cudjoe Key, South Florida. </lc><col> Legit</col><co> A. H. Curtiss.</co> </labeldata>
Supervised Learning Framework Gold Classified Labels Training Phase Application Phase Machine Learner Unclassified Labels Segmented Text Silver  Classified Labels Segmentation  Machine  Classifier Unclassified  Labels Human Editing Trained  Model
Herbis Experimental Data 295  marked up records 74 label states 5-fold cross-validation
Performances of NB and HMM
Element Identifiers
Improved Performance With Field Element Identifiers
 
Learning w/ pre categorization Gold Labels Machine Learner Model n Classified Labels Class 1 Labels Categor- ization Class 2 Labels Class n Labels Machine Learner Machine Learner Model 2 Model 1 Class 1 Labels Categor- ization Class 2 Labels Class n Labels Machine Classification Machine Classification Machine Classification Classified Labels Classified Labels Unclassified Labels
FIG. 5. Improved Performance of Specialist Model Specialist100 Curtiss VS 100 General Iterations 0 200 0 100 Specialist Random
P. Bryan Heidorn 1 , Hong Zhang 1 , Eugene Chung 2   and   BGWG 1 Graduate School of Library and Information Science,  2 Linguistics, University of Illinois  Machine Learning in BioGeomancer’s Locality Specification SPNHC & NSCA 2006
BioGeomancer Working Group (BGWG)  http://203.202.1.217/bgwebsite/index.html Worldwide collaboration of natural history and geospatial data experts Maximize the quality and quantity of biodiversity data that can be mapped  Support of scientific research, planning, conservation, and management Promotes discussion, manages geospatial data and data standards, and develops software tools in support of this mission
Participants
Example Locality Types Record # Specification of Location   Locality  Type 43 dario 7 mi wnw of; RIO VIEJO FOH; F 86 near Aleutian Islands; S of Amukta Pass  NF; FH 100 INDIAN CREEK, 11 MI. W HWY 160 P; POH 109 TIESMA RD, 1.5 MI NW EDGEWATER; OFF LAKE MICHIGAN R  P; FOH; NP 160 WALTMAN, 9 MI N, 2.5 MI W OF  FOO 181 0.4 mi N Collinston on LA 138 FPOH 204 Seward Peninsula; vic. Bluff, S coast F; NF; FS
 
JOH :  offset from a junction at heading e.g. 0.5 mi. W Sandhill and Hagadorn Roads [ FEATURE [  CITY =  Sandhill   ]] [ FEATURE [  ROAD=  Hagadorn Roads  ]] OFFSET   VALUE  = 0.5     DIRECTION= W   UNIT   = mile   JUNCITON  [ FEATURE [  CITY =  Sandhill   ]]   [ FEATURE [  ROAD=  Hagadorn Roads  ]] FRAME
Xiaoya Tang and P. Bryan Heidorn Different vocabularies in queries and documents Long leaves … ...  Leaves  20–75, many-ranked, spreading and recurved, not twisted, gray-green (rarely variegated with linear cream stripes), to 1 m    1.5–3.5 cm, ……...  Inflorescences:  ……. spikes very laxly 6–11-flowered, erect to spreading, 2–3-pinnate, ……. User query Description of leaf Length in texts
Information Extraction From FNA Templates for  useful information Extraction Rules Structured  information  Leaf_Shape obovate Leaf_Shape orbiculate Blade_Dimension 3—9 x 3—8 cm   ………… .. ………… .. Original documents ……… .. Leaf blade obovate to nearly orbiculate, 3--9 × 3--8 cm, leathery, base obtuse to broadly cuneate, margins flat, coarsely and often irregularly doubly serrate to nearly dentate,   . ……………… Knowledge bases … .. PartBlade: Leaf blade Blades blade …… Pattern:: * <PartBlade> ' ' <leafShape> * ( <leafShape> ) ',' *  Output:: leaf {leafShape $1} Pattern:: * <PartBlade> * ', ' ( <Range> ' ' * <LengUnit> ) * <PartBase> Output:: leaf {bladeDimension $1} User log analysis Leaf_Shape Leaf_Margin Leaf_Apex     Leaf_Base Blade_Dimension … .. … .. 
Results – System Performance NT: number of tasks accomplished in total NTH: number of tasks accomplished per hour TSR: task success rate SSR: search success rate NSST: number of searches to accomplish a task TST: time spent to accomplish a task NDVST: number of documents viewed to  accomplish a task Group NT NTH TSR SSR NSST TST NDVST SEARFA 6.75 8.078 0.860 0.210 4.779 338.8 11.16 SEARF 4.50 3.598 0.568 0.053 9.584 435.2 14.75 Sig.(ANOVA) 0.005 0.005 0.000 0.011 0.000 0.72 0.162
Education Programs Biological Information Specialist Concentration in Data Curation (MSLIS) Certificate of Advanced Study in Data Curation Information and professional education in biodiversity informatics
Biological Information Specialists At present: Biologists at all degree levels self-trained in information technology Information technologists at all degree levels self-trained in biology  (both with gaps in knowledge for many months, years) Differing roles of BIS in large and small
Master of Science in Biological Informatics Degree Program began September 2007  Part of campus-wide bioinformatics masters program NSF/CISE/IIS, Education Research and Curriculum Development, 0534567  (Palmer, PI) Combines Biology, Bioinformatics, Computer Science core with LIS courses
What does a BIS need to know? Biological training   and interest in solving biological research problems Information skills   Evaluation and implementation of information systems:  user based assessment and continual quality improvement for the development of tools that work and are used. Information acquisition, management, and dissemination: development of digital libraries, data archives, institutional repositories, and related tools. Information organization and integration:  ontology development, structuring information for optimal use and sharing, and standards development.
UIUC bioinformatics core coursework Cross-disciplinary course distribution requirement Bioinformatics:  Computing in Molecular Biology Algorithms in Bioinformatics Principles of Systematics Computer Science:  Algorithms Database Systems Biology: Human Genetics Introductory Biochemistry Macromolecular Modeling
Sample of existing LIS courses Information Organization and Knowledge Representation LIS 551 Interfaces to Information Systems LIS 590DM Document Modeling LIS 590RO Representing and Organizing Information Resources LIS590ON Ontologies in Natural Science Information Resources, Uses and users LIS 503 Use and Users of Information LIS 522 Information Sources in the Sciences LIS 590TR Information Transfer and Collaboration in Science Information Systems LIS 456 Information Storage and Retrieval LIS 509 Building Digital Libraries LIS 566 Architecture of Network Information Systems LIS 590EP Electronic Publishing Disciplinary Focus LIS 530B Health Sciences Information Services and Resources LIS 590HI Healthcare Informatics (Healthcare Infrastructure) LIS 590EI/BDI Ecological Informatics (Biodiversity Informatics)
MSLIS Data Curation Concentration Data Curation Educational Program  (DCEP) IMLS – Laura Bush 21 st  Century Librarian Program,  RE-05-06-0036-06  (Heidorn, PI) Students with the DC concentration will be trained to add value to data and promote sharing across labs and disciplinary specializations
New research directions IGERT Interactive Keys? Focus on integration and scale  Informatics infrastructure as competitive edge Sample areas of development Landinformatics Group Atmospheric science, hydrology, nutrient balance, carbon cycle,  ecology, agronomy BREC Focus on data integration problems across larger range of sciences
Example Service JRS Biodiversity Foundation National Science Foundation Taxonomic Database Working Group
JRS Biodiversity Foundation History: The J.R.S. Biodiversity Foundation was created in January 2004 when the nonprofit publishing company, BIOSIS was sold to Thomson Scientific.  The proceeds from that sale were applied to fund an endowment and create a new grant-making foundation. Mission: The Foundation defined a mission within the field of biodiversity:  To enhance knowledge and promote the understanding of biological diversity for the benefit and sustainability of life on earth. JRS Biodiversity Foundation
JRS Biodiversity Foundation Scope: To further advance the Foundation’s mission a scope was developed as:  Interdisciplinary activities primarily carried out via collaborations in developing countries and economies in transition.  The Foundation Board of Trustees has expressed a particular interest in focusing its grant-making in Africa. Strategic Interest: Within those bounds a considered course has been chosen to:  Advance projects, or parts of biodiversity projects that focus on: (1) collecting data, (2) aggregating, synthesizing, publishing data, and making it more widely available to potential end users, and (3) interpreting and gaining insight from data to inform policy-makers
Grant Making: about $2M/yr Animal Tracking in South Africa Specimen Digitization in Ghana Social Value of Conservation in Peru Species Pages and BD Education in Costa Rica Niche Modeling in Brazil Travel Grants Lake Victoria Data Library Project in Tanzania, Uganda and Kenya e-Biosphere ‘09 JRS Biodiversity Foundation
National Science Foundation Advances in Biological Informatics Data Working Group Plant Science Cyberinfrastructure Center (iPlant) Cyber-enabled Discovery and Innovation Hiring Committees Division of Biological Infrastructure Planning
 
 
ALISE and AMISE Historical Analysis of 100 Years back Use Library and Museum Resources Prepare for Blitz  K-12 / Graduate / Hobbyist BioDiversity Blitz Build time capsule for Bicentenial in Cultural Heritage Institutions  **Libraries and Museums**

More Related Content

PPT
Dark Data In the Long Tail of Science:   Examples in Biology
PDF
Southampton Marine and Maritime Institute Z card
PPTX
Potential ANTI-HIV agents from marine source
PDF
Marine fisheries
DOCX
Marine Resources: Physical and biological resources, marine energy
PPTX
IPR on MGR: Biodiscovery to Bioprospecting and Question of Ownership
PPTX
Marine weather resources
PPTX
Marine resoues ppt
Dark Data In the Long Tail of Science:   Examples in Biology
Southampton Marine and Maritime Institute Z card
Potential ANTI-HIV agents from marine source
Marine fisheries
Marine Resources: Physical and biological resources, marine energy
IPR on MGR: Biodiscovery to Bioprospecting and Question of Ownership
Marine weather resources
Marine resoues ppt

Similar to Mblwhoil2010 Heidorn (20)

PPT
Constructing Semantic Gazetteers: Managing GeoSpatial Vocabularies Using Open...
PPT
Geographic Information Retrieval From Disparate Data Sources
PPT
What is DataCite-screenshots
PPTX
RAC data day
PDF
bgsu1349900740
PDF
TERN Ecosystem Surveillance Plots Roy Hill Station
PPT
Or2013 poster
PDF
Baseline study for EIA
PDF
BHL Technical Director's Report, Mar. 2014
PPTX
What's wrong with our scholarly infrastructure?
PPTX
Behavior ontology workshop princeton
PDF
Mo ta phau dien dat theo usda 2012 version 3.0
PPTX
April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Te...
KEY
NISO Forum, Denver, Sept. 24, 2012: Data Equivalence
PPT
Dr Sarah Adamowicz - Ecological studies
PPTX
Module 1 - Data Around Us .pptx
PPTX
Introduction to Data Management
PPT
BHL Tech Overview for BHL-Europe
PPTX
Lehnert_EGU201_SampleMetadataStandards
PPTX
Ecological Marine Units: A 3-D Mapping of the Ocean Based on NOAA’s World Oce...
Constructing Semantic Gazetteers: Managing GeoSpatial Vocabularies Using Open...
Geographic Information Retrieval From Disparate Data Sources
What is DataCite-screenshots
RAC data day
bgsu1349900740
TERN Ecosystem Surveillance Plots Roy Hill Station
Or2013 poster
Baseline study for EIA
BHL Technical Director's Report, Mar. 2014
What's wrong with our scholarly infrastructure?
Behavior ontology workshop princeton
Mo ta phau dien dat theo usda 2012 version 3.0
April 23 NISO Virtual Conference: Dealing with the Data Deluge: Successful Te...
NISO Forum, Denver, Sept. 24, 2012: Data Equivalence
Dr Sarah Adamowicz - Ecological studies
Module 1 - Data Around Us .pptx
Introduction to Data Management
BHL Tech Overview for BHL-Europe
Lehnert_EGU201_SampleMetadataStandards
Ecological Marine Units: A 3-D Mapping of the Ocean Based on NOAA’s World Oce...
Ad

Recently uploaded (20)

PDF
UiPath Agentic Automation session 1: RPA to Agents
DOCX
search engine optimization ppt fir known well about this
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PPTX
Custom Battery Pack Design Considerations for Performance and Safety
PDF
Improvisation in detection of pomegranate leaf disease using transfer learni...
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PPT
Geologic Time for studying geology for geologist
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
How IoT Sensor Integration in 2025 is Transforming Industries Worldwide
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
A proposed approach for plagiarism detection in Myanmar Unicode text
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Flame analysis and combustion estimation using large language and vision assi...
PDF
The influence of sentiment analysis in enhancing early warning system model f...
UiPath Agentic Automation session 1: RPA to Agents
search engine optimization ppt fir known well about this
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Consumable AI The What, Why & How for Small Teams.pdf
Custom Battery Pack Design Considerations for Performance and Safety
Improvisation in detection of pomegranate leaf disease using transfer learni...
Getting started with AI Agents and Multi-Agent Systems
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Taming the Chaos: How to Turn Unstructured Data into Decisions
Geologic Time for studying geology for geologist
A review of recent deep learning applications in wood surface defect identifi...
NewMind AI Weekly Chronicles – August ’25 Week III
How IoT Sensor Integration in 2025 is Transforming Industries Worldwide
Chapter 5: Probability Theory and Statistics
Convolutional neural network based encoder-decoder for efficient real-time ob...
Module 1.ppt Iot fundamentals and Architecture
A proposed approach for plagiarism detection in Myanmar Unicode text
1 - Historical Antecedents, Social Consideration.pdf
Flame analysis and combustion estimation using large language and vision assi...
The influence of sentiment analysis in enhancing early warning system model f...
Ad

Mblwhoil2010 Heidorn

  • 1. Biodiversity Informatics: Mining Untapped Resources February 8, 2010 Marine Biology Laboratory and Woods Hole Oceanographic Institute Library P. Bryan Heidorn Director University of Arizona School of Information Resources and Library Science
  • 2. Biodiversity Information Diversity Wrongly perceived as bioinformatics and two sets of base-pairs Biodiversity = Data Complexity Requires new information theory and cyberinfrastructure Largely unrecognized as an interesting problem are in computer science
  • 3. The problem Information is not in accessible Computer Science, Information Science and Technology has not addressed the problem
  • 4. Dark data is the data that we know is/was there but we can’t see it. Hubble Space Telescope composite image &quot;ring&quot; of dark matter in the galaxy cluster Cl 0024+17
  • 5. Naive View of Science Data GenBank PDB f ( x )= ax k + o ( x k ) Power Law of Science Data f ( x )= ax k + o ( x k )| X<.20 Data Volume Science Projects and Initiatives
  • 6. Does NSF’s Data Follow the Power Law? I do not know but if $1 = X bytes…..
  • 7. 20-80 Rule The small are big! Total Grants 9347 $2,137,636,716 20% 80% Number Grants 1869 7478 Total Dollars $1,199,088,125 $938,548,595 Range $6,892,810-$350,000 $350,000- $831
  • 8. Related Ideas John Porter: Deep verses Wide databases Swanson: Undiscovered Public Knowledge Science Commons: Big Verses Small science
  • 9. Where to find dark data Literature/Biodiversity Heritage Library Museum Specimens Field notes (Un)Experimental data sets Citizen Observations
  • 10. What is dark data good for? Ecological Niche Modeling Climate Change niche change prediction Taxonomic Name Resolution Literature Search Support Taxonomic intelligence Key-like – character searching Phenology and Phenology change Food-web / trophic level
  • 11. Global Biodiversity Information Facility has 100s of millions of specimen records Animalia
  • 12. Global Earth Observation System of Systems (GEOSS) and Historical Data Unpublished observations of flowing time in Concord by Alfred Hosmer from 1888 to 1902 Photographs of Flowers Blue Hill Observatory meteorological data Richard B. Primack, Abraham J. Miller-Rushing, Daniel Primack, and Sharda Mukunda (2007). Using Photographs to Show the Effects of Climate Change on Flowering Time. Arnoldia 65(1), p2-9. Historical and Current Data need to be in a form that allow for use and reuse.
  • 13. Willis CG, Ruhfel BR, Primack RB, Miller-Rushing AJ, Losos JB, et al. 2010 Favorable Climate Change Response Explains Non-Native Species' Success in Thoreau's Woods. PLoS ONE 5(1): e8878. doi:10.1371/journal.pone.0008878 Favorable Climate Change Response Explains Non-Native Species’ Success in Thoreau’s Woods
  • 14. The problem with Museum Specimens >1 Billion Natural History Specimens Collected over 250 years / many languages No publishing standards Near infinite classes Your high school teacher lied 6 min / label * 1B labels = 100M hours Saving 1 min = 16.7 Million hours $10/hr = $167,000,000 1/4790 of U.S. deregulation financial bailout
  • 16. Automatic Metadata Extraction (Darwin Core) From Museum Specimen Labels … <co> Curtis, </co><hdlc> North American Pl </hdlc><cnl> No.</cnl><cn> 503*</cn> <gn> Polygala</gn><sp> ambigua,</sp><sa> Nutt.,</sa><val> var.</val> <hb> Coral soil,</hb><lc> Cudjoe Key, South Florida. </lc><col> Legit</col><co> A. H. Curtiss.</co><dt>February</dt>… With Qin Wei, Univ of Illinois
  • 18. Sample OCR Output Yale University Herbarium ~r-^&quot;&quot;&quot; r-n------- YU.001300 Curtisb, North American Pl C^o.nr r^-n ANTS, No. 503* &quot;^ Polygala ambigna, Nntt., var. Coral soil, Cudjoe Key, South Florida. Legit A. H. Curtiss.
  • 19. Label Labels bc - barcode bt - barcode text cm - common/colloquial name cn - collection number co - collector cd - collection date fm - family name ft - footer info
  • 20. Label Labels gn - genus name hd - header info in - infra name ina - infra name author lc - location pd - plant description sa - scientific name author sp - species name
  • 21. Example Training Record <?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?> <?oxygen RNGSchema=&quot;http://www3.isrl.uiuc.edu/~TeleNature/Herbis/semanticrelax.rng&quot; type=&quot;xml&quot;?> <labeldata> <bt>Yale University Herbarium </bt><ns> ~r-^&quot;&quot;&quot; r-n------</ns><bc> YU.001300 </bc><co cc=&quot;Curtiss&quot; > Curtisb, </co><hdlc cc=&quot;North American Plants&quot; > North American Pl </hdlc><ns>C^o.nr r^-n ANTS,</ns> <cnl> No.</cnl><cn> 503*</cn><ns> &quot;^</ns> <gn> Polygala</gn><sp> ambigna,</sp><sa> Nntt.,</sa><val> var.</val> <hb> Coral soil,</hb><lc> Cudjoe Key, South Florida. </lc><col> Legit</col><co> A. H. Curtiss.</co> </labeldata>
  • 22. Supervised Learning Framework Gold Classified Labels Training Phase Application Phase Machine Learner Unclassified Labels Segmented Text Silver Classified Labels Segmentation Machine Classifier Unclassified Labels Human Editing Trained Model
  • 23. Herbis Experimental Data 295 marked up records 74 label states 5-fold cross-validation
  • 26. Improved Performance With Field Element Identifiers
  • 27.  
  • 28. Learning w/ pre categorization Gold Labels Machine Learner Model n Classified Labels Class 1 Labels Categor- ization Class 2 Labels Class n Labels Machine Learner Machine Learner Model 2 Model 1 Class 1 Labels Categor- ization Class 2 Labels Class n Labels Machine Classification Machine Classification Machine Classification Classified Labels Classified Labels Unclassified Labels
  • 29. FIG. 5. Improved Performance of Specialist Model Specialist100 Curtiss VS 100 General Iterations 0 200 0 100 Specialist Random
  • 30. P. Bryan Heidorn 1 , Hong Zhang 1 , Eugene Chung 2 and BGWG 1 Graduate School of Library and Information Science, 2 Linguistics, University of Illinois Machine Learning in BioGeomancer’s Locality Specification SPNHC & NSCA 2006
  • 31. BioGeomancer Working Group (BGWG) http://203.202.1.217/bgwebsite/index.html Worldwide collaboration of natural history and geospatial data experts Maximize the quality and quantity of biodiversity data that can be mapped Support of scientific research, planning, conservation, and management Promotes discussion, manages geospatial data and data standards, and develops software tools in support of this mission
  • 33. Example Locality Types Record # Specification of Location Locality Type 43 dario 7 mi wnw of; RIO VIEJO FOH; F 86 near Aleutian Islands; S of Amukta Pass NF; FH 100 INDIAN CREEK, 11 MI. W HWY 160 P; POH 109 TIESMA RD, 1.5 MI NW EDGEWATER; OFF LAKE MICHIGAN R P; FOH; NP 160 WALTMAN, 9 MI N, 2.5 MI W OF FOO 181 0.4 mi N Collinston on LA 138 FPOH 204 Seward Peninsula; vic. Bluff, S coast F; NF; FS
  • 34.  
  • 35. JOH : offset from a junction at heading e.g. 0.5 mi. W Sandhill and Hagadorn Roads [ FEATURE [ CITY = Sandhill ]] [ FEATURE [ ROAD= Hagadorn Roads ]] OFFSET VALUE = 0.5 DIRECTION= W UNIT = mile JUNCITON [ FEATURE [ CITY = Sandhill ]] [ FEATURE [ ROAD= Hagadorn Roads ]] FRAME
  • 36. Xiaoya Tang and P. Bryan Heidorn Different vocabularies in queries and documents Long leaves … ... Leaves 20–75, many-ranked, spreading and recurved, not twisted, gray-green (rarely variegated with linear cream stripes), to 1 m  1.5–3.5 cm, ……... Inflorescences: ……. spikes very laxly 6–11-flowered, erect to spreading, 2–3-pinnate, ……. User query Description of leaf Length in texts
  • 37. Information Extraction From FNA Templates for useful information Extraction Rules Structured information Leaf_Shape obovate Leaf_Shape orbiculate Blade_Dimension 3—9 x 3—8 cm ………… .. ………… .. Original documents ……… .. Leaf blade obovate to nearly orbiculate, 3--9 × 3--8 cm, leathery, base obtuse to broadly cuneate, margins flat, coarsely and often irregularly doubly serrate to nearly dentate, . ……………… Knowledge bases … .. PartBlade: Leaf blade Blades blade …… Pattern:: * <PartBlade> ' ' <leafShape> * ( <leafShape> ) ',' * Output:: leaf {leafShape $1} Pattern:: * <PartBlade> * ', ' ( <Range> ' ' * <LengUnit> ) * <PartBase> Output:: leaf {bladeDimension $1} User log analysis Leaf_Shape Leaf_Margin Leaf_Apex    Leaf_Base Blade_Dimension … .. … .. 
  • 38. Results – System Performance NT: number of tasks accomplished in total NTH: number of tasks accomplished per hour TSR: task success rate SSR: search success rate NSST: number of searches to accomplish a task TST: time spent to accomplish a task NDVST: number of documents viewed to accomplish a task Group NT NTH TSR SSR NSST TST NDVST SEARFA 6.75 8.078 0.860 0.210 4.779 338.8 11.16 SEARF 4.50 3.598 0.568 0.053 9.584 435.2 14.75 Sig.(ANOVA) 0.005 0.005 0.000 0.011 0.000 0.72 0.162
  • 39. Education Programs Biological Information Specialist Concentration in Data Curation (MSLIS) Certificate of Advanced Study in Data Curation Information and professional education in biodiversity informatics
  • 40. Biological Information Specialists At present: Biologists at all degree levels self-trained in information technology Information technologists at all degree levels self-trained in biology (both with gaps in knowledge for many months, years) Differing roles of BIS in large and small
  • 41. Master of Science in Biological Informatics Degree Program began September 2007 Part of campus-wide bioinformatics masters program NSF/CISE/IIS, Education Research and Curriculum Development, 0534567 (Palmer, PI) Combines Biology, Bioinformatics, Computer Science core with LIS courses
  • 42. What does a BIS need to know? Biological training and interest in solving biological research problems Information skills Evaluation and implementation of information systems: user based assessment and continual quality improvement for the development of tools that work and are used. Information acquisition, management, and dissemination: development of digital libraries, data archives, institutional repositories, and related tools. Information organization and integration: ontology development, structuring information for optimal use and sharing, and standards development.
  • 43. UIUC bioinformatics core coursework Cross-disciplinary course distribution requirement Bioinformatics: Computing in Molecular Biology Algorithms in Bioinformatics Principles of Systematics Computer Science: Algorithms Database Systems Biology: Human Genetics Introductory Biochemistry Macromolecular Modeling
  • 44. Sample of existing LIS courses Information Organization and Knowledge Representation LIS 551 Interfaces to Information Systems LIS 590DM Document Modeling LIS 590RO Representing and Organizing Information Resources LIS590ON Ontologies in Natural Science Information Resources, Uses and users LIS 503 Use and Users of Information LIS 522 Information Sources in the Sciences LIS 590TR Information Transfer and Collaboration in Science Information Systems LIS 456 Information Storage and Retrieval LIS 509 Building Digital Libraries LIS 566 Architecture of Network Information Systems LIS 590EP Electronic Publishing Disciplinary Focus LIS 530B Health Sciences Information Services and Resources LIS 590HI Healthcare Informatics (Healthcare Infrastructure) LIS 590EI/BDI Ecological Informatics (Biodiversity Informatics)
  • 45. MSLIS Data Curation Concentration Data Curation Educational Program (DCEP) IMLS – Laura Bush 21 st Century Librarian Program, RE-05-06-0036-06 (Heidorn, PI) Students with the DC concentration will be trained to add value to data and promote sharing across labs and disciplinary specializations
  • 46. New research directions IGERT Interactive Keys? Focus on integration and scale Informatics infrastructure as competitive edge Sample areas of development Landinformatics Group Atmospheric science, hydrology, nutrient balance, carbon cycle, ecology, agronomy BREC Focus on data integration problems across larger range of sciences
  • 47. Example Service JRS Biodiversity Foundation National Science Foundation Taxonomic Database Working Group
  • 48. JRS Biodiversity Foundation History: The J.R.S. Biodiversity Foundation was created in January 2004 when the nonprofit publishing company, BIOSIS was sold to Thomson Scientific. The proceeds from that sale were applied to fund an endowment and create a new grant-making foundation. Mission: The Foundation defined a mission within the field of biodiversity: To enhance knowledge and promote the understanding of biological diversity for the benefit and sustainability of life on earth. JRS Biodiversity Foundation
  • 49. JRS Biodiversity Foundation Scope: To further advance the Foundation’s mission a scope was developed as: Interdisciplinary activities primarily carried out via collaborations in developing countries and economies in transition. The Foundation Board of Trustees has expressed a particular interest in focusing its grant-making in Africa. Strategic Interest: Within those bounds a considered course has been chosen to: Advance projects, or parts of biodiversity projects that focus on: (1) collecting data, (2) aggregating, synthesizing, publishing data, and making it more widely available to potential end users, and (3) interpreting and gaining insight from data to inform policy-makers
  • 50. Grant Making: about $2M/yr Animal Tracking in South Africa Specimen Digitization in Ghana Social Value of Conservation in Peru Species Pages and BD Education in Costa Rica Niche Modeling in Brazil Travel Grants Lake Victoria Data Library Project in Tanzania, Uganda and Kenya e-Biosphere ‘09 JRS Biodiversity Foundation
  • 51. National Science Foundation Advances in Biological Informatics Data Working Group Plant Science Cyberinfrastructure Center (iPlant) Cyber-enabled Discovery and Innovation Hiring Committees Division of Biological Infrastructure Planning
  • 52.  
  • 53.  
  • 54. ALISE and AMISE Historical Analysis of 100 Years back Use Library and Museum Resources Prepare for Blitz K-12 / Graduate / Hobbyist BioDiversity Blitz Build time capsule for Bicentenial in Cultural Heritage Institutions **Libraries and Museums**

Editor's Notes

  • #14: Figure 1. Bar graphs depicting phylogenetically corrected mean differences between species groups for two climate change response traits: the correlation coefficient between first flowering day and annual spring temperature for the time period of 1888–1902 (A; i.e., flowering time tracking ), and the shift in mean first flowering day during the period exhibiting the most dramatic increase in mean annual temperature, from 1900–2006 (B; i.e., flowering time shift ).
  • #18: Not handwriting
  • #49: Insert lake victoria overlay
  • #50: Insert lake victoria overlay
  • #51: Insert lake victoria overlay