SlideShare a Scribd company logo
Bots & spiders

 Bio-informatica II
    19/04/2012

        Maté Ongenaert

   Center for Medical Genetics
Ghent University Hospital, Belgium
 Part 1: Bots & spiders
  Background

 Part 2: Real-life case studies
  The use of bots and spiders in bio-informatics
 About the presenter
   Bio-engineer cell and gene biotechnology (2005)
    •   Master thesis: identificatie van kanker-specifiek gemethyleerde genen

   PhD applied biological sciences: cell and gene
    biotechnology (2009)
    •   PhD thesis: cellular reprogramming

   Industrial experience
    •   Research scientist (methylation biomarkers)

   Currently: postdoc at CMGG
    •   Prognostic methylation biomarkers in neuroblastoma
Part 1
Bots & spiders: background
Overview

 Bots and spiders
     Introduction
     Bots
     Spiders
     The Google case
 Bots/spiders and bio-informatics
     Automated querying
     APIs
     NCBI E-Utils (PubMed/GenBank)
     Ensembl
Bots and spiders

 Bots and spiders
    The web history
       •   In 1989, while working at CERN, Tim Berners-
           Lee       invented        a      network-based
           implementation of the hypertext concept
       •   Since then, information can be retrieved by
           ‘following links’ instead of having to know the
           exact location at first
       •   Information is not at a single location, it is
           dynamic and spread across machines
Bots and spiders

 Bots
   Webbots
      •   Web robots, WWW robots, bots): software
          applications that run automated tasks over the
          Internet

   Bots perform tasks that:
      •   Are simple
      •   Structurally repetitive
      •   At a much higher rate than would be possible
          for a human
      •   Automated script fetches, analyses and files
          information from web servers at many times
          the speed of a human

   Other uses:
      •   Chatbots / IM / Skype / Wiki bots
      •   Malicious bots and bot networks (Zombies)
Bots and spiders

 Bots
   A spam bot, called the ‘Zunker Bot’
      •   Is installed on unpatched Windows machines
      •   Controls the clients trough a neat application
      •   Can install additional software and execute commands
Bots and spiders

 Spiders
   Webspiders
      •   Webspiders / Crawlers are programs or
          automated scripts which browses the World
          Wide Web in a methodical, automated
          manner. It is one type of bot

   The spider starts with a list of
    URLs to visit, called the seeds
      • As the crawler visits these URLs, it identifies
        all the hyperlinks in the page
      • It adds them to the list of URLs to visit, called
        the crawl frontier
      • URLs from the frontier are recursively visited
        according to a set of policies
      • This process is called web crawling: in most
        cases a mean of collecting up-to-date data
Bots and spiders

 Spiders
Bots and spiders

 Spiders
   Use of webcrawlers:
      •   Mainly used to create a copy of all the visited pages for later processing by a
          search engine that will index the downloaded pages to provide fast searches
      •   Automating maintenance tasks on a website, such as checking links or
          validating HTML code
      •   Can be used to gather specific types of information from Web pages, such as
          harvesting e-mail addresses

   Most commonly used crawler is probably the
    GoogleBot crawler
      •   Crawls
      •   Indexes (content + key content tags and attributes, such as Title tags and ALT
          attributes)
      •   Serves results: PageRank Technology
Bots and spiders

 PageRank
Bots and spiders

 PageRank
Bots and spiders

 Google
   Hardware
      •   Standard server hardware (2009): 16 GB RAM / 2 TB storage per server
      •   2009 estimate: 450 000 servers – 2 million $/month electricity cost

   Software
      •   Webserver (Not apache-based)
      •   Storage (Google File System / BigTable): distributed storage – mostly in
          memory
      •   Borg job scheduling and monitoring
      •   Indexing services: caffeine / percolator
      •   MapReduce: cluster system: splits complex problems and sends ‘jobs’ to worker
          nodes (Map), answers are gathered and combined to solve the original
          question (Reduce)
Overview

 Bots and spiders
     Introduction
     Bots
     Spiders
     The Google case
 Bots/spiders and bio-informatics
     Automated querying
     APIs
     NCBI E-Utils (PubMed/GenBank)
     Ensembl
Bots and spiders

 Bots/spiders and bio-informatics
   Automated querying
      •   Collecting information nowadays means the power to automatically query
          datasources (databases, websites, Google, Ensembl or NCBI databases)
      •   Query in web-terms: GET / POST
      •   Web-queries using Perl: LWP library

   LWP: set of Perl modules which provides a simple and
    consistent application programming interface (API) to
    the World-Wide Web
      •   Free LWP E-book: http://lwp.interglacial.com/

   LWP for newbies
      •   LWP::Simple (demo1)
      •   Go to a URL, fetch data, ready to parse
      •   Attention: HTML tags and regular expression
Bots and spiders

 Bots/spiders and bio-informatics
   Some more advanced features
      •   LWP::UserAgent (demo2 – show server access logs)
      •   Fill in forms and parse results
      •   Depending on content: follow hyperlinks to other pages and parse these
          again,…
      •   Mechanize package: follow links; fill in forms,…

   Bioinformatics examples
      • Use genome browser data (demo3) and sequences
      • Get gene aliases and symbols from GeneCards (demo4)
Bots and spiders

 Bots/spiders and bio-informatics
   Why not make use of crawls, indexing and serving
    technologies of others (e.g. Google)
      • Google allows automated queries: per account 1000 queries a day
      • Google uses Snippets: the short pieces of text you get in the main search
        results
      • This is the result of its indexing and parsing algoritms
      • Demo5: LWP and Google APIs combined and parsing the results

   API: Application Programming Interface
      •   Hides complexity by sharing ‘libraries’ with functions that can be applied within
          another programming language
      •   Bridges programming languages – crosses abstraction layers
      •   Example: displaying on a screen; printing; querying Google or NCBI from within
          a programming language
Bots and spiders

 Bots/spiders and bio-informatics APIs
   Google example used Google API
   NCBI API
      • The NCBI Web service is a web program that enables developers to access
        Entrez Utilities via the Simple Object Access Protocol (SOAP)
      • Programmers may write software applications that access the E-Utilities using
        any SOAP development tool
      • Main tools (demo6):
         – E-Search: Searches and retrieves primary IDs and term translations and
             optionally retains results for future use in the user's environment
         – E-Fetch: Retrieves records in the requested format from a list of one or
             more primary IDs

   Ensembl API (demo7)
      •   Uses ‘Slices’ and adaptors
      •   You have to know the ‘application’ or database (Compare/Core/…)
Bots and spiders

 Bots/spiders and bio-informatics APIs
   NCBI API
   A NCBI database, frequently used is PubMed
      •   PubMed can be queried using E-Utils
      •   Uses syntax as regular PubMed website
      •   Get the data back in data formats as on the website (XML, Plain Text)
      •   Parse XML results and apply more advanced Text-mining techniques
      •   Demo8
      •   Parse results and present them in an interface
           – Methylated genes in cancer:
           – http://matrix.ugent.be/mate/methylome/result1.html
           – miRNAs in cancer:
           – http://matrix.ugent.be/mate/textmining/preprocess/
Part 2
Real-life case studies: the use of bots and
         spiders in bio-informatics
Bots and spiders

 TextMining
   Create and translate query
      •   User query -> query suited for PubMed

   Query is executed, results are returned
      •   Results format: XML, TXT, MedLine, ASN,…
      •   Human readable <> parsable (XML parsers)

   Parse results
      • Extract information: authors, title, abstract
      •   Store results

   Analyse results
      •   Identify gene names, keywords, GO-terms,… -> score
      •   Semantic analysis / NLP processing / …

   Visualise results
      •   Highlighting, hierarchie, filters, searches, graphics
Bots and spiders

 TextMining
Bots and spiders

 TextMining
Bots and spiders

 TextMining
Bots and spiders

 TextMining
   Demonstration: GoldMine
   Web-application
   Translate query – find aliases for genes or miRNAs
    and incorporate them in the search
   Query NCBI PubMed using E-fetch
   Get the results and process them
         Count
         Highlight
         Rank
         Visualization
Bots and spiders

 Data analysis
     NCBI GEO – Gene Expression Omnibus
     Raw expression data on FTP-server
     Annotation: can be queried using NCBI E-Utils
     Annotation: in Excel-files at FTP-server
     For specific experimental conditions, get all raw data
      and annotations and perform an automated analysis
 Create a scheme how you would proceed:
  biological question: superficial vs.
  Infiltrating bladder cancer
Bots and spiders

 Case study: superficial vs. Infiltrating
  bladder cancer
     Find experiments on GEO
     Annotation of samples: up to the submittors
     ‘Uniform’ sample sheet available (Matrix-file)
     Current update of GEO: view ‘factors’ in graphical
      overview
Bots and spiders

 Case study: superficial vs. Infiltrating
  bladder cancer
Bots and spiders

 Case study: superficial vs. Infiltrating
  bladder cancer
   Use this to couple sample annotation features (stage,
    age, risk, sex) to unique sampleID (GSMxxxxxxx)
   Get raw data for each sample in dataset
   Either txt files (uniform) or raw data files (such as Affy
    CEL files)
   Dependends on the used platform: GPLxxxx
Bots and spiders

 Case study: superficial vs. Infiltrating
  bladder cancer
   Platform / data files / samples / sample annotation
    relationship
   Set up standardised analysis strategy
   Make use of sample annotations
   Combine studies or keep them seperate?
   Normalisation
   RankProd analysis
Bots and spiders

 Case study: superficial vs. Infiltrating
  bladder cancer
     data.justrma<just.rma("GSM90305.CEL”,”… SAMPLES
     expression<-exprs(data.justrma)         NORMALISATION
     results[,2:103]<-expression
     library(hgu95av2.db)                        PLATFORM
     cl<c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,   ANNOTATION
     RP.out.stage <- RP(results[,3:104], cl, num.perm =
      100, logged = TRUE, na.rm = FALSE, plot = TRUE, rand
      = 123)                              ANALYSIS STRATEGY
Bots and spiders

 Case study: superficial vs. Infiltrating
  bladder cancer
     Combine results accross studies
     Biological question <> data analysis
     Scoring scheme, priorization
     Superficial vs. Infiltrating
     Metastasis vs. Primary cancer
     High stage vs. Low stage
     Normal vs. Cancer
Bots and spiders

 OncoMine
Bots and spiders

 Integrated analysis
Rank     Meth Pca   Lit   Meth other Expression Pca   Progression   Rank1    2       3          4      5        6       7       8

                                                                                         EXPRESSION           RE-EXP   CpG      Pc

 1          1                               x                                       0,95        1     0,993   0,997    0,84     1

 2                                                                  0,998           0,995       1     0,958   0,091            0,994

 3                  1         x             x              x         1              0,993       1     0,996                    0,312

 4          1                 x             x              x        0,995   0,767   0,96               1      0,931    0,998   0,635

 5                  1                       x                       0,997           0,968       1      1      0,364    0,746   0,199

 6                                                         x                0,711   0,948             0,994   0,559    0,991   0,993

 7                                                                                  0,998             0,993    0,83    0,936   0,996

 8                                                                  0,997           0,99              0,998   0,759    0,726   0,575

 9                  1                       x              x        0,886           0,995             0,997     1               0,7

 10                 1                                               0,998           0,409             0,99     0,88    0,998   0,779

 11                 1                       x              x                        0,995             0,999   0,995            0,687

 12                 1                       x              x                        0,997             0,999   0,999            0,257

 13         1                 x             x              x        0,799   0,996   0,969             0,994   0,848    0,981   0,887

 14         1                 x             x                       0,916   0,568   0,99              0,993   0,994    0,988   0,558

 15                                                                                 0,986             0,995   0,956    0,983   0,998

 16         1                 x                                                     0,157       1     0,925   0,989    0,984   0,993
Acknowledgments


   CMGG
       Anneleen Decock
       Frank Speleman
       Jo Vandesompele


   BioBix
       Leander Van Neste
       Tim De Meyer
       Gerben Mensschaert
       Geert Trooskens
       Wim Van Criekinge

More Related Content

DOC
Lei_Resume-it.doc
PPTX
2016 bioinformatics i_proteins_wim_vancriekinge
PPTX
2016 bioinformatics i_io_wim_vancriekinge
PPTX
2016 bioinformatics i_phylogenetics_wim_vancriekinge
PPTX
2016 bioinformatics i_bio_cheminformatics_wimvancriekinge
PPTX
2016 bioinformatics i_bio_python_ii_wimvancriekinge
PPTX
2016 bioinformatics i_database_searching_wimvancriekinge
PPTX
2016 bioinformatics i_bio_python_wimvancriekinge
Lei_Resume-it.doc
2016 bioinformatics i_proteins_wim_vancriekinge
2016 bioinformatics i_io_wim_vancriekinge
2016 bioinformatics i_phylogenetics_wim_vancriekinge
2016 bioinformatics i_bio_cheminformatics_wimvancriekinge
2016 bioinformatics i_bio_python_ii_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_bio_python_wimvancriekinge

Similar to Bots & spiders (20)

ODP
Search Engine Spiders
PDF
Web Crawler For Mining Web Data
PDF
IRJET- A Two-Way Smart Web Spider
PPTX
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
PPT
Web crawler
PDF
Web 2.0 e ricerca scientifica - Web 2.0 and scientific research
PPT
Web Crawler
PPT
Webcrawler
PPT
Webcrawler
PPTX
Web crawler
PDF
Web search engines and search technology
PDF
Smart Crawler Automation with RMI
PPTX
Web Mining.pptx
PPT
Jagmohancrawl
PDF
Brief Introduction on Working of Web Crawler
PPTX
4 Web Crawler.pptx
PDF
Tolmachev Alexander Web Search Engines
PDF
Week10 Web Presentation
PPT
An Introduction to "Bioinformatics & Internet"
PPTX
Scalability andefficiencypres
Search Engine Spiders
Web Crawler For Mining Web Data
IRJET- A Two-Way Smart Web Spider
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
Web crawler
Web 2.0 e ricerca scientifica - Web 2.0 and scientific research
Web Crawler
Webcrawler
Webcrawler
Web crawler
Web search engines and search technology
Smart Crawler Automation with RMI
Web Mining.pptx
Jagmohancrawl
Brief Introduction on Working of Web Crawler
4 Web Crawler.pptx
Tolmachev Alexander Web Search Engines
Week10 Web Presentation
An Introduction to "Bioinformatics & Internet"
Scalability andefficiencypres
Ad

More from Maté Ongenaert (18)

PDF
Unleash transcriptomics to gain insights in disease mechanisms: integration i...
PPTX
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
PPTX
Ecobouwers opendeur passiefhuis Lokeren
PPTX
Workshop NGS data analysis - 3
PPTX
ENCODE project: brief summary of main findings
PPTX
Workshop NGS data analysis - 2
PPTX
Workshop NGS data analysis - 1
PPTX
Exploring the neuroblastoma epigenome: perspectives for improved prognosis
PPTX
High-throughput proteomics: from understanding data to predicting them
PPTX
Microarray data and pathway analysis: example from the bench
PPT
Large scale machine learning challenges for systems biology
PPTX
Integrative transcriptomics to study non-coding RNA functions
PPTX
Race against the sequencing machine: processing of raw DNA sequence data at t...
PDF
Bringing the data back to the researchers
PPTX
The post-genomic era: epigenetic sequencing applications and data integration
PPTX
Introduction
PPTX
Literature managment training
PPTX
Scientific literature managment - exercises
Unleash transcriptomics to gain insights in disease mechanisms: integration i...
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
Ecobouwers opendeur passiefhuis Lokeren
Workshop NGS data analysis - 3
ENCODE project: brief summary of main findings
Workshop NGS data analysis - 2
Workshop NGS data analysis - 1
Exploring the neuroblastoma epigenome: perspectives for improved prognosis
High-throughput proteomics: from understanding data to predicting them
Microarray data and pathway analysis: example from the bench
Large scale machine learning challenges for systems biology
Integrative transcriptomics to study non-coding RNA functions
Race against the sequencing machine: processing of raw DNA sequence data at t...
Bringing the data back to the researchers
The post-genomic era: epigenetic sequencing applications and data integration
Introduction
Literature managment training
Scientific literature managment - exercises
Ad

Recently uploaded (20)

PDF
advance database management system book.pdf
PDF
MBA _Common_ 2nd year Syllabus _2021-22_.pdf
PDF
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
PDF
FORM 1 BIOLOGY MIND MAPS and their schemes
PDF
Paper A Mock Exam 9_ Attempt review.pdf.
PPTX
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
Introduction to pro and eukaryotes and differences.pptx
PDF
Empowerment Technology for Senior High School Guide
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PPTX
Computer Architecture Input Output Memory.pptx
PDF
David L Page_DCI Research Study Journey_how Methodology can inform one's prac...
PDF
Indian roads congress 037 - 2012 Flexible pavement
PPTX
Unit 4 Computer Architecture Multicore Processor.pptx
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PPTX
Introduction to Building Materials
PDF
FOISHS ANNUAL IMPLEMENTATION PLAN 2025.pdf
PPTX
TNA_Presentation-1-Final(SAVE)) (1).pptx
PDF
Practical Manual AGRO-233 Principles and Practices of Natural Farming
advance database management system book.pdf
MBA _Common_ 2nd year Syllabus _2021-22_.pdf
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
FORM 1 BIOLOGY MIND MAPS and their schemes
Paper A Mock Exam 9_ Attempt review.pdf.
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Introduction to pro and eukaryotes and differences.pptx
Empowerment Technology for Senior High School Guide
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
Computer Architecture Input Output Memory.pptx
David L Page_DCI Research Study Journey_how Methodology can inform one's prac...
Indian roads congress 037 - 2012 Flexible pavement
Unit 4 Computer Architecture Multicore Processor.pptx
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
Introduction to Building Materials
FOISHS ANNUAL IMPLEMENTATION PLAN 2025.pdf
TNA_Presentation-1-Final(SAVE)) (1).pptx
Practical Manual AGRO-233 Principles and Practices of Natural Farming

Bots & spiders

  • 1. Bots & spiders Bio-informatica II 19/04/2012 Maté Ongenaert Center for Medical Genetics Ghent University Hospital, Belgium
  • 2.  Part 1: Bots & spiders Background  Part 2: Real-life case studies The use of bots and spiders in bio-informatics
  • 3.  About the presenter  Bio-engineer cell and gene biotechnology (2005) • Master thesis: identificatie van kanker-specifiek gemethyleerde genen  PhD applied biological sciences: cell and gene biotechnology (2009) • PhD thesis: cellular reprogramming  Industrial experience • Research scientist (methylation biomarkers)  Currently: postdoc at CMGG • Prognostic methylation biomarkers in neuroblastoma
  • 4. Part 1 Bots & spiders: background
  • 5. Overview  Bots and spiders  Introduction  Bots  Spiders  The Google case  Bots/spiders and bio-informatics  Automated querying  APIs  NCBI E-Utils (PubMed/GenBank)  Ensembl
  • 6. Bots and spiders  Bots and spiders  The web history • In 1989, while working at CERN, Tim Berners- Lee invented a network-based implementation of the hypertext concept • Since then, information can be retrieved by ‘following links’ instead of having to know the exact location at first • Information is not at a single location, it is dynamic and spread across machines
  • 7. Bots and spiders  Bots  Webbots • Web robots, WWW robots, bots): software applications that run automated tasks over the Internet  Bots perform tasks that: • Are simple • Structurally repetitive • At a much higher rate than would be possible for a human • Automated script fetches, analyses and files information from web servers at many times the speed of a human  Other uses: • Chatbots / IM / Skype / Wiki bots • Malicious bots and bot networks (Zombies)
  • 8. Bots and spiders  Bots  A spam bot, called the ‘Zunker Bot’ • Is installed on unpatched Windows machines • Controls the clients trough a neat application • Can install additional software and execute commands
  • 9. Bots and spiders  Spiders  Webspiders • Webspiders / Crawlers are programs or automated scripts which browses the World Wide Web in a methodical, automated manner. It is one type of bot  The spider starts with a list of URLs to visit, called the seeds • As the crawler visits these URLs, it identifies all the hyperlinks in the page • It adds them to the list of URLs to visit, called the crawl frontier • URLs from the frontier are recursively visited according to a set of policies • This process is called web crawling: in most cases a mean of collecting up-to-date data
  • 11. Bots and spiders  Spiders  Use of webcrawlers: • Mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches • Automating maintenance tasks on a website, such as checking links or validating HTML code • Can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses  Most commonly used crawler is probably the GoogleBot crawler • Crawls • Indexes (content + key content tags and attributes, such as Title tags and ALT attributes) • Serves results: PageRank Technology
  • 14. Bots and spiders  Google  Hardware • Standard server hardware (2009): 16 GB RAM / 2 TB storage per server • 2009 estimate: 450 000 servers – 2 million $/month electricity cost  Software • Webserver (Not apache-based) • Storage (Google File System / BigTable): distributed storage – mostly in memory • Borg job scheduling and monitoring • Indexing services: caffeine / percolator • MapReduce: cluster system: splits complex problems and sends ‘jobs’ to worker nodes (Map), answers are gathered and combined to solve the original question (Reduce)
  • 15. Overview  Bots and spiders  Introduction  Bots  Spiders  The Google case  Bots/spiders and bio-informatics  Automated querying  APIs  NCBI E-Utils (PubMed/GenBank)  Ensembl
  • 16. Bots and spiders  Bots/spiders and bio-informatics  Automated querying • Collecting information nowadays means the power to automatically query datasources (databases, websites, Google, Ensembl or NCBI databases) • Query in web-terms: GET / POST • Web-queries using Perl: LWP library  LWP: set of Perl modules which provides a simple and consistent application programming interface (API) to the World-Wide Web • Free LWP E-book: http://lwp.interglacial.com/  LWP for newbies • LWP::Simple (demo1) • Go to a URL, fetch data, ready to parse • Attention: HTML tags and regular expression
  • 17. Bots and spiders  Bots/spiders and bio-informatics  Some more advanced features • LWP::UserAgent (demo2 – show server access logs) • Fill in forms and parse results • Depending on content: follow hyperlinks to other pages and parse these again,… • Mechanize package: follow links; fill in forms,…  Bioinformatics examples • Use genome browser data (demo3) and sequences • Get gene aliases and symbols from GeneCards (demo4)
  • 18. Bots and spiders  Bots/spiders and bio-informatics  Why not make use of crawls, indexing and serving technologies of others (e.g. Google) • Google allows automated queries: per account 1000 queries a day • Google uses Snippets: the short pieces of text you get in the main search results • This is the result of its indexing and parsing algoritms • Demo5: LWP and Google APIs combined and parsing the results  API: Application Programming Interface • Hides complexity by sharing ‘libraries’ with functions that can be applied within another programming language • Bridges programming languages – crosses abstraction layers • Example: displaying on a screen; printing; querying Google or NCBI from within a programming language
  • 19. Bots and spiders  Bots/spiders and bio-informatics APIs  Google example used Google API  NCBI API • The NCBI Web service is a web program that enables developers to access Entrez Utilities via the Simple Object Access Protocol (SOAP) • Programmers may write software applications that access the E-Utilities using any SOAP development tool • Main tools (demo6): – E-Search: Searches and retrieves primary IDs and term translations and optionally retains results for future use in the user's environment – E-Fetch: Retrieves records in the requested format from a list of one or more primary IDs  Ensembl API (demo7) • Uses ‘Slices’ and adaptors • You have to know the ‘application’ or database (Compare/Core/…)
  • 20. Bots and spiders  Bots/spiders and bio-informatics APIs  NCBI API  A NCBI database, frequently used is PubMed • PubMed can be queried using E-Utils • Uses syntax as regular PubMed website • Get the data back in data formats as on the website (XML, Plain Text) • Parse XML results and apply more advanced Text-mining techniques • Demo8 • Parse results and present them in an interface – Methylated genes in cancer: – http://matrix.ugent.be/mate/methylome/result1.html – miRNAs in cancer: – http://matrix.ugent.be/mate/textmining/preprocess/
  • 21. Part 2 Real-life case studies: the use of bots and spiders in bio-informatics
  • 22. Bots and spiders  TextMining  Create and translate query • User query -> query suited for PubMed  Query is executed, results are returned • Results format: XML, TXT, MedLine, ASN,… • Human readable <> parsable (XML parsers)  Parse results • Extract information: authors, title, abstract • Store results  Analyse results • Identify gene names, keywords, GO-terms,… -> score • Semantic analysis / NLP processing / …  Visualise results • Highlighting, hierarchie, filters, searches, graphics
  • 23. Bots and spiders  TextMining
  • 24. Bots and spiders  TextMining
  • 25. Bots and spiders  TextMining
  • 26. Bots and spiders  TextMining  Demonstration: GoldMine  Web-application  Translate query – find aliases for genes or miRNAs and incorporate them in the search  Query NCBI PubMed using E-fetch  Get the results and process them  Count  Highlight  Rank  Visualization
  • 27. Bots and spiders  Data analysis  NCBI GEO – Gene Expression Omnibus  Raw expression data on FTP-server  Annotation: can be queried using NCBI E-Utils  Annotation: in Excel-files at FTP-server  For specific experimental conditions, get all raw data and annotations and perform an automated analysis  Create a scheme how you would proceed: biological question: superficial vs. Infiltrating bladder cancer
  • 28. Bots and spiders  Case study: superficial vs. Infiltrating bladder cancer  Find experiments on GEO  Annotation of samples: up to the submittors  ‘Uniform’ sample sheet available (Matrix-file)  Current update of GEO: view ‘factors’ in graphical overview
  • 29. Bots and spiders  Case study: superficial vs. Infiltrating bladder cancer
  • 30. Bots and spiders  Case study: superficial vs. Infiltrating bladder cancer  Use this to couple sample annotation features (stage, age, risk, sex) to unique sampleID (GSMxxxxxxx)  Get raw data for each sample in dataset  Either txt files (uniform) or raw data files (such as Affy CEL files)  Dependends on the used platform: GPLxxxx
  • 31. Bots and spiders  Case study: superficial vs. Infiltrating bladder cancer  Platform / data files / samples / sample annotation relationship  Set up standardised analysis strategy  Make use of sample annotations  Combine studies or keep them seperate?  Normalisation  RankProd analysis
  • 32. Bots and spiders  Case study: superficial vs. Infiltrating bladder cancer  data.justrma<just.rma("GSM90305.CEL”,”… SAMPLES  expression<-exprs(data.justrma) NORMALISATION  results[,2:103]<-expression  library(hgu95av2.db) PLATFORM  cl<c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, ANNOTATION  RP.out.stage <- RP(results[,3:104], cl, num.perm = 100, logged = TRUE, na.rm = FALSE, plot = TRUE, rand = 123) ANALYSIS STRATEGY
  • 33. Bots and spiders  Case study: superficial vs. Infiltrating bladder cancer  Combine results accross studies  Biological question <> data analysis  Scoring scheme, priorization  Superficial vs. Infiltrating  Metastasis vs. Primary cancer  High stage vs. Low stage  Normal vs. Cancer
  • 35. Bots and spiders  Integrated analysis Rank Meth Pca Lit Meth other Expression Pca Progression Rank1 2 3 4 5 6 7 8 EXPRESSION RE-EXP CpG Pc 1 1 x 0,95 1 0,993 0,997 0,84 1 2 0,998 0,995 1 0,958 0,091 0,994 3 1 x x x 1 0,993 1 0,996 0,312 4 1 x x x 0,995 0,767 0,96 1 0,931 0,998 0,635 5 1 x 0,997 0,968 1 1 0,364 0,746 0,199 6 x 0,711 0,948 0,994 0,559 0,991 0,993 7 0,998 0,993 0,83 0,936 0,996 8 0,997 0,99 0,998 0,759 0,726 0,575 9 1 x x 0,886 0,995 0,997 1 0,7 10 1 0,998 0,409 0,99 0,88 0,998 0,779 11 1 x x 0,995 0,999 0,995 0,687 12 1 x x 0,997 0,999 0,999 0,257 13 1 x x x 0,799 0,996 0,969 0,994 0,848 0,981 0,887 14 1 x x 0,916 0,568 0,99 0,993 0,994 0,988 0,558 15 0,986 0,995 0,956 0,983 0,998 16 1 x 0,157 1 0,925 0,989 0,984 0,993
  • 36. Acknowledgments  CMGG  Anneleen Decock  Frank Speleman  Jo Vandesompele  BioBix  Leander Van Neste  Tim De Meyer  Gerben Mensschaert  Geert Trooskens  Wim Van Criekinge