SlideShare a Scribd company logo
Contact Us
info@easygenomics.com
http://www.easygenomics.com
Next Generation Bioinformatics
on the Cloud
Xing Xu, Ph.D
Director of Cloud Computing Product
Topics for Today
 Behind the cloud product
 BGI
 The team
 The product: EasyGenomics
 Why are we building this product?
 What can this product do?
 Future direction and open questions
2
BGI
 The world largest genome sequencing center
 Started with Human Genome Project in 1999 with only a
few sequencers.
 Now more than 150 sequencers, 6 TB/day sequencing
throughput.
MODEL
ABI
3730XL
Roche
454
ABI
SOLiD 4
Solexa
GA IIx
Illumina
HiSeq 2000
INSTALLATION 16 1 27 6 135
BGI
 The world largest genome sequencing center
 The largest computing and storage center for
genomics in China
- 20,000+ CPU cores
- 19 NVIDIA GPUs
- 220+ Tflops peak
performance
- 17 PB data storage
- The storage and
computation capability
increase by 10000 folds!
- Still increasing …
BGI
 The world largest genome sequencing center
 The largest computing and storage center for
genomics in China
 One of world leading research institutes in
Genomics
Since 2007,
- 253 papers in high-impact journals
- Including 47 in Nature and its sub-journals,
9 in Science,2 in Cell, and 1 in NEJM, with
42 first and/or corresponding authors
- 369 patent applications
- 254 software authorship
BGI
 The world largest genome sequencing center
 The largest computing and storage center for
genomics in China
 One of world leading research institutes in
Genomics
BGI has the sequencing capacity, hardware resource
and software proficiency to be the one of the strongest
end-to-end service providers in the world for NGS
sequencing, data analysis and data interpretation.
Team for the Cloud Platform
 Run like a software
company
 Managers are from
leading software
companies, such as HP,
Microsoft, and Levono.
 Team members are
Young, Energetic, and
Ambitious.
 Fully supported by BGI
in-house algorithm
development teams.
Product
Development
Testing
Operation
BGI Support
Team for the Cloud Platform
 Development Team
 Dev: Ming Jiang, Yongsheng Chen, Can Long, Jiasheng Wu, etc.
 Flex Lab: Yan Li, Shengchang Gu etc. GPU Lab: Bingqiang Wang etc.
 Pipeline: Liang Wang etc.
 Test & QA Team
 Xin Guan, Jingjuan Liu, etc.
 PMO & IT Operation
 Wenjun Zeng, Litong Lai, Jing Tian, etc.
 Product Team
 Xing Xu, Jing Guo, Fang Fang etc.
 Other BGI Teams
+ + +
Topics for Today
 Behind the cloud product
 BGI
 The team
 The product: EasyGenomics
 Why are we building this product?
 What can this product do?
 Future direction and open questions
9
Trend of Volume and Cost
10
Geological side of the problem
Sequencing happens
EVERYWHERE.
+
Geological side of the problem
Images from omicsmaps.com
BGI
Difficulties of Analysis
In-depth
Annotation
Lack of
knowledge
Post Tertiary
Analysis
Variant Calling
Complicated
Algorithms
Computation
intensive
Tertiary Analysis
Mapping
Computation
intensive
Data storage
Secondary
Analysis
Base calling
Data
throughput
Data storage
Primary analysis
Problems and Solutions
13
Problems:
• Big genomic data
• Geological distribution
• Algorithm integration
• Computational demand
• Big genomic data
• Geological distribution
• Algorithm integration
• Computational demand+)
Cloud
High Speed Data Exchange
Pipelines
Distributed Workloads
Solutions
EasyGenomics™
EasyGenomics is a Software as a Service (SaaS)
bioinformatics platform for research and applications.
Algorithms,
Workflows,
Reports
Computational
Resources
Database,
Data management
Web portal,
Simple UIHigh speed
connection
Bioinformatics
Workflows
Data
Management
High Speed
Connection
Key Features
Bioinformatics Workflow
 Four steps:
Upload, Create a Sample, Perform Analyses, Download Results
 Algorithms:
Carefully chosen, tested and optimized
 Workflows:
Whole Genome Resequencing, Exome Resequencing, RNA-Seq,
small RNA, ncRNA, and De novo Assembly
Homepage
Four task
portals
Status of
recent works
Warning and
Logging
Navigation
Tabs
Bioinformatics Workflow
--- Pipelines
18
Exome Resequencing RNASeq
Transcriptome
Bioinformatics Workflow
---Comprehensive Reports
19
Bioinformatics Workflow
---Comprehensive Reports
20
Data Management
 “Sample”, “Analysis”, “Project”
 Mimicking real research procedure
 Automatic management of underlying data structure
Raw Data
Sample A
Sample B
Analysis I
Analysis II
Analysis X
Project I
Create a Sample
Add read groups
Sample Page
Individual report
for each lane
Summarized report
for all lanes
Data management
---Security
Access
Multi-tenancy
Isolation
Compliance
• Username/Password
• Biometric access
• HTTPS , Aspera fastpTM
• Trusted database
connection
• ACL, Data encryption
• Physical isolation
• Virtual isolation
• ISO27000
High Speed Data Exchange
 Aspera’s patented
fasp™ high-speed file
transferring technology
 10~100X faster than
FTP
25
Transfer 24GB in 30 Seconds
26
 Demonstrated 10Gbps ultra high speed data
exchange with UC Davis, and NCBI in June.
Transfer 24GB in 30 Seconds
27
 Demonstrated 10Gbps ultra high speed data
exchange with UC Davis, and NCBI in June.
 A 24GB file was transferred from China to US
in 30 Seconds (~8Gbits/s).
Amount of Data that
can be transferred in 24hr
28
The data amount transferred in 24hrs at different data transfer bandwidths. (Assuming the input read size is
10GB, the total results is about 50GB, the clean reads is about 10GB and the aligned reads (BAM) is about
20GB]
Easy-to-Use UI
 Reusability
 Reuse the same sample for different analyses (different
parameters)
 Reuse all parameter settings for different analyses
 Simple UI and interactive features
 As easy as to do online shopping
 Shortcut for predefined setting, at the same time fully
customizable for advance users
 Handle batch analyses in one setting
29
Create an Analysis
Selected
sample(s)
•One selected sample => Single Analysis
•Multiple selected samples => Batch Analyses
Create an Analysis
Selectable
modules
Predefined
Settings
Shortcut
Create an Analysis
Create an Analysis
Customizable
Create an Analysis
Project Table
Add/Remove
Project
Operation
short cuts
Project list table
Filter and
search box
Analysis Table
Sample Table
A typical user case
38
A normal user case of EasyGenomics and Customers’ Local Computational resource.
The double line items are Customers’ data or resource. The single line items are
results and data within BGI and EasyGenomics platform. The widths of arrows
represent the sizes of data flows (not in real proportion).
Customers’
Local
Resources
Topics for Today
 Behind the cloud product
 BGI
 The team
 The product: EasyGenomics
 Why are we building this product?
 What can this product do?
 Future direction and open questions
39
Future directions
 What is the market?
 Which direction to go?
 Cloud on the public infrastructure vs cloud on the private
infrastructure
 SaaS vs PaaS
 Data analysis is only one step of the whole process.
 What will be the sustained model for the cloud service?
Cloud Service
Providers
Market Position
Annotation Providers
Sequencing Service ProvidersInstrument Manufacturers
Personal Genetic Testing
Providers
illumina
Software
Providers
NOW
Challenge and Solution
DNANexus Basespace
(Illumina)
GenomeSpace EasyGenomics Ingenuity/
NextBio
Cloud Public Public Public Private Private
Reasoning Great demand on
space and
computation
resources
Security, Privacy
issue
Positioning Infrastructure
(PaaS)
App Store Platform for
accessing available
tools.
SaaS Solution Information
They are playing
the results from
NGS not the raw
reads.
Advantage Funding
Advance in the
field
Sequencing service
Community of
Partners
Strong connection
to academia
Sequencing Service
Development
Capability
Experience
42
Public vs Private Cloud
Public Cloud
Pros:
− “Limitless” resource
− Share data to a wide
range of people
− Offering nice platform
Cons:
− Security and reliability
− Short term cost saving
vs Long term cost
nightmare
Private Cloud
Pros:
− Flexibility
− Security and Privacy
control
− Long-term cost saving
Cons:
− Big initial investment
− Maintaining the
infrastructure and
software on the cloud
But, the line between public and private cloud are blurring.
A sustained model for
cloud service?
 Key components of cost
 Storage
 Computational resource
 Data transfer
 Software usage
 App store or Cell phone plan
 Long term cost vs Short term cost
Data analysis is NOT ALL!
EPM
Project
Management
Sample Center
Wet Lab
Operation
Bioinformatics
Data Analysis
EPM
Management System
Budgeting
Tasking
Receipt/Storage
Handover
Sample QC
Sample prep
Workflow
Sequencing
Data analysis
Data QC
Sales
Billin
g
Web-based Interface
Management Interfacing Query Statistics
Roadmap of EasyGenomics
46
Jun
2012
Aug
2012
Sep
2012
Dec
2012
Apr
2013
EG1.1 (in Jun)
•New result reports
•Fully Integrated Data
Exchange Interface
EG1.2 (in Aug)
•New read filtering step,
speed up 20x
EG1.3 (in Sep)
•Data import from BGI
sequencing service
EG1.5 (est. in Dec)
•QC indicator, QC module
•New Sample report
•Transcriptome workflows
•Reference management
EG2.0 (est. in Apr, 2013)
•IRODs data management
•Data sharing, collaboration
•User own applications
•Comparison, Filtering tools
•Visualization
www.EasyGenomics.com
Free Beta Trial is on going!!
Interpretation is the KEY
 Analysis and Interpretation is the KEY
Enabling Technology
49
Best Practice Award
for IT Infrastructure
Human Genome SOAPdenovo EasyGenomicsTM
(192 cores)
Genome Coverage 86% 86%
Assembly Time 70h 55h
No. of Servers 1 15
Memory Size 500GB x 1 24 GB x 15
Mode Centralized Distributed
Hadoop-based Flexible Computing
Enabling Technology
 SOAP Hadoop (Gaea)
 GPU
50

More Related Content

PDF
Making BD Work~TIAS_20150622
PDF
"Implementing the TensorFlow Deep Learning Framework on Qualcomm’s Low-power ...
PPTX
DBTA Data Summit : Eliminating the data constraint in Application Development
PDF
Genome voyager-beta-brochure
PPTX
Best pratices at BGI for the Challenges in the Era of Big Genomics Data
PPTX
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
PPTX
Ecobouwers opendeur passiefhuis Lokeren
PPTX
Workshop NGS data analysis - 3
Making BD Work~TIAS_20150622
"Implementing the TensorFlow Deep Learning Framework on Qualcomm’s Low-power ...
DBTA Data Summit : Eliminating the data constraint in Application Development
Genome voyager-beta-brochure
Best pratices at BGI for the Challenges in the Era of Big Genomics Data
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
Ecobouwers opendeur passiefhuis Lokeren
Workshop NGS data analysis - 3

Viewers also liked (12)

PPTX
ENCODE project: brief summary of main findings
PPTX
Workshop NGS data analysis - 2
PDF
Open human genome data
PDF
A Platform for Integrated Genome Data Analysis
PDF
New Progress in Pyrosequencing for DNA Methylation
PDF
PCR - From Setup to Cleanup: A Beginner`s Guide with Useful Tips and Tricks -...
PDF
Microbiome Profiling with the Microbial Genomics Pro Suite
PDF
Clinical Metagenomics for Rapid Detection of Enteric Pathogens and Characteri...
PPTX
Workshop NGS data analysis - 1
PDF
Advanced NGS Data Analysis & Interpretation- BGW + IVA: NGS Tech Overview Web...
PDF
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
PDF
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
ENCODE project: brief summary of main findings
Workshop NGS data analysis - 2
Open human genome data
A Platform for Integrated Genome Data Analysis
New Progress in Pyrosequencing for DNA Methylation
PCR - From Setup to Cleanup: A Beginner`s Guide with Useful Tips and Tricks -...
Microbiome Profiling with the Microbial Genomics Pro Suite
Clinical Metagenomics for Rapid Detection of Enteric Pathogens and Characteri...
Workshop NGS data analysis - 1
Advanced NGS Data Analysis & Interpretation- BGW + IVA: NGS Tech Overview Web...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Ad

Similar to Easygenomics ISCB Cloud section 2012 (20)

PPTX
Xu Xing: EasyGenomics – Next Generation Bioinformatics on the Cloud
PPTX
Bionimbus - Northwestern CGI Workshop 4-21-2011
PDF
Software Analytics: Data Analytics for Software Engineering
PPTX
2016 05 sanger
PPTX
Gluent Extending Enterprise Applications with Hadoop
PDF
BDSE 2015 Evaluation of Big Data Platforms with HiBench
PPTX
So Long Computer Overlords
PDF
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
PDF
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
PDF
Big Data and OSS at IBM
PPTX
Data-intensive bioinformatics on HPC and Cloud
PPTX
Rpi talk foster september 2011
ODP
e-Infrastructure @ Science
 
PDF
Laying the Foundation for Ionic Platform Insights on Spark
PDF
G3 talk rld_2
PDF
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
PDF
Benefits of Hadoop as Platform as a Service
PDF
Speeding Up Atlas Deep Learning Platform with Alluxio + Fluid
PPTX
GlobusWorld 2020 Keynote
PPTX
Lessons learned from designing QA automation event streaming platform(IoT big...
Xu Xing: EasyGenomics – Next Generation Bioinformatics on the Cloud
Bionimbus - Northwestern CGI Workshop 4-21-2011
Software Analytics: Data Analytics for Software Engineering
2016 05 sanger
Gluent Extending Enterprise Applications with Hadoop
BDSE 2015 Evaluation of Big Data Platforms with HiBench
So Long Computer Overlords
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Big Data and OSS at IBM
Data-intensive bioinformatics on HPC and Cloud
Rpi talk foster september 2011
e-Infrastructure @ Science
 
Laying the Foundation for Ionic Platform Insights on Spark
G3 talk rld_2
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Benefits of Hadoop as Platform as a Service
Speeding Up Atlas Deep Learning Platform with Alluxio + Fluid
GlobusWorld 2020 Keynote
Lessons learned from designing QA automation event streaming platform(IoT big...
Ad

Recently uploaded (20)

PDF
Salesforce Agentforce AI Implementation.pdf
PDF
Autodesk AutoCAD Crack Free Download 2025
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
Monitoring Stack: Grafana, Loki & Promtail
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
Weekly report ppt - harsh dattuprasad patel.pptx
PDF
Digital Systems & Binary Numbers (comprehensive )
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Cost to Outsource Software Development in 2025
PDF
Complete Guide to Website Development in Malaysia for SMEs
PPTX
history of c programming in notes for students .pptx
PDF
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
PPTX
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
PDF
Nekopoi APK 2025 free lastest update
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Salesforce Agentforce AI Implementation.pdf
Autodesk AutoCAD Crack Free Download 2025
Operating system designcfffgfgggggggvggggggggg
Design an Analysis of Algorithms II-SECS-1021-03
Monitoring Stack: Grafana, Loki & Promtail
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Odoo Companies in India – Driving Business Transformation.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
CHAPTER 2 - PM Management and IT Context
Weekly report ppt - harsh dattuprasad patel.pptx
Digital Systems & Binary Numbers (comprehensive )
Wondershare Filmora 15 Crack With Activation Key [2025
Cost to Outsource Software Development in 2025
Complete Guide to Website Development in Malaysia for SMEs
history of c programming in notes for students .pptx
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
Nekopoi APK 2025 free lastest update
Why Generative AI is the Future of Content, Code & Creativity?
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx

Easygenomics ISCB Cloud section 2012

  • 1. Contact Us info@easygenomics.com http://www.easygenomics.com Next Generation Bioinformatics on the Cloud Xing Xu, Ph.D Director of Cloud Computing Product
  • 2. Topics for Today  Behind the cloud product  BGI  The team  The product: EasyGenomics  Why are we building this product?  What can this product do?  Future direction and open questions 2
  • 3. BGI  The world largest genome sequencing center  Started with Human Genome Project in 1999 with only a few sequencers.  Now more than 150 sequencers, 6 TB/day sequencing throughput. MODEL ABI 3730XL Roche 454 ABI SOLiD 4 Solexa GA IIx Illumina HiSeq 2000 INSTALLATION 16 1 27 6 135
  • 4. BGI  The world largest genome sequencing center  The largest computing and storage center for genomics in China - 20,000+ CPU cores - 19 NVIDIA GPUs - 220+ Tflops peak performance - 17 PB data storage - The storage and computation capability increase by 10000 folds! - Still increasing …
  • 5. BGI  The world largest genome sequencing center  The largest computing and storage center for genomics in China  One of world leading research institutes in Genomics Since 2007, - 253 papers in high-impact journals - Including 47 in Nature and its sub-journals, 9 in Science,2 in Cell, and 1 in NEJM, with 42 first and/or corresponding authors - 369 patent applications - 254 software authorship
  • 6. BGI  The world largest genome sequencing center  The largest computing and storage center for genomics in China  One of world leading research institutes in Genomics BGI has the sequencing capacity, hardware resource and software proficiency to be the one of the strongest end-to-end service providers in the world for NGS sequencing, data analysis and data interpretation.
  • 7. Team for the Cloud Platform  Run like a software company  Managers are from leading software companies, such as HP, Microsoft, and Levono.  Team members are Young, Energetic, and Ambitious.  Fully supported by BGI in-house algorithm development teams. Product Development Testing Operation BGI Support
  • 8. Team for the Cloud Platform  Development Team  Dev: Ming Jiang, Yongsheng Chen, Can Long, Jiasheng Wu, etc.  Flex Lab: Yan Li, Shengchang Gu etc. GPU Lab: Bingqiang Wang etc.  Pipeline: Liang Wang etc.  Test & QA Team  Xin Guan, Jingjuan Liu, etc.  PMO & IT Operation  Wenjun Zeng, Litong Lai, Jing Tian, etc.  Product Team  Xing Xu, Jing Guo, Fang Fang etc.  Other BGI Teams + + +
  • 9. Topics for Today  Behind the cloud product  BGI  The team  The product: EasyGenomics  Why are we building this product?  What can this product do?  Future direction and open questions 9
  • 10. Trend of Volume and Cost 10
  • 11. Geological side of the problem Sequencing happens EVERYWHERE. + Geological side of the problem Images from omicsmaps.com BGI
  • 12. Difficulties of Analysis In-depth Annotation Lack of knowledge Post Tertiary Analysis Variant Calling Complicated Algorithms Computation intensive Tertiary Analysis Mapping Computation intensive Data storage Secondary Analysis Base calling Data throughput Data storage Primary analysis
  • 13. Problems and Solutions 13 Problems: • Big genomic data • Geological distribution • Algorithm integration • Computational demand • Big genomic data • Geological distribution • Algorithm integration • Computational demand+) Cloud High Speed Data Exchange Pipelines Distributed Workloads Solutions
  • 14. EasyGenomics™ EasyGenomics is a Software as a Service (SaaS) bioinformatics platform for research and applications. Algorithms, Workflows, Reports Computational Resources Database, Data management Web portal, Simple UIHigh speed connection
  • 16. Bioinformatics Workflow  Four steps: Upload, Create a Sample, Perform Analyses, Download Results  Algorithms: Carefully chosen, tested and optimized  Workflows: Whole Genome Resequencing, Exome Resequencing, RNA-Seq, small RNA, ncRNA, and De novo Assembly
  • 17. Homepage Four task portals Status of recent works Warning and Logging Navigation Tabs
  • 18. Bioinformatics Workflow --- Pipelines 18 Exome Resequencing RNASeq Transcriptome
  • 21. Data Management  “Sample”, “Analysis”, “Project”  Mimicking real research procedure  Automatic management of underlying data structure Raw Data Sample A Sample B Analysis I Analysis II Analysis X Project I
  • 22. Create a Sample Add read groups
  • 23. Sample Page Individual report for each lane Summarized report for all lanes
  • 24. Data management ---Security Access Multi-tenancy Isolation Compliance • Username/Password • Biometric access • HTTPS , Aspera fastpTM • Trusted database connection • ACL, Data encryption • Physical isolation • Virtual isolation • ISO27000
  • 25. High Speed Data Exchange  Aspera’s patented fasp™ high-speed file transferring technology  10~100X faster than FTP 25
  • 26. Transfer 24GB in 30 Seconds 26  Demonstrated 10Gbps ultra high speed data exchange with UC Davis, and NCBI in June.
  • 27. Transfer 24GB in 30 Seconds 27  Demonstrated 10Gbps ultra high speed data exchange with UC Davis, and NCBI in June.  A 24GB file was transferred from China to US in 30 Seconds (~8Gbits/s).
  • 28. Amount of Data that can be transferred in 24hr 28 The data amount transferred in 24hrs at different data transfer bandwidths. (Assuming the input read size is 10GB, the total results is about 50GB, the clean reads is about 10GB and the aligned reads (BAM) is about 20GB]
  • 29. Easy-to-Use UI  Reusability  Reuse the same sample for different analyses (different parameters)  Reuse all parameter settings for different analyses  Simple UI and interactive features  As easy as to do online shopping  Shortcut for predefined setting, at the same time fully customizable for advance users  Handle batch analyses in one setting 29
  • 30. Create an Analysis Selected sample(s) •One selected sample => Single Analysis •Multiple selected samples => Batch Analyses
  • 38. A typical user case 38 A normal user case of EasyGenomics and Customers’ Local Computational resource. The double line items are Customers’ data or resource. The single line items are results and data within BGI and EasyGenomics platform. The widths of arrows represent the sizes of data flows (not in real proportion). Customers’ Local Resources
  • 39. Topics for Today  Behind the cloud product  BGI  The team  The product: EasyGenomics  Why are we building this product?  What can this product do?  Future direction and open questions 39
  • 40. Future directions  What is the market?  Which direction to go?  Cloud on the public infrastructure vs cloud on the private infrastructure  SaaS vs PaaS  Data analysis is only one step of the whole process.  What will be the sustained model for the cloud service?
  • 41. Cloud Service Providers Market Position Annotation Providers Sequencing Service ProvidersInstrument Manufacturers Personal Genetic Testing Providers illumina Software Providers NOW
  • 42. Challenge and Solution DNANexus Basespace (Illumina) GenomeSpace EasyGenomics Ingenuity/ NextBio Cloud Public Public Public Private Private Reasoning Great demand on space and computation resources Security, Privacy issue Positioning Infrastructure (PaaS) App Store Platform for accessing available tools. SaaS Solution Information They are playing the results from NGS not the raw reads. Advantage Funding Advance in the field Sequencing service Community of Partners Strong connection to academia Sequencing Service Development Capability Experience 42
  • 43. Public vs Private Cloud Public Cloud Pros: − “Limitless” resource − Share data to a wide range of people − Offering nice platform Cons: − Security and reliability − Short term cost saving vs Long term cost nightmare Private Cloud Pros: − Flexibility − Security and Privacy control − Long-term cost saving Cons: − Big initial investment − Maintaining the infrastructure and software on the cloud But, the line between public and private cloud are blurring.
  • 44. A sustained model for cloud service?  Key components of cost  Storage  Computational resource  Data transfer  Software usage  App store or Cell phone plan  Long term cost vs Short term cost
  • 45. Data analysis is NOT ALL! EPM Project Management Sample Center Wet Lab Operation Bioinformatics Data Analysis EPM Management System Budgeting Tasking Receipt/Storage Handover Sample QC Sample prep Workflow Sequencing Data analysis Data QC Sales Billin g Web-based Interface Management Interfacing Query Statistics
  • 46. Roadmap of EasyGenomics 46 Jun 2012 Aug 2012 Sep 2012 Dec 2012 Apr 2013 EG1.1 (in Jun) •New result reports •Fully Integrated Data Exchange Interface EG1.2 (in Aug) •New read filtering step, speed up 20x EG1.3 (in Sep) •Data import from BGI sequencing service EG1.5 (est. in Dec) •QC indicator, QC module •New Sample report •Transcriptome workflows •Reference management EG2.0 (est. in Apr, 2013) •IRODs data management •Data sharing, collaboration •User own applications •Comparison, Filtering tools •Visualization
  • 48. Interpretation is the KEY  Analysis and Interpretation is the KEY
  • 49. Enabling Technology 49 Best Practice Award for IT Infrastructure Human Genome SOAPdenovo EasyGenomicsTM (192 cores) Genome Coverage 86% 86% Assembly Time 70h 55h No. of Servers 1 15 Memory Size 500GB x 1 24 GB x 15 Mode Centralized Distributed Hadoop-based Flexible Computing
  • 50. Enabling Technology  SOAP Hadoop (Gaea)  GPU 50

Editor's Notes

  • #2: Good afternoon Ladies and Gentlemen, First of all I would like to thanks everyone coming to the session. My name is Sifei He, Director of BGI Cloud Initiative, driving BGI’s Cloud-based-Omics effort. And I am glad to introduce my colleague Dr Xing Xu, Senior Product Manager at BGI, responsible for all aspects of “EasyGenomics”, BGI’s latest SaaS-based bioinformatics product.
  • #8: ` ``
  • #9: ` ``
  • #11: This morning I was reading Monday’s USA Today. One of the cover story was a girl at 18, whose family history includes Huntington’s disease has decided to conduct genetic test to see whether she has the fatal gene. What impressed me is the fact that genetic testing and disease are now such close to our daily life. Imagine by 2030 the UN President candidates all publish their complete genome, who would you vote for? A few years ago it was science fiction but look at the trend today, the cost for 1Mb DNA sequencing has gone down dramatically and thanks to these great instruments, the total number of human genome sequenced has gone from 1 in 2003 when the Human Genome Project releases their data to a few thousands today. The number may vary but the trend won’t change. If the red-dotted Moore’s law continues as it was, we may well see $1000 a genome in 2012 or 2013 and the price will continue to drops toward $0. In contrast, we will be able to sequence a lot more genome then today, and I’d like to quote Martin Leach’s “Humanity Genome” or “Hunome”
  • #12: Over the past few years, we have been thinking of $1000 a genome and of course have done tons of great works to archive that. GO-Big. Getting just 0.1% of world human population sequenced would cost $7 Billion, generating around 700 Petabyte of RAW ATGC, equivalent of 85 billions The Complete Harry Potter Collection - eBook. And that’s not the end of the story. Omicsmap team created this nice map to illustrate sequencing capacity around the world. As the price of sequencing drop, there is a reason to believe the map will be looked like this in a few years! The point is, sequencing is a commodity and it happens everywhere. Key takeaway 1.
  • #13: A lot of the times when I chat with collaborators and partners, everyone was talking about the opportunities and possibilities introduced by NGS. Unfortunately not all of them possess the necessary knowledge and skills to handle the tremendous amount of data generated by NGS which indeed has become one of the biggest obstacles to fully utilize this technology. On the other hand, scientists often have to deal with numerous difficulties, such as data deliveries on hard drives, management of computing and storage resources, installation and integration of multiple algorithms, and optimization of a number of parameters, to get reliable and meaningful results. If you wonder how BGI solved it, you are on the right session. If you want to access BGI’s bioinformatics solution, the next 20 slides are just for you.
  • #14: Again just to summarized what we have learned. Big Gennomics Data, Geological distribution, Algorithm integration, Computation demand. Whenever there are problems, there is solution. Cloud, unlimited storage, computation, access from anywhere High speed data exchange Well tested and optimized algorithm Fine tuned resource management Together that makes up EasyGenomics.
  • #15: If we look at EasyGenomics from feature perspective, web portal, algorithms, workflows, resources, database, high speed data exchange all packed as a simple solution on the cloud.
  • #16: 云平台的建设,其核心技术包括生物信息流程,数据管理,以及高速数据交换
  • #17: At the heart of EasyGenomics is our Bioinformatics Core. 5 workflows with carefully chosen algorithms, tested and optimized. Filtering, QC Report, Alignment along with other supporting features.
  • #22: When user start a new analysis project, there are three atomic objects he or she needs to look into. Sample which is created by aggregating raw data, Analysis that take Samples as input and Project which encloses multiple analysis. Filtering, QC Report, Alignment are built-in so that users don’t have to worry about it. While different pipelines may have different handles but the basic remains the same. In this way, EasyGenomics enables a unfied underlying data structure, mimicking your real research procedures.
  • #25: At EasyGenomcis, we are serious about information security and have designed a secure multitenancy architecture from the ground up. Critical user data is 256bits encrypted to make sure everyone is in stealth mode. Sample and project data are stored in user’s designated virtual partition so that no one not even EasyGenomcs operation team can see them. Same as many online applications, a secure login mechanism is provided and every interaction you make with the system is encrypted using secure HTTP protocol. When it goes to data transfer security, EasyGenomics partnered with Aspera to send/receive your data fast and securely. Last but not least, we will never store your password in plain text!
  • #26: Today we announced a partnership with Aspera to deploy fasp technology with EasyGenomics. Some of you may be familiar with Aspera, it is the same piece of technology used at European Bioinformatics Institue and National Center for Biotechnology Information for delivering large chunk of data over the open Internet. Speed? You know it! 10~100 times faster than FTP.
  • #49: That 700PB does freak a lot people, but if anyone in this room ask me what matters the most at today’s Big Genomics Data era? I would say information. Raw ATGC does NOT make any sense. When you trying to look into so call the Sex chromosome, 200 million bp decides our gender and more… Up until today, we only get to know a very limited set of knowledge hidden behind our gene. While sequencing continue to be a thrilling race, discovering information behind Big Genomics Data presents huge challenge to the community. And turning those scientific discoveries into consumable application is the silver bullet. Key takeaway2: Analysis and interpretation of the genome data is the KEY and to apply sequencing information onto application is the Silver Bullet
  • #50: EasyGenomics has integrated the de novo application on Hadoop framework. Comparable performance without high memory constraint. Resequencing workflow based on the Hapdoop framework is in developing.
  • #51: EasyGenomics has integrated the de novo application on Hadoop framework. Comparable performance without high memory constraint. Resequencing workflow based on the Hapdoop framework is in developing.