SlideShare a Scribd company logo
Taipei | www.atgenomix.com | info@atgenomix.com
BigData Solution
for NGS Data
Anlaysis
Steven Li, Sr. Data Scientist 

yunlung_li@atgenomix.com
Outline
•Software Containerization
•NGS genome analysis
•Distributed, RESTful search and analytics
2
Software Containerization
•Docker Introduction
•Docker In Action
•Docker in Atgenomix SeqsLab
3
What is Docker
4
Docker Container Image
5
• container image - a
lightweight, stand-alone,
executable package of a
piece of software that
includes everything
needed to run it: code,
runtime, system tools,
system libraries,
settings.
https://www.docker.com/what-container
Docker Container
6
https://www.docker.com/what-container
• A VM as light as a process
Software Container
7https://www.slideshare.net/jonasrosland/docker-and-containers-for-boston-docker-meetup-workshop-in-march-2015
Cargo Container
8https://www.slideshare.net/jonasrosland/docker-and-containers-for-boston-docker-meetup-workshop-in-march-2015
Run Your System Everywhere
9
https://www.wired.com/2016/07/windows-10-free-upgrade-ends /
WindowsWindows
Architecture
10 https://docs.docker.com/engine/docker-overview/#docker-architecture
>_
Container Warehouse
11 https://hub.docker.com/
• https://hub.docker.com/
• https://store.docker.com/
• https://aws.amazon.com/ecr/
• private docker registry
• …
Container Warehouse
12https://www.ibm.com/blogs/bluemix/2015/08/c-ports-docker-containers-across-multiple-clouds-datacenters/
Docker Client
13 https://docs.docker.com/engine/docker-overview/#docker-architecture
>_
14
https://github.com/Haufe-Lexware/docker-style-guide/blob/master/DockerImage.md
Workshop Environment
15
• docker cloud account registration
• https://cloud.docker.com/
• docker training course and browser-based terminal labs
• https://training.play-with-docker.com/ops-s1-hello/
Workshop Environment
16
Workshop 1-1
Hello World
17 https://training.play-with-docker.com/ops-s1-hello/
Workshop 1-2
Image created with Container
18
• generate image from a fresh ubuntu container
https://training.play-with-docker.com/ops-s1-images/
Workshop 1-3
Dockerfile
19 https://training.play-with-docker.com/ops-s1-images/
Workshop 1-4
Filesystem Mount
20
/mnt
• mounting host OS file path to container file path
host OS
/root/mount_test
Workshop 1-5 Network -
Container to Host
21
• mapping host OS port to container ports
host OS
:8080
:80
Workshop 1-6 Network -
Inter-container
es-docker test
my-net
• https://docs.docker.com/network/bridge/#connect-a-container-to-a-user-defined-bridge
Workshop 1-7 Network -
Run Container as Host
23
• run container using host network stack
Atgenomix SeqsLab
portal
master slave slave
Atgenomix
SedsLab
Docker in Atgenomix SeqsLab
portal
master slave slave
• We use docker to simplify cluster environment management
Atgenomix
SedsLab
Atgenomix
SedsLab
Atgenomix
SedsLab
Atgenomix
SedsLab
Docker with IT Automation
Window
Atgenomix
SeqsLab
Window
Atgenomix
SeqsLab
Windows
Atgenomix
SeqsLab
Demo
28
•SeqsLab ansible deployment
•SeqsLab docker environment
Outline
•Software Containerization
•NGS genome analysis
•Distributed, RESTful search and analytics
29
NGS Genome Analysis
•NGS Bioinformatics introduction
•Workshop: contaiernized DNA-Seq
Pipeline
•DNA-Seq Pipeline in Atgenomix SeqsLab
30
Bioinformatics
31
• Science of biological sequences started from 1970s
• 1980s to 1990s, protein era
• 2000s, human genome project
• 2007, Next Generation Sequencing
Margrett Dayhoff
http://www.whatisbiotechnology.org/index.php/people/summary/Dayhoff
Next Generation Sequencing
32
NGS Debuted
Hiseq X10
Debuted
https://www.genome.gov/images/content/costpergenome2015_4.jpg
2017 Illumina
NovaSeq 6000
$100 per genome
• Hiseq X10

cost: $ 1K 

throughput:

50 WGS per day

~5TB per day

• data deluge

• precision
medicine
Sequencing by Synthesis Technology
33
https://www.youtube.com/watch?v=77r5p8IBwJk
34 https://nanohub.org/resources/17704/watch?resid=17818
Cell
DNA
DNA
fragment
sequencing
template
300-500bp
Sequencing Type
•Whole Genome Sequencing, 3 Billion bp, ~100GB per
sample (30x)
•Whole Exome Seqeuencing, exon regions (1.1% of
genome), ~10GB per sample (200x)
•Target Panel Sequencing, full gene regions of certain
genes
35
Coverage
36
•How many times the
genome is covered by
reads
C = N*L / G
•C: coverage
•N: number of reads
•L: read length
•G: genome length
Sequencing Error
37
Resequencing Approach
38
https://www.broadinstitute.org/gatk/img/cartoon-blackbox-workflow-web-blackblue.png
Wet Lab dry Lab
Fastq
39
Fastq format

@SEQ_ID

Sequenced DNA base

+

Sequencing quality

size: 100GB gz (WGS, 30x)

counts: billions
Genome Puzzle
40
https://www.flickr.com/photos/doctorow/1443403594/in/photostream/ 

https://www.libertypuzzles.com/wooden-jigsaw-puzzles/starry-night-large-piece
Resequencing Approach
reference genome
sequnced reads
reference genome
sequnced genome
Resequencing Approach
reference genome
sequnced reads
reference genome
sequnced genome
fastq
bam
vcf
read

mapping
variant

calling
Reads to Genome Mapping
suffix array / FM-index
data structure
SAM/BAM
SAM/BAM
Variant Type
•Single NucleotideVariant (SNV)
•short insertion or deleteion (INDEL): <50bp
•structure variant (SV): >50bp
Variant Type
47h"p://stream.dcasf.com/webinar/jonas-korlach-pacbio-applica9ons-updates-future-roadmap-ashg-2017	/
Single Nucleotide Variant
http://www.clcsupport.com/clcgenomicsworkbench/754/SNP-example.png
Insertion / Deletion
49
Structure Variant
•short read challange
•repeat challange
50
http://www.nature.com/nmeth/journal/v9/n2/full/nmeth.1858.html
Variant Call Format
• header (##)
• chromsome, POS, rsid
• ref allele / alt allele
• quality / filter
• info
• sample info
Workshop2
FASTQ to VCF Pipeline
52
1. SNP-indel pipeline: bwa-samtools
2. structure variant pipeline: bwa-delly2
https://training.play-with-docker.com/ops-s1-hello/
Workshop2-1
bwa-samtools Pipeline
read mapping
BWA
mem
BWA
index
ref.fa
ref.fa.amb

ref.fa.ann

ref.fa.bwt

ref.fa.pac

ref.fa.sa
in.fq
t.sam samtools
sort
t.bam samtools
index
t.bai

t.bam samtools
mpileup
r.bcf bcftools
call
snp.vcf.gz
variant calling
1. https://github.com/obigbando/demo_case
2. https://github.com/lh3/bwa-docker
3. https://hub.docker.com/r/obigbando/samtools/
4. https://hub.docker.com/r/obigbando/bcftools/
5. OS filesystem mount
bam process
Workshop2-2
bwa-delly Pipeline
ref.fa
samtools
faidx
t.bai

t.bam delly2
call
r.bcf bcftools
view
sv.vcf.gz
variant calling
Read
Mapping
bam
process
ref.fa.fai
1. https://github.com/dellytools/delly/tree/master/docker
2. https://hub.docker.com/r/obigbando/samtools/
3. https://hub.docker.com/r/obigbando/bcftools/
Atgenomix SeqsLab
55
• Bioinformatics on Spark / Hadoop
• Data parallelization
Atgenomix SeqsLab
56
Hadoop / Spark
https://www.safaribooksonline.com/library/view/data-analytics-with/9781491913734/ch04.html
http://vijaytraining.com/product/big-data/
• Hadoop distributed filesystem (HDFS)
• Yet Another Resource Negotiator(Yarn)
• Spark
Atgenomix SeqsLab
58
Genome reference HG19 HG38
(primary assembly) (primary assembly+alt+decoy+HLA)
Filtering with
population databases:
1000G indels,
dbSNP, COSMIC
Short-Read mapping BWA-MEM alt-aware
Germline SNP/INDEL GATK3/GATK4 HaplotypeCaller
Somatic SNP/INDEL MuTect2 VarScan2
Germline / Somatic SV Delly2 - DEL, DUP, INS, INV, TRA
Somatic CNV GATK4 CNV
Joint Genotyping GATK4 GenotypeGVCFs
Haplotype phasing WhatsHap
Data Parallelization FastQ to Bam
59
Data Parallelization Bam to VCF
• 	Human genome is naturally separated
by chromosome, so we can partition
DNA data by chromosome without
any information loss.
• 	There are several partitioning
strategies

to get more partitions for better 

parallelization. 

- Centromere
	 - Long ambiguous regions
	 - Any customized regions
60
Data Parallelization Bam to VCF
61
62
SeqsLab Demo
Outline
•Software Containerization
•NGS genome analysis
•Distributed, RESTful search and
analytics
63
Containerized DNA-Seq
Pipeline
•Elasticsearch Introduction
•Workshop: from variant data to search
•Elasticsearch in Atgenomix SeqsLab
64
65https://www.slideshare.net/clintongormley/down-and-dirty-with-elasticsearch?qid=d5dd27a4-2c66-43b9-9ae4-6e9668ca8978&v=&b=&from_search=12
Elasticsearch
66
• NoSQL with full-text search
• Lucene Search Engine
• Distributed Architecture
• RESTFUL API
NoSQL
• NoSQL category:
• key-value
• document-oriented
• terms comparison to RDB
• example
67
RDB ElasticSearch
database index
table type
row document
column field
schema mapping
SQL sql DSL
Lucene Engine
68 https://www.slideshare.net/otisg/lucene-introduction
• all fields indexed
Lucene Engine
69
Lucene Engine
70
The quick brown fox jumped over the lazy dogs
[The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]
[the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]
Whitespace analysis
Simple analysis
[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]
Stop analysis
[quick] [brown] [fox] [jump] [over] [lazy] [dog]
Standard analysis
https://www.slideshare.net/otisg/lucene-introduction
71
https://www.slideshare.net/otisg/lucene-introduction
Elasticsearch Glossary
72
• node: an Elasticsearch instance (server node)
• cluster: some nodes
• index: RDB database in Elasticsearch world
• shard: a component of index for horizontal scaling
and operation parallelization
• primary / secondary shard
https://www.slideshare.net/dadoonet/elasticsearch-devoxx-france-2012-english-version?qid=d5dd27a4-2c66-43b9-9ae4-6e9668ca8978&v=&b=&from_search=22
Distributed Architecture
73
node0
shard0
{

"_index": “tour_sites",

"_type": “Taipei”,

"_id": “0000000000",

"_source": {

"title": “National Palace
Museum”

}

}
{

"_index": “tour_sites",

"_type": “Taipei”,

"_id": “0000000001",

"_source": {

"title": “Taipei 101”

}

}
https://www.slideshare.net/ssuser6bb62e/elasticsearch-97018162?qid=09f88107-4bf9-4caa-be51-e2d9286819e7&v=&b=&from_search=11
tour_sites
Distributed Architecture
74
node2node1
P1 R1 P0
node3
R0 P2P2
https://www.slideshare.net/ssuser6bb62e/elasticsearch-97018162?qid=09f88107-4bf9-4caa-be51-e2d9286819e7&v=&b=&from_search=11
{

"_index": “tour_sites",

"_type": “Taipei”,

"_id": "0000000000",

"_source": {

"title": “National Palace
Museum”

}

}
{

"_index": “tour_sites",

"_type": “Taipei”,

"_id": “0000000001",

"_source": {

"title": “Taipei 101”

}

}
tour_sites
RESTful API
75
• Extensive REST API for everthing about Elasticsearch
• cluster API
• cat API
• document API
• query DSL
• kibana UI interface
76
• get cluster wise information
Cluster API
• sample command
Document API
77
DML ES endpoint description
create _bulk
elasticsearch document
import
read _search
elasticsearch document
retrival
update _update_by_query elasticsearch data update
delete
_delete_by_query
elasticsearch data
deletion
CREATE
78https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html
• index API
• bulk API
READ
79https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html
• index API

• search API
UPDATE
80https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html
• _update API: partial document merge and update

• index API: overwrite existing document
DELETE
81https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html
• index API

• delete_by_query API
Query DSL
82
• Elasticsearch supports vast amount of different queries
• compound query clauses
• term query for structured data
Query DSL
83
• curl with shell:
curl -XPOST --header "content-type: application/JSON" 
es-docker:9200/demo_table/variant/_search? 
-d '{ "query": { "bool": { "filter": { "bool": { "must": [], "must_not": [] } } } } }'
Compound Query
84
• term query
Compound Query
85
• range query
Compound Query
86
• terms query => or condition
Compound Query
87
• term query and range query
Aggregation
88
• Elasticsearch supports vast amount of different aggregations
89
• Elasticsearch supports vast amount of different aggregations
Aggregation
Query x Aggregation
90
• aggregate on entry returned by query
91
Workshop3
VCF2ES
es-kibana vcf2es
my-net
• https://hub.docker.com/r/obigbando/es-kibana
• https://hub.docker.com/r/obigbando/vcf2es/
• https://docs.docker.com/network/bridge/#connect-a-container-to-a-user-defined-bridge
Workshop3
VCF2ES
• Answer the following questions:
• how many variant in the demoVCF
• how many transitions / trasversion variants
• how many deletion variants in chrX
• how many variant in chromosome 22 p-arm
• quality distribution (qual) of all variants
• how many INS, DEL, INV, DUP,TRA in chromosome 1
https://www.biorxiv.org/content/early/2017/12/27/239962
SeqsLab Demo
Discussion
95

More Related Content

PDF
Genomic Data Analysis: From Reads to Variants
PDF
Dual index adapters with UMIs resolve index hopping and increase sensitivity ...
PDF
Liquid biopsy: Overcome Challenges of Circulating DNA with Automated and Stan...
PPTX
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
PPTX
Insilico
PPTX
Next generation sequencing technologies for crop improvement
PDF
The Application of Next Generation Sequencing (NGS) in cancer treatment
PDF
NGS: Mapping and de novo assembly
Genomic Data Analysis: From Reads to Variants
Dual index adapters with UMIs resolve index hopping and increase sensitivity ...
Liquid biopsy: Overcome Challenges of Circulating DNA with Automated and Stan...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Insilico
Next generation sequencing technologies for crop improvement
The Application of Next Generation Sequencing (NGS) in cancer treatment
NGS: Mapping and de novo assembly

What's hot (9)

PDF
Loop mediated isothermal amplification by dr.pavulraj.s
PPTX
Precision Medicine in Oncology Informatics
PPTX
polymerase chain reaction
PDF
딥러닝 논문읽기 efficient netv2 논문리뷰
PPTX
2 whole genome sequencing and analysis
PPT
Restrictions endonuclease and vectors for gene cloning
PDF
An Introduction to Neural Architecture Search
PPTX
Clinical Trials for Metastatic HER2-positive Breast Cancer
PDF
Introduction to next generation sequencing
Loop mediated isothermal amplification by dr.pavulraj.s
Precision Medicine in Oncology Informatics
polymerase chain reaction
딥러닝 논문읽기 efficient netv2 논문리뷰
2 whole genome sequencing and analysis
Restrictions endonuclease and vectors for gene cloning
An Introduction to Neural Architecture Search
Clinical Trials for Metastatic HER2-positive Breast Cancer
Introduction to next generation sequencing
Ad

Similar to Big data solution for ngs data analysis (20)

PDF
Introduction to Galaxy and RNA-Seq
PDF
Whole Genome Sequencing - Data Processing and QC at SciLifeLab NGI
PDF
20211119 ntuh azure hpc workshop final
PDF
Galaxy RNA-Seq Analysis: Tuxedo Protocol
PDF
Introduction to Next-Generation Sequencing (NGS) Technology
PDF
Ceph used in Cancer Research at OICR
PPTX
Making Use of NGS Data: From Reads to Trees and Annotations
PDF
HUG @ NGCLE@e-Novia 15.11.2017
PPTX
VariantSpark: applying Spark-based machine learning methods to genomic inform...
PPTX
ngs.pptx
PDF
Genome Simulation & Applications: Use of Managed Distributed Compute Infrastr...
PDF
ChipSeq Data Analysis
PDF
Overview of Next Gen Sequencing Data Analysis
PDF
Initial steps towards a production platform for DNA sequence analysis on the ...
PPTX
What should Bioinformatics do for EvoDevo?
PPT
Finding Needles in Haystacks (The Size of Countries)
PDF
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
PPTX
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
PPTX
2011 jeroen vanhoudt_ngs
PPTX
Genome simulation and applications
Introduction to Galaxy and RNA-Seq
Whole Genome Sequencing - Data Processing and QC at SciLifeLab NGI
20211119 ntuh azure hpc workshop final
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Introduction to Next-Generation Sequencing (NGS) Technology
Ceph used in Cancer Research at OICR
Making Use of NGS Data: From Reads to Trees and Annotations
HUG @ NGCLE@e-Novia 15.11.2017
VariantSpark: applying Spark-based machine learning methods to genomic inform...
ngs.pptx
Genome Simulation & Applications: Use of Managed Distributed Compute Infrastr...
ChipSeq Data Analysis
Overview of Next Gen Sequencing Data Analysis
Initial steps towards a production platform for DNA sequence analysis on the ...
What should Bioinformatics do for EvoDevo?
Finding Needles in Haystacks (The Size of Countries)
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
2011 jeroen vanhoudt_ngs
Genome simulation and applications
Ad

Recently uploaded (20)

PPTX
bas. eng. economics group 4 presentation 1.pptx
PPT
Mechanical Engineering MATERIALS Selection
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
Construction Project Organization Group 2.pptx
PPTX
Lecture Notes Electrical Wiring System Components
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
composite construction of structures.pdf
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
bas. eng. economics group 4 presentation 1.pptx
Mechanical Engineering MATERIALS Selection
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Foundation to blockchain - A guide to Blockchain Tech
UNIT 4 Total Quality Management .pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Construction Project Organization Group 2.pptx
Lecture Notes Electrical Wiring System Components
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Automation-in-Manufacturing-Chapter-Introduction.pdf
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
composite construction of structures.pdf
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd

Big data solution for ngs data analysis