SlideShare a Scribd company logo
Detecting Somatic Mutations in Impure Cancer Sample - Ensemble
Approach
CB Hong ⇤
, KJ Kim
4-5 February 2015
Contents
1 TCGA Benchmark 4 Data Set 3
1.1 GenomeTorrent| t© TCGA pt0 ‰¥‹ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Sample Data Set DX0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 îú⌧ Ì Ù Ux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 ‰µ` pt0 Ux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 ¨X0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Somatic Mutation Prediction 6
2.1 SomaticSniper ‰â ✏ ¨⌅ D0 ©X0 (164 ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 VarScan2 ‰â ✏ ¨⌅ D0 ©X0 (10Ñ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 MuTect ‰â ✏ ¨⌅ D0 ©X0 (18Ñ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 ¨X0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Full Consensus / Partial Consensus sSNV lX0 11
3.1 Bi-allelic SNPÃ îúX0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Full Consensus / Partial Consensus lX0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Full Consensus / Partial Consensus /⇠ lX0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 ¨X0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 î D0 ©X0 13
4.1 Unifed Genotyper| t© normal, tumor variants call (8Ñ) . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Filtering SNVs - full consensus (›µ •) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 Filtering SNVs - partial consensus (SomaticSniper/MuTect) . . . . . . . . . . . . . . . . . . . . . . . . 13
4.4 GATK D0| © ƒ Full Consensus / Partial Consensus /⇠ lX0 . . . . . . . . . . . . . . . . . . 14
4.5 ¨X0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5 Validation 15
5.1 COSMIC, CCLE pt0 DX0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.2 Validation ⇠â - consensus / parital consensus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.3 ¨X0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6 0¿ Somatic Mutation Callers - Strelka, Virmid 17
6.1 Strelka (1Ñ38 ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.2 Virmid (33Ñ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
⇤KT GenomeCloud hongiiv@gmail.com
1
Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 2
7 ⌅¥ l| ⌅ ¨⇧§ 19
7.1 ‰µ© ¨⇧§ ⌧Ñ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
7.2 ‰µ© ¨⇧§ ⌧Ñ ⌘çX0 - ƒ∞਩ê . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
7.3 ‰µ© ¨⇧§ ⌧Ñ ⌘çX0 -  ⇣î ¨⇧§ ¨©ê . . . . . . . . . . . . . . . . . . . . . . . . . . 19
7.4 ¨⇧§ ‹§ Ù LD¥0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
7.5 ¨⇧§ | ‹§ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
7.6 ¨⇧§ X‹§l î X0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7.7 | ( Ö9¥ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7.8 ¨⇧§ $∏Ãl Ù . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
7.9 ¨⇧§ Uï ttX0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7.10 ¨⇧§ å⌅∏Ë¥ $XX0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7.10.1 APT| t© å⌅∏Ë¥ $X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7.10.2 å§ T‹ Ù |D µ å⌅∏Ë¥ $X . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 3
1 TCGA Benchmark 4 Data Set
¯ ‰µ–⌧î TCGA mutation calling benchmark4 datasetsD t©XÏ ¥ªå somatic mutationD >D¿– t⌧
LD ¸ ÉÖ»‰. Genome sequencing benchmakr dataset@ x⌅ < tumor ÿ – | D((5%-95%)X Normal
ÿ D <iXÏ ›1 pt0Ö»‰. t ⌘–⌧ ∞¨î n40t60 (mixed with 60% of the tumor and 40% of the
normal)¸ t– QXî normal sampleD ¨©` ÉÖ»‰. t˘ pt0î BAM Ϙ< TCGA Benchmark Hò
t¿–⌧ ‰¥‹ •i»‰.
1.1 GenomeTorrent| t© TCGA pt0 ‰¥‹
• ‰¥‹ S/W $X - Key/UUID | ‰¥‹ - ÿ ‰¥‹
• ‹)TCGA Benchmark Data SetD ⌅ Public Key ‰¥‹
• https://cghub.ucsc.edu/datasets/benchmark download.html
$ cd
$ wget https:// cghub.ucsc.edu/software/downloads/cghub_public.key
• π |X ‰¥‹ Ù| ÏhXî UUID(universally unique identifier, ›ƒê) |
• TCGA Benchmark cell line: HCC1143 tumor 50x
$ curl https:// cghub.ucsc.edu/cghub/metadata/ analysisAttributes ? 
analysis_id=ad3d4757 -f358 -40a3 -9d92 -742463 a95e88 
-o uuid.txt
$ more uuid.txt
<?xml version="1.0" encoding="utf -8" standalone="yes"?>
<center_name >UCSC </ center_name >
<study >TCGA_MUT_BENCHMARK_4 </study >
<files >
<file >
<filename >G15511.HCC1143 .1.bam </ filename >
<filesize >255795959440 </ filesize >
</file >
• gtdownload| t© pt0 ‰¥‹
$ cd
$ gtdownload -c cghub_public.key -vv -d uuid.txt
1.2 Sample Data Set DX0
• BAMX |Ä Ì îú - ,(sort) - xqÒ (index)
¸…¥ Ë⌅ îú (-b: bam Ϙ< ú%)
$ cd
$ samtools view -b in.bam 1 > chr1.bam
$ samtools sort chr1.bam chr1_sorted
$ samtools index chr1_sorted.bam
• π ÌX îú (BED | t©)
Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 4
$ cd
$ cat chr17.bed
17:5967 -6207
17:11197 -11389
17:11806 -12018
17:13897 -14017
17:22307 -22427
17:30843 -30963
17:31151 -31279
17:63618 -63738
17:65398 -65638
17:69410 -69530
17:96838 -97108
17:131511 -131661
17:169155 -169395
17:170984 -171254
17:177205 -177355
17:260100 -260308
17:262897 -263257
17:263317 -263947
$ cat chr17.bed |xargs samtools view -b in.bam 
> exome.bam
$ samtools sort exome.bam exome_sorted
$ samtools index exome_sorted.bam
1.3 îú⌧ Ì Ù Ux
• readƒ ⌅X Ù| bed Ϙ< ú%‰. ⌅Ëà ucsc genome browserX custom track< î XÏ align
⌧ read Ù| Ux` ⇠ à‰.
$ cd
$ bamToBed -i exome_sorted.bam > cov_1.bed
• BAM |X ‰Ñ¨¿| BED | ú%Xp, read depth Ù| ৆¯®< ¯¨0 ⌅ Ù ©
⇠ à‰.
$ cd
$ samtools view -b exome_sorted.bam | 
genomeCoverageBed -ibam stdin > cov_2.bed
1.4 ‰µ` pt0 Ux
• ÿ , ⌅¯®, |§ pt0 ©]
$ cd /somatic_bench
$ pwd
/somatic_bench
$ ls -al
total 176
drwxr -xr -x 7 root root 4096 Jan 21 15:25 .
drwxr -xr -x 25 root root 4096 Jan 20 08:53 ..
Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 5
drwxr -xr -x 9 root root 4096 Jan 21 08:15 app
drwxr -xr -x 2 root root 4096 Jan 21 14:38 bam
drwxr -xr -x 2 root root 4096 Jan 19 11:43 reference
drwxr -xr -x 2 root root 4096 Jan 21 15:24 script
drwxr -xr -x 2 root root 151552 Jan 21 12:59 tmp
$ more /somatic_bench/script/ somatic_call_bench .sh
input_bam1="/somatic_bench/bam/hcc1143.ccle.n40t60.sorted.bam"
input_bam2="/somatic_bench/bam/hcc1143.ccle.b.sorted.bam"
gatk_b37="/somatic_bench/reference/ human_g1k_v37_decoy .fasta"
temp_dir="/somatic_bench/tmp/"
$ cd
$ ln -s /somatic_bench/bam/hcc1143.ccle.n40t60.sorted.bam tumor.bam
$ ln -s /somatic_bench/bam/hcc1143.ccle.b.sorted.bam normal.bam
1.5 ¨X0
• ⌅¯® ©]: wget, curl, gtdownload, samtools, bedtools(bamToBed, genomeCoverageBed)
• ∞¸<: –Xî ÌÃt t¨Xî .bam, t˘ .bamX coverage| Ùϸî .bed
Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 6
2 Somatic Mutation Prediction
SomaticSniper, VarScan2, MuTectD t©XÏ ÿ pt0K< Ä0 (tumor@ matched normal bam) somatic mu-
tationD >D≈»‰.
• Ñ Ö9: https://gist.github.com/hongiiv/06611f189f4c8158edb0
• SAMtools: v0.1.19
• GATK: v2.8.1
• MuTect: v1.1.4
• SomaticSniper: v1.0.4
• Strelka: v1.0.14
• Virmid: v1.1.1
2.1 SomaticSniper ‰â ✏ ¨⌅ D0 ©X0 (164 )
SomaticSniperî Varscan2| Ç ÃÒ4 YX Li Ding– Xt 2011D ⌧⌧⇠»<p, Bayesian probability@ poste-
rior filteringD t©‰. ¸î π’<î High computational e ciency| Ùx‰.
• -J: joint genotyping mode with default prior probability of a somatic mutation (0.01)
• -n, -t: normal/tumor sample id (for VCF header)
• -F: output Ϙ (classic, vcf, bed)
• -f: ref.fasta |X Ω
$ cd
$ bam - somaticsniper 
-J 
-F vcf 
-n HCC1143_Normal 
-t HCC1143_Tumor 
-f /somatic_bench/reference/ human_g1k_v37_decoy .fasta 
tumor.bam normal.bam 
HCC1143_somaticsniper .vcf
• (D05X) Reads with a mapping quality of 0 were filtered prior to somatic mutation identification. Predictions
with ’somatic score’ of 40 or greater were considered for subsequent downstaream validation and analysis step.
• GATKXSelectVariants| t©XÏ –Xî variantsÃD îú` ⇠ à‰.
• VCF |X FORMAT D‹X SSC (somatic score), MQ (mapping quality) Ù| t©
$ cd
$ ln -s /somatic_bench/app/GenomeAnalysisTK -2.8 -1/ GenomeAnalysisTK .jar ./
$ update -alternatives --config java
There are 2 choices for the alternative java (providing /usr/bin/java ).
Selection Path Priority
------------------------------------------------------------
0 /usr/lib/jvm/java -7- oracle/jre/bin/java 2
1 /usr/lib/jvm/java -6- oracle/jre/bin/java 1
* 2 /usr/lib/jvm/java -7- oracle/jre/bin/java 2
Press enter to keep the current choice [*], or type selection number: 2
update -alternatives : using /usr/lib/jvm/java -6- oracle/jre/bin/java
Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 7
$ java -version
java version "1.7.0 _72"
Java(TM) SE Runtime Environment (build 1.7.0_72 -b14)
Java HotSpot(TM) 64-Bit Server VM (build 24.72 -b04 , mixed mode)
$ java -jar GenomeAnalysisTK .jar 
-T SelectVariants 
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta 
--variant HCC1143_somaticsniper .vcf 
-o HCC1143_somaticsniper_filter .vcf 
-sn HCC1143_Tumor -sn HCC1143_Normal 
-select 'vc.getGenotype(" HCC1143_Tumor"). getExtendedAttribute ("SSC") >= 40 
&& (vc.getGenotype(" HCC1143_Tumor"). getExtendedAttribute ("MQ") > 0 || 
vc.getGenotype(" HCC1143_Normal "). getExtendedAttribute ("MQ") > 0)'
• D0 ⌅/ƒX mutation /⇠ DPX0
$ cd
$ grep -v "#" HCC1143_somaticsniper .vcf |wc -l
583
$ grep -v "#" HCC1143_somaticsniper_filter .vcf |wc -l
161
2.2 VarScan2 ‰â ✏ ¨⌅ D0 ©X0 (10Ñ)
VarScan2î ÃÒ4 YX Li Ding– Xt SomaticSniperÙ‰ 1D ¶@ 2012D ⌧⌧⇠»‰. ‰x 4‰¸î Ϩ
Fisher exact test@ filtering and FDR correctionD ¨©‰. ¸î π’< high-quality sSNVs– t⌧ sensitive
detectionD ⇠â‰. ‰x 4‰¸ Ϩ Ö% |D .bam |t Dà pileup ⇣î mpileup |D Ö% î‰.
• samtoolsX mpileupD t©XÏ normal, tumor– t⌧ pileup/mpileup ϘD ›1‰.
• mpileup ˃–⌧ -q 1 (skip alignments with mapQ smaller than INT), -B (disable BAQ computation) 5XD µt
filter| ⇠â‰.
• VarScan–⌧ mpileup1
ϘD Ö%< ¨©Xî Ω∞ ’–mpileup 1’ 5XD ‰.
$ cd
$ samtools mpileup 
-f /somatic_bench/reference/ human_g1k_v37_decoy .fasta 
-q 1 -B normal.bam > HCC1143_n.pileup
$ samtools mpileup 
-f /somatic_bench/reference/ human_g1k_v37_decoy .fasta 
-q 1 -B tumor.bam > HCC1143_t.pileup
$ ln -s /somatic_bench/app/VarScan/VarScan.v2 .3.3. jar ./
$ java -jar VarScan.v2 .3.7. jar 
somatic HCC1143_n.pileup HCC1143_t.pileup 
HCC1143_varscan 
--output -vcf 1
14617150 positions in tumor
14616970 positions shared in normal
13721478 had sufficient coverage for comparison
10tX 8⌧‰@ samtoolsX pileupD ¨©Xî ÉD 0 < $Ö⇠¥ à¿Ã, samtools ≈pt∏ ⇠t⌧ pileup@ ¨|¿‡ mpileup
< ¥ ⇠»‰. X¿Ã mpileup<ƒ XòX ÿ à pileupt •X‰. <` varscan–⌧î N/T ®P Ïh⌧ mpileup |D ¿–‰.
Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 8
13700958 were called Reference
0 were mixed SNP -indel calls and filtered
18427 were called Germline
1562 were called LOH
450 were called Somatic
81 were called Unknown
0 were called Variant
• VarScan2X ⇠â∞¸ Dò@ ⇡t INDEL¸ SNP Ïh⌧ ∞¸| VCF ‹ ›1⌧‰ (HCC1143 varscan.indel.vcf,
HCC1143 varscan.snp.vcf).
drwxr -xr -x 2 root root 4096 Jan 30 09:52 ./
drwxr -xr -x 5 root root 8192 Jan 30 09:35 ../
-rw -r--r-- 1 root root 402354 Jan 30 09:47 HCC1143_varscan .indel.vcf
-rw -r--r-- 1 root root 2691462 Jan 30 09:47 HCC1143_varscan .snp.vcf
• VarScan2X ∞¸ ⌘, HCC1143varscan.snp.vcf XprocessSomaticısomaticFilter|tXD0|¸.
• processSomatic: high-confidence2
/low-confidence Somatic mutationsD Ѩt ‰.
• somaticFilter: ê‡t –Xî D0 5X –min-coverage, –p-value, –indel-file Ò © •X‰.
$ cd
$ java -jar VarScan.v2 .3.3. jar processSomatic -help
USAGE: java -jar VarScan.jar process [status -file] OPTIONS
status -file - The VarScan output file for SNPs or Indels
OPTIONS
--min -tumor -freq - Minimum variant allele frequency in tumor [0.10]
--max -normal -freq - Maximum variant allele frequency in normal [0.05]
--p-value - P-value for high -confidence calling [0.07]
$ java -jar VarScan.v2 .3.3. jar processSomatic HCC1143_varscan .snp.vcf
Reading input from HCC1143_varscan .snp.vcf
Opening output files:
17914 VarScan calls processed
382 were Somatic (102 high confidence)
16048 were Germline (15431 high confidence)
1451 were LOH (1447 high confidence)
• processSomaticX ∞¸ Germline, LOH, Somatic– t⌧ high confidence, low confidenceX ©]t Ïh
⌧ ∞¸| ›1‰.
$ ls
-rw -r--r-- 1 2413169 Jan 30 09:52 HCC1143_varscan .snp.vcf.Germline
-rw -r--r-- 1 2320566 Jan 30 09:52 HCC1143_varscan .snp.vcf.Germline.hc
-rw -r--r-- 1 216574 Jan 30 09:52 HCC1143_varscan .snp.vcf.LOH
-rw -r--r-- 1 215997 Jan 30 09:52 HCC1143_varscan .snp.vcf.LOH.hc
-rw -r--r-- 1 59990 Jan 30 09:52 HCC1143_varscan .snp.vcf.Somatic
-rw -r--r-- 1 17055 Jan 30 09:52 HCC1143_varscan .snp.vcf.Somatic.hc
• VarScan2X ∞¸ VCFX Ω∞ ALT allele– ’G/T’ Ò< 0Xîp tî îƒ Ñ – –Ï| ⌧›‰. 0|
⌧ ’G,T’X ⌅ )›< ¿Ω‰.
2tumor–⌧ minimum variant allele frequency 0.1, normal–⌧ maximum variant allele frequency 0.05
Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 9
$ cd
$ perl -pe 's/tA //tA ,/' HCC1143_varscan .snp.vcf.Somatic.hc | 
perl -pe 's/tT //tT ,/'| 
perl -pe 's/tG //tG ,/'| 
perl -pe 's/tC //tC ,/' > HCC1143_varscan_filter .vcf
• D0 ƒX mutation /⇠
$ cd
$ grep -v "#" HCC1143_varscan_filter .vcf |wc -l
102
2.3 MuTect ‰â ✏ ¨⌅ D0 ©X0 (18Ñ)
MuTect@ Broad–⌧ ⌧⌧⌧ 4 Bayesian probability with pre- and post- filteringD ⇠âXp, πà low allelic-fraction
–⌧ sSNVs– t⌧ sensitive detectionD ⇠â‰.
• MuTectî ê 1.6 Ñ⌅–⌧Ã ŸëX0 L8– ⌅¨ Java Ñ⌅D Ux ƒ– Dî‹ update-alternatives| t
©XÏ Ñ⌅D ¿Ω‰.
$ cd
$ ln -s /somatic_bench/app/mutect/muTect -1.1.4. jar ./
$ samtools index normal.bam
$ samtools index tumor.bam
$ cp /somatic_bench/reference/ccle.gatk.bed ./
$ update -alternatives --config java
There are 2 choices for the alternative java (providing /usr/bin/java ).
Selection Path Priority
------------------------------------------------------------
0 /usr/lib/jvm/java -7- oracle/jre/bin/java 2
1 /usr/lib/jvm/java -6- oracle/jre/bin/java 1
* 2 /usr/lib/jvm/java -7- oracle/jre/bin/java 2
Press enter to keep the current choice [*], or type selection number: 1
update -alternatives : using /usr/lib/jvm/java -6- oracle/jre/bin/java
$ java -version
java version "1.6.0 _45"
Java(TM) SE Runtime Environment (build 1.6.0_45 -b06)
Java HotSpot(TM) 64-Bit Server VM (build 20.45 -b01 , mixed mode)
$ java -jar muTect -1.1.4. jar --analysis_type MuTect 
--reference_sequence /somatic_bench/reference/ human_g1k_v37_decoy .fasta 
--cosmic /somatic_bench/reference/ b37_cosmic_v54_120711 .vcf 
--dbsnp /somatic_bench/reference/dbsnp_132_b37.leftAligned.vcf 
--input_file:normal normal.bam 
--input_file:tumor tumor.bam 
--out HCC1143_mutect .out 
--vcf HCC1143_mutect .vcf 
--coverage_file HCC1143.mutect.cov.wig.txt 
--normal_sample_name HCC1143_Normal 
--tumor_sample_name HCC1143_Tumor 
-L ccle.gatk.bed
Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 10
• (D05X) Predictions not labeled as ’REJECT’ were accepted as confident somatic mutation predictions, and
subsequent downstream validation and analysis steps.
• D0– ¨©` GATKî ê 1.7 Ñ⌅D Dî X¿ update-alternatives| t©XÏ ê Ñ⌅D ¿Ω‰.
• GATKX SelectVariants| t©XÏ VCFX D0 (FILTER) D‹ÄÑt PASS⌧ (REJECT| ⌧x) variantsÃ
>D∏‰.
$ cd
$ update -alternatives --config java
There are 2 choices for the alternative java (providing /usr/bin/java ).
Selection Path Priority
------------------------------------------------------------
0 /usr/lib/jvm/java -7- oracle/jre/bin/java 2
1 /usr/lib/jvm/java -6- oracle/jre/bin/java 1
* 2 /usr/lib/jvm/java -7- oracle/jre/bin/java 2
Press enter to keep the current choice [*], or type selection number: 2
update -alternatives : using /usr/lib/jvm/java -6- oracle/jre/bin/java
$ java -version
java version "1.7.0 _72"
Java(TM) SE Runtime Environment (build 1.7.0_72 -b14)
Java HotSpot(TM) 64-Bit Server VM (build 24.72 -b04 , mixed mode)
$ java -jar GenomeAnalysisTK .jar -T SelectVariants 
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta 
--variant HCC1143_mutect .vcf 
-o HCC1143_mutect_filter .vcf 
-sn HCC1143_Tumor -sn HCC1143_Normal 
-select 'vc.isNotFiltered ()'
• GATKX SelectVariants| t©XÏ VCFX D0 (FILTER) D‹ ÄÑt PASS⌧ (REJECT| ⌧x) variantsÃ
>D∏‰.
$ cd
$ java -jar GenomeAnalysisTK .jar -T SelectVariants 
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta 
--variant HCC1143_mutect .vcf 
-o HCC1143_mutect_filter .vcf 
-sn HCC1143_Tumor -sn HCC1143_Normal 
--excludeFiltered
• D0 ƒX mutation /⇠
$ cd
$ grep -v "#" HCC1143_mutect_filter .vcf |wc -l
109
2.4 ¨X0
• ⌅¯® ©]: VarScan2, SomaticSniper, MuTect, GATK
• ∞¸<: 4ƒ D0 DÃ⌧ somatic mutation (161, 102, 112)
Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 11
3 Full Consensus / Partial Consensus sSNV lX0
SomaticSniper, VarScan2, MuTect 3ÖX SNV detecting toolsX full consensus callD >î‰. ∞ multi-allelic¸ indel
@ ⌧p‰.
3.1 Bi-allelic SNPÃ îúX0
• ¨⌅ D0 ∞¸– t⌧ multi-allelicD ⌧pX‡ SNPà îú‰.
• GATKX SelectVariants| t©XÏ -selectTypeD SNP (INDEL, SNP, MIXED, MNP, SYMBOLIC, NO VARIATION),
-restrictAllelesTo| BIALLELIC (MULTIALLELIC or BIALLELIC)<  ‰.
$ cd
$ java -jar GenomeAnalysisTK .jar 
-T SelectVariants 
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta 
--variant HCC1143_mutect_filter .vcf 
-o HCC1143_mutect_1 .vcf 
-selectType SNP 
-restrictAllelesTo BIALLELIC
$ java -jar GenomeAnalysisTK .jar 
-T SelectVariants 
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta 
--variant HCC1143_somaticsniper_filter .vcf 
-o HCC1143_somaticsniper_1 .vcf 
-selectType SNP 
-restrictAllelesTo BIALLELIC
$ java -jar GenomeAnalysisTK .jar 
-T SelectVariants 
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta 
--variant HCC1143_varscan_filter .vcf 
-o HCC1143_varscan_1 .vcf 
-selectType SNP 
-restrictAllelesTo BIALLELIC
3.2 Full Consensus / Partial Consensus lX0
• Partial Consensus (SomaticSniper/MuTect, MuTect/VarScan2, VarScan2/SomaticSniper)@ somatic caller 3Ö–
 ⌅¥ consensus| l‰.
$ cd
$ java -jar GenomeAnalysisTK .jar 
-T SelectVariants 
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta 
--variant HCC1143_somaticsniper_1 .vcf 
--concordance HCC1143_mutect_1 .vcf 
-o HCC1143_SM.vcf
$ java -jar GenomeAnalysisTK .jar 
-T SelectVariants 
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta 
--variant HCC1143_mutect_1 .vcf 
--concordance HCC1143_varscan_1 .vcf 
-o HCC1143_MV.vcf
Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 12
$ java -jar GenomeAnalysisTK .jar 
-T SelectVariants 
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta 
--variant HCC1143_varscan_1 .vcf 
--concordance HCC1143_somaticsniper_1 .vcf 
-o HCC1143_VS.vcf
$ java -jar GenomeAnalysisTK .jar 
-T SelectVariants 
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta 
--variant HCC1143_SM.vcf 
--concordance HCC1143_varscan_1 .vcf 
-o HCC1143_SMV.vcf
3.3 Full Consensus / Partial Consensus /⇠ lX0
• full consensus ✏ parital consensus /⇠| l‰.
$ cd
$ grep -v "#" HCC1143_SM.vcf |wc -l
45
$ grep -v "#" HCC1143_MV.vcf |wc -l
38
$ grep -v "#" HCC1143_VS.vcf |wc -l
42
$ grep -v "#" HCC1143_SMV.vcf |wc -l
32
3.4 ¨X0
• ⌅¯® ©]: GATK
• ∞¸<: consensus / parital consensus pt0
Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 13
4 î D0 ©X0
GATK Unified Genotyper| t©XÏ specificity| ù ‹¨ ⇠ à‰.
4.1 Unifed Genotyper| t© normal, tumor variants call (8Ñ)
• GATK UnifiedGenotyper| t©XÏ Normal/Tumor ÿ – t SNP| calling‰.
$ cd
$ java -jar GenomeAnalysisTK .jar 
-T UnifiedGenotyper 
-o HCC1143_gatk.tumor.vcf 
-I tumor.bam 
--genotype_likelihoods_model SNP 
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta
-L ccle.gatk.bed
$ java -jar GenomeAnalysisTK .jar 
-T UnifiedGenotyper 
-o HCC1143_gatk.normal.vcf 
-I normal.bam 
--genotype_likelihoods_model SNP 
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta
-L ccle.gatk.bed
4.2 Filtering SNVs - full consensus (›µ •)
• GATK UnifiedGenotyper| t©XÏ ›1⌧ Normal/Tumor X variants| t©XÏ SNVs predicted in tumor
but not the germlines D0| ⇠â‰.
$ cd
$ java -jar GenomeAnalysisTK .jar 
-T SelectVariants 
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta 
--variant HCC1143_SMV.vcf 
--discordance HCC1143_gatk.normal.vcf 
-o HCC1143_SMV_discordance_normal .vcf
$ java -jar GenomeAnalysisTK .jar 
-T SelectVariants 
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta 
--variant HCC1143_SMV_discordance_normal .vcf 
--concordance HCC1143_gatk.tumor.vcf 
-o HCC1143_final_filter_concordance .vcf
4.3 Filtering SNVs - partial consensus (SomaticSniper/MuTect)
$ cd
$ java -jar GenomeAnalysisTK .jar 
-T SelectVariants 
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta 
--variant HCC1143_SM.vcf 
--discordance HCC1143_gatk.normal.vcf
Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 14
-o HCC1143_SM_discordance_normal .vcf
$ java -jar GenomeAnalysisTK .jar 
-T SelectVariants 
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta 
--variant HCC1143_SM_discordance_normal .vcf 
--concordance HCC1143_gatk.tumor.vcf 
-o HCC1143_SM_final_filter_concordance .vcf
4.4 GATK D0| © ƒ Full Consensus / Partial Consensus /⇠ lX0
• GATK D0| » consensus ✏ parital consensus /⇠| l‰.
$ cd
$ grep -v "#" HCC1143_final_filter_concordance .vcf |wc -l
32
$ grep -v "#" HCC1143_SM_final_filter_concordance .vcf |wc -l
45
4.5 ¨X0
• ⌅¯® ©]: GATK
• ∞¸<: GATK D0| © consensus / parital consensus pt0
Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 15
5 Validation
COSMIC¸CCLEX HCC1143 ÿ –  ¿t ¨§∏| ¿‡ º»ò |XXî¿| LD¯‰. validation.list
|@ ⌧Ñ– •⌧ | ⇣î ‰¥‹ (https://gist.github.com/hongiiv/42194181ce6402d8b629)XÏ ¨©i»‰.
5.1 COSMIC, CCLE pt0 DX0
• COSMIC¸ CCLEX HCC1143 ÿ –  ¿t ©] ( 103⌧)D ı¨‰.
$ cd
$ cp /somatic_bench/reference/validation.list ./
$ cat validation.list | wc -l
103
5.2 Validation ⇠â - consensus / parital consensus
• Ö filter⌧ consensus/partial consensus (SomaticSniper/MuTect)– t⌧ á⌧ |XXî¿| Ux‰.
$ cd
$ java -jar GenomeAnalysisTK .jar 
-T SelectVariants 
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta 
--variant HCC1143_final_filter_concordance .vcf 
-o all.val.filter.vcf 
-L validation.list
$ java -jar GenomeAnalysisTK .jar 
-T SelectVariants 
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta 
--variant HCC1143_SM_final_filter_concordance .vcf 
-o sm.val.filter.vcf 
-L validation.list
$ grep -v "#" all.val.filter.vcf | wc -l
6
$ grep -v "#" sm.val.filter.vcf | wc -l
9
• î  GATK D0⌅X consensus ¿t– t⌧ á⌧ |XXî¿| Ux‰.
$ cd
$ java -jar GenomeAnalysisTK .jar 
-T SelectVariants 
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta 
--variant HCC1143_SMV.vcf 
-o all.val.vcf 
-L validation.list
$ java -jar GenomeAnalysisTK .jar 
-T SelectVariants 
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta 
--variant HCC1143_SM.vcf 
-o sm.val.vcf 
-L validation.list
$ grep -v "#" all.val.vcf |wc -l
6
Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 16
$ grep -v "#" sm.val.vcf |wc -l
9
• consensus: before GATK filter (32/6) - after GATK filter (32/6)
• partial consensus-SM: before GATK filter (45/9) - after GATK filter (45/9)
5.3 ¨X0
• ⌅¯® ©]: GATK
• ∞¸<: Ö consensus / partial consensus@ COSMIC, CCLE@ |XXî /⇠
Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 17
6 0¿ Somatic Mutation Callers - Strelka, Virmid
6.1 Strelka (1Ñ38 )
Bayesian probability with posterior filtering| t© somatic mutation caller 2012D |˯ò Ç ⌅¯®t
‰. |˯òX alignerx issactò eland –à D»| bwaƒ ¿–‰.‰â)ït |⇠ ⌅¯®‰¸î }⌅ ‰x
)›D t©Xîp tî |˯ò ¸ ⌧ issac ⇣ D∑ ‰â)ïD ¨©Xp, tî XòX ⌅ ∏|
®( < ¨X‡ | 1àå ¨X0 ⌅XÏ Makefile t|î ›D ¨©Xî make |î ¯¨| t©
X0 L8t‰.
• Strelka| ¨©X0 ⌅t⌧î StrelkaX 5Xt •⌧ |t DîXp, 0¯ < bwa, eland, isaac 3⌧X
aligner| ⌅ 0¯ 5XD ⌧ı‰.
• 0¯ 5X–⌧ exometò target sequencingX Ω∞ isSkipDepthFilters = 1  ¿ ‰.
$ ll /somatic_bench/app/strelka -1.0.14/ etc/
total 20
drwxrwxr -x 2 viz viz 4096 Jul 10 2014 ./
drwxr -xr -x 7 root root 4096 Jan 30 11:06 ../
-rw -rw -r-- 1 viz viz 3658 Jul 10 2014 strelka_config_bwa_default .ini
-rw -rw -r-- 1 viz viz 3683 Jul 10 2014 strelka_config_eland_default .ini
-rw -rw -r-- 1 viz viz 3821 Jul 10 2014 strelka_config_isaac_default .ini
• Strelka $X⌧  †¨@ Ñ ∞¸ •  †¨– t⌧ ¿⇠ $ D ‰.
• 0¯ 5X |D ı¨X‡ configureStrelkaWorkflow.pl Ö9< Ñ Ö9¥| ›1‰.
• É¥ƒ Ñ Ö9D make| µt ‰âXp tL -j 5XD µt Ñ – ¨©` thread (cpu) /⇠| ¿ ‰.
• INDEL¸ SNP ƒƒX VCF Ϙ< ›1⇠p, pass ⌧ ɸ raw somatic 4⌧X ∞¸ |t
›1⌧‰.
$ STRELKA_INSTALL_DIR =/ somatic_bench/app/strelka -1.0.14/
echo $ STRELKA_INSTALL_DIR
/somatic_bench/app/strelka -1.0.14/
$ WORK_DIR =/ root/myWork
$ cp $ STRELKA_INSTALL_DIR /etc/ strelka_config_isaac_default .ini config.ini
$ STRELKA_INSTALL_DIR /bin/ configureStrelkaWorkflow .pl 
--normal =/ root/normal.bam 
--tumor =/ root/tumor.bam 
--ref=/ somatic_bench/reference/ human_g1k_v37_decoy .fasta 
--config=config.ini --output -dir =./ myAnalysis
$ cd ./ myAnalysis
$ make -j 8
$ ll myAnalysis/results/
total 88
drwxr -xr -x 2 root root 4096 Jan 30 11:39 ./
drwxr -xr -x 5 root root 4096 Jan 30 11:37 ../
-rw -r--r-- 1 root root 13452 Jan 30 11:37 all.somatic.indels.vcf
-rw -r--r-- 1 root root 36736 Jan 30 11:37 all.somatic.snvs.vcf
-rw -r--r-- 1 root root 7098 Jan 30 11:37 passed.somatic.indels.vcf
-rw -r--r-- 1 root root 16070 Jan 30 11:37 passed.somatic.snvs.vcf
• Ö pass⌧ somatic SNPX /⇠| Ux‰.
$ cd myAnalysis/results/
$ grep -v "#" passed.somatic.snvs.vcf|wc -l
62
Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 18
6.2 Virmid (33Ñ)
Virmidî 2013D 8 YP @¡∞ P⇠ Ç å⌅∏Ë¥Ö»‰. ÿ ¡D µt tumor–⌧ normal ÿ X pro-
portionD ©‰ (↵).
• Ö pass⌧ somatic SNPX /⇠| Ux‰.
$ java -jar /somatic_bench/app/Virmid -1.1.1/ Virmid.jar 
-R /somatic_bench/reference/ human_g1k_v37_decoy .fasta 
-D /root/tumor.bam 
-N /root/normal.bam 
-t 8 
-w /root/virmid
$ cd /root/virmid
$ ls -la
$ ls -al
total 98024
drwxr -xr -x 2 root 4096 Jan 30 16:00 ./
drwxr -xr -x 8 root 8192 Jan 30 15:32 ../
-rw -r--r-- 1 root 1252161 Jan 30 16:03 tumor.bam.virmid.germ.all.vcf
-rw -r--r-- 1 root 955213 Jan 30 16:03 tumor.bam.virmid.germ.passed.vcf
-rw -r--r-- 1 root 262 Jan 30 16:00 tumor.bam.virmid.gm
-rw -r--r-- 1 root 36564 Jan 30 16:03 tumor.bam.virmid.loh.all.vcf
-rw -r--r-- 1 root 2233 Jan 30 16:01 tumor.bam.virmid.loh.passed.vcf
-rw -r--r-- 1 root 992 Jan 30 16:03 tumor.bam.virmid.report
-rw -r--r-- 1 root 1364144 Jan 30 15:29 tumor.bam.virmid.sample.control.bai
-rw -r--r-- 1 root 53107377 Jan 30 15:29 tumor.bam.virmid.sample.control.bam
-rw -r--r-- 1 root 1364104 Jan 30 15:29 tumor.bam.virmid.sample.disease.bai
-rw -r--r-- 1 root 41746178 Jan 30 15:29 tumor.bam.virmid.sample.disease.bam
-rw -r--r-- 1 root 84053 Jan 30 16:03 tumor.bam.virmid.som.all.vcf
-rw -r--r-- 1 root 6883 Jan 30 16:03 tumor.bam.virmid.som.passed.vcf
$ grep -v "#" tumor.bam.virmid.som.passed.vcf|wc -l
78
Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 19
7 ⌅¥ l| ⌅ ¨⇧§
7.1 ‰µ© ¨⇧§ ⌧Ñ
• ⌧Ñ ¸å: xxx.xxx.xxx.xxx
• Dt: edu01, edu02
• T8: kogo2015
• ˘⌘ç: http://xxx.xxx.xxx.xxx:8787
7.2 ‰µ© ¨⇧§ ⌧Ñ ⌘çX0 - ƒ∞਩ê
• http://www.chiark.greenend.org.uk/˜sgtatham/putty/download.html ⌘ç
• Intel x86© putty.exe| ‰¥‹ i»‰.
• Host Name: xxx.xxx.xxx.xxx / Port: xx
• Security Alert =t (t ’ (Y)’| ›i»‰.
• ¯x Dt: `˘ @ Dt@ T8| ¨©i»‰.
7.3 ‰µ© ¨⇧§ ⌧Ñ ⌘çX0 -  ⇣î ¨⇧§ ¨©ê
• Â(OSX)X Ω∞ ’Q©⌅¯®, ¯¨, 0¯⇣ app’D ‰âi»‰. ¨⇧§X Ω∞ ’Tt ⇣î ê ¨
⇧§X ⌅¯® Tt–⌧ 0¯⇣D ‰â i»‰.
$ ssh user_id@host_name
$ ssh root@127 .0.0.1
• ssh Ö9D t©XÏ ‰µ© ¨⇧§ ⌧Ñ– ⌘çi»‰. ´à¯ ⌘ç‹ yes| ›Xt T8| ;î Ttt
ò$å ⇠p tL ÄÏ @ T8| Ö%XÏ ⌘çi»‰.
7.4 ¨⇧§ ‹§ Ù LD¥0
¯ 8⌧î ¨⇧§ 0Ï⇣3
X Xòx ’Ubuntu (∞Ñ,)’| 0⇠< $Öi»‰. ƒƒX ‹ ∆î Ω∞ ®‡ Ö
XX ¨⇧§– ¨©t •i»‰. ¨⇧§î ‰ë 0Ï⇣¸ X‹Ë¥¡–⌧ ŸëXî ¥ ¥⌧Ö»‰. ê‡X
¨⇧§ ¥† XΩ–⌧ ŸëXî¿| LDP¥| å⌅∏Ë¥ $X‹ ê‡X ¨⇧§– i å⌅∏Ë¥X
$X •i»‰.
• ⌅¨ ê‡t ¨©Xî ¨⇧§ 0Ï⇣X ÖX ›ƒXî )ïÖ»‰. UbuntuX Ω∞ 4à 0Ï⇠î ¨⇧§
¥ ¥⌧ ⌅¨ ‡Ñ⌅@ 14.04 LTS (Long Term Support)4
Ñ⌅Ö»‰.
$ cat /etc/issue.net
Ubuntu 12.04.1 LTS
• ¨⇧§î ‰ë X‹Ë¥ XΩ–⌧ ¥ ⇠p ¨⇧§| ¿–Xî å⌅∏Ë¥‰@ tÏ X‹Ë¥– 0|
‰â |D 0 ⌧ıi»‰. 0|⌧ ⌅¨ ê‡t ¨©Xî X‹Ë¥ Ù| Lt ꇖå fiî å⌅∏Ë
¥| ‰¥‹XÏ ¨©` ⇠ ൻ‰. ¨⇧§ ⌧Ñ •D X‹Ë¥ ¨ë ›ƒ@ ’-m’ â, machine 5XD µt
L ⇠ ൻ‰. ’x86’@ Intel 0⇠X CPU| X¯Xp, ’64’î 64D∏ X‹Ë¥| X¯5
i»‰.
$ uname -m
x86_64
3¨⇧§î lå ‹á ƒÙ¸ pDH ƒÙ Ѩ⇠p ƒÙƒ ‰ë 0Ï⇣t t¨‰.
4T‹Ö@ Trusty TahrÖ»‰.
5Tà ⌅Ï⌧ x64|‡ ⌅i»‰.
Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 20
• ‰⇣@ ¨⇧§ ¥ ¥⌧X uÏ< ¨©êX Ö9D ‰⌧ X‹Ë¥| µt ‰âXƒ] i»‰. ¨⇧§ ‰⇣
@ ¨©Xî 0Ï⇣– 0| ⌧ ‰x Ñ⌅D ¨©i»‰. ⌅¨ • ‡X ¨⇧§ ‰⇣@ 3.14.3dmfh 2014D
5‘6| ⌧⌧ Ñ⌅Ö»‰. ¨⇧§ 0Ï⇣@ t⌥å ⌧⌧ ‰⇣D 0⇠< ⌧ë)»‰. ¨⇧§X ‰⇣
Ù ›ƒ tÙƒ] X†µ»‰.
$ uname -r
3.2.0 -32 - virtual
• X@ ¨⇧§ Ö9¥| Ö% D t| ‰âXî XΩ< ’PATH’î ⌅8§ ŸëXî )ï– •D |
Xî ✓x XΩ ¿⇠ ⌘X XòÖ»‰. exportî tÏ XΩ¿⇠X ✓D $ Xî Ö9¥ Ö»‰. ¨⇧§–
Ö9D Ö%Xt PATH– $ ⌧  †¨| ∞ Ä…XÏ t˘ Ö9¥ àî¿| UxX‡ t| ‰âi
»‰. 0|⌧ ê‡X ¡⌘ å⌅∏Ë¥| $XX‡ ¨⇧§ ¡–⌧ ‰âXî Ω∞ ⇠‹‹ PATH| ¿ t| ¥
–⌧‡¿ ‰ât •Xp ¯⌥¿ J@ Ω∞ å⌅∏Ë¥ $X⌧  †¨ ¥–⌧à ‰ât •i»‰.
X XΩ ¿⇠ Ux@ ’env’ Ö9< LD º ⇠ à<p, PATHî ’export’| µt $ i»‰.
$ env | grep PATH
MANPATH =/usr/local/texlive /2013/ texmf/doc/man:
PATH =/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
INFOPATH =/usr/local/texlive /2013/ texmf/doc/info:
$ export PATH =/BIO/app/bwa -0.7.5a/:$PATH
$ env | grep PATH
7.5 ¨⇧§ | ‹§
¨⇧§X X@ XòX <¨ §l| |¨ < ÏÏ Ì< lÑXÏ ¨Xp X@ | ‹§
D ›1XÏ | ✏  †¨| ¨` ⇠ ൻ‰.
• ¨⇧§ ‹§@ ÏÏ ¨©ê ¨©Xî ‹§< ê ê‡X ‡ Ìx H †¨| ¿‡ ൻ
‰. H  †¨¥–⌧î ê‡t |D ›1, ≠⌧ •i»‰. H  †¨ tŸXî Ö9@ ’cd’ Ö9
tp, ⌅¨  †¨ Ωî ’pwd’ Ö9< Ux` ⇠ ൻ‰.
$ cd
$ pwd
/home/hongiiv
•  †¨ ɇ t˘  †¨ tŸX0
$ cd
$ mkdir sample_data
$ ls -la
total 2203488
drwxr -xr -x 16 hongiiv hongiiv 4096 May 29 10:34 .
drwxr -xr -x 3 root root 4096 May 7 13:14 ..
-rw ------- 1 hongiiv hongiiv 1908 May 10 11:59 .bash_history
-rw -r--r-- 1 hongiiv hongiiv 220 May 7 13:14 .bash_logout
-rw -r--r-- 1 hongiiv hongiiv 3763 May 10 17:06 .bashrc
drwxr -xr -x 2 root root 4096 May 29 10:34 sample_data
$ cd sample_data
$ pwd
/home/hongiiv/sample_data
•  †¨ ✏ | ≠⌧X0
Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 21
$ cd
$ rm -rf sample_data
$ ls -la
total 2203488
drwxr -xr -x 16 hongiiv hongiiv 4096 May 29 10:34 .
drwxr -xr -x 3 root root 4096 May 7 13:14 ..
-rw ------- 1 hongiiv hongiiv 1908 May 10 11:59 .bash_history
-rw -r--r-- 1 hongiiv hongiiv 220 May 7 13:14 .bash_logout
-rw -r--r-- 1 hongiiv hongiiv 3763 May 10 17:06 .bashrc
$
• ¨⇧§ | ‹§ Ù0
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 19G 14G 4.8G 74% /
udev 3.9G 4.0K 3.9G 1% /dev
tmpfs 1.6G 188K 1.6G 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 3.9G 0 3.9G 0% /run/shm
/dev/xvdb1 79G 38G 38G 50% /home/hongiiv/test
• <¨ X‹§l X Ù Ù0 - 21.5 GBX <¨ x /dev/xvda X‹§lî vxda1, xvda2 2⌧X 
X< l1⇠¥ à<p Linux, Linux swapX |‹§ÑD Ux` ⇠ ൻ‰.
$ fdisk -l
Disk /dev/xvda: 21.5 GB , 21474836480 bytes
255 heads , 63 sectors/track , 2610 cylinders , total 41943040 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00034212
Device Boot Start End Blocks Id System
/dev/xvda1 2048 40038399 20018176 83 Linux
/dev/xvda2 40038400 41940991 951296 82 Linux swap / Solaris
Disk /dev/xvdb: 300.6 GB , 300647710720 bytes
171 heads , 35 sectors/track , 98112 cylinders , total 587202560 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x3459a991
Device Boot Start End Blocks Id System
/dev/xvdb1 2048 587202559 293600256 8e Linux LVM
• | ‹§ »¥∏ Ù Ux
$ cat /etc/fstab
proc /proc proc nodev ,noexec ,nosuid 0 0
/dev/xvda1 / ext3 errors=remount -ro 0 1
/dev/xvda2 none swap sw 0 0
Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 22
7.6 ¨⇧§ X‹§l î X0
• fdisk| µt î ⌧ X‹§l| Ux ƒ T›, |‹§ ›1, »¥∏X 3˃| p– X‹§l
| ¨©i»‰. USB •X| ¨⇧§– x›X0 ⌅t⌧î mount ¸ ÃD pXt )»‰.
$ fdisk /dev/xvdb
$ mkfs.ext3 /dev/xvdb1
$ mkdir /new_hdd
$ mount /dev/xvdb1 /new_hdd
$ cd /new_hdd
$ df -h
7.7 | ( Ö9¥
• touch - | l0 0x »¥ | ›1Xpò |t ›1⌧ ‹⌅D ¿Ω` ⇠ ൻ‰. ⌅9 ⌅¥ (
å⌅∏Ë¥ $Xò P!‹ ¨©Xî Ö9¥ ⇡¿X‹0 绉.
$ touch a
$ ls -al
-rw -r--r-- 1 root root 0 Jun 18 10:04 a
$ date
Wed Jun 18 10:05:10 KST 2014
$ touch -c a
$ ls -al
-rw -r--r-- 1 root root 0 Jun 18 10:05 a
• cat - |X ¥©D UxXpò ⌅Ë §lΩ∏ ë1‹ ¨©i»‰. ’cat ¿ test’ Ö9< test|î |D
›1Xt⌧ | ¥©D ë1i»‰. ë1t DÃ⌧ ƒ–î ’ctrl+D’ ѺD Ï `8ò, ⇠ ൻ‰.
$ cat > test
hi there
my name is hong
$ cat test
hi there
my name is hong
$ ls -al
-rw -r--r-- 1 root root 25 Jun 18 10:09 test
• π  †¨X |X /⇠ 80
$ ls -l . | grep ^- | wc -l
50
• |X π 8êÙ ‹ëXî ÄÑD ⌧x ÄÑ ú%X0Ö»‰. VCF |¸ ⇡t ’’ ‹ëXî ÄÑ@
¸ x Ω∞ ¸ ÃD ⌧x ‰⌧ ⌅¿tX ¨§∏| ú%i»‰. ⇣î ¯ ⇠  ¸ ÄÑÃD ú%i
»‰.
$ cd /BIO/data/gatk
$ grep -v "#" dbsnp_138.hg19.vcf| wc -l
8087914
$ grep -F "#" dbsnp_138.hg19.vcf |wc -l
165
Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 23
• π ¸…¥Ã ú%i»‰. t˘ ¸…¥X L ≥⌧ ’-d’, +ê⌧’-c’< ,t •i»‰.
$ grep -v "#" dbsnp_138.hg19.vcf |awk '{print $1}'| more
chrM
chrM
chrM
chrM
chrM
chrM
chrM
chrM
chrM
$ grep -v "#" dbsnp_138.hg19.vcf |awk '{print $1}'| sort -d
chr1
chr2
$ grep -v "#" dbsnp_138.hg19.vcf |awk '{print $1}'| uniq -c
475 chrM
4723878 chr1
3363561 chr2
$ grep -v "#" dbsnp_138.hg19.vcf | 
awk '{if ($1 == "chrM") printf "chrM is: %sn", $2}'
chrM is: 16390
chrM is: 16391
chrM is: 16429
chrM is: 16445
chrM is: 16499
•  ú%< ú%⇠î ¥©D | •X0
$ grep -v "#" dbsnp_138.hg19.vcf | 
awk '{if ($1 == "chrM") printf "chrM is: %sn", $2}' > ~/chr_pos.txt
$ grep -v "#" dbsnp_138.hg19.vcf | 
awk '{if ($1 == "chr1") printf "chrM is: %sn", $2}' >> ~/chr_pos.txt
7.8 ¨⇧§ $∏Ãl Ù
• $∏Ãl x0òt§–  Ù eth0X inet addrt xÄ–⌧ ⌅¨ ¨⇧§ ⌘ç • ¸å6
Ö»‰.
$ ifconfig
eth0 Link encap:Ethernet HWaddr 02:00:5b:73:00:33
inet addr: 172.27.252.234 Bcast: 172.27.255.255
inet6 addr: fe80::5bff:fe73:33/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:501386 errors:0 dropped:0 overruns:0 frame:0
TX packets:346879 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:19357734604 (1 GB) TX bytes:2720265191 (2 GB)
Interrupt:68
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
6¨⇧§ ⌧ÑX ¸åî 172.27.252.234 êX ‰µ XΩ– 0| ‰tå ‹⌧‰.
Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 24
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:4337 errors:0 dropped:0 overruns:0 frame:0
TX packets:4337 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:2203478 (2.2 MB) TX bytes:2203478 (2.2 MB)
7.9 ¨⇧§ Uï ttX0
¨⇧§î ‰ë UïD ¿–Xp, å⌅∏Ë¥ò pt0| 0ÏXî Ω∞ Uï⌧ |D t©XÏ 0Ïi»‰.
• ¨⇧§–⌧ ¨©Xî ‰ë Uï t⌧ )ïÖ»‰. UïD t⌧ |H–î 8⌧ ‰¥àµ»‰. 8⌧|
⌧| < x‹î Ñ–åî ¡àt ¸¥—»‰.
$ cd
$ cp -R /BIO/data/compress ./ compress
$ cd compress
$ gzip -d compress01.gz
$ tar xvfz compress02.tar.gz
$ unzip compress03.zip
$ bzip2 -d comress04.bz2
$ tar xvfz compress05.tar.gz
$ tar xvf compress06.tar.bz2
• gzip: Recommended for fast network connections
• bzip2: Recommended for slower network connections (smaller size but takes longer to compress)
• zip: Not recommended but is provided as an option for those who cannot open the above formats
• ©…X Uï⌧ ⌅¥ pt0– t UïD t⌧X¿ J‡ ¯¨ |X ¥© UxXî )ïÖ»‰. FASTQ
|ÒD UxXîp ©i»‰.
$ gzip -dc CEUTrio.HiSeq.WGS.b37.bestPractices.hg19.vcf.gz | more
$ gzip -dc CEUTrio.HiSeq.WGS.b37.bestPractices.hg19.tar.gz | tar -tvf -
7.10 ¨⇧§ å⌅∏Ë¥ $XX0
|⇠ < ¨⇧§– å⌅∏Ë¥| $XXî )ï@ ‰LX 3 ¿ )ït ൻ‰. ´à¯î t ¨ (‰â)
|D Uï ‹ ⌧ıXî )ï< ⌅Ëà UïD t⌧XÏ  ¨©t •X‰. Pà¯î ¨⇧§–⌧ ⌧ı
Xî (§¿| t©Xî )ï< ∞Ñ,X Ω∞ APT|î (§¿ ¨ ⌅¯®D t©‰. 8à¯î å§
|D t©XÏ $XXî )ït‰.
7.10.1 APT| t© å⌅∏Ë¥ $X
• APT| t© (§¿ ≈pt∏
$ apt -get update
$ apt -get install bwa
Reading package lists ... Done
Building dependency tree
Reading state information ... Done
Use 'apt -get autoremove ' to remove them.
Suggested packages:
samtools
Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 25
The following NEW packages will be installed:
bwa
0 upgraded , 1 newly installed , 0 to remove and 153 not upgraded.
Need to get 135 kB of archives.
After this operation , 286 kB of additional disk space will be used.
Fetched 135 kB in 3s (40.1 kB/s)
Selecting previously unselected package bwa.
(Reading database ...17 files and directories currently installed .)
Unpacking bwa (from .../ archives/bwa_0 .6.1 -1 _amd64.deb) ...
Processing triggers for man -db ...
Setting up bwa (0.6.1 -1) ...
$ bwa
Program: bwa (alignment via Burrows -Wheeler transformation )
Version: 0.6.1 - r104
Contact: Heng Li <lh3@sanger.ac.uk >
Usage: bwa <command > [options]
Command: index index sequences in the FASTA format
aln gapped/ungapped alignment
samse generate alignment (single ended)
sampe generate alignment (paired ended)
bwasw BWA -SW for long queries
fastmap identify super -maximal exact matches
fa2pac convert FASTA to PAC format
pac2bwt generate BWT from PAC
pac2bwtgen alternative algorithm for generating BWT
bwtupdate update .bwt to the new format
bwt2sa generate SA from BWT and Occ
pac2cspac convert PAC to color -space PAC
stdsw standard SW/NW alignment
• NGS ( å⌅∏Ë¥ $X| ⌅t ¯¨ 0¯ $X⇠¥| Xî (§¿ ©]Ö»‰.
$ apt -get update -y
$ apt -get install gcc -y
$ apt -get install make -y
$ apt -get install zlib1g -dev -y
$ apt -get install libncurses5 -dev -y
$ apt -get install g++ -y
$ apt -get install tcl tk -y
$ apt -get install tcl -dev -y
$ apt -get install unzip -y
$ apt -get install curl -y
$ apt -get install screen -y
$ apt -get install python -dev -y
$ apt -get install python -software -properties -y
$ add -apt -repository ppa:webupd8team/java
$ apt -get update -y
$ apt -get install oracle -java7 -installer -y
Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 26
7.10.2 å§ T‹ Ù |D µ å⌅∏Ë¥ $X
• å§ $XX0
$ cd
$ cp /BIO/app/bwa -0.7.4. tar.bz2 ./
$ tar xvf bwa -0.7.4. tar.bz2
$ cd bwa -0.7.4
$ make
$ ./bwa
Program: bwa (alignment via Burrows -Wheeler transformation )
Version: 0.7.4 - r385
Contact: Heng Li <lh3@sanger.ac.uk >
Usage: bwa <command > [options]
Command: index index sequences in the FASTA format
mem BWA -MEM algorithm
fastmap identify super -maximal exact matches
pemerge merge overlapping paired ends (EXPERIMENTAL)
aln gapped/ungapped alignment
samse generate alignment (single ended)
sampe generate alignment (paired ended)
bwasw BWA -SW for long queries
fa2pac convert FASTA to PAC format
pac2bwt generate BWT from PAC
pac2bwtgen alternative algorithm for generating BWT
bwtupdate update .bwt to the new format
bwt2sa generate SA from BWT and Occ
$ bwa
Program: bwa (alignment via Burrows -Wheeler transformation )
Version: 0.6.2 - r126
Contact: Heng Li <lh3@sanger.ac.uk >
Usage: bwa <command > [options]

More Related Content

PDF
Galaxy RNA-Seq Analysis: Tuxedo Protocol
PDF
Big data solution for ngs data analysis
PPTX
SeqsLab: a high performance genomics data analysis platform based on Apache S...
PDF
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
PDF
Overcoming the challenges of designing efficient and specific CRISPR gRNAs
PDF
PrimeTime® qPCR products for gene expression
PPTX
Workshop NGS data analysis - 1
PDF
NGS: Mapping and de novo assembly
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Big data solution for ngs data analysis
SeqsLab: a high performance genomics data analysis platform based on Apache S...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
Overcoming the challenges of designing efficient and specific CRISPR gRNAs
PrimeTime® qPCR products for gene expression
Workshop NGS data analysis - 1
NGS: Mapping and de novo assembly

What's hot (20)

PDF
Computational infrastructure for NGS data analysis
PPTX
ABGT 2016 Workshop Schneider
PPTX
Grc workshop agbt2015_tg
PDF
Analysis of ChIP-Seq Data
PPTX
Ashg2017 workshop tg
PPTX
High efficiency qPCR with PrimeTime® Gene Expression Master Mix from IDT
PPTX
Use of NCBI Databases in qPCR Assay Design
PPTX
How to cluster and sequence an ngs library (james hadfield160416)
PPTX
GRCWorkshop_geval_1KG_slides
PPTX
AGBT2017 Reference Workshop: Schneider
PDF
Ashg grc workshop2015_tg
PPTX
Ashg grc workshop2014_tg
PDF
Scaling Genomic Analyses
PDF
Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, H...
PPTX
Agbt2015 workshop schneider
PPTX
agbt 2016 workshop lindsay
PPTX
Ashg2014 grc workshop_schneider
PPTX
Storing Cassandra Metrics (Chris Lohfink, DataStax) | C* Summit 2016
PPTX
Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...
PDF
Storm Anatomy
Computational infrastructure for NGS data analysis
ABGT 2016 Workshop Schneider
Grc workshop agbt2015_tg
Analysis of ChIP-Seq Data
Ashg2017 workshop tg
High efficiency qPCR with PrimeTime® Gene Expression Master Mix from IDT
Use of NCBI Databases in qPCR Assay Design
How to cluster and sequence an ngs library (james hadfield160416)
GRCWorkshop_geval_1KG_slides
AGBT2017 Reference Workshop: Schneider
Ashg grc workshop2015_tg
Ashg grc workshop2014_tg
Scaling Genomic Analyses
Native Code & Off-Heap Data Structures for Solr: Presented by Yonik Seeley, H...
Agbt2015 workshop schneider
agbt 2016 workshop lindsay
Ashg2014 grc workshop_schneider
Storing Cassandra Metrics (Chris Lohfink, DataStax) | C* Summit 2016
Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...
Storm Anatomy
Ad

Viewers also liked (18)

PDF
Detecting Somatic Mutation - Ensemble Approach
PDF
Workshop 2011
PDF
Genomics and BigData - case study
PPTX
Aug2013 tumor normal whole genome sequencing
PDF
Kogo 2013-ngs galaxy
PPT
Explanation slides Somatic Mutations cancer
PDF
Lopez-Bigas talk at the EBI/EMBL Cancer Genomics Workshop
PPTX
Normal/Tumor somatic mutations report tool
PDF
Incidental findings throughout multigene panel testing in cancer genetics
PDF
Part 5 of RNA-seq for DE analysis: Detecting differential expression
PPT
DESeq Paper Journal club
PDF
DEseq, voom and vst
PDF
Computational genomics approaches to precision medicine
PDF
Computational genomics course poster 2015 (BIMSB/MDC-Berlin)
PPTX
영어로 논문쓰기 - 읽기 쓰기 통합 전략을 중심으로
PPTX
R 기본-데이타형 소개
PPTX
R 프로그래밍-향상된 데이타 조작
PPTX
R 프로그래밍 기본 문법
Detecting Somatic Mutation - Ensemble Approach
Workshop 2011
Genomics and BigData - case study
Aug2013 tumor normal whole genome sequencing
Kogo 2013-ngs galaxy
Explanation slides Somatic Mutations cancer
Lopez-Bigas talk at the EBI/EMBL Cancer Genomics Workshop
Normal/Tumor somatic mutations report tool
Incidental findings throughout multigene panel testing in cancer genetics
Part 5 of RNA-seq for DE analysis: Detecting differential expression
DESeq Paper Journal club
DEseq, voom and vst
Computational genomics approaches to precision medicine
Computational genomics course poster 2015 (BIMSB/MDC-Berlin)
영어로 논문쓰기 - 읽기 쓰기 통합 전략을 중심으로
R 기본-데이타형 소개
R 프로그래밍-향상된 데이타 조작
R 프로그래밍 기본 문법
Ad

Similar to Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach (20)

PDF
вестник южно уральского-государственного_университета._серия_математика._меха...
PDF
Apache HBase Improvements and Practices at Xiaomi
PDF
ThinkPad® T400 M R400
PDF
Rapide deployment with Pathloss
PDF
强烈推荐Ann77+python
PDF
TaqMan® Gene Expression Assays Protocol
PDF
Exp pcb intro_wkb_rus
PDF
Clinical significance of transcript alignment discrepancies gne - 20141016
PDF
Unveiling the Secrets of Gaokao Essays to Stand Out in IB Chinese Exams 揭秘高考作...
PDF
Burst TCP: an approach for benefiting mice flows
PDF
Curvic
PDF
2_DOF_Inverted_Pendulum_Laboratory_Session
PDF
YCT 1 Chinese Intensive Reading for Kids Y10900 Official Mock 少儿汉语考试模拟考题 sample
PDF
Smith randall 15-rolling-element-bearing-diagnostics-cwu
PDF
Thesis_Sebastian_Ånerud_2015-06-16
PDF
Documentation - LibraryRandom
PDF
Querying Provenance Information: Basic Notions and an Example from Paleoclima...
PDF
Brick
PDF
Coriolis rct1000 manual badger meter rct1000
вестник южно уральского-государственного_университета._серия_математика._меха...
Apache HBase Improvements and Practices at Xiaomi
ThinkPad® T400 M R400
Rapide deployment with Pathloss
强烈推荐Ann77+python
TaqMan® Gene Expression Assays Protocol
Exp pcb intro_wkb_rus
Clinical significance of transcript alignment discrepancies gne - 20141016
Unveiling the Secrets of Gaokao Essays to Stand Out in IB Chinese Exams 揭秘高考作...
Burst TCP: an approach for benefiting mice flows
Curvic
2_DOF_Inverted_Pendulum_Laboratory_Session
YCT 1 Chinese Intensive Reading for Kids Y10900 Official Mock 少儿汉语考试模拟考题 sample
Smith randall 15-rolling-element-bearing-diagnostics-cwu
Thesis_Sebastian_Ånerud_2015-06-16
Documentation - LibraryRandom
Querying Provenance Information: Basic Notions and an Example from Paleoclima...
Brick
Coriolis rct1000 manual badger meter rct1000

More from Hong ChangBum (20)

PDF
Demo chapter3
PDF
통계유전학워크샵
PDF
Genome Wide SNP Analysis for Inferring the Population Structure and Genetic H...
PDF
BioSMACK - Linux Live CD for GWAS
PDF
Next-generation genomics: an integrative approach
PDF
How to genome
KEY
worldwide population
PDF
RSS & Bioinformatics
PDF
Perspectives of identifying Korean genetic variations
PDF
Genome Browser based on Google Maps API
PDF
Korean Database of Genomic Variants
PDF
Dt Ccompanieslist
PDF
DTC Companies List
PDF
My Project
PDF
Genome Browser
PDF
GenomeBrowser
PDF
PDF
Next Generation bio Research Infra
PDF
Cluster Drm
PDF
Cluster Drm
Demo chapter3
통계유전학워크샵
Genome Wide SNP Analysis for Inferring the Population Structure and Genetic H...
BioSMACK - Linux Live CD for GWAS
Next-generation genomics: an integrative approach
How to genome
worldwide population
RSS & Bioinformatics
Perspectives of identifying Korean genetic variations
Genome Browser based on Google Maps API
Korean Database of Genomic Variants
Dt Ccompanieslist
DTC Companies List
My Project
Genome Browser
GenomeBrowser
Next Generation bio Research Infra
Cluster Drm
Cluster Drm

Recently uploaded (20)

PPTX
Custom Software Development Services.pptx.pptx
PPTX
Tech Workshop Escape Room Tech Workshop
PDF
DNT Brochure 2025 – ISV Solutions @ D365
PDF
Ableton Live Suite for MacOS Crack Full Download (Latest 2025)
PDF
Complete Guide to Website Development in Malaysia for SMEs
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Digital Systems & Binary Numbers (comprehensive )
PDF
MCP Security Tutorial - Beginner to Advanced
PPTX
GSA Content Generator Crack (2025 Latest)
PDF
Time Tracking Features That Teams and Organizations Actually Need
PDF
Cost to Outsource Software Development in 2025
PPTX
Patient Appointment Booking in Odoo with online payment
PDF
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
PPTX
assetexplorer- product-overview - presentation
PDF
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
PDF
AI/ML Infra Meetup | LLM Agents and Implementation Challenges
PPTX
Computer Software and OS of computer science of grade 11.pptx
DOCX
How to Use SharePoint as an ISO-Compliant Document Management System
PPTX
"Secure File Sharing Solutions on AWS".pptx
PDF
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
Custom Software Development Services.pptx.pptx
Tech Workshop Escape Room Tech Workshop
DNT Brochure 2025 – ISV Solutions @ D365
Ableton Live Suite for MacOS Crack Full Download (Latest 2025)
Complete Guide to Website Development in Malaysia for SMEs
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Digital Systems & Binary Numbers (comprehensive )
MCP Security Tutorial - Beginner to Advanced
GSA Content Generator Crack (2025 Latest)
Time Tracking Features That Teams and Organizations Actually Need
Cost to Outsource Software Development in 2025
Patient Appointment Booking in Odoo with online payment
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
assetexplorer- product-overview - presentation
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
AI/ML Infra Meetup | LLM Agents and Implementation Challenges
Computer Software and OS of computer science of grade 11.pptx
How to Use SharePoint as an ISO-Compliant Document Management System
"Secure File Sharing Solutions on AWS".pptx
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025

Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach

  • 1. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach CB Hong ⇤ , KJ Kim 4-5 February 2015 Contents 1 TCGA Benchmark 4 Data Set 3 1.1 GenomeTorrent| t© TCGA pt0 ‰¥‹ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Sample Data Set DX0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 îú⌧ Ì Ù Ux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 ‰µ` pt0 Ux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 ¨X0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Somatic Mutation Prediction 6 2.1 SomaticSniper ‰â ✏ ¨⌅ D0 ©X0 (164 ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 VarScan2 ‰â ✏ ¨⌅ D0 ©X0 (10Ñ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 MuTect ‰â ✏ ¨⌅ D0 ©X0 (18Ñ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 ¨X0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3 Full Consensus / Partial Consensus sSNV lX0 11 3.1 Bi-allelic SNPà îúX0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Full Consensus / Partial Consensus lX0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3 Full Consensus / Partial Consensus /⇠ lX0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.4 ¨X0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4 î D0 ©X0 13 4.1 Unifed Genotyper| t© normal, tumor variants call (8Ñ) . . . . . . . . . . . . . . . . . . . . . . . 13 4.2 Filtering SNVs - full consensus (›µ •) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.3 Filtering SNVs - partial consensus (SomaticSniper/MuTect) . . . . . . . . . . . . . . . . . . . . . . . . 13 4.4 GATK D0| © ƒ Full Consensus / Partial Consensus /⇠ lX0 . . . . . . . . . . . . . . . . . . 14 4.5 ¨X0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 5 Validation 15 5.1 COSMIC, CCLE pt0 DX0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 5.2 Validation ⇠â - consensus / parital consensus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 5.3 ¨X0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 6 0¿ Somatic Mutation Callers - Strelka, Virmid 17 6.1 Strelka (1Ñ38 ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 6.2 Virmid (33Ñ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 ⇤KT GenomeCloud hongiiv@gmail.com 1
  • 2. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 2 7 ⌅¥ l| ⌅ ¨⇧§ 19 7.1 ‰µ© ¨⇧§ ⌧Ñ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 7.2 ‰µ© ¨⇧§ ⌧Ñ ⌘çX0 - ƒ∞਩ê . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 7.3 ‰µ© ¨⇧§ ⌧Ñ ⌘çX0 -  ⇣î ¨⇧§ ¨©ê . . . . . . . . . . . . . . . . . . . . . . . . . . 19 7.4 ¨⇧§ ‹§ Ù LD¥0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 7.5 ¨⇧§ | ‹§ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 7.6 ¨⇧§ X‹§l î X0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 7.7 | ( Ö9¥ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 7.8 ¨⇧§ $∏Ãl Ù . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 7.9 ¨⇧§ Uï ttX0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 7.10 ¨⇧§ å⌅∏Ë¥ $XX0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 7.10.1 APT| t© å⌅∏Ë¥ $X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 7.10.2 å§ T‹ Ù |D µ å⌅∏Ë¥ $X . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
  • 3. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 3 1 TCGA Benchmark 4 Data Set ¯ ‰µ–⌧î TCGA mutation calling benchmark4 datasetsD t©XÏ ¥ªå somatic mutationD >D¿– t⌧ LD ¸ ÉÖ»‰. Genome sequencing benchmakr dataset@ x⌅ < tumor ÿ – | D((5%-95%)X Normal ÿ D <iXÏ ›1 pt0Ö»‰. t ⌘–⌧ ∞¨î n40t60 (mixed with 60% of the tumor and 40% of the normal)¸ t– QXî normal sampleD ¨©` ÉÖ»‰. t˘ pt0î BAM Ϙ< TCGA Benchmark Hò t¿–⌧ ‰¥‹ •i»‰. 1.1 GenomeTorrent| t© TCGA pt0 ‰¥‹ • ‰¥‹ S/W $X - Key/UUID | ‰¥‹ - ÿ ‰¥‹ • ‹)TCGA Benchmark Data SetD ⌅ Public Key ‰¥‹ • https://cghub.ucsc.edu/datasets/benchmark download.html $ cd $ wget https:// cghub.ucsc.edu/software/downloads/cghub_public.key • π |X ‰¥‹ Ù| ÏhXî UUID(universally unique identifier, ›ƒê) | • TCGA Benchmark cell line: HCC1143 tumor 50x $ curl https:// cghub.ucsc.edu/cghub/metadata/ analysisAttributes ? analysis_id=ad3d4757 -f358 -40a3 -9d92 -742463 a95e88 -o uuid.txt $ more uuid.txt <?xml version="1.0" encoding="utf -8" standalone="yes"?> <center_name >UCSC </ center_name > <study >TCGA_MUT_BENCHMARK_4 </study > <files > <file > <filename >G15511.HCC1143 .1.bam </ filename > <filesize >255795959440 </ filesize > </file > • gtdownload| t© pt0 ‰¥‹ $ cd $ gtdownload -c cghub_public.key -vv -d uuid.txt 1.2 Sample Data Set DX0 • BAMX |Ä Ì îú - ,(sort) - xqÒ (index) ¸…¥ Ë⌅ îú (-b: bam Ϙ< ú%) $ cd $ samtools view -b in.bam 1 > chr1.bam $ samtools sort chr1.bam chr1_sorted $ samtools index chr1_sorted.bam • π ÌX îú (BED | t©)
  • 4. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 4 $ cd $ cat chr17.bed 17:5967 -6207 17:11197 -11389 17:11806 -12018 17:13897 -14017 17:22307 -22427 17:30843 -30963 17:31151 -31279 17:63618 -63738 17:65398 -65638 17:69410 -69530 17:96838 -97108 17:131511 -131661 17:169155 -169395 17:170984 -171254 17:177205 -177355 17:260100 -260308 17:262897 -263257 17:263317 -263947 $ cat chr17.bed |xargs samtools view -b in.bam > exome.bam $ samtools sort exome.bam exome_sorted $ samtools index exome_sorted.bam 1.3 îú⌧ Ì Ù Ux • readƒ ⌅X Ù| bed Ϙ< ú%‰. ⌅Ëà ucsc genome browserX custom track< î XÏ align ⌧ read Ù| Ux` ⇠ à‰. $ cd $ bamToBed -i exome_sorted.bam > cov_1.bed • BAM |X ‰Ñ¨¿| BED | ú%Xp, read depth Ù| ৆¯®< ¯¨0 ⌅ Ù © ⇠ à‰. $ cd $ samtools view -b exome_sorted.bam | genomeCoverageBed -ibam stdin > cov_2.bed 1.4 ‰µ` pt0 Ux • ÿ , ⌅¯®, |§ pt0 ©] $ cd /somatic_bench $ pwd /somatic_bench $ ls -al total 176 drwxr -xr -x 7 root root 4096 Jan 21 15:25 . drwxr -xr -x 25 root root 4096 Jan 20 08:53 ..
  • 5. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 5 drwxr -xr -x 9 root root 4096 Jan 21 08:15 app drwxr -xr -x 2 root root 4096 Jan 21 14:38 bam drwxr -xr -x 2 root root 4096 Jan 19 11:43 reference drwxr -xr -x 2 root root 4096 Jan 21 15:24 script drwxr -xr -x 2 root root 151552 Jan 21 12:59 tmp $ more /somatic_bench/script/ somatic_call_bench .sh input_bam1="/somatic_bench/bam/hcc1143.ccle.n40t60.sorted.bam" input_bam2="/somatic_bench/bam/hcc1143.ccle.b.sorted.bam" gatk_b37="/somatic_bench/reference/ human_g1k_v37_decoy .fasta" temp_dir="/somatic_bench/tmp/" $ cd $ ln -s /somatic_bench/bam/hcc1143.ccle.n40t60.sorted.bam tumor.bam $ ln -s /somatic_bench/bam/hcc1143.ccle.b.sorted.bam normal.bam 1.5 ¨X0 • ⌅¯® ©]: wget, curl, gtdownload, samtools, bedtools(bamToBed, genomeCoverageBed) • ∞¸<: –Xî ÌÃt t¨Xî .bam, t˘ .bamX coverage| Ùϸî .bed
  • 6. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 6 2 Somatic Mutation Prediction SomaticSniper, VarScan2, MuTectD t©XÏ ÿ pt0K< Ä0 (tumor@ matched normal bam) somatic mu- tationD >D≈»‰. • Ñ Ö9: https://gist.github.com/hongiiv/06611f189f4c8158edb0 • SAMtools: v0.1.19 • GATK: v2.8.1 • MuTect: v1.1.4 • SomaticSniper: v1.0.4 • Strelka: v1.0.14 • Virmid: v1.1.1 2.1 SomaticSniper ‰â ✏ ¨⌅ D0 ©X0 (164 ) SomaticSniperî Varscan2| Ç ÃÒ4 YX Li Ding– Xt 2011D ⌧⌧⇠»<p, Bayesian probability@ poste- rior filteringD t©‰. ¸î π’<î High computational e ciency| Ùx‰. • -J: joint genotyping mode with default prior probability of a somatic mutation (0.01) • -n, -t: normal/tumor sample id (for VCF header) • -F: output Ϙ (classic, vcf, bed) • -f: ref.fasta |X Ω $ cd $ bam - somaticsniper -J -F vcf -n HCC1143_Normal -t HCC1143_Tumor -f /somatic_bench/reference/ human_g1k_v37_decoy .fasta tumor.bam normal.bam HCC1143_somaticsniper .vcf • (D05X) Reads with a mapping quality of 0 were filtered prior to somatic mutation identification. Predictions with ’somatic score’ of 40 or greater were considered for subsequent downstaream validation and analysis step. • GATKXSelectVariants| t©XÏ –Xî variantsÃD îú` ⇠ à‰. • VCF |X FORMAT D‹X SSC (somatic score), MQ (mapping quality) Ù| t© $ cd $ ln -s /somatic_bench/app/GenomeAnalysisTK -2.8 -1/ GenomeAnalysisTK .jar ./ $ update -alternatives --config java There are 2 choices for the alternative java (providing /usr/bin/java ). Selection Path Priority ------------------------------------------------------------ 0 /usr/lib/jvm/java -7- oracle/jre/bin/java 2 1 /usr/lib/jvm/java -6- oracle/jre/bin/java 1 * 2 /usr/lib/jvm/java -7- oracle/jre/bin/java 2 Press enter to keep the current choice [*], or type selection number: 2 update -alternatives : using /usr/lib/jvm/java -6- oracle/jre/bin/java
  • 7. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 7 $ java -version java version "1.7.0 _72" Java(TM) SE Runtime Environment (build 1.7.0_72 -b14) Java HotSpot(TM) 64-Bit Server VM (build 24.72 -b04 , mixed mode) $ java -jar GenomeAnalysisTK .jar -T SelectVariants -R /somatic_bench/reference/ human_g1k_v37_decoy .fasta --variant HCC1143_somaticsniper .vcf -o HCC1143_somaticsniper_filter .vcf -sn HCC1143_Tumor -sn HCC1143_Normal -select 'vc.getGenotype(" HCC1143_Tumor"). getExtendedAttribute ("SSC") >= 40 && (vc.getGenotype(" HCC1143_Tumor"). getExtendedAttribute ("MQ") > 0 || vc.getGenotype(" HCC1143_Normal "). getExtendedAttribute ("MQ") > 0)' • D0 ⌅/ƒX mutation /⇠ DPX0 $ cd $ grep -v "#" HCC1143_somaticsniper .vcf |wc -l 583 $ grep -v "#" HCC1143_somaticsniper_filter .vcf |wc -l 161 2.2 VarScan2 ‰â ✏ ¨⌅ D0 ©X0 (10Ñ) VarScan2î ÃÒ4 YX Li Ding– Xt SomaticSniperÙ‰ 1D ¶@ 2012D ⌧⌧⇠»‰. ‰x 4‰¸î Ϩ Fisher exact test@ filtering and FDR correctionD ¨©‰. ¸î π’< high-quality sSNVs– t⌧ sensitive detectionD ⇠â‰. ‰x 4‰¸ Ϩ Ö% |D .bam |t Dà pileup ⇣î mpileup |D Ö% î‰. • samtoolsX mpileupD t©XÏ normal, tumor– t⌧ pileup/mpileup ϘD ›1‰. • mpileup ˃–⌧ -q 1 (skip alignments with mapQ smaller than INT), -B (disable BAQ computation) 5XD µt filter| ⇠â‰. • VarScan–⌧ mpileup1 ϘD Ö%< ¨©Xî Ω∞ ’–mpileup 1’ 5XD ‰. $ cd $ samtools mpileup -f /somatic_bench/reference/ human_g1k_v37_decoy .fasta -q 1 -B normal.bam > HCC1143_n.pileup $ samtools mpileup -f /somatic_bench/reference/ human_g1k_v37_decoy .fasta -q 1 -B tumor.bam > HCC1143_t.pileup $ ln -s /somatic_bench/app/VarScan/VarScan.v2 .3.3. jar ./ $ java -jar VarScan.v2 .3.7. jar somatic HCC1143_n.pileup HCC1143_t.pileup HCC1143_varscan --output -vcf 1 14617150 positions in tumor 14616970 positions shared in normal 13721478 had sufficient coverage for comparison 10tX 8⌧‰@ samtoolsX pileupD ¨©Xî ÉD 0 < $Ö⇠¥ à¿Ã, samtools ≈pt∏ ⇠t⌧ pileup@ ¨|¿‡ mpileup < ¥ ⇠»‰. X¿Ã mpileup<ƒ XòX ÿ à pileupt •X‰. <` varscan–⌧î N/T ®P Ïh⌧ mpileup |D ¿–‰.
  • 8. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 8 13700958 were called Reference 0 were mixed SNP -indel calls and filtered 18427 were called Germline 1562 were called LOH 450 were called Somatic 81 were called Unknown 0 were called Variant • VarScan2X ⇠â∞¸ Dò@ ⇡t INDEL¸ SNP Ïh⌧ ∞¸| VCF ‹ ›1⌧‰ (HCC1143 varscan.indel.vcf, HCC1143 varscan.snp.vcf). drwxr -xr -x 2 root root 4096 Jan 30 09:52 ./ drwxr -xr -x 5 root root 8192 Jan 30 09:35 ../ -rw -r--r-- 1 root root 402354 Jan 30 09:47 HCC1143_varscan .indel.vcf -rw -r--r-- 1 root root 2691462 Jan 30 09:47 HCC1143_varscan .snp.vcf • VarScan2X ∞¸ ⌘, HCC1143varscan.snp.vcf XprocessSomaticısomaticFilter|tXD0|¸. • processSomatic: high-confidence2 /low-confidence Somatic mutationsD Ѩt ‰. • somaticFilter: ê‡t –Xî D0 5X –min-coverage, –p-value, –indel-file Ò © •X‰. $ cd $ java -jar VarScan.v2 .3.3. jar processSomatic -help USAGE: java -jar VarScan.jar process [status -file] OPTIONS status -file - The VarScan output file for SNPs or Indels OPTIONS --min -tumor -freq - Minimum variant allele frequency in tumor [0.10] --max -normal -freq - Maximum variant allele frequency in normal [0.05] --p-value - P-value for high -confidence calling [0.07] $ java -jar VarScan.v2 .3.3. jar processSomatic HCC1143_varscan .snp.vcf Reading input from HCC1143_varscan .snp.vcf Opening output files: 17914 VarScan calls processed 382 were Somatic (102 high confidence) 16048 were Germline (15431 high confidence) 1451 were LOH (1447 high confidence) • processSomaticX ∞¸ Germline, LOH, Somatic– t⌧ high confidence, low confidenceX ©]t Ïh ⌧ ∞¸| ›1‰. $ ls -rw -r--r-- 1 2413169 Jan 30 09:52 HCC1143_varscan .snp.vcf.Germline -rw -r--r-- 1 2320566 Jan 30 09:52 HCC1143_varscan .snp.vcf.Germline.hc -rw -r--r-- 1 216574 Jan 30 09:52 HCC1143_varscan .snp.vcf.LOH -rw -r--r-- 1 215997 Jan 30 09:52 HCC1143_varscan .snp.vcf.LOH.hc -rw -r--r-- 1 59990 Jan 30 09:52 HCC1143_varscan .snp.vcf.Somatic -rw -r--r-- 1 17055 Jan 30 09:52 HCC1143_varscan .snp.vcf.Somatic.hc • VarScan2X ∞¸ VCFX Ω∞ ALT allele– ’G/T’ Ò< 0Xîp tî îƒ Ñ – –Ï| ⌧›‰. 0| ⌧ ’G,T’X ⌅ )›< ¿Ω‰. 2tumor–⌧ minimum variant allele frequency 0.1, normal–⌧ maximum variant allele frequency 0.05
  • 9. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 9 $ cd $ perl -pe 's/tA //tA ,/' HCC1143_varscan .snp.vcf.Somatic.hc | perl -pe 's/tT //tT ,/'| perl -pe 's/tG //tG ,/'| perl -pe 's/tC //tC ,/' > HCC1143_varscan_filter .vcf • D0 ƒX mutation /⇠ $ cd $ grep -v "#" HCC1143_varscan_filter .vcf |wc -l 102 2.3 MuTect ‰â ✏ ¨⌅ D0 ©X0 (18Ñ) MuTect@ Broad–⌧ ⌧⌧⌧ 4 Bayesian probability with pre- and post- filteringD ⇠âXp, πà low allelic-fraction –⌧ sSNVs– t⌧ sensitive detectionD ⇠â‰. • MuTectî ê 1.6 Ñ⌅–⌧à ŸëX0 L8– ⌅¨ Java Ñ⌅D Ux ƒ– Dî‹ update-alternatives| t ©XÏ Ñ⌅D ¿Ω‰. $ cd $ ln -s /somatic_bench/app/mutect/muTect -1.1.4. jar ./ $ samtools index normal.bam $ samtools index tumor.bam $ cp /somatic_bench/reference/ccle.gatk.bed ./ $ update -alternatives --config java There are 2 choices for the alternative java (providing /usr/bin/java ). Selection Path Priority ------------------------------------------------------------ 0 /usr/lib/jvm/java -7- oracle/jre/bin/java 2 1 /usr/lib/jvm/java -6- oracle/jre/bin/java 1 * 2 /usr/lib/jvm/java -7- oracle/jre/bin/java 2 Press enter to keep the current choice [*], or type selection number: 1 update -alternatives : using /usr/lib/jvm/java -6- oracle/jre/bin/java $ java -version java version "1.6.0 _45" Java(TM) SE Runtime Environment (build 1.6.0_45 -b06) Java HotSpot(TM) 64-Bit Server VM (build 20.45 -b01 , mixed mode) $ java -jar muTect -1.1.4. jar --analysis_type MuTect --reference_sequence /somatic_bench/reference/ human_g1k_v37_decoy .fasta --cosmic /somatic_bench/reference/ b37_cosmic_v54_120711 .vcf --dbsnp /somatic_bench/reference/dbsnp_132_b37.leftAligned.vcf --input_file:normal normal.bam --input_file:tumor tumor.bam --out HCC1143_mutect .out --vcf HCC1143_mutect .vcf --coverage_file HCC1143.mutect.cov.wig.txt --normal_sample_name HCC1143_Normal --tumor_sample_name HCC1143_Tumor -L ccle.gatk.bed
  • 10. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 10 • (D05X) Predictions not labeled as ’REJECT’ were accepted as confident somatic mutation predictions, and subsequent downstream validation and analysis steps. • D0– ¨©` GATKî ê 1.7 Ñ⌅D Dî X¿ update-alternatives| t©XÏ ê Ñ⌅D ¿Ω‰. • GATKX SelectVariants| t©XÏ VCFX D0 (FILTER) D‹ÄÑt PASS⌧ (REJECT| ⌧x) variantsà >D∏‰. $ cd $ update -alternatives --config java There are 2 choices for the alternative java (providing /usr/bin/java ). Selection Path Priority ------------------------------------------------------------ 0 /usr/lib/jvm/java -7- oracle/jre/bin/java 2 1 /usr/lib/jvm/java -6- oracle/jre/bin/java 1 * 2 /usr/lib/jvm/java -7- oracle/jre/bin/java 2 Press enter to keep the current choice [*], or type selection number: 2 update -alternatives : using /usr/lib/jvm/java -6- oracle/jre/bin/java $ java -version java version "1.7.0 _72" Java(TM) SE Runtime Environment (build 1.7.0_72 -b14) Java HotSpot(TM) 64-Bit Server VM (build 24.72 -b04 , mixed mode) $ java -jar GenomeAnalysisTK .jar -T SelectVariants -R /somatic_bench/reference/ human_g1k_v37_decoy .fasta --variant HCC1143_mutect .vcf -o HCC1143_mutect_filter .vcf -sn HCC1143_Tumor -sn HCC1143_Normal -select 'vc.isNotFiltered ()' • GATKX SelectVariants| t©XÏ VCFX D0 (FILTER) D‹ ÄÑt PASS⌧ (REJECT| ⌧x) variantsà >D∏‰. $ cd $ java -jar GenomeAnalysisTK .jar -T SelectVariants -R /somatic_bench/reference/ human_g1k_v37_decoy .fasta --variant HCC1143_mutect .vcf -o HCC1143_mutect_filter .vcf -sn HCC1143_Tumor -sn HCC1143_Normal --excludeFiltered • D0 ƒX mutation /⇠ $ cd $ grep -v "#" HCC1143_mutect_filter .vcf |wc -l 109 2.4 ¨X0 • ⌅¯® ©]: VarScan2, SomaticSniper, MuTect, GATK • ∞¸<: 4ƒ D0 DÃ⌧ somatic mutation (161, 102, 112)
  • 11. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 11 3 Full Consensus / Partial Consensus sSNV lX0 SomaticSniper, VarScan2, MuTect 3ÖX SNV detecting toolsX full consensus callD >î‰. ∞ multi-allelic¸ indel @ ⌧p‰. 3.1 Bi-allelic SNPà îúX0 • ¨⌅ D0 ∞¸– t⌧ multi-allelicD ⌧pX‡ SNPà îú‰. • GATKX SelectVariants| t©XÏ -selectTypeD SNP (INDEL, SNP, MIXED, MNP, SYMBOLIC, NO VARIATION), -restrictAllelesTo| BIALLELIC (MULTIALLELIC or BIALLELIC)< ‰. $ cd $ java -jar GenomeAnalysisTK .jar -T SelectVariants -R /somatic_bench/reference/ human_g1k_v37_decoy .fasta --variant HCC1143_mutect_filter .vcf -o HCC1143_mutect_1 .vcf -selectType SNP -restrictAllelesTo BIALLELIC $ java -jar GenomeAnalysisTK .jar -T SelectVariants -R /somatic_bench/reference/ human_g1k_v37_decoy .fasta --variant HCC1143_somaticsniper_filter .vcf -o HCC1143_somaticsniper_1 .vcf -selectType SNP -restrictAllelesTo BIALLELIC $ java -jar GenomeAnalysisTK .jar -T SelectVariants -R /somatic_bench/reference/ human_g1k_v37_decoy .fasta --variant HCC1143_varscan_filter .vcf -o HCC1143_varscan_1 .vcf -selectType SNP -restrictAllelesTo BIALLELIC 3.2 Full Consensus / Partial Consensus lX0 • Partial Consensus (SomaticSniper/MuTect, MuTect/VarScan2, VarScan2/SomaticSniper)@ somatic caller 3Ö– ⌅¥ consensus| l‰. $ cd $ java -jar GenomeAnalysisTK .jar -T SelectVariants -R /somatic_bench/reference/ human_g1k_v37_decoy .fasta --variant HCC1143_somaticsniper_1 .vcf --concordance HCC1143_mutect_1 .vcf -o HCC1143_SM.vcf $ java -jar GenomeAnalysisTK .jar -T SelectVariants -R /somatic_bench/reference/ human_g1k_v37_decoy .fasta --variant HCC1143_mutect_1 .vcf --concordance HCC1143_varscan_1 .vcf -o HCC1143_MV.vcf
  • 12. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 12 $ java -jar GenomeAnalysisTK .jar -T SelectVariants -R /somatic_bench/reference/ human_g1k_v37_decoy .fasta --variant HCC1143_varscan_1 .vcf --concordance HCC1143_somaticsniper_1 .vcf -o HCC1143_VS.vcf $ java -jar GenomeAnalysisTK .jar -T SelectVariants -R /somatic_bench/reference/ human_g1k_v37_decoy .fasta --variant HCC1143_SM.vcf --concordance HCC1143_varscan_1 .vcf -o HCC1143_SMV.vcf 3.3 Full Consensus / Partial Consensus /⇠ lX0 • full consensus ✏ parital consensus /⇠| l‰. $ cd $ grep -v "#" HCC1143_SM.vcf |wc -l 45 $ grep -v "#" HCC1143_MV.vcf |wc -l 38 $ grep -v "#" HCC1143_VS.vcf |wc -l 42 $ grep -v "#" HCC1143_SMV.vcf |wc -l 32 3.4 ¨X0 • ⌅¯® ©]: GATK • ∞¸<: consensus / parital consensus pt0
  • 13. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 13 4 î D0 ©X0 GATK Unified Genotyper| t©XÏ specificity| ù ‹¨ ⇠ à‰. 4.1 Unifed Genotyper| t© normal, tumor variants call (8Ñ) • GATK UnifiedGenotyper| t©XÏ Normal/Tumor ÿ – t SNP| calling‰. $ cd $ java -jar GenomeAnalysisTK .jar -T UnifiedGenotyper -o HCC1143_gatk.tumor.vcf -I tumor.bam --genotype_likelihoods_model SNP -R /somatic_bench/reference/ human_g1k_v37_decoy .fasta -L ccle.gatk.bed $ java -jar GenomeAnalysisTK .jar -T UnifiedGenotyper -o HCC1143_gatk.normal.vcf -I normal.bam --genotype_likelihoods_model SNP -R /somatic_bench/reference/ human_g1k_v37_decoy .fasta -L ccle.gatk.bed 4.2 Filtering SNVs - full consensus (›µ •) • GATK UnifiedGenotyper| t©XÏ ›1⌧ Normal/Tumor X variants| t©XÏ SNVs predicted in tumor but not the germlines D0| ⇠â‰. $ cd $ java -jar GenomeAnalysisTK .jar -T SelectVariants -R /somatic_bench/reference/ human_g1k_v37_decoy .fasta --variant HCC1143_SMV.vcf --discordance HCC1143_gatk.normal.vcf -o HCC1143_SMV_discordance_normal .vcf $ java -jar GenomeAnalysisTK .jar -T SelectVariants -R /somatic_bench/reference/ human_g1k_v37_decoy .fasta --variant HCC1143_SMV_discordance_normal .vcf --concordance HCC1143_gatk.tumor.vcf -o HCC1143_final_filter_concordance .vcf 4.3 Filtering SNVs - partial consensus (SomaticSniper/MuTect) $ cd $ java -jar GenomeAnalysisTK .jar -T SelectVariants -R /somatic_bench/reference/ human_g1k_v37_decoy .fasta --variant HCC1143_SM.vcf --discordance HCC1143_gatk.normal.vcf
  • 14. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 14 -o HCC1143_SM_discordance_normal .vcf $ java -jar GenomeAnalysisTK .jar -T SelectVariants -R /somatic_bench/reference/ human_g1k_v37_decoy .fasta --variant HCC1143_SM_discordance_normal .vcf --concordance HCC1143_gatk.tumor.vcf -o HCC1143_SM_final_filter_concordance .vcf 4.4 GATK D0| © ƒ Full Consensus / Partial Consensus /⇠ lX0 • GATK D0| » consensus ✏ parital consensus /⇠| l‰. $ cd $ grep -v "#" HCC1143_final_filter_concordance .vcf |wc -l 32 $ grep -v "#" HCC1143_SM_final_filter_concordance .vcf |wc -l 45 4.5 ¨X0 • ⌅¯® ©]: GATK • ∞¸<: GATK D0| © consensus / parital consensus pt0
  • 15. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 15 5 Validation COSMIC¸CCLEX HCC1143 ÿ – ¿t ¨§∏| ¿‡ º»ò |XXî¿| LD¯‰. validation.list |@ ⌧Ñ– •⌧ | ⇣î ‰¥‹ (https://gist.github.com/hongiiv/42194181ce6402d8b629)XÏ ¨©i»‰. 5.1 COSMIC, CCLE pt0 DX0 • COSMIC¸ CCLEX HCC1143 ÿ – ¿t ©] ( 103⌧)D ı¨‰. $ cd $ cp /somatic_bench/reference/validation.list ./ $ cat validation.list | wc -l 103 5.2 Validation ⇠â - consensus / parital consensus • Ö filter⌧ consensus/partial consensus (SomaticSniper/MuTect)– t⌧ á⌧ |XXî¿| Ux‰. $ cd $ java -jar GenomeAnalysisTK .jar -T SelectVariants -R /somatic_bench/reference/ human_g1k_v37_decoy .fasta --variant HCC1143_final_filter_concordance .vcf -o all.val.filter.vcf -L validation.list $ java -jar GenomeAnalysisTK .jar -T SelectVariants -R /somatic_bench/reference/ human_g1k_v37_decoy .fasta --variant HCC1143_SM_final_filter_concordance .vcf -o sm.val.filter.vcf -L validation.list $ grep -v "#" all.val.filter.vcf | wc -l 6 $ grep -v "#" sm.val.filter.vcf | wc -l 9 • î GATK D0⌅X consensus ¿t– t⌧ á⌧ |XXî¿| Ux‰. $ cd $ java -jar GenomeAnalysisTK .jar -T SelectVariants -R /somatic_bench/reference/ human_g1k_v37_decoy .fasta --variant HCC1143_SMV.vcf -o all.val.vcf -L validation.list $ java -jar GenomeAnalysisTK .jar -T SelectVariants -R /somatic_bench/reference/ human_g1k_v37_decoy .fasta --variant HCC1143_SM.vcf -o sm.val.vcf -L validation.list $ grep -v "#" all.val.vcf |wc -l 6
  • 16. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 16 $ grep -v "#" sm.val.vcf |wc -l 9 • consensus: before GATK filter (32/6) - after GATK filter (32/6) • partial consensus-SM: before GATK filter (45/9) - after GATK filter (45/9) 5.3 ¨X0 • ⌅¯® ©]: GATK • ∞¸<: Ö consensus / partial consensus@ COSMIC, CCLE@ |XXî /⇠
  • 17. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 17 6 0¿ Somatic Mutation Callers - Strelka, Virmid 6.1 Strelka (1Ñ38 ) Bayesian probability with posterior filtering| t© somatic mutation caller 2012D |˯ò Ç ⌅¯®t ‰. |˯òX alignerx issactò eland –à D»| bwaƒ ¿–‰.‰â)ït |⇠ ⌅¯®‰¸î }⌅ ‰x )›D t©Xîp tî |˯ò ¸ ⌧ issac ⇣ D∑ ‰â)ïD ¨©Xp, tî XòX ⌅ ∏| ®( < ¨X‡ | 1àå ¨X0 ⌅XÏ Makefile t|î ›D ¨©Xî make |î ¯¨| t© X0 L8t‰. • Strelka| ¨©X0 ⌅t⌧î StrelkaX 5Xt •⌧ |t DîXp, 0¯ < bwa, eland, isaac 3⌧X aligner| ⌅ 0¯ 5XD ⌧ı‰. • 0¯ 5X–⌧ exometò target sequencingX Ω∞ isSkipDepthFilters = 1 ¿ ‰. $ ll /somatic_bench/app/strelka -1.0.14/ etc/ total 20 drwxrwxr -x 2 viz viz 4096 Jul 10 2014 ./ drwxr -xr -x 7 root root 4096 Jan 30 11:06 ../ -rw -rw -r-- 1 viz viz 3658 Jul 10 2014 strelka_config_bwa_default .ini -rw -rw -r-- 1 viz viz 3683 Jul 10 2014 strelka_config_eland_default .ini -rw -rw -r-- 1 viz viz 3821 Jul 10 2014 strelka_config_isaac_default .ini • Strelka $X⌧  †¨@ Ñ ∞¸ •  †¨– t⌧ ¿⇠ $ D ‰. • 0¯ 5X |D ı¨X‡ configureStrelkaWorkflow.pl Ö9< Ñ Ö9¥| ›1‰. • É¥ƒ Ñ Ö9D make| µt ‰âXp tL -j 5XD µt Ñ – ¨©` thread (cpu) /⇠| ¿ ‰. • INDEL¸ SNP ƒƒX VCF Ϙ< ›1⇠p, pass ⌧ ɸ raw somatic 4⌧X ∞¸ |t ›1⌧‰. $ STRELKA_INSTALL_DIR =/ somatic_bench/app/strelka -1.0.14/ echo $ STRELKA_INSTALL_DIR /somatic_bench/app/strelka -1.0.14/ $ WORK_DIR =/ root/myWork $ cp $ STRELKA_INSTALL_DIR /etc/ strelka_config_isaac_default .ini config.ini $ STRELKA_INSTALL_DIR /bin/ configureStrelkaWorkflow .pl --normal =/ root/normal.bam --tumor =/ root/tumor.bam --ref=/ somatic_bench/reference/ human_g1k_v37_decoy .fasta --config=config.ini --output -dir =./ myAnalysis $ cd ./ myAnalysis $ make -j 8 $ ll myAnalysis/results/ total 88 drwxr -xr -x 2 root root 4096 Jan 30 11:39 ./ drwxr -xr -x 5 root root 4096 Jan 30 11:37 ../ -rw -r--r-- 1 root root 13452 Jan 30 11:37 all.somatic.indels.vcf -rw -r--r-- 1 root root 36736 Jan 30 11:37 all.somatic.snvs.vcf -rw -r--r-- 1 root root 7098 Jan 30 11:37 passed.somatic.indels.vcf -rw -r--r-- 1 root root 16070 Jan 30 11:37 passed.somatic.snvs.vcf • Ö pass⌧ somatic SNPX /⇠| Ux‰. $ cd myAnalysis/results/ $ grep -v "#" passed.somatic.snvs.vcf|wc -l 62
  • 18. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 18 6.2 Virmid (33Ñ) Virmidî 2013D 8 YP @¡∞ P⇠ Ç å⌅∏Ë¥Ö»‰. ÿ ¡D µt tumor–⌧ normal ÿ X pro- portionD ©‰ (↵). • Ö pass⌧ somatic SNPX /⇠| Ux‰. $ java -jar /somatic_bench/app/Virmid -1.1.1/ Virmid.jar -R /somatic_bench/reference/ human_g1k_v37_decoy .fasta -D /root/tumor.bam -N /root/normal.bam -t 8 -w /root/virmid $ cd /root/virmid $ ls -la $ ls -al total 98024 drwxr -xr -x 2 root 4096 Jan 30 16:00 ./ drwxr -xr -x 8 root 8192 Jan 30 15:32 ../ -rw -r--r-- 1 root 1252161 Jan 30 16:03 tumor.bam.virmid.germ.all.vcf -rw -r--r-- 1 root 955213 Jan 30 16:03 tumor.bam.virmid.germ.passed.vcf -rw -r--r-- 1 root 262 Jan 30 16:00 tumor.bam.virmid.gm -rw -r--r-- 1 root 36564 Jan 30 16:03 tumor.bam.virmid.loh.all.vcf -rw -r--r-- 1 root 2233 Jan 30 16:01 tumor.bam.virmid.loh.passed.vcf -rw -r--r-- 1 root 992 Jan 30 16:03 tumor.bam.virmid.report -rw -r--r-- 1 root 1364144 Jan 30 15:29 tumor.bam.virmid.sample.control.bai -rw -r--r-- 1 root 53107377 Jan 30 15:29 tumor.bam.virmid.sample.control.bam -rw -r--r-- 1 root 1364104 Jan 30 15:29 tumor.bam.virmid.sample.disease.bai -rw -r--r-- 1 root 41746178 Jan 30 15:29 tumor.bam.virmid.sample.disease.bam -rw -r--r-- 1 root 84053 Jan 30 16:03 tumor.bam.virmid.som.all.vcf -rw -r--r-- 1 root 6883 Jan 30 16:03 tumor.bam.virmid.som.passed.vcf $ grep -v "#" tumor.bam.virmid.som.passed.vcf|wc -l 78
  • 19. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 19 7 ⌅¥ l| ⌅ ¨⇧§ 7.1 ‰µ© ¨⇧§ ⌧Ñ • ⌧Ñ ¸å: xxx.xxx.xxx.xxx • Dt: edu01, edu02 • T8: kogo2015 • ˘⌘ç: http://xxx.xxx.xxx.xxx:8787 7.2 ‰µ© ¨⇧§ ⌧Ñ ⌘çX0 - ƒ∞à¨©ê • http://www.chiark.greenend.org.uk/˜sgtatham/putty/download.html ⌘ç • Intel x86© putty.exe| ‰¥‹ i»‰. • Host Name: xxx.xxx.xxx.xxx / Port: xx • Security Alert =t (t ’ (Y)’| ›i»‰. • ¯x Dt: `˘ @ Dt@ T8| ¨©i»‰. 7.3 ‰µ© ¨⇧§ ⌧Ñ ⌘çX0 -  ⇣î ¨⇧§ ¨©ê • Â(OSX)X Ω∞ ’Q©⌅¯®, ¯¨, 0¯⇣ app’D ‰âi»‰. ¨⇧§X Ω∞ ’Tt ⇣î ê ¨ ⇧§X ⌅¯® Tt–⌧ 0¯⇣D ‰â i»‰. $ ssh user_id@host_name $ ssh root@127 .0.0.1 • ssh Ö9D t©XÏ ‰µ© ¨⇧§ ⌧Ñ– ⌘çi»‰. ´à¯ ⌘ç‹ yes| ›Xt T8| ;î Ttt ò$å ⇠p tL ÄÏ @ T8| Ö%XÏ ⌘çi»‰. 7.4 ¨⇧§ ‹§ Ù LD¥0 ¯ 8⌧î ¨⇧§ 0Ï⇣3 X Xòx ’Ubuntu (∞Ñ,)’| 0⇠< $Öi»‰. ƒƒX ‹ ∆î Ω∞ ®‡ Ö XX ¨⇧§– ¨©t •i»‰. ¨⇧§î ‰ë 0Ï⇣¸ X‹Ë¥¡–⌧ ŸëXî ¥ ¥⌧Ö»‰. ê‡X ¨⇧§ ¥† XΩ–⌧ ŸëXî¿| LDP¥| å⌅∏Ë¥ $X‹ ê‡X ¨⇧§– i å⌅∏Ë¥X $X •i»‰. • ⌅¨ ê‡t ¨©Xî ¨⇧§ 0Ï⇣X ÖX ›ƒXî )ïÖ»‰. UbuntuX Ω∞ 4à 0Ï⇠î ¨⇧§ ¥ ¥⌧ ⌅¨ ‡Ñ⌅@ 14.04 LTS (Long Term Support)4 Ñ⌅Ö»‰. $ cat /etc/issue.net Ubuntu 12.04.1 LTS • ¨⇧§î ‰ë X‹Ë¥ XΩ–⌧ ¥ ⇠p ¨⇧§| ¿–Xî å⌅∏Ë¥‰@ tÏ X‹Ë¥– 0| ‰â |D 0 ⌧ıi»‰. 0|⌧ ⌅¨ ê‡t ¨©Xî X‹Ë¥ Ù| Lt ꇖå fiî å⌅∏Ë ¥| ‰¥‹XÏ ¨©` ⇠ ൻ‰. ¨⇧§ ⌧Ñ •D X‹Ë¥ ¨ë ›ƒ@ ’-m’ â, machine 5XD µt L ⇠ ൻ‰. ’x86’@ Intel 0⇠X CPU| X¯Xp, ’64’î 64D∏ X‹Ë¥| X¯5 i»‰. $ uname -m x86_64 3¨⇧§î lå ‹á ƒÙ¸ pDH ƒÙ Ѩ⇠p ƒÙƒ ‰ë 0Ï⇣t t¨‰. 4T‹Ö@ Trusty TahrÖ»‰. 5Tà ⌅Ï⌧ x64|‡ ⌅i»‰.
  • 20. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 20 • ‰⇣@ ¨⇧§ ¥ ¥⌧X uÏ< ¨©êX Ö9D ‰⌧ X‹Ë¥| µt ‰âXƒ] i»‰. ¨⇧§ ‰⇣ @ ¨©Xî 0Ï⇣– 0| ⌧ ‰x Ñ⌅D ¨©i»‰. ⌅¨ • ‡X ¨⇧§ ‰⇣@ 3.14.3dmfh 2014D 5‘6| ⌧⌧ Ñ⌅Ö»‰. ¨⇧§ 0Ï⇣@ t⌥å ⌧⌧ ‰⇣D 0⇠< ⌧ë)»‰. ¨⇧§X ‰⇣ Ù ›ƒ tÙƒ] X†µ»‰. $ uname -r 3.2.0 -32 - virtual • X@ ¨⇧§ Ö9¥| Ö% D t| ‰âXî XΩ< ’PATH’î ⌅8§ ŸëXî )ï– •D | Xî ✓x XΩ ¿⇠ ⌘X XòÖ»‰. exportî tÏ XΩ¿⇠X ✓D $ Xî Ö9¥ Ö»‰. ¨⇧§– Ö9D Ö%Xt PATH– $ ⌧  †¨| ∞ Ä…XÏ t˘ Ö9¥ àî¿| UxX‡ t| ‰âi »‰. 0|⌧ ê‡X ¡⌘ å⌅∏Ë¥| $XX‡ ¨⇧§ ¡–⌧ ‰âXî Ω∞ ⇠‹‹ PATH| ¿ t| ¥ –⌧‡¿ ‰ât •Xp ¯⌥¿ J@ Ω∞ å⌅∏Ë¥ $X⌧  †¨ ¥–⌧à ‰ât •i»‰. X XΩ ¿⇠ Ux@ ’env’ Ö9< LD º ⇠ à<p, PATHî ’export’| µt $ i»‰. $ env | grep PATH MANPATH =/usr/local/texlive /2013/ texmf/doc/man: PATH =/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin INFOPATH =/usr/local/texlive /2013/ texmf/doc/info: $ export PATH =/BIO/app/bwa -0.7.5a/:$PATH $ env | grep PATH 7.5 ¨⇧§ | ‹§ ¨⇧§X X@ XòX <¨ §l| |¨ < ÏÏ Ì< lÑXÏ ¨Xp X@ | ‹§ D ›1XÏ | ✏  †¨| ¨` ⇠ ൻ‰. • ¨⇧§ ‹§@ ÏÏ ¨©ê ¨©Xî ‹§< ê ê‡X ‡ Ìx H †¨| ¿‡ ൻ ‰. H  †¨¥–⌧î ê‡t |D ›1, ≠⌧ •i»‰. H  †¨ tŸXî Ö9@ ’cd’ Ö9 tp, ⌅¨  †¨ Ωî ’pwd’ Ö9< Ux` ⇠ ൻ‰. $ cd $ pwd /home/hongiiv •  †¨ ɇ t˘  †¨ tŸX0 $ cd $ mkdir sample_data $ ls -la total 2203488 drwxr -xr -x 16 hongiiv hongiiv 4096 May 29 10:34 . drwxr -xr -x 3 root root 4096 May 7 13:14 .. -rw ------- 1 hongiiv hongiiv 1908 May 10 11:59 .bash_history -rw -r--r-- 1 hongiiv hongiiv 220 May 7 13:14 .bash_logout -rw -r--r-- 1 hongiiv hongiiv 3763 May 10 17:06 .bashrc drwxr -xr -x 2 root root 4096 May 29 10:34 sample_data $ cd sample_data $ pwd /home/hongiiv/sample_data •  †¨ ✏ | ≠⌧X0
  • 21. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 21 $ cd $ rm -rf sample_data $ ls -la total 2203488 drwxr -xr -x 16 hongiiv hongiiv 4096 May 29 10:34 . drwxr -xr -x 3 root root 4096 May 7 13:14 .. -rw ------- 1 hongiiv hongiiv 1908 May 10 11:59 .bash_history -rw -r--r-- 1 hongiiv hongiiv 220 May 7 13:14 .bash_logout -rw -r--r-- 1 hongiiv hongiiv 3763 May 10 17:06 .bashrc $ • ¨⇧§ | ‹§ Ù0 $ df -h Filesystem Size Used Avail Use% Mounted on /dev/xvda1 19G 14G 4.8G 74% / udev 3.9G 4.0K 3.9G 1% /dev tmpfs 1.6G 188K 1.6G 1% /run none 5.0M 0 5.0M 0% /run/lock none 3.9G 0 3.9G 0% /run/shm /dev/xvdb1 79G 38G 38G 50% /home/hongiiv/test • <¨ X‹§l X Ù Ù0 - 21.5 GBX <¨ x /dev/xvda X‹§lî vxda1, xvda2 2⌧X  X< l1⇠¥ à<p Linux, Linux swapX |‹§ÑD Ux` ⇠ ൻ‰. $ fdisk -l Disk /dev/xvda: 21.5 GB , 21474836480 bytes 255 heads , 63 sectors/track , 2610 cylinders , total 41943040 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x00034212 Device Boot Start End Blocks Id System /dev/xvda1 2048 40038399 20018176 83 Linux /dev/xvda2 40038400 41940991 951296 82 Linux swap / Solaris Disk /dev/xvdb: 300.6 GB , 300647710720 bytes 171 heads , 35 sectors/track , 98112 cylinders , total 587202560 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x3459a991 Device Boot Start End Blocks Id System /dev/xvdb1 2048 587202559 293600256 8e Linux LVM • | ‹§ »¥∏ Ù Ux $ cat /etc/fstab proc /proc proc nodev ,noexec ,nosuid 0 0 /dev/xvda1 / ext3 errors=remount -ro 0 1 /dev/xvda2 none swap sw 0 0
  • 22. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 22 7.6 ¨⇧§ X‹§l î X0 • fdisk| µt î ⌧ X‹§l| Ux ƒ T›, |‹§ ›1, »¥∏X 3˃| p– X‹§l | ¨©i»‰. USB •X| ¨⇧§– x›X0 ⌅t⌧î mount ¸ ÃD pXt )»‰. $ fdisk /dev/xvdb $ mkfs.ext3 /dev/xvdb1 $ mkdir /new_hdd $ mount /dev/xvdb1 /new_hdd $ cd /new_hdd $ df -h 7.7 | ( Ö9¥ • touch - | l0 0x »¥ | ›1Xpò |t ›1⌧ ‹⌅D ¿Ω` ⇠ ൻ‰. ⌅9 ⌅¥ ( å⌅∏Ë¥ $Xò P!‹ ¨©Xî Ö9¥ ⇡¿X‹0 绉. $ touch a $ ls -al -rw -r--r-- 1 root root 0 Jun 18 10:04 a $ date Wed Jun 18 10:05:10 KST 2014 $ touch -c a $ ls -al -rw -r--r-- 1 root root 0 Jun 18 10:05 a • cat - |X ¥©D UxXpò ⌅Ë §lΩ∏ ë1‹ ¨©i»‰. ’cat ¿ test’ Ö9< test|î |D ›1Xt⌧ | ¥©D ë1i»‰. ë1t DÃ⌧ ƒ–î ’ctrl+D’ ѺD Ï `8ò, ⇠ ൻ‰. $ cat > test hi there my name is hong $ cat test hi there my name is hong $ ls -al -rw -r--r-- 1 root root 25 Jun 18 10:09 test • π  †¨X |X /⇠ 80 $ ls -l . | grep ^- | wc -l 50 • |X π 8êÙ ‹ëXî ÄÑD ⌧x ÄÑ ú%X0Ö»‰. VCF |¸ ⇡t ’’ ‹ëXî ÄÑ@ ¸ x Ω∞ ¸ ÃD ⌧x ‰⌧ ⌅¿tX ¨§∏| ú%i»‰. ⇣î ¯ ⇠ ¸ ÄÑÃD ú%i »‰. $ cd /BIO/data/gatk $ grep -v "#" dbsnp_138.hg19.vcf| wc -l 8087914 $ grep -F "#" dbsnp_138.hg19.vcf |wc -l 165
  • 23. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 23 • π ¸…¥Ã ú%i»‰. t˘ ¸…¥X L ≥⌧ ’-d’, +ê⌧’-c’< ,t •i»‰. $ grep -v "#" dbsnp_138.hg19.vcf |awk '{print $1}'| more chrM chrM chrM chrM chrM chrM chrM chrM chrM $ grep -v "#" dbsnp_138.hg19.vcf |awk '{print $1}'| sort -d chr1 chr2 $ grep -v "#" dbsnp_138.hg19.vcf |awk '{print $1}'| uniq -c 475 chrM 4723878 chr1 3363561 chr2 $ grep -v "#" dbsnp_138.hg19.vcf | awk '{if ($1 == "chrM") printf "chrM is: %sn", $2}' chrM is: 16390 chrM is: 16391 chrM is: 16429 chrM is: 16445 chrM is: 16499 • ú%< ú%⇠î ¥©D | •X0 $ grep -v "#" dbsnp_138.hg19.vcf | awk '{if ($1 == "chrM") printf "chrM is: %sn", $2}' > ~/chr_pos.txt $ grep -v "#" dbsnp_138.hg19.vcf | awk '{if ($1 == "chr1") printf "chrM is: %sn", $2}' >> ~/chr_pos.txt 7.8 ¨⇧§ $∏Ãl Ù • $∏Ãl x0òt§– Ù eth0X inet addrt xÄ–⌧ ⌅¨ ¨⇧§ ⌘ç • ¸å6 Ö»‰. $ ifconfig eth0 Link encap:Ethernet HWaddr 02:00:5b:73:00:33 inet addr: 172.27.252.234 Bcast: 172.27.255.255 inet6 addr: fe80::5bff:fe73:33/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:501386 errors:0 dropped:0 overruns:0 frame:0 TX packets:346879 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:19357734604 (1 GB) TX bytes:2720265191 (2 GB) Interrupt:68 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host 6¨⇧§ ⌧ÑX ¸åî 172.27.252.234 êX ‰µ XΩ– 0| ‰tå ‹⌧‰.
  • 24. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 24 UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:4337 errors:0 dropped:0 overruns:0 frame:0 TX packets:4337 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:2203478 (2.2 MB) TX bytes:2203478 (2.2 MB) 7.9 ¨⇧§ Uï ttX0 ¨⇧§î ‰ë UïD ¿–Xp, å⌅∏Ë¥ò pt0| 0ÏXî Ω∞ Uï⌧ |D t©XÏ 0Ïi»‰. • ¨⇧§–⌧ ¨©Xî ‰ë Uï t⌧ )ïÖ»‰. UïD t⌧ |H–î 8⌧ ‰¥àµ»‰. 8⌧| ⌧| < x‹î Ñ–åî ¡àt ¸¥—»‰. $ cd $ cp -R /BIO/data/compress ./ compress $ cd compress $ gzip -d compress01.gz $ tar xvfz compress02.tar.gz $ unzip compress03.zip $ bzip2 -d comress04.bz2 $ tar xvfz compress05.tar.gz $ tar xvf compress06.tar.bz2 • gzip: Recommended for fast network connections • bzip2: Recommended for slower network connections (smaller size but takes longer to compress) • zip: Not recommended but is provided as an option for those who cannot open the above formats • ©…X Uï⌧ ⌅¥ pt0– t UïD t⌧X¿ J‡ ¯¨ |X ¥© UxXî )ïÖ»‰. FASTQ |ÒD UxXîp ©i»‰. $ gzip -dc CEUTrio.HiSeq.WGS.b37.bestPractices.hg19.vcf.gz | more $ gzip -dc CEUTrio.HiSeq.WGS.b37.bestPractices.hg19.tar.gz | tar -tvf - 7.10 ¨⇧§ å⌅∏Ë¥ $XX0 |⇠ < ¨⇧§– å⌅∏Ë¥| $XXî )ï@ ‰LX 3 ¿ )ït ൻ‰. ´à¯î t ¨ (‰â) |D Uï ‹ ⌧ıXî )ï< ⌅Ëà UïD t⌧XÏ  ¨©t •X‰. Pà¯î ¨⇧§–⌧ ⌧ı Xî (§¿| t©Xî )ï< ∞Ñ,X Ω∞ APT|î (§¿ ¨ ⌅¯®D t©‰. 8à¯î å§ |D t©XÏ $XXî )ït‰. 7.10.1 APT| t© å⌅∏Ë¥ $X • APT| t© (§¿ ≈pt∏ $ apt -get update $ apt -get install bwa Reading package lists ... Done Building dependency tree Reading state information ... Done Use 'apt -get autoremove ' to remove them. Suggested packages: samtools
  • 25. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 25 The following NEW packages will be installed: bwa 0 upgraded , 1 newly installed , 0 to remove and 153 not upgraded. Need to get 135 kB of archives. After this operation , 286 kB of additional disk space will be used. Fetched 135 kB in 3s (40.1 kB/s) Selecting previously unselected package bwa. (Reading database ...17 files and directories currently installed .) Unpacking bwa (from .../ archives/bwa_0 .6.1 -1 _amd64.deb) ... Processing triggers for man -db ... Setting up bwa (0.6.1 -1) ... $ bwa Program: bwa (alignment via Burrows -Wheeler transformation ) Version: 0.6.1 - r104 Contact: Heng Li <lh3@sanger.ac.uk > Usage: bwa <command > [options] Command: index index sequences in the FASTA format aln gapped/ungapped alignment samse generate alignment (single ended) sampe generate alignment (paired ended) bwasw BWA -SW for long queries fastmap identify super -maximal exact matches fa2pac convert FASTA to PAC format pac2bwt generate BWT from PAC pac2bwtgen alternative algorithm for generating BWT bwtupdate update .bwt to the new format bwt2sa generate SA from BWT and Occ pac2cspac convert PAC to color -space PAC stdsw standard SW/NW alignment • NGS ( å⌅∏Ë¥ $X| ⌅t ¯¨ 0¯ $X⇠¥| Xî (§¿ ©]Ö»‰. $ apt -get update -y $ apt -get install gcc -y $ apt -get install make -y $ apt -get install zlib1g -dev -y $ apt -get install libncurses5 -dev -y $ apt -get install g++ -y $ apt -get install tcl tk -y $ apt -get install tcl -dev -y $ apt -get install unzip -y $ apt -get install curl -y $ apt -get install screen -y $ apt -get install python -dev -y $ apt -get install python -software -properties -y $ add -apt -repository ppa:webupd8team/java $ apt -get update -y $ apt -get install oracle -java7 -installer -y
  • 26. Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 26 7.10.2 å§ T‹ Ù |D µ å⌅∏Ë¥ $X • å§ $XX0 $ cd $ cp /BIO/app/bwa -0.7.4. tar.bz2 ./ $ tar xvf bwa -0.7.4. tar.bz2 $ cd bwa -0.7.4 $ make $ ./bwa Program: bwa (alignment via Burrows -Wheeler transformation ) Version: 0.7.4 - r385 Contact: Heng Li <lh3@sanger.ac.uk > Usage: bwa <command > [options] Command: index index sequences in the FASTA format mem BWA -MEM algorithm fastmap identify super -maximal exact matches pemerge merge overlapping paired ends (EXPERIMENTAL) aln gapped/ungapped alignment samse generate alignment (single ended) sampe generate alignment (paired ended) bwasw BWA -SW for long queries fa2pac convert FASTA to PAC format pac2bwt generate BWT from PAC pac2bwtgen alternative algorithm for generating BWT bwtupdate update .bwt to the new format bwt2sa generate SA from BWT and Occ $ bwa Program: bwa (alignment via Burrows -Wheeler transformation ) Version: 0.6.2 - r126 Contact: Heng Li <lh3@sanger.ac.uk > Usage: bwa <command > [options]