SlideShare a Scribd company logo
Sai Teja Vissamsetti (700645566)
Sarika Batte (700647682)
Chandana Sripathi (700641627)
Krishna Chaitanya Koti (700648083)
Krishna Chaitanya Gollavilli (700638821)
Sree Navya Kovvuri (700645739)
Sai Priyanka Reddy Addaboina (700648561)
ANALYSING GENOMICS AND
THE BDG PROJECT
BIG DATA
- Dr. Bo Li
Next generation DNA sequencing is rapidly transforming the life
sciences into a data driven fields.
• Traditional computational methods – difficult to use
• More digitalised versions are developed
INTRODUCTION
• We show the experienced Bio Informatician how to perform typical genomics tasks in
the context of Spark.
• Comprises a set of genomics-specific Avro schemas, Spark-based APIs, and command-
line tools for large-scale genomics analysis.
• We introduce the general Spark user to a new set of Hadoop-friendly serialization and
file formats
OVERVIEW of the Project
• Free java based programming frame work
• Runs thousands of nodes involving thousands of terabytes
• Rapid data transfer
• Continue operating interpreted in case of node failure this frame work is
used by
Google
Yahoo
IBM
• Scalable, cost effective, flexible, fast, resilient to failure
HADOOP
 A software frame work for writing and processing vast amount of
data on large clusters reliably
 Basic concept :
 Divide - Divides input datasets into chunks and processed by map task
in parallel.
 Sorts
 Conquer - Merges and given as the input to the reduced tasks.
 Handles
 Scheduling
 Data distribution
 Synchronization
 Errors and faults
Map Reduce
• Also called as sequence-specific DNA binding factor
• Controls the rate of genetic information
• Larger genomes – more number of transcription factors
TRANSCRIPTION FACTOR
GM12878 - Genetic variation studies
K562 - Erythropoiesis
HepG2 - Metabolism disorders
HEK293 - Embryonic kidney
H54 - Glioblastoma
BJ - Skin fibroblast
Data Types
 Bio informaticians have their own specific file formats
Example:
 .fasta
 .sam
 .gtf
 .narrowpeak
 .vcf etc.
 Accessing file formats of similar data is difficult
 They are ASCII encoded
 ASCII – inefficient !!
DECOUPLING STORAGE
 An open source, high performance, distributed platform for genomic
analysis
 ADAM defines a:
 Data schema and layout on disk
 A Scala API
 A command line interface
What is ADAM?
 VM-Ware version:5.5 – Cloudera
 Java version 1.8
 Tool : ADAM
 Apache Avro
 Spark
SOFTWARES USED
• An in-memory data parallel computing framework
• Optimized for iterative jobs —> unlike Hadoop
• Data maintained in memory unless inter-node movement
needed
• Presents a functional programing API, along with support for
iterative programming.
• Used at scale on clusters with >2k nodes, 4TB datasets
 Current leading map-reduce framework:
• First in-memory map-reduce platform
• Used at scale in industry, supported in major distros
 Cloudera
 HortonWorks
 MapR
 The API:
• Fully functional API
• Main API in Scala, also support Java, Python, R
• Manages failures
WHY SPARK?
SPARK
• Open source
• In memory, on disk
• Can be written in SCALA
• API : SCALA, Java, python
• Easy to program
• Doesn’t need abstractions
• Less compared to map reduce
MAP REDUCE
• Open source
• On-disk
• Can be written in java
• API : java, python, SCALA
• Difficult to program
• Needs abstractions
• More security features
MAP REDUCE vs SPARK
Ingesting the full 1000 Genomes genotype data set –
• Download the raw data directly into HDFS
• Unzipping in-flight
• Run an ADAM job to convert the data to Parquet
Querying Genotypes from the 1000
Genomes Project
Building ADAM
Building Spark
Big data   analysing genomics and the bdg project

More Related Content

PDF
Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...
PDF
Apache Spark Usage in the Open Source Ecosystem
PPTX
Introduction to Apache Spark and MLlib
PDF
From Python Scikit-learn to Scala Apache Spark—The Road to Uncovering Botnets...
PPTX
Big Data Certifications Workshop - 201711 - Introduction and Database Essentials
PDF
Spark Summit EU talk by Jakub Hava
PDF
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
PDF
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...
Apache Spark Usage in the Open Source Ecosystem
Introduction to Apache Spark and MLlib
From Python Scikit-learn to Scala Apache Spark—The Road to Uncovering Botnets...
Big Data Certifications Workshop - 201711 - Introduction and Database Essentials
Spark Summit EU talk by Jakub Hava
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...

What's hot (20)

PPTX
Apache Spark in Industry
PDF
Apache Arrow: Cross-language Development Platform for In-memory Data
PDF
Spark Summit EU talk by Shay Nativ and Dvir Volk
PPTX
Intro to Python for C# Developers
PDF
Resource-Efficient Deep Learning Model Selection on Apache Spark
PDF
Intro to Apache Spark
PDF
Apache Spark for Everyone - Women Who Code Workshop
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
Spark Core
PDF
Latest Developments in H2O
PDF
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
PPTX
Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - W...
PPTX
Is there a SQL for NoSQL?
PDF
Scala ecosystem - Dublin Scala Meetup, Oct 2018
PPTX
Apache Spark Fundamentals
PDF
Scaling Security Threat Detection with Apache Spark and Databricks
PPTX
Big Data Certifications Workshop - 201711 - Introduction and Linux Essentials
PDF
Spark Summit EU talk by Tim Hunter
PPTX
Simplifying Big Data Applications with Apache Spark 2.0
PDF
Stacked Ensembles in H2O
Apache Spark in Industry
Apache Arrow: Cross-language Development Platform for In-memory Data
Spark Summit EU talk by Shay Nativ and Dvir Volk
Intro to Python for C# Developers
Resource-Efficient Deep Learning Model Selection on Apache Spark
Intro to Apache Spark
Apache Spark for Everyone - Women Who Code Workshop
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Core
Latest Developments in H2O
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - W...
Is there a SQL for NoSQL?
Scala ecosystem - Dublin Scala Meetup, Oct 2018
Apache Spark Fundamentals
Scaling Security Threat Detection with Apache Spark and Databricks
Big Data Certifications Workshop - 201711 - Introduction and Linux Essentials
Spark Summit EU talk by Tim Hunter
Simplifying Big Data Applications with Apache Spark 2.0
Stacked Ensembles in H2O
Ad

Viewers also liked (20)

PPT
drill management system
PDF
HUG Italy meet-up with Fabian Wilckens, MapR EMEA Solutions Architect
PDF
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
PDF
An Amzing Sermon
PDF
Carreras de Caballos
PPTX
Question 7
PDF
Lectura 1 Los números Irracionales
PDF
bw23-nyfinalpresentation-verizon-130426104853-phpapp02
PPTX
History of internet
PPTX
Emerging challenges in data-intensive genomics
PPTX
Android Seminar || history || versions||application developement
PDF
7 Steps to Rocking Your Brand on Social Media
PPT
Mubasher, M Phil synoses seminar
PPTX
La emoción y el conocimiento van juntos
DOCX
Jenis turbin dan nozzle beserta komponennya
PPT
Execuçao CBH Rio das Velhas
PPTX
Data analytics challenges in genomics
PPTX
Classifications of Triangles by Ricardo C. Lacsa
PDF
2 6 rational function graphs
PPTX
Diretrizes para elaboração de projetos ambientais
drill management system
HUG Italy meet-up with Fabian Wilckens, MapR EMEA Solutions Architect
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
An Amzing Sermon
Carreras de Caballos
Question 7
Lectura 1 Los números Irracionales
bw23-nyfinalpresentation-verizon-130426104853-phpapp02
History of internet
Emerging challenges in data-intensive genomics
Android Seminar || history || versions||application developement
7 Steps to Rocking Your Brand on Social Media
Mubasher, M Phil synoses seminar
La emoción y el conocimiento van juntos
Jenis turbin dan nozzle beserta komponennya
Execuçao CBH Rio das Velhas
Data analytics challenges in genomics
Classifications of Triangles by Ricardo C. Lacsa
2 6 rational function graphs
Diretrizes para elaboração de projetos ambientais
Ad

Similar to Big data analysing genomics and the bdg project (20)

PDF
Ga4 gh meeting at the the sanger institute
PDF
Why is Bioinformatics a Good Fit for Spark?
PDF
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
PPTX
VariantSpark: applying Spark-based machine learning methods to genomic inform...
PDF
Spark meetup london share and analyse genomic data at scale with spark, adam...
PDF
Enabling Biobank-Scale Genomic Processing with Spark SQL
PDF
Adam bosc-071114
PDF
Spark Summit East 2015
PPTX
11-Big Data Application in Biomedical Research and Health Care.pptx
PDF
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
PPT
Strata-Hadoop 2015 Presentation
PDF
Processing 70Tb Of Genomics Data With ADAM And Toil
PPTX
Conceptualizing And Prototyping A Scalable Genomic Data Analysis Pipeline: Us...
PDF
Hadoop as a Platform for Genomics
PPTX
Xu Xing: EasyGenomics – Next Generation Bioinformatics on the Cloud
PPTX
Hadoop as a Platform for Genomics - Strata 2015, San Jose
PDF
Spark Summit Europe: Share and analyse genomic data at scale
PDF
Lightning fast genomics with Spark, Adam and Scala
PPT
Smith T Bio Hdf Bosc2008
PDF
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
Ga4 gh meeting at the the sanger institute
Why is Bioinformatics a Good Fit for Spark?
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
VariantSpark: applying Spark-based machine learning methods to genomic inform...
Spark meetup london share and analyse genomic data at scale with spark, adam...
Enabling Biobank-Scale Genomic Processing with Spark SQL
Adam bosc-071114
Spark Summit East 2015
11-Big Data Application in Biomedical Research and Health Care.pptx
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
Strata-Hadoop 2015 Presentation
Processing 70Tb Of Genomics Data With ADAM And Toil
Conceptualizing And Prototyping A Scalable Genomic Data Analysis Pipeline: Us...
Hadoop as a Platform for Genomics
Xu Xing: EasyGenomics – Next Generation Bioinformatics on the Cloud
Hadoop as a Platform for Genomics - Strata 2015, San Jose
Spark Summit Europe: Share and analyse genomic data at scale
Lightning fast genomics with Spark, Adam and Scala
Smith T Bio Hdf Bosc2008
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale

Recently uploaded (20)

PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Introduction to Data Science and Data Analysis
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
annual-report-2024-2025 original latest.
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Mega Projects Data Mega Projects Data
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Database Infoormation System (DBIS).pptx
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
ISS -ESG Data flows What is ESG and HowHow
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Introduction to Data Science and Data Analysis
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
annual-report-2024-2025 original latest.
Qualitative Qantitative and Mixed Methods.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
Supervised vs unsupervised machine learning algorithms
1_Introduction to advance data techniques.pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Mega Projects Data Mega Projects Data
Reliability_Chapter_ presentation 1221.5784
Database Infoormation System (DBIS).pptx
.pdf is not working space design for the following data for the following dat...
Clinical guidelines as a resource for EBP(1).pdf
oil_refinery_comprehensive_20250804084928 (1).pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx

Big data analysing genomics and the bdg project

  • 1. Sai Teja Vissamsetti (700645566) Sarika Batte (700647682) Chandana Sripathi (700641627) Krishna Chaitanya Koti (700648083) Krishna Chaitanya Gollavilli (700638821) Sree Navya Kovvuri (700645739) Sai Priyanka Reddy Addaboina (700648561) ANALYSING GENOMICS AND THE BDG PROJECT BIG DATA - Dr. Bo Li
  • 2. Next generation DNA sequencing is rapidly transforming the life sciences into a data driven fields. • Traditional computational methods – difficult to use • More digitalised versions are developed INTRODUCTION
  • 3. • We show the experienced Bio Informatician how to perform typical genomics tasks in the context of Spark. • Comprises a set of genomics-specific Avro schemas, Spark-based APIs, and command- line tools for large-scale genomics analysis. • We introduce the general Spark user to a new set of Hadoop-friendly serialization and file formats OVERVIEW of the Project
  • 4. • Free java based programming frame work • Runs thousands of nodes involving thousands of terabytes • Rapid data transfer • Continue operating interpreted in case of node failure this frame work is used by Google Yahoo IBM • Scalable, cost effective, flexible, fast, resilient to failure HADOOP
  • 5.  A software frame work for writing and processing vast amount of data on large clusters reliably  Basic concept :  Divide - Divides input datasets into chunks and processed by map task in parallel.  Sorts  Conquer - Merges and given as the input to the reduced tasks.  Handles  Scheduling  Data distribution  Synchronization  Errors and faults Map Reduce
  • 6. • Also called as sequence-specific DNA binding factor • Controls the rate of genetic information • Larger genomes – more number of transcription factors TRANSCRIPTION FACTOR
  • 7. GM12878 - Genetic variation studies K562 - Erythropoiesis HepG2 - Metabolism disorders HEK293 - Embryonic kidney H54 - Glioblastoma BJ - Skin fibroblast Data Types
  • 8.  Bio informaticians have their own specific file formats Example:  .fasta  .sam  .gtf  .narrowpeak  .vcf etc.  Accessing file formats of similar data is difficult  They are ASCII encoded  ASCII – inefficient !! DECOUPLING STORAGE
  • 9.  An open source, high performance, distributed platform for genomic analysis  ADAM defines a:  Data schema and layout on disk  A Scala API  A command line interface What is ADAM?
  • 10.  VM-Ware version:5.5 – Cloudera  Java version 1.8  Tool : ADAM  Apache Avro  Spark SOFTWARES USED
  • 11. • An in-memory data parallel computing framework • Optimized for iterative jobs —> unlike Hadoop • Data maintained in memory unless inter-node movement needed • Presents a functional programing API, along with support for iterative programming. • Used at scale on clusters with >2k nodes, 4TB datasets
  • 12.  Current leading map-reduce framework: • First in-memory map-reduce platform • Used at scale in industry, supported in major distros  Cloudera  HortonWorks  MapR  The API: • Fully functional API • Main API in Scala, also support Java, Python, R • Manages failures WHY SPARK?
  • 13. SPARK • Open source • In memory, on disk • Can be written in SCALA • API : SCALA, Java, python • Easy to program • Doesn’t need abstractions • Less compared to map reduce MAP REDUCE • Open source • On-disk • Can be written in java • API : java, python, SCALA • Difficult to program • Needs abstractions • More security features MAP REDUCE vs SPARK
  • 14. Ingesting the full 1000 Genomes genotype data set – • Download the raw data directly into HDFS • Unzipping in-flight • Run an ADAM job to convert the data to Parquet Querying Genotypes from the 1000 Genomes Project