SlideShare a Scribd company logo
Spark Meetup, December 2015
Noam Barkai
noamb@nrgene.com
Overview
● Food shortage: new problems, new solutions
● Intermezzo: how DNA works
● Tach’les: what we do with Apache Spark
The planet has gotten very populous
And it’s the only one we got
World Population
Annual Growth Rate:
Peak - 2.1% (1962)
Current - 1.1% (2009)
https://en.wikipedia.org/wiki/World_population#/media/File:World-Population-1800-2100.svg
Food intake
source: http://www.coolgeography.co.uk/A-level/AQA/Year%2012/Food%20supply/Patterns%20and%20intro/Food_consumption.gif
Upscale: Same area, more crops
Plant breeding
● An ancient art
● Incremental changes
● Slow but considerable
source: https://en.wikipedia.org/wiki/Zea_%28genus%29#/media/File:Maize-teosinte.jpg
How long does it take
today?
Maize: 10-15 years
source: http://www.cropj.com/shimelis_6_11_2012_1542_1549.pdf
How breeding works
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
Computational genomics
⬇ Prices of DNA sequencing
⬆ Number of samples per crop sequenced and analyzed
⬆ Amount and quality of genomic data
⬇ Prices of computation
⬇ Prices of storage
We’re entering a new era
BIG DATA Genomics
Food security - a computational problem?
● The plant’s potential lies in its DNA.
● We analyze and compare sequences from many plants.
● Resulting in better predictions for breeding.
● Faster rate of crop improvement.
Intermezzo: DNA - how does it work?
● Four “letters”:
cytosine(C), guanine(G),
adenine(A), thymine(T)
● Encode 20 amino acids
● Combine to make:
+100K proteins
Conceptually we can think of
this as a “pipeline”:“The Central Dogma”
DNA as storage
● Durable
● Supports random access
● Efficient sequential reads
● Easily replicated
● Contains error correction mechanisms
● Maximally “data local”
Part 2: What we do with
● Analyze lots of genome sequences.
● Apply similarity algorithms, find where they match.
● Finally, assist the breeding program.
Input data is “noisy”
● Contains errors and gaps.
● Is fragmented.
● All due to sequencing technology.
Our setup
● Hadoop clusters on both private cloud and AWS
● Textual files, using Parquet.
● MapR 5 Hadoop distro
● Spark 1.4.1
● SparkSQL and Hive (JDBC)
● Instances: ~150GB RAM, 40 cores.
● Provisioning: Ansible
Our data
● A dozen or so different crops, going for hundreds.
● Each crop: potentially ~1K fully sequenced samples
● ~100K “markers”.
● Each sequence: 1Gbp - 10Gbp (giga base-pairs =
characters) long
● Current: several terabytes, aiming at petabytes
Working with Spark and Scala
● Scala’s type system is your friend
● Thinking functional takes time - and can be “overdone”
● Remember to add @tailrec when needed
● Scala case classes - great
● Nested structure: keeps you DRY, but sluggish.
● Scala has its pitfalls - profile.
● Spark as the “ultimate scala collection” - Martin Odersky.
● Complex unmanaged framework - the usual 20/80 rule:
20% fun algorithmic stuff,
80% integration/devops/tuning/black-voodoo
● Integration with Hive - doable but cumbersome
● DataFrames API - very clean
● Parquet in Spark 1.4 - seamless, Parquet with SparkSQL
< 1.3 - rather sucks.
Integrations with Spark
● If RDD objects need high RAM → memory gets tricky.
● Spark UI in 1.4.1 - very nice
● PairRDD - need to be your own “query optimizer”
● repartition / coalesce - very useful, but gets tricky if data
variability is high (a dynamic real-time optimizer would be
great).
Performance tuning with Spark
● Testing: “local” is great, but means no unit-test :-(
● sbt-pack - good alternative to sbt-assembly.
● Spark packages: spark-csv, spark-notebook and more.
● Speaking of open-source packages...
Testing, packaging and extending Spark
ADAM Project - Genomics using Spark
● Fully open sourced from
● Similarity algorithms
● Population clustering
● Predictive analysis using Deep Learning
● And more
Spark Meetup, December 2015
Noam Barkai
noamb@nrgene.com
Thank you

More Related Content

PPTX
Using apache spark to fight world hunger - spark meetup
PPTX
From XML to MARC. RDF behind the scenes.
PDF
Yann Nicolas - Elag 2018 : From XML to MARC
PPT
Biohackathon2013: Tripling Bioinformatics Productivity
PPT
Lightning Talk, Ransom: Making the Case for Interactive Data Transformation T...
PPTX
Semantics, rdf and drupal
PDF
Drupal 7 and RDF
PDF
VALA Tech Camp 2017: Intro to Wikidata & SPARQL
Using apache spark to fight world hunger - spark meetup
From XML to MARC. RDF behind the scenes.
Yann Nicolas - Elag 2018 : From XML to MARC
Biohackathon2013: Tripling Bioinformatics Productivity
Lightning Talk, Ransom: Making the Case for Interactive Data Transformation T...
Semantics, rdf and drupal
Drupal 7 and RDF
VALA Tech Camp 2017: Intro to Wikidata & SPARQL

Viewers also liked (18)

DOCX
Ukuran sudut
DOCX
Tidak ada ketentuan besar kecilnya maha1
DOCX
How to install ssl certificate from .pem
PDF
Consult Group - Recruitment & Human Capital Services - Brochure (Mandarin)
PPTX
Unit overview
PDF
Frank Salliau, iMinds @ ICT 2015, TISP workshop
PPTX
Albert Gauthier, European Commission @ Frankfurt Book Fair 2015, TISP workshop
DOCX
Kitab barang temuan
PDF
Liquid phase alkylation of benzene with-ethylene
PDF
Cyril Labordrie, EDRLab @ TISP seminar, FICOD 2015
PPTX
WPバックアップ必勝法!「BackWPup」プラグインを使って突然サーバーがクラッシュしても大丈夫なように運用するための方法
PDF
BUILDING TECHNOLOGY PROJECT 2 REPORT
PPTX
Yellowing of cotton fabric due to softners -by Labeesh Kumar
PPTX
Operating samza at skyscanner
PDF
Cursos de Big Data y Machine Learning
PDF
Project1 part1stage2(sohyoushing)
PPTX
Jam, jelly &marmalade
Ukuran sudut
Tidak ada ketentuan besar kecilnya maha1
How to install ssl certificate from .pem
Consult Group - Recruitment & Human Capital Services - Brochure (Mandarin)
Unit overview
Frank Salliau, iMinds @ ICT 2015, TISP workshop
Albert Gauthier, European Commission @ Frankfurt Book Fair 2015, TISP workshop
Kitab barang temuan
Liquid phase alkylation of benzene with-ethylene
Cyril Labordrie, EDRLab @ TISP seminar, FICOD 2015
WPバックアップ必勝法!「BackWPup」プラグインを使って突然サーバーがクラッシュしても大丈夫なように運用するための方法
BUILDING TECHNOLOGY PROJECT 2 REPORT
Yellowing of cotton fabric due to softners -by Labeesh Kumar
Operating samza at skyscanner
Cursos de Big Data y Machine Learning
Project1 part1stage2(sohyoushing)
Jam, jelly &marmalade
Ad

Similar to Using apache spark to fight world hunger - Israel spark meetup at taboola (20)

PDF
Spark Summit EU talk by Erwin Datema and Roeland van Ham
PDF
Spark meetup london share and analyse genomic data at scale with spark, adam...
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
PDF
MapReduce Best Practices and Lessons Learned Applied to Enterprise Datasets -...
PDF
Ga4 gh meeting at the the sanger institute
PDF
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
PDF
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
PDF
eScience Cluster Arch. Overview
PPTX
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
PDF
New Developments in Spark
PDF
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
PPTX
Spark vstez
PDF
Fast and Scalable Python
PDF
Design for Scalability in ADAM
PPTX
Hadoop ecosystem for health/life sciences
PPTX
Data Science at Scale: Using Apache Spark for Data Science at Bitly
PDF
Apache Spark and R: A (Big Data) Love Story?
PDF
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
PPTX
Intro to Spark development
PDF
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark meetup london share and analyse genomic data at scale with spark, adam...
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
MapReduce Best Practices and Lessons Learned Applied to Enterprise Datasets -...
Ga4 gh meeting at the the sanger institute
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
eScience Cluster Arch. Overview
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
New Developments in Spark
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Spark vstez
Fast and Scalable Python
Design for Scalability in ADAM
Hadoop ecosystem for health/life sciences
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Apache Spark and R: A (Big Data) Love Story?
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Intro to Spark development
Spark Summit Europe: Share and analyse genomic data at scale
Ad

More from tsliwowicz (7)

PPTX
Spark war stories taboola
PDF
Spark on Dataproc - Israel Spark Meetup at taboola
PPTX
Inneractive - Spark meetup2
PPTX
Spark meetup2 final (Taboola)
PPTX
Spark Magic Building and Deploying a High Scale Product in 4 Months
PPTX
Taboola Road To Scale With Apache Spark
PPTX
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Spark war stories taboola
Spark on Dataproc - Israel Spark Meetup at taboola
Inneractive - Spark meetup2
Spark meetup2 final (Taboola)
Spark Magic Building and Deploying a High Scale Product in 4 Months
Taboola Road To Scale With Apache Spark
Taboola's experience with Apache Spark (presentation @ Reversim 2014)

Recently uploaded (20)

PDF
Nekopoi APK 2025 free lastest update
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
Introduction to Artificial Intelligence
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Understanding Forklifts - TECH EHS Solution
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
System and Network Administraation Chapter 3
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
Computer Software and OS of computer science of grade 11.pptx
PDF
medical staffing services at VALiNTRY
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
Cost to Outsource Software Development in 2025
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
history of c programming in notes for students .pptx
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
Nekopoi APK 2025 free lastest update
Softaken Excel to vCard Converter Software.pdf
Introduction to Artificial Intelligence
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Design an Analysis of Algorithms II-SECS-1021-03
Which alternative to Crystal Reports is best for small or large businesses.pdf
Understanding Forklifts - TECH EHS Solution
How to Choose the Right IT Partner for Your Business in Malaysia
System and Network Administraation Chapter 3
Design an Analysis of Algorithms I-SECS-1021-03
CHAPTER 2 - PM Management and IT Context
Computer Software and OS of computer science of grade 11.pptx
medical staffing services at VALiNTRY
Designing Intelligence for the Shop Floor.pdf
Cost to Outsource Software Development in 2025
PTS Company Brochure 2025 (1).pdf.......
history of c programming in notes for students .pptx
Operating system designcfffgfgggggggvggggggggg
wealthsignaloriginal-com-DS-text-... (1).pdf

Using apache spark to fight world hunger - Israel spark meetup at taboola

  • 1. Spark Meetup, December 2015 Noam Barkai noamb@nrgene.com
  • 2. Overview ● Food shortage: new problems, new solutions ● Intermezzo: how DNA works ● Tach’les: what we do with Apache Spark
  • 3. The planet has gotten very populous And it’s the only one we got
  • 4. World Population Annual Growth Rate: Peak - 2.1% (1962) Current - 1.1% (2009) https://en.wikipedia.org/wiki/World_population#/media/File:World-Population-1800-2100.svg
  • 6. Upscale: Same area, more crops
  • 7. Plant breeding ● An ancient art ● Incremental changes ● Slow but considerable source: https://en.wikipedia.org/wiki/Zea_%28genus%29#/media/File:Maize-teosinte.jpg
  • 8. How long does it take today? Maize: 10-15 years source: http://www.cropj.com/shimelis_6_11_2012_1542_1549.pdf
  • 10. Computational genomics ⬇ Prices of DNA sequencing ⬆ Number of samples per crop sequenced and analyzed ⬆ Amount and quality of genomic data ⬇ Prices of computation ⬇ Prices of storage We’re entering a new era BIG DATA Genomics
  • 11. Food security - a computational problem? ● The plant’s potential lies in its DNA. ● We analyze and compare sequences from many plants. ● Resulting in better predictions for breeding. ● Faster rate of crop improvement.
  • 12. Intermezzo: DNA - how does it work? ● Four “letters”: cytosine(C), guanine(G), adenine(A), thymine(T) ● Encode 20 amino acids ● Combine to make: +100K proteins
  • 13. Conceptually we can think of this as a “pipeline”:“The Central Dogma”
  • 14. DNA as storage ● Durable ● Supports random access ● Efficient sequential reads ● Easily replicated ● Contains error correction mechanisms ● Maximally “data local”
  • 15. Part 2: What we do with ● Analyze lots of genome sequences. ● Apply similarity algorithms, find where they match. ● Finally, assist the breeding program.
  • 16. Input data is “noisy” ● Contains errors and gaps. ● Is fragmented. ● All due to sequencing technology.
  • 17. Our setup ● Hadoop clusters on both private cloud and AWS ● Textual files, using Parquet. ● MapR 5 Hadoop distro ● Spark 1.4.1 ● SparkSQL and Hive (JDBC) ● Instances: ~150GB RAM, 40 cores. ● Provisioning: Ansible
  • 18. Our data ● A dozen or so different crops, going for hundreds. ● Each crop: potentially ~1K fully sequenced samples ● ~100K “markers”. ● Each sequence: 1Gbp - 10Gbp (giga base-pairs = characters) long ● Current: several terabytes, aiming at petabytes
  • 19. Working with Spark and Scala ● Scala’s type system is your friend ● Thinking functional takes time - and can be “overdone” ● Remember to add @tailrec when needed ● Scala case classes - great ● Nested structure: keeps you DRY, but sluggish. ● Scala has its pitfalls - profile. ● Spark as the “ultimate scala collection” - Martin Odersky.
  • 20. ● Complex unmanaged framework - the usual 20/80 rule: 20% fun algorithmic stuff, 80% integration/devops/tuning/black-voodoo ● Integration with Hive - doable but cumbersome ● DataFrames API - very clean ● Parquet in Spark 1.4 - seamless, Parquet with SparkSQL < 1.3 - rather sucks. Integrations with Spark
  • 21. ● If RDD objects need high RAM → memory gets tricky. ● Spark UI in 1.4.1 - very nice ● PairRDD - need to be your own “query optimizer” ● repartition / coalesce - very useful, but gets tricky if data variability is high (a dynamic real-time optimizer would be great). Performance tuning with Spark
  • 22. ● Testing: “local” is great, but means no unit-test :-( ● sbt-pack - good alternative to sbt-assembly. ● Spark packages: spark-csv, spark-notebook and more. ● Speaking of open-source packages... Testing, packaging and extending Spark
  • 23. ADAM Project - Genomics using Spark ● Fully open sourced from ● Similarity algorithms ● Population clustering ● Predictive analysis using Deep Learning ● And more
  • 24. Spark Meetup, December 2015 Noam Barkai noamb@nrgene.com Thank you