SlideShare a Scribd company logo
Terabyte-scale image similarity
search: experience and best practice
Diana Moise2, Denis Shestakov1,2,
Gylfi Gudmundsson2, Laurent Amsaleg3
1

Department of Media Technology, School of Science, Aalto University, Finland
2
Inria Rennes – Bretagne Atlantique, France
3
IRISA - CNRS, France
Denis Shestakov
denis.shestakov at aalto.fi
linkedin: linkedin.com/in/dshestakov
mendeley: mendeley.com/profiles/denis-shestakov
Terabyte-scale image search in
Europe?
Overview
1. Background: image retrieval, our focus,
environment, etc.
2. Applying Hadoop to multimedia retrieval
tasks
3. Addressing Hadoop cluster heterogeneity
issue
4. Studying workloads with large auxiliary data
structure required for processing
5. Experimenting with very large image dataset
Image search?
Content-based image search:
● Find matches with similar content
Image search applications?
● regular image search
● object recognition
○ face, logo, etc.
● for systems like Google Goggles
● augmented reality applications
● medical imaging
● analysis of astrophysics data
Our use case
● Copyright violation detection
● Our scenario:
○ Searching for batch of images
■ Querying for thousands of images in one run
■ Focus on throughput, not on response time for
individual image
● Note: indexed dataset can be searched on single
machine with adequate disk capacity if necessary
Image search with Hadoop
● Index & search huge image collection using
MapReduce-based eCP algorithm
○ See our work at ICMR'13: Indexing and
searching 100M images with MapReduce [18]
○ See Section III for quick overview
● Use the Grid5000 plartform
○ Distributed infrastructure available to French
researchers & their partners
● Use the Hadoop framework
Experimental setup: cluster
● Grid5000 platform:
○ Nodes in rennes site of Grid5000
■ Up to 110 nodes available
■ Nodes capacity/performance varied
● Heterogenous, come from three clusters
● From 8 cores to 24 cores per node
● From 24GB to 48GB RAM per node
Experimental setup: framework
● Standard Apache Hadoop distribution, ver.1.0.1
○ (!) No changes in Hadoop internals
■ Pros: easy to migrate, try and compare by others
■ Cons: not top performance

○ Tools provided by Hadoop framework
■
■
■
■

Hadoop SequenceFiles
DistributedCache
multithreaded mappers
MapFiles
Experimental setup: dataset
● 110 mln images (~30 billion SIFT descriptors)
○ Collected from the Web and provided by one
of the partners in Quaero project
■ Largest reported in literature
○ Images resized to 150px on largest side
○ Worked with
■ The whole set (~4TB)
■ The subset, 20mln images (~1TB)
○ Used as distracting dataset
Experimental setup: querying
● For evaluation of indexing quality:
○ Added to distracting datasets:
■ INRIA Copydays (127 images)
○ Queried for
■ Copydays batch (~3000 images = 127 original
images and their associated variants incl. strong
distortions, e.g. print-crumple-scan )
■ 12k batch (~12000 images = 245 random images
from dataset and their variants)
■ 25k batch
○ Checked if original images returned as top voted
search results
Image search with Hadoop
Distributed index creation
● Clustering images into a large set of clusters (max
cluster size = 5000)
● Mapper input:
○ unsorted SIFT descriptors
○ index tree (loaded by every mapper)
● Mapper output:
○ (cluster_id, SIFT)
● Reducer output:
○ SIFTs sorted by cluster_id
Image search with Hadoop
Indexing workload characteristics
● computationally-intensive (map phase)
● data-intensive (at map&reduce phases)
● large auxiliary data structure (i.e., index tree)
○ grows as dataset grows
○ e.g., 1.8GB for 110M images (4TB)

● map input < map output
● network is heavily utilized during shuffling
Image search with Hadoop
Image search with Hadoop
Searching workflow
● large aux.data structure (e.g., lookup table)
Index search with Hadoop: results
● Basic settings:
○ 512MB chunk size
○ 3 replicas
○ 8 map slots
○ 2 reduce slots
● 4TB dataset:
○ 4 map slots
Hadoop on heterogeneous clusters
Capacity/performance of nodes in our cluster
varied
○
○
○
○

Nodes come from three clusters
From 8 cores to 24 cores per node
From 24GB to 48GB RAM per node
Different CPU speeds

● Hadoop assumes one configuration (#mappers,
#reducers, maxim. map/reduce memory, ...) for
all nodes
● Not good for Hadoop clusters like ours
Hadoop on heterogeneous clusters
● Our solution (hack):
○ deploy Hadoop on all nodes with settings addressing the
least equipped nodes
○ create sub-cluster configuration files adjusted to better
equipped nodes
○ restart tasktrackers with new configuration files on better
equipped nodes

● We call it ‘smart deployment’
● Considerations:
○ Perhaps rack-awareness feature of Hadoop should be
complemented with smart deployment functionality
Hadoop on heterogeneous clusters
● Results

○ indexing 1T on 106 nodes: 75min → 65min
Large auxiliary data structure
● Some workloads require all mappers to load a largesize data structure
○ E.g., both in image indexing and searching workloads

● Spreading data file across all nodes:
○ Hadoop DistributedCache

● Not efficient if structure is of gigabytes-size
● Partial solution: increase HDFS block sizes →
decrease #mappers
● Another solution: multithreaded mappers provided by
Hadoop
○ Poorly documented feature!
Large auxiliary data structure
● Multithreaded mapper spans a configured number
of threads, each thread executes a map task
● Mapper threads share the RAM
● Downsides:
○ synchronization when reading input
○ synchronization when writing output
Large auxiliary data structure
● Let’s test it!

● Indexing 4T with 4 mappers slots, each running 2
threads
○ index tree size: 1.8GB
● Indexing time: 8h27min → 6h8min
Large auxiliary data structure
● In some application, mappers needs only a part of
auxiliary data structure (the one relevant to data
block processed)
● Solution: Hadoop MapFile
● See Section 5.C.2
○ Searching for 3-25k image batches
○ Though it is rather inconclusive
● Stay tuned!
○ A proper study of MapFile is now in progress
Open questions
● Practical one:
○ What are best practices for analysis of
Hadoop job execution logs?
● Analysis of Hadoop job logs happened to be very
useful in our project
○ Did with our python/perl scripts
● It is extremely useful for understanding and then
tuning Hadoop jobs on large Hadoop clusters
● Any good existing libraries/tools?
○ E.g., Starfish Hadoop Log analyzer (Duke Univ.)
Open questions
E.g., search (12k batch over 1TB) job execution on 100 nodes
Observations & implications
● HDFS block size limits scalability
○ 1TB dataset => 1186 blocks of 1024MB size
○ Assuming 8-core nodes and reported searching
method: no scaling after 149 nodes (i.e. 8x149=1192)
○ Solutions:
■ Smaller HDFS blocks, e.g., scaling up to 280 nodes for
512MB blocks
■ Re-visit search process: e.g., partial-loading of lookup
table

● Big data is here but not resources to process
○ E.g, indexing&searching >10TB not possible given
resources we had
Things to share
● Our methods/system can be applied to audio datasets
○ No major changes expected
○ Contact me/Diana if interested

● Code for MapReduce-eCP algorithm available on request
○ Should run smoothly on your Hadoop cluster
○ Interested in comparisons

● Hadoop job history logs behind our experiments available
on request
○ Describe indexing/searching our dataset by giving details on
map/reduce tasks execution
○ Insights on better analysis/visualization are welcome
○ E.g., job logs supporting our CBMI'13 work: http://goo.
gl/e06wE
Acknowledgements
● Aalto University http://www.
aalto.fi
● Quaero project http://www.
quaero.org
● Grid5000 infrastructure & its
Rennes maintenance team
http://www.grid5000.fr
Supporting publications
[18] D. Moise, D. Shestakov, G. Gudmundsson, L. Amsaleg. Indexing and
searching 100M images with Map-Reduce. In Proc. ACM ICMR '13, 2013.
[20] D. Shestakov, D. Moise, G. Gudmundsson, L. Amsaleg. Scalable highdimensional indexing with Hadoop. In Proc. CBMI'13, 2013.
[this-bigdata13]
D. Moise, D. Shestakov, G. Gudmundsson, L. Amsaleg.
Terabyte-scale image similarity search: experience and best practice. In Proc.
IEEE BigData'13, 2013.
[submitted] D. Shestakov, D. Moise, G. Gudmundsson, L. Amsaleg.
Scalable high-dimensional indexing and searching with Hadoop.
Thank you!

More Related Content

PDF
Scalable high-dimensional indexing with Hadoop
PPTX
Distro-independent Hadoop cluster management
PDF
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
PDF
Scaling Storage and Computation with Hadoop
PPT
An Introduction to Hadoop
PPTX
Introduction to Hadoop Technology
PPTX
Pptx present
PDF
Introduction to Hadoop
Scalable high-dimensional indexing with Hadoop
Distro-independent Hadoop cluster management
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Scaling Storage and Computation with Hadoop
An Introduction to Hadoop
Introduction to Hadoop Technology
Pptx present
Introduction to Hadoop

What's hot (20)

PDF
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
PDF
Hadoop-Introduction
PPT
Hadoop - Introduction to HDFS
PPSX
PPTX
Introduction to Big Data & Hadoop Architecture - Module 1
PPT
Hadoop Technologies
PDF
EclipseCon Keynote: Apache Hadoop - An Introduction
PDF
Hadoop scalability
PDF
Hadoop Family and Ecosystem
PDF
Hadoop: Distributed data processing
PPTX
BIG DATA: Apache Hadoop
PPTX
Big Data and Hadoop - An Introduction
PPTX
Hadoop: Distributed Data Processing
PPTX
Hadoop
PPTX
Hadoop overview
PPTX
MATLAB, netCDF, and OPeNDAP
PPTX
Sf NoSQL MeetUp: Apache Hadoop and HBase
PPTX
HDFS: Hadoop Distributed Filesystem
PPTX
Hive and data analysis using pandas
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Hadoop-Introduction
Hadoop - Introduction to HDFS
Introduction to Big Data & Hadoop Architecture - Module 1
Hadoop Technologies
EclipseCon Keynote: Apache Hadoop - An Introduction
Hadoop scalability
Hadoop Family and Ecosystem
Hadoop: Distributed data processing
BIG DATA: Apache Hadoop
Big Data and Hadoop - An Introduction
Hadoop: Distributed Data Processing
Hadoop
Hadoop overview
MATLAB, netCDF, and OPeNDAP
Sf NoSQL MeetUp: Apache Hadoop and HBase
HDFS: Hadoop Distributed Filesystem
Hive and data analysis using pandas
Ad

Viewers also liked (20)

PDF
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
PPTX
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
PPTX
Big data ppt
PPTX
Introducing Big Data
PDF
SCAPE Information Day at BL - Large Scale Processing with Hadoop
PDF
Optimize IT Infrastructure
PDF
The Evolution of Data Analysis with Hadoop - StampedeCon 2014
PDF
Using MapReduce for Large–scale Medical Image Analysis
PDF
Virtualizing Hadoop
PDF
String matching algorithms
PPTX
Video Analysis in Hadoop
KEY
Big Data Trends
PDF
Large-scale social media analysis with Hadoop
PPTX
Retail Reference Architecture
PPTX
Big Data - The 5 Vs Everyone Must Know
PDF
Big Data: Issues and Challenges
PPTX
DevOps and Continuous Delivery Reference Architectures (including Nexus and o...
PDF
Big image analytics for (Re-) insurer
PPTX
What is big data?
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
Big data ppt
Introducing Big Data
SCAPE Information Day at BL - Large Scale Processing with Hadoop
Optimize IT Infrastructure
The Evolution of Data Analysis with Hadoop - StampedeCon 2014
Using MapReduce for Large–scale Medical Image Analysis
Virtualizing Hadoop
String matching algorithms
Video Analysis in Hadoop
Big Data Trends
Large-scale social media analysis with Hadoop
Retail Reference Architecture
Big Data - The 5 Vs Everyone Must Know
Big Data: Issues and Challenges
DevOps and Continuous Delivery Reference Architectures (including Nexus and o...
Big image analytics for (Re-) insurer
What is big data?
Ad

Similar to Terabyte-scale image similarity search: experience and best practice (20)

PDF
big_data_topic2_[hadoop]_[thanh_binh_nguyen].TextMark.pdf
PDF
Intro to Apache Hadoop
PDF
Fast and Scalable Python
PPTX
Hive and data analysis using pandas
PPTX
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
PPT
Hadoop Technology
PPSX
Hadoop-Quick introduction
PPT
Big Data Technologies - Hadoop
PDF
hdfs readrmation ghghg bigdats analytics info.pdf
PPTX
JOSA TechTalks - Big Data on Hadoop
PPTX
Hadoop introduction
ODP
Hadoop seminar
PPTX
2. hadoop fundamentals
PPTX
PDF
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
PDF
Apache Spark 101 - Demi Ben-Ari
PPTX
Hadoop.pptx
PPTX
Hadoop.pptx
PPTX
List of Engineering Colleges in Uttarakhand
PDF
InternReport
big_data_topic2_[hadoop]_[thanh_binh_nguyen].TextMark.pdf
Intro to Apache Hadoop
Fast and Scalable Python
Hive and data analysis using pandas
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Hadoop Technology
Hadoop-Quick introduction
Big Data Technologies - Hadoop
hdfs readrmation ghghg bigdats analytics info.pdf
JOSA TechTalks - Big Data on Hadoop
Hadoop introduction
Hadoop seminar
2. hadoop fundamentals
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Apache Spark 101 - Demi Ben-Ari
Hadoop.pptx
Hadoop.pptx
List of Engineering Colleges in Uttarakhand
InternReport

More from Denis Shestakov (9)

PDF
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
PDF
Deep Web: Databases on the Web
PDF
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
PPT
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
PDF
Intelligent web crawling
PDF
Current challenges in web crawling
PDF
Sampling national deep Web
PPT
On building a search interface discovery system
PPT
Biological Database Systems
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Deep Web: Databases on the Web
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Intelligent web crawling
Current challenges in web crawling
Sampling national deep Web
On building a search interface discovery system
Biological Database Systems

Recently uploaded (20)

PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Big Data Technologies - Introduction.pptx
PDF
Machine learning based COVID-19 study performance prediction
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Electronic commerce courselecture one. Pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Cloud computing and distributed systems.
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPT
Teaching material agriculture food technology
PDF
Encapsulation theory and applications.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
MIND Revenue Release Quarter 2 2025 Press Release
Big Data Technologies - Introduction.pptx
Machine learning based COVID-19 study performance prediction
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Electronic commerce courselecture one. Pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Encapsulation_ Review paper, used for researhc scholars
Cloud computing and distributed systems.
Network Security Unit 5.pdf for BCA BBA.
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Diabetes mellitus diagnosis method based random forest with bat algorithm
MYSQL Presentation for SQL database connectivity
Chapter 3 Spatial Domain Image Processing.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Review of recent advances in non-invasive hemoglobin estimation
Teaching material agriculture food technology
Encapsulation theory and applications.pdf

Terabyte-scale image similarity search: experience and best practice

  • 1. Terabyte-scale image similarity search: experience and best practice Diana Moise2, Denis Shestakov1,2, Gylfi Gudmundsson2, Laurent Amsaleg3 1 Department of Media Technology, School of Science, Aalto University, Finland 2 Inria Rennes – Bretagne Atlantique, France 3 IRISA - CNRS, France Denis Shestakov denis.shestakov at aalto.fi linkedin: linkedin.com/in/dshestakov mendeley: mendeley.com/profiles/denis-shestakov
  • 3. Overview 1. Background: image retrieval, our focus, environment, etc. 2. Applying Hadoop to multimedia retrieval tasks 3. Addressing Hadoop cluster heterogeneity issue 4. Studying workloads with large auxiliary data structure required for processing 5. Experimenting with very large image dataset
  • 4. Image search? Content-based image search: ● Find matches with similar content
  • 5. Image search applications? ● regular image search ● object recognition ○ face, logo, etc. ● for systems like Google Goggles ● augmented reality applications ● medical imaging ● analysis of astrophysics data
  • 6. Our use case ● Copyright violation detection ● Our scenario: ○ Searching for batch of images ■ Querying for thousands of images in one run ■ Focus on throughput, not on response time for individual image ● Note: indexed dataset can be searched on single machine with adequate disk capacity if necessary
  • 7. Image search with Hadoop ● Index & search huge image collection using MapReduce-based eCP algorithm ○ See our work at ICMR'13: Indexing and searching 100M images with MapReduce [18] ○ See Section III for quick overview ● Use the Grid5000 plartform ○ Distributed infrastructure available to French researchers & their partners ● Use the Hadoop framework
  • 8. Experimental setup: cluster ● Grid5000 platform: ○ Nodes in rennes site of Grid5000 ■ Up to 110 nodes available ■ Nodes capacity/performance varied ● Heterogenous, come from three clusters ● From 8 cores to 24 cores per node ● From 24GB to 48GB RAM per node
  • 9. Experimental setup: framework ● Standard Apache Hadoop distribution, ver.1.0.1 ○ (!) No changes in Hadoop internals ■ Pros: easy to migrate, try and compare by others ■ Cons: not top performance ○ Tools provided by Hadoop framework ■ ■ ■ ■ Hadoop SequenceFiles DistributedCache multithreaded mappers MapFiles
  • 10. Experimental setup: dataset ● 110 mln images (~30 billion SIFT descriptors) ○ Collected from the Web and provided by one of the partners in Quaero project ■ Largest reported in literature ○ Images resized to 150px on largest side ○ Worked with ■ The whole set (~4TB) ■ The subset, 20mln images (~1TB) ○ Used as distracting dataset
  • 11. Experimental setup: querying ● For evaluation of indexing quality: ○ Added to distracting datasets: ■ INRIA Copydays (127 images) ○ Queried for ■ Copydays batch (~3000 images = 127 original images and their associated variants incl. strong distortions, e.g. print-crumple-scan ) ■ 12k batch (~12000 images = 245 random images from dataset and their variants) ■ 25k batch ○ Checked if original images returned as top voted search results
  • 12. Image search with Hadoop Distributed index creation ● Clustering images into a large set of clusters (max cluster size = 5000) ● Mapper input: ○ unsorted SIFT descriptors ○ index tree (loaded by every mapper) ● Mapper output: ○ (cluster_id, SIFT) ● Reducer output: ○ SIFTs sorted by cluster_id
  • 13. Image search with Hadoop Indexing workload characteristics ● computationally-intensive (map phase) ● data-intensive (at map&reduce phases) ● large auxiliary data structure (i.e., index tree) ○ grows as dataset grows ○ e.g., 1.8GB for 110M images (4TB) ● map input < map output ● network is heavily utilized during shuffling
  • 15. Image search with Hadoop Searching workflow ● large aux.data structure (e.g., lookup table)
  • 16. Index search with Hadoop: results ● Basic settings: ○ 512MB chunk size ○ 3 replicas ○ 8 map slots ○ 2 reduce slots ● 4TB dataset: ○ 4 map slots
  • 17. Hadoop on heterogeneous clusters Capacity/performance of nodes in our cluster varied ○ ○ ○ ○ Nodes come from three clusters From 8 cores to 24 cores per node From 24GB to 48GB RAM per node Different CPU speeds ● Hadoop assumes one configuration (#mappers, #reducers, maxim. map/reduce memory, ...) for all nodes ● Not good for Hadoop clusters like ours
  • 18. Hadoop on heterogeneous clusters ● Our solution (hack): ○ deploy Hadoop on all nodes with settings addressing the least equipped nodes ○ create sub-cluster configuration files adjusted to better equipped nodes ○ restart tasktrackers with new configuration files on better equipped nodes ● We call it ‘smart deployment’ ● Considerations: ○ Perhaps rack-awareness feature of Hadoop should be complemented with smart deployment functionality
  • 19. Hadoop on heterogeneous clusters ● Results ○ indexing 1T on 106 nodes: 75min → 65min
  • 20. Large auxiliary data structure ● Some workloads require all mappers to load a largesize data structure ○ E.g., both in image indexing and searching workloads ● Spreading data file across all nodes: ○ Hadoop DistributedCache ● Not efficient if structure is of gigabytes-size ● Partial solution: increase HDFS block sizes → decrease #mappers ● Another solution: multithreaded mappers provided by Hadoop ○ Poorly documented feature!
  • 21. Large auxiliary data structure ● Multithreaded mapper spans a configured number of threads, each thread executes a map task ● Mapper threads share the RAM ● Downsides: ○ synchronization when reading input ○ synchronization when writing output
  • 22. Large auxiliary data structure ● Let’s test it! ● Indexing 4T with 4 mappers slots, each running 2 threads ○ index tree size: 1.8GB ● Indexing time: 8h27min → 6h8min
  • 23. Large auxiliary data structure ● In some application, mappers needs only a part of auxiliary data structure (the one relevant to data block processed) ● Solution: Hadoop MapFile ● See Section 5.C.2 ○ Searching for 3-25k image batches ○ Though it is rather inconclusive ● Stay tuned! ○ A proper study of MapFile is now in progress
  • 24. Open questions ● Practical one: ○ What are best practices for analysis of Hadoop job execution logs? ● Analysis of Hadoop job logs happened to be very useful in our project ○ Did with our python/perl scripts ● It is extremely useful for understanding and then tuning Hadoop jobs on large Hadoop clusters ● Any good existing libraries/tools? ○ E.g., Starfish Hadoop Log analyzer (Duke Univ.)
  • 25. Open questions E.g., search (12k batch over 1TB) job execution on 100 nodes
  • 26. Observations & implications ● HDFS block size limits scalability ○ 1TB dataset => 1186 blocks of 1024MB size ○ Assuming 8-core nodes and reported searching method: no scaling after 149 nodes (i.e. 8x149=1192) ○ Solutions: ■ Smaller HDFS blocks, e.g., scaling up to 280 nodes for 512MB blocks ■ Re-visit search process: e.g., partial-loading of lookup table ● Big data is here but not resources to process ○ E.g, indexing&searching >10TB not possible given resources we had
  • 27. Things to share ● Our methods/system can be applied to audio datasets ○ No major changes expected ○ Contact me/Diana if interested ● Code for MapReduce-eCP algorithm available on request ○ Should run smoothly on your Hadoop cluster ○ Interested in comparisons ● Hadoop job history logs behind our experiments available on request ○ Describe indexing/searching our dataset by giving details on map/reduce tasks execution ○ Insights on better analysis/visualization are welcome ○ E.g., job logs supporting our CBMI'13 work: http://goo. gl/e06wE
  • 28. Acknowledgements ● Aalto University http://www. aalto.fi ● Quaero project http://www. quaero.org ● Grid5000 infrastructure & its Rennes maintenance team http://www.grid5000.fr
  • 29. Supporting publications [18] D. Moise, D. Shestakov, G. Gudmundsson, L. Amsaleg. Indexing and searching 100M images with Map-Reduce. In Proc. ACM ICMR '13, 2013. [20] D. Shestakov, D. Moise, G. Gudmundsson, L. Amsaleg. Scalable highdimensional indexing with Hadoop. In Proc. CBMI'13, 2013. [this-bigdata13] D. Moise, D. Shestakov, G. Gudmundsson, L. Amsaleg. Terabyte-scale image similarity search: experience and best practice. In Proc. IEEE BigData'13, 2013. [submitted] D. Shestakov, D. Moise, G. Gudmundsson, L. Amsaleg. Scalable high-dimensional indexing and searching with Hadoop.