SlideShare a Scribd company logo
Testing storage and metadata backends
Hugo González Labrador, Arno Formella
LIA2, University of Vigo
CS3: Cloud Storage Services for Novel Applications and Workflows
Zürich, January 2016
Outline
• Origin of the project
• Architecture
• Storage backends
• Benchmark results
• Conclusion
• Outlook
• Cloud Synchronisation Benchmarking Framework
• Curiosity for testing new data and metadata backends that are novel
for synchronisation platforms.
• Flexible to plug your implementations in any language and technology
• Experiment with a new design that:
• Avoids synchronisation between the DB (containing the metadata)
and the filesystem (containing the data) that is done in the majority
of sync platforms using a local filesystem.
• Experience from being a Technical student at CERN working on the
CERNBox project.
Why did we develop ClawIO ?
META DATA META DATA META DATA
SYNC
(ownCloud Sync Protocol)
MGM
LOCAL FS XATTR/NOXATTR EOS S3/SWIFT/RADGOSGW
gRPC
HTTP
API REST
SHARE AUTH
CLI
Both
ownCloud CLIENTS 3rd PARTY APPSTECHNICAL USERS
Architecture
Data
Metadata used by ownCloud
Metadata Key Metadata Value
CHECKSUM md5:8c8d357b5e872bbacd45197
626bd5759
MTIME 104857600
PATH /local/users/d/demo/file
FILEID a8584c90-ae2a-4dd8-84a7-
f18ced109cce
ETAG 956494f8-5120-4165-afba-
ad5f8d13b8ef
What metadata do we keep ?
META UNIT
DATA UNIT
Ext4 FS
DATA
FILEID
CHECKSUM
MTIME
ETAG PATH
SETUP ONE: Local FS with MySQL as the metadata store
META UNIT
DATA UNIT
Ext4 FS XATTR
DATA
FILEID
CHECKSUM
FILEID
CHECKSUM
MTIME
ETAG PATH
SETUP TWO: Local FS with MySQL and XATTRs as metadata stores
META UNIT
DATA UNIT
Ext4 FS XATTR
DATA
FILEID
CHECKSUM
FILEID
CHECKSUM
MTIME
ETAG PATH
SETUP THREE: Local FS with REDIS and XATTRs as metadata stores
CPU
64 cores Intel(R) Xeon(R) CPU E5-4640 v2 @
2.20GHz
RAM 64 GB
DISK SAS-3 12 Gb/s 4 TB Seagate Constellation
Enterprise ST4000NM3401 (RAID6)
VM Machine Specs, Deployment scenario and Operations to benchmark
Deployment
Operations
16 services (servers) on the same VM
bench client also on the same VM
Stat Upload
VM
infrastructure provided by
• The STAT operation is similar to a Unix file stat operation or a
webDAV PROPFIND.
• The objective of this operation is to retrieve the metadata
associated with a particular resource (file or folder)
• For each level of concurrency, 10 000 requests are triggered
and the test is repeated 5 times.
• Operation uses gRPC and HTTP/2.
STAT benchmark
STAT benchmark
STAT benchmark
STAT benchmark
• The upload operation uploads a file randomly
chosen from a fixed set of 100 files that follow the
distribution of files observed on CERNBox.
• The chosen file is uploaded 5000 times per
concurrency level, to random target destinations
to avoid overwrites. The benchmark is repeated 5
times.
• Operation uses HTTP/1.1
• Upload triggers metadata propagation.
UPLOAD benchmark
UPLOAD benchmark
UPLOAD benchmark
UPLOAD benchmark
CONCLUSIONS
STAT COMPARISON
UPLOAD COMPARISON
• Retrieval of file hierarchy from FS favours novel
uses cases and access to existing data
repositories.
• In-memory databases increase the performance
and can scale to a high number or records 

(with a 70 bytes memory footprint per file,

64 GiB => 981714285714 files)
• Use of XATTRS makes the system more consistent
What can we extract from these results ?
WE THINK WE ARE ON THE RIGHT WAY
Piotr’s Analytics System
What comes next ?
What comes next ?
• Improve performance (more parallelisation)
• Run more benchmarks: upload with checksums, overwrites, remove, move, fuzz
• Perform benchmark on cluster (ClawIO design scales out)
• Implement more backends: EOS, S3/SWIFT/RADOSGW

(plug your backend, suggestions are welcome)
• Implement sharing and run benchmarks on shared folders
• Test other Sync Protocols: SeaFile, StorageSync ?
Thank you
¿Q&A?
Acknowledgements
• Centro de Investigación, Transferencia e Innovación
(CITI)
• PhD Jakub T. Mościcki @CERN
• Piotr Mrowczynski @Technical University of Denmark
• LIA2, University of Vigo

More Related Content

PDF
What is CERNBox ?
PDF
CERNBox: Site Report
PDF
Resource planning on the (Amazon) cloud
PDF
Cern Cloud Architecture - February, 2016
PDF
Dev opsmeetup sept2013-leaseweb
PDF
Deep Dive Into the CERN Cloud Infrastructure - November, 2013
PDF
10 Years of OpenStack at CERN - From 0 to 300k cores
PPTX
Learning to Scale OpenStack
What is CERNBox ?
CERNBox: Site Report
Resource planning on the (Amazon) cloud
Cern Cloud Architecture - February, 2016
Dev opsmeetup sept2013-leaseweb
Deep Dive Into the CERN Cloud Infrastructure - November, 2013
10 Years of OpenStack at CERN - From 0 to 300k cores
Learning to Scale OpenStack

What's hot (20)

PDF
Cloud computing and bioinformatics
PDF
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
PDF
Apache Storm
PPTX
Apache Storm Internals
PPTX
Scaling Apache Storm (Hadoop Summit 2015)
PPTX
Moving to Nova Cells without Destroying the World
PDF
OpenNebula Conf 2014 | Lightning talk: OpenNebula Puppet Module - Norman Mess...
PDF
Build a Complex, Realtime Data Management App with Postgres 14!
PPTX
Resource Aware Scheduling in Apache Storm
PPTX
Open stack neutron and opendaylight
ODP
Hpc to OpenStack: Our journey
PDF
Pushing Python: Building a High Throughput, Low Latency System
PPTX
Apache Storm based Real Time Analytics for Recommending Trending Topics and S...
PPTX
Multi-Tenant Storm Service on Hadoop Grid
PDF
Storm
PPTX
Disaggregating Ceph using NVMeoF
PPTX
ДЕНИС КЛЕПIКОВ «Long Term storage for Prometheus» Lviv DevOps Conference 2019
PDF
2018년 3월 정기 세미나 - March 2018 Ops Meetup 후기
PDF
Stor4NFV: Exploration of Cloud native Storage in OPNFV - Ren Qiaowei, Wang Hui
PDF
Elks for analysing performance test results - Helsinki QA meetup
Cloud computing and bioinformatics
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
Apache Storm
Apache Storm Internals
Scaling Apache Storm (Hadoop Summit 2015)
Moving to Nova Cells without Destroying the World
OpenNebula Conf 2014 | Lightning talk: OpenNebula Puppet Module - Norman Mess...
Build a Complex, Realtime Data Management App with Postgres 14!
Resource Aware Scheduling in Apache Storm
Open stack neutron and opendaylight
Hpc to OpenStack: Our journey
Pushing Python: Building a High Throughput, Low Latency System
Apache Storm based Real Time Analytics for Recommending Trending Topics and S...
Multi-Tenant Storm Service on Hadoop Grid
Storm
Disaggregating Ceph using NVMeoF
ДЕНИС КЛЕПIКОВ «Long Term storage for Prometheus» Lviv DevOps Conference 2019
2018년 3월 정기 세미나 - March 2018 Ops Meetup 후기
Stor4NFV: Exploration of Cloud native Storage in OPNFV - Ren Qiaowei, Wang Hui
Elks for analysing performance test results - Helsinki QA meetup
Ad

Similar to Testing data and metadata backends with ClawIO (20)

PDF
Ceph Day San Jose - Object Storage for Big Data
PDF
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
PDF
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
PDF
Data Science with the Help of Metadata
PDF
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
PPTX
InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
PDF
Machine Learning With H2O vs SparkML
PDF
Scalable Preservation Workflows
PPTX
Hadoop introduction
PDF
OCRE webinar - April 14 - Cloud_Validation_Suite_Ignacio Peluaga Lozada.pdf
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
PPTX
Data Automation at Light Sources
PDF
Ceph used in Cancer Research at OICR
PDF
WarsawITDays_ ApacheNiFi202
PPTX
PDF
Apache Big Data Europe 2015: Selected Talks
PPTX
HPC and cloud distributed computing, as a journey
PDF
Alfresco benchmark report_bl100093
PPTX
Data provenance in Hopsworks
PPTX
Parallel Distributed Deep Learning on HPCC Systems
Ceph Day San Jose - Object Storage for Big Data
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
Data Science with the Help of Metadata
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
Machine Learning With H2O vs SparkML
Scalable Preservation Workflows
Hadoop introduction
OCRE webinar - April 14 - Cloud_Validation_Suite_Ignacio Peluaga Lozada.pdf
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Data Automation at Light Sources
Ceph used in Cancer Research at OICR
WarsawITDays_ ApacheNiFi202
Apache Big Data Europe 2015: Selected Talks
HPC and cloud distributed computing, as a journey
Alfresco benchmark report_bl100093
Data provenance in Hopsworks
Parallel Distributed Deep Learning on HPCC Systems
Ad

Recently uploaded (20)

PPTX
introduction about ICD -10 & ICD-11 ppt.pptx
PPTX
Introuction about ICD -10 and ICD-11 PPT.pptx
PPT
isotopes_sddsadsaadasdasdasdasdsa1213.ppt
PDF
Introduction to the IoT system, how the IoT system works
DOCX
Unit-3 cyber security network security of internet system
PPT
Design_with_Watersergyerge45hrbgre4top (1).ppt
PPTX
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
PPTX
Introuction about WHO-FIC in ICD-10.pptx
PDF
The New Creative Director: How AI Tools for Social Media Content Creation Are...
PDF
Unit-1 introduction to cyber security discuss about how to secure a system
PPTX
Job_Card_System_Styled_lorem_ipsum_.pptx
PPTX
innovation process that make everything different.pptx
PDF
An introduction to the IFRS (ISSB) Stndards.pdf
PPTX
Funds Management Learning Material for Beg
PDF
WebRTC in SignalWire - troubleshooting media negotiation
PDF
Slides PDF The World Game (s) Eco Economic Epochs.pdf
PPTX
Module 1 - Cyber Law and Ethics 101.pptx
PPTX
Digital Literacy And Online Safety on internet
PPTX
international classification of diseases ICD-10 review PPT.pptx
PPTX
PptxGenJS_Demo_Chart_20250317130215833.pptx
introduction about ICD -10 & ICD-11 ppt.pptx
Introuction about ICD -10 and ICD-11 PPT.pptx
isotopes_sddsadsaadasdasdasdasdsa1213.ppt
Introduction to the IoT system, how the IoT system works
Unit-3 cyber security network security of internet system
Design_with_Watersergyerge45hrbgre4top (1).ppt
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
Introuction about WHO-FIC in ICD-10.pptx
The New Creative Director: How AI Tools for Social Media Content Creation Are...
Unit-1 introduction to cyber security discuss about how to secure a system
Job_Card_System_Styled_lorem_ipsum_.pptx
innovation process that make everything different.pptx
An introduction to the IFRS (ISSB) Stndards.pdf
Funds Management Learning Material for Beg
WebRTC in SignalWire - troubleshooting media negotiation
Slides PDF The World Game (s) Eco Economic Epochs.pdf
Module 1 - Cyber Law and Ethics 101.pptx
Digital Literacy And Online Safety on internet
international classification of diseases ICD-10 review PPT.pptx
PptxGenJS_Demo_Chart_20250317130215833.pptx

Testing data and metadata backends with ClawIO

  • 1. Testing storage and metadata backends Hugo González Labrador, Arno Formella LIA2, University of Vigo CS3: Cloud Storage Services for Novel Applications and Workflows Zürich, January 2016
  • 2. Outline • Origin of the project • Architecture • Storage backends • Benchmark results • Conclusion • Outlook
  • 3. • Cloud Synchronisation Benchmarking Framework • Curiosity for testing new data and metadata backends that are novel for synchronisation platforms. • Flexible to plug your implementations in any language and technology • Experiment with a new design that: • Avoids synchronisation between the DB (containing the metadata) and the filesystem (containing the data) that is done in the majority of sync platforms using a local filesystem. • Experience from being a Technical student at CERN working on the CERNBox project. Why did we develop ClawIO ?
  • 4. META DATA META DATA META DATA SYNC (ownCloud Sync Protocol) MGM LOCAL FS XATTR/NOXATTR EOS S3/SWIFT/RADGOSGW gRPC HTTP API REST SHARE AUTH CLI Both ownCloud CLIENTS 3rd PARTY APPSTECHNICAL USERS Architecture
  • 5. Data Metadata used by ownCloud Metadata Key Metadata Value CHECKSUM md5:8c8d357b5e872bbacd45197 626bd5759 MTIME 104857600 PATH /local/users/d/demo/file FILEID a8584c90-ae2a-4dd8-84a7- f18ced109cce ETAG 956494f8-5120-4165-afba- ad5f8d13b8ef What metadata do we keep ?
  • 6. META UNIT DATA UNIT Ext4 FS DATA FILEID CHECKSUM MTIME ETAG PATH SETUP ONE: Local FS with MySQL as the metadata store
  • 7. META UNIT DATA UNIT Ext4 FS XATTR DATA FILEID CHECKSUM FILEID CHECKSUM MTIME ETAG PATH SETUP TWO: Local FS with MySQL and XATTRs as metadata stores
  • 8. META UNIT DATA UNIT Ext4 FS XATTR DATA FILEID CHECKSUM FILEID CHECKSUM MTIME ETAG PATH SETUP THREE: Local FS with REDIS and XATTRs as metadata stores
  • 9. CPU 64 cores Intel(R) Xeon(R) CPU E5-4640 v2 @ 2.20GHz RAM 64 GB DISK SAS-3 12 Gb/s 4 TB Seagate Constellation Enterprise ST4000NM3401 (RAID6) VM Machine Specs, Deployment scenario and Operations to benchmark Deployment Operations 16 services (servers) on the same VM bench client also on the same VM Stat Upload VM infrastructure provided by
  • 10. • The STAT operation is similar to a Unix file stat operation or a webDAV PROPFIND. • The objective of this operation is to retrieve the metadata associated with a particular resource (file or folder) • For each level of concurrency, 10 000 requests are triggered and the test is repeated 5 times. • Operation uses gRPC and HTTP/2. STAT benchmark
  • 14. • The upload operation uploads a file randomly chosen from a fixed set of 100 files that follow the distribution of files observed on CERNBox. • The chosen file is uploaded 5000 times per concurrency level, to random target destinations to avoid overwrites. The benchmark is repeated 5 times. • Operation uses HTTP/1.1 • Upload triggers metadata propagation. UPLOAD benchmark
  • 21. • Retrieval of file hierarchy from FS favours novel uses cases and access to existing data repositories. • In-memory databases increase the performance and can scale to a high number or records 
 (with a 70 bytes memory footprint per file,
 64 GiB => 981714285714 files) • Use of XATTRS makes the system more consistent What can we extract from these results ?
  • 22. WE THINK WE ARE ON THE RIGHT WAY Piotr’s Analytics System
  • 24. What comes next ? • Improve performance (more parallelisation) • Run more benchmarks: upload with checksums, overwrites, remove, move, fuzz • Perform benchmark on cluster (ClawIO design scales out) • Implement more backends: EOS, S3/SWIFT/RADOSGW
 (plug your backend, suggestions are welcome) • Implement sharing and run benchmarks on shared folders • Test other Sync Protocols: SeaFile, StorageSync ?
  • 26. Acknowledgements • Centro de Investigación, Transferencia e Innovación (CITI) • PhD Jakub T. Mościcki @CERN • Piotr Mrowczynski @Technical University of Denmark • LIA2, University of Vigo