SlideShare a Scribd company logo
11
Supercharging Data Performance for
Real-Time Data Analysis
2
Information—the fuel of business—is trapped
in analysis platforms built on 70-year
old architectures.
3
Data volume and velocity challenge traditional
computing methods
Traditional Approach:
• Commodity x86 based servers
• Cluster with open source software
• Scale for volume
• Scale for parallelism / performance
Challenges:
• High level languages can be inefficient
• Data intensive workloads drive in-memory solutions
• DRAM footprints at commodity prices are small
• Scaling out increases cost and complexity
Ryft delivers huge benefits in a small package.
Highest performance per watt and lowest total cost of ownership (TCO) of
any product on the market.
48 TB in 1U
• Data storage is abstracted
as a set of Linux mount
points
• Support native
encryption/decryption with
no loss in performance
(AES 256 Encryption)
Simple API
• C library abstracts internal
FPGA constructs to simplify
programmability, allowing a
programmer to invoke
operations as simple function
calls, returning simple results
• Command line
• Web Interface
Linux Front End
• Linux (Ubuntu 14.04 LTS )
front end - Standard build,
Non restricted OS, apt-get
• API calls FPGA fabric
backend
• Linux services/protocols can
be used
• ssh/scp/rsync/sftp
• Standard monitoring
agents
• Web services
• Security configuration
x86 Architecture vs. Systolic Arrays
Memory
PE
One Clock Cycle
(x86)
Memory
PEPEPE PE PEPE
One Clock Cycle
FPGA- Systolic Array
100 ns
100 ns
FPGA Benefits
x86 FPGA
• General purpose computing
• Sequential in nature
• Non-deterministic performance
• Interrupts
• Memory allocation
• Problems are broken into a sequence of
operations and processed serially
• Increasing number of instructions
• Increased overhead
• Increasing required power/cooling
required
• Software can break problems down and
bring parallelism:
• Between processors/cores
• Between servers
• Output combined over interconnects
• Not general purpose
• Purpose built algorithms
• Can be reprogramed via firmware
• Parallel in nature
• Can execute many parallel operations in
one clock cycle
• More output with less power and clock
speed
• ~1000X less instructions to solve the same
problem as x86
• 100% deterministic performance
• No memory fetching or management
• No interrupts
Multi-Dimensional Systolic Arrays
PE PE PE
PE PE PE
PE PE PE
PE PE PE
PE
PE
PE
PE
PE PE PE PE
PE
PE
PE
PE
PE
The Ryft ONE is powered by a breakthrough in
Real-time Data Analysis.
The only 1U platform capable of analyzing streaming, historical,
unstructured, and multi-structured data in real-time at 10 GB/second.
Ryft ONE avoids bottlenecks that strangle conventional systems
by combining these two innovations:
The Ryft Analytics Cortex™
Ryft ONE leverages a massively parallel bitwise
computing architecture to deliver unprecedented
performance from the smallest possible form factor.
The Ryft Algorithm Primitives™ Library
Each Ryft ONE comes with a subscription to this
growing collection of pre-built algorithm components,
and an open API to leverage them.
+
“We see Spark Streaming scales nearly linearly to 100 nodes, and can
process up to 6 GB/s at sub-second latency on 100 nodes for Grep, 2.3
GB/s for the other, more CPU-intensive jobs”
UC Berkley Streaming Computation at Scale
Proprietary | 9
http://www.cs.berkeley.edu/~matei/papers/2013/sosp_spark_streaming.pdf
Ryft transforms datacenter economics.
The Ryft ONE
Costly & Complex Clusters
Search = 10 GB/s
Term Frequency = 2.5 GB/s
Search = 6GB/s
Term Frequency= 2.3 GB/s
Wikipedia Examples
• English XML Dump is offered by Wikipedia
• Total Corpus is 44GB
• Copying the data takes 44 seconds
• Fuzzy search would take 4.4 seconds
• Term Frequency would take 17.6 seconds
Data Exploration Use Case
Data Exploration Use Case
• RDF—understanding
of native formats
• Powerful no-index
search
• Flexible query format
with wildcarding
• Identify relationships
between disparate data
HDFS
Data Triage for Hadoop/Spark Use Case
Raw Data
M/R
noSQL Hive
Text
Index
Application
Hours?
Days?
Search / Minimize
@10GB/s
Data Triage for Hadoop/Spark Use Case
Ingest @ 1-4GB/s
Seconds!
HDFS
• Social media signal/noise
• Fuzzy searching at line rate
@badguy1
@badguy2
@badguy01
@badboy01
Search: “badguy??”
Organizations who want real-time insights into all their data
Large data sets (changing, structured & unstructured, Text, Binary, Imaging)
High Velocity Data
• Logging
• Ad Data
• Twitter
Forensics & Legal Discovery
• Host based forensics
• E-discovery
Scientific Data
• Genomics
• Sensor Data
Financial
• Compliance
• Fraud Detection
Cyber Security
• PCAP
• Full packet capture
• Binary Analysis
Imagery Analysis
• Change Analysis
• High Performance Rendering
Revisiting Performance Results
Ryft ONE closes the industry’s data analytics performance gap
by combining the following into a single architecture:
 Parallel FPGA architectures to accelerate performance
 Dedicated storage/access/RAM
 Elimination of data security performance bottlenecks
 Elimination of operating system and high level language overhead
 Minimizing the need to move data
Use Case
Single Ryft ONE
Throughput
Spark Cluster to Match
Performance
Search ~10GB/sec > 100 nodes1
Fuzzy Search ~10GB/sec 100-200 nodes2
Term Frequency ~2.5GB/sec 100 nodes1
Accelerate business insights with the only platform purpose-built
to simultaneously analyze any type of data—historical and
streaming, unstructured and multi-structured—
100X faster with 70% lower TCO.
The Ryft ONE: More data. Less center. Faster insights.
1919
info@ryft.com
Questions

More Related Content

PPTX
IoT Slam Keynote: Harnessing the Flood of Data with Heterogeneous Computing a...
PPTX
Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)
PDF
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
PPTX
The Life of an Internet of Things Electron
PDF
Real-time DeepLearning on IoT Sensor Data
PDF
Spark Summit Europe 2016 Keynote - Databricks CEO
PPTX
David Henthorn [Rose-Hulman Institute of Technology] | Illuminating the Dark ...
PDF
Life is but a Stream
IoT Slam Keynote: Harnessing the Flood of Data with Heterogeneous Computing a...
Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
The Life of an Internet of Things Electron
Real-time DeepLearning on IoT Sensor Data
Spark Summit Europe 2016 Keynote - Databricks CEO
David Henthorn [Rose-Hulman Institute of Technology] | Illuminating the Dark ...
Life is but a Stream

What's hot (20)

PPTX
From Batch to Real Time: Overstock’s Journey Towards Unifying Analytics Acros...
PDF
PaNDA - a platform for Network Data Analytics: an overview
PDF
XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
PDF
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
PDF
Apache Spark and future of advanced analytics
PPTX
The Power of Data
PPTX
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
PDF
MLOps with Kubeflow
PPTX
Hadoop and Friends as Key Enabler of the IoE - Continental's Dynamic eHorizon
PPTX
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
PDF
Siscale Lightning Talk: Automated Root Cause Analysis with Elastic Stack
PDF
Upgrading Made Easy: Moving to InfluxDB 2.x or InfluxDB Cloud with Cribl LogS...
PDF
Complex event processing platform handling millions of users - Krzysztof Zarz...
PPTX
Build a car with Graphs, Fabien Batejat, Volvo Cars
PPTX
Architecting a Modern Data Warehouse: Enterprise Must-Haves
PDF
FIWARE Global Summit - Edge/Fog Computing in “Powered by FIWARE” Architectures
PDF
Industrial production process visualization with the Elastic Stack in real-ti...
PDF
Scaling ML-Based Threat Detection For Production Cyber Attacks
PDF
DATA @ NFLX (Tableau Conference 2014 Presentation)
PPTX
Data-Driven @ Netflix
From Batch to Real Time: Overstock’s Journey Towards Unifying Analytics Acros...
PaNDA - a platform for Network Data Analytics: an overview
XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
Apache Spark and future of advanced analytics
The Power of Data
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
MLOps with Kubeflow
Hadoop and Friends as Key Enabler of the IoE - Continental's Dynamic eHorizon
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Siscale Lightning Talk: Automated Root Cause Analysis with Elastic Stack
Upgrading Made Easy: Moving to InfluxDB 2.x or InfluxDB Cloud with Cribl LogS...
Complex event processing platform handling millions of users - Krzysztof Zarz...
Build a car with Graphs, Fabien Batejat, Volvo Cars
Architecting a Modern Data Warehouse: Enterprise Must-Haves
FIWARE Global Summit - Edge/Fog Computing in “Powered by FIWARE” Architectures
Industrial production process visualization with the Elastic Stack in real-ti...
Scaling ML-Based Threat Detection For Production Cyber Attacks
DATA @ NFLX (Tableau Conference 2014 Presentation)
Data-Driven @ Netflix
Ad

Viewers also liked (9)

PDF
Using Performance Management Data to drive strategic decisions and company pe...
PPTX
Edge-Fog Cloud: Scaling IoT computations on the edge
PDF
Rock Report: Fitness Technology for Athletes by @Rock_Health
PPTX
Edge-Fog Cloud
PDF
OpenStack NFV Edge computing for IOT microservices
PDF
Rock Report: Big Data by @Rock_Health
PDF
Cloud, Fog & Edge Computing
PDF
Advanced Packaging Role after Moore’s Law: Transition from Technology Node Er...
PPTX
Internet of Things: Programming on the edge
Using Performance Management Data to drive strategic decisions and company pe...
Edge-Fog Cloud: Scaling IoT computations on the edge
Rock Report: Fitness Technology for Athletes by @Rock_Health
Edge-Fog Cloud
OpenStack NFV Edge computing for IOT microservices
Rock Report: Big Data by @Rock_Health
Cloud, Fog & Edge Computing
Advanced Packaging Role after Moore’s Law: Transition from Technology Node Er...
Internet of Things: Programming on the edge
Ad

Similar to Supercharging Data Performance for Real-Time Data Analysis (20)

PDF
OpenPOWER Acceleration of HPCC Systems
PDF
Meta scale kognitio hadoop webinar
PDF
Meta scale kognitio hadoop webinar
PPT
Sparc t4 systems customer presentation
PPTX
Hadoop ppt1
PPTX
DataEngConf SF16 - High cardinality time series search
PPT
Exadata architecture and internals presentation
PDF
Hyperscan - Mohammad Abdul Awal
PPTX
Introduction to Apache Apex
PDF
Hpc lunch and learn
PDF
Data Pipelines with Spark & DataStax Enterprise
PDF
20160331 sa introduction to big data pipelining berlin meetup 0.3
PDF
2016 August POWER Up Your Insights - IBM System Summit Mumbai
PDF
Kognitio overview jan 2013
PDF
Kognitio overview jan 2013
PDF
Kafka & Hadoop in Rakuten
PPTX
Dissecting Scalable Database Architectures
PPTX
High cardinality time series search: A new level of scale - Data Day Texas 2016
PPSX
Hadoop-Quick introduction
PPTX
Foxvalley bigdata
OpenPOWER Acceleration of HPCC Systems
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Sparc t4 systems customer presentation
Hadoop ppt1
DataEngConf SF16 - High cardinality time series search
Exadata architecture and internals presentation
Hyperscan - Mohammad Abdul Awal
Introduction to Apache Apex
Hpc lunch and learn
Data Pipelines with Spark & DataStax Enterprise
20160331 sa introduction to big data pipelining berlin meetup 0.3
2016 August POWER Up Your Insights - IBM System Summit Mumbai
Kognitio overview jan 2013
Kognitio overview jan 2013
Kafka & Hadoop in Rakuten
Dissecting Scalable Database Architectures
High cardinality time series search: A new level of scale - Data Day Texas 2016
Hadoop-Quick introduction
Foxvalley bigdata

Recently uploaded (20)

PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPT
Teaching material agriculture food technology
PDF
cuic standard and advanced reporting.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Cloud computing and distributed systems.
PPTX
Spectroscopy.pptx food analysis technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
A Presentation on Artificial Intelligence
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Mobile App Security Testing_ A Comprehensive Guide.pdf
Big Data Technologies - Introduction.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Teaching material agriculture food technology
cuic standard and advanced reporting.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Cloud computing and distributed systems.
Spectroscopy.pptx food analysis technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Assigned Numbers - 2025 - Bluetooth® Document
20250228 LYD VKU AI Blended-Learning.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Spectral efficient network and resource selection model in 5G networks
A Presentation on Artificial Intelligence
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Programs and apps: productivity, graphics, security and other tools
A comparative analysis of optical character recognition models for extracting...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11

Supercharging Data Performance for Real-Time Data Analysis

  • 1. 11 Supercharging Data Performance for Real-Time Data Analysis
  • 2. 2 Information—the fuel of business—is trapped in analysis platforms built on 70-year old architectures.
  • 3. 3 Data volume and velocity challenge traditional computing methods Traditional Approach: • Commodity x86 based servers • Cluster with open source software • Scale for volume • Scale for parallelism / performance Challenges: • High level languages can be inefficient • Data intensive workloads drive in-memory solutions • DRAM footprints at commodity prices are small • Scaling out increases cost and complexity
  • 4. Ryft delivers huge benefits in a small package. Highest performance per watt and lowest total cost of ownership (TCO) of any product on the market. 48 TB in 1U • Data storage is abstracted as a set of Linux mount points • Support native encryption/decryption with no loss in performance (AES 256 Encryption) Simple API • C library abstracts internal FPGA constructs to simplify programmability, allowing a programmer to invoke operations as simple function calls, returning simple results • Command line • Web Interface Linux Front End • Linux (Ubuntu 14.04 LTS ) front end - Standard build, Non restricted OS, apt-get • API calls FPGA fabric backend • Linux services/protocols can be used • ssh/scp/rsync/sftp • Standard monitoring agents • Web services • Security configuration
  • 5. x86 Architecture vs. Systolic Arrays Memory PE One Clock Cycle (x86) Memory PEPEPE PE PEPE One Clock Cycle FPGA- Systolic Array 100 ns 100 ns
  • 6. FPGA Benefits x86 FPGA • General purpose computing • Sequential in nature • Non-deterministic performance • Interrupts • Memory allocation • Problems are broken into a sequence of operations and processed serially • Increasing number of instructions • Increased overhead • Increasing required power/cooling required • Software can break problems down and bring parallelism: • Between processors/cores • Between servers • Output combined over interconnects • Not general purpose • Purpose built algorithms • Can be reprogramed via firmware • Parallel in nature • Can execute many parallel operations in one clock cycle • More output with less power and clock speed • ~1000X less instructions to solve the same problem as x86 • 100% deterministic performance • No memory fetching or management • No interrupts
  • 7. Multi-Dimensional Systolic Arrays PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE
  • 8. The Ryft ONE is powered by a breakthrough in Real-time Data Analysis. The only 1U platform capable of analyzing streaming, historical, unstructured, and multi-structured data in real-time at 10 GB/second. Ryft ONE avoids bottlenecks that strangle conventional systems by combining these two innovations: The Ryft Analytics Cortex™ Ryft ONE leverages a massively parallel bitwise computing architecture to deliver unprecedented performance from the smallest possible form factor. The Ryft Algorithm Primitives™ Library Each Ryft ONE comes with a subscription to this growing collection of pre-built algorithm components, and an open API to leverage them. +
  • 9. “We see Spark Streaming scales nearly linearly to 100 nodes, and can process up to 6 GB/s at sub-second latency on 100 nodes for Grep, 2.3 GB/s for the other, more CPU-intensive jobs” UC Berkley Streaming Computation at Scale Proprietary | 9 http://www.cs.berkeley.edu/~matei/papers/2013/sosp_spark_streaming.pdf
  • 10. Ryft transforms datacenter economics. The Ryft ONE Costly & Complex Clusters Search = 10 GB/s Term Frequency = 2.5 GB/s Search = 6GB/s Term Frequency= 2.3 GB/s
  • 11. Wikipedia Examples • English XML Dump is offered by Wikipedia • Total Corpus is 44GB • Copying the data takes 44 seconds • Fuzzy search would take 4.4 seconds • Term Frequency would take 17.6 seconds
  • 13. Data Exploration Use Case • RDF—understanding of native formats • Powerful no-index search • Flexible query format with wildcarding • Identify relationships between disparate data
  • 14. HDFS Data Triage for Hadoop/Spark Use Case Raw Data M/R noSQL Hive Text Index Application Hours? Days?
  • 15. Search / Minimize @10GB/s Data Triage for Hadoop/Spark Use Case Ingest @ 1-4GB/s Seconds! HDFS • Social media signal/noise • Fuzzy searching at line rate @badguy1 @badguy2 @badguy01 @badboy01 Search: “badguy??”
  • 16. Organizations who want real-time insights into all their data Large data sets (changing, structured & unstructured, Text, Binary, Imaging) High Velocity Data • Logging • Ad Data • Twitter Forensics & Legal Discovery • Host based forensics • E-discovery Scientific Data • Genomics • Sensor Data Financial • Compliance • Fraud Detection Cyber Security • PCAP • Full packet capture • Binary Analysis Imagery Analysis • Change Analysis • High Performance Rendering
  • 17. Revisiting Performance Results Ryft ONE closes the industry’s data analytics performance gap by combining the following into a single architecture:  Parallel FPGA architectures to accelerate performance  Dedicated storage/access/RAM  Elimination of data security performance bottlenecks  Elimination of operating system and high level language overhead  Minimizing the need to move data Use Case Single Ryft ONE Throughput Spark Cluster to Match Performance Search ~10GB/sec > 100 nodes1 Fuzzy Search ~10GB/sec 100-200 nodes2 Term Frequency ~2.5GB/sec 100 nodes1
  • 18. Accelerate business insights with the only platform purpose-built to simultaneously analyze any type of data—historical and streaming, unstructured and multi-structured— 100X faster with 70% lower TCO. The Ryft ONE: More data. Less center. Faster insights.

Editor's Notes

  • #3: Legacy proprietary platforms are too slow and costly No real-time performance; limited data formats Priced out of the range of all but the largest enterprises Hadoop/Spark running on clusters are slow, complex, and brittle Significant technology, performance, and knowledge gaps remain Slow and complex setup and maintenance; X86 architecture is not sustainable Demand for knowledgeable developers far exceeds supply Need purpose built solutions that are open, high speed, and sustainable Top ISV/OEMs working to unlock power of new architectures Enterprises developing homegrown servers Hyper growth emerging markets for applying HPC resources to data analysis
  • #4: x86 servers are used universally across many problem areas: Data analysis Search Simulation Machine learning Genome sequencing Graph processing Scale-out x86 clusters have advantages but also many drawbacks: Increased node count for to meet DRAM footprint requirements Increased node count for CPU core requirements Inefficient high level languages Overhead of distributing data and combining results Datacenter sprawl Complex deployments Increased operational cost A New Approach is Needed Highly distributed memory architectures turn complex analytics problems into I/O problems, because they must frequently move data between physically distributed memory, disk storage, processors, & networked nodes. The rising class of complex analytic workloads demands strong communications and near-real-time turnaround. Trying to partition (slice) these problems into smaller pieces that can run independently is like trying to cut a human into dozens of chunks and expecting each chunk to go on living Commodity Hardware Clusters using Hadoop/Spark are designed for compute-intensive workloads, not data analyticsWithout purpose-built solutions for Big Data Analytics challenges, IT has been forced to piecemeal a solution and scale out to larger and larger commodity hardware clusters that are strangled by i/o performance bottlenecks MapReduce/Hadoop tools were originally designed to run relatively simple, non-real-time tasks on highly distributed architectures such as clusters and clouds; these workloads frequently make the slow journey out to disk and back Spark operates on similar principles but more efficiently — it saves up multiple tasks before going out to disk
  • #5: JSON – Java Script Object Notation, ODBC – open database connection, ODATA – open data protocol
  • #7: Footprint Comparison
  • #9: Years in the making, the Ryft ONE combines two proven innovations in hardware and software to optimize compute, storage and IO performance: Fast Actionable Business Insights Analyze historical and streaming data at an unprecedented 10 Gigabytes per second or faster Traditional Clustered Systems Big Data Analytics challenges by re-engineering old technologies to try to make them faster Ryft’s revolutionary innovations in hardware and software dramatically reduce Mean Time to Decisions
  • #17: High Velocity Data  These are use cases where the data arrives so rapidly that the indexing approaches don’t work well without expensive scaling and licensing. Logging (enterprise level syslog or flume) Ad data (Admeld) Click stream (web logs) Twitter firehose Scientific Data These are use cases where the data doesn’t format well for tokenizers and indexers. Genomics (sequencing / bowtie and like algorithms) Other sensor data Financial  These check multiple data sources to determine the legitimacy of an action.  The turnaround time determines if it is a forensic finding or circumvents the incident. Compliance Fraud Detection Forensics and Legal Discovery These users get data in a large package that can take vast amounts of time to index and sometimes indexing isn’t possible due to unfamiliar formats that aren’t parsed and text extracted.  Our brute force comparison methods sidestep many of these issues and allow analysts to find key pieces of data in seconds vs. days.  Host based forensics on disk images E-discovery E-mail Databases Documents Messaging servers Copier hard drive images Cyber Security PCAP Full packet capture (includes payload analysis) Binary analysis (malware/virus) Configuration file diff checking Imagery Analysis Change analysis Military airborne sensors Security cameras Aircraft radar Astronomy  High Performance Rendering
  • #18: The node/cluster configurations are noted in the footnotes in the slide, and also in the notes below. They were taken directly from published literature, which is why they differ across search/fuzzy/TF vs. Sort. Sort was a more recent publication which used higher-end hardware. Each node in the Spark cluster for the search, fuzzy search and term frequency operations consisted of m1.xlarge EC2 nodes made up of 4 cores, 15GB RAM and 1.68TB storage each, as taken from an academic publication by UCB: http://www.cs.berkeley.edu/~matei/papers/2013/sosp_spark_streaming.pdf, and also from Amazon EC2 configuration information: http://aws.amazon.com/ec2/previous-generation/ Spark cluster configuration for the sort operation was taken from a more recent publication (https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html) where they called out extremely high-end (and highest cost as compared to any other EC2 instance at $6.82/hour) EC2 instances where each i2.8xlarge node consists of: 32 cores, 244GB RAM, and 6.4TB of storage per node! That’s an amazing and costly amount of resources! The performance of any sort algorithm is highly dependent on the size of the sort key and the size of its accompanying data record. Ryft ONE’s worst case is on the order of 1GB/sec, and a typical real-world case can be upwards of 10GB/sec. The equivalent number of Spark nodes for Sort is estimated at approximately 65 nodes. This estimate stems from an analysis of the latest Spark sort benchmark performance numbers as published in https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html coupled with estimated Spark performance degradation (at approximately 50%) when moving from the non-real-world sort benchmarks employed to a more realistic real-world sort. Even if the assumptions and estimates are off (say even by a factor of 2), the fact that a single 1U Ryft ONE can achieve the sort performance of a large cluster of nodes where each node is 32 cores, 244GB RAM and 6.4TB is simply amazing.
  • #19: Massively valuable businesses and applications will be built off of rapidly increasing volume, velocity and variety of data. Today, most enterprise big data initiatives struggle to make it out of prototype stage, because current tools like Hadoop and Spark are complex to build and maintain, limited in capabilities, and built upon server clusters using von Neumann architectures designed 70 years ago. Today’s x86 architectures which rely on these legacy architectures are not designed for high performance data analysis and cannot do what companies need them to answer the questions they need to ask. Businesses need a new category of high performance, open. and low maintenance platform that supports the volume, velocity and variety of big data—at a price tag that makes high performance computing capabilities attainable by all businesses. Massively valuable businesses and applications can be built on the Ryft platform to enable companies to do things never before possible while transforming data center economics.