SlideShare a Scribd company logo
Running Solr at Memory Speed with Alluxio
Timothy Potter
Lucidworks
Agenda
• Overview of Alluxio
• Running Solr on Alluxio
• Interesting Use Cases
• Futures
• Questions?
3
01
Cool things I’ve learned about Alluxio …
• Fastest growing open source project in big data
space
• Baidu reported having an Alluxio cluster with
1000 workers and 50TB of RAM … in Feb 2016!
• Brings cloud-storage into the compute layer; data
access at memory speed
• No need to move / migrate data into Alluxio; just
mount the under storage!
• Apache 2.0 licensed but also has a commercial
offering with support if needed
4
01
Alluxio Basics
• Hadoop FileSystem API: alluxio://…
• Supports single node up to massive
clusters
• Uses ZK for HA stuff; master/worker
model
• Supports many popular storage
systems: HDFS, S3, Azure Blob store,
GCS, GlusterFS …
• Alluxio FUSE to mount as FS on Linux
memory-centric
virtual distributed
storage system
5
01
Configure Solr to use Alluxio
• mkdir or mount Solr root dir in Alluxio
bin/alluxio fs mkdir /solr
• Set start-up options in bin/solr.in.sh:
solr.directoryFactory=HdfsDirectoryFactory
solr.lock.type=hdfs
solr.hdfs.home=alluxio://master:19998/solr
solr.hdfs.confdir=/path/hadoop-conf
• Add a core-site.xml to set:
fs.AbstractFileSystem.alluxio.impl=alluxio.hadoop.AlluxioFileSystem
fs.alluxio.impl.disable.cache=true
alluxio.user.file.writetype.default=CACHE_THROUGH
• Add alluxio client JAR to Solr classpath
Copy alluxio-core-client-runtime-1.5.0-jar-with-dependencies.jar to
server/solr-webapp/webapp/WEB-INF/lib/
• Upconfig alluxio configset to ZK
bin/solr zk upconfig -n alluxio -d server/solr/configsets/alluxio/conf
see: http://bit.ly/2y33wQs
6
01
Solr on Alluxio Tips & Tricks
• Run an Alluxio worker on each Solr node
• Write mode should be CACHE_THROUGH to ensure Solr files get
persisted to the under storage, e.g. S3
• Admin can “pin” an index directory to ensure it stays cached in
memory
• Set TTL on index directories that can be freed from memory after a
given timeframe
• Load command moves data from the under storage into Alluxio, such
as after restoring an index from backup
7
01
Use Case 1: Replace the OS cache with Local under FS
• Index performance
~ 5M docs, ~4K docs/sec, <1% diff than local FS, 8GB index on disk
• Query performance (9gb index, 5M docs, r4.xlarge)
* NOTE: ymmv! Utterly un-scientific experiments to get a feel for the technology
Metrics Alluxio MMap/SSD HDFS
QPS 36 42 20
Max QTime 2212 ms 1789 ms 5612 ms
Stddev QTime 335 ms 353 ms 609 ms
Median QTime 70 ms 9 ms 187 ms
75% 372 ms 383 ms 754 ms
95% 972 ms 996 ms 1723 ms
99% 1426 ms 1349 ms 2599 ms
8
01
Use Case 2: Use cloud storage as under FS (S3, GCS, Azure)
• Indexing rate: ~3,650 docs/sec to S3 vs. on 4,000 on local
• As expected, query perf metrics nearly identical 
• Mount the cloud storage system to a directory in Alluxio
bin/alluxio fs mount 
alluxio://ec2-34-196-176-70.compute-1.amazonaws.com:19998/s3 s3a://sstk-dev/alluxio
• Deploy cloud instances with lots of memory, e.g. r4’s in EC2
• Use tiered storage to take advantage of the ephemeral disks
(fast SSDs)
• “pin” specific indexes for better performance guarantees S3 or GCS
Alluxio (memory)
10 to 100 Gbps
100 Mbps to
10 Gbps
9
01
Use Case 3: Time-based Partitioning
• Fits nicely with write-once indexes: signals, logs
• Use Alluxio’s TTL feature to “free” indexes on
aged out partitions
• Tiered storage also allows you to have hot
(memory), warm (SSD), cool (HDD), and cold
(S3) partitions
• Allocators and evictors to re-arrange blocks
between tiers; easy to plug-in advanced
strategies
Solr
Partition
9-15
Solr
Partition
9-14
Alluxio (memory)
Alluxio (SSD)
Solr
Partition
9-13
S3 or GCS
1
01
Use Case 4: Cloud-based Recovery
• Solr auto-add replica (have to use
the HdfsUpdateLog)
<updateLog class=“solr.HdfsUpdateLog”> …
• Alluxio will pull the files from memory
on another worker if they’re available
or go back to under FS storage
• Wise to have some auto-warming
queries / caches configured so that
replicas don’t get marked as active in
the cluster until they are warmed up
… thanks Shalin! SOLR-6086
S3 or GCS
Solr
Replica
Alluxio (memory)
Node 1 (us-east-1d)
Node 2 (us-east-1c)
Solr
overseer
Solr
Replica
Add
Replica
Alluxio (memory)
1
01
Synergy with Analytics & Machine Learning
• Solr streaming expressions power analytics jobs that may
require massive result sets at once
• Hybrid solutions that mix Solr with compute frameworks
like Spark and Flink
• Alluxio speeds up SparkSQL and ML jobs
• Fusion SQL ~ Keeping expensive views in Alluxio for
analytics dashboards (complex queries against data
loaded from Solr)
1
01
Work in progress …
• ALLUXIO-2995: Perf issue (fixed in 1.6.0)
Work-around is: alluxio.user.file.cache.partially.read.block=false
• Orphaned write.lock prevents core initialization after crash, SOLR-
8335 and SOLR-8169
bin/alluxio fs rm /solr/alluxio1/core_node1/data/index/write.lock
• SOLR-11335: Closing FileSystem object retrieved from get()
fs.alluxio.impl.disable.cache = true (in core-site.xml)
• SOLR-6237: Shared replicas
• SOLR-9515: Couldn’t get Solr running with s3a w/o Alluxio;
classpath issues 
• Test ASYNC_THROUGH write mode with Solr
1
01
FAQ
• Does Alluxio support running in HA mode?
• How does data locality work with Solr & Alluxio?
• What block size do you recommend for Solr?
• What’s the overhead of CACHE_THROUGH
during indexing?
• What about Solr’s block cache?
• Does Alluxio work with Solr 7?
Thank You

More Related Content

PDF
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
PDF
The Practice of Alluxio in JD.com
PDF
Best Practices for Using Alluxio with Spark
PDF
Atom: A cloud native deep learning platform at Supremind
PDF
Speeding Up Spark Performance using Alluxio at China Unicom
PDF
Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017
PDF
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
PDF
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
The Practice of Alluxio in JD.com
Best Practices for Using Alluxio with Spark
Atom: A cloud native deep learning platform at Supremind
Speeding Up Spark Performance using Alluxio at China Unicom
Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Building a high-performance data lake analytics engine at Alibaba Cloud with ...

What's hot (20)

PDF
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
PDF
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
PDF
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
PDF
Spark Summit EU talk by Jiri Simsa
PDF
Best Practices for Using Alluxio with Spark
PDF
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
PDF
How to Develop and Operate Cloud First Data Platforms
PDF
Accelerating Spark Workloads in a Mesos Environment with Alluxio
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
PDF
Presto on Alluxio Hands-On Lab
PDF
Alluxio-FUSE as a data access layer for Dask
PDF
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
PDF
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
PDF
Hybrid data lake on google cloud with alluxio and dataproc
PDF
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
PDF
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
PPTX
Presentation by TachyonNexus & Baidu at Strata Singapore 2015
PDF
Improving Presto performance with Alluxio at TikTok
PDF
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017
PDF
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
Spark Summit EU talk by Jiri Simsa
Best Practices for Using Alluxio with Spark
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
How to Develop and Operate Cloud First Data Platforms
Accelerating Spark Workloads in a Mesos Environment with Alluxio
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Presto on Alluxio Hands-On Lab
Alluxio-FUSE as a data access layer for Dask
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
Hybrid data lake on google cloud with alluxio and dataproc
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Presentation by TachyonNexus & Baidu at Strata Singapore 2015
Improving Presto performance with Alluxio at TikTok
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Ad

Similar to Running Solr in the Cloud at Memory Speed with Alluxio (20)

PDF
Running Solr at Memory Speed with Alluxio - Timothy Potter, Lucidworks
PDF
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
PPTX
Alluxio Presentation at Strata San Jose 2016
PDF
Spark Summit EU talk by Jiri Simsa
PPTX
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
PPTX
(Re)Indexing Large Repositories in Alfresco
PDF
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
PDF
Unified Big Data Analytics: Any Stack, Any Cloud
PDF
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
PDF
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
PPTX
Oracle database smart flash cache
PDF
Best Practice in Accelerating Data Applications with Spark+Alluxio
PDF
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
PDF
Open Source Data Orchestration for AI, Big Data, and Cloud
PDF
CNCF Member Webinar: Improving Data Locality for Analytics Jobs on Kubernetes...
PPTX
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
PDF
Alluxio (formerly Tachyon): The Journey thus far and the Road Ahead
PDF
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
PDF
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
PDF
Ippevent : openshift Introduction
Running Solr at Memory Speed with Alluxio - Timothy Potter, Lucidworks
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Presentation at Strata San Jose 2016
Spark Summit EU talk by Jiri Simsa
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
(Re)Indexing Large Repositories in Alfresco
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Unified Big Data Analytics: Any Stack, Any Cloud
Alluxio Webinar | What’s New in Alluxio AI: 3X Faster Checkpoint File Creatio...
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
Oracle database smart flash cache
Best Practice in Accelerating Data Applications with Spark+Alluxio
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Open Source Data Orchestration for AI, Big Data, and Cloud
CNCF Member Webinar: Improving Data Locality for Analytics Jobs on Kubernetes...
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Alluxio (formerly Tachyon): The Journey thus far and the Road Ahead
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Ippevent : openshift Introduction
Ad

More from thelabdude (10)

PPTX
NYC Lucene/Solr Meetup: Spark / Solr
PPTX
ApacheCon NA 2015 Spark / Solr Integration
PPTX
Benchmarking Solr Performance at Scale
PPTX
Solr Exchange: Introduction to SolrCloud
PPTX
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
PPTX
Integrate Solr with real-time stream processing applications
PPTX
Scaling Through Partitioning and Shard Splitting in Solr 4
PPTX
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
PPT
Boosting Documents in Solr (Lucene Revolution 2011)
PPTX
Dachis Group Pig Hackday: Pig 202
NYC Lucene/Solr Meetup: Spark / Solr
ApacheCon NA 2015 Spark / Solr Integration
Benchmarking Solr Performance at Scale
Solr Exchange: Introduction to SolrCloud
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Integrate Solr with real-time stream processing applications
Scaling Through Partitioning and Shard Splitting in Solr 4
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Boosting Documents in Solr (Lucene Revolution 2011)
Dachis Group Pig Hackday: Pig 202

Recently uploaded (20)

PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Computer network topology notes for revision
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPT
Quality review (1)_presentation of this 21
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Clinical guidelines as a resource for EBP(1).pdf
Database Infoormation System (DBIS).pptx
Computer network topology notes for revision
Business Ppt On Nestle.pptx huunnnhhgfvu
.pdf is not working space design for the following data for the following dat...
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Galatica Smart Energy Infrastructure Startup Pitch Deck
oil_refinery_comprehensive_20250804084928 (1).pptx
Quality review (1)_presentation of this 21
IBA_Chapter_11_Slides_Final_Accessible.pptx
Fluorescence-microscope_Botany_detailed content
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Business Acumen Training GuidePresentation.pptx

Running Solr in the Cloud at Memory Speed with Alluxio

  • 1. Running Solr at Memory Speed with Alluxio Timothy Potter Lucidworks
  • 2. Agenda • Overview of Alluxio • Running Solr on Alluxio • Interesting Use Cases • Futures • Questions?
  • 3. 3 01 Cool things I’ve learned about Alluxio … • Fastest growing open source project in big data space • Baidu reported having an Alluxio cluster with 1000 workers and 50TB of RAM … in Feb 2016! • Brings cloud-storage into the compute layer; data access at memory speed • No need to move / migrate data into Alluxio; just mount the under storage! • Apache 2.0 licensed but also has a commercial offering with support if needed
  • 4. 4 01 Alluxio Basics • Hadoop FileSystem API: alluxio://… • Supports single node up to massive clusters • Uses ZK for HA stuff; master/worker model • Supports many popular storage systems: HDFS, S3, Azure Blob store, GCS, GlusterFS … • Alluxio FUSE to mount as FS on Linux memory-centric virtual distributed storage system
  • 5. 5 01 Configure Solr to use Alluxio • mkdir or mount Solr root dir in Alluxio bin/alluxio fs mkdir /solr • Set start-up options in bin/solr.in.sh: solr.directoryFactory=HdfsDirectoryFactory solr.lock.type=hdfs solr.hdfs.home=alluxio://master:19998/solr solr.hdfs.confdir=/path/hadoop-conf • Add a core-site.xml to set: fs.AbstractFileSystem.alluxio.impl=alluxio.hadoop.AlluxioFileSystem fs.alluxio.impl.disable.cache=true alluxio.user.file.writetype.default=CACHE_THROUGH • Add alluxio client JAR to Solr classpath Copy alluxio-core-client-runtime-1.5.0-jar-with-dependencies.jar to server/solr-webapp/webapp/WEB-INF/lib/ • Upconfig alluxio configset to ZK bin/solr zk upconfig -n alluxio -d server/solr/configsets/alluxio/conf see: http://bit.ly/2y33wQs
  • 6. 6 01 Solr on Alluxio Tips & Tricks • Run an Alluxio worker on each Solr node • Write mode should be CACHE_THROUGH to ensure Solr files get persisted to the under storage, e.g. S3 • Admin can “pin” an index directory to ensure it stays cached in memory • Set TTL on index directories that can be freed from memory after a given timeframe • Load command moves data from the under storage into Alluxio, such as after restoring an index from backup
  • 7. 7 01 Use Case 1: Replace the OS cache with Local under FS • Index performance ~ 5M docs, ~4K docs/sec, <1% diff than local FS, 8GB index on disk • Query performance (9gb index, 5M docs, r4.xlarge) * NOTE: ymmv! Utterly un-scientific experiments to get a feel for the technology Metrics Alluxio MMap/SSD HDFS QPS 36 42 20 Max QTime 2212 ms 1789 ms 5612 ms Stddev QTime 335 ms 353 ms 609 ms Median QTime 70 ms 9 ms 187 ms 75% 372 ms 383 ms 754 ms 95% 972 ms 996 ms 1723 ms 99% 1426 ms 1349 ms 2599 ms
  • 8. 8 01 Use Case 2: Use cloud storage as under FS (S3, GCS, Azure) • Indexing rate: ~3,650 docs/sec to S3 vs. on 4,000 on local • As expected, query perf metrics nearly identical  • Mount the cloud storage system to a directory in Alluxio bin/alluxio fs mount alluxio://ec2-34-196-176-70.compute-1.amazonaws.com:19998/s3 s3a://sstk-dev/alluxio • Deploy cloud instances with lots of memory, e.g. r4’s in EC2 • Use tiered storage to take advantage of the ephemeral disks (fast SSDs) • “pin” specific indexes for better performance guarantees S3 or GCS Alluxio (memory) 10 to 100 Gbps 100 Mbps to 10 Gbps
  • 9. 9 01 Use Case 3: Time-based Partitioning • Fits nicely with write-once indexes: signals, logs • Use Alluxio’s TTL feature to “free” indexes on aged out partitions • Tiered storage also allows you to have hot (memory), warm (SSD), cool (HDD), and cold (S3) partitions • Allocators and evictors to re-arrange blocks between tiers; easy to plug-in advanced strategies Solr Partition 9-15 Solr Partition 9-14 Alluxio (memory) Alluxio (SSD) Solr Partition 9-13 S3 or GCS
  • 10. 1 01 Use Case 4: Cloud-based Recovery • Solr auto-add replica (have to use the HdfsUpdateLog) <updateLog class=“solr.HdfsUpdateLog”> … • Alluxio will pull the files from memory on another worker if they’re available or go back to under FS storage • Wise to have some auto-warming queries / caches configured so that replicas don’t get marked as active in the cluster until they are warmed up … thanks Shalin! SOLR-6086 S3 or GCS Solr Replica Alluxio (memory) Node 1 (us-east-1d) Node 2 (us-east-1c) Solr overseer Solr Replica Add Replica Alluxio (memory)
  • 11. 1 01 Synergy with Analytics & Machine Learning • Solr streaming expressions power analytics jobs that may require massive result sets at once • Hybrid solutions that mix Solr with compute frameworks like Spark and Flink • Alluxio speeds up SparkSQL and ML jobs • Fusion SQL ~ Keeping expensive views in Alluxio for analytics dashboards (complex queries against data loaded from Solr)
  • 12. 1 01 Work in progress … • ALLUXIO-2995: Perf issue (fixed in 1.6.0) Work-around is: alluxio.user.file.cache.partially.read.block=false • Orphaned write.lock prevents core initialization after crash, SOLR- 8335 and SOLR-8169 bin/alluxio fs rm /solr/alluxio1/core_node1/data/index/write.lock • SOLR-11335: Closing FileSystem object retrieved from get() fs.alluxio.impl.disable.cache = true (in core-site.xml) • SOLR-6237: Shared replicas • SOLR-9515: Couldn’t get Solr running with s3a w/o Alluxio; classpath issues  • Test ASYNC_THROUGH write mode with Solr
  • 13. 1 01 FAQ • Does Alluxio support running in HA mode? • How does data locality work with Solr & Alluxio? • What block size do you recommend for Solr? • What’s the overhead of CACHE_THROUGH during indexing? • What about Solr’s block cache? • Does Alluxio work with Solr 7?

Editor's Notes

  • #3: In this talk, I introduce Alluxio, the fastest growing open source project in the big data ecosystem, and show how to leverage it for optimizing Solr performance. I'll begin with a brief introduction about how Alluxio works and why it's interesting for the Solr community. Next, I describe how to run Solr on Alluxio and cover basic integration scenarios. Lastly, I provide some performance comparisons between running Solr on Alluxio vs. a local FS and HDFS. Attendees will come away with a new toolset to help them use Solr to tackle a wide array of big data problems.
  • #4: Apache Zeppelin interpreter to execute FS shell commands, e.g. ls /mnt/solr Another benefit is you can try this out quickly on EC2
  • #6: See: http://lucene.apache.org/solr/guide/6_6/running-solr-on-hdfs.html#running-solr-on-hdfs
  • #8: r4.xlarge with 4 cpu, 5M docs, 10K random queries, 16 concurrent users (jmeter) Still might be useful to “pin” specific indexes to help ensure performance Overall, using Alluxio was slower for queries, which is expected as MMap is faster than reading from Alluxio even though files are in memory However, Alluxio beat HDFS. Probably could have done some BlockCache tuning but seems complicated
  • #9: Accelerate remote storage I/O Since indexes are in S3, you could run Spark jobs that read the full index w/o impacting search performance Avoid cloud vendor lock-in as Solr doesn’t know anything about the underlying cloud FS Important: Could not get Solr to work against S3 w/o Alluxio due to Hadoop classpath issues and an issue with HttpClient 4.3; this is documented at: https://community.plm.automation.siemens.com/t5/The-Big-Data-Blog/Running-Solr-on-S3/ba-p/388004 However, this is another example of using Alluxio to hide under FS issues from Solr!
  • #10: What happens when an old partition is queried? Does Alluxio pull that into cache and evict other data or ??? How to control this
  • #13: Solr on S3A w/o Alluxio issues: https://community.plm.automation.siemens.com/t5/The-Big-Data-Blog/Running-Solr-on-S3/ba-p/388004
  • #14: Data locality: you’ll want an alluxio worker on every node where you plan to run Solr replicas Be careful with smaller block sizes and merging / optimize CACHE_THROUGH didn’t show much overhead, <%1 diff