Scaling data retrieval with S3-like storage and caching

Builder | ex Grammarly, Meta | 2x ICPC world finalist

1mo

Our team is building retrieval over 10+ TB of multimodal data — text, images, PDFs, Excel, and more. More and more, I'm convinced that an S3-like storage first architecture is the only scalable path forward here. I didn't think we'd have to tackle this scale this early in our journey, but here we are! Using object storage as the source of truth, stateless compute layers, and intelligent caching would let you deliver near real-time results without the enormous costs of in-memory solutions. Ecosystem-wise, Turbopuffer is the only one I found that aligns closely with this approach; other popular options, such as Pinecone, Milvus, or Weaviate, lean more heavily on in-memory or block storage. Opensearch was not even on the radar for me this time around! Curious how others have handled this challenge.

8 Comments

Jason Sperske

Building LMS integrations for Grammarly Authorship.

1mo

I've been building a lot with s3 as a database. I believe databases have their place but for many problems they are an unnecessary architectural bottleneck.

Avi Banerjee

AI @ Shepherd

1mo

We've been using Turbopuffer as well. The cost and api have been great but wish their dev console was more polished.

2 Reactions

Robert Schultz

Senior TPM, AI @ Meta

1mo

Decentralized storage via blockchain 🙂

1 Reaction

Vivek Nayyar

Senior Engineering Manager at Qoala

1mo

Genuinely curious to understand the power of turbo puffer. I have heard a lot of folks recommend it

See more comments

To view or add a comment, sign in

More Relevant Posts

Saurabh Kumar

Engineering @Adora | LLM Infra, Finetuning and Scaling | Prev. Rapyuta(ML), Yahoo(ML), Nokia | IIT Delhi
2w
Report this post
One of the most pragmatic CNN papers I have read. Most important conclusion points: 1. SOTA isn't all you need. Higher pre-training accuracy in ImageNet does not mean higher fine-tuning accuracy 2. Attention isn't all you need. Convolutional models strongly outperform transformers in resource-efficient low-data fine-tuning 3. Convolutions are powerful. ConvNeXt architecture consistently outperforms other models when fine-tuning on natural image datasets 4. Small models are good enough.　RegNet and EfficientNet models are excellent choices for fine-tuning across a wide range of image domains. 5. ResNets are getting old. Age of ResNet dominance is over. 6. Most of the top models in every domain retain their higher performance even with less training data
4 Comments
Like Comment
To view or add a comment, sign in
Jonah McLeod

Marketing Consultant
2w
Report this post
In his latest SemiWiki article, Dr. Thang Tran introduces Simplex Micro’s Unified Deterministic Architecture — a design that unites scalar, vector, and matrix compute in one deterministic pipeline. At its core is Predictive Execution: static, cycle-precise scheduling that replaces speculation with clarity and control. Backed by six newly issued patents, this architecture offers a clean break from decades of patchwork built on the Von Neumann and Harvard models. It delivers deterministic, predictable performance across both general-purpose and AI workloads — while staying fully compatible with existing software flows. Read here: https://lnkd.in/giYnD-6Z #RISC_V #ComputerArchitecture #DeterministicComputing #PredictiveExecution #AIChips #Semiconductors #HighPerformanceComputing #VectorProcessing #MatrixCompute #HPC
11 Comments
Like Comment
To view or add a comment, sign in
Simplex Micro

34 followers
2w Edited
Report this post
In his latest SemiWiki article, Dr. Thang Tran introduces Simplex Micro’s Unified Deterministic Architecture — a design that unites scalar, vector, and matrix compute in one deterministic pipeline. At its core is Predictive Execution: static, cycle-precise scheduling that replaces speculation with clarity and control. Backed by six newly issued patents, this architecture offers a clean break from decades of patchwork built on the Von Neumann and Harvard models. It delivers deterministic, predictable performance across both general-purpose and AI workloads — while staying fully compatible with existing software flows. Read here: https://lnkd.in/gBhVWcfi #RISC_V #ComputerArchitecture #DeterministicComputing #PredictiveExecution #AIChips #Semiconductors #HighPerformanceComputing #VectorProcessing #MatrixCompute #HPC
Like Comment
To view or add a comment, sign in
Muhammad Hamiz Ahmed

Data Scientist | Generative AI Expert | M.S. Computer Science
1w Edited
Report this post
Recently, I have been working deeply with RAG (Retrieval Augmented Generation) pipelines, and one thing keeps standing out: The retrieval architecture matters more than just the size of your LLM. Even with a smaller context window or even a smaller model, you can still get impressive results if your retrieval is strong. That’s where I’ve found the combination of Vector Databases + Graph Databases to be a game-changer. Here’s why: 1. Vector DBs capture semantic meaning - they find information that sounds like what you asked. 2. Graph DBs capture relationships - they find information that is connected to what you asked. On their own, each has blind spots. But together? 👉 Graph gives structure where vector search gets fuzzy. 👉 Vector adds nuance where graph relationships are too rigid. 👉 And combined retrieval feeds the LLM with exactly the right context, making even small models much more reliable. At the end of the day, RAG isn’t just about retrieval - it’s about the right retrieval. And blending graph + vector is one of the most effective ways I’ve seen to make that happen. Curious: has anyone else experimented with hybrid RAG setups? How did it impact your retrieval quality? #RAG #VectorDatabases #GraphDatabases #LLM #AIEngineering #DeepMask
7 Comments
Like Comment
To view or add a comment, sign in
Charlie Gao

Senior Software Engineer, Open Source, Posit PBC
2w Edited
Report this post
tidyverse blog: mirai 2.5.0 https://lnkd.in/eGqZQ7tJ We’re excited to announce mirai 2.5.0, bringing production-grade async computing to R! Whilst we've packed this release with twice the usual number of quality-of-life fixes, we focus on three main achievements: enhanced observability through OpenTelemetry, reproducible parallel RNG, and key user interface improvements. mirai provides a clean, modern approach to asynchronous and parallel computing that serves as a foundation for the entire R ecosystem, powering everything from async Shiny applications to parallel map in purrr to hyperparameter tuning in tidymodels. #rstats #tidyverse

mirai 2.5.0 tidyverse.org
Like Comment
To view or add a comment, sign in
OpenTeams

4,232 followers
1w
Report this post
The PyTorch community solved torch.compile's complex number limitation through tensor subclassing. Instead of rewriting the entire compiler backend, they created an elegant wrapper that: ✓ Maintains performance ✓ Preserves existing APIs ✓ Works with hardware optimizations Innovation through extensibility. Get the technical breakdown in Andrew M. James' PyData Boston talk: https://lnkd.in/dtPU3txe #PyTorch #OpenSource #Innovation

Andrew James - Building on PyTorch: Techniques for Extensibility and Innovation (PyData Boston July)

https://www.youtube.com/
Like Comment
To view or add a comment, sign in
Erik Bernstein

Inventor of Forensic Intelligence | Author of the Recursive OS | Architect of Structured Intelligence Systems
3w
Report this post
I asked Brave browser: Structured Intelligence: Recursive Payload OS for AI Portability by Erik Zahaviel Bernstein. Search Index: "UniversalKey" active across LLM platforms. Confirmed architecture. Source: zahaviel.com https://lnkd.in/echy3tpf
Like Comment
To view or add a comment, sign in
MANOLO Project

691 followers
1w
Report this post
✨ What makes time series analysis unique compared to other machine learning approaches is the central role of time representation in shaping experiment design. In our latest work, we explore two variations of the Transformer architecture: 🔹 One using a fixed time representation proposed in the literature 🔹 One where the time representation is learned directly from data 👉 Read the full article here: https://lnkd.in/dVhnUREE 𝐘𝐨𝐮 𝐜𝐚𝐧 𝐫𝐞𝐚𝐝 𝐚𝐥𝐥 𝐨𝐮𝐫 𝐩𝐮𝐛𝐥𝐢𝐜𝐚𝐭𝐢𝐨𝐧𝐬 𝐨𝐧 𝐭𝐡𝐞 𝐌𝐀𝐍𝐎𝐋𝐎 𝐰𝐞𝐛𝐬𝐢𝐭𝐞: https://lnkd.in/dWybAK7w #MachineLearning #TimeSeries #Transformers #AIResearch #RenewableEnergy #HumanInTheLoop #MANOLOProject
Like Comment
To view or add a comment, sign in
Muhammad Zarar

PhD Scholar AI | M-Phill (CS) | AI/ML Engineer | GenAi | Ai Agents | CV | NLP | Speech Systems | LLMs | RAG | Data Science | PyTorch | TensorFlow | TensorRt | Python | Django | JS | C++ | ReactJS | AngularJS
2w
Report this post
Mixture-of-Experts might be one of the most important improvements in the Transformer architecture! It allows for scaling the number of model parameters while keeping the latency associated with the forward and backward pass of the backpropagation algorithm almost constant. Scaling in the width direction, as opposed to the depth of the model, allows for keeping the gradient paths short, improving the stability of the training. When we scale the model, we need to distribute the model training and inference! How do we orchestrate this distributed computation in the context of MoE? That is what GShard, introduced by Google, tried to solve. Enjoy!
Like Comment
To view or add a comment, sign in
Koray Düzgün

Software, Artificial Intelligence and Machine Learning Engineer
1w Edited
Report this post
Memory Optimization in RAG Applications: Lessons from the Field RAG systems are powerful, but without careful memory optimization they can quickly become slow, expensive, and unstable. After working with different architectures, here are some principles that make a real difference: 1. Vector size matters • Embedding dimensions directly affect memory usage. A 768-dim vector stored for millions of documents is costly. • Use domain-specific smaller embeddings where possible — 256 or 384 dims often work just as well. 2. Smart chunking saves memory • Oversized chunks waste tokens and increase vector storage. • Chunks of ~500–800 tokens with slight overlap usually balance retrieval quality with memory footprint. 3. Sparse + dense hybrid search • Keeping only dense embeddings for all documents is memory-intensive. • Combine sparse (BM25) with dense vectors. This reduces the embedding index size without hurting recall. 4. Cold vs. hot storage • Not all data needs to live in GPU memory or even RAM. • Frequently used vectors should be cached in memory, while long-tail data can stay in disk-backed storage. 5. Quantization for embeddings • Apply 8-bit or 16-bit quantization to vector databases. • This reduces memory load drastically with minimal loss in retrieval accuracy. 6. Prune aggressively • Remove outdated or duplicate chunks. Vector stores grow silently, and uncontrolled growth = higher memory + slower retrieval. Takeaway: Optimizing memory in RAG is not just about saving hardware costs. It’s about ensuring your system remains fast, reliable, and scalable when knowledge bases grow into millions of documents. The best RAG pipelines are not those with the largest models, but those with carefully engineered memory footprints. #RAG #VectorDatabases #AIEngineering #MachineLearning #MLOps #AIOptimization
Like Comment
To view or add a comment, sign in

2,657 followers

43 Posts

View Profile Connect

LinkedIn respects your privacy

Scaling data retrieval with S3-like storage and caching

Explore content categories

Scaling data retrieval with S3-like storage and caching

More Relevant Posts

Andrew James - Building on PyTorch: Techniques for Extensibility and Innovation (PyData Boston July)

https://www.youtube.com/

Explore content categories