How Conflict-Free Replicated Data Types (CRDTs) Power Redis for Scalable, Always-Available Data

SWE II @JPMorganChase | Performance Optimization and Scaling Distributed System

𝐇𝐨𝐰 𝐂𝐨𝐧𝐟𝐥𝐢𝐜𝐭-𝐅𝐫𝐞𝐞 𝐑𝐞𝐩𝐥𝐢𝐜𝐚𝐭𝐞𝐝 𝐃𝐚𝐭𝐚 𝐓𝐲𝐩𝐞𝐬 (𝐂𝐑𝐃𝐓𝐬) 𝐏𝐨𝐰𝐞𝐫 𝐑𝐞𝐝𝐢𝐬 𝐟𝐨𝐫 𝐒𝐜𝐚𝐥𝐚𝐛𝐥𝐞, 𝐀𝐥𝐰𝐚𝐲𝐬-𝐀𝐯𝐚𝐢𝐥𝐚𝐛𝐥𝐞 𝐃𝐚𝐭𝐚 🧠 In today’s distributed systems, ensuring eventual consistency without coordination bottlenecks is essential. That’s where CRDTs shine—and Redis has embraced them to deliver highly available, low-latency data replication. 𝑾𝒉𝒂𝒕 𝑨𝒓𝒆 𝑪𝑹𝑫𝑻𝒔? 🤔 Conflict-Free Replicated Data Types are specially designed data structures that support: ➡️ Concurrent updates across multiple replicas ➡️ Automatic conflict resolution without locks or leader election ➡️ Deterministic merges ensuring every replica converges to the same state At their core, CRDTs track causality so operations commute—whether you’re adding, removing, or incrementing, updates can be applied in any order and still yield the correct final result. 𝑯𝒐𝒘 𝑪𝑹𝑫𝑻𝒔 𝑾𝒐𝒓𝒌? 𝐎𝐩𝐞𝐫𝐚𝐭𝐢𝐨𝐧-𝐛𝐚𝐬𝐞𝐝 𝐂𝐑𝐃𝐓𝐬: Each change (e.g., “add 5”) is broadcast to all replicas. Operations carry enough metadata to be applied in any order safely. State-based CRDTs: Entire state snapshots are merged. Each replica periodically exchanges its state and uses a merge function that’s associative, commutative, and idempotent. 𝐇𝐲𝐛𝐫𝐢𝐝 𝐚𝐩𝐩𝐫𝐨𝐚𝐜𝐡𝐞𝐬: Combine both to optimize bandwidth and convergence speed. 𝐖𝐡𝐲 𝐑𝐞𝐝𝐢𝐬? Redis Enterprise integrates CRDTs via its Redis Conflict-Free Replicated Data Types module (Redis CRDT), enabling: Geo-distributed clusters with multi-master writes Zero downtime under network partitions Automatic reconciliation when partitions heal 𝑪𝑹𝑫𝑻𝒔 𝒊𝒏 𝑹𝒆𝒅𝒊𝒔 𝑬𝒏𝒂𝒃𝒍𝒆: Scalable counters (G-Counters, PNCounters) for real-time analytics Grow-only sets (G-Sets) for distributed feature flags Observed-Remove Sets (OR-Sets) for collaborative applications Flags and registers for coordination-free feature toggles and leaderless locking Real-World Use Cases Gaming leaderboards: Consistent ranking updates from players around the globe without locking IoT telemetry: Edge devices record sensor readings locally and sync seamlessly when back online Collaborative editing: Multiple users update shared documents or whiteboards concurrently 𝐊𝐞𝐲 𝐓𝐚𝐤𝐞𝐚𝐰𝐚𝐲 𝑩𝒚 𝒄𝒐𝒎𝒃𝒊𝒏𝒊𝒏𝒈 𝒕𝒉𝒆 𝒔𝒊𝒎𝒑𝒍𝒊𝒄𝒊𝒕𝒚 𝒂𝒏𝒅 𝒓𝒊𝒄𝒉 𝒆𝒄𝒐𝒔𝒚𝒔𝒕𝒆𝒎 𝒐𝒇 𝑹𝒆𝒅𝒊𝒔 𝒘𝒊𝒕𝒉 𝒕𝒉𝒆 𝒓𝒐𝒃𝒖𝒔𝒕 𝒄𝒐𝒏𝒇𝒍𝒊𝒄𝒕 𝒓𝒆𝒔𝒐𝒍𝒖𝒕𝒊𝒐𝒏 𝒑𝒓𝒐𝒑𝒆𝒓𝒕𝒊𝒆𝒔 𝒐𝒇 𝑪𝑹𝑫𝑻𝒔, 𝒕𝒆𝒂𝒎𝒔 𝒄𝒂𝒏 𝒃𝒖𝒊𝒍𝒅 𝒕𝒓𝒖𝒍𝒚 𝒂𝒍𝒘𝒂𝒚𝒔-𝒐𝒏, 𝒉𝒊𝒈𝒉𝒍𝒚 𝒄𝒐𝒏𝒄𝒖𝒓𝒓𝒆𝒏𝒕 𝒅𝒊𝒔𝒕𝒓𝒊𝒃𝒖𝒕𝒆𝒅 𝒂𝒑𝒑𝒍𝒊𝒄𝒂𝒕𝒊𝒐𝒏𝒔—𝒘𝒊𝒕𝒉𝒐𝒖𝒕 𝒕𝒉𝒆 𝒄𝒐𝒎𝒑𝒍𝒆𝒙𝒊𝒕𝒚 𝒐𝒇 𝒎𝒂𝒏𝒖𝒂𝒍 𝒄𝒐𝒏𝒇𝒍𝒊𝒄𝒕 𝒉𝒂𝒏𝒅𝒍𝒊𝒏𝒈 𝒐𝒓 𝒄𝒆𝒏𝒕𝒓𝒂𝒍 𝒄𝒐𝒐𝒓𝒅𝒊𝒏𝒂𝒕𝒊𝒐𝒏. — Empower your next project with Redis CRDTs and see consistency and availability coexist. #Redis #CRDT #DistributedSystems #EventualConsistency #RealTimeData

To view or add a comment, sign in

More Relevant Posts

Márcio Zampiron 🇧🇷

Head Cloud|OCI|AWS|GCP|FinOps|CKA|CKAD|DevSecOps|SysAdmin |SRE|Cloud Architect|SOA Architect|Integration Architect|Oracle Middeware Consultant
1w
Report this post
Distributed Databases Empowering Agentic AI Worldwide Oracle has unveiled the Globally Distributed Exadata Database on Exascale, a unified, "always-on" distributed database that mirrors and harmonizes data across various regions. This innovation ensures minimal latency, exceptional availability (swift failover with no data loss), and compliance with data residency regulations. It serves as the optimal groundwork for agentic AI applications necessitating global operations, incorporating vector search and complete SQL functionality without the need for application rewrites. Key Points to Note: - Unprecedented Availability: Leveraging Raft-based replication and rapid failover capabilities spanning multiple data centers and regions. - Scalability and Cost Efficiency: Embracing a serverless Exascale framework to manage unforeseen spikes in agent workloads efficiently. - Global AI Implementation: Proximity of data and vectors to users while upholding data sovereignty and performance, enhancing responsiveness for users worldwide. Explore more: https://lnkd.in/dP9YpS63 #AgenticAI #OracleDatabase #DistributedSystems #DataResidency #Exadata #OCI

Distributed Databases: Enabling Agentic AI Across Global Regions blogs.oracle.com
Like Comment
To view or add a comment, sign in
Bikash Kumar

TOGAF Enterprise Architect | Technical GPM with 22+ years' of experience in helping clients (Retail & US Healthcare), driving business growth by providing Data Engineering Solutions - Databricks (using PySpark & Python).
4w
Report this post
𝐀𝐫𝐜𝐡𝐢𝐭𝐞𝐜𝐭𝐢𝐧𝐠 𝐟𝐨𝐫 𝐑𝐞𝐚𝐥-𝐓𝐢𝐦𝐞: 𝐀 𝐃𝐞𝐞𝐩 𝐃𝐢𝐯𝐞 𝐢𝐧𝐭𝐨 𝐭𝐡𝐞 𝐊𝐚𝐟𝐤𝐚 & 𝐃𝐚𝐭𝐚𝐛𝐫𝐢𝐜𝐤𝐬 𝐏𝐨𝐰𝐞𝐫 𝐃𝐮𝐨! Having successfully architected the Apache Kafka and Databricks integration for a scalable data platform, I wanted to share why this combination is so powerful for modern data architectures. It’s more than just connecting two technologies; it’s about creating a seamless flow from real-time ingestion to actionable intelligence. 𝐇𝐞𝐫𝐞’𝐬 𝐚 𝐥𝐨𝐨𝐤 𝐮𝐧𝐝𝐞𝐫 𝐭𝐡𝐞 𝐡𝐨𝐨𝐝: 𝟏. 𝐓𝐡𝐞 𝐒𝐞𝐚𝐦𝐥𝐞𝐬𝐬 𝐃𝐚𝐭𝐚 𝐅𝐥𝐨𝐰: • 𝐈𝐧𝐠𝐞𝐬𝐭 & 𝐁𝐮𝐟𝐟𝐞𝐫: Apache Kafka acts as the resilient, high-throughput central nervous system. It durably ingests streams from every source (microservices, DBs, IoT sensors), handling back-pressure and decoupling producers from consumers. • 𝐏𝐫𝐨𝐜𝐞𝐬𝐬 & 𝐄𝐧𝐫𝐢𝐜𝐡: Databricks, with its 𝐒𝐩𝐚𝐫𝐤 𝐒𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞𝐝 𝐒𝐭𝐫𝐞𝐚𝐦𝐢𝐧𝐠 engine, consumes from Kafka topics. This is where the magic happens: data validation, transformation, enrichment with slowly changing dimensions (SCDs), and event-time aggregation using watermarks. • 𝐒𝐭𝐨𝐫𝐞 & 𝐒𝐞𝐫𝐯𝐞: Processed data is written continuously into 𝐃𝐞𝐥𝐭𝐚 𝐋𝐚𝐤𝐞 on cloud storage (S3/ADLS/GCS). This provides ACID transactions, schema enforcement, and ultra-efficient upserts/merges (e.g., for CDC). The Delta Lake table becomes your single source of truth for both batch and streaming. 𝟐. 𝐖𝐡𝐲 𝐓𝐡𝐢𝐬 𝐂𝐨𝐦𝐛𝐢𝐧𝐚𝐭𝐢𝐨𝐧 𝐖𝐢𝐧𝐬: • 𝐌𝐚𝐬𝐬𝐢𝐯𝐞 𝐒𝐜𝐚𝐥𝐚𝐛𝐢𝐥𝐢𝐭𝐲: Kafka handles millions of events per second. Databricks Spark clusters scale elastically to process it. • 𝐑𝐞𝐥𝐢𝐚𝐛𝐢𝐥𝐢𝐭𝐲 & 𝐅𝐚𝐮𝐥𝐭 𝐓𝐨𝐥𝐞𝐫𝐚𝐧𝐜𝐞: Kafka ensures no data loss. Spark Streaming’s checkpointing and Delta Lake’s transactions guarantee exactly-once processing semantics. • 𝐎𝐩𝐞𝐧 & 𝐕𝐞𝐧𝐝𝐨𝐫-𝐅𝐫𝐢𝐞𝐧𝐝𝐥𝐲: This is an open-source core (Kafka, Spark, Delta). You avoid complete vendor lock-in while leveraging Databricks' optimized performance and management. • 𝐓𝐡𝐞 𝐋𝐚𝐤𝐞𝐡𝐨𝐮𝐬𝐞 𝐏𝐚𝐫𝐚𝐝𝐢𝐠𝐦: It enables the core tenet of the Lakehouse: combining the best of data lakes (scale, flexibility) and data warehouses (performance, reliability) on a single platform. 𝐈𝐦𝐩𝐥𝐞𝐦𝐞𝐧𝐭𝐚𝐭𝐢𝐨𝐧 𝐂𝐨𝐧𝐬𝐢𝐝𝐞𝐫𝐚𝐭𝐢𝐨𝐧: We used the 𝐊𝐚𝐟𝐤𝐚 𝐒𝐨𝐮𝐫𝐜𝐞 𝐟𝐨𝐫 𝐒𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞𝐝 𝐒𝐭𝐫𝐞𝐚𝐦𝐢𝐧𝐠 for maximum control over offsets and processing semantics. The alternative, 𝐀𝐮𝐭𝐨 𝐋𝐨𝐚𝐝𝐞𝐫, is a fantastic option for efficiently streaming files from cloud storage as they arrive, which itself is often populated by Kafka Connect sinks. This architecture has been a game-changer for enabling real-time dashboards, live feature engineering, and immediate anomaly detection. 𝐀𝐫𝐞 𝐲𝐨𝐮 𝐭𝐞𝐚𝐦 𝐊𝐚𝐟𝐤𝐚 𝐒𝐨𝐮𝐫𝐜𝐞 𝐨𝐫 𝐭𝐞𝐚𝐦 𝐀𝐮𝐭𝐨 𝐋𝐨𝐚𝐝𝐞𝐫 𝐟𝐨𝐫 𝐬𝐭𝐫𝐞𝐚𝐦𝐢𝐧𝐠 𝐢𝐧𝐭𝐨 𝐃𝐚𝐭𝐚𝐛𝐫𝐢𝐜𝐤𝐬? #DataArchitecture #ApacheKafka #Databricks #DeltaLake #DataEngineering #Spark #Lakehouse
Like Comment
To view or add a comment, sign in
Riya Khandelwal

Lead Data Engineer| 50k followers | Data engineer Mentor | Enabling Data-Driven Innovation | Azure and Databricks Ecosystem Expert | 12 x Cloud ☁️ Certified | Ex - IBMer | Technical Blogger | DM For Brand Partnership
2w
Report this post
In today’s data-driven world, real-time processing has become the need of the hour. Businesses can no longer rely only on batch systems—they need insights as events happen. That’s where Apache Kafka comes in. Kafka has evolved from being just a “pub-sub messaging system” into the central nervous system of modern data platforms. Let’s dive deep into the 𝐓𝐨𝐩 5 𝐊𝐚𝐟𝐤𝐚 𝐔𝐬𝐞 𝐂𝐚𝐬𝐞𝐬 : 1. Data Streaming ↬ Kafka ingests continuous streams of events from platforms like social media, IoT devices, sensors, or user applications. ↬ These events are stored in Kafka topics and consumed by downstream systems like Spark Streaming, Flink, or ML models. ↬ This enables use cases like fraud detection, recommendation engines, and live dashboards where milliseconds matter. 2. Log & Activity Tracking ↬ Large-scale applications generate billions of logs. Collecting, storing, and analyzing them in real time is a massive challenge. ↬ Kafka centralizes logs from different systems and makes them available for monitoring, alerting, and analytics. ↬ Tools like Spark or Elasticsearch can then consume these logs for building insights, detecting anomalies, or visualizing system behavior. 3. Message Queuing ↬ Traditional message queues struggle with scale and durability. Kafka solves this by decoupling producers and consumers. ↬ Multiple producers can write to Kafka topics, and multiple consumers can read the same messages independently. ↬ This ensures fault tolerance, guaranteed delivery, and parallel processing—critical for microservices communication, payment processing, and notification systems. 4. Change Data Capture (CDC) ↬ Databases are constantly evolving—new inserts, updates, deletes. ↬ Using tools like Debezium, Kafka can capture these row-level changes in real time. ↬ Downstream systems (like Spark, Elastic, or data warehouses) consume these events to keep everything in sync, powering real-time analytics, auditing, and event-driven architectures. 5. Data Replication ↬ Organizations often need to replicate data across regions, environments, or cloud providers. ↬ Kafka connectors make it possible to stream data from one database into another in real time. ↬ This supports disaster recovery, high availability, and global-scale applications where data consistency is non-negotiable. 𝐖𝐡𝐲 𝐊𝐚𝐟𝐤𝐚? ↬ High throughput – Handles millions of events per second. ↬ Durability & fault tolerance – Data is replicated across brokers. ↬ Scalability – Horizontal scaling for both storage and processing. ↬ Flexibility – Works with streaming, batch, and hybrid pipelines. 𝐅𝐨𝐥𝐥𝐨𝐰 𝐦𝐲 𝐌𝐞𝐝𝐢𝐮𝐦 𝐇𝐚𝐧𝐝𝐥𝐞 𝐭𝐨 𝐬𝐭𝐚𝐲 𝐮𝐩𝐝𝐚𝐭𝐞𝐝 - https://lnkd.in/dHhPyud2 𝗙𝗼𝗿 𝗠𝗲𝗻𝘁𝗼𝗿𝘀𝗵𝗶𝗽 - https://lnkd.in/gYn8Q39u 𝗙𝗼𝗿 𝗚𝘂𝗶𝗱𝗮𝗻𝗰𝗲 - https://lnkd.in/gfrPMQSj Riya Khandelwal
1 Comment
Like Comment
To view or add a comment, sign in
Mohsin Shaikh (L-I-O-N) - He/Him

Kindness & Human First - Hands-on TA Leader - Global TA Excellence Award - Best TA Leader Of The Year Award - Top 100 TalentTitanWinr - GCC Finalist - TalentHacker - Expert in Sourcing - 40u40 Future HR Leader Certified.
3w
Report this post
Today’s reality: most #generativeAI tools still require shipping data to third-party clouds. For enterprises, that’s a compliance and security nightmare. The EDB Postgres AI Factory keeps AI-ready data sovereign with innovations like off-prompt access controls that prevent proprietary data from leaking into public LLMs while reducing costs. The Register highlights how enterprises are protecting their most valuable asset--data--with sovereign Postgres: https://bit.ly/45U8wYd #EDBPostgresAI #SovereignAI #DataSovereignty #AgenticAI #PostgreSQL #DataSecurity #AIFactory

Crossing the agentic chasm with a sovereign data AI platform theregister.com
Like Comment
To view or add a comment, sign in
Komprise

12,732 followers
2w
Report this post
What is under the covers of Komprise? Managing #metadata at scale with a fast query engine, to drive better decision-making for #unstructureddata. ➡️ We chose the open-source indexing engine Elasticsearch which allows users to find specific data sets over multiple data centers and hybrid cloud--for planning, #compliance and #AI #workflows. ➡️ This approach keeps the Komprise architecture simple. Using a dedicated metadata solution means you can store as much metadata as needed (creating additional tags for example) without the concern that Komprise’s performance will be impacted. 😬 Other solutions try to do everything in a single central database that restricts scale and suffers performance penalties as the metadata load grows. Read more about it! And check comments for a link to schedule a demo. https://lnkd.in/gaX26fEQ

Komprise Deep Analytics Powered by Elasticsearch - Find out More komprise.com

2 Comments
Like Comment
To view or add a comment, sign in
SUDHANSHU MISHRA

Senior BI Developer & Big Data Strategist | Ex-EY |Leading Expert in Azure Data Engineering | Microsoft Certified Power BI, Azure Data Engineer, & Fabric Analytics Engineer Associate | Driving Data-Driven Transformation
4w
Report this post
Modernizing the Data Warehouse: Build Around, Not Away The fastest route to a modern data platform isn’t a rewrite, it’s augmentation. Pair the existing DW with a governed data lake, selective data virtualization, and scalable MPP/cloud to unlock near–real-time analytics and ML without sacrificing trust. Why this matters - Consolidate multistructured data (IoT, logs, social, text) with curated warehouse facts to deepen context and accelerate insights. - Scale via MPP and cloud elasticity; combine streaming + batch (Lambda) for low-latency decisions with historical depth. - Support all user types—BI, self-service, and data science sandboxes—under governance, security, and MDM. What “modern” looks like - Multi-platform by design: DW + Data Lake + optional Hadoop/NoSQL, unified logically with data virtualization where appropriate. - Mix integration styles: schema-on-read for exploration, schema-on-write for delivery, virtualization to minimize movement and latency. - Operate with agility: automation/APIs, data cataloging, promotable sandboxes, and hybrid cloud patterns. Start small, win fast - Land DW staging in the data lake to cut storage costs and speed ingestion of raw formats. - Offload cold data to the lake as an active archive; keep query access across current + historical via federated queries. - Pilot a new data type (clickstream, sensors) to prove ingestion, governance, and access patterns before scaling. Design patterns to adopt - Lambda architecture: speed layer for streaming KPIs + batch layer for full-fidelity history; serve through curated stores. - Lake zones: transient (checks), raw (immutable history), curated (governed delivery), sandbox (experiment → promote). - Organize for retrieval, not just storage—use subject/security/time/purpose and lean on business metadata to avoid a swamp. Watch-outs - Governance and data quality can slip in schema-on-read and self-service growth design promotion paths and controls early. - Virtualization brings speed but mind performance, lineage, and historical analysis limitations. - Expect change: file/layout drift, schema evolution, and lifecycle management across platforms. Call to action Pick one move this quarter: stage in the lake, stand up an active archive, or ingest one new high-value data source. Prove value with a POC, define promotion paths, and scale what works agility plus trust is the blueprint for a durable modern platform. #ModernDataPlatform #DataLake #DataWarehouse #DataArchitecture #DataVirtualization #RealTimeAnalytics
Like Comment
To view or add a comment, sign in
Hardial Singh

Enterprise Data | Hybrid Cloud | Management & Delivery | Edge to AI Platform Solutions| Cloud Specialist | Machine Learning | GenAI & Agentic AI | Solutions Architect Professional Services
4w
Report this post
AI might be the engine behind today’s biggest breakthroughs, but we know that without the right data in the right place, that engine stalls. Most organizations are sitting on massive volumes of data buried in legacy file systems, scattered across cloud storage, or trapped inside platforms like SharePoint and Salesforce. This data sprawl makes it hard to move fast. VAST Data calls this the “last mile” problem. Even with advanced models and plenty of compute, teams struggle to get their data into a pipeline where AI can actually use it. VAST has introduced SyncEngine as a possible solution to this problem. It is designed to act as a universal data router, automatically discovering, cataloging, and moving unstructured data across fragmented systems and SaaS platforms. By collapsing migration, indexing, and transformation into a single workflow, SyncEngine helps organizations feed their AI pipelines without relying on brittle scripts or a patchwork of third-party tools. Please follow Hardial Singh for such content. #linkedIn #Cybersecurity #informationsecurity #cloudsecurity #datasecurity #cybersecurityawareness #Data #Bigdata #Hadoop #Enterprisedata #Hybridcloud #Cloud #Cloudgovernance #Devops #Devsecops #Secops #cyber #infosec #riskassessment #informationsecurity #auditmanagement #informationprotection #securityaudit #cyberrisks #cybersecurity #security #cloudsecurity #trends #AWS #EC2 #AWSStorage #Cloudstorage https://lnkd.in/gfkrFUFJ

VAST Tackles AI’s Data Bottleneck with SyncEngine Launch bigdatawire.com
Like Comment
To view or add a comment, sign in
Bhaskar Sampathkumar

Principal AI Solutions Architect - Data & Insights
2w
Report this post
Neo4j has launched a graph database built to unify workloads at 100TB+ scale for generative AI. Good to know and expecting this feature available sooner in Aura Instance

Sudhir Hasbe

President & Chief Product Officer at Neo4j
3w Edited

🚀 Super excited to announce availability of Neo4j Infinigraph, our most scalable graph database yet – built to unify real-time transactions and deep analytics at 100TB+ scale and ACID compliance. With Infinigraph, enterprises no longer need to choose between speed and scale, or stitch together transactional and analytical systems. Now you can: ✅ Run operational + analytical workloads in a single system ✅ 100TB+ horizontal scale with zero application rewrites ✅ Embed billions of vectors directly in the graph ✅ High performance across massive transactional and analytical workloads ✅ High availability across data centers through autonomous clustering, which detects and recovers from failures automatically ✅ No ETL pipelines, sync delays, or duplicated storage ✅ Preserved graph structure for real-time traversal, even at scale ✅ Full ACID compliance for consistent enterprise-grade data integrity ✅ Pricing designed for scale, with compute and storage billed separately, for greater control over cost and deployment flexibility. This is a breakthrough for customers who need to fight global fraud, analyze decades of compliance data, or deliver real-time product recommendations at massive scale. As I shared: “Infinigraph sets a new standard for enterprise graph databases: one system that runs real-time operations and deep analytics together, at full fidelity and massive scale.” We’re proud to build on our history of innovation and deliver the graph infrastructure that powers intelligent applications for 84 of the Fortune 100. Thanks Ivan Zoratti and Florin Manole for driving this effort over past 2+ years. cc: Ivan Zoratti Magnus Vejlstrup Emil Eifrem Mike Asher Mark Woodhams Ajay Singh Dan McGrath Charles Dolan Anurag Tandon Michael Hunger Shradha K. David Fauth Ravi Ramanathan Robert Strange Kris Payne Jesús Barrasa Bryan Evans Dan Broom Kristen Pimpini (KP) Michael Moore, Ph.D. #GraphDatabase #AI #DataArchitecture #Neo4j #GenAI https://lnkd.in/gy8mkQAu

Neo4j unifies real-time transactions and graph analytics at scale - SiliconANGLE siliconangle.com
Like Comment
To view or add a comment, sign in
Tom Scott

Founder and CEO at Streambased
3w
Report this post
Partition now, regret later: why tiering from Kafka to Iceberg involves an uncomfortable choice between streaming and batch. Today there are two ways to surface Kafka data to Iceberg: shared tiering (also called zero copy) or materialization. 𝗠𝗮𝘁𝗲𝗿𝗶𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 means copying data from Kafka into Iceberg (ETL). In this scheme there are two copies of the data (Kafka and Iceberg) and clients read from the copy most appropriate to their needs. 𝗦𝗵𝗮𝗿𝗲𝗱 𝗧𝗶𝗲𝗿𝗶𝗻𝗴 means reformatting Kafka data to Iceberg when Kafka’s tiering process kicks in. As data is moved rather than copied, clients must read from what's available whether it is laid out optimally or not. At first it may seem that there is no difference in partitioning here but the key is in the client requirements. Shared Tiering inherits Kafka's partitioning scheme whereas Materialization allows you to re-partition your data in a format more appropriate for analytical queries. For example, given a Kafka topic holding credit card transactions. A common real-time use case is to sum these transaction amounts into a balance and because of this they are keyed (partitioned) by account number. The same topic is tiered into Iceberg where a reporting case needs to aggregate all transactions by the US state in which the transaction occurred. With Materialization the Kafka copy remains partitioned by account number and the Iceberg copy is partitioned by state. Real-time and analytical clients can choose the correct copy to match their case. With Shared Tiering the tiered data must serve real-time and analytical clients. But if both use cases have different access patterns, it is impossible to optimize one without the other suffering. It can be a non-starter if your Iceberg queries take N minutes to compute 💡 𝗣𝗹𝗮𝗻 𝗖 Using the zero-copy Shared Tiering approach but adding an index gives you the best of both worlds. Both real-time and analytical applications can consume the same dataset with near optimal performance whilst maintaining cheap non-duplicated storage, a single source of truth and ease of governance. This strategy optimizes: ○ 𝗦𝘁𝗼𝗿𝗮𝗴𝗲 - an index involves some extra storage but much less than the full duplication of materialization 💸 ○ <𝟭𝟬𝟬𝗺𝘀 𝗹𝗮𝘁𝗲𝗻𝗰𝘆 - Kafka clients see 1 partitioning scheme and stay fast. Adding an index allows analytical clients to see another partitioning scheme that boosts performance.🔥 ○ 𝗖𝗼𝗻𝘀𝗶𝘀𝘁𝗲𝗻𝗰𝘆- we still only have one source of truth (the data itself). Indexing simply accelerates access to it. 👌 If we apply indexing to our example, the data would be partitioned by account number, with an additional index created on the state field. This allows both real-time and analytical cases to perform as required. The future isn’t about choosing streaming or batch, it’s about designing systems that serve both. Shared Tiering + Indexing makes that future possible without the heavy tax of duplication.

7 Comments
Like Comment
To view or add a comment, sign in
Aishwarya Pani

Senior Data Engineer | Career Coach | Writes to 86k+| Brand Partnership| 3x Microsoft Certified| 2x Databricks Certified| Data Science| Data Analytics| AI| Data Engineering
2w
Report this post
🛠️ 𝗦𝘆𝘀𝘁𝗲𝗺 𝗗𝗲𝘀𝗶𝗴𝗻 𝗳𝗼𝗿 𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝘀 — 𝗖𝗼𝗿𝗲 𝗖𝗼𝗻𝗰𝗲𝗽𝘁𝘀 𝗬𝗼𝘂 𝗖𝗮𝗻’𝘁 𝗜𝗴𝗻𝗼𝗿𝗲 Data Engineering isn’t just about writing SQL or Spark code. If you want to ace interviews and design production-ready platforms, you need to understand 𝗦𝘆𝘀𝘁𝗲𝗺 𝗗𝗲𝘀𝗶𝗴𝗻 deeply. Here’s a breakdown of the most common concepts you’ll come across 👇 𝟭. 𝗦𝗰𝗮𝗹𝗶𝗻𝗴 (𝗩𝗲𝗿𝘁𝗶𝗰𝗮𝗹 𝘃𝘀 𝗛𝗼𝗿𝗶𝘇𝗼𝗻𝘁𝗮𝗹) ➜ Vertical Scaling: Add more CPU/RAM to a single machine. ➜ Horizontal Scaling: Add more machines to the cluster. ➜ Use case: Horizontal scaling dominates in distributed data systems. Limitation: Vertical scaling hits hardware limits quickly. 𝟮. 𝗟𝗼𝗮𝗱 𝗕𝗮𝗹𝗮𝗻𝗰𝗲𝗿𝘀 ➜ Flow: User Requests → Load Balancer → Multiple Servers ➜ How it works: Distributes traffic evenly, improves availability, and prevents server overload. ➜ Use case: Essential for large-scale APIs, microservices, and streaming systems. Limitation: Can become a single point of failure if not designed well. 𝟯. 𝗖𝗮𝗰𝗵𝗶𝗻𝗴 𝗦𝘁𝗿𝗮𝘁𝗲𝗴𝗶𝗲𝘀 ➜ Types: Client-side, server-side, and distributed caches (e.g., Redis, Memcached). ➜ How it works: Stores frequently accessed results closer to the user/system. ➜ Use case: Speeds up reads (e.g., analytics dashboards). Limitation: Cache invalidation is tricky. 𝟰. 𝗗𝗮𝘁𝗮𝗯𝗮𝘀𝗲 𝗥𝗲𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻 & 𝗦𝗵𝗮𝗿𝗱𝗶𝗻𝗴 ➜ Replication: Keeps multiple copies of data for high availability. ➜ Sharding: Splits large datasets across nodes for scalability. ➜ Use case: Global-scale applications, modern data lakes. Limitation: Complex query coordination across shards. 𝟱. 𝗕𝗮𝘁𝗰𝗵 𝘃𝘀 𝗦𝘁𝗿𝗲𝗮𝗺 𝗣𝗿𝗼𝗰𝗲𝘀𝘀𝗶𝗻𝗴 ➜ Batch: Large historical data → Hadoop/Spark. ➜ Stream: Real-time data → Kafka/Flink. ➜ Use case: IoT analytics, fraud detection, log pipelines. Limitation: Stream processing adds complexity in state management. 𝟲. 𝗟𝗮𝗺𝗯𝗱𝗮 & 𝗞𝗮𝗽𝗽𝗮 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲𝘀 ➜ Lambda: Batch Layer + Speed Layer + Serving Layer. ➜ Kappa: Everything as a stream, simpler to maintain. ➜ Use case: Unified analytics combining history + real-time events. Limitation: Lambda is powerful but harder to maintain. 📌 𝗪𝗵𝘆 𝘁𝗵𝗶𝘀 𝗺𝗮𝘁𝘁𝗲𝗿𝘀: These aren’t just theory. They appear in interviews and form the backbone of 𝘀𝗰𝗮𝗹𝗮𝗯𝗹𝗲, 𝗳𝗮𝘂𝗹𝘁-𝘁𝗼𝗹𝗲𝗿𝗮𝗻𝘁 𝗱𝗮𝘁𝗮 𝘀𝘆𝘀𝘁𝗲𝗺𝘀 in the real world. ♻️ Save this post for later. 💬 Comment your favorite concept below (mine = caching, because speed thrills 🚀). ✅ Follow Aishwarya Pani for more 𝗿𝗲𝗮𝗹-𝘄𝗼𝗿𝗹𝗱 𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 𝗶𝗻𝘀𝗶𝗴𝗵𝘁𝘀 that actually help you grow. 📌 PDF Credit: Shwetank Singh #DataEngineering #SystemDesign #BigData #CloudComputing #DistributedSystems #DataArchitecture #TechCareers #InterviewPreparation #SoftwareEngineering

57 Comments
Like Comment
To view or add a comment, sign in

1,617 followers

View Profile Connect

LinkedIn respects your privacy

How Conflict-Free Replicated Data Types (CRDTs) Power Redis for Scalable, Always-Available Data

More from this author

Leaky Bucket

Token Bucket

Explore content categories