"Chad on Spark: Not for all data engineering needs"

4w Edited

"Most pipelines in enterprises are not web scale" Can't agree more with Chad Spark is very powerful but not for all your data engineering requirements. Life could be easier if engineers think before throwing Spark to non Spark problems.

Chad Wahlquist

Architect - Palantir

1mo

Long Live Spark, Down with Spark 🙃 Spark has been the de facto processing engine for the last 10 years in the "Big Data" community. It still has its place in data processing. Some have branched off their own versions, spun off custom engines, and developed new approaches in an attempt to overcome the shortcomings of spark in diverse landscapes. Our focus is to enable the right compute for the problem with a focus on interoperability with other systems and to have open source compute engines as first-class citizens in platform. Enterprises are highly complex, with high heterogeneity of solutions, tech debt, data gravity, and modalities. I would argue that most pipelines in enterprise are not web scale; they are small and medium sized. A pipeline with a few million or hundreds of millions of rows, sometimes a few billion. Spark is overkill for 90%+ of these workloads and carries a lot of overhead that impacts speed and cost. With the Palantir Multi-Modal Data Plane (MMDP), we provide the framework to store data anywhere, compute data anywhere, and use any model. If we break down this a bit more, this means I can have data spread across different platforms, file formats, structures and use any compute (inside or outside of the platform) to process my data. The permutations of storage locations, formats, and compute engines are massive. It's the right tool for the problem at hand, not one tool you need to shove everything into. Re-platforming should never be the goal. With new frameworks like Polars, DataFusion, DuckDB and the availability of larger node sizes that this means you can run increasingly bigger and more complex workloads in a fraction of the time and cost. You can mix and match these as needed in Foundry and AIP, even having one step run one engine and another step use a different engine. (You can even bring your own engine.) Spark is always there on the table, but it should no longer be the default for most enterprises to start with.

To view or add a comment, sign in

More Relevant Posts

Chad Wahlquist

Architect - Palantir
1mo
Report this post
Long Live Spark, Down with Spark 🙃 Spark has been the de facto processing engine for the last 10 years in the "Big Data" community. It still has its place in data processing. Some have branched off their own versions, spun off custom engines, and developed new approaches in an attempt to overcome the shortcomings of spark in diverse landscapes. Our focus is to enable the right compute for the problem with a focus on interoperability with other systems and to have open source compute engines as first-class citizens in platform. Enterprises are highly complex, with high heterogeneity of solutions, tech debt, data gravity, and modalities. I would argue that most pipelines in enterprise are not web scale; they are small and medium sized. A pipeline with a few million or hundreds of millions of rows, sometimes a few billion. Spark is overkill for 90%+ of these workloads and carries a lot of overhead that impacts speed and cost. With the Palantir Multi-Modal Data Plane (MMDP), we provide the framework to store data anywhere, compute data anywhere, and use any model. If we break down this a bit more, this means I can have data spread across different platforms, file formats, structures and use any compute (inside or outside of the platform) to process my data. The permutations of storage locations, formats, and compute engines are massive. It's the right tool for the problem at hand, not one tool you need to shove everything into. Re-platforming should never be the goal. With new frameworks like Polars, DataFusion, DuckDB and the availability of larger node sizes that this means you can run increasingly bigger and more complex workloads in a fraction of the time and cost. You can mix and match these as needed in Foundry and AIP, even having one step run one engine and another step use a different engine. (You can even bring your own engine.) Spark is always there on the table, but it should no longer be the default for most enterprises to start with.

23 Comments
Like Comment
To view or add a comment, sign in
Ajay Kumar Ojha

Data & AI Architect | Enterprise Solution Architect (Integration) | Enterprise Systems & EDM
3w
Report this post
Future-Proofing Data Platforms: Spark Trends You Can’t Ignore Data platforms are changing at lightning speed. What works today might not survive tomorrow. Apache Spark is at the heart of this transformation — and the way we design, operate and scale Spark-based systems will define the future of data-driven business. Here are Spark shifts that will move from “nice-to-have” to absolutely necessary: 1. Instead of reprocessing entire datasets, platforms will focus on updating only what changed — faster, cheaper and smarter. 2. Data bottlenecks caused by uneven distribution will give way to engines that automatically rebalance workloads. 3. Pipeline failures from changing data formats will be solved by automatic checks and agreements between producers and consumers. 4. Unpredictable cloud costs will be tamed by serverless, auto-scaling Spark that adjusts resources on demand. 5. Businesses won’t rely on stale batch reports; real-time and batch will converge, delivering insights instantly. 6. Machine Learning will become more reliable through reproducible snapshots of data that keep training and production in sync. 7. Spark will tap into the power of GPUs and accelerators, boosting both AI and heavy data processing. 8. Debugging will no longer be a guessing game; advanced observability tools will pinpoint problems instantly. 9. Centralized data teams will share responsibility as organizations embrace a self-serve model, empowering domain teams. 10. Security and privacy will be non-negotiable, with fine-grained controls, encryption and compliance baked into platforms. 11. Manual performance tuning will fade away, replaced by intelligent systems that learn and auto-optimize job configurations. 12. Reinventing infrastructure patterns will stop; standard blueprints on Kubernetes will make Spark deployments seamless. In short: the future of Spark is not just about speed — it’s about trust, efficiency, security and real-time intelligence. Which of these Spark trends do you see happening in your organization already?
Like Comment
To view or add a comment, sign in
Sajjad Tariq

Founder @ Exceptional IT Training | Zero-to-Hero AWS DevOps Bootcamp in 12 Weeks | DevOps & Cloud Mentor | Supporting Career Changers & Tech Professionals Worldwide
1mo
Report this post
MongoDB Stock Rockets After Q2 Beat Trend: MongoDB shares surged 23% after reporting a 24% revenue rise—driven by AI-driven database adoption. Why it matters: Data platforms are fueling AI—increasing demand for reliable, scalable storage. Question: How central is AI to your database strategy? 🔁 Repost if data infrastructure supports innovation 🔔 Follow me for what powers the AI engine 📈 Database innovation = AI potential
Like Comment
To view or add a comment, sign in
Sajjad Tariq

Founder @ Exceptional IT Training | Zero-to-Hero AWS DevOps Bootcamp in 12 Weeks | DevOps & Cloud Mentor | Supporting Career Changers & Tech Professionals Worldwide
1mo
Report this post
MongoDB Stocks Soar on AI Demand Trend: MongoDB stock jumped over 30% after beating Q2 expectations—subscription revenue rose 23%, notably from AI app users, and Atlas grew 29%. Why it matters: Data platforms that support AI workloads are winning investor confidence. Question: Is your database ready to power the AI wave? 🔁 Repost if storage powers the AI revolution 🔔 Follow me for data strategy that scales 📈 AI drives database demand
Like Comment
To view or add a comment, sign in
Eisha Shah

Data Engineer | Microsoft-certified | Building Scalable Data Solutions with Microsoft Fabric & Spark | Enabling Agentic AI Solutions
4d
Report this post
𝗦𝗰𝗮𝗹𝗶𝗻𝗴 𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 𝘄𝗶𝘁𝗵 𝗣𝗮𝗿𝗮𝗹𝗹𝗲𝗹 𝗣𝗿𝗼𝗰𝗲𝘀𝘀𝗶𝗻𝗴 As data grows in both volume and complexity, one of the biggest challenges in data engineering is ensuring that pipelines remain fast, reliable, and efficient. This is where parallel processing becomes essential. Instead of handling data sequentially, parallel processing breaks down large datasets into smaller, independent chunks and processes them simultaneously. This approach significantly reduces processing time and makes it possible to handle workloads that would otherwise be impractical. From distributed computing frameworks like Apache Spark to modern cloud-native platforms, parallelism is the backbone of large-scale data systems. It not only speeds up ETL workflows but also improves scalability, enabling data teams to keep pace with ever-growing business demands. Recently, I’ve been working on a project in Microsoft Fabric that involves processing millions of rows of data on a daily basis. One of the challenges has been figuring out how to implement multithreading and parallel execution to cut down runtime and make the pipelines more efficient. Balancing speed with data consistency has pushed me to test different approaches, and it’s been a hands-on way of seeing how parallel processing concepts apply in practice. At its core, parallel processing is about doing more with less time- an essential principle for anyone building robust data pipelines. #DataEngineer #BigData #ETL #DataPipelines #DistributedComputing #MicrosoftFabric #Azure #DataAnalytics #Spark
Like Comment
To view or add a comment, sign in
Paula Santos
4w
Report this post
For those who work in FinOps, we know that efficiency goes far beyond optimizing VMs. It's in the architecture, in the governance, and, most importantly, in how we manage the most expensive asset in the cloud: data. It was with this mindset that I dove into the Databricks Fundamentals course, and the connection to FinOps was immediate. For my colleagues looking to optimize TCO (Total Cost of Ownership) and maximize the value of their data platforms, here are my key insights: 💡 1. Less Redundancy, Lower Costs (The Power of the Lakehouse): The Lakehouse architecture tackles one of the biggest money drains in the cloud: data duplication. By unifying the Data Warehouse and Data Lake, we eliminate the need to maintain separate copies (and redundant ETL pipelines) for BI and AI. Less storage + less processing = direct savings. 💡 2. End the Waste of Idle Compute (Hello, Serverless!): Managing clusters is a classic cost challenge. Databricks' Serverless compute abstracts this complexity away by allocating resources on-demand and eliminating the cost of idle clusters. For FinOps, this means paying for exactly what you use, optimizing price-performance. 💡 3. Freedom is Power (and Savings) (Open Standards): Databricks is built on open standards like Apache Spark and Delta Lake. What does this mean for FinOps? An end to vendor lock-in. Your data remains in an open format (Parquet) in your own storage. This guarantees long-term portability and negotiating power, a strategic pillar for cost control. 💡 4. Data Sharing without Duplication Costs (Delta Sharing): Traditionally, sharing data with partners means creating copies, generating storage and transfer costs. Delta Sharing allows for "in-place" data sharing without moving it. The impact of this on the TCO of collaborative data ecosystems is huge. 💡 5. Centralized Governance = Cost Visibility (Unity Catalog): Good governance with Unity Catalog isn't just about security. It's about understanding data lineage, who uses the data, and for what purpose. This visibility is crucial for accurate chargeback/showback, allowing data costs to be attributed directly to the business units that generate value from them. In conclusion, understanding Databricks isn't about learning another data tool. It's about understanding an architecture that, by design, tackles the main cost drivers of a modern data ecosystem. Have you analyzed the TCO of a unified data platform? #FinOps #CloudArchitecture #Databricks #CloudCost #TCO #DataEngineering #Lakehouse #AWS #Azure #GCP

Academy Accreditation - Databricks Fundamentals • Paula Gonçalves dos Santos • Databricks Academy credentials.databricks.com

2 Comments
Like Comment
To view or add a comment, sign in
Prithvika Babu

Data Analyst @ Compass Group | Analytics with Impact | Power BI • Python • SQL | AI-Driven Insights
4d
Report this post
Ever wondered how Google, Meta or Amazon handle millions of users refreshing data every second without a single drop in performance? This article by Prasad A. Parit explains the how. From distributed storage (HDFS, S3, NoSQL) to processing engines like MapReduce and Spark, and real-time tools like Kafka and Flink, big tech has built a connected ecosystem that keeps data moving fast. Caching, indexing, CDNs and even AI-driven optimizations go a long way and make your feed refresh instantly. What stood out most is that speed and reliability are prioritized over new features. Because if your app doesn’t load smoothly, nothing else matters. https://lnkd.in/gQzyKtfY #bigdata #dataprocessing #distributedsystems #Hadoop #Mapreduce #dataanalytics #dataengineering #bigtech #tech #ai

How Big Tech Companies Store, Manage, and Process Massive Data with High Speed and Efficiency medium.com
Like Comment
To view or add a comment, sign in
Ronak Jain 🧿

Cloud Data Engineer at Skyhigh Security | GenAI & MLOps Enthusiast | Azure • Databricks • Python • Snowflake • Terraform • Prometheus • Grafana • OpenSearch | Building MLOps Infrastructure
3w
Report this post
From a messy data infrastructure to a smooth, production-grade pipeline. Here's a quick look at a recent challenge I tackled. When I started on a new project, we were facing a bottleneck in our data processing, with huge datasets and slow ETL workflows. It was costing us time and resources. My approach: I decided to leverage Databricks and PySpark to refactor the entire workflow. We built a scalable and efficient pipeline that not only processed data faster but also improved overall data quality. I even integrated GenAI-ready pipelines using LangChain to support a new LLM inference workflow. The results were clear: a significant improvement in efficiency and a 99.9% uptime for our data infrastructure. It's a reminder that a well-designed data strategy is the backbone of any successful AI/ML project. What's the biggest data or MLOps challenge you've overcome recently? Share your story in the comments! #DataEngineering #MLOps #GenAI #CloudComputing #AWS #Azure #Databricks #PySpark #LinkedIn
Like Comment
To view or add a comment, sign in
TechIntelPro

2,089 followers
4w
Report this post
EDB showcases Postgres AI at Supermicro OSS 2025, unifying data lakehouses and modernizing apps for AI with 6x performance and 90% better value - https://lnkd.in/gweSTyAc “EDB’s focus on Postgres as a universal data platform for lakehouses and modern applications positions it as a key enabler for organizations navigating the dual challenges of exploding data volumes and the urgent demand for AI-driven innovation,” said Devin Pratt, research director at IDC. #AppModernization #OpenStorageSummit #SovereignAI #TechIntelPro

EDB Postgres AI Powers Lakehouses at Supermicro Summit 2025 techintelpro.com
Like Comment
To view or add a comment, sign in
Bhaskar Sampathkumar

Principal AI Solutions Architect - Data & Insights
3w
Report this post
Neo4j has launched a graph database built to unify workloads at 100TB+ scale for generative AI. Good to know and expecting this feature available sooner in Aura Instance

Sudhir Hasbe

President & Chief Product Officer at Neo4j
3w Edited

🚀 Super excited to announce availability of Neo4j Infinigraph, our most scalable graph database yet – built to unify real-time transactions and deep analytics at 100TB+ scale and ACID compliance. With Infinigraph, enterprises no longer need to choose between speed and scale, or stitch together transactional and analytical systems. Now you can: ✅ Run operational + analytical workloads in a single system ✅ 100TB+ horizontal scale with zero application rewrites ✅ Embed billions of vectors directly in the graph ✅ High performance across massive transactional and analytical workloads ✅ High availability across data centers through autonomous clustering, which detects and recovers from failures automatically ✅ No ETL pipelines, sync delays, or duplicated storage ✅ Preserved graph structure for real-time traversal, even at scale ✅ Full ACID compliance for consistent enterprise-grade data integrity ✅ Pricing designed for scale, with compute and storage billed separately, for greater control over cost and deployment flexibility. This is a breakthrough for customers who need to fight global fraud, analyze decades of compliance data, or deliver real-time product recommendations at massive scale. As I shared: “Infinigraph sets a new standard for enterprise graph databases: one system that runs real-time operations and deep analytics together, at full fidelity and massive scale.” We’re proud to build on our history of innovation and deliver the graph infrastructure that powers intelligent applications for 84 of the Fortune 100. Thanks Ivan Zoratti and Florin Manole for driving this effort over past 2+ years. cc: Ivan Zoratti Magnus Vejlstrup Emil Eifrem Mike Asher Mark Woodhams Ajay Singh Dan McGrath Charles Dolan Anurag Tandon Michael Hunger Shradha K. David Fauth Ravi Ramanathan Robert Strange Kris Payne Jesús Barrasa Bryan Evans Dan Broom Kristen Pimpini (KP) Michael Moore, Ph.D. #GraphDatabase #AI #DataArchitecture #Neo4j #GenAI https://lnkd.in/gy8mkQAu

Neo4j unifies real-time transactions and graph analytics at scale - SiliconANGLE siliconangle.com
Like Comment
To view or add a comment, sign in

3,044 followers

148 Posts

View Profile Connect

LinkedIn respects your privacy

"Chad on Spark: Not for all data engineering needs"

Explore content categories