"Most pipelines in enterprises are not web scale" Can't agree more with Chad Spark is very powerful but not for all your data engineering requirements. Life could be easier if engineers think before throwing Spark to non Spark problems.
Long Live Spark, Down with Spark 🙃 Spark has been the de facto processing engine for the last 10 years in the "Big Data" community. It still has its place in data processing. Some have branched off their own versions, spun off custom engines, and developed new approaches in an attempt to overcome the shortcomings of spark in diverse landscapes. Our focus is to enable the right compute for the problem with a focus on interoperability with other systems and to have open source compute engines as first-class citizens in platform. Enterprises are highly complex, with high heterogeneity of solutions, tech debt, data gravity, and modalities. I would argue that most pipelines in enterprise are not web scale; they are small and medium sized. A pipeline with a few million or hundreds of millions of rows, sometimes a few billion. Spark is overkill for 90%+ of these workloads and carries a lot of overhead that impacts speed and cost. With the Palantir Multi-Modal Data Plane (MMDP), we provide the framework to store data anywhere, compute data anywhere, and use any model. If we break down this a bit more, this means I can have data spread across different platforms, file formats, structures and use any compute (inside or outside of the platform) to process my data. The permutations of storage locations, formats, and compute engines are massive. It's the right tool for the problem at hand, not one tool you need to shove everything into. Re-platforming should never be the goal. With new frameworks like Polars, DataFusion, DuckDB and the availability of larger node sizes that this means you can run increasingly bigger and more complex workloads in a fraction of the time and cost. You can mix and match these as needed in Foundry and AIP, even having one step run one engine and another step use a different engine. (You can even bring your own engine.) Spark is always there on the table, but it should no longer be the default for most enterprises to start with.