I really don't get a lot of this criticism. For example, who is using iceberg with hundreds of concurrent committers, especially at the scale mentioned in the article (10k rows per second)? Using iceberg or any table format over object storage would be insane in that case. But for your typical spark application, you have one main writer (the spark driver) appending or merging a large number of records in > 1 minute microbatches and maybe a handful of maintenance jobs for compaction and retention; Iceberg's concurrency system works fine there.
If you have any use case like one the author describes, maybe use an in-memory cloud database with tiered storage or a plain RDBMS. Iceberg (and similar formats) work great for the use cases for which they're designed.
>who is using iceberg with hundreds of concurrent committers, especially at the scale mentioned in the article (10k rows per second)? Using iceberg or any table format over object storage would be insane in that case
You can achieve 100M database inserts per second with D4M and Accumulo more than a decade ago back in 2014, and object storage is not necessary for that exercise.
Someone need to come up with lakehouse systems based on D4M, it's a long overdue.
D4M is also based on sound mathematics not unlike the venerable SQL [2].
[1] Achieving 100M database inserts per second using Apache Accumulo and D4M (2017 - 46 comments):
> But for your typical spark application, you have one main writer (the spark driver) appending or merging a large number of records...
The multi-writer architecture can't be proven scalable because a single writer doesn't cause it to fall over.
I have caused issues by using 500 concurrent writers on embarrassingly parallel workloads. I have watched people choose sharding schemes to accommodate Iceberg's metadata throughput NOT the natural/logical sharding of the underlying data.
Last I half-knew (so check me), Spark may have done some funky stuff to workaround the Iceberg shortcomings. That is useless if you're not using Spark. If scalability of the architecture requires a funky client in one language and a cooperative backend, we might as well be sticking HDF5 on Lustre. HDF5 on Lustre never fell over for me in the 1000+ embarrassingly parallel concurrent writer use case (massive HPC turbulence restart files with 32K concurrent writers per https://ieeexplore.ieee.org/abstract/document/6799149 )
If you have any use case like one the author describes, maybe use an in-memory cloud database with tiered storage or a plain RDBMS. Iceberg (and similar formats) work great for the use cases for which they're designed.