💊 DATA Pill #169 - Persona vectors, HA Postgres on K8s, streaming lakehouses
Hi,
This week: steer LLMs with persona vectors, see how Airbnb runs Postgres HA on Kubernetes, and learn to build a streaming lakehouse with Flink and Iceberg. Plus, tools from Google and Databricks and Flink’s biggest release yet.
ARTICLE
Persona vectors: Monitoring and controlling character traits in language models | 7 min | AI Research | Anthropic
Anthropic introduces latent vectors that shape tone, expertise, and goals in LLMs without touching the prompt. A practical path toward more modular and controllable AI.
In MORE LINKS you will read:
Achieving High Availability with distributed database on Kubernetes at Airbnb
5 Ways Dremio Makes Apache Iceberg Lakehouses Easy
Five Python Tips You Won’t Find in Most Curriculums
TUTORIALS
Build a Streaming Lakehouse with Flink, Kafka, Iceberg, and Polaris | 8 min | Data Engineering | Gilles Philippart | Personal Blog
A hands-on guide to setting up a streaming data lakehouse with schema evolution and end-to-end reliability using open-source tools.
NEWS
Apache Flink 2.1.0: Ushers in a New Era of Unified Real-Time Data + AI with Comprehensive Upgrades | 6 min | Streaming & AI | Apache Flink
New AI-native connectors, unified batch and stream processing, improved autoscaling, and hardened production stability make this Flink's most capable release yet.
In MORE LINKS you will read:
Apache Flink 2.1.0: Ushers in a New Era of Unified Real-Time Data + AI with Comprehensive Upgrades
TOOLS
Introducing LangExtract: A Gemini powered information extraction library | 4 min | NLP | Akshay Goel, Atilla Kiraly | Google for Developers Blog
A lightweight Python library for information extraction with built-in schema validation and few-shot support. Built for fast, type-safe NLP pipelines.
In MORE LINKS you will read:
Databricks Labs LSQL
EVENTS, CONFS, AND MEETUPS
Data Expo 2025 | 10-11th September | Utrecht
The largest data event in the Netherlands returns with 100+ vendors, 150+ sessions, and a packed agenda for engineers, scientists, and data leaders. Free to attend.
PINNACLE PICKS
Your last week top picks:
Announcing Kedro 1.0 | 6 min | ML | QuantumBlack, AI by McKinsey
Kedro reaches 1.0 with improved modularity, long-term support, and new hooks for ML pipelines.
Stream Kafka Topic to the Iceberg Tables with Zero-ETL | 12 min | Data Streaming | Vu Trinh | Data Engineer Things
Learn how to stream Kafka data into Iceberg tables using Flink for real-time, zero-ETL pipelines.
Why Startups Are Betting Everything on Apache DataFusion | Databases | 5 min | Andrew Lamb | The New Stack Blog
DataFusion is winning over startups with its fast Rust-based query engine and plug-and-play architecture.
____________________
Have any interesting content to share in the DATA Pill newsletter?
➡ Join us on GitHub
➡ Dig previous editions of DataPill
Adam from the GetInData is Now Xebia