Greenplum + AI: Personalized Blog Recommendations at Scale

Ivan D. Novick

Product Manager for Data Products

Published Sep 10, 2025

I created a demo of how to use AI with Greenplum to build a recommendation engine. The idea is simple: take a large dataset of blog posts, turn them into vectors with embeddings, cluster them into topics, and then generate personalized recommendations based on user preferences. The key lies in how Greenplum stores, processes, and queries everything at scale.

Greenplum is particularly well-suited for this kind of project because it allows me to bring together structured data, unstructured text, and semi-structured formats like JSON or XML in one place. It handles graph, time-series, and geospatial workloads too. With the ability to store both raw data and vectors, Greenplum becomes an engine for high-speed iteration across the data science process. That means I can preprocess, embed, cluster, and search — all in the same system — without constantly exporting data to external pipelines.

From Kaggle Data to Embeddings

I started with the Kaggle Medium Blog Posts dataset, which contains over 100,000 blog titles and descriptions. These were loaded into Greenplum as a simple table.

Next, I used the OpenAI Python SDK to generate embeddings for each post. The vectors were stored directly in Greenplum using the pgvector extension, which adds support for similarity search. This allows me to query articles by semantic meaning, not just keywords.

The process is automated by a script (genvec.py) in my repo ivannovick/blog-recommendations. It loops through blog posts, generates embeddings via API calls, and updates each row with the resulting vector. The repo also includes scripts for clustering (cluster.sql), summarization (summarize.py), and interactive recommendations (rec-based-summaries.py).

Clustering with PL/Python

Once embeddings were in place, I used PL/Python with scikit-learn to run KMeans clustering directly inside Greenplum. This is a form of unsupervised learning where similar articles are grouped into topical clusters.

The math is straightforward: each cluster has a centroid, which is the average of all the vectors in that cluster. Articles are assigned to whichever centroid they are closest to in vector space. The clustering step revealed natural topical groupings — technology, media, politics, lifestyle, and more.

Cluster summaries were then generated using GPT, giving a human-readable description of what each group of articles was about. These summaries were stored in a separate table (blog_cluster_summaries) for use in personalization.

From Vectors to Personalization

The recommendation step works by letting a user rate cluster summaries instead of individual articles. For example, if I’m most interested in AI and less in politics, I can give the AI cluster a high rating and the politics cluster a low one.

Here’s where the math comes in. Every article has already been transformed into a vector — a long list of numbers that capture the meaning of the text. When we cluster these vectors, each group of related articles forms a center point in that high-dimensional space. This center point, or centroid, represents the “average” meaning of the cluster.

When I rate clusters, I’m essentially saying how closely my personal interests align with each of those centroids. A high rating puts more emphasis on that centroid, pulling the preference profile closer to it, while a low rating pushes it away. By blending all of these centroids together according to my ratings, the system builds a single composite vector that reflects my unique tastes.

This “preference vector” sits in the same space as the article vectors. That means I can directly compare it to every article in the database. Articles that are closest to this vector are treated as the most relevant, because they align strongly with the topics I rated highly. Articles that sit far away in vector space are treated as least relevant, because they resemble the clusters I didn’t care for.

The beauty of this approach is that it turns subjective ratings into a precise mathematical object that can be used for fast, large-scale similarity search. The end result feels intuitive — the system recommends content I actually want to read — but under the hood it’s powered by the geometry of vectors and the mathematics of weighted combinations.

The Software Flow

The project runs in a clear sequence:

Data loading: blog posts from Kaggle into Greenplum
Vector generation: embeddings created in Python, stored in pgvector
Clustering: KMeans in PL/Python, grouping articles into topics
Summarization: GPT-generated text summaries for each cluster
Preference modeling: user ratings on summaries converted into a weighted vector
Recommendations: similarity search in SQL to find the most and least interesting content

Example Outcomes

When running the demo, I can pull up the 25 most interesting and 25 least interesting blog posts in under a second, all tailored to my personal preference vector. That weighted vector is matched against every blog’s embedding in the same vector space, and thanks to Greenplum’s high-speed, high-volume vector indexing, the system can instantly retrieve the closest (and farthest) matches.

The result is a powerful example of using Greenplum as the engine for AI-driven personalization—scalable, SQL-native, and tightly integrated with modern AI tools. While this demo uses blog posts, the same pattern extends naturally to retail product recommendations, healthcare content delivery, education course suggestions, and personalized media feeds. Wherever tailored recommendations matter, Greenplum + AI provides the foundation.

Repo

The full project, including schema, scripts, and demo instructions, is here: 👉 https://github.com/ivannovick/blog-recommendations

Ashwin Agrawal

Technology Lead and Distinguished Engineer - Greenplum Database (massively parallel PostgreSQL) [Broadcom]

👏 Plus in addition Greenplum's row level, cell level security and anonymization rules can be defined to further control which vector row embeddings are used to influence recommendations or decision making based on user roles - which makes it wholistic solution in one place

3 Reactions

LinkedIn respects your privacy

Greenplum + AI: Personalized Blog Recommendations at Scale

Ivan D. Novick

Product Manager for Data Products

From Kaggle Data to Embeddings

Clustering with PL/Python

From Vectors to Personalization

The Software Flow

Example Outcomes

Repo

The Data Tells a Story

886 followers

More articles by this author

Explore content categories

From Kaggle Data to Embeddings

Clustering with PL/Python

From Vectors to Personalization

The Software Flow

Example Outcomes

Repo

The Data Tells a Story

886 followers

Bank Customer Churn Demo with Machine Learning and AI as Needed

Sep 14, 2025

Tanzu GemFire Memory Control: When to Expire Data, When to Evict It

Sep 3, 2025

Become a Greenplum Expert with ChatGPT AI — Learning the Smart Way

Sep 2, 2025

The Robots Are Coming: How AI in Physical Devices Will Reshape Human Work

Aug 29, 2025

Data Dissected — Breaking Down the Building Blocks of Information

Aug 10, 2025

And the winner is?

Aug 9, 2025

Codex helps me write code with AI

Jul 22, 2025

15 Best and Most Interesting PostgreSQL Extensions in 2025

Jul 20, 2025

What is Blockchain Technology?

Jul 20, 2025

Learning AI to Transform Yourself

Jul 14, 2025

Explore content categories