Greenplum + AI: Personalized Blog Recommendations at Scale

Greenplum + AI: Personalized Blog Recommendations at Scale

I created a demo of how to use AI with Greenplum to build a recommendation engine. The idea is simple: take a large dataset of blog posts, turn them into vectors with embeddings, cluster them into topics, and then generate personalized recommendations based on user preferences. The key lies in how Greenplum stores, processes, and queries everything at scale.

Greenplum is particularly well-suited for this kind of project because it allows me to bring together structured data, unstructured text, and semi-structured formats like JSON or XML in one place. It handles graph, time-series, and geospatial workloads too. With the ability to store both raw data and vectors, Greenplum becomes an engine for high-speed iteration across the data science process. That means I can preprocess, embed, cluster, and search — all in the same system — without constantly exporting data to external pipelines.

From Kaggle Data to Embeddings

I started with the Kaggle Medium Blog Posts dataset, which contains over 100,000 blog titles and descriptions. These were loaded into Greenplum as a simple table.

Next, I used the OpenAI Python SDK to generate embeddings for each post. The vectors were stored directly in Greenplum using the pgvector extension, which adds support for similarity search. This allows me to query articles by semantic meaning, not just keywords.

The process is automated by a script (genvec.py) in my repo ivannovick/blog-recommendations. It loops through blog posts, generates embeddings via API calls, and updates each row with the resulting vector. The repo also includes scripts for clustering (cluster.sql), summarization (summarize.py), and interactive recommendations (rec-based-summaries.py).

Clustering with PL/Python

Once embeddings were in place, I used PL/Python with scikit-learn to run KMeans clustering directly inside Greenplum. This is a form of unsupervised learning where similar articles are grouped into topical clusters.

The math is straightforward: each cluster has a centroid, which is the average of all the vectors in that cluster. Articles are assigned to whichever centroid they are closest to in vector space. The clustering step revealed natural topical groupings — technology, media, politics, lifestyle, and more.

Cluster summaries were then generated using GPT, giving a human-readable description of what each group of articles was about. These summaries were stored in a separate table (blog_cluster_summaries) for use in personalization.

From Vectors to Personalization

The recommendation step works by letting a user rate cluster summaries instead of individual articles. For example, if I’m most interested in AI and less in politics, I can give the AI cluster a high rating and the politics cluster a low one.

Here’s where the math comes in. Every article has already been transformed into a vector — a long list of numbers that capture the meaning of the text. When we cluster these vectors, each group of related articles forms a center point in that high-dimensional space. This center point, or centroid, represents the “average” meaning of the cluster.

When I rate clusters, I’m essentially saying how closely my personal interests align with each of those centroids. A high rating puts more emphasis on that centroid, pulling the preference profile closer to it, while a low rating pushes it away. By blending all of these centroids together according to my ratings, the system builds a single composite vector that reflects my unique tastes.

This “preference vector” sits in the same space as the article vectors. That means I can directly compare it to every article in the database. Articles that are closest to this vector are treated as the most relevant, because they align strongly with the topics I rated highly. Articles that sit far away in vector space are treated as least relevant, because they resemble the clusters I didn’t care for.

The beauty of this approach is that it turns subjective ratings into a precise mathematical object that can be used for fast, large-scale similarity search. The end result feels intuitive — the system recommends content I actually want to read — but under the hood it’s powered by the geometry of vectors and the mathematics of weighted combinations.

The Software Flow

The project runs in a clear sequence:

  • Data loading: blog posts from Kaggle into Greenplum
  • Vector generation: embeddings created in Python, stored in pgvector
  • Clustering: KMeans in PL/Python, grouping articles into topics
  • Summarization: GPT-generated text summaries for each cluster
  • Preference modeling: user ratings on summaries converted into a weighted vector
  • Recommendations: similarity search in SQL to find the most and least interesting content

Example Outcomes

When running the demo, I can pull up the 25 most interesting and 25 least interesting blog posts in under a second, all tailored to my personal preference vector. That weighted vector is matched against every blog’s embedding in the same vector space, and thanks to Greenplum’s high-speed, high-volume vector indexing, the system can instantly retrieve the closest (and farthest) matches.

The result is a powerful example of using Greenplum as the engine for AI-driven personalization—scalable, SQL-native, and tightly integrated with modern AI tools. While this demo uses blog posts, the same pattern extends naturally to retail product recommendations, healthcare content delivery, education course suggestions, and personalized media feeds. Wherever tailored recommendations matter, Greenplum + AI provides the foundation.

Repo

The full project, including schema, scripts, and demo instructions, is here: 👉 https://github.com/ivannovick/blog-recommendations



Ashwin Agrawal

Technology Lead and Distinguished Engineer - Greenplum Database (massively parallel PostgreSQL) [Broadcom]

1w

👏 Plus in addition Greenplum's row level, cell level security and anonymization rules can be defined to further control which vector row embeddings are used to influence recommendations or decision making based on user roles - which makes it wholistic solution in one place

To view or add a comment, sign in

Explore content categories