Blog Recommendation Engine with Greenplum + AI

A sophisticated content recommendation system powered by VMware Tanzu Greenplum, pgvector, and AI-generated embeddings. This project demonstrates how to build a complete AI-driven recommendation pipeline with a stunning modern web interface featuring real-time clustering and metallic design.

🌟 Features

🎨 Modern Metallic Web Interface

Glass morphism design with metallic gradients and effects
Interactive cluster rating system with custom metallic sliders
Real-time re-clustering with adjustable parameters (5-25 clusters)
Responsive design that works beautifully on all devices
Dark theme with gold accents and shimmer animations

🧠 AI-Powered Recommendation Engine

Embedding generation using configurable AI models (OpenAI or local)
KMeans clustering with PL/Python and scikit-learn
Vector similarity search using pgvector for recommendations
AI-generated cluster summaries for better user understanding
Personalized preference modeling based on cluster ratings

⚙️ Advanced Capabilities

Dynamic re-clustering - Change cluster count and regenerate groupings
Configurable embedding dimensions (supports 768D, 1536D, etc.)
Local or cloud AI models via OpenAI-compatible APIs
Export functionality for recommendations and analytics
Comprehensive pre-flight checks for data integrity

🚀 Quick Start

Prerequisites

Greenplum Database 7.x with pgvector extension
Python 3.8+ with virtual environment support
AI Model Access (OpenAI API or local model server)

Installation

Clone and setup:

git clone <repository-url>
cd blog-recommendations
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Download the dataset:

# Easy one-click download
./download_data.sh

Alternative manual download:

# Manual download and extract
curl -o medium_data.csv.zip "https://storage.googleapis.com/kaggle-data-sets/748442/1294572/compressed/medium_data.csv.zip?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20250915%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20250915T195945Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=2641e2acbb04a08f983895bca92680cf78ecc5a4cb273a5b78d249e3b8a780e7da9481fd482a56947908916fdaa3f66dca87644ebc4f3d20a93d16ba18ecd865a60db000af100e0c328a7f028dc30700c23a9fccc34027d187402393331624ca3e45fc5b9d64cdf9dbe84f7774e0bdfcbbff28f1dd12ee662ff112d0fc082776d8b5b57cf5f69892887b03a71a742565f8d79c7149cedb9530e96a4c9bab8ec051d9505105b7c56da4ba145bcea0f9df48e80dae9f6c0e241fe741a119f9c01206974f8c36a5c123be324428df89be0f0024ea7d52e374409e15529a75ae4e70da7ee756a18b3eb8da6b18a9abb3b5515c8afffa81c753301a711947a4005107"
unzip medium_data.csv.zip
mv medium_data.csv medium_post_titles.csv
rm medium_data.csv.zip

Configure environment (edit .env):

# Database Configuration
GP_HOST=localhost
GP_PORT=15432
GP_USER=gpadmin
DB_NAME=demo

# AI Model Configuration
USE_LOCAL_MODELS=true
LOCAL_API_BASE=http://127.0.0.1:1234/v1
EMBEDDING_MODEL=text-embedding-nomic-embed-text-v2
EMBEDDING_DIMENSIONS=768

# Web Interface
WEB_PORT=8081

Run the complete demo:
```
./run_demo.sh
```

📊 Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Raw CSV Data │ -> │  Greenplum + AI  │ -> │  Web Interface  │
│   (Blog Posts)  │    │   Processing     │    │   (Metallic UI) │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                              │
                              ▼
                    ┌──────────────────┐
                    │  Recommendation  │
                    │     Engine       │
                    └──────────────────┘

Data Flow

Load → Blog posts imported into Greenplum
Embed → AI generates vector embeddings for each post
Cluster → KMeans groups similar content (configurable clusters)
Summarize → AI creates thematic summaries for each cluster
Recommend → Vector similarity finds personalized content

🎯 Demo Workflow

Step 1: Cluster Analysis

View AI-generated cluster summaries with sample articles
Adjust cluster count (5-25) using the metallic slider
Click "Re-cluster Data" to dynamically reorganize content

Step 2: Preference Rating

Rate each cluster from 1-10 using metallic sliders
Build your personalized preference vector in real-time
Visual feedback with gold accent colors

Step 3: AI Recommendations

Get top 25 most interesting articles for you
See 25 least relevant articles (for comparison)
Export results with metadata and system information

🛠 Technical Stack

Component	Technology	Purpose
Database	Greenplum 7.x	Scalable data warehouse
Vector Search	pgvector	Similarity search & storage
Clustering	PL/Python + scikit-learn	KMeans clustering
AI Models	OpenAI API / Local LLMs	Embeddings & summaries
Backend	Flask + Python	REST API server
Frontend	HTML/CSS/JS + Tailwind	Metallic web interface
Config	Environment variables	Flexible configuration

📁 Project Structure

blog-recommendations/
├── 🌐 web_app.py              # Flask web server with REST API
├── ⚙️ config.py               # Configuration management
├── 🔧 run_demo.sh             # One-click demo launcher
├── 📊 genvec.py               # Embedding generation
├── 🎯 cluster.sql             # KMeans clustering query
├── 📝 summarize.py            # AI cluster summaries
├── 🗄️ load_data.sh            # Data pipeline setup
├── 🔧 create_kmeans_function.sql  # PL/Python KMeans UDF
├── 📋 generate_schema.py      # Dynamic schema generation
├── 🌐 templates/
│   └── index.html             # Metallic UI template
├── 📱 static/
│   └── js/app.js              # Frontend JavaScript
├── 📜 scripts/                # Utility scripts
└── 📸 screenshots/            # Demo screenshots

⚡ Advanced Usage

Re-clustering Data

// Via Web UI - adjust slider and click "Re-cluster Data"
// Or via API:
curl -X POST http://localhost:8081/api/recluster \
  -H "Content-Type: application/json" \
  -d '{"num_clusters": 20}'

Custom Embedding Models

# In .env file:
EMBEDDING_MODEL=text-embedding-3-large
EMBEDDING_DIMENSIONS=3072

Batch Processing

# Complete pipeline:
bash load_data.sh              # Load CSV data
python genvec.py               # Generate embeddings
psql -f create_kmeans_function.sql demo  # Create KMeans function
psql -f cluster.sql demo       # Run clustering
python summarize.py            # Generate summaries
python web_app.py              # Start web interface

🔧 Configuration Options

Database Settings

GP_HOST, GP_PORT, GP_USER - Database connection parameters
DB_NAME - Target database name

AI Model Settings

USE_LOCAL_MODELS - Toggle between local and cloud AI models
EMBEDDING_MODEL - Model for generating embeddings
EMBEDDING_DIMENSIONS - Vector dimensions (768, 1536, 3072, etc.)
CHAT_MODEL - Model for generating cluster summaries

Web Interface

WEB_PORT - Flask server port (default: 8081)
CLUSTER_SAMPLE_SIZE - Articles per cluster for summaries

🎨 UI Customization

The metallic theme uses CSS custom properties for easy customization:

/* Metallic color scheme */
.metallic-bg { /* Silver gradients */ }
.metallic-blue { /* Blue metallic accents */ }
.metallic-gold { /* Gold highlights */ }
.glass-panel { /* Glassmorphism effects */ }

📈 Performance & Scaling

Vector Search: pgvector provides fast similarity search at scale
Clustering: PL/Python enables in-database ML processing
Caching: Configuration display caching for faster startups
Responsive: UI adapts from mobile to large displays

💡 Example Use Cases

Media & Publishing: Personalized article feeds and content discovery
E-commerce: Product recommendations based on user behavior
Education: Adaptive learning content recommendations
Research: Academic paper recommendation systems
Corporate: Knowledge base content suggestion

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Test with your Greenplum setup
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Submit a pull request

📝 License

This project demonstrates advanced Greenplum + AI capabilities for educational and commercial use.

🔗 Links

Built with ❤️ using Greenplum, pgvector, and modern web technologies

Showcasing the power of in-database AI and vector search for next-generation recommendation systems

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
docs		docs
screenshots		screenshots
scripts		scripts
static/js		static/js
templates		templates
.gitignore		.gitignore
README.md		README.md
VERSION		VERSION
cluster.sql		cluster.sql
config.py		config.py
create_kmeans_function.sql		create_kmeans_function.sql
download_data.sh		download_data.sh
generate_schema.py		generate_schema.py
genvec.py		genvec.py
load_data.sh		load_data.sh
medium_post_titles.csv		medium_post_titles.csv
requirements.txt		requirements.txt
run_demo.sh		run_demo.sh
schema.sql		schema.sql
summarize.py		summarize.py
web_app.py		web_app.py

ivannovick/blog-recommendations

Folders and files

Latest commit

History

Repository files navigation