Skip to content

ivannovick/blog-recommendations

Repository files navigation

Blog Recommendation Engine with Greenplum + AI

Blog Recommendation Engine

A sophisticated content recommendation system powered by VMware Tanzu Greenplum, pgvector, and AI-generated embeddings. This project demonstrates how to build a complete AI-driven recommendation pipeline with a stunning modern web interface featuring real-time clustering and metallic design.

🌟 Features

🎨 Modern Metallic Web Interface

  • Glass morphism design with metallic gradients and effects
  • Interactive cluster rating system with custom metallic sliders
  • Real-time re-clustering with adjustable parameters (5-25 clusters)
  • Responsive design that works beautifully on all devices
  • Dark theme with gold accents and shimmer animations

🧠 AI-Powered Recommendation Engine

  • Embedding generation using configurable AI models (OpenAI or local)
  • KMeans clustering with PL/Python and scikit-learn
  • Vector similarity search using pgvector for recommendations
  • AI-generated cluster summaries for better user understanding
  • Personalized preference modeling based on cluster ratings

βš™οΈ Advanced Capabilities

  • Dynamic re-clustering - Change cluster count and regenerate groupings
  • Configurable embedding dimensions (supports 768D, 1536D, etc.)
  • Local or cloud AI models via OpenAI-compatible APIs
  • Export functionality for recommendations and analytics
  • Comprehensive pre-flight checks for data integrity

πŸš€ Quick Start

Prerequisites

  • Greenplum Database 7.x with pgvector extension
  • Python 3.8+ with virtual environment support
  • AI Model Access (OpenAI API or local model server)

Installation

  1. Clone and setup:

    git clone <repository-url>
    cd blog-recommendations
    python3 -m venv venv
    source venv/bin/activate
    pip install -r requirements.txt
  2. Download the dataset:

    # Easy one-click download
    ./download_data.sh

    Alternative manual download:

    # Manual download and extract
    curl -o medium_data.csv.zip "https://storage.googleapis.com/kaggle-data-sets/748442/1294572/compressed/medium_data.csv.zip?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20250915%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20250915T195945Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=2641e2acbb04a08f983895bca92680cf78ecc5a4cb273a5b78d249e3b8a780e7da9481fd482a56947908916fdaa3f66dca87644ebc4f3d20a93d16ba18ecd865a60db000af100e0c328a7f028dc30700c23a9fccc34027d187402393331624ca3e45fc5b9d64cdf9dbe84f7774e0bdfcbbff28f1dd12ee662ff112d0fc082776d8b5b57cf5f69892887b03a71a742565f8d79c7149cedb9530e96a4c9bab8ec051d9505105b7c56da4ba145bcea0f9df48e80dae9f6c0e241fe741a119f9c01206974f8c36a5c123be324428df89be0f0024ea7d52e374409e15529a75ae4e70da7ee756a18b3eb8da6b18a9abb3b5515c8afffa81c753301a711947a4005107"
    unzip medium_data.csv.zip
    mv medium_data.csv medium_post_titles.csv
    rm medium_data.csv.zip
  3. Configure environment (edit .env):

    # Database Configuration
    GP_HOST=localhost
    GP_PORT=15432
    GP_USER=gpadmin
    DB_NAME=demo
    
    # AI Model Configuration
    USE_LOCAL_MODELS=true
    LOCAL_API_BASE=http://127.0.0.1:1234/v1
    EMBEDDING_MODEL=text-embedding-nomic-embed-text-v2
    EMBEDDING_DIMENSIONS=768
    
    # Web Interface
    WEB_PORT=8081
  4. Run the complete demo:

    ./run_demo.sh

πŸ“Š Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Raw CSV Data β”‚ -> β”‚  Greenplum + AI  β”‚ -> β”‚  Web Interface  β”‚
β”‚   (Blog Posts)  β”‚    β”‚   Processing     β”‚    β”‚   (Metallic UI) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  Recommendation  β”‚
                    β”‚     Engine       β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data Flow

  1. Load β†’ Blog posts imported into Greenplum
  2. Embed β†’ AI generates vector embeddings for each post
  3. Cluster β†’ KMeans groups similar content (configurable clusters)
  4. Summarize β†’ AI creates thematic summaries for each cluster
  5. Recommend β†’ Vector similarity finds personalized content

🎯 Demo Workflow

Recommendation Workflow

Step 1: Cluster Analysis

  • View AI-generated cluster summaries with sample articles
  • Adjust cluster count (5-25) using the metallic slider
  • Click "Re-cluster Data" to dynamically reorganize content

Step 2: Preference Rating

  • Rate each cluster from 1-10 using metallic sliders
  • Build your personalized preference vector in real-time
  • Visual feedback with gold accent colors

Step 3: AI Recommendations

  • Get top 25 most interesting articles for you
  • See 25 least relevant articles (for comparison)
  • Export results with metadata and system information

πŸ›  Technical Stack

Component Technology Purpose
Database Greenplum 7.x Scalable data warehouse
Vector Search pgvector Similarity search & storage
Clustering PL/Python + scikit-learn KMeans clustering
AI Models OpenAI API / Local LLMs Embeddings & summaries
Backend Flask + Python REST API server
Frontend HTML/CSS/JS + Tailwind Metallic web interface
Config Environment variables Flexible configuration

πŸ“ Project Structure

blog-recommendations/
β”œβ”€β”€ 🌐 web_app.py              # Flask web server with REST API
β”œβ”€β”€ βš™οΈ config.py               # Configuration management
β”œβ”€β”€ πŸ”§ run_demo.sh             # One-click demo launcher
β”œβ”€β”€ πŸ“Š genvec.py               # Embedding generation
β”œβ”€β”€ 🎯 cluster.sql             # KMeans clustering query
β”œβ”€β”€ πŸ“ summarize.py            # AI cluster summaries
β”œβ”€β”€ πŸ—„οΈ load_data.sh            # Data pipeline setup
β”œβ”€β”€ πŸ”§ create_kmeans_function.sql  # PL/Python KMeans UDF
β”œβ”€β”€ πŸ“‹ generate_schema.py      # Dynamic schema generation
β”œβ”€β”€ 🌐 templates/
β”‚   └── index.html             # Metallic UI template
β”œβ”€β”€ πŸ“± static/
β”‚   └── js/app.js              # Frontend JavaScript
β”œβ”€β”€ πŸ“œ scripts/                # Utility scripts
└── πŸ“Έ screenshots/            # Demo screenshots

⚑ Advanced Usage

Re-clustering Data

// Via Web UI - adjust slider and click "Re-cluster Data"
// Or via API:
curl -X POST http://localhost:8081/api/recluster \
  -H "Content-Type: application/json" \
  -d '{"num_clusters": 20}'

Custom Embedding Models

# In .env file:
EMBEDDING_MODEL=text-embedding-3-large
EMBEDDING_DIMENSIONS=3072

Batch Processing

# Complete pipeline:
bash load_data.sh              # Load CSV data
python genvec.py               # Generate embeddings
psql -f create_kmeans_function.sql demo  # Create KMeans function
psql -f cluster.sql demo       # Run clustering
python summarize.py            # Generate summaries
python web_app.py              # Start web interface

πŸ”§ Configuration Options

Database Settings

  • GP_HOST, GP_PORT, GP_USER - Database connection parameters
  • DB_NAME - Target database name

AI Model Settings

  • USE_LOCAL_MODELS - Toggle between local and cloud AI models
  • EMBEDDING_MODEL - Model for generating embeddings
  • EMBEDDING_DIMENSIONS - Vector dimensions (768, 1536, 3072, etc.)
  • CHAT_MODEL - Model for generating cluster summaries

Web Interface

  • WEB_PORT - Flask server port (default: 8081)
  • CLUSTER_SAMPLE_SIZE - Articles per cluster for summaries

🎨 UI Customization

The metallic theme uses CSS custom properties for easy customization:

/* Metallic color scheme */
.metallic-bg { /* Silver gradients */ }
.metallic-blue { /* Blue metallic accents */ }
.metallic-gold { /* Gold highlights */ }
.glass-panel { /* Glassmorphism effects */ }

πŸ“ˆ Performance & Scaling

  • Vector Search: pgvector provides fast similarity search at scale
  • Clustering: PL/Python enables in-database ML processing
  • Caching: Configuration display caching for faster startups
  • Responsive: UI adapts from mobile to large displays

πŸ’‘ Example Use Cases

  • Media & Publishing: Personalized article feeds and content discovery
  • E-commerce: Product recommendations based on user behavior
  • Education: Adaptive learning content recommendations
  • Research: Academic paper recommendation systems
  • Corporate: Knowledge base content suggestion

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Test with your Greenplum setup
  4. Commit your changes (git commit -m 'Add amazing feature')
  5. Push to the branch (git push origin feature/amazing-feature)
  6. Submit a pull request

πŸ“ License

This project demonstrates advanced Greenplum + AI capabilities for educational and commercial use.

πŸ”— Links


Built with ❀️ using Greenplum, pgvector, and modern web technologies

Showcasing the power of in-database AI and vector search for next-generation recommendation systems

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •