Ollama, and Economics of hosting a shared open-source coding model on a GPU instance

Ollama, and Economics of hosting a shared open-source coding model on a GPU instance

PART A) OLLAMA EXPERIMENTS

Using my personal RTX-4090, I conducted some experiments with Ollama and lately Crush (fork of OpenModel).

ollama.ai allows you to run open-source LLMs locally - bundles model weights, configuration, and data into a single package, defined by a Modelfile, optimizes setup and configuration details, including GPU usage. It is the Easiest Way to RUN LLMs Locally: https://youtube.com/watch?v=MGr1V4LyGFA

OpenWebUI is a https://docs.openwebui.com front end for Ollama. They have also released their own GUI https://ollama.com/blog/new-app

I first tried running Llama 3 Locally last year, https://analyticsvidhya.com/blog/2024/04/how-to-run-llama-3-locally/ using HuggingFace / Ollama. Here is a quickstart: Using LangChain with Ollama in Python https://github.com/ollama/ollama/blob/main/docs/tutorials/langchainpy.md - its simple: just `ollama run llama3.1`

I created an AI prototyping environment using Jupyter Lab IDE with Typescript, LangChain.js and Ollama for rapid AI prototyping https://itnext.io/create-an-ai-prototyping-environment-using-jupyter-lab-ide-with-typescript-langchain-js-7b395dae3f09

Note: if you are going to swim upstream and not use python for this, then IMHO, use Rust - see Rust Ollama By Example - 4 Chapters (Full Tutorial - Rust AI) https://youtu.be/OcH-zT5VNgM?si=9Qq4UiGarw3bm0Fy). try ollama-rs https://github.com/pepperoni21/ollama-rs A simple Rust lib for interacting with the Ollama API.

for dev with LazyVim, there is the ollama.nvim plugin https://github.com/nomnivore/ollama.nvim and the parrot.nvim plugin that brings stochastic parrots to Neovim https://github.com/frankroeder/parrot.nvim?tab=readme-ov-file#faq including ollama support.

So last year I tried Graph RAG with Ollama - Save $$$ with Local LLMs https://youtu.be/_XOCAVsr3KU?si=WE4eWFo4BIYlDDUx , Local LightRAG: a GraphRAG Alternative but Fully Local with Ollama https://youtube.com/watch?v=g21royNJ4fw, Local LangGraph Agents with Llama 3.1 + Ollama https://youtube.com/watch?v=5a-NuqTaC20 , and Ollama from LangChain: Orchestrate LLM prompt chains locally https://medium.com/@abonia/ollama-and-langchain-run-llms-locally-900931914a46 . Structured Outputs work well with Ollama https://ollama.com/blog/structured-outputs (Shaping Ollama’s JSON chat response format https://elegantcode.com/2024/12/13/6998/ ) , and so optimizations such as KV context quantisation https://smcleod.net/2024/12/bringing-k/v-context-quantisation-to-ollama/ . you can even do Local Synthetic Data Generation https://analyticsvidhya.com/blog/2025/01/local-synthetic-data-generation/

When DeepSeek came out this year, I had another play, https://ollama.com/library/deepseek-r1:14b see Running Ollama, DeepSeek and Open WebUI Locally: A Step-by-Step Guide https://itnext.io/running-ollama-deepseek-and-open-webui-locally-a-step-by-step-guide-cf43d27e1d14?source=rss----5b301f10ddcd---4

this time with VSCode - Using DeepSeek in your IDE for free:

  1. Install DeepSeek locally with Ollama https://pub.towardsai.net/stop-paying-for-ai-how-to-run-deepseek-locally-for-free-48c0d3af6346
  2. configure VSCode Copilot proxy to point to local Ollama https://github.com/bernardo-bruning/ollama-copilot :)

I also tried from Google Colab terminal and Used LangChain with Ollama https://analyticsvidhya.com/blog/2025/02/run-terminal-in-google-colab/

Ollama's has a new engine for multimodal models https://ollama.com/blog/multimodal-models and supports running multiple model process instances at the same time https://youtu.be/8r_8CZqt5yk?si=VCFCubDSFj-DtUMz (need to check wether this is concurrent or parallel, and does it support MIG ). You can use it to for Fine Tuning an LLM for Parsing and Serving https://pub.towardsai.net/fine-tuning-llm-for-parsing-and-serving-through-ollama-e224a8a5636a?source=rss----98111c9905da---4

For Crush AI Coding Agent Setup with Ollama https://reddit.com/r/ollama/comments/1mjtgbi/crush_ai_coding_agent_setup_with_ollama_for/ you need to configure Local LM Studio/Ollama Custom Providers Support https://github.com/charmbracelet/crush/issues/447 in ~/.local/share/crush/crush.json

IMPORTANT: Running a model requires HBM capacity - use this online GPU Calculator https://aleibovici.github.io/ollama-gpu-calculator/ to see what can fit on your rig. You may need to play arround: Convert a model to GGUF https://docs.nvidia.com/brev/latest/ollama-brev.html fine tune on NVidia Brev, and quantize with llama.cpp, then push to Ollama hub.

PART B) THE HIGH COST OF PAY PER USE CODING MODELS - IS THERE AN ALTERNATIVE ?

like many dev divisions, we tried the build approach to accelerating the SDLC, then switched to buy. We followed the wisdom of crowds, and encouraged devs to start running first with Copilot, then with Cursor, and lately Claude Code. However this seems to get more and more expensive. We hire more devs, devs start using the tools more and the providers hike the prices - it is getting more and more expensive $$$$ !

I started looking into open source coding model alternatives. here are 10 LLM coding benchmarks https://evidentlyai.com/blog/llm-coding-benchmarks , and here are the best Ollama Models of 2025: https://share.google/EAFJVMDFMju5U4m6J and the Best Ollama Models for Developers https://collabnix.com/best-ollama-models-for-developers-complete-2025-guide-with-code-examples/

From Local to Production, I found a good Ollama to vLLM Migration Guide https://pub.towardsai.net/from-local-to-production-the-ultimate-ollama-to-vllm-migration-guide-571faa8cbfde?source=rss----98111c9905da---4

Besides the slight dip in quality, the important question is: AT WHAT POINT IS THIS ECONOMICALLY VIABLE ?

I asked Grok this question, and here is the answer: - To determine when it’s more economical to host a shared open-source coding model on a cloud virtual machine (VM) compared to using Claude or Cursor for x developers at $y per month per seat, we need to compare the costs of both approaches, considering usage patterns, infrastructure costs, and the specifics of open-source models. Since the query uses placeholders (x and y), I’ll provide a general framework and use realistic values based on available data for Claude and Cursor pricing, as well as typical costs for hosting open-source models. I’ll also include a specific example to illustrate the break-even point.

Step 1: Understand Claude and Cursor Pricing

Based on available information:

Claude Code Pricing:

  • Pro Plan: $17/month per user (with annual commitment) or $20/month month-to-month. Includes ~6,480–6,696 messages/month (45 messages every 5 hours).
  • Max Plan: $100/month or $200/month per user, offering 5x or 20x the Pro plan’s usage limits, respectively.
  • API Pricing: Token-based, e.g., Claude 4 Sonnet: $3/million input tokens, $15/million output tokens; Claude 4 Opus: $15/million input, $75/million output. Optimizations like prompt caching (up to 90% savings on input tokens) and batch processing (50% savings on output tokens) can reduce costs.
  • Average Usage: For moderate usage (e.g., 765,000 input and 765,000 output tokens/month), API costs are ~$13.78/month for Sonnet 4 (unoptimized) or ~$5.97/month with optimizations; ~$68.86/month for Opus 4 or ~$29.84/month with optimizations.

Cursor Pricing:

  • Pro Plan: $20/month per user (500 “fast” requests, unlimited “slow” requests) or $40/month for 1,000 fast requests to Claude 3.5 Sonnet.
  • Business Plan: $40/user/month, includes centralized billing and advanced features.
  • API Usage: Using Claude models through Cursor incurs a 20% markup (e.g., Sonnet 4: $3.60/million input, $18/million output). Using a direct Anthropic API key avoids the markup, aligning with Anthropic’s rates.
  • Average Usage: For the same 765,000 input/output tokens, Cursor costs ~$16.52/month for Sonnet 4 (with markup) or $13.78/month with a direct API key; ~$82.62/month for Opus 4 or $68.86/month with a direct API key.

For x developers at $y per month per seat, the total cost for Claude or Cursor is:

  • Total Cost = x × y per month.
  • If y is the Pro plan cost, then y ≈ $17–$20 for Claude Code or $20–$40 for Cursor. For API usage, y depends on token consumption, which varies by workload.

Step 2: Cost of Hosting a Shared Open-Source Coding Model on a Cloud VM

Hosting an open-source coding model (e.g., Qwen3 Coder, DeepSeek v3) on a cloud VM involves compute costs, setup, and maintenance. Based on:

Model Selection: Open-source models like DeepSeek v3 or Qwen3 Coder match or exceed Claude’s performance for coding tasks.

Compute Requirements: For DeepSeek v3, 8 × H200 GPUs at $3.14/hour each = $25.12/hour total.

  • Monthly Cost: $25.12/hour × 24 hours × 30 days = ~$18,086.40/month (assuming 24/7 operation).
  • Per-Token Costs: Input: $0.88/million tokens; Output: $7.03/million tokens (vs. Claude’s $3/$15 for Sonnet 4, $15/$75 for Opus 4). This is ~3.4x cheaper for input and ~2.1x cheaper for output.

Optimizations:

  • Partial Usage: Running the VM only during working hours (e.g., 8 hours/day, 20 days/month) reduces costs to $25.12 × 8 × 20 = ~$4,019.20/month.
  • Smaller Models: Lightweight models like Qwen3 30B or Llama 4 Scout may require fewer GPUs (e.g., 4 × H200 GPUs = $12.56/hour), lowering costs to ~$9,043.20/month (24/7) or ~$2,009.60/month (8 hours/day, 20 days).
  • Spot Instances: Using spot instances or reserved instances can reduce costs by 20–50%.

Additional Costs:

  • Setup: Initial configuration (e.g., Northflank’s one-click deployment) may cost $500–$2,000 one-time.
  • Maintenance: ~$1,000–$2,000/month for DevOps support, depending on Complexity
  • Storage/Networking : Minimal, ~$100–$500/month for most setups.

Total Estimated Cost:

  • High-End (DeepSeek v3, 24/7): ~$18,086.40 + $1,000–$2,000 maintenance + $100–$500 storage = ~$19,186.40–$20,586.40/month.
  • Optimized (DeepSeek v3, 8 hours/day, 20 days): ~$4,019.20 + $1,000–$2,000 + $100–$500 = ~$5,119.20–$6,519.20/month.
  • Lightweight Model (Qwen3 30B, 8 hours/day, 20 days): ~$2,009.60 + $1,000–$2,000 + $100–$500 = ~$3,109.60–$4,509.60/month.

Step 3: Break-Even Analysis

The break-even point occurs when the cost of Claude/Cursor for x developers equals the cost of hosting the open-source model. Let’s assume:

  • Claude/Cursor Cost: y = $20/month per seat (Pro plan, common for both tools).
  • Open-Source Hosting Cost: $5,119.20–$6,519.20/month (optimized DeepSeek v3) or $3,109.60–$4,509.60/month (lightweight model).

Break-Even Formula:

  • x × y = Hosting Cost. For y = $20:
  • DeepSeek v3: x × $20 = $5,119.20–$6,519.20 → x = 256–326 developers.
  • Lightweight Model: x × $20 = $3,109.60–$4,509.60 → x = 156–226 developers.

Example with API Usage:

  • Assume y = $13.78/month per developer (Claude Sonnet 4 API, unoptimized, 765,000 input/output tokens).
  • DeepSeek v3: x × $13.78 = $5,119.20–$6,519.20 → x = 372–473 developers.
  • Lightweight Model: x × $13.78 = $3,109.60–$4,509.60 → x = 226–327 developers.
  • With optimizations (Claude API at $5.97/month):
  • DeepSeek v3: x × $5.97 = $5,119.20–$6,519.20 → x = 858–1,092 developers.
  • Lightweight Model: x × $5.97 = $3,109.60–$4,509.60 → x = 521–755 developers.

Step 4: Other Considerations

  • Rate Limits: Claude and Cursor have rate limits (e.g., weekly caps for Claude, premium request quotas for Cursor), which can throttle productivity during sprints or debugging. Self-hosted models eliminate rate limits, providing consistent performance.
  • Performance: Open-source models like DeepSeek v3 match or exceed Claude’s coding performance, with faster task completion (e.g., 15–30 seconds for large file refactoring vs. 45–90 seconds for Claude).
  • Security: Self-hosted models keep code on your infrastructure, enhancing data privacy compared to cloud-based APIs.
  • Scalability: Hosting costs scale linearly with compute, while Claude/Cursor costs scale with users, making self-hosting more economical for large teams.
  • Maintenance Overhead: Self-hosting requires DevOps expertise, which may not be feasible for small teams.

Step 5: General Answer

It becomes more economical to host a shared open-source coding model on a cloud VM when the total cost of Claude or Cursor for x developers exceeds the hosting cost (typically $3,100–$6,500/month for an optimized setup). For example:

  • At $20/month per seat (Pro plan), self-hosting is cheaper with ~156–326 developers (lightweight model) or ~256–326 developers (high-end model).
  • For API usage at $5.97–$13.78/month per developer, the break-even point is ~226–1,092 developers, depending on optimizations and model choice.
  • Smaller teams (e.g., <100 developers) are generally better off with Claude or Cursor Pro plans due to lower upfront costs and no maintenance overhead.
  • Larger teams or those with high usage, frequent rate limit issues, or strict security requirements benefit from self-hosting sooner.

Example:

For a team of 200 developers (x = 200) at $20/month per seat (y = $20):

  • Claude/Cursor Cost: 200 × $20 = $4,000/month.
  • Self-Hosting Cost: ~$3,109.60–$4,509.60/month (lightweight model) or ~$5,119.20–$6,519.20/month (DeepSeek v3).
  • Conclusion: Self-hosting a lightweight model is already economical (~$3,109.60–$4,000), saving up to $890.40/month. For DeepSeek v3, it’s not yet cost-effective unless usage increases or additional savings (e.g., spot instances) are applied.

BOTTOM LINE: Self-hosting an open-source coding model on a cloud VM becomes more economical than Claude or Cursor when the number of developers (x) multiplied by the per-seat cost (y) exceeds ~$3,100–$6,500/month, typically at 156–326 developers for Pro plans ($20/month) or 226–1,092 developers for API usage ($5.97–$13.78/month). The exact break-even point depends on the model, usage patterns, and optimizations. For smaller teams, Claude or Cursor is more cost-effective; for larger teams or high-usage scenarios, self-hosting saves 60–80% on costs and eliminates rate limits.

Amazing article! An interesting insight drawn from the article - is - the smaller the dev division, it becomes less likely for data privacy to be a priority, since it is less likely the dev division has the resources to apply self hosting. That means, that by externally inspecting a software development company’s size, you could somewhat deduce the probability of their approach towards data privacy in the realm of AI usage. This simple equation could advise CISOs on picking an AI-powered product Thank you Josh!

To view or add a comment, sign in

More articles by Joshua Reuben

Explore content categories