Ollama, and Economics of hosting a shared open-source coding model on a GPU instance
PART A) OLLAMA EXPERIMENTS
Using my personal RTX-4090, I conducted some experiments with Ollama and lately Crush (fork of OpenModel).
ollama.ai allows you to run open-source LLMs locally - bundles model weights, configuration, and data into a single package, defined by a Modelfile, optimizes setup and configuration details, including GPU usage. It is the Easiest Way to RUN LLMs Locally: https://youtube.com/watch?v=MGr1V4LyGFA
OpenWebUI is a https://docs.openwebui.com front end for Ollama. They have also released their own GUI https://ollama.com/blog/new-app
I first tried running Llama 3 Locally last year, https://analyticsvidhya.com/blog/2024/04/how-to-run-llama-3-locally/ using HuggingFace / Ollama. Here is a quickstart: Using LangChain with Ollama in Python https://github.com/ollama/ollama/blob/main/docs/tutorials/langchainpy.md - its simple: just `ollama run llama3.1`
I created an AI prototyping environment using Jupyter Lab IDE with Typescript, LangChain.js and Ollama for rapid AI prototyping https://itnext.io/create-an-ai-prototyping-environment-using-jupyter-lab-ide-with-typescript-langchain-js-7b395dae3f09
Note: if you are going to swim upstream and not use python for this, then IMHO, use Rust - see Rust Ollama By Example - 4 Chapters (Full Tutorial - Rust AI) https://youtu.be/OcH-zT5VNgM?si=9Qq4UiGarw3bm0Fy). try ollama-rs https://github.com/pepperoni21/ollama-rs A simple Rust lib for interacting with the Ollama API.
for dev with LazyVim, there is the ollama.nvim plugin https://github.com/nomnivore/ollama.nvim and the parrot.nvim plugin that brings stochastic parrots to Neovim https://github.com/frankroeder/parrot.nvim?tab=readme-ov-file#faq including ollama support.
So last year I tried Graph RAG with Ollama - Save $$$ with Local LLMs https://youtu.be/_XOCAVsr3KU?si=WE4eWFo4BIYlDDUx , Local LightRAG: a GraphRAG Alternative but Fully Local with Ollama https://youtube.com/watch?v=g21royNJ4fw, Local LangGraph Agents with Llama 3.1 + Ollama https://youtube.com/watch?v=5a-NuqTaC20 , and Ollama from LangChain: Orchestrate LLM prompt chains locally https://medium.com/@abonia/ollama-and-langchain-run-llms-locally-900931914a46 . Structured Outputs work well with Ollama https://ollama.com/blog/structured-outputs (Shaping Ollama’s JSON chat response format https://elegantcode.com/2024/12/13/6998/ ) , and so optimizations such as KV context quantisation https://smcleod.net/2024/12/bringing-k/v-context-quantisation-to-ollama/ . you can even do Local Synthetic Data Generation https://analyticsvidhya.com/blog/2025/01/local-synthetic-data-generation/
When DeepSeek came out this year, I had another play, https://ollama.com/library/deepseek-r1:14b see Running Ollama, DeepSeek and Open WebUI Locally: A Step-by-Step Guide https://itnext.io/running-ollama-deepseek-and-open-webui-locally-a-step-by-step-guide-cf43d27e1d14?source=rss----5b301f10ddcd---4
this time with VSCode - Using DeepSeek in your IDE for free:
I also tried from Google Colab terminal and Used LangChain with Ollama https://analyticsvidhya.com/blog/2025/02/run-terminal-in-google-colab/
Ollama's has a new engine for multimodal models https://ollama.com/blog/multimodal-models and supports running multiple model process instances at the same time https://youtu.be/8r_8CZqt5yk?si=VCFCubDSFj-DtUMz (need to check wether this is concurrent or parallel, and does it support MIG ). You can use it to for Fine Tuning an LLM for Parsing and Serving https://pub.towardsai.net/fine-tuning-llm-for-parsing-and-serving-through-ollama-e224a8a5636a?source=rss----98111c9905da---4
For Crush AI Coding Agent Setup with Ollama https://reddit.com/r/ollama/comments/1mjtgbi/crush_ai_coding_agent_setup_with_ollama_for/ you need to configure Local LM Studio/Ollama Custom Providers Support https://github.com/charmbracelet/crush/issues/447 in ~/.local/share/crush/crush.json
IMPORTANT: Running a model requires HBM capacity - use this online GPU Calculator https://aleibovici.github.io/ollama-gpu-calculator/ to see what can fit on your rig. You may need to play arround: Convert a model to GGUF https://docs.nvidia.com/brev/latest/ollama-brev.html fine tune on NVidia Brev, and quantize with llama.cpp, then push to Ollama hub.
PART B) THE HIGH COST OF PAY PER USE CODING MODELS - IS THERE AN ALTERNATIVE ?
like many dev divisions, we tried the build approach to accelerating the SDLC, then switched to buy. We followed the wisdom of crowds, and encouraged devs to start running first with Copilot, then with Cursor, and lately Claude Code. However this seems to get more and more expensive. We hire more devs, devs start using the tools more and the providers hike the prices - it is getting more and more expensive $$$$ !
I started looking into open source coding model alternatives. here are 10 LLM coding benchmarks https://evidentlyai.com/blog/llm-coding-benchmarks , and here are the best Ollama Models of 2025: https://share.google/EAFJVMDFMju5U4m6J and the Best Ollama Models for Developers https://collabnix.com/best-ollama-models-for-developers-complete-2025-guide-with-code-examples/
From Local to Production, I found a good Ollama to vLLM Migration Guide https://pub.towardsai.net/from-local-to-production-the-ultimate-ollama-to-vllm-migration-guide-571faa8cbfde?source=rss----98111c9905da---4
Besides the slight dip in quality, the important question is: AT WHAT POINT IS THIS ECONOMICALLY VIABLE ?
I asked Grok this question, and here is the answer: - To determine when it’s more economical to host a shared open-source coding model on a cloud virtual machine (VM) compared to using Claude or Cursor for x developers at $y per month per seat, we need to compare the costs of both approaches, considering usage patterns, infrastructure costs, and the specifics of open-source models. Since the query uses placeholders (x and y), I’ll provide a general framework and use realistic values based on available data for Claude and Cursor pricing, as well as typical costs for hosting open-source models. I’ll also include a specific example to illustrate the break-even point.
Step 1: Understand Claude and Cursor Pricing
Based on available information:
Claude Code Pricing:
Cursor Pricing:
For x developers at $y per month per seat, the total cost for Claude or Cursor is:
Step 2: Cost of Hosting a Shared Open-Source Coding Model on a Cloud VM
Hosting an open-source coding model (e.g., Qwen3 Coder, DeepSeek v3) on a cloud VM involves compute costs, setup, and maintenance. Based on:
Model Selection: Open-source models like DeepSeek v3 or Qwen3 Coder match or exceed Claude’s performance for coding tasks.
Compute Requirements: For DeepSeek v3, 8 × H200 GPUs at $3.14/hour each = $25.12/hour total.
Optimizations:
Additional Costs:
Total Estimated Cost:
Step 3: Break-Even Analysis
The break-even point occurs when the cost of Claude/Cursor for x developers equals the cost of hosting the open-source model. Let’s assume:
Break-Even Formula:
Example with API Usage:
Step 4: Other Considerations
Step 5: General Answer
It becomes more economical to host a shared open-source coding model on a cloud VM when the total cost of Claude or Cursor for x developers exceeds the hosting cost (typically $3,100–$6,500/month for an optimized setup). For example:
Example:
For a team of 200 developers (x = 200) at $20/month per seat (y = $20):
BOTTOM LINE: Self-hosting an open-source coding model on a cloud VM becomes more economical than Claude or Cursor when the number of developers (x) multiplied by the per-seat cost (y) exceeds ~$3,100–$6,500/month, typically at 156–326 developers for Pro plans ($20/month) or 226–1,092 developers for API usage ($5.97–$13.78/month). The exact break-even point depends on the model, usage patterns, and optimizations. For smaller teams, Claude or Cursor is more cost-effective; for larger teams or high-usage scenarios, self-hosting saves 60–80% on costs and eliminates rate limits.
Amazing article! An interesting insight drawn from the article - is - the smaller the dev division, it becomes less likely for data privacy to be a priority, since it is less likely the dev division has the resources to apply self hosting. That means, that by externally inspecting a software development company’s size, you could somewhat deduce the probability of their approach towards data privacy in the realm of AI usage. This simple equation could advise CISOs on picking an AI-powered product Thank you Josh!