An experiment with Model Context Protocol (MCP) for Spark Code Optimization

Giri Ramanathan

Senior Director, Data and AI Solutions at Databricks | AI Software Engg | Hands-on in Cloud, Big Data, ML/AI- GenAI | MCP | LlamaIndex | Agentic AI |RAG Frameworks | Vector DB | Agent Evaluation

Published Mar 22, 2025

Integrating intelligent AI agents with real-world tools like Apache Spark opens up massive potential in the rapidly evolving world of AI and data engineering space. Over the past couple of weeks, I’ve been diving into the Model Context Protocol (MCP) and to solidify my understanding I decided to build something practical. I tried an approach to optimizing Spark code to get performance optimization recommendations using an MCP Client-Server architecture.

High-Level Architecture

This project is now available on GitHub here

What is the Model Context Protocol (MCP)?

MCP is a protocol that acts as a bridge between AI models and tools/external systems. It helps standardize how AI models interact with tools, execute actions, and manage context. Think of it as an intelligent glue layer that allows language models to control tools in a predictable and context-aware way.

Here’s a Simple Analogy:

Let’s say you have a virtual assistant (like Siri or Alexa), and you say:

“Hey Assistant, can you book me a flight, add it to my calendar, and also email me the itinerary?”

Without MCP:

Your assistant might try to understand each part individually.
It could mess up the context.
It might treat every action as a fresh command with no memory or coordination.

With MCP:

Each action like book_flight, add_to_calendar, and send_email is a registered tool.
Your assistant knows how to call each tool understands their inputs and outputs and keeps the context (like dates, and destinations).
The assistant handles these as a chain of structured tool calls with consistent formats and reliable behaviour.

Now, imagine replacing "book a flight" with "optimize Spark code" or "analyze logs." That’s what MCP enables for AI agents like Claude or GPT.

Spark Code Optimizer via MCP

This project demonstrates how to optimize a Spark job using an AI model (e.g., Claude API) through an MCP-style interface. The system uses a client-server architecture where the client submits Spark code and the server connects with an AI model to return optimized code.

Architecture

Key Components

Input Layer

spark_code_input.py: Source PySpark code for optimization
run_client.py: Client startup and configuration

2. MCP Client Layer

SparkMCPClient: Async client implementation
Tools Interface: Protocol-compliant tool invocation

3. MCP Server Layer

run_server.py: Server initialization
SparkMCPServer: Core server implementation
Tool Registry: Optimization and analysis tools
Protocol Handler: MCP request/response management

4. Resource Layer

Claude AI: Code analysis and optimization
PySpark Runtime: Code execution and validation

5. Output Layer

optimized_spark_example.py: Optimized code
performance_analysis.md: Detailed analysis

This workflow illustrates:

Input PySpark code submission
MCP protocol handling and routing
Claude AI analysis and optimization
Code transformation and validation
Performance analysis and reporting

Dive into the spark-mcp directory

Inside the spark-mcp/ directory, you’ll find two important files:

server.py – The Backend Brain

This file is the MCP-style server built with FastAPI. Here's what it does:

Registers a tool called optimize_code, which represents an action the AI can take (i.e., code optimization).
Receives structured tool calls (like optimize_code) from the client.
Passes the code to Claude (via claude_call), which is assumed to be a call to an AI model that returns the optimized code.
Returns a structured response with the optimized code as a tool result.

This simulates a real MCP server where tools can be abstracted and invoked via protocol-compatible calls.

client.py – The Spark Code Optimizer Client

This file represents the client that interacts with the MCP server:

Sends the original Spark code as input to the /invoke endpoint of the server.
Wraps the code in a structured tool call request using the optimize_code tool name.
Receives and prints the optimized Spark code, optionally saving it.

Think of this as a real-world simulation of how an AI agent can invoke tool-based workflows to get back structured results.

Why MCP Matters (More Than Just Calling Claude AI)

One might ask why not just call Claude AI directly to get Spark code optimizations? Why introduce the extra complexity of an MCP server?

Tool Abstraction & Standardization

MCP allows the AI to interact with tools in a predictable and structured format, not just via ad-hoc API calls. This means you can define tool interfaces like optimize_code, and the AI can call them capabilities not just plain prompts.

Context Awareness & State Management

With MCP, your AI agent can maintain context across multiple interactions. Instead of treating each prompt as a fresh start, MCP lets the AI interact as if it's using a programmable interface with memory, tracking code state, feedback, or performance metrics.

Scalable, Multi-Tool Architecture

Imagine adding more tools later: data profilers, logging analyzers, security scanners, etc. MCP lets the AI switch between or combine tools within a single protocol. It’s modular and future-proof.

Loop-Friendly Workflows

MCP makes it easy to build feedback loops: AI suggests optimization → code is benchmarked → result is sent back to AI → repeat. This would be clumsy and error-prone with just direct API calls to Claude.

Structured Communication

Instead of relying on raw text prompts and free-form responses, MCP uses structured JSON payloads, making responses easier to parse, validate, and debug.

Quick Start

Step 1: Add your Spark code to optimize in input/spark_code_input.py

Step 2: Start the MCP server:python run_server.py

Step 3: Run the client to optimize your code: python run_client.py

This will generate two files:

output/optimized_spark_example.py: The optimized Spark code with detailed optimization comments
output/performance_analysis.md: Comprehensive performance analysis

Run and compare code versions: python run_optimized.py

Execute both original and optimized code
Compare execution times and results
Update the performance analysis with execution metrics
Show detailed performance improvement statistics

Output Examples

Input Code to be Optimized (Intentionally written unoptimized Spark code):

Optimized Spark Code (Generated by Claude AI through MCP Server):

Performance Analysis (Generated by Claude AI through MCP Server)

Conclusion

This project was an attempt to merge practical Spark engineering with GenAI-based optimization all via a clean MCP-style protocol. It shows how AI models can interact with traditional big data tools to suggest performance improvements something very relevant in real-world workloads.

The MCP pattern makes this approach modular and scalable. Next steps might include:

Supporting more tools (e.g., SQL, Python code).
Enhancing AI prompt engineering.
Visualizing optimization reports and diffs.

🔗 Check out the full code here: GitHub - ai-spark-mcp-server

Ram Rushendranath Gulla

2mo

getting eror like below,pls suggest Traceback (most recent call last): File "/home/rgulla/Desktop/workspace/ai-spark-mcp-server/v1/run_server.py", line 76, in <module> mcp.register_tool("optimize_spark_code", optimize_spark_code) ^^^^^^^^^^^^^^^^^ AttributeError: 'FastMCP' object has no attribute 'register_tool' INFO:py4j.clientserver:Closing down clientserver connection

DeepinsightsX

4mo

Have you tried integrating with MCP servers? Check out the largest directory MCPInsightsX.com. Free forever!

Jamie-Lee Salazar

Building Arcade.dev

4mo

This is really cool. If you wanted to start to think about how this would work with the new streamable HTTP transport, then you can check out this demo environment we just built. Let's you get your hands dirty playing with it: https://blog.arcade.dev/announcing-native-support-for-mcp-servers/

2 Reactions

Charafeddine Mouzouni

AI Lead at PwC. Entrepreneur. Associate Professor.

4mo

Interesting Check this out: https://www.cohorte.co/blog/a-comprehensive-guide-to-the-model-context-protocol-mcp

1 Reaction

Srini Vemula

Building NeXTGen Ai & Quantum Leaders|⟨A|Q⟩MATiCS|{igebra.ai}| ExDatabricks

4mo

So thorough Giri Ramanathan. Very interesting stuff 👏

1 Reaction

See more comments

To view or add a comment, sign in

See all

An experiment with Model Context Protocol (MCP) for Spark Code Optimization

Giri Ramanathan

Senior Director, Data and AI Solutions at Databricks | AI Software Engg | Hands-on in Cloud, Big Data, ML/AI- GenAI | MCP | LlamaIndex | Agentic AI |RAG Frameworks | Vector DB | Agent Evaluation

High-Level Architecture

What is the Model Context Protocol (MCP)?

Here’s a Simple Analogy:

Spark Code Optimizer via MCP

Architecture

Key Components

Dive into the spark-mcp directory

server.py – The Backend Brain

client.py – The Spark Code Optimizer Client

Why MCP Matters (More Than Just Calling Claude AI)

Quick Start

Output Examples

Conclusion

More articles by this author

Others also viewed

Is Data Science Easy Or AI: Unveiling The Truth Behind The Buzz!

AutoGen Frameworks in LLMOps: Automating JSON Flow Generation

Creating a traditional (Non-LLM) Chatbot in Databricks Free Edition

Generative AI + Databases & Vector Search: The Future of Intelligent Data Retrieval

Top 10 Skills Every Data Scientist Needs in 2025

Persistence of Memory: AI Can’t Learn Without It

Analytics and Data Science News for the Week of August 1; Updates from Anaconda, Teradata, ThoughtSpot & More

💊 DATA Pill #121 - Local & Free Multi-Agent RAG Superbot, Data Mesh - Where Are We Now?

24 Ultimate Data Science (ML) projects to work on in 2022.

Live Retrieval-Augmented Generation (RAG) with Kafka

Explore topics

High-Level Architecture

What is the Model Context Protocol (MCP)?

Here’s a Simple Analogy:

Spark Code Optimizer via MCP

Architecture

Key Components

Dive into the spark-mcp directory

server.py – The Backend Brain

client.py – The Spark Code Optimizer Client

Why MCP Matters (More Than Just Calling Claude AI)

Quick Start

Output Examples

Conclusion

Secure by Design: A Developer’s Guide to Agentic AI Security

Jul 1, 2025

Building an AI Job Market Analysis System using Agentic RAG: A Data-Driven Roadmap to Your Next Career Move!

Jan 3, 2025

Building an AI-Powered Avatar Generator: A Journey Through Multi AI Model Integration

Nov 28, 2024

Harnessing AI for Log analysis using AI functions in Databricks

Oct 20, 2024

Others also viewed

Is Data Science Easy Or AI: Unveiling The Truth Behind The Buzz!

AutoGen Frameworks in LLMOps: Automating JSON Flow Generation

Creating a traditional (Non-LLM) Chatbot in Databricks Free Edition

Generative AI + Databases & Vector Search: The Future of Intelligent Data Retrieval

Top 10 Skills Every Data Scientist Needs in 2025

Persistence of Memory: AI Can’t Learn Without It

Analytics and Data Science News for the Week of August 1; Updates from Anaconda, Teradata, ThoughtSpot & More

💊 DATA Pill #121 - Local & Free Multi-Agent RAG Superbot, Data Mesh - Where Are We Now?

24 Ultimate Data Science (ML) projects to work on in 2022.

Live Retrieval-Augmented Generation (RAG) with Kafka

Explore topics