An experiment with Model Context Protocol (MCP) for Spark Code Optimization
Integrating intelligent AI agents with real-world tools like Apache Spark opens up massive potential in the rapidly evolving world of AI and data engineering space. Over the past couple of weeks, I’ve been diving into the Model Context Protocol (MCP) and to solidify my understanding I decided to build something practical. I tried an approach to optimizing Spark code to get performance optimization recommendations using an MCP Client-Server architecture.
High-Level Architecture
This project is now available on GitHub here
What is the Model Context Protocol (MCP)?
MCP is a protocol that acts as a bridge between AI models and tools/external systems. It helps standardize how AI models interact with tools, execute actions, and manage context. Think of it as an intelligent glue layer that allows language models to control tools in a predictable and context-aware way.
Here’s a Simple Analogy:
Let’s say you have a virtual assistant (like Siri or Alexa), and you say:
“Hey Assistant, can you book me a flight, add it to my calendar, and also email me the itinerary?”
Without MCP:
Your assistant might try to understand each part individually.
It could mess up the context.
It might treat every action as a fresh command with no memory or coordination.
With MCP:
Each action like book_flight, add_to_calendar, and send_email is a registered tool.
Your assistant knows how to call each tool understands their inputs and outputs and keeps the context (like dates, and destinations).
The assistant handles these as a chain of structured tool calls with consistent formats and reliable behaviour.
Now, imagine replacing "book a flight" with "optimize Spark code" or "analyze logs." That’s what MCP enables for AI agents like Claude or GPT.
Spark Code Optimizer via MCP
This project demonstrates how to optimize a Spark job using an AI model (e.g., Claude API) through an MCP-style interface. The system uses a client-server architecture where the client submits Spark code and the server connects with an AI model to return optimized code.
Architecture
Key Components
Input Layer
spark_code_input.py: Source PySpark code for optimization
run_client.py: Client startup and configuration
2. MCP Client Layer
SparkMCPClient: Async client implementation
Tools Interface: Protocol-compliant tool invocation
3. MCP Server Layer
run_server.py: Server initialization
SparkMCPServer: Core server implementation
Tool Registry: Optimization and analysis tools
Protocol Handler: MCP request/response management
4. Resource Layer
Claude AI: Code analysis and optimization
PySpark Runtime: Code execution and validation
5. Output Layer
optimized_spark_example.py: Optimized code
performance_analysis.md: Detailed analysis
This workflow illustrates:
Input PySpark code submission
MCP protocol handling and routing
Claude AI analysis and optimization
Code transformation and validation
Performance analysis and reporting
Dive into the spark-mcp directory
Inside the spark-mcp/ directory, you’ll find two important files:
server.py – The Backend Brain
This file is the MCP-style server built with FastAPI. Here's what it does:
Registers a tool called optimize_code, which represents an action the AI can take (i.e., code optimization).
Receives structured tool calls (like optimize_code) from the client.
Passes the code to Claude (via claude_call), which is assumed to be a call to an AI model that returns the optimized code.
Returns a structured response with the optimized code as a tool result.
This simulates a real MCP server where tools can be abstracted and invoked via protocol-compatible calls.
client.py – The Spark Code Optimizer Client
This file represents the client that interacts with the MCP server:
Sends the original Spark code as input to the /invoke endpoint of the server.
Wraps the code in a structured tool call request using the optimize_code tool name.
Receives and prints the optimized Spark code, optionally saving it.
Think of this as a real-world simulation of how an AI agent can invoke tool-based workflows to get back structured results.
Why MCP Matters (More Than Just Calling Claude AI)
One might ask why not just call Claude AI directly to get Spark code optimizations? Why introduce the extra complexity of an MCP server?
Tool Abstraction & Standardization
MCP allows the AI to interact with tools in a predictable and structured format, not just via ad-hoc API calls. This means you can define tool interfaces like optimize_code, and the AI can call them capabilities not just plain prompts.
Context Awareness & State Management
With MCP, your AI agent can maintain context across multiple interactions. Instead of treating each prompt as a fresh start, MCP lets the AI interact as if it's using a programmable interface with memory, tracking code state, feedback, or performance metrics.
Scalable, Multi-Tool Architecture
Imagine adding more tools later: data profilers, logging analyzers, security scanners, etc. MCP lets the AI switch between or combine tools within a single protocol. It’s modular and future-proof.
Loop-Friendly Workflows
MCP makes it easy to build feedback loops: AI suggests optimization → code is benchmarked → result is sent back to AI → repeat. This would be clumsy and error-prone with just direct API calls to Claude.
Structured Communication
Instead of relying on raw text prompts and free-form responses, MCP uses structured JSON payloads, making responses easier to parse, validate, and debug.
Quick Start
Step 1: Add your Spark code to optimize in input/spark_code_input.py
Step 2: Start the MCP server:python run_server.py
Step 3: Run the client to optimize your code: python run_client.py
This will generate two files:
output/optimized_spark_example.py: The optimized Spark code with detailed optimization comments
output/performance_analysis.md: Comprehensive performance analysis
Run and compare code versions: python run_optimized.py
Execute both original and optimized code
Compare execution times and results
Update the performance analysis with execution metrics
Show detailed performance improvement statistics
Output Examples
Input Code to be Optimized (Intentionally written unoptimized Spark code):
Optimized Spark Code (Generated by Claude AI through MCP Server):
Performance Analysis (Generated by Claude AI through MCP Server)
Conclusion
This project was an attempt to merge practical Spark engineering with GenAI-based optimization all via a clean MCP-style protocol. It shows how AI models can interact with traditional big data tools to suggest performance improvements something very relevant in real-world workloads.
The MCP pattern makes this approach modular and scalable. Next steps might include:
Supporting more tools (e.g., SQL, Python code).
Enhancing AI prompt engineering.
Visualizing optimization reports and diffs.
🔗 Check out the full code here: GitHub - ai-spark-mcp-server
DevOps Engineer |Platform Engineer| SRE | Cloud Engineer | AWS Certified Solution Architect | Cloud Infrastructure, Container Orchestration, IaC Expert
2mogetting eror like below,pls suggest Traceback (most recent call last): File "/home/rgulla/Desktop/workspace/ai-spark-mcp-server/v1/run_server.py", line 76, in <module> mcp.register_tool("optimize_spark_code", optimize_spark_code) ^^^^^^^^^^^^^^^^^ AttributeError: 'FastMCP' object has no attribute 'register_tool' INFO:py4j.clientserver:Closing down clientserver connection
Have you tried integrating with MCP servers? Check out the largest directory MCPInsightsX.com. Free forever!
Building Arcade.dev
4moThis is really cool. If you wanted to start to think about how this would work with the new streamable HTTP transport, then you can check out this demo environment we just built. Let's you get your hands dirty playing with it: https://blog.arcade.dev/announcing-native-support-for-mcp-servers/
AI Lead at PwC. Entrepreneur. Associate Professor.
4moInteresting Check this out: https://www.cohorte.co/blog/a-comprehensive-guide-to-the-model-context-protocol-mcp
Building NeXTGen Ai & Quantum Leaders|⟨A|Q⟩MATiCS|{igebra.ai}| ExDatabricks
4moSo thorough Giri Ramanathan. Very interesting stuff 👏