Evaluating the Ambiguous – Measuring Hypervelocity Engineering Success

Following my posts on Becoming an AI Engineering Team, What is Hypervelocity Engineering, and Start Slow and Accelerate, I want to tackle one of the most challenging aspects of adoption: how do you know if it's actually working? Hyper Velocity Engineering (HVE) is about leveraging AI and reusable patterns to accelerate the journey from raw idea to production software, freeing teams to focus on meaningful, creative work. Unlike traditional team metrics where success is often discipline-specific, measuring the effectiveness of human-AI collaboration across Design, Engineering, Project Management, Data Science, and Security requires us to think more holistically about value creation.

This is perhaps the most speculative post in the series so far. We're all still figuring out what "good" looks like when multi-disciplinary teams collaborate with AI, and the measurement approaches I'm sharing here are experiments in progress, not proven methodologies. But without some framework for evaluation, teams risk either abandoning promising approaches too early or persisting with practices that aren't delivering value.

What measurement challenge is your team facing as you integrate AI into your workflows? I'd love to hear your thoughts as we explore this together.

Learning from Established Frameworks

GitHub's Engineering System Success Playbook offers valuable themes we can adapt for HVE contexts: focusing on developer experience, measuring both velocity and quality of outcomes, and emphasizing continuous feedback loops. While their framework wasn't designed with AI collaboration in mind, the core principles of measuring what matters to your team's daily experience translate well across disciplines in our hybrid human-AI workflows.

The key insight from their approach is that successful measurement combines quantitative metrics with qualitative feedback, recognizing that the most important outcomes often can't be captured by numbers alone – whether you're talking about code quality, design effectiveness, or business value.

Setting the Stage

Before we get into the metrics itself, I want to set the stage by talking about the broader context – ways metrics can go wrong, how to solve for that (using AI to help), and how to think about the act of measuring in the more holistic context of growing the AI muscles of your team.

Watch for Perverse Incentives and Anti-Patterns

Perhaps more important than tracking positive metrics is watching for the ways measurement can go wrong in HVE environments across different disciplines.

Teams can begin Gaming the Metrics. Project managers could start inflating velocity estimates to show AI impact. Designers may only be using AI for simple asset generation to boost success rates of AI-generated designs, vs. going for the design they believe is best. SDEs may not be factoring in prompt-engineering time when evaluating time saved through AI tools.

Keep watch over time, as a simple metric like “time to PR approval” can work well at first and suddenly drive undesirable behavior if your teams believe they are being judged on the value (think of KLOC and some of the code-bloat that metric generated).

You can be vulnerable to Measurement Theater: Spending more time measuring AI effectiveness than actually improving cross-functional workflows, or creating elaborate dashboards that no one acts upon while ignoring qualitative feedback from team members.

Your team may suffer from Professional Insecurity Responses. This shows up differently across disciplines, here are some to watch out for. Designers might over-emphasize AI-generated concepts to appear cutting-edge. Data scientists might under-report AI assistance to protect their analytical expertise. Security experts might avoid AI collaboration to maintain their role as the "human firewall."

Keep watch over time, as a simple metric like 'time to PR approval' can work well at first and suddenly drive undesirable behavior if your teams believe they are being judged on the value

Siloed Evaluation, or measuring AI impact within individual disciplines without considering cross-functional effects, can miss key improvements in the overall process. For instance, AI might slow down initial Data Science work, but the improvement in AI-coauthored DS code and tests may make DS to SDE handovers much smoother.

The most dangerous anti-pattern is measuring AI performance in isolation rather than evaluating how human-AI collaboration affects the entire product development lifecycle.

How have you seen measurement go wrong in your team's AI adoption journey? What warning signs should other teams watch for?

Course-Correcting When Things Go Sideways

The beauty of treating HVE as an ongoing experiment is that course correction becomes part of the process, not a sign of failure. Here are some tactics for when metrics suggest things aren't working across different functions.

Involve AI in cross-functional problem-solving: Ask your AI tools to analyze patterns across different discipline feedback and suggest alternative approaches. Sometimes AI can spot solutions that span functional boundaries in ways individual team members might miss. For instance we spotted in some survey results that DS were less confident using AI tools than SEs – digging in, we realized we needed to make more time for DS to benefit from experience sharing so that they realized how quickly the competence of these models were increasing, but we also needed to involve DS more heavily in designing our scoring rubrics as some evaluations were generating false confidence.

Revisit your measurement framework: Poor metrics might indicate you're measuring the wrong things, not that HVE isn't working. Be especially cautious about metrics that seem to improve in one area while degrading in another as this often signals measurement misalignment rather than actual problems.

Scale back strategically: Rather than abandoning AI assistance entirely, be intentional in your experimentation. Learn what is working for you and your team, and what isn’t, and quickly pivot away from areas that aren’t providing value. Document why you abandon approaches – this field is changing rapidly, and approaches you abandon now as infeasible might be possible within a few months.

Cross-pollinate learnings: Security teams might discover prompting techniques that help design teams, or project management workflows might inform data science evaluation approaches. Make sure insights can flow across disciplinary boundaries. We use cross-discipline Teams meetings to share emerging insights and best practices, document our experiments using Markdown-based experiment templates with synopses in Loop for easy discoverability, and Viva Pulse surveys to track qualitative metric feedback over time. One important point we make with all of our teams is that there is no one right way at this point – things are changing fast enough that we’re all learning from each other, no matter the discipline or level. We’re evolving in our methods, though, so I would love to hear what your teams are using!

There is no “one right way” to use AI in your engineering engagements, we’re all learning from each other across disciplines and levels, and the tools are exponentially increasing in capability

Set Realistic Expectations

AI won't solve all your team's challenges overnight. Pretending otherwise sets up everyone for disappointment. Creating measurement rubrics and cadences gives your teams the confidence to experiment with new techniques while providing stakeholders with concrete evidence of where AI delivers value, and where it currently falls short. This transparency builds trust by demonstrating you're approaching AI adoption thoughtfully, not "vibe coding" your way toward production issues.

The Meta-Challenge of Holistic Measurement

Perhaps the most challenging aspect of measuring HVE success is that we're trying to evaluate a moving target using tools that are themselves rapidly evolving, while balancing the needs and perspectives of multiple disciplines that historically measured success very differently.

The teams that will succeed long-term are those building the evaluation muscles that help them adapt

This is why I believe the most valuable measurement focuses on cross-functional team capabilities and processes rather than specific AI tool performance within individual disciplines. Are you getting better at identifying good use cases for AI assistance across different functions? Is your team developing stronger skills in human-AI collaboration that span traditional role boundaries? Are you building institutional knowledge that will transfer as tools evolve and as the lines between disciplines continue to blur?

The teams that will succeed long-term aren't necessarily those with the best metrics today within any single discipline, but those building the evaluation muscles that help them adapt as both AI capabilities and cross-functional collaboration patterns continue shifting.

Cross-Disciplinary Metrics That Matter

Based on the bottleneck identification I discussed in Start Slow and Accelerate, below are concrete metrics teams are considering or experimenting with across different disciplines: For those that seem subjective or ambiguous, we’re using Likert scores or other simple judging rubrics that allow us to gather quantitative data on different facets and move us away from “judging on vibes”, and will be adapting them over time. Having different teams using different (but potentially overlapping) criteria can be a good way to broaden your experimentation and coalesce more quickly on a set of criteria that work for you.

Broken out by discipline, here are some examples of metrics that matter our teams are considering or experimenting with. I would love to hear what you and your teams are using.

Design and User Experience:

Time from concept to interactive prototype for stakeholder validation
Design iteration cycles (how quickly can you test and refine based on feedback?)
Brand compliance scores for AI-generated assets
User testing setup and analysis timeframes
Stakeholder design review feedback quality (are you getting actionable insights faster?)

Project Management and Stakeholder Communication:

Meeting preparation efficiency
Fidelity of captured stories, tasks, and action items from meeting transcripts
Risk identification and mitigation planning speed
Sprint planning and backlog refinement effectiveness

The goal isn't to prove AI is "better" - it's to understand where human-AI collaboration creates genuine value across your entire team's workflow.

Data Science and Analytics:

Experiment design and hypothesis validation turnaround time
Data cleaning, quality assessment, and general EDA efficiency
Fidelity of AI-generated Dataset Cards

Security and Compliance:

Security review and approval cycles
Vulnerability identification in AI-assisted vs. traditional code
Responsible AI “black mirror” thinking coverage

Designing Effective Qualitative Measurement

One of the most practical approaches I've seen teams adopt borrows from data science evaluation practices, but simplified for cross-functional contexts. The key is creating systematic approaches to capture and interpret qualitative feedback that would otherwise be lost in the noise of daily work.

End-User Satisfaction Surveys: Beyond Basic Ratings

Rather than simple "How satisfied are you?" surveys, effective HVE measurement requires more nuanced questioning. Consider structuring surveys around specific collaboration scenarios:

For stakeholders receiving AI-enhanced project communications:

"Compared to previous projects, how clear were the status updates?" (Much clearer/Clearer/About the same/Less clear/Much less clear)
"Did you feel you had the right information to make decisions quickly?" (Always/Usually/Sometimes/Rarely/Never)
"What specific aspects of project communication improved or worsened?"

For team members using AI-assisted workflows:

"Which tasks felt more efficient with AI assistance this sprint?"
"Where did AI assistance create additional work or confusion?"
"How confident are you in the quality of AI-assisted output in your discipline?"

Systematic Qualitative Data Capture

Establish regular "qualitative checkpoints" beyond traditional retrospectives, and use those to refine and improve your HVE workflow:

Weekly pulse surveys: 2-3 targeted questions focusing on recent AI interactions
Monthly deep dives: Longer surveys exploring specific workflow changes
Quarterly cross-functional reviews: Sessions where different disciplines share qualitative observations about how AI affects their collaboration

The key is creating structure around qualitative feedback while keeping the cognitive load manageable. Use consistent language and scales, but allow for open-ended responses that can reveal unexpected insights.

Analyzing and Acting on Qualitative Data

Qualitative feedback can be tough to turn into actionable insights, you can do so through:

Thematic analysis: Regularly review open-ended responses to identify recurring themes across disciplines
Sentiment tracking: Monitor how attitudes toward AI assistance evolve over time
Cross-discipline correlation: Look for patterns where improvements in one area (e.g., design efficiency) correlate with feedback from another (e.g., developer satisfaction with design handoffs)

Tracking Business Value and ROI

Over the medium to long term, you’ll want to be tracking business value and ROI in a quantitative way. GitHub’s ESSP comes in handy again here, as they outline some concrete business value metrics in their playbook. One that I find particularly valuable in the context of HVE is Feature Engineering Expenses vs. Total Engineering Expenses. The goal of HVE at least in the beginning, as I’ve laid out in my other posts, is to help reduce and remove bottlenecks and to ease the burden of more rote engineering tasks, leaving impactful and interesting features to be co-designed and co-developed by your team and AI in tandem. Understanding how much of your efforts are going into feature development, and how that’s changing as AI is adopted on your teams, can be a good indicator of ROI for those AI investments. However, business value metrics are wide and varied – I would love to hear, what are you using for tracking ROI in this new AI-enabled landscape?

Your Turn: Share Your Measurement Journey

I'm particularly curious about your experiences with measuring AI adoption across different disciplines. Some questions for discussion:

What metrics has your team found most valuable for tracking AI adoption?
Have you discovered evaluation approaches that work well across different disciplines?
What anti-patterns have you observed - both in measurement and in how different functional areas respond to being measured together?
How do you balance quantitative metrics with qualitative feedback in your AI measurement approach?

The measurement landscape for HVE is still being written, and your experiences could help other teams avoid common pitfalls while identifying promising approaches. Share your thoughts, challenges, and successes in the comments and let's build this knowledge base together.

#HypervelocityEngineering #AIEngineering #TechLeadership #SoftwareDevelopment #EngineeringExcellence #AIProductivity #TechInnovation #EngineeringTeams #TechMetrics #CrossFunctionalTeams

Evaluating the Ambiguous – Measuring Hypervelocity Engineering Success

Mike Lanzetta

Helping Scale Data and AI Solutions for Microsoft’s Largest Customers

Learning from Established Frameworks

Setting the Stage

Watch for Perverse Incentives and Anti-Patterns

Course-Correcting When Things Go Sideways

Set Realistic Expectations

The Meta-Challenge of Holistic Measurement

Cross-Disciplinary Metrics That Matter

Designing Effective Qualitative Measurement

End-User Satisfaction Surveys: Beyond Basic Ratings

Systematic Qualitative Data Capture

Analyzing and Acting on Qualitative Data

Tracking Business Value and ROI

Your Turn: Share Your Measurement Journey

More articles by this author

Others also viewed

Avoid the Bottleneck: How to Structure Your Engineering Team for Speed

Mastering Prompt Engineering Using Microsoft 365 Copilot: Increasing Daily Productivity

Before You Scale AI for Software Dev, Fix How You Measure Productivity

Compelling work environments, formality in processes, cohesion vs. autonomy, AI adoption | Erik Summerfield, Senior Engineering Leader

Mastering Prompt Engineering with Microsoft Copilot: From Concept to Implementation

What's it really like to be an engineering leader in 2025? We have the answers.

Agentic Systems - Unlocking the Promise of Autonomous Execution of Complex Workflows

July 04, 2025

Why Should You Consider A Phased Delivery Plan For Your Generative AI Solution?

Why Generative AI Features Don’t Fit the Traditional SDLC

Explore topics

Learning from Established Frameworks

Setting the Stage

Watch for Perverse Incentives and Anti-Patterns

Course-Correcting When Things Go Sideways

Set Realistic Expectations

The Meta-Challenge of Holistic Measurement

Cross-Disciplinary Metrics That Matter

Designing Effective Qualitative Measurement

End-User Satisfaction Surveys: Beyond Basic Ratings

Systematic Qualitative Data Capture

Analyzing and Acting on Qualitative Data

Tracking Business Value and ROI

Your Turn: Share Your Measurement Journey

Hypervelocity Engineering: Start Slow and Accelerate

May 20, 2025

Growing Engineers in the Hypervelocity Era

Apr 21, 2025

What is Hypervelocity Engineering?

Mar 29, 2025

Becoming an AI Engineering Team

Mar 21, 2025

Why RAG Solutions Struggle

Jan 5, 2024

Memory, Mind Maps, and Meta-Prompts

May 21, 2023

Comparison Shopping With AI

May 14, 2023

The Art of AI Integration: Unleashing Midjourney with GPT

May 9, 2023

GPT As My Challenge Network

Apr 27, 2023

How I Use Foundation Models

Apr 24, 2023

Others also viewed

Avoid the Bottleneck: How to Structure Your Engineering Team for Speed

Mastering Prompt Engineering Using Microsoft 365 Copilot: Increasing Daily Productivity

Before You Scale AI for Software Dev, Fix How You Measure Productivity

Compelling work environments, formality in processes, cohesion vs. autonomy, AI adoption | Erik Summerfield, Senior Engineering Leader

Mastering Prompt Engineering with Microsoft Copilot: From Concept to Implementation

What's it really like to be an engineering leader in 2025? We have the answers.

Agentic Systems - Unlocking the Promise of Autonomous Execution of Complex Workflows

July 04, 2025

Why Should You Consider A Phased Delivery Plan For Your Generative AI Solution?

Why Generative AI Features Don’t Fit the Traditional SDLC

Explore topics