Finding the Spacing Signal in the Noise: How to Debias a Model

Novi Labs

The most accurate and timely oil & gas data, combined with the industry's most powerful machine learning software.

Published Mar 25, 2025

Note: This is the 2nd of a multi-part series on Novi’s approach to creating forecasting models which respond accurately and sensibly to downspacing and parent-child depletion.

Understanding the Bias in Well Forecasting Models

In Part 1, we identified trends in spacing data that do not agree with physical intuition. These trends, showing rising production with tighter spacing, stem from operator decision making to downspace and increase completion intensity in higher quality rock. As a result, purely associative pattern recognition models struggle to decouple the positive effects of rock quality with the negative effects of tighter spacing.

Now that we have examined the bias in the underlying data, let’s have a look at the results of a purely associative machine learning model. The tree-based associative model treats all input variables equally, and has no special knowledge of causation or physics. This model uses 11,176 horizontal wells in the Delaware Basin with the following characteristics (Figure 1), and the following input variables:

Figure 1. Filters showing composition of wells used to train the Delaware Basin model. Extremely short laterals and missing values for fluid and proppant comprise the bulk of the exclusions. Only wells with 90+ days of production data were considered.

Fluid Per Foot
Proppant Per Foot
Target Formation
Depth of Target Formation
Water Saturation
Total Organic Carbon
Clay Content
Thickness of Target Formation
Distance to Lateral Farther Neighbor
Distance Stagger Farther Neighbor
Age of Parent Well
Distance to Parent Well

Figure 2 shows the definitions of lateral farther and stagger farther neighbors that were used as spacing features in this model. Using the second closest neighbors creates a delineation between exterior single bound wells and interior double bounded wells on a pad. Other spacing definitions, such as total wells in radius or average neighbor distance can be used, but these are the 2 interwell features we chose for this particular model.

Figure 2. Definition of stagger and lateral neighbor distances. In the model results shown here, we examine the impact of lateral and stagger farther distances on production.

Impact of Proppant Loading on Production: SHAP Analysis

Figure 3 shows the SHAP dependence plot for proppant in the Delaware Basin. The SHAP impact isolates the effect of a variable on the model, showing how much the value of an input variable can move a forecast away from the average well in the dataset. In this case, increasing proppant loading moves the forecast up in a mostly linear trend between 1000 and 3000 lbs per foot. As we saw in the raw data, some local variations occur around large sample size round numbers (2000, 2500, etc), but the trend is mostly uniform and proppant alone can swing a forecast by 20% in this range. An operator would likely scale fluid and proppant together, creating an even larger impact for completion intensity.

Figure 3. SHAP values in the Delaware Basin for proppant loading. This trend is largely positive and linear, as we would expect. The underlying data sampling reflects physical reality.

Figure 4 shows a different story. Figures 4A and 4C show the same raw data trend from Part 1 of this series, while 4B and 4D show the equivalent SHAP trends for these variables in the model. The model is able to determine that downspacing has a negative impact on production, but the magnitude of change is much smaller than we would expect. Adding together these 2 features gives a downspacing impact of less than 10% when moving from a purely unbounded well to a sub 440ft cube development. The purely associative model is partially able to isolate some of the trend we expect to see, but the model is likely not useful to understand the economic impacts of downspacing.

Figure 4. A) Distance to 2nd closest lateral neighbor and 1-year cumulative oil production per lateral foot in the Delaware Basin. B) SHAP Impact of 2nd closest lateral neighbor feature. C) Distance to 2nd closest stagger neighbor and 1-year cumulative oil production per lateral foot. D) SHAP Impact of 2nd closest stagger neighbor feature.

Now that we have identified the underlying data issue and the results of an associative machine learning model, what tools can we use to create models that respond in a physically sensible way to spacing and depletion? Do data scientists in other industries face similar problems? As it turns out, these types of issues are common in medicine and economics, where fully randomized and controlled trials are impossible due to cost, ethics, or time constraints.

Figure 5, borrowed from wikimedia, demonstrates the concept we are trying to elicit from our forecasting models for shale wells. In this example dataset, the overall trend is down and to the right with a correlation coefficient of -0.74. This is analogous to the basinwide trend, created by sampling bias, showing downspacing correlated to higher production. Within this dataset, there are cohorts showing the expected behavior of a positive correlation. Similarly, within our basinwide dataset, there are cohorts of wells with similar rock quality and completions that show degradation with spacing. As an operator, you have probably observed this by comparing 2 pads in nearby areas with different spacing strategies.

Figure 5. Conceptual example of stratification. If we have prior knowledge of subpopulations which create our biased dataset, we can create a model within each one to derive the correct signal.

Because of the inherent sampling bias in the data, we need a way to teach our models that geology and spacing will be correlated. One way to do this is to segment the dataset. Figure 5 demonstrates the concept of stratification, manually bucketing the dataset with some prior knowledge of the shape of the bias. For our dataset, this might take the form of bucketing wells by rock quality or creating a model without unbounded or single bounded wells.

Moving from the conceptual example (Figure 5) to the true dataset (Figure 6) presents the challenge at hand. The subpopulation groupings exist in the dataset, but it is extremely difficult and time consuming to isolate the subpopulations and then build individualized models. At Novi, we have experimented with subpopulations defined by spacing, rock quality, location (county or formation), with varying success. Much like the process of manually selecting type curves, this process is subject to human bias, and contains too much data across too many dimensions for anyone to manually select.

Figure 6. Delaware Basin raw data trends for A) Nearest neighbor and rock quality (represented by GEOSHAP) and B) Nearest neighbor and proppant loading. Contained within these datapoints are the subpopulations needed to derive the correct, physically sensible spacing signals for well forecasting.

Conclusion

Creating unbiased forecasting models in the Delaware Basin is challenging, especially when separating the effects of downspacing from rock quality. Traditional associative models often fail due to data biases. By stratifying datasets into subpopulations based on rock quality and completion strategies, we can develop models that better reflect physical realities. This approach enhances the accuracy of well forecasts and provides more reliable data for strategic decisions.

In the next post, we will describe the process (Double Machine Learning) of using a model to select the subpopulations which will debias the model.

Written by Kiran Sathaye

Efren Munoz

Senior Reservoir Engineer, Production Optimization experience and expertise. Well testing and Reserves specialist for both conventional and unconventional reservoirs. Specialized in analysis workflows for specific cases.

5mo

Very good article, and perfect title for it. We have to use tools like this to "debias" our studies and conclusions when it comes to completion sizing and well spacing combinations.

1 Reaction

Finding the Spacing Signal in the Noise: How to Debias a Model

Novi Labs

The most accurate and timely oil & gas data, combined with the industry's most powerful machine learning software.

Understanding the Bias in Well Forecasting Models

Impact of Proppant Loading on Production: SHAP Analysis

Conclusion

More articles by this author

Others also viewed

Refining Mesh Quality in OpenFOAM – from Smoothing to Genetic Optimisation

Enhanced Unified Toroidal-Crystalline Harmonic System (UTCHS) with Phase Recursion: A Comprehensive Theoretical Framework

Basics of Multiphase Flow - III

How Do We Inform a Model About a Well’s Surroundings?

Estimation Methods: Post III - Inverse Distance

Frameworks: Walter Russell's concentric cube model of the atom using the mathematical framework of cubic harmonics.

🚨 What If Your Sensitivity Analysis Is Giving You Fake Confidence?

Variography: Post II - The Nugget Effect

Explaining the Periodic Table of Primes Through the UFRF Framework (Enhanced)

Battle of the Plots: Contour vs. Isosurface! 🎮📊

Explore topics

Understanding the Bias in Well Forecasting Models

Impact of Proppant Loading on Production: SHAP Analysis

Conclusion

Permian’s Efficiency Gains No Longer Offset Falling Rig Counts

Aug 22, 2025

US E&Ps maintain capital discipline, but lower prices are killing returns

Aug 15, 2025

Can the US-EU $750B Energy Deal Actually Work?

Aug 8, 2025

Chevron Closes $60B Hess Deal. What’s Left in the Williston?

Aug 1, 2025

What’s Really Behind EOG’s $5.6B Encino Deal, and Why Now?

Jul 25, 2025

Optimizing Well Spacing for Maximum Recovery: What 23,000 Wells Reveal

Jul 18, 2025

How Reliable Is Public Water Production Data Across Major Basins?

Jul 11, 2025

U.S. Crude Output Plateaus at 13.4M bbl/d While Rig Count Sinks to 4-Year Low

Jul 4, 2025

[L48 Cost Analysis] How did operators cut $1.8B in capex without slowing production?

Jun 27, 2025

Can Transfer Learning Be Used to Forecast Production in Frontier Basins?

Jun 20, 2025

Others also viewed

Refining Mesh Quality in OpenFOAM – from Smoothing to Genetic Optimisation

Enhanced Unified Toroidal-Crystalline Harmonic System (UTCHS) with Phase Recursion: A Comprehensive Theoretical Framework

Basics of Multiphase Flow - III

How Do We Inform a Model About a Well’s Surroundings?

Estimation Methods: Post III - Inverse Distance

Frameworks: Walter Russell's concentric cube model of the atom using the mathematical framework of cubic harmonics.

🚨 What If Your Sensitivity Analysis Is Giving You Fake Confidence?

Variography: Post II - The Nugget Effect

Explaining the Periodic Table of Primes Through the UFRF Framework (Enhanced)

Battle of the Plots: Contour vs. Isosurface! 🎮📊

Explore topics