Unlocking the Subsurface: A Multimodal AI Approach to Geological Data Integration

Emmanouil Amygdalas

Seasoned Energy Professional transitioning to Data Analysis and Data Science | Applying Deep Technical Expertise to Data Insights | IBM Certified, Meng, Msc, PMP®

Published Jun 6, 2025

Introduction:

In the complex world of oil & gas exploration, geothermal energy, and carbon storage, a profound understanding of the subsurface is paramount. Geoscientists rely on a myriad of data sources – from written reports and core images to highly technical well logs – each offering a unique piece of the geological puzzle. However, these diverse data types often reside in silos, presenting a significant challenge for holistic analysis and interpretation.

Imagine being able to seamlessly integrate narrative descriptions, visual evidence, and quantitative measurements to build a more complete, accurate, and automated picture of what lies beneath our feet. This is precisely the challenge our recent project aimed to address.

We embarked on a journey to develop an end-to-end workflow, demonstrating how Artificial Intelligence and Machine Learning can bridge these data divides. Our goal was to integrate seemingly disparate geological information – text, images, and well logs – into a unified framework to enhance subsurface characterization, ultimately leading to more informed decision-making.

This article will walk you through our approach, detailing the steps from initial data acquisition and pre-processing to advanced feature engineering, multimodal integration, and finally, the application of machine learning for automated geological interpretation.

Phase 1: Data Acquisition and Preprocessing

Our journey began by tackling the fundamental challenge of acquiring and preparing various data types. This phase is critical, as the quality and structure of your input data directly impact the reliability and insights of your machine learning models.

1. Text Data Acquisition: Extracting Insights from Reports

Geological operations generate a wealth of unstructured text in daily reports, technical summaries, and historical documents. For this project, we conceptually worked with "Mini Reports" and "NPD PDF files" (NPD refers to the Norwegian Petroleum Directorate, which provides public wellbore information).

Mini Reports: These short, narrative summaries often contain crucial details about well names, formations encountered, lithologies, fluid types, and presence of faults. We designed a strategy to use Named Entity Recognition (NER) techniques to automatically identify and extract these key geological entities. Imagine sifting through thousands of reports manually – NER automates this tedious process, transforming unstructured text into structured, actionable data points.

Norwegian Petroleum Directorate (NPD) PDF Files: These documents often contain both narrative and tabular information about wellbores. Our approach included methods for both text-based NER and specialized techniques to extract structured data from tables, such as drilling parameters and wellbore status.

2. Image Data Preprocessing: Visual Clues from Core Photos

Core photos provide invaluable visual context, offering a direct glimpse into the rock's fabric and characteristics.

Core Photos: These images are critical for understanding rock facies (e.g., sandstone, carbonate, shale), grain sizes, sorting, and sedimentary structures. We envisioned using Convolutional Neural Networks (CNNs) for tasks like automated facies classification. By training a CNN on a dataset of labeled core images, a model could learn to distinguish between different rock types or even identify specific features, providing quantitative and consistent interpretations that complement traditional manual methods.

3. Well Log Data: The Subsurface Fingerprint

Well logs are continuous measurements recorded downhole, providing a detailed profile of the subsurface. They are the backbone of petrophysical analysis.

Loading and Initial Analysis: We started by loading a synthetic well log dataset (synthetic_well_log_1.csv). Our initial steps involved cleaning column names, inspecting data types, and performing basic statistical analyses (min, max, mean, standard deviation) for key logs like Gamma Ray (GR), Resistivity (RES), and Neutron Porosity (NPOR). This ensures data quality and helps us understand the fundamental range and distribution of our measurements.

Well Log Visualization: Understanding well logs is often best achieved visually. We generated a comprehensive well log plot, displaying GR, RES, and NPOR curves against depth. This visual representation allows geoscientists to quickly identify zones of interest, pick formation tops, and infer lithology changes.

Figure 1: A typical well log plot showing Gamma Ray (GR), Resistivity (RES), and Neutron Porosity (NPOR) curves against depth, revealing subsurface variations.

This concludes our detailed dive into the initial data acquisition and preprocessing steps. Next, we will explore how we extracted and engineered more meaningful features from these processed data streams.

Phase 2: Feature Extraction and Engineering

With our raw data acquired and preprocessed, the next crucial step was to transform these measurements into more meaningful and interpretable features. Feature engineering is the art of creating new input features from existing ones that are better understood by a machine learning model, thereby improving its performance and the insights derived.

1. Deriving Petrophysical Properties: Volume of Shale (Vshale)

The Gamma Ray (GR) log is a primary indicator of shale content. By normalizing the GR response, we can calculate the Volume of Shale (Vshale). Shale is generally considered an impediment to hydrocarbon flow, so quantifying its presence is vital for reservoir characterization. Our process involved using a simple linear transformation of the GR log to estimate Vshale, providing a continuous profile of shale content along the wellbore.

2. Conceptual Lithology Classification

Building on the Vshale calculation and integrating with Neutron Porosity (NPOR), we developed a conceptual rule-based classification for Lithology. This allowed us to categorize the subsurface into distinct rock types: 'Shale', 'Sandstone/Carbonate', 'Tight Formation', and 'Mixed Lithology'. This provides a simplified, yet geologically relevant, target for our machine learning model.

Visualizing Engineered Features: To better understand these derived features and their relationships, we generated several plots.

Figure 2: Plot showcasing engineered features like Vshale and derived Conceptual Lithology alongside raw well logs, highlighting their relationship with depth

Figure 3: Cross-plot of Neutron Porosity vs. Gamma Ray, illustrating how different lithologies tend to cluster based on their log responses.

Figure 4: Distribution of Neutron Porosity, providing insights into the overall porosity characteristics of the drilled interval (Box Plot)

These engineered features are essential bridges between raw measurements and the geological interpretations that machine learning models can learn from. The next phase will involve integrating these features with information from other data sources and building our predictive model.

Phase 3: Integration and Machine Learning

This phase represents the culmination of our data preparation efforts, bringing together disparate data sources and applying machine learning to derive predictive insights.

1. Multimodal Data Integration: The Unified Dataset

The true power of this project lies in integrating information from all our processed modalities: text, images, and well logs. We created a unified integrated_features_df where each row represents a depth point in the well, populated with attributes from various sources.

How we integrated: The depth-continuous well log data (including our engineered Vshale and conceptual lithology) served as our foundational structure. We conceptually added columns that would contain information extracted from text reports (e.g., well_name, report_overall_lithology, report_fluid_indication, report_fault_presence). While these were conceptual for our synthetic data, in a real scenario, NER and text extraction would populate these fields. Similarly, information derived from image analysis (e.g., core_facies_type classified from core photos) was conceptually added to align with the corresponding depth intervals.

This integrated dataset forms a rich, multi-dimensional view of the subsurface, far more comprehensive than any single data source could provide, making it ideal for advanced analytics.

2. Machine Learning Model Development: Predicting Lithology

With our integrated dataset ready, we proceeded to the machine learning phase, aiming to predict lithology—a critical parameter for reservoir quality assessment.

Problem Formulation: We framed this as a supervised classification problem. The goal was to train a model to predict the lithology (our target variable, y) based on the various integrated features (our input features, X).
Feature Selection: Our input features (X) included numerical well log attributes (gr, res, npor, vshale) and the conceptual categorical feature (core_facies_type).

Data Preparation for ML:

One-Hot Encoding: Categorical features like core_facies_type were converted into a numerical format using one-hot encoding, making them suitable for machine learning algorithms.
Train-Test Split: The dataset was then split into training and testing sets (70% for training, 30% for testing). This crucial step ensures that we evaluate the model's performance on unseen data, simulating its behavior in a real-world application.
Feature Scaling: Numerical features were scaled using StandardScaler. While not strictly necessary for all models (like Random Forest), it's a good practice to ensure features contribute equally to the model's learning process.

Figure 5: Comprehensive well log plot illustrating the raw Gamma Ray, Resistivity, and Neutron Porosity curves alongside the derived Conceptual Lithology, a key output of our feature engineering process.

Model Choice & Training: We selected the Random Forest Classifier for our conceptual demonstration. Random Forest is an ensemble learning method known for its robustness, accuracy, and ability to handle various data types. The model was then trained ("fitted") using our prepared training data.
Conceptual Model Evaluation: After training, the model made predictions on the unseen test set. We evaluated its performance using standard classification metrics:

Figure 6: Visualizing Key Feature Relationships and Distributions.

The results demonstrated a perfect conceptual accuracy of 1.00. This highlights the model's ability to precisely learn the patterns within our synthetic dataset, as the lithology labels were deterministically derived from the input well log features. While this level of accuracy is characteristic of synthetic data (where the model learns the rules used to generate the labels), it serves as a powerful validation of the workflow's potential.

Overall Conclusion and Key Insights

Our journey through this end-to-end workflow culminates in a powerful demonstration of how diverse geological data can be harmonized and leveraged for predictive insights.

Overall Conclusion of the ML Model

The Random Forest Classifier model, trained on our integrated features, achieved a perfect accuracy of 1.00 in predicting lithology. This outcome, while remarkable, is a direct reflection of the conceptual and synthetic nature of our dataset. In this exercise, the lithology labels were deterministically derived from the input well log features (like Vshale and NPOR) using predefined rules. Therefore, the machine learning model successfully "learned" these underlying rules with perfect precision.

Insights from the ML Model:

Workflow Validation: Despite the synthetic data, this perfect score serves as a robust validation of the entire workflow. It confirms that the data acquisition, preprocessing, feature engineering, and integration steps successfully prepared a dataset suitable for machine learning.
Proof-of-Concept: The exercise is a strong proof-of-concept for the feasibility of applying supervised machine learning to complex geological classification problems when data is appropriately prepared. It highlights that models can learn the intricate relationships between various subsurface attributes.
Pattern Recognition: The model's ability to achieve perfect accuracy showcases its capacity to identify and generalize complex patterns within the data. In real-world scenarios, where relationships are often probabilistic and noisy, a similar approach would still aim to identify strong correlations and predictive power.

Broader Insights for Geoscience and AI

This project extends beyond a single model's accuracy. It illuminates several critical insights for the intersection of geoscience and artificial intelligence:

Bridging Data Silos is Key: The ability to integrate text, image, and well log data is paramount. Real-world geological understanding is inherently multimodal, and AI solutions must reflect this complexity. Breaking down data silos unlocks a more holistic view of the subsurface.
Feature Engineering is Transformative: Simply using raw data isn't enough. Deriving meaningful, geologically relevant features (like Vshale or conceptual lithologies) significantly enhances a model's ability to learn and interpret. This step often requires domain expertise.
Automation Potential: Repetitive and labor-intensive interpretation tasks, such as lithology logging or facies identification, can be augmented or even partially automated by AI. This frees up geoscientists to focus on higher-level analytical challenges and strategic decision-making.
Enhanced Consistency: Automated interpretation reduces subjectivity, leading to more consistent and reproducible results across different projects or teams.
Accelerated Workflows: By streamlining data processing and interpretation, AI can drastically cut down the time required for subsurface characterization, accelerating exploration, development, and carbon storage projects.
Unlocking Hidden Value: Integrating diverse data sources can reveal subtle correlations or patterns that might be missed when analyzing data in isolation, potentially leading to new discoveries or optimized resource recovery.

This comprehensive project successfully demonstrated the transformative potential of integrating multimodal geological data with AI and machine learning. By systematically processing and combining information from text reports, core images, and well logs, we created a richer and more robust dataset than any single source could provide.

The conceptual ML model, while achieving perfect accuracy due to the synthetic nature of the data, effectively illustrated the workflow and feasibility of applying advanced analytical techniques to automate and enhance complex geological interpretations, such as lithology prediction. This approach paves the way for more efficient, consistent, and data-driven decision-making in the geoscience domain, ultimately leading to a more profound and accurate understanding of Earth's subsurface

Building on our conceptual ML modeling and integrated workflow, the practical applications of machine learning in the geoscience field (petroleum exploration), particularly for subsurface characterization, are vast and transformative. Here are some of the key areas where ML modelling is making a significant impact:

Automated Lithology and Facies Prediction: Application: Directly from our project, ML models can predict rock types (lithology) and geological facies from well logs, seismic data, and core images. This automates and standardizes a task traditionally done manually, which can be time-consuming and subjective. Benefit: Faster, more consistent, and objective geological interpretation across numerous wells and large datasets.
Reservoir Property Prediction: Application: Estimating critical reservoir properties like porosity, permeability, water saturation, and net-to-gross from indirect measurements (logs, seismic attributes). Benefit: Improves accuracy of reserve estimates, optimizes well placement, and enhances reservoir simulation inputs.
Seismic Interpretation Enhancement: Application: ML algorithms can detect seismic facies, delineate salt bodies, identify faults and fractures, and predict fluid content from complex seismic datasets. Advanced techniques can even aid in seismic data denoising and inversion. Benefit: Accelerates seismic interpretation workflows, reduces human bias, and uncovers subtle features that might be missed by traditional methods.
Drilling Optimization and Hazard Prediction: Application: Predicting drilling parameters (e.g., Rate of Penetration - ROP), identifying potential drilling hazards like overpressure zones, lost circulation, or wellbore instability before they are encountered. Benefit: Reduces non-productive time (NPT), improves drilling safety, and lowers operational costs.
Production Forecasting and Optimization: Application: Predicting future oil, gas, or water production rates for individual wells or entire fields based on historical production data, completion designs, and reservoir properties. Optimizing injection strategies for enhanced oil recovery (EOR). Benefit: Better production planning, improved resource management, and maximizing hydrocarbon recovery.
Unconventional Resource Characterization: Application: Identifying "sweet spots" in shale plays, predicting geomechanical properties crucial for hydraulic fracturing (e.g., brittleness index), and optimizing frac stage placement. Benefit: Maximizes economic returns from unconventional assets by focusing completion efforts on the most productive rock volumes.
Core-to-Log Integration and Upscaling: Application: Using ML to establish robust relationships between high-resolution core measurements and lower-resolution log data, and then upscaling these properties to field-scale models. Benefit: Improves the accuracy of petrophysical models and ensures consistency across different data scales.
Data Quality Control and Imputation: Application: Identifying anomalous data points in logs or other measurements, and intelligently infilling missing data intervals based on correlations with other reliable data. Benefit: Ensures high-quality input data for all subsequent analyses and models.
Geohazard Assessment: Application: Predicting risks associated with geological hazards such as landslides, subsidence, or fault reactivation based on geological, geophysical, and environmental data. Benefit: Enhances safety for infrastructure development and environmental protection.
Carbon Capture, Utilization, and Storage (CCUS): Application: Characterizing subsurface reservoirs for CO2 storage (capacity, injectivity, containment), monitoring CO2 plume migration, and predicting potential leakage pathways. Benefit: Crucial for the safe and effective implementation of CCUS projects, which are vital for decarbonization efforts.

In essence, ML modeling empowers geoscientists to move beyond traditional empirical methods, enabling them to handle vast, complex datasets, identify non-linear relationships, automate routine tasks, and generate more accurate, consistent, and predictive insights into the Earth's subsurface.

You can find the project files in Github:

Github - Geological Data Integration