Benchmarking ChatGPT-5 for Threat Intelligence Extraction

By B. Nathan Thomason (UALR CORE Center), in collaboration with Bastazo and Spencer Massengale.

Supported by U.S. Department of Energy award CR-0000031. Published August 26, 2025.

Critical-infrastructure defenders don’t just need more threat intelligence, they need machine-readable intel their systems can act on. Bastazo and the University of Arkansas at Little Rock’s CORE Center partner to do exactly that. Student analysts, practitioners, and AI work together to transform open-source threat reporting into structured records that operators can use to see context, prioritize risk, and choose effective remediations.

At the heart of this pipeline is an LLM extraction step that maps natural-language snippets to explicit fields to profile threat actors using SKRAM-related features (Skills Knowledge Resources Authorities Motivation). Since launch, OpenAI’s o4-mini has been our most reliable model for this translation.

On August 7, 2025, OpenAI released the ChatGPT-5 family (5, 5-mini, 5-nano). Same day tests suggested no obvious, runaway winner over o4-mini. To move beyond initial impressions, we ran a focused benchmark on August 11-13, 2025, comparing the 5-series models to o4-mini to evaluate extraction accuracy and consistency, latency, and cost-per-correct field.

This report shares our setup and early takeaways so that teams building threat-intelligence workflows can decide if and where ChatGPT-5 belongs in production.

Early Takeaways

Our results suggest that ChatGPT-o4’s architecture has a faster response timing speed and improves reliability under both high and low-context scenarios.
ChatGPT-5-mini briefly outperformed in high-context techniques, but failed on Mitigations and Detections, leaving GPT-o4-mini the most consistent and overall strongest performer for cybersecurity threat feature extraction tasks.

Methodology

To compare OpenAI models on equal footing, we embedded each one directly into the threat-analysis workflow and ran both a high-context source and a low-context source. High-context includes the then most recent CISA Cybersecurity Advisory - #StopRansomware: Interlock. We treated it as high-context because it explicitly describes many of the SKRAM attributes used in our threat reports. In contrast, for a representative low-context article, we used Dark Reading’s “Nimble ‘Gunra’ Ransomware Evolves With Linux Variant”. We treated it as low-context because it offers fewer explicitly stated profiling details, especially around TTPs.

The workflow is shown in Figure 1. We first create a threat-report object from each article URL and then use a web scraper to fetch and normalize the HTML. We send the extracted text to an OpenAI agent along with a structured prompt set to align the output to the given schema. We use a distinct prompt for each phase of the threat extraction to break up the complex tasks. Our analysis has found the LLM performs better on smaller, defined tasks than single-shot feature extraction. Likewise, separate prompts exist for feature extraction, adversarial technique extraction, mitigation technique extraction, and remediation playbook generation.

Figure 1 - Krypteia Threat Report Workflow Overview

In each phase, the model extracts specified fields only when the article states them explicitly or when a clearly supported inference can be made. Otherwise, it sh

ould leave the field blank.

The model scoring criteria includes the following:

Latency: time to first valid, schema-conformant response.
Coverage: number of correctly populated report fields vs. the ground-truth fields present in the source.
Inference accuracy: correctness of any model-made inferences used to fill fields not stated verbatim.
Retry count: how many analysis attempts were required to achieve an overall satisfactory report.

This setup lets us measure not just what each model can extract, but how quickly, how reliably, and how often it needs a second pass.

Results

Figure 2 - Timing Graph of Model Test Results

As a whole, ChatGPT-o4-mini outperformed each of the ChatGPT-5 models in regards to the amount of processing time required to return a response, with ChatGPT-5 and ChatGPT-5-nano being unable to return some Mitigations and Detections responses entirely. Even in low-context scenarios, ChatGPT-o4-mini consistently completed its analysis in a number of minutes, whereas the ChatGPT-5 models exhibited consistent latencies up to 30 minutes. These results suggest that ChatGPT-o4’s architecture not only benefits response timing speed but also improves reliability under both high and low-context scenarios.

Figure 3 - Accuracy Graph of Model Test Results

In terms of accuracy, performance quality between the different models was slightly more nuanced than the timing results would lead one to assume. Although ChatGPT-o4-mini maintained the most consistent accuracy scores throughout both high and low-context scenarios, ChatGPT-5 did outperform the model in the Techniques step under high-context conditions. This advantage would have been more promising if the model had not failed to return responses for the Mitigations and Detections steps, despite subsequent analysis attempts. ChatGPT-5-mini demonstrated strong overall consistency in its accuracy, but was overshadowed by ChatGPT-o4-mini’s performance. And ChatGPT-5-nano, similar to ChatGPT-5, also failed to provide Mitigations and Detections responses during the high-context scenario. These results suggest that smaller models such as ChatGPT-o4-mini and 5-mini provide more reliable performance overall, but the larger ChatGPT-5 models are capable of performing analysis to a higher degree of success when provided with sufficient context.

It’s worth noting that accuracy is not the same as usefulness for technique extraction. Low-context outputs often appear more accurate because the model lists only a few generic techniques (for example, mapping ransomware to T1486: Data Encrypted for Impact). That may be correct but offers little operational value. Higher-context outputs provide richer, actionable detail for operations, even if their technique labels score lower on strict accuracy.

Discussion

The results of these tests indicate that, for the sample reports in the limited timeframe, the ChatGPT-5 models underperformed in comparison to their ChatGPT-o4-mini counterpart. On average, all ChatGPT-5 models required significantly longer processing time than o4-mini to return successful responses, and in several cases the ChatGPT-5 models were unable to successfully provide responses for the Mitigations and Detections steps of threat reports. The ChatGPT-5 models did notably return much more contextually detailed rationale justifying their feature responses, but this improvement did not translate to higher feature accuracy and therefore limits their practical benefit in threat report creation, where actionable feature responses outweigh the value of the narrative depth provided by the rationale response.

However, ChatGPT-5 model performance variability may have been influenced by external factors affecting the OpenAI API during the evaluation period, such as fluctuations in API availability or demand. During multiple tests of the ChatGPT-5 models, various threat report steps experienced unexplained timeouts or substantial query latency spikes, often within 10-15 minute intervals scattered unpredictably across the threat report creation process. These irregularities were not observed during the evaluation of the ChatGPT-o4-mini model, and coincided with the immediate post-launch period of the ChatGPT-5 models when heightened user activity could have placed a strain on OpenAI API infrastructure.

Conclusion

Our initial evaluation findings of the OpenAI ChatGPT-5 models suggest that despite architectural advancements, the ChatGPT-5 models might not currently offer an advantage over o4-mini for time-sensitive, accuracy-critical workloads in which the model must either analyze incredibly large amounts of contextual material or make large amounts of inferences. The greater depth and clarity of the ChatGPT-5 models’ rationale output is outweighed by the longer completion times and higher failure rates in later threat report steps, and only marginal accuracy improvements in specific contexts. It is plausible that these poor performance outcomes were caused by post-launch availability degradation of the OpenAI API, and so future work ought to be conducted which re-runs these tests during a known period of stable model availability to determine whether the observed timing and reliability issues persist. Additional investigation into ChatGPT-5 model prompt optimization could assist in the determination of whether its supposed improvements in detailed reasoning capabilities can be leveraged without compromising throughput or threat report step completion rates.

Works Cited

Cybersecurity & Infrastructure Security Agency. “#StopRansomware: Interlock.” CISA, 2025. https://www.cisa.gov/news-events/cybersecurity-advisories/aa25-203a. Accessed 14 Aug., 2025.

Montalbano, Elizabeth. “Nimble 'Gunra' Ransomware Evolves With Linux Variant.” DarkReading, 2025. https://www.darkreading.com/threat-intelligence/nimble-gunra-ransomware-Linux-variant. Accessed 14 Aug., 2025.

See the full article with appendix on the Bastazo website here.

LinkedIn respects your privacy