Fine-tuning LLMs for better FX market sentiment analysis

New SNB Working Paper: Daniele Ballinari and Jessica Maly fine-tune large language models (LLMs) in order to measure sentiment in the FX market. Their findings indicate that LLMs outperform existing methods for FX market sentiment analysis. Abstract: We enhance sentiment analysis in the foreign exchange (FX) market by fine-tuning large language models (LLMs) to better understand and interpret the complex language specific to FX markets. We build on existing methods by using state-of-the-art open source LLMs, fine-tuning them with labelled FX news articles and then comparing their performance against traditional approaches and alternative models. Furthermore, we tested these fine-tuned LLMs by creating investment strategies based on the sentiment they detect in FX analysis articles with the goal of demonstrating how well these strategies perform in real-world trading scenarios. Our findings indicate that the fine-tuned LLMs outperform the existing methods in terms of both the classification accuracy and trading performance, highlighting their potential for improving FX market sentiment analysis and investment decision-making. https://lnkd.in/dKD8UXB5

  • No alternative text description for this image
Javier Venegas Contreras

Factoring Operations Analyst | International Finance | GFRI | Data Analytics | World Economic Forum

4d

Daniele Ballinari Jessica Maly Schweizerische Nationalbank This innovative model, which analyses currency market sentiment using LLM language news, has some aspects that could be improved: 1) Language bias, as it ignores news in other languages (German, French, Japanese, Spanish, etc.) that also influence the currency market, especially in pairs not dominated by the dollar. It also ignores institutional research (sell-side, central banks, technical reports) that uses a different linguistic register. If applied to a German news item about the ECB, or to a post ‘X’ about the yen, it would probably not capture the sentiment correctly. 2) The alpha is fragile and style-dependent: on DailyFX, LLM performs well, while on FXStreet, VADER wins. If the advantage evaporates when changing sources, it is more a sign of style adjustment than economic understanding. With Sharpe ~0.4 and differences that are diluted after multiple corrections, it is not investable without risk/cost overlays. 3) What you capture is not “predictive sentiment” but analyst consensus, which is often already priced in. Periods with positive results coincide with trending environments (currency momentum). In reversal/shock regimes, the model degrades.

Like
Reply
Javier Venegas Contreras

Factoring Operations Analyst | International Finance | GFRI | Data Analytics | World Economic Forum

4d

Real-life case study: The ‘decoupling’ of the CHF in 2015: The Swiss National Bank abandoned the EUR/CHF floor (January 2015). It was an abrupt shock with gaps, huge spreads and a liquidity break. • What would happen to a strategy based on paper sentiment • Before the event: articles likely to be neutral or with a narrative of stability; the model tends towards ‘unchanged’ (where it performs worse) or erratic signals. The rule of ‘retaining the previous day's signal’ leaves you exposed to the event gap. [Operational risk] • During/immediately after: textual sentiment shifts to ‘CHF appreciation’, but you would be late; actual execution faces extreme spreads and slippage, eroding any edge.

Like
Reply
Javier Venegas Contreras

Factoring Operations Analyst | International Finance | GFRI | Data Analytics | World Economic Forum

4d

Improvements to consider: • Cost-conscious backtests with volatility- and session-dependent spreads/rolls; net Sharpe report. • Holdout by source/style/language (train without a source, test on that source) to measure robustness to domain shift. • Significance with bootstrap in blocks and regime testing (high/low vol; trend/range). • Ordinal or pair-to-pair model and recovery of the ‘unchanged’ class as a ‘no trade’ signal. • Risk gating: vol-targeting, event blackouts, turnover limits. • Ablation of the label scheme (alternative windows, thresholds other than 30%) to rule out artefacts from the target definition.

Like
Reply
See more comments

To view or add a comment, sign in

Explore content categories