Optimizing enterprise-grade distributed systems in fintech

Perfios

Lead with brilliance / Leap to outdo

Published May 29, 2025

Enterprise grade applications are large software systems that are designed for use in business or government environments and are mission critical. These systems are characterised by high scalability, high security, extensive and complex integrations, regulatory compliant, built-in analytics, comprehensive support and maintenance. What also sets them apart is that not always would the choice of our technology be approved by the enterprise. In one of our client’s cases, Cassandra was not an approved technology and we had to refine our architecture.

Growth in volumes over a period of time leads to significant impact to scale in such systems. Despite following the common best practices, the system starts to choke in production and needs deep analysis to get to the root of the problem. We will be categorizing the pain areas into the following major heads 1) Database 2) Network Latency 3) Application latency 4) Infrastructure 5) Profiling (analyzing heap dumps, thread dumps) 6) Observability

A lot of considerations here will be similar to any distributed system at scale, but through this blog we aim to highlight how these challenges get amplified when working within the enterprise confines of financial institutions and banks.

1. Database

Enterprise applications usually consist of multiple databases including both SQL (traditional RDBMS) and NoSQL supporting structured and unstructured data. Tuning can depend on the flavor chosen but there are some common aspects that apply.

The cascading effect of growing data and volume in databases usually leads to deterioration of throughput in all layers leading up to APIs. As a recommendation database should be the first debugging point to analyze or get to an answer to “Why is my system responding slow?”

Another aspect of many enterprise systems is that usually a large database is shared across multiple applications and vendors. It can become even more difficult to isolate problematic queries.

The Automatic Workload Repository (AWR) is a tool in Oracle databases that can help identify performance issues by capturing detailed database activity data. Some common focus areas for debugging db related performance issues are given below. These are, by no means exhaustive, but in our experiences a lot of db issues can be debugged by application developers and we highlight them here:

● Review SQL Queries : SQL queries ordered by execution time and CPU time should be looked into for performance optimization. The section “SQL Statistics” is very helpful. As an application developer, always get the query execution plan for the newly introduced queries from the production database beforehand.

● Full Table Scans: Growing volume in an enterprise can easily render query plans ineffective over time. Look for the query execution plan and identify if full table scan is being performed on one or more tables then we need to apply indexes to limit search time for relevant data. There might be scenarios where FTS shows up despite proper indexing. We ran into this problem with one of the leading national banks and this link provide some additional considerations to be made.

● enq: TM — contention event — Waits indicate there are unindexed foreign key constraints.

● Blocking session and inactive session count — These stats can be gathered periodically from the database.

● Data archival and purging — Financial Enterprises Data (specially banking data) is kept for an average of seven years, or longer in some places. This includes bank statements, accounting records, deposit slips, taxes, payroll, purchase orders, and employee expenses reports. With growing numbers of records it becomes essential to keep a tap on database performance as queries response will be impacted while working on large datasets. It is essential that we partition data according to day/month/year etc and keep moving it out from transactional db to some other backup store. Consider this early in DB design and think of strategies like soft or hard delete, time duration by when old data can be purged or moved to archival storage.

2. Network Latency

Network latency is another very important factor which can contribute to the delay in data transfer. Enterprise grade financial systems might communicate with a lot of internal and external APIs. A typical example of these APIs can be NSDL (National Securities Depository Limited), PAN over internet, core banking system (CBS) hosted on premise. The challenge system will face is that the front end application cannot move to the next page unless you get a valid required response as most of these calls might be synchronous.

An enterprise platform we had deployed in one the bank premises was integrated with 50+ external API’s. One of the issues we identified was that external API traffic was getting routed via the DR (Disaster Recovery) region. Due to additional network hops, sudden spikes in latency were observed for these APIs during peak load window. As the enterprise systems normally share the network components (e.g. load balancers, network bandwidth) and data centers, it is a good idea to ensure all the required services are deployed in the same region to avoid network latencies. While this may seem like a natural requirement for any system in general, the point here is to not make assumptions about the network routes and deployments of dependent services in an enterprise setup. It is best to test all the assumptions and make required adjustments.

There are several other optimization techniques to overcome these hurdles. These can include caching, minimizing data size (including compression) but still monitoring of the network adds immense value and should be given a priority. Using synthetic monitoring checks like network bandwidth saturation, packet drop will help in averting and analyzing the problems beforehand. The objective is not to recommend any tool but new Relic, datadog, dynatrace can do the job. Even command line linux tools like traceroute, MTR (delay, drop and connectivity), Iperf (speed, performance) can give quick and meaningful insights.

Similarly keep a close track of call amplification and avoid too frequent API calls or too many database calls.

Optimizing enterprise-grade distributed systems in fintech

Perfios

Lead with brilliance / Leap to outdo

1. Database

2. Network Latency

Perfios Tech Blog

18,741 followers

More articles by this author

Others also viewed

Designing a Fault Tolerant Database for Scalable Distributed Systems

May 2023: Metamorphic testing, Oracle migrations, and online schema changes

Best Practices for Using DynamoDB in Enterprise-Level Applications

Ensuring Data Reliability in Apache Kafka

Chapter 5: Database & Storage Architecture

Database Indexing Strategies: How NoSQL outperforms traditional indexing

HLD Designing a Distributed Counter: A System Design Deep Dive

Top 10 operational challenges in managing Kafka

MongoDB: A NoSQL Database

Understanding System Design Acronyms: CAP, PACELC, BASE, SOLID, and KISS

Explore topics

1. Database

2. Network Latency

Perfios Tech Blog

18,741 followers

Navigating the New Frontier: The Rising Threat of AI-Generated Content and the Future of eKYC in Fintech

Jun 11, 2025

Application Serviceability

Jun 4, 2025

HTML Page Change Detector: Monitoring And Alerting System

May 21, 2025

One System to Process Them All

Apr 30, 2025

Achieving QA Excellence: Best Practices and Innovative Approaches at Perfios

Apr 16, 2025

Pose Detection with YOLOv10

Apr 9, 2025

One Test to Rule them All - Hypothesis Testing Simplified

Apr 3, 2025

The Mask on Truth: A Deep Dive into the Threat of Deepfakes

Mar 26, 2025

Embracing Reactive Programming: Optimizing Cloud, Database, and Web Operations

Mar 19, 2025

Others also viewed

Designing a Fault Tolerant Database for Scalable Distributed Systems

May 2023: Metamorphic testing, Oracle migrations, and online schema changes

Best Practices for Using DynamoDB in Enterprise-Level Applications

Ensuring Data Reliability in Apache Kafka

Chapter 5: Database & Storage Architecture

Database Indexing Strategies: How NoSQL outperforms traditional indexing

HLD Designing a Distributed Counter: A System Design Deep Dive

Top 10 operational challenges in managing Kafka

MongoDB: A NoSQL Database

Understanding System Design Acronyms: CAP, PACELC, BASE, SOLID, and KISS

Explore topics