Optimizing enterprise-grade distributed systems in fintech
Enterprise grade applications are large software systems that are designed for use in business or government environments and are mission critical. These systems are characterised by high scalability, high security, extensive and complex integrations, regulatory compliant, built-in analytics, comprehensive support and maintenance. What also sets them apart is that not always would the choice of our technology be approved by the enterprise. In one of our client’s cases, Cassandra was not an approved technology and we had to refine our architecture.
Growth in volumes over a period of time leads to significant impact to scale in such systems. Despite following the common best practices, the system starts to choke in production and needs deep analysis to get to the root of the problem. We will be categorizing the pain areas into the following major heads 1) Database 2) Network Latency 3) Application latency 4) Infrastructure 5) Profiling (analyzing heap dumps, thread dumps) 6) Observability
A lot of considerations here will be similar to any distributed system at scale, but through this blog we aim to highlight how these challenges get amplified when working within the enterprise confines of financial institutions and banks.
1. Database
Enterprise applications usually consist of multiple databases including both SQL (traditional RDBMS) and NoSQL supporting structured and unstructured data. Tuning can depend on the flavor chosen but there are some common aspects that apply.
The cascading effect of growing data and volume in databases usually leads to deterioration of throughput in all layers leading up to APIs. As a recommendation database should be the first debugging point to analyze or get to an answer to “Why is my system responding slow?”
Another aspect of many enterprise systems is that usually a large database is shared across multiple applications and vendors. It can become even more difficult to isolate problematic queries.
The Automatic Workload Repository (AWR) is a tool in Oracle databases that can help identify performance issues by capturing detailed database activity data. Some common focus areas for debugging db related performance issues are given below. These are, by no means exhaustive, but in our experiences a lot of db issues can be debugged by application developers and we highlight them here:
● Review SQL Queries : SQL queries ordered by execution time and CPU time should be looked into for performance optimization. The section “SQL Statistics” is very helpful. As an application developer, always get the query execution plan for the newly introduced queries from the production database beforehand.
● Full Table Scans: Growing volume in an enterprise can easily render query plans ineffective over time. Look for the query execution plan and identify if full table scan is being performed on one or more tables then we need to apply indexes to limit search time for relevant data. There might be scenarios where FTS shows up despite proper indexing. We ran into this problem with one of the leading national banks and this link provide some additional considerations to be made.
● enq: TM — contention event — Waits indicate there are unindexed foreign key constraints.
● Blocking session and inactive session count — These stats can be gathered periodically from the database.
● Data archival and purging — Financial Enterprises Data (specially banking data) is kept for an average of seven years, or longer in some places. This includes bank statements, accounting records, deposit slips, taxes, payroll, purchase orders, and employee expenses reports. With growing numbers of records it becomes essential to keep a tap on database performance as queries response will be impacted while working on large datasets. It is essential that we partition data according to day/month/year etc and keep moving it out from transactional db to some other backup store. Consider this early in DB design and think of strategies like soft or hard delete, time duration by when old data can be purged or moved to archival storage.
2. Network Latency
Network latency is another very important factor which can contribute to the delay in data transfer. Enterprise grade financial systems might communicate with a lot of internal and external APIs. A typical example of these APIs can be NSDL (National Securities Depository Limited), PAN over internet, core banking system (CBS) hosted on premise. The challenge system will face is that the front end application cannot move to the next page unless you get a valid required response as most of these calls might be synchronous.
An enterprise platform we had deployed in one the bank premises was integrated with 50+ external API’s. One of the issues we identified was that external API traffic was getting routed via the DR (Disaster Recovery) region. Due to additional network hops, sudden spikes in latency were observed for these APIs during peak load window. As the enterprise systems normally share the network components (e.g. load balancers, network bandwidth) and data centers, it is a good idea to ensure all the required services are deployed in the same region to avoid network latencies. While this may seem like a natural requirement for any system in general, the point here is to not make assumptions about the network routes and deployments of dependent services in an enterprise setup. It is best to test all the assumptions and make required adjustments.
There are several other optimization techniques to overcome these hurdles. These can include caching, minimizing data size (including compression) but still monitoring of the network adds immense value and should be given a priority. Using synthetic monitoring checks like network bandwidth saturation, packet drop will help in averting and analyzing the problems beforehand. The objective is not to recommend any tool but new Relic, datadog, dynatrace can do the job. Even command line linux tools like traceroute, MTR (delay, drop and connectivity), Iperf (speed, performance) can give quick and meaningful insights.
Similarly keep a close track of call amplification and avoid too frequent API calls or too many database calls.