Optimizing enterprise-grade distributed systems in fintech
Enterprise grade applications are large software systems that are designed for use in business or government environments and are mission critical. These systems are characterised by high scalability, high security, extensive and complex integrations, regulatory compliant, built-in analytics, comprehensive support and maintenance. What also sets them apart is that not always would the choice of our technology be approved by the enterprise. In one of our client’s cases, Cassandra was not an approved technology and we had to refine our architecture.
Growth in volumes over a period of time leads to significant impact to scale in such systems. Despite following the common best practices, the system starts to choke in production and needs deep analysis to get to the root of the problem. We will be categorizing the pain areas into the following major heads 1) Database 2) Network Latency 3) Application latency 4) Infrastructure 5) Profiling (analyzing heap dumps, thread dumps) 6) Observability
A lot of considerations here will be similar to any distributed system at scale, but through this blog we aim to highlight how these challenges get amplified when working within the enterprise confines of financial institutions and banks.
1. Database
Enterprise applications usually consist of multiple databases including both SQL (traditional RDBMS) and NoSQL supporting structured and unstructured data. Tuning can depend on the flavor chosen but there are some common aspects that apply.
The cascading effect of growing data and volume in databases usually leads to deterioration of throughput in all layers leading up to APIs. As a recommendation database should be the first debugging point to analyze or get to an answer to “Why is my system responding slow?”
Another aspect of many enterprise systems is that usually a large database is shared across multiple applications and vendors. It can become even more difficult to isolate problematic queries.
The Automatic Workload Repository (AWR) is a tool in Oracle databases that can help identify performance issues by capturing detailed database activity data. Some common focus areas for debugging db related performance issues are given below. These are, by no means exhaustive, but in our experiences a lot of db issues can be debugged by application developers and we highlight them here:
● Review SQL Queries : SQL queries ordered by execution time and CPU time should be looked into for performance optimization. The section “SQL Statistics” is very helpful. As an application developer, always get the query execution plan for the newly introduced queries from the production database beforehand.
● Full Table Scans: Growing volume in an enterprise can easily render query plans ineffective over time. Look for the query execution plan and identify if full table scan is being performed on one or more tables then we need to apply indexes to limit search time for relevant data. There might be scenarios where FTS shows up despite proper indexing. We ran into this problem with one of the leading national banks and this link provide some additional considerations to be made.
● enq: TM — contention event — Waits indicate there are unindexed foreign key constraints.
● Blocking session and inactive session count — These stats can be gathered periodically from the database.
● Data archival and purging — Financial Enterprises Data (specially banking data) is kept for an average of seven years, or longer in some places. This includes bank statements, accounting records, deposit slips, taxes, payroll, purchase orders, and employee expenses reports. With growing numbers of records it becomes essential to keep a tap on database performance as queries response will be impacted while working on large datasets. It is essential that we partition data according to day/month/year etc and keep moving it out from transactional db to some other backup store. Consider this early in DB design and think of strategies like soft or hard delete, time duration by when old data can be purged or moved to archival storage.
2. Network Latency
Network latency is another very important factor which can contribute to the delay in data transfer. Enterprise grade financial systems might communicate with a lot of internal and external APIs. A typical example of these APIs can be NSDL (National Securities Depository Limited), PAN over internet, core banking system (CBS) hosted on premise. The challenge system will face is that the front end application cannot move to the next page unless you get a valid required response as most of these calls might be synchronous.
An enterprise platform we had deployed in one the bank premises was integrated with 50+ external API’s. One of the issues we identified was that external API traffic was getting routed via the DR (Disaster Recovery) region. Due to additional network hops, sudden spikes in latency were observed for these APIs during peak load window. As the enterprise systems normally share the network components (e.g. load balancers, network bandwidth) and data centers, it is a good idea to ensure all the required services are deployed in the same region to avoid network latencies. While this may seem like a natural requirement for any system in general, the point here is to not make assumptions about the network routes and deployments of dependent services in an enterprise setup. It is best to test all the assumptions and make required adjustments.
There are several other optimization techniques to overcome these hurdles. These can include caching, minimizing data size (including compression) but still monitoring of the network adds immense value and should be given a priority. Using synthetic monitoring checks like network bandwidth saturation, packet drop will help in averting and analyzing the problems beforehand. The objective is not to recommend any tool but new Relic, datadog, dynatrace can do the job. Even command line linux tools like traceroute, MTR (delay, drop and connectivity), Iperf (speed, performance) can give quick and meaningful insights.
Similarly keep a close track of call amplification and avoid too frequent API calls or too many database calls.
3. Application Latency
There can be many areas internal to the enterprise application that may impact its performance. Problems can be unique (we will like to hear more from our readers) however below areas might help in debugging the root cause. Some of the steps taken by us:
Database optimization: DB related issues already discussed in section 1
API optimization: We started with top 10 heavily used and top 10 slow APIs. We studied their overall response latency, throughput, code complexity, response payload sizes. We repeated this over a period of time. We could convert a lot of sync calls to async in a transaction e.g. sending customer notification, updating state in database etc. APIs used to upload or download documents were moved to different services with controls on thread pools, etc. Proper timeouts were added in all external facing API’s after studying average/max/min latencies etc. We could remove a lot of redundant calls. The question that arises here is as to why were these not planned earlier itself and why was there a change done retrospectively. The point being that the in-house systems at large enterprises can evolve over a period of time and the communication from the enterprise may not be very proactive. It can render some of the initial design considerations wrong. The deterioration in the core banking system might be gradual making it even more difficult to identify the root cause later.
Configuration review: Overall system configuration was revisited for all microservices thread pools, database connection pool, API gateway routing pool, JVM Xms and Xmx values (running multiple nodes or processes with smaller JVM footprint), redis configurations, etc.
One important learning was to avoid using default values and fine tune every required parameter as per the need.
Mocking Enterprise Dependencies: Enterprise systems interact with external system APIs. Building a test lab with service virtualization of these external API’s i.e. returning mock response can be a life saver. These mocked endpoints can add some latency, drop connections abruptly, etc. A lot of API level issues which occur due to high concurrency or volume can be debugged and tackled beforehand by running performance tests using tools like Jmeter, ab (apache benchmark) etc. One can mock core banking systems, external systems such as Government systems, etc. and those can be helpful in being able to roll out changes quickly.
Logging cleanup: At times it’s just overlooked but 4% to 5% performance improvement was seen by analyzing and looking closely at what’s logged in e.g. stack traces, inside most frequently used methods, information which was not meaningful etc. Again, this requires careful reasoning with the teams within the enterprise who can be paranoid about saving every log, every metric till eternity to handle some unforeseen compliance or audit issues. A separate persistent mechanism (such as NoSQL or another) can instead be used to save rolled up data rather than the entire logs of applications.
4. Infrastructure right sizing
Let’s start with the problem we had at hand. Our deployed distributed platform was running multiple microservices on multiple VM’s, a cluster of SQL and NoSQL databases, caching solutions and was decently handling the throughput and volume. We were informed by one customer (a big financial organization) that in the next few months incoming traffic will increase by 3x. And if there is a need to upsize the infrastructure, then it has to be called out to them in advance. Such unplanned tasks can happen in any mission critical financial systems. What approach will you take? Simple option is 3x of CPU and memory of what you have or are there better ways to calculate it. One of them is benchmarking in a lab environment by simulating the possible traffic but at times it’s not easy.
One of the very handy mathematical formula is Little’s law which we also used for sizing the infra based on current traffic
An example : If 1000 requests on average are received each second and on an average it takes 0.25 seconds to process, then request concurrency a service has to handle is ~1000*0.25 = 250 requests/sec. This exercise needs to be done for all services and a good point to start is with the ones interacting with external API endpoints.
This approach is not the silver bullet, other possible areas also need to be analyzed e.g. If concurrency is also going to increase then you have to size the service which is the first point of contact, disk size of storage solutions by estimating increase in data size, IOPS needs to be closely looked at, by what factor database reads and writes will increase etc.
5. Profiling
To put it simply, code profiling really makes life of developers easier as most of the time focus is on business logic and non-functional aspects are given least priority. High memory footprint and leaks have a domino effect on large scale enterprise systems. Even if a single piece starts showing an issue it could bring down the entire system if not properly handled or designed. Consider the case of an enterprise system written in Java. A lot of tools are available to help profile and debug different problems. Let us talk about 2 key tools.
5.1 Thread Dump
A thread dump is a snapshot that gives a picture of everything that an application is doing at the instance it is captured. It is a collection of stack traces, one for each thread that’s running in the instance. Every high-level language will have its own way of taking threads snapshot, in our case it was Java. So thread dump for us gives a snapshot of the current state of java running processes.
Utility : Helpful in analyzing the CPU spikes.
Command to take thread dump:
First check which process is taking up max cpu using:
top -n1 -b -H
Use the above pid to take the dump
jstack -l {pid} > {file-path-for-dump}
Java threads have clearly defined states.
Case 1 : When CPU utilization is very high, focus on threads in RUNNABLE state
Case 2 : When CPU usage is very low focus on threads BLOCKED state
Usually 2–3 thread dumps should be taken with a gap of 10–15 seconds. This will help in identifying long running threads.
One interesting finding that we observed during our analysis of thread dump was that there were several threads in the Blocked state. The thread dump pointed to the Rest Template which allows 2 concurrent api hits by default. The other threads kept waiting in a blocked state. Once we increased the Rest Template configuration count, the new thread dump showed a very small number of blocked threads as expected.
There are some tools e.g. VisualVM that can analyze these dumps and plot the transitive dependency graphs to detect any deadlocks.
5.2 Heap Dump
A Heap dump is a snapshot of the objects and threads occupying the memory of JVM.
Utility : Helpful in analyzing the instances of memory leak that leads to Out Of Memory (OOM) crashes.
Command to take heap dump:
One of the ways of automatically taking the heap dump on OOM by configuring additional JVM argument :
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=</path/to/dumpfile>
Once a dump file is created, then it should be analyzed for memory leaks. There are some free and paid tools like Eclipse Memory Analyser, Visual VM, Yourkit etc
The below snapshot is from Visual VM that shows all the objects present in heap when the dump was taken. We can see for a few of the classes like the highlighted one (Agent class), around 3.4 million objects were present, this was the cause of memory leak. We fixed the query that was loading these objects.
6. Observability
Observability helps in answering very important and meaningful questions like “How much time is spent by a transaction end to end when request is flowing through multiple microservices, database layer, caching layer, external end points etc.“, “What’s the average latency of each microservice for p99 or p95”, “Which is critical time windows or any patterns of burst load that is throttling the system”, “Are there any dependent services due to which there is a cascading impact on other”, “Any third party API’s have started timing out or intermittently introducing latency” etc.
To gauge not only this but many other parameters in near real time, observability metrics are a lifeline for debugging any system and getting to the root cause. One can use any open source tool like Opentelimetry or a licensed one like ELK-APM. We have been using ELK-APM to answer some of these heavy weighted and important questions.
The first graph details dependency latency & throughput in the time window being monitored. The second graph is a snapshot of a transaction trace to analyze how much time is spent and where.
In our case we have been using ELK-APM to monitor the overall system performance. It provides a quick bird eye view of all microservices health and throughput, database health and any long running queries, caching services health and throughput, incoming and outgoing traffic rate, spiking or burst of traffic, third party API endpoints availability and latency, transactional errors, HTTP status codes being returned etc.
Conclusion
Every enterprise system is unique as it might be solving a different business problem and having to adapt to different deployment strategies. Architecture of such systems evolve with business needs and are required to work within the constraints imposed by the enterprise systems (such as stability and limitations of the core banking systems) that they interface with. They can be further challenged due to evolving needs of the enterprise, their changing priorities and external dependencies. Scaling and evolving such systems can lead to a lot of learning which is what we have attempted to bring forth in our blog. We have captured learnings around various tiers and taken Java enterprise systems as an example. We are sure that a lot of you have had learnings which are similar to ours. Do share yours as well.