Tech Bytes: Architecture

Showing posts with label Architecture. Show all posts

Friday, June 19, 2015

Information Security - Reducing Complexity

Change is constant and we are seeing that everything around us are evolving. Primarily, the evolution is happening on the following categories:

Threats:

There is a drastic change in the threat landscape between now and the 1980s or even 1990s. Between 1980 and 2000, a good anti-virus and firewall solution was considered well enough for an organization. But now those are not just enough and the hackers are using sophisticated tools, technology and sills to attack the organizations. The motive behind hacking has also evolved and in that front, we see that hacking, though illegal is a commercially viable profession or business.

Compliance:

With the pace at which the Threat landscape is evolving, governments have reasons to be concerned much as they are increasingly leveraging the technology to better serve the citizens and thus giving room for an increased security risk. To combat such challenges, Governments have come up with regulatory compliance requirements making it even complex for the CSOs of enterprises.

Technology:

Technology is evolving at a much faster pace and as we are experiencing, we are seeing that the things around us are getting smarter with the ability to connect and communicate to internet. On the other side, considerable progress have been achieved in the Artificial Intelligence, Machine Learning, etc. These newer ‘smarter things’ are adding up to the complexity as the CSOs of the have to handle the threats that these bring on to the surface.

Needless to mention that the hackers too make the best use of the technology evolution and thus improving their attack capabilities day by day.

Business Needs:

The driver of adoption of these evolution is the business need. As businesses want to stay ahead of the competition, they leverage the evolving technologies and surge ahead of the competition. With a shorter time to market, all departments, including the security organization should be capable of accepting and implementing such changes at faster pace. Due to this time pressure, there is a tendency to look for easier and quicker ways to implement changes ignoring the best practices.

Consumerization

IT today is to simplify things to the consumers within and outside the organization and this raises the user expectation and thus leading to too many changes with some being unrealistic as well. This may include the users bringing their own anything (BYOA). This will soon include Bring Your Own Identity with chips implanted under the skin. As you would know, employees who work at the new high tech office campus in Sweden, EpiCenter can wave their hands to open doors, with an RFID chip implanted under the skin.

Connected world

Most enterprises are now connected with their business partners in terms for exchanging business data. With this the IT System perimeter extends to that of the partners’ as well to some extent. Rules and polices had to be relaxed to support such connected systems. Now that we are looking at things that we use every day will transform as connected things, adding up to the complexity.

Big data

Basically the need for big data tools to handle this. While this complexity did exist earlier, the attacks were not that sophisticated then. Today with the level of sophistication on the attack surface, the need for simplifying complexity of handling huge data is very much required.

Skillset

The threat landscape is widening and the attacks are getting sophisticated, which call for even better tools and technologies to be used to prevent or counter them. This means that there is a continuous change in the method, approach, tools and technology used, making it difficult to maintain and manage the skills of the human resources.

Application Eco System

A midsized organization will have hundreds of applications, needing to have different exceptions to the policies and rules. These applications may in turn use third party components and thus the chances of a vulnerability within these applications is very high. Given that these applications constantly undergo change and evolve, there is a possibility that the code or component left behind might expose a vulnerability.

How does this impact

Complexity impacts the security capability in many ways and the following are some:

Accuracy in Detection

The complexity makes the detection of a compromise difficult. Having to handle and correlating large volume of logs from different devices and that too different vendors will always be a challenge and this makes timely and accurate detection a remote possibility. A successful counter measure require accurate detection in the pre-infection or atleast in the infection stage. The later it is detected, it is complex to counter the same.

Resources

Each new security technology requires people to properly deploy, operate and maintain it. But it is difficult to add new heads to the Security Organization as and when a new tool or technology is considered. Similarly, managing the legacy solutions put in by older employees who are no longer employed in the organizaiton is likely to remain untouched due to the fear of breaking certain things.

Vulnerabilities and Exposures

With the huge number of applications used by the enterprise, this is a complex and huge exercise, unless the same is integrated into the build and delivery process by mandating a security vulnerability assessment. With innumerable number of applications, components, and the operating systems connecting to the enterprise network, this is almost impossible. Needless to mention that with the wearables and other smarter things connection to the network, who knows, what vulnerability exist in such smarter things and in turn exploited by hackers.

Methods for reducing complexity

Complexity is certainly bad and reducing complexity will beneficial both in terms of cost and otherwise. However, simplification by any means should not result in compromising the needed detection and protection abilities. A balanced approach is necessary so that the risk, cost and complexity are well balanced and beneficial to the organization. The following are some of the methods that may help reduce the complexity:

Integrated processes as against isolated security processes. Every Business process should have the security related processes integrated within, so that every person in the organization will by default contribute towards security. The security process framework shall be designed in such a manner that it evolves over a period based on experience and feedback.
Practicing Agile approach within the security organization, so that the complexity is hidden within tools and appliances by automating the same. Agile approach also helps the security organization to embrace changes faster, especially, when implementing changes in response to a detected threat or compromise. One has to carefully adopt such practices into the Security framework.
Outsourcing the security operations to Managed Security Service Providers(MSSP) is certainly an option for small and medium enterprises that brings takes some of the complexity away and thus benefits the organization. Needless to mention here that outsourcing does not absolve the responsibility of the security organization from any security incident or breach.
“Shrinking the Rack” – Consolidating technologies whereby devices combining multiple technology and capability within it may make it easier for deployment and administration. At the same time this has the risk of ‘having all eggs in one basket’, i.e. when such a device or solution is hacked, then it is far and wide open for the hackers.
Mandating periodical code, component and process refactoring, where by unneeded legacy code, component and process are periodically reviewed and removed from the system. This will help keeping the applications maintainable and secure. Also implant security as a culture amongst all the employees, so that they handle security indicators responsibly.

Thursday, August 28, 2014

Architectural Security aspects of BGP/MPLS

The inherent benefits of the MPLS (Multi Protocol Label Switching), is gaining widespread use for providing IP VPN services. With the emerging trend of connected systems, a global enterprise today is well connected with their partners, with MPLS being the preferred choice. Border Gateway Routing Protocol (BGP) is used to interconnect such autonomous systems by exchanging the routing informaiton across such systems. The emergence of Multiprotocol Extension, and other variations of BGP Protocol, has furthered the choice of MPLS VPNs. On the same lines, the security concerns on using such a network is also on the rise. The specific demands of customers in terms of security is also emerging as they experience issues of data breaches and security incidents.

The objective of this blog is not to explain about the BGP / MPLS as such and instead let us examine how the BGP / MPLS addresses the typical security requirements in this blog. The following sections of this blog have been extracted from the RFC 4381 published by Internet Engineering Task Force (IETF) in 2006.

Address Space, Routing, and Traffic Separation

BGP/MPLS allows distinct IP VPNs to use the same address space, which can also be private address space. This is achieved by adding a 64-bit Route Distinguisher (RD) to each IPv4 route, making VPN-unique addresses also unique in the MPLS core. This "extended" address is also called a "VPN-IPv4 address". Thus, customers of a BGP/MPLS IP VPN service do not need to change their current addressing plan. The address space on the CE-PE link (including the peering PE address) is considered part of the VPN address space. Since address space can overlap between VPNs, the CE-PE link addresses can overlap between VPNs. For practical management considerations, SPs typically address CE-PE links from a global pool, maintaining uniqueness across the core.

On the data plane, traffic separation is achieved by the ingress PE pre-pending a VPN-specific label to the packets. The packets with the VPN labels are sent through the core to the egress PE, where the VPN label is used to select the egress VRF. Given the addressing, routing, and traffic separation across an BGP/ MPLS IP VPN core network, it can be assumed that this architecture offers in this respect the same security as a layer-2 VPN. It is not possible to intrude from a VPN or the core into another VPN unless this has been explicitly configured. If and when confidentiality is required, it can be achieved in BGP/ MPLS IP VPNs by overlaying encryption services over the network. However, encryption is not a standard service on BGP/MPLS IP VPNs.

Hiding of the BGP/MPLS IP VPN Core Infrastructure

Service providers and end-customers do not normally want their network topology revealed to the outside. This makes attacks more difficult to execute: If an attacker doesn't know the address of a victim, he can only guess the IP addresses to attack. Since most DoS attacks don't provide direct feedback to the attacker it would be difficult to attack the network. It has to be mentioned specifically that information hiding as such does not provide security. However, in the market this is a perceived requirement.

With a known IP address, a potential attacker can launch a DoS attack more easily against that device. Therefore, the ideal is to not reveal any information about the internal network to the outside world. This applies to the customer network and the core. A number of additional security measures also have to be taken: most of all, extensive packet filtering. For security reasons, it is recommended for any core network to filter packets from the "outside" (Internet or connected VPNs) destined to the core infrastructure. This makes it very hard to attack the core, although some functionality such as pinging core routers will be lost. Traceroute across the core will still work, since it addresses a destination outside the core.

Being reachable from the Internet automatically exposes a customer network to additional security threats. Appropriate security mechanisms have to be deployed such as firewalls and intrusion detection systems. This is true for any Internet access, over MPLS or direct. A BGP/MPLS IP VPN network with no interconnections to the Internet has security equal to that of FR or ATM VPN networks. With an Internet access from the MPLS cloud, the service provider has to reveal at least one IP address (of the peering PE router) to the next provider, and thus to the outside world.

Resistance to Attacks

To attack an element of a BGP/MPLS IP VPN network, it is first necessary to know the address of the element. The addressing structure of the BGP/MPLS IP VPN core is hidden from the outside world. Thus, an attacker cannot know the IP address of any router in the core to attack. The attacker could guess addresses and send packets to these addresses. However, due to the address separation of MPLS each incoming packet will be treated as belonging to the address space of the customer. Thus, it is impossible to reach an internal router, even by guessing IP addresses.

In the case of a static route that points to an interface, the CE router doesn't need to know any IP addresses of the core network or even of the PE router. This has the disadvantage of needing a more extensive (static) configuration, but is the most secure option. In this case, it is also possible to configure packet filters on the PE interface to deny any packet to the PE interface. This protects the router and the whole core from attack. In all other cases, each CE router needs to know at least the router ID (RID, i.e., peer IP address) of the PE router in the core, and thus has a potential destination for an attack.

A potential attack could be to send an extensive number of routes, or to flood the PE router with routing updates. Both could lead to a DoS, however, not to unauthorised access. To reduce this risk, it is necessary to configure the routing protocol on the PE router to operate as securely as possible. This can be done in various ways:

By accepting only routing protocol packets, and only from the CE router. The inbound ACL on each CE interface of the PE router should allow only routing protocol packets from the CE to the PE.
By configuring MD5 authentication for routing protocols. This is available for BGP (RFC 2385 [6]), OSPF (RFC 2154 [4]), and RIP2 (RFC 2082 [3]), for example.

This avoids packets being spoofed from other parts of the customer network than the CE router. It requires the service provider and customer to agree on a shared secret between all CE and PE routers. It is necessary to do this for all VPN customers. It is not sufficient to do this only for the customer with the highest security requirements.

It is theoretically possible to attack the routing protocol port to execute a DoS attack against the PE router. This in turn might have a negative impact on other VPNs on this PE router. For this reason, PE routers must be extremely well secured, especially on their interfaces to CE routers. ACLs must be configured to limit access only to the port(s) of the routing protocol, and only from the CE router.

Label Spoofing

Similar to IP spoofing attacks, where an attacker fakes the source IP address of a packet, it is also theoretically possible to spoof the label of an MPLS packet. For security reasons, a PE router should never accept a packet with a label from a CE router. RFC 3031 [9] specifies: "Therefore, when a labeled packet is received with an invalid incoming label, it MUST be discarded, UNLESS it is determined by some means that forwarding it unlabeled cannot cause any harm."

There remains the possibility to spoof the IP address of a packet being sent to the MPLS core. Since there is strict address separation within the PE router, and each VPN has its own VRF, this can only harm the VPN the spoofed packet originated from; that is, a VPN customer can attack only himself. MPLS doesn't add any security risk here. The Inter-AS and Carrier's Carrier cases are special cases, since on the interfaces between providers typically packets with labels are exchanged. See section 4 for an analysis of these architectures.

There are a number of precautionary measures outlined above that a service provider can use to tighten security of the core, but the security of the BGP/MPLS IP VPN architecture depends on the security of the service provider. If the service provider is not trusted, the only way to fully secure a VPN against attacks from the "inside" of the VPN service is to run IPsec on top, from the CE devices or beyond. This document discussed many aspects of BGP/MPLS IP VPN security. It has to be noted that the overall security of this architecture depends on all components and is determined by the security of the weakest part of the solution.

Saturday, June 1, 2013

Software Quality Attributes: Trade-off anaysis

We all know that the Software Quality is not just about meeting the Functional Requirements, but also about the extent of the software meeting a combination of quality attributes. Building a quality software will requires much attention to be paid to identifying and prioritizing the quality attributes and design & build the software to adhere those. Again, going by the saying "you cannot manage what you cannot measure", it is also important to design the software with the ability to collect metrics around these quality attributes, so that the degree to which the end product satisfies the specific quality attribute can be measured and monitored.

It has always remained as a challenge for the software architects or designers in coming up with the right mix of the quality attributes with appropriate priority. This is further complicated as these attributes are highly interlinked as a higher priority on one would result in an adverse impact on another. Here is a sample matrix showing the inter-dependencies of some of the software quality metrics.

	Avail- ability	Effici- ency	Flexi- bility	Inte- grity	Inter-oper-ability	Main-tain-ability	Port-ability	Reli-ability	Reus-ability	Rob-ust-ness	Test-ability
Avail-ability								+		+
Effici- ency			-		-	-	-	-		-	-
Flexi- bility		-		-		+	+	+		+
Integrity		-			-				-		-
Inter-oper-ability		-	+	-			+
Maintai-nabilit	+	-	+					+			+
Port-ability		-	+		+	-			+		+
Reli- ability	+	-	+			+				+	+
Reus-ability		-	+	-				-			+
Robust-ness	+	-						+
Test-ability	+	-	+			+		+

While the '+' sign indicates positive impact, the '-' sign indicates negative impact. This is only an likely indication of the dependencies and in reality, this could be different. The important takeaway however is that there is a need for planning and prioritizing the quality attributes for every software being designed or built and the prioritization has to be accomplished keeping mind the inter-dependencies amongst the quality attributes. This would mean that there should be some trade-off made and the business and IT should be in agreement with these trade off decisions.

SEI's Architecture Trade-off Analysis Method (ATAM) provides a structured method to evaluate the trade off points. . The ATAM not only reveals how well an architecture satisfies particular quality goals (such as performance or modifiability), but it also provides insight into how those quality attributes interact with each other—how they trade off against each other. Such design decisions are critical; they have the most far-reaching consequences and are the most difficult to change after a system has been implemented.

A prerequisite of an evaluation is to have a statement of quality attribute requirements and a specification of the architecture with a clear articulation of the architectural design decisions. However, it is not uncommon for quality attribute requirement specifications and architecture renderings to be vague and ambiguous. Therefore, two of the major goals of ATAM are to

elicit and refine a precise statement of the architecture’s driving quality attribute requirements
elicit and refine a precise statement of the architectural design decisions

Sensitivity points use the language of the attribute characterizations. So, when performing an ATAM, the attribute characterizations are used as a vehicle for suggesting questions and analyses that guide towards potential sensitivity points. For example, the priority of a specific quality attribute might be a sensitivity point if it is a key property for achieving an important latency goal (a response) of the system. It is not uncommon for an architect to answer an elicitation question by saying: “we haven’t made that decision yet”. However, it is important to flag key decisions that have been made as well as key decisions that have not yet been made.

All sensitivity points and tradeoff points are candidate risks. By the end of the ATAM, all sensitivity points and tradeoff points should be categorized as either a risk or a non-risk. The risks/non-risks, sensitivity points, and tradeoffs are gathered together in three separate lists.

Sunday, April 7, 2013

NGINX - A High Performance HTTP Server

What you will find about NGINX from its website is that it is a free, open source light weight HTTP web server and a reverse proxy. Nginx doesn't rely on threads to handle requests. Instead it uses a much more scalable event-driven (asynchronous) architecture. This architecture uses small, but more importantly, predictable amounts of memory under load. We have found that it is holding up to what it says. Here is how we ended up using NGINX for one of our client and have found it meeting up to our expectation.

We were on a performance testing assignment of a game application, where much of the game contents were static flash files (of course they are dynamic as they had embedded action scripts for execution in the front end) and image files of which quite many are of sizes above 500KB. Need not to mention that this is the compressed size and as such enabling compression did not offered any performance gains. These files were served by Apache, which also serves dynamic php requests.

Our initial performance tests did indicated that for about 1000 concurrent hits, the Apache server memory consumption shot up considerably beyond 2 GB and also quite many requests were timing out. Examination of the server configuration did reveal that the server is configured to serve 256 child processes and also that it was configured to use the ‘prefork’ mpm module, which limits each child process to use only one thread. This technically limits the number of concurrent requests the Apache Server can serve to 256.

We also figured out that Apache needs to use ‘prefork’ module in order to safely serve PHP requests. Though later versions of PHP are found to be working with the other mpm module (‘worker’, which can use multiple threads per child process) many still have concerns around thread safety. The 256 client process limit is also a hard limit and if that needs to be increased, the Apache server needs to be rebuilt.

With this diagnosis, the choices we had with us was to have the static content served out of a server other than Apache and just leave Apache to serve the PHP requests. We just then thought of trying out NGINX and without wasting much time, we went ahead and implemented it. NGINX was then configured to listen on port 80 and has been configured to serve the contents from the root folder of Apache. Apache server was configured to listen at a non standard port and NGINX has also been configured to act as a reverse proxy for all php requests by getting them served processed by Apache.

We performed the tests again and have found tremendous improvement. The average latency at 1000 concurrent hits stress level has come down to under 5 seconds from about 80 seconds. We could also observe that NGINX was consuming just around 10MB. NGINX, by itself does not has any limitation on the number concurrent threads and it is constrained by the limits of the server itself and other daemons running on it.

Saturday, February 9, 2013

Stress Testing a Multi-player Game Application

I recently had an opportunity to consult for a friend of mine on stress testing a multi-player game application. This was a totally new experience for me and this blog is to detail as to how I approached this need to simulate the required amount stress and have it tested under stressed circumstances.

About the Application Architecture

The Application was developed using Flash Action Scripts with few php scripts for some of the support activities. The multi-player platform is aided by SmartFox multi-player gaming middleware. It also makes use of MySQL. The flash files containing the action scripts and a lot of images have been hosted on an Apache web server, which also hosts the PHP. All of Apache, MySQL and SmartFox have been hosted on a single cloud hosted hardware on Linux operating system.

The test approach

My first take on the test approach was to focus on simulating stress on the server and get the client out of the scope of this test. This made sense as all of the flash action scripts get executed on the client side and in reality it is typically single user using the game on a client device and the client side application has all the CPU, memory and related resources available on the client device. Thus in reality there is no multi-player stress on the client.

Given that I will now focus on the impact of the stress on the server resources, I had to understand how the client communicates with the server and about the request / response protocols and related payload. I used Fiddler to monitor the traffic out of the client device on which the game is being played and I could only see http requests for fetching the flash and image files and few of the php files. But I could not find any traffic for the SmartFox Server on http and figured out that those requests are tcp socket requests and thus not captured by fiddler.

The test tools

At this stage, it was much clear that we need to simulate stress on apache by sending in as many http requests, we need to simulate stress on SmartFox server as well over tcp sockets. We have a choice of numerous open source tools to simulate http traffic. I chose JMeter for http traffic simulation, which is open source, UI driven, easy to setup and use. It also supports multi node load simulation.

I need figure out for a tool for simulating load on sockets. I checked with SmartFox to see if they offer a stress test tool, but they don’t. A search through the SmartFox forums revealed that a custom tool is the way to go and to make it easier, we can use one of SmartFox client API libraries, which are available for .NET, Java, Action Script and few other languages. I settled for .NET route as C# is the language in which I have been working with in the recent years.

I have built a multi-threaded custom .NET tool using the SmartFox Client API to simulate the stress on the SmartFox. To my surprise, the SmartFox Client API library has not been designed to work with multi threading and SmartFox support confirmed this behaviour. I then decided redesign my custom tool to use the multi-process architecture and it worked fine..

I needed a server monitoring tool to monitor and measure various server performance parameters under stress conditions. I have found the cloud based NewRelic as the tool of choice to monitor the Linux Server hosting the game components.

The test execution

I had JMeter configured on three nodes (one being the monitoring node) and had set it up to spawn the desired number of threads. I had the custom .NET tool on another client and set it up to spawn the desired number of processes making a sequence of tcp socket requests. I also engaged couple of QA resources to play the game record the user experience under stress conditions.

The test execution went well and we could gather the needed data to form an opinion and make recommendations.

References:

Apache JMeter

NewRelic

Multi Process Architecture in C#

Saturday, December 29, 2012

Resilient Systems - Survivability of Software Systems

Resilience as we all know is an ability to withstand through tough times.There is also another term quite interchangeably used, which is Reliability. But Reliability and Resilience are different. Reliability is about a system or a process that has zero tolerance to failure or the one that should not fail. In other words, when we talk about reliable systems, the context is that failure is not expected or rather acceptable. Whereas Resilience is about the ability to recover from failures. What is important to understand about resilience is that failure is expected and is inherent in any systems or processes, which might be triggered due to changes to the platform, environment and data. While Reliability is about the system’s robustness of not failing, Resilience is its ability to sense or detect failures ahead and then prevent it from encountering such events that lead to failure and when it cannot be avoided, allow it to happen and then recover from the failure sooner.

A working definition for resilience (of a system) developed by the Resilient Systems Working Group (RSWG) is as follows:

“Resilience is the capability of a system with specific characteristics before, during and after a disruption to absorb the disruption, recover to an acceptable level of performance, and sustain that level for an acceptable period of time.“ The following words were clarified:

The term capability is preferred over capacity since capacity has a specific meaning in the design principles.

The term system is limited to human-made systems containing software, hardware, humans, concepts, and processes. Infrastructures are also systems.

The term sustain allows determination of long-term performance to be stated.

Characteristics can be static features, such as redundancy, or dynamic features, such as corrective action to be specified.

Before, during and after – Allows the three phases of disruption to be considered.

Before – Allows anticipation and corrective action to be considered

During – How the system survives the impact of the disruption

After – How the system recovers from the disruption

Disruption is the initiating event of a reduction is performance. A disruption may be either a sudden or sustained event. A disruption may either be internal (human or software error) or external (earthquake, tsunami, hurricane, or terrorist attack).

Evan Marcus, and Hal Stern in their book Blueprints for High Availability, define a resilient system as one that can take a hit to a critical component and recover and come back for more in a known, bounded, and generally acceptable period of time.

In general Resilience is a term of concern for Information Security professionals as the final impact of disruption (from which a system needs to recover), could mostly be on Availability which is one of the three tenets of Information Security (CIA - Confidentiality, Integrity and Availability). But there is a lot for System designers and developers, especially those tasked to build mission critical systems to be concerned about Resilience and architect the systems to build in a required level of Resilience characteristics. For a system to be resilient, it should draw necessary support from dependent software and hardware components, systems and the platform. For instance a disruption for a web application can even be from network outage, security attacks at the network layer, which the software has no control over. But it is important to consider software resiliency in relation to the resiliency of the entire system, including the human and operational components.

The PDR (Protect - Detect - React) strategy is no longer as effective as it used to be due to various factors. It is time that predictive analytics and some of the disruptive technologies like big data and machine learning need a consideration in enhancing the system resiliency. Based on the logs of various inter-connected applications or components and other traffic data on the network, intelligence need to be built into the system to a combination of number of possible actions. For instance, if there is a reason to suspect an attacker attempting to gain access to the systems, a possible action could be to operate the system at a reduced access mode, i.e. parts of the systems may be shut down or parts of the networks to which the system is exposed could be blocked, etc.

OWASP’s AppSensor project is worth checking by the architects and developers. The AppSensor project defines a conceptual framework and methodology that offers prescriptive guidance to implement intrusion detection and automated response into an existing application. AppSensor defines over 50 different detection points which can be used to identify a malicious attacker.Appsensor provides guidance in the form of possible responses for each such identified malicious atacer.

The following are some of the factors that need to be appropriately addressed to enhance the resilience of the systems:

Complexity - With systems continuously evolving and many discrete systems and components increasingly integrated into today’s IT eco-system, the complexity is on the rise. This makes the resiliency routines as built in to individual systems needing a constant review and revision.

Cloud - Cloud computing is gaining higher acceptance and as organizations embrace cloud for its IT needs, the location of data, systems and components are widespread across the globe and that brings in challenge for those involved in building resilience.

Agility - To stay on top of the competition, organizations need agility in their business processes, which means rapid changes in the underlying systems and this could be a challenge as this will call for a constant check to ensure that the changes being introduced does not downgrade or compromise the resiliency level of the systems.

While there are techniques and guiding principles which when followed and applied, the resilience of the systems can be greatly improved, such design or implementation comes with a price and that is where the economics of Resiliency needs to be considered. For instance, mission critical software systems like the ones used in medical devices, need to have a high resilience characteristic, but quite many of the business systems can have a higher tolerance level and thereby being less resilient. However, it is good to document the expected resilience level at the initial stage and work on it in the early life cycle of the system development. thinking about resilience later in the life cycle may not be any good as implementation will call for higher investment.

References:

Crosstalk - The journal of Defense Software Engineering Vol 22 No:6

Resilient Systems Working Group

OWASP - AppSensor Project

Sunday, December 9, 2012

Measuring Performance of Enterprise Architecture

Enterprise Architecture is viewed as an important function in IT Governance and it plays a vital role in aligning IT with the Business. This function is expected to define the technical direction and to ensure application of principles of Architecture to the design and maintenance of IT systems, which in turn should be in alignment with the business vision, mission and strategies and the role that the IT within the organization is expected to play. While the support for commitment and funding is important, it is also important that the EA function consider the following(not exhaustive) to be successful:

Alignment with the business strategy and the culture of the organization,
Actively involve in projects to ensure that the principles of design and evolution are adhered to and ensure the continued focus on the business requirements.
Offer technical consultancy for all the business and IT functions both internal and external.
Acting as a gate for all decisions impacting the design & evolution architecture.

IT Governance views EA as the hub of the IT wheel with linkages with various processes, components and goals of the enterprise and some of such key and enabling links are:

Promoting and enabling Business Agility
Providing standards, policies and principles for the IT Project, Program and Portfolio management function
Guides and enables cost management and consolidation
Facilitates cost-effective, scalable integration of various IT systems
Supports IT Governance by defining / providing the conceptual and technical priorities and thereby promoting informed decision making

Going by the premise, what is not measured does not get managed, it is important to identify the measurable objectives for the Architecture function itself, so that it is well managed and its contribution to the success of the organization is established. While measurement of Architectural activities is difficult, COBIT suggests a set of measurable outcomes and performance measures, some of which are the following:

Number of technology solutions that are not aligned with the business strategy - One of the objectives of the EA function should be to ensure that the technology solutions chosen or implemented are aligned to the business strategy a measure around this could be very useful to establish that the number of misaligned solutions are on the decline. There could be other measures derived around this and could be represented as a relative measure to the total solutions.

Percent of non-compliant technology projects and platforms planned - With the fast changing business environment, there will be times when the business will need to solutions technology projects that are not compliant with the standards and principles laid out by EA. EA has the responsibility to carefully review such needs and grant waivers. Such waivers should be for a shorter term and should be backed with a plan to normalize it. At times, this could call revision in the standards, policies or principles. A measure around this could be very useful that the EA is effective in dealing with non-compliant technology projects and platforms.

Decreased number of technology platforms to maintain - Standardization is one of the objectives of EA, which could contribute to cost reductions and reduce the technical complexity. Statistics and surveys show that enterprises without an active IT Governance / EA function have more multiple applications requiring different platforms for the same business requirement being used by different departments. With an effective EA function, these should be very less in number and should decline over a period. A measure around this is a very good indicator of EA being effective in this area.

Reduced application deployment effort and time-to-market - Supporting Business agility is yet another key objective of the EA function. Today’s businesses are operating in a highly dynamic industry environment and in order to stay competitive and to sustain its market position, need support from IT to have the new or changed capabilities with a reduced time-to-market. A delayed delivery from IT could mean an opportunity lost. A measure around this indicator would really be helpful in establishing how IT is supporting the business changes.

Increased interoperability between systems and applications - It is quite common that most enterprises have multiple applications for specific needs, but there is a need to have these applications share data and information amongst each other. With cloud computing gaining wider acceptance, most enterprises are looking at discrete cloud based solutions. With the benefits outweighing the concerns and constraints, and that the industry is working towards addressing these concerns, there will be increased focus on move to cloud. This will mean hosted applications from various providers would need to be interoperable and working with other in house applications. EA should ensure that the technology and solutions acquired or designed should support this important attribute i.e. interoperability. A measure around this parameter would be an important indicator of EA function’s effectiveness.

Percent of IT budget assigned to technology infrastructure and research - Yet another expectation from the EA function is that it should help businesses in leveraging emerging technology to its advantage. This will require the Architects to be continuously looking for newer technologies and its application areas, and recommend such technology or solutions for implementation so that the business will get the most out of IT to accomplish its mission. It is also important the extent of this activity should be in line with the identified and stated role of IT in the organization. While the percentage of IT budget used for research is an useful measure, there could be other useful measures derived from this, for instance, number of research solutions getting implemented as a percentage to total number of solutions implemented in a given period. Business satisfaction on timely identification and analysis of technology opportunities is another related measure which is indicative of the outcome of this research.

Number of months since the last technology infrastructure review - With the fast changing IT space, it is important to ensure that the technology infrastructure is of continued relevance to meeting the business objectives and if needed changes should be considered. A measure indicating that the EA function is performing review of the technology infrastructure periodically is a good indicator of its effectiveness. Measures derived around this review could be based on the outcome of the review will also be very useful.

There is no one size fits all in IT and as such the measures indicated above need to be tailored to suit the organization and the role of IT within the organization.

References:

COBIT - an IT Governance framework from ISACA.

Developing a successful governance strategy - A best practice guide for decision makers in IT from The National Computing Center, UK

Pages