SlideShare a Scribd company logo
An Optumis White Paper




                                                 www.optumis.com




               A Paradigm for Integrated IT Systems Management




                                                    Sanjay Raina




  December, 2010
Contents                 Introduction
                                            IT Systems Management tools and technologies
Introduction                            2   continue to be crucial to efficient delivery of IT
Problem Statement                       2   services. The IT landscape is evolving all the time
Current Practice                        3   and there are increasing demands placed on the
Optumis Concerto                        4
                                            management of IT. Systems Management tools
                                            have failed to keep pace with these
Implementation                          8
                                            developments and often fail to deliver the
Business Benefits                      13   potential value as promised by the vendors. The
Summary                                13   main reason behind this is that IT Systems
References                             13   Management tends to be disjointed and silo
                                            based. This white paper presents a holistic
                                            approach to IT Systems and Service
                                            Management. The approach advocates a
                                            declarative, data-driven framework for specifying
                                            management structures and policy. This
                                            combined with the notion of abstraction of
                                            management data results in a integrated
                                            paradigm that allows stakeholders to make
                                            effective decisions about the complex IT
                                            infrastructure and applications in a coherent and
                                            consistent way.

                                            Problem Statement
                                            Today businesses rely heavily on IT to keep the
                                            revenue streams flowing and to run the day to
                                            day back office functions. It is therefore more
                                            critical than ever that the tools and technologies
                                            that manage the IT infrastructure are effective in
                                            delivering high levels of availability and
                                            productivity whilst keeping the costs down.

                                            The Systems Management tools in use today
                                            have their origins in the days of distributed, client
                                            server technology. Since then the IT landscape
                                            has undergone considerable evolution to N-tier,
                                            Service Oriented Application architectures and
                                            more recently towards Cloud based utility
                                            computing delivery models. However, IT Systems
                                            Management tools and practices have not kept
                                            pace and there have not been corresponding
                                            improvements        in    Systems      Management
Copyright ©Optumis Ltd.   www.optumis.com
technologies. While the management tools          projects to solve particular problems that span
were adequate in the distributed computing        multiple functional areas. Secondly, each team or
era of twenty years ago, today's                  management function is driven by a narrow goal
infrastructure is considerably more complex       of delivering their specific piece without any
and more dynamic.                                 investment of thought on how their piece
                                                  impacts other functions. Thirdly, the disjointed
The Systems Management tools marketplace          approach prevents the fostering of higher level
is dominated by the Big Four vendors, each        abstractions that can create value simply by
with an arsenal of products [1][2][3][4] aimed    cross-pollination of management information
to cover the entire Enterprise Management         from different sources. IT organizations spend a
space and purported to solve the challenging      disproportionate amount of time just to keep
problems of end to end Systems                    things ticking over and fire fighting, leaving very
Management. In practice, however, these           little resource to spend on adding value to their
tools fail to deliver the promised value and      service offering.
have proved to be difficult and costly to
implement and integrate. Often, tools from        Current Practice
the same vendor have poor integration             In the past, vendors have responded with more
capabilities and the only thing they share with   tightly integrated Framework based products,
each other is the brand name!                     such as the older generation of IBM Tivoli [5]
                                                  and HP OpenView [3]. These products use an
When it comes to deploying Enterprise             underlying layer built out of technologies such as
Management tools, typically organizations         CORBA to improve the integration capabilities.
implement these as silo functions [7]             The problem with such an approach is the lack of
managed by several specialized teams. There       flexibility and vendor lock-in that prevents the
may be separate monitoring teams for              use of best of breed tools. Consequently, the
servers, applications and databases. Other        Framework tools have nowadays lost favor and
teams might be responsible for managing           most vendors now espouse a best of breed
desktops, network and security. Yet more          approach.
teams may be responsible for service desk
and operations functions. That there are so       More recently, organizations have sought to
many different teams performing IT                make use of process frameworks such as ITIL [9]
management functions is not a problem in          to improve efficiencies. Such frameworks provide
itself, but the fact that there is no coherent    the ability to organize people, processes and
framework or mechanism to orchestrate and         technology so that IT management is optimized.
coordinate the activities of these disjointed     However, while some organizations have
functions creates a number of problems.           benefited from the adoption of such practices,
Firstly, there is the problem of different        the benefits have been limited because, in most
teams and tools using different proprietary       cases it is seen as an overlay of a governance
databases and data formats to store               framework and doesn't fundamentally alter the
management data, with the result that             way various technical components of IT Systems
integration although not impossible is            Management interact with each other. What is
extremely laborious. Often, organizations         required is a fundamental rethink of the
have to embark upon costly integration
structure and organization of Enterprise         result, an application developer can build
Management tools and techniques.                 increasingly rich applications without worrying
                                                 about the specifics and compatibility of
Optumis Concerto                                 underlying hardware or software. Before the
The problems and challenges faced by IT          advent of operating systems and compilers, life
Systems Management are no different from         was tough for the application developer, as
those faced in the past by other domains in      programming involved getting to grips with the
IT. The complexity of the resources to be        complexities and workings of the underlying
managed and the tools that manage them is        hardware and devices. Abstraction is also used in
no different from the complexity of the          network communication, where successive layers
hardware and devices used in the platforms       of network protocols provide a more application
for developing and running applications.         centric view of communicating endpoints than
Likewise, the problem of interoperability        offered by the underlying network components.
between Systems Management tools and             Without networking protocols, developing
functions is no more severe than the problem     internetworked applications would mean having
of communication in a distributed computing      to know the details of all components of the
environment.                                     underlying network between two communication
                                                 endpoints - an impossible task.
We have taken some of the most common
techniques in computing that are in              Enterprise Management can benefit from
widespread use and applied them to               abstraction in the same way that application
Enterprise Management. The intent is to          development and network programming has. An
develop holistic solutions by taking a broader   Enterprise Management Abstract Machine
view of the Enterprise Management problem        (EMAM) allows higher level management
rather than point solutions for specific         applications and tools to be built quickly, without
problems.                                        the intimate knowledge of the underlying
                                                 domain specific tools. Fig. 1 below shows layered
                                                 application and network stacks along with a
Enterprise Management as an                      comparable Enterprise Management stack.
  abstract machine
The principle of abstraction has long been
used in computing as well as other fields to
solve the problem of complexity. Operating
systems use abstraction to hide details of the
underlying hardware and provide a collection
of common interfaces and services that work
on a variety of hardware through the use of      FIGURE 1. ENTERPRISE MANAGEMENT STACK COMPARED
device drivers. Compilers take abstraction          WITH APPLICATION DEVELOPMENT AND NETWORK
further by using an intermediate target                           PROTOCOL STACKS
abstract machine (e.g. Java Virtual Machine).
An interpreter or assembler then converts the    Like the counterpart layers in the application and
intermediate code to machine language. As a      network stacks, each layer in the Enterprise
                                                 Management stack supports the use of
constructs that use appropriate abstraction        service centric view of management data. The
mechanisms to ensure they are independent          top most layer provides high level reports and
of the specific characteristics of the layers      dashboard views as well as interfaces to data
below.                                             marts, primarily for the consumption of business
                                                   analysts and senior management.
The lowest layer consists of element
managers and can be compared to the device
driver layer in application and network stacks.
                                                   Cooperating state machines
It constitutes agents that produce raw event       Contemporary Systems Management tools are
data from instrumented applications and            programmed using policies and rules to model
infrastructure components, as well as agents       the various activities within a management
that perform command and control of                function. Unlike data processing or determinate
managed resources. This layer abstracts the        programming, Enterprise Management is
specifics of element managers into common          essentially reactive in nature and is best
constructs that offer a tool-agnostic view of      represented as an event-driven system. State
management data. For instance, one could           machines are an ideal way to implement complex
specify that a disk be monitored in abstract       event driven functions and are characterized by a
terms as:                                          collection of states and state transitions that
                                                   occur in response to events. State machines are
monitor(DISK, PctSpaceUsed, 95, Critical, 90,      typically represented by a state transition
Warning)                                           diagram as shown in Fig. 2 below.

This construct is then translated by an
interpreter to tool specific instruction using
the API of the specific tool. Management
operations can now be specified in common
terms without worrying about the
representation in underlying tools. The tools
can now be replaced or augmented without
affecting the higher level applications – all
that is required is a change to the interpreter.   FIGURE 2. A STATE TRANSITION DIAGRAM


The same principle is applied to the               Due to the large number of states and state
subsequent layers shown above. The next            transitions, representing a complex end to end
layer provides an aggregated view of the           Enterprise Management system as a single state
management data to applications. At this           machine is an impossible task. In the past,
layer, management data from multiple               techniques such as State Charts [6] have been
element managers can be combined to                developed to overcome the state explosion
provide a more analytical interpretation of        problem. We have used the concept of
the data. Note that the context is still           cooperating state machines. Each state machine
technology related. The next layer titled          represents one Enterprise Management function
Business Service Management consists of            or process and they link together to form an end
abstraction that provides a business and           to end model. Fig. 3 below shows a chain of state
machines representing the detection,             information and data link layer adds information
reporting and resolution of a fault.             about the physical media. Each layer acts as a
                                                 provider of service that is consumed by the layer
                                                 above.
                                                                                    Data
                                                                                           Application

                                                                     TCP
                                                                                    Data
                                                                    Header                 Transport

                                                             IP
                                                                             TCP Data
                                                           Header                          Internet

                                                  Frame
  FIGURE 3. COOPERATING STATE MACHINES            Header
                                                                       Frame data          Link

Data manipulation
                                                 FIGURE 4. THE WELL KNOW NETWORK PROTOCOL STACK
Efficient manipulation of data is an important
aspect of any abstract or physical machine.
                                                 A similar approach can be applied to Enterprise
Operands are commonly used in an
                                                 Management data that is successively enriched
instruction set (of abstract or real machine)
                                                 by layers in the stack. Each consuming layer
and allows instructions to perform operations
                                                 enriches the data further before providing it to
efficiently. Management data tends to be
                                                 the next layer.
passed around quite frequently amongst
various     components      in    a    Systems
Management solution and it is essential that
the data is optimally formatted and
structured. This is addressed by employing
two other concepts widely used in general
computing: normalization and encapsulation.      FIGURE 5. EVENT MANAGEMENT DATA BEING
                                                 SUCCESSIVELY ENRICHED
Normalization refers to the transformation of
the structure of the management data into a      Enrichment involves filling in the missing
canonical form. Without normalization,           information into the normalized event format as
considerable effort has to be expended in        described above. The information can be
interpreting the data emanating from various     supplied by external information sources such as
management tools.                                the CMDB, an operational data store or the
                                                 Incident Management database.
Encapsulation is most prominently used by
the TCP/IP protocol suite to provide             The standardization and enrichment of
abstraction of network protocols and             management data provides a number of benefits:
services. As shown in Fig. 4 below, data            • Management Systems tend to generate
packets are encapsulated with headers at              large volumes of data, much of which is
each layer. The TCP layer adds a TCP/UDP              noise. This data needs to be aggregated
header to identify the source and destination         and correlated to pin-point the root
access point. The IP header adds routing              cause. Standardization of management
data formats plays a crucial role in this   languages are examples of the declarative
       regard. The standardized format             paradigm. Specialized configuration files can also
       makes matching of events efficient.         be considered declarative, and even though they
       The rules for duplicate detection and       are not programming languages, they do enable
       suppression     become       simplified.    computation based on what rather than how.
       Detection and prevention of event           Another example of a declarative language is
       storms is also simplified. It is also       Prolog where programs are specified as facts and
       possible to apply more granular rules,      rules in a knowledge base. An inference engine
       e.g. one can put very specific alerts       then attempts to find solutions based on the
       from a particular resource or from a        rules and facts.
       whole datacenter into maintenance.
   •   Due to the added context information        Enterprise Management tools are generally
       available, it is easier to perform          programmed in an imperative manner using
       business      impact     management.        proprietary rule bases and databases. When
       Enrichment of management data               implementing Enterprise Management solutions,
       enables more accurate and automated         a significant amount of effort is spent on
       processing of events within a               encoding the control flow, i.e. specifying how
       management system. New, service             particular tasks are to be accomplished. We
       impacting events can be generated           advocate a declarative approach where the
       based on location or service                emphasis is on what rather than the how. So,
       information from the CMDB or                rather than specifying how to monitor a disk in a
       Incident information from the Incident      particular tool, using tool specific data structures,
       database.                                   we can specify the monitoring parameters in an
   •   Management data often traverses a           abstract form, as shown below.
       number of boundaries when various
       functions    are    performed.       The    <DiskThresh>
       information conveyed by the data is                <Hostname>Ferrari</Hostname>
       often interpreted by a multitude of                <Diskname>C:</Diskname>
       systems and personnel. It helps a                  <PctUsedWarn>90</PctUsedWarn>
       great deal if the information being                <PctUsedCrit>95</PctUsedCrit>
       passed     around     is   consistently     </DiskThresh>
       structured.
                                                   A driver component then takes this abstract
                                                   notation and converts it into tool specific
Declarative, data-driven                           instructions. The declarative approach can be
  programming                                      applied right across the board. The fragment
Most programs are written in an imperative         below shows how an alert matching a certain
paradigm where the developer instructs the         criteria can be specified to be routed to a
computer how to get a certain task done. In        resolver group.
declarative programming, on the other hand,
the developer simply states what is to be          <IncidentProfile>
achieved, and leaves it up to the system to               <hostname> Ferrari </hostname>
get the job done. XML and related markup                  <resource> DISK </resource>
<threshname> PctUsed                       understand and you don’t need specialists in the
</threshname>                                     different tools to manage and maintain the
       <threshop> LessThan </threshop>            management data. The data can be managed by
       <threshval> 95 </threshval>                a wider section of the IT service delivery
       <resolver_group>GTI_GB_WENG</tick          organization rather than just the specialists.
etqueue>
       <priority>P2</priority>                    Finally, there is the advantage that all data can
       <scim>Server OS, EMEA Intel</scim>         now be made available to personnel, based on
</IncidentProfile>                                their role, for configuration and reporting
                                                  purpose. A user can update monitoring,
Similarly, the fragment below shows how an        maintenance windows, enrichment data, Incident
alert matching a certain criteria can be          resolver group information, notifications
specified to be suppressed during a               calendar and calling tree, all in one place.
maintenance window.
                                                  Implementation
<MaintenanceMode>                                 Although the Enterprise Management Abstract
       <hostname> Ferrari </hostname>             Machine covers a wide range of functions, its
       <resource> DISK </resource>                implementation is expected to be a veneer of
       <threshname> PctUsed                       software that runs on top of existing tools and
</threshname>                                     systems. We do not intend to reinvent well
       <thre shop> LessThan </threshop>           established functions of Systems Management
       <threshval> 95 </threshval>                and most of the heavy lifting is expected to be
       <suppressstart>                            done by existing tools and systems. This section
              3-Aug-2010 11:00:00                 describes how the key aspects described in the
       </suppressstart>                           previous section can be realized. Two scenarios
       <suppressend>                              are outlined to demonstrate the use of the
              27-Dec-2010 12:00:00                concepts discussed.
       </suppressend>
       <suppressday>Sunday</suppressday>          In common with other abstract (and indeed
       <suppresshour>05:00--                      physical) machines, the operation of EMAM is
11:00</suppresshour>                              characterized by:
</MaintenanceMode>                                   • A workflow component that executes the
                                                         logic of the computation being
This has a major advantage in that the policies          performed. This may take the form of a
and rules for management have to be                      program of instructions compiled and
specified just once. The underlying tools can            then processed by a CPU or, in the case of
be replaced at any time without having to                an operating system a sequence of
rewrite the policies and rules for the new               processes being scheduled from a work
tool. Integration to various tools is done via           queue. In the case of EMAM, program
SOAP/WSDL or tool specific APIs.                         execution takes the form of a sequence of
                                                         state machines.
Another advantage is that the management
data in declarative form is easier to
•   Operands used by the instructions in a             product [9]. The fragment below shows a XAML
            program. These take the form of local              representation of a state machine.
            storage (registers or stack) in
            conventional machines. In the EMAM                 <StateMachineWorkflowActivity
            these operands are typically alert                 x:Class="EMAMWorkflow.Monitor"
            data, incident records, change records                Name="Monitor"
            etc.                                                  InitialStateName="Idle"
        •   Reference data is used by the                      xmlns="http://schemas.microsoft.com/winfx/200
            workflow component. This is usually                6/xaml/workflow"
            general purpose storage in a                       xmlns:x="http://schemas.microsoft.com/winfx/2
            conventional machine where the                     006/xaml">
            results of computation are stored. In                 <StateActivity x:Name="Idle">
            the EMAM, the reference data is                          <EventDrivenActivity
            typically operational configuration                x:Name="CheckThreshold">
            data         stored        in         a                      <HandleExternalEventActivity
            configuration/operational data store.                            x:Name="HandleCheckThreshold"
                                                                            EventName="Check">
                                                                         </HandleExternalEventActivity>
   State-machine Workflows                                              <CodeActivity
   A program in the EMAM is expressed in the                                x:Name="DoCheckCode"
   form of a workflow of state machines. The
   control of the program propagates through                   ExecuteCode="DoCheckCode_ExecuteCode">
   state machines, with one state machine                             </CodeActivity>
   triggering another. The programming of the                         <SetStateActivity
   EMAM takes place by specifying the sequence                            x:Name="SetChecking"
   in which these state machines are triggered.                           TargetStateName="Checking">
   The program is specified declaratively, in the                     </SetStateActivity>
   form of a database table or an XML based                          </EventDrivenActivity>
   markup. Table 1 below shows a sample                           </StateActivity>
   workflow specification.                                       …………
                                                               </StateMachineWorkflowActivity>
   TABLE 1. STATE TRANSITION TABLE
State       Current     Event         Next        Next State
Machine     State       Condition     State       Machine      The above XAML code can be loaded directly into
Monitor     Idle        Check         Checking    Monitor
Monitor     Checking    Breach        Alerted     Monitor
                                                               Microsoft Workflow Foundation to appear as in
Monitor     Checking    NotBreached   Idle        Monitor      Fig. 6 below.
Monitor     Alerted     DupDetected   Duplicate   Monitor
Monitor     Alerted     Not Dup       Unique      Normalized
Monitor     Duplicate   Drop          Idle        Monitor
Normalize



   The above specification can be represented in
   an XML markup such as XAML. XAML is used
   by the Microsoft Workflow foundation
FIGURE 7. OPERATIONS MANAGEMENT DATA BEING
                                                 TRANSLATED TO TOOL SPECIFIC DATA

                                                 Scenario 1: Fault detection and
                                                   resolution
FIGURE 6. XAML BASED WORKFLOW IN MICROSOFT       The following scenario describes a fault being
WORKFLOW FOUNDATION
                                                 detected by the monitoring system, a problem
Operands                                         ticket being cut and then the problem
Management data that is processed by state       remediated following a change management
machines comes in various types. Examples        procedure. Each step corresponds to a state
include:                                         machine and describes the operation performed
    • Alert                                      along with the operand and the reference data
    • Incident                                   used.
    • Change record
                                                 Table 2 below outlines the sequence of state
    • Service request
                                                 machines that constitute this scenario, followed
    • Provisioning
                                                 by the description of activities performed in each
As the state machine workflow progresses,
                                                 step. Note that the activities within each
the operands are transformed into more
                                                 workflow may be automated or manual.
specific and more relevant management
information.                                     TABLE 2. STATE MACHINE WORKFLOW FOR FAULT
                                                 DETECTION AND RESOLUTION
Reference data                                   State Machine       Operand                Ref Data
                                                 Detect fault        Alert                  Thresholds, Alert
In addition to the operands, the state                                                      history
machines also use more permanent,                Normalize           Alert                  Normalization map
                                                 Correlate           Alert                  Alert history
reference data. Such data may include IT         Enrichment          Alert                  CMDB, OMDB
Service Management data in a CMDB or more        Problem ticket      Alert, Problem         Alert Ticket mapping
                                                                     Ticket
transient data in another operational data       Escalate            Problem notification   Escalation table
store. The operational management data           Notification        Problem Ticket,        Notification data,
                                                                     Page, Email, SMS       Call tree
store contains the declarative data previously   Change Request      Change Ticket          CMDB
mentioned and meta data that may be              Update Monitoring
                                                 Close Change,       Problem, Change
mapped on, directly or indirectly, to tool       Problem Tickets     Tickets
specific data as shown in Figure 7 below.
Detect fault                                       Correlation and root cause analysis
A monitoring tool samples metrics from a           In this step, alerts are consolidated and root
resource and compares against the thresholds       cause determined. Existing alerts are searched to
table. An alert is generated if a threshold        determine if this is a repeat symptom of an
condition is breached. Thresholds are              underlying problem. Lookup tables are used to
specified in a database and may be made            perform correlation. Fields defined previously are
available synchronously within the tool, for       used to perform the match against the lookup
instance    via    configuration    file   or      tables. Also, maintenance windows may be
programmatically through tool specific APIs.       consulted at this stage using the same matching
The generated event follows a tool specific        criteria and the alert dropped if the matched
format. The figure below shows fields from an      resource is under maintenance.
event generated by an agent and those from
an SNMP trap, both depicting the same fault.        Enrichment
                                                   In this step, additional fields are added to the
                                                   alert. These fields can be technology related to
                                                   assist operations or business related to add
                                                   business context. Table 4 below shows the
                                                   location and service fields enriched.

                                                              TABLE 4. ENRICHED ALERT DATA
     FIGURE 8. TOOL SPECIFIC ALERT FIELDS

 Normalize
In this step, the event fields are normalized
into a canonical form. The idea is that no
matter what tool or method is used to detect
the fault, its representation is the same, in an
abstract form and not dependent on the
underlying tool. Table 3 below shows alert
data in normalized form
                                                    Create Problem Ticket
TABLE 3. NORMALIZED ALERT DATA
                                                   Based on the mapping table, and using the alert
                                                   data, a new problem ticket is created. The
                                                   problem ticket follows a standard form just like
                                                   the alert, to ensure consistency. Whether an
                                                   alert was generated automatically as above or
                                                   the ticket created manually by a Service Desk
                                                   operator, the representation should be the same.
The fields can now be used consistently to         The problem ticket now forms the basis for
perform matching and analytics at various          tracking the alert and is used when performing
levels. These fields serve as a key in matching    escalation etc.
against the different types of management
                                                    Escalation
data.
                                                   Escalation is a core function of the Incident
                                                   Management process. The escalation function
can be performed on the problem ticket using                       The final step in this sequence is to mark the
escalation data in a table such as below.                          Change as implemented and close the Incident.
                                                                   The workflow will automatically close or clear the
TABLE 5. ESCALATION TABLE                                          alert in the monitoring tools.
Priority   P1             P2             P3           P4
Level of   High,          Production     Degraded     Minimal
Impact     Critical,      severely       operations   impact
                                                                   Scenario 2: VM server provisioning
           Fatal          impacted
2 hrs      First
                                                                   This scenario depicts another common situation
           response                                                of requesting and provisioning a virtual server. As
4 hrs      Work           First
           around         response                                 with the previous scenario, the management
24 hrs     Mgmt           Work           First                     solution consists of a series of state machines.
           notification   around         response
48 hrs                    Mgmt           Work         First        The state machine workflow is outlined in Table 6
                          notification   around       response     below with the associated operand and reference
1 wk       Resolution                    Mgmt notif   Work
                                                      around       data.
2 wks                     Resolution
3 wks                                    Resolution
                                                                   TABLE 6. STATE MACHINE WORKFLOW FOR VM SERVER
Release                                               Resolution
                                                                   PROVISIONING
                                                                   State Machine      Operand            Ref Data
  Notification                                                     Create Request     Service Request
Based on a calling tree and calendar                               Check Service      Service Request    CMDB, Service
                                                                   Catalog                               Catalog
information the problem ticket can generate                        Check capacity     Service Request    CMDB
notifications.                                                     Create Change      Service Request,   CMDB
                                                                   Request            Change Ticket
                                                                   Provision VM       Change Ticket      CMDB
  Create Change Ticket                                             Close Change,      Change Ticket,
                                                                   Service Request    Service Request
Once the right personnel have been notified
and the resolution identified, a change record                       Create request
is created to perform the change. In our                           A Service Request is created manually by a
example here, the change involves a change                         requester. As before, the request is turned into a
to the monitoring thresholds as it was                             standardized form so as to make its processing
deemed to be a spurious alert. The change                          easier. This state machine workflow routes the
request follows the change management                              request to individuals in the organization for
process, including appropriate reviews,                            action, alerts the manager as necessary when the
approvals and assignment of change                                 current owner does not respond to the request,
implementers.                                                      and escalates or transfers the request to the next
                                                                   level of support. At this stage only a few fields
  Update monitoring threshold
                                                                   such as request number and request owner are
Once the change has been approved and
                                                                   populated in the service request.
implementer notified, the monitoring
threshold is updated in the database. Note                          Supplement information from Service
that no change has been made to the                                Catalog
monitoring tool or any rule sets, and such a                       This step looks up the Service Catalog to fill in the
change can be performed by a non-specialist                        details about the servers. This step is comparable
since it is a simple data change.                                  to the Enrichment step in the previous scenario.
                                                                   Additional attributes include response deadline,
  Close change, incident and alert                                 server asset data etc.
Summary
 Check capacity
Once the service request is sufficiently          The approach described in this white paper is
qualified, the next step checks there is          based on ideas and principles widely used in
adequate capacity on the physical                 general computing to overcome the problem of
infrastructure. Checks are performed to           complexity and inter-operability. The approach
determine CPU, Memory and Storage capacity        results in a more holistic solution to the problem
and appropriate personnel notified, if            of Enterprise Management.           A concept of
necessary.                                        Enterprise Management Abstract Machine is
                                                  presented that utilizes state machine workflows
 Create and manage Change Ticket                  and declarative, data-driven programming to
Once the right personnel have been notified       decouple management procedures and data
and the checks performed, a Change Ticket is      from the underlying tools. Such an approach
created to perform the change.                    results in a federated management model that
                                                  enables optimal use of people, processes and
 Provision VM                                     technology. Management applications and
This is essentially a manual step, in which the   processes can be implemented quickly and
implementer creates the Virtual Machine.          efficiently, without getting bogged down by the
                                                  mechanics of the tools.
 Close change and service request
The final step in this sequence is to mark the    References
Change as implemented and close the               [1]    BMC Patrol, http://www.bmc.com/products/product-
corresponding Service Request.                           listing/ProactiveNet-Performance-Management.html
                                                  [2]    CA, http://www.ca.com/us/products.aspx
                                                  [3]    HP OpenView Operations,
Business Benefits                                        https://h10078.www1.hp.com/cda/hpms/display/main/hpms_h
                                                         ome.jsp?zn=bto&cp=1_4011_100
The integrated approach to Enterprise             [4]    IBM Tivoli, http://www.ibm.com/software/tivoli
Systems Management provides a number of           [5]    IBM, Tivoli Management Framework, http://www-
                                                         01.ibm.com/software/tivoli/products/mgt-framework
key related benefits to the business.             [6]    D.Harel. Statecharts: a visual formalism for complex systems.
    • The solution enables optimal use of                Science of Computer Programming 8:231-274. North-
                                                         Holland 1987.
        technology and human resources to         [7]    Macehiter Ward-Dutton, The New Face of IT Service
        deliver significant cost reduction in            Management, 2007.
                                                  [8]    Microsoft System Center Operations Manager,
        managing IT systems.                             http://www.microsoft.com/systemcenter/en/us/operations-
    • Standardisation and systematic reuse               manager.aspx
                                                  [9]    Microsoft Windows Workflow Foundation,
        of processes and procedures leads to             http://msdn.microsoft.com/en-
        increased automation and efficient               us/library/ms735921(VS.90).aspx
                                                  [10]   Office of Government Commerce: Best Management
        practice.                                        Practice, IT Service Management, http://www.best-
    • The solution significantly improves                management-practice.com/IT-Service-Management-ITIL
        productivity, allowing support staff to
        improve service delivery and add
        value rather than constantly fire
        fighting.

More Related Content

PDF
IBM PureFlex System: The infrastructure system with integrated expertise
PDF
IBM Pureflex product brochure
PDF
AGILE CLOUD: NAVIGATING THE TRANSITION TO MANAGE IT
PPTX
Track 1, session 6, accelerating your cloud journey with advanced services ab...
PDF
Estudo Gartner - IT360 ManageEngine
PDF
IBM zEnterprise System Brings Hybrid Computing Capabilities to Midsize Organi...
PDF
Enterprise architecture institutionalization_and_assessment
PPT
Exploring Opportunities in Crisis by Ramco
IBM PureFlex System: The infrastructure system with integrated expertise
IBM Pureflex product brochure
AGILE CLOUD: NAVIGATING THE TRANSITION TO MANAGE IT
Track 1, session 6, accelerating your cloud journey with advanced services ab...
Estudo Gartner - IT360 ManageEngine
IBM zEnterprise System Brings Hybrid Computing Capabilities to Midsize Organi...
Enterprise architecture institutionalization_and_assessment
Exploring Opportunities in Crisis by Ramco

What's hot (20)

PDF
Expanding Role of ITSM
PDF
9sept2009 fujitsu
PPT
Chapter14
PPTX
CH14-Enterprise Computing
PDF
itlc
PDF
Enterprise architecture management_s_impact_on_information_technology
PDF
Ar Accelerating Converged Infrastructure With Flexpod
PDF
Gaining efficiency and business value through effective management of your IT...
PDF
The Datacenter Of The Future
PDF
Integrating innovation into_enterprise_architecture
PDF
A framework for ERP systems in sme based On cloud computing technology
PDF
Convergence point of_view_article
PDF
Expanding mission critical ci
PDF
Essential layers artifact_and_dependencies_of_ea
PDF
Evaluating E R P Implementation Luo Strong
PDF
A method to_define_an_enterprise_architecture_using_the_zachman_framework
PDF
Bending the IT Op-Ex Cost Curve Through IT Simplification
PPTX
BMC Software proactive operations platform
PDF
How Intelligent Operations Enables Proactive Data Center Management
PDF
Experts perspective on_enterprise_architecture_1106145396_1_
Expanding Role of ITSM
9sept2009 fujitsu
Chapter14
CH14-Enterprise Computing
itlc
Enterprise architecture management_s_impact_on_information_technology
Ar Accelerating Converged Infrastructure With Flexpod
Gaining efficiency and business value through effective management of your IT...
The Datacenter Of The Future
Integrating innovation into_enterprise_architecture
A framework for ERP systems in sme based On cloud computing technology
Convergence point of_view_article
Expanding mission critical ci
Essential layers artifact_and_dependencies_of_ea
Evaluating E R P Implementation Luo Strong
A method to_define_an_enterprise_architecture_using_the_zachman_framework
Bending the IT Op-Ex Cost Curve Through IT Simplification
BMC Software proactive operations platform
How Intelligent Operations Enables Proactive Data Center Management
Experts perspective on_enterprise_architecture_1106145396_1_
Ad

Similar to Concerto Whitepaper (20)

PDF
IBM PureFlex System
PDF
RightITnow Whitepaper
PDF
IT Infrastructure Management | Defination, Objectives & Strategies
PDF
How to Better Manage Your IT Infrastructure
PDF
The Future of Convergence Paper
PDF
FS Netmagic - Whitepaper
PDF
Understanding the Key Challenges of Software Integration.pdf
PDF
The data center impact of cloud, analytics, mobile, social and security rlw03...
PDF
Bab 6 (understanding it infrastructure)
PDF
Simplifying IT Management with Cloud Administration .pdf
PDF
1 18784 navisite-wp-cloud_roi
PPTX
Choosing a Desktop Virtualization (VDI) Partner
PDF
IBM Systems Director
PDF
BMC Discovery IDC Research Study 470 ROI in 5 Years
PDF
Transforming an organization to cloud
PDF
ARMnet Financial Product Management Design Philosphy
PDF
Axcess Design Philosphy
PPT
comparision between IT and Information system
PDF
AA using WS vanZyl 2002-05-06
PPTX
Dell - Converged infrastructure
IBM PureFlex System
RightITnow Whitepaper
IT Infrastructure Management | Defination, Objectives & Strategies
How to Better Manage Your IT Infrastructure
The Future of Convergence Paper
FS Netmagic - Whitepaper
Understanding the Key Challenges of Software Integration.pdf
The data center impact of cloud, analytics, mobile, social and security rlw03...
Bab 6 (understanding it infrastructure)
Simplifying IT Management with Cloud Administration .pdf
1 18784 navisite-wp-cloud_roi
Choosing a Desktop Virtualization (VDI) Partner
IBM Systems Director
BMC Discovery IDC Research Study 470 ROI in 5 Years
Transforming an organization to cloud
ARMnet Financial Product Management Design Philosphy
Axcess Design Philosphy
comparision between IT and Information system
AA using WS vanZyl 2002-05-06
Dell - Converged infrastructure
Ad

Concerto Whitepaper

  • 1. An Optumis White Paper www.optumis.com A Paradigm for Integrated IT Systems Management Sanjay Raina December, 2010
  • 2. Contents Introduction IT Systems Management tools and technologies Introduction 2 continue to be crucial to efficient delivery of IT Problem Statement 2 services. The IT landscape is evolving all the time Current Practice 3 and there are increasing demands placed on the Optumis Concerto 4 management of IT. Systems Management tools have failed to keep pace with these Implementation 8 developments and often fail to deliver the Business Benefits 13 potential value as promised by the vendors. The Summary 13 main reason behind this is that IT Systems References 13 Management tends to be disjointed and silo based. This white paper presents a holistic approach to IT Systems and Service Management. The approach advocates a declarative, data-driven framework for specifying management structures and policy. This combined with the notion of abstraction of management data results in a integrated paradigm that allows stakeholders to make effective decisions about the complex IT infrastructure and applications in a coherent and consistent way. Problem Statement Today businesses rely heavily on IT to keep the revenue streams flowing and to run the day to day back office functions. It is therefore more critical than ever that the tools and technologies that manage the IT infrastructure are effective in delivering high levels of availability and productivity whilst keeping the costs down. The Systems Management tools in use today have their origins in the days of distributed, client server technology. Since then the IT landscape has undergone considerable evolution to N-tier, Service Oriented Application architectures and more recently towards Cloud based utility computing delivery models. However, IT Systems Management tools and practices have not kept pace and there have not been corresponding improvements in Systems Management Copyright ©Optumis Ltd. www.optumis.com
  • 3. technologies. While the management tools projects to solve particular problems that span were adequate in the distributed computing multiple functional areas. Secondly, each team or era of twenty years ago, today's management function is driven by a narrow goal infrastructure is considerably more complex of delivering their specific piece without any and more dynamic. investment of thought on how their piece impacts other functions. Thirdly, the disjointed The Systems Management tools marketplace approach prevents the fostering of higher level is dominated by the Big Four vendors, each abstractions that can create value simply by with an arsenal of products [1][2][3][4] aimed cross-pollination of management information to cover the entire Enterprise Management from different sources. IT organizations spend a space and purported to solve the challenging disproportionate amount of time just to keep problems of end to end Systems things ticking over and fire fighting, leaving very Management. In practice, however, these little resource to spend on adding value to their tools fail to deliver the promised value and service offering. have proved to be difficult and costly to implement and integrate. Often, tools from Current Practice the same vendor have poor integration In the past, vendors have responded with more capabilities and the only thing they share with tightly integrated Framework based products, each other is the brand name! such as the older generation of IBM Tivoli [5] and HP OpenView [3]. These products use an When it comes to deploying Enterprise underlying layer built out of technologies such as Management tools, typically organizations CORBA to improve the integration capabilities. implement these as silo functions [7] The problem with such an approach is the lack of managed by several specialized teams. There flexibility and vendor lock-in that prevents the may be separate monitoring teams for use of best of breed tools. Consequently, the servers, applications and databases. Other Framework tools have nowadays lost favor and teams might be responsible for managing most vendors now espouse a best of breed desktops, network and security. Yet more approach. teams may be responsible for service desk and operations functions. That there are so More recently, organizations have sought to many different teams performing IT make use of process frameworks such as ITIL [9] management functions is not a problem in to improve efficiencies. Such frameworks provide itself, but the fact that there is no coherent the ability to organize people, processes and framework or mechanism to orchestrate and technology so that IT management is optimized. coordinate the activities of these disjointed However, while some organizations have functions creates a number of problems. benefited from the adoption of such practices, Firstly, there is the problem of different the benefits have been limited because, in most teams and tools using different proprietary cases it is seen as an overlay of a governance databases and data formats to store framework and doesn't fundamentally alter the management data, with the result that way various technical components of IT Systems integration although not impossible is Management interact with each other. What is extremely laborious. Often, organizations required is a fundamental rethink of the have to embark upon costly integration
  • 4. structure and organization of Enterprise result, an application developer can build Management tools and techniques. increasingly rich applications without worrying about the specifics and compatibility of Optumis Concerto underlying hardware or software. Before the The problems and challenges faced by IT advent of operating systems and compilers, life Systems Management are no different from was tough for the application developer, as those faced in the past by other domains in programming involved getting to grips with the IT. The complexity of the resources to be complexities and workings of the underlying managed and the tools that manage them is hardware and devices. Abstraction is also used in no different from the complexity of the network communication, where successive layers hardware and devices used in the platforms of network protocols provide a more application for developing and running applications. centric view of communicating endpoints than Likewise, the problem of interoperability offered by the underlying network components. between Systems Management tools and Without networking protocols, developing functions is no more severe than the problem internetworked applications would mean having of communication in a distributed computing to know the details of all components of the environment. underlying network between two communication endpoints - an impossible task. We have taken some of the most common techniques in computing that are in Enterprise Management can benefit from widespread use and applied them to abstraction in the same way that application Enterprise Management. The intent is to development and network programming has. An develop holistic solutions by taking a broader Enterprise Management Abstract Machine view of the Enterprise Management problem (EMAM) allows higher level management rather than point solutions for specific applications and tools to be built quickly, without problems. the intimate knowledge of the underlying domain specific tools. Fig. 1 below shows layered application and network stacks along with a Enterprise Management as an comparable Enterprise Management stack. abstract machine The principle of abstraction has long been used in computing as well as other fields to solve the problem of complexity. Operating systems use abstraction to hide details of the underlying hardware and provide a collection of common interfaces and services that work on a variety of hardware through the use of FIGURE 1. ENTERPRISE MANAGEMENT STACK COMPARED device drivers. Compilers take abstraction WITH APPLICATION DEVELOPMENT AND NETWORK further by using an intermediate target PROTOCOL STACKS abstract machine (e.g. Java Virtual Machine). An interpreter or assembler then converts the Like the counterpart layers in the application and intermediate code to machine language. As a network stacks, each layer in the Enterprise Management stack supports the use of
  • 5. constructs that use appropriate abstraction service centric view of management data. The mechanisms to ensure they are independent top most layer provides high level reports and of the specific characteristics of the layers dashboard views as well as interfaces to data below. marts, primarily for the consumption of business analysts and senior management. The lowest layer consists of element managers and can be compared to the device driver layer in application and network stacks. Cooperating state machines It constitutes agents that produce raw event Contemporary Systems Management tools are data from instrumented applications and programmed using policies and rules to model infrastructure components, as well as agents the various activities within a management that perform command and control of function. Unlike data processing or determinate managed resources. This layer abstracts the programming, Enterprise Management is specifics of element managers into common essentially reactive in nature and is best constructs that offer a tool-agnostic view of represented as an event-driven system. State management data. For instance, one could machines are an ideal way to implement complex specify that a disk be monitored in abstract event driven functions and are characterized by a terms as: collection of states and state transitions that occur in response to events. State machines are monitor(DISK, PctSpaceUsed, 95, Critical, 90, typically represented by a state transition Warning) diagram as shown in Fig. 2 below. This construct is then translated by an interpreter to tool specific instruction using the API of the specific tool. Management operations can now be specified in common terms without worrying about the representation in underlying tools. The tools can now be replaced or augmented without affecting the higher level applications – all that is required is a change to the interpreter. FIGURE 2. A STATE TRANSITION DIAGRAM The same principle is applied to the Due to the large number of states and state subsequent layers shown above. The next transitions, representing a complex end to end layer provides an aggregated view of the Enterprise Management system as a single state management data to applications. At this machine is an impossible task. In the past, layer, management data from multiple techniques such as State Charts [6] have been element managers can be combined to developed to overcome the state explosion provide a more analytical interpretation of problem. We have used the concept of the data. Note that the context is still cooperating state machines. Each state machine technology related. The next layer titled represents one Enterprise Management function Business Service Management consists of or process and they link together to form an end abstraction that provides a business and to end model. Fig. 3 below shows a chain of state
  • 6. machines representing the detection, information and data link layer adds information reporting and resolution of a fault. about the physical media. Each layer acts as a provider of service that is consumed by the layer above. Data Application TCP Data Header Transport IP TCP Data Header Internet Frame FIGURE 3. COOPERATING STATE MACHINES Header Frame data Link Data manipulation FIGURE 4. THE WELL KNOW NETWORK PROTOCOL STACK Efficient manipulation of data is an important aspect of any abstract or physical machine. A similar approach can be applied to Enterprise Operands are commonly used in an Management data that is successively enriched instruction set (of abstract or real machine) by layers in the stack. Each consuming layer and allows instructions to perform operations enriches the data further before providing it to efficiently. Management data tends to be the next layer. passed around quite frequently amongst various components in a Systems Management solution and it is essential that the data is optimally formatted and structured. This is addressed by employing two other concepts widely used in general computing: normalization and encapsulation. FIGURE 5. EVENT MANAGEMENT DATA BEING SUCCESSIVELY ENRICHED Normalization refers to the transformation of the structure of the management data into a Enrichment involves filling in the missing canonical form. Without normalization, information into the normalized event format as considerable effort has to be expended in described above. The information can be interpreting the data emanating from various supplied by external information sources such as management tools. the CMDB, an operational data store or the Incident Management database. Encapsulation is most prominently used by the TCP/IP protocol suite to provide The standardization and enrichment of abstraction of network protocols and management data provides a number of benefits: services. As shown in Fig. 4 below, data • Management Systems tend to generate packets are encapsulated with headers at large volumes of data, much of which is each layer. The TCP layer adds a TCP/UDP noise. This data needs to be aggregated header to identify the source and destination and correlated to pin-point the root access point. The IP header adds routing cause. Standardization of management
  • 7. data formats plays a crucial role in this languages are examples of the declarative regard. The standardized format paradigm. Specialized configuration files can also makes matching of events efficient. be considered declarative, and even though they The rules for duplicate detection and are not programming languages, they do enable suppression become simplified. computation based on what rather than how. Detection and prevention of event Another example of a declarative language is storms is also simplified. It is also Prolog where programs are specified as facts and possible to apply more granular rules, rules in a knowledge base. An inference engine e.g. one can put very specific alerts then attempts to find solutions based on the from a particular resource or from a rules and facts. whole datacenter into maintenance. • Due to the added context information Enterprise Management tools are generally available, it is easier to perform programmed in an imperative manner using business impact management. proprietary rule bases and databases. When Enrichment of management data implementing Enterprise Management solutions, enables more accurate and automated a significant amount of effort is spent on processing of events within a encoding the control flow, i.e. specifying how management system. New, service particular tasks are to be accomplished. We impacting events can be generated advocate a declarative approach where the based on location or service emphasis is on what rather than the how. So, information from the CMDB or rather than specifying how to monitor a disk in a Incident information from the Incident particular tool, using tool specific data structures, database. we can specify the monitoring parameters in an • Management data often traverses a abstract form, as shown below. number of boundaries when various functions are performed. The <DiskThresh> information conveyed by the data is <Hostname>Ferrari</Hostname> often interpreted by a multitude of <Diskname>C:</Diskname> systems and personnel. It helps a <PctUsedWarn>90</PctUsedWarn> great deal if the information being <PctUsedCrit>95</PctUsedCrit> passed around is consistently </DiskThresh> structured. A driver component then takes this abstract notation and converts it into tool specific Declarative, data-driven instructions. The declarative approach can be programming applied right across the board. The fragment Most programs are written in an imperative below shows how an alert matching a certain paradigm where the developer instructs the criteria can be specified to be routed to a computer how to get a certain task done. In resolver group. declarative programming, on the other hand, the developer simply states what is to be <IncidentProfile> achieved, and leaves it up to the system to <hostname> Ferrari </hostname> get the job done. XML and related markup <resource> DISK </resource>
  • 8. <threshname> PctUsed understand and you don’t need specialists in the </threshname> different tools to manage and maintain the <threshop> LessThan </threshop> management data. The data can be managed by <threshval> 95 </threshval> a wider section of the IT service delivery <resolver_group>GTI_GB_WENG</tick organization rather than just the specialists. etqueue> <priority>P2</priority> Finally, there is the advantage that all data can <scim>Server OS, EMEA Intel</scim> now be made available to personnel, based on </IncidentProfile> their role, for configuration and reporting purpose. A user can update monitoring, Similarly, the fragment below shows how an maintenance windows, enrichment data, Incident alert matching a certain criteria can be resolver group information, notifications specified to be suppressed during a calendar and calling tree, all in one place. maintenance window. Implementation <MaintenanceMode> Although the Enterprise Management Abstract <hostname> Ferrari </hostname> Machine covers a wide range of functions, its <resource> DISK </resource> implementation is expected to be a veneer of <threshname> PctUsed software that runs on top of existing tools and </threshname> systems. We do not intend to reinvent well <thre shop> LessThan </threshop> established functions of Systems Management <threshval> 95 </threshval> and most of the heavy lifting is expected to be <suppressstart> done by existing tools and systems. This section 3-Aug-2010 11:00:00 describes how the key aspects described in the </suppressstart> previous section can be realized. Two scenarios <suppressend> are outlined to demonstrate the use of the 27-Dec-2010 12:00:00 concepts discussed. </suppressend> <suppressday>Sunday</suppressday> In common with other abstract (and indeed <suppresshour>05:00-- physical) machines, the operation of EMAM is 11:00</suppresshour> characterized by: </MaintenanceMode> • A workflow component that executes the logic of the computation being This has a major advantage in that the policies performed. This may take the form of a and rules for management have to be program of instructions compiled and specified just once. The underlying tools can then processed by a CPU or, in the case of be replaced at any time without having to an operating system a sequence of rewrite the policies and rules for the new processes being scheduled from a work tool. Integration to various tools is done via queue. In the case of EMAM, program SOAP/WSDL or tool specific APIs. execution takes the form of a sequence of state machines. Another advantage is that the management data in declarative form is easier to
  • 9. Operands used by the instructions in a product [9]. The fragment below shows a XAML program. These take the form of local representation of a state machine. storage (registers or stack) in conventional machines. In the EMAM <StateMachineWorkflowActivity these operands are typically alert x:Class="EMAMWorkflow.Monitor" data, incident records, change records Name="Monitor" etc. InitialStateName="Idle" • Reference data is used by the xmlns="http://schemas.microsoft.com/winfx/200 workflow component. This is usually 6/xaml/workflow" general purpose storage in a xmlns:x="http://schemas.microsoft.com/winfx/2 conventional machine where the 006/xaml"> results of computation are stored. In <StateActivity x:Name="Idle"> the EMAM, the reference data is <EventDrivenActivity typically operational configuration x:Name="CheckThreshold"> data stored in a <HandleExternalEventActivity configuration/operational data store. x:Name="HandleCheckThreshold" EventName="Check"> </HandleExternalEventActivity> State-machine Workflows <CodeActivity A program in the EMAM is expressed in the x:Name="DoCheckCode" form of a workflow of state machines. The control of the program propagates through ExecuteCode="DoCheckCode_ExecuteCode"> state machines, with one state machine </CodeActivity> triggering another. The programming of the <SetStateActivity EMAM takes place by specifying the sequence x:Name="SetChecking" in which these state machines are triggered. TargetStateName="Checking"> The program is specified declaratively, in the </SetStateActivity> form of a database table or an XML based </EventDrivenActivity> markup. Table 1 below shows a sample </StateActivity> workflow specification. ………… </StateMachineWorkflowActivity> TABLE 1. STATE TRANSITION TABLE State Current Event Next Next State Machine State Condition State Machine The above XAML code can be loaded directly into Monitor Idle Check Checking Monitor Monitor Checking Breach Alerted Monitor Microsoft Workflow Foundation to appear as in Monitor Checking NotBreached Idle Monitor Fig. 6 below. Monitor Alerted DupDetected Duplicate Monitor Monitor Alerted Not Dup Unique Normalized Monitor Duplicate Drop Idle Monitor Normalize The above specification can be represented in an XML markup such as XAML. XAML is used by the Microsoft Workflow foundation
  • 10. FIGURE 7. OPERATIONS MANAGEMENT DATA BEING TRANSLATED TO TOOL SPECIFIC DATA Scenario 1: Fault detection and resolution FIGURE 6. XAML BASED WORKFLOW IN MICROSOFT The following scenario describes a fault being WORKFLOW FOUNDATION detected by the monitoring system, a problem Operands ticket being cut and then the problem Management data that is processed by state remediated following a change management machines comes in various types. Examples procedure. Each step corresponds to a state include: machine and describes the operation performed • Alert along with the operand and the reference data • Incident used. • Change record Table 2 below outlines the sequence of state • Service request machines that constitute this scenario, followed • Provisioning by the description of activities performed in each As the state machine workflow progresses, step. Note that the activities within each the operands are transformed into more workflow may be automated or manual. specific and more relevant management information. TABLE 2. STATE MACHINE WORKFLOW FOR FAULT DETECTION AND RESOLUTION Reference data State Machine Operand Ref Data Detect fault Alert Thresholds, Alert In addition to the operands, the state history machines also use more permanent, Normalize Alert Normalization map Correlate Alert Alert history reference data. Such data may include IT Enrichment Alert CMDB, OMDB Service Management data in a CMDB or more Problem ticket Alert, Problem Alert Ticket mapping Ticket transient data in another operational data Escalate Problem notification Escalation table store. The operational management data Notification Problem Ticket, Notification data, Page, Email, SMS Call tree store contains the declarative data previously Change Request Change Ticket CMDB mentioned and meta data that may be Update Monitoring Close Change, Problem, Change mapped on, directly or indirectly, to tool Problem Tickets Tickets specific data as shown in Figure 7 below.
  • 11. Detect fault Correlation and root cause analysis A monitoring tool samples metrics from a In this step, alerts are consolidated and root resource and compares against the thresholds cause determined. Existing alerts are searched to table. An alert is generated if a threshold determine if this is a repeat symptom of an condition is breached. Thresholds are underlying problem. Lookup tables are used to specified in a database and may be made perform correlation. Fields defined previously are available synchronously within the tool, for used to perform the match against the lookup instance via configuration file or tables. Also, maintenance windows may be programmatically through tool specific APIs. consulted at this stage using the same matching The generated event follows a tool specific criteria and the alert dropped if the matched format. The figure below shows fields from an resource is under maintenance. event generated by an agent and those from an SNMP trap, both depicting the same fault. Enrichment In this step, additional fields are added to the alert. These fields can be technology related to assist operations or business related to add business context. Table 4 below shows the location and service fields enriched. TABLE 4. ENRICHED ALERT DATA FIGURE 8. TOOL SPECIFIC ALERT FIELDS Normalize In this step, the event fields are normalized into a canonical form. The idea is that no matter what tool or method is used to detect the fault, its representation is the same, in an abstract form and not dependent on the underlying tool. Table 3 below shows alert data in normalized form Create Problem Ticket TABLE 3. NORMALIZED ALERT DATA Based on the mapping table, and using the alert data, a new problem ticket is created. The problem ticket follows a standard form just like the alert, to ensure consistency. Whether an alert was generated automatically as above or the ticket created manually by a Service Desk operator, the representation should be the same. The fields can now be used consistently to The problem ticket now forms the basis for perform matching and analytics at various tracking the alert and is used when performing levels. These fields serve as a key in matching escalation etc. against the different types of management Escalation data. Escalation is a core function of the Incident Management process. The escalation function
  • 12. can be performed on the problem ticket using The final step in this sequence is to mark the escalation data in a table such as below. Change as implemented and close the Incident. The workflow will automatically close or clear the TABLE 5. ESCALATION TABLE alert in the monitoring tools. Priority P1 P2 P3 P4 Level of High, Production Degraded Minimal Impact Critical, severely operations impact Scenario 2: VM server provisioning Fatal impacted 2 hrs First This scenario depicts another common situation response of requesting and provisioning a virtual server. As 4 hrs Work First around response with the previous scenario, the management 24 hrs Mgmt Work First solution consists of a series of state machines. notification around response 48 hrs Mgmt Work First The state machine workflow is outlined in Table 6 notification around response below with the associated operand and reference 1 wk Resolution Mgmt notif Work around data. 2 wks Resolution 3 wks Resolution TABLE 6. STATE MACHINE WORKFLOW FOR VM SERVER Release Resolution PROVISIONING State Machine Operand Ref Data Notification Create Request Service Request Based on a calling tree and calendar Check Service Service Request CMDB, Service Catalog Catalog information the problem ticket can generate Check capacity Service Request CMDB notifications. Create Change Service Request, CMDB Request Change Ticket Provision VM Change Ticket CMDB Create Change Ticket Close Change, Change Ticket, Service Request Service Request Once the right personnel have been notified and the resolution identified, a change record Create request is created to perform the change. In our A Service Request is created manually by a example here, the change involves a change requester. As before, the request is turned into a to the monitoring thresholds as it was standardized form so as to make its processing deemed to be a spurious alert. The change easier. This state machine workflow routes the request follows the change management request to individuals in the organization for process, including appropriate reviews, action, alerts the manager as necessary when the approvals and assignment of change current owner does not respond to the request, implementers. and escalates or transfers the request to the next level of support. At this stage only a few fields Update monitoring threshold such as request number and request owner are Once the change has been approved and populated in the service request. implementer notified, the monitoring threshold is updated in the database. Note Supplement information from Service that no change has been made to the Catalog monitoring tool or any rule sets, and such a This step looks up the Service Catalog to fill in the change can be performed by a non-specialist details about the servers. This step is comparable since it is a simple data change. to the Enrichment step in the previous scenario. Additional attributes include response deadline, Close change, incident and alert server asset data etc.
  • 13. Summary Check capacity Once the service request is sufficiently The approach described in this white paper is qualified, the next step checks there is based on ideas and principles widely used in adequate capacity on the physical general computing to overcome the problem of infrastructure. Checks are performed to complexity and inter-operability. The approach determine CPU, Memory and Storage capacity results in a more holistic solution to the problem and appropriate personnel notified, if of Enterprise Management. A concept of necessary. Enterprise Management Abstract Machine is presented that utilizes state machine workflows Create and manage Change Ticket and declarative, data-driven programming to Once the right personnel have been notified decouple management procedures and data and the checks performed, a Change Ticket is from the underlying tools. Such an approach created to perform the change. results in a federated management model that enables optimal use of people, processes and Provision VM technology. Management applications and This is essentially a manual step, in which the processes can be implemented quickly and implementer creates the Virtual Machine. efficiently, without getting bogged down by the mechanics of the tools. Close change and service request The final step in this sequence is to mark the References Change as implemented and close the [1] BMC Patrol, http://www.bmc.com/products/product- corresponding Service Request. listing/ProactiveNet-Performance-Management.html [2] CA, http://www.ca.com/us/products.aspx [3] HP OpenView Operations, Business Benefits https://h10078.www1.hp.com/cda/hpms/display/main/hpms_h ome.jsp?zn=bto&cp=1_4011_100 The integrated approach to Enterprise [4] IBM Tivoli, http://www.ibm.com/software/tivoli Systems Management provides a number of [5] IBM, Tivoli Management Framework, http://www- 01.ibm.com/software/tivoli/products/mgt-framework key related benefits to the business. [6] D.Harel. Statecharts: a visual formalism for complex systems. • The solution enables optimal use of Science of Computer Programming 8:231-274. North- Holland 1987. technology and human resources to [7] Macehiter Ward-Dutton, The New Face of IT Service deliver significant cost reduction in Management, 2007. [8] Microsoft System Center Operations Manager, managing IT systems. http://www.microsoft.com/systemcenter/en/us/operations- • Standardisation and systematic reuse manager.aspx [9] Microsoft Windows Workflow Foundation, of processes and procedures leads to http://msdn.microsoft.com/en- increased automation and efficient us/library/ms735921(VS.90).aspx [10] Office of Government Commerce: Best Management practice. Practice, IT Service Management, http://www.best- • The solution significantly improves management-practice.com/IT-Service-Management-ITIL productivity, allowing support staff to improve service delivery and add value rather than constantly fire fighting.