Debs 2011 tutorial on non functional properties of event processing

Non Functional Properties of Event Processing Presenters: Opher Etzion and Tali Yatzkar-Haham Participated in the preparation: Ella Rabinovich and Inna Skarbovsky

Introduction to non functional properties of event processing

The variety There are variety of cheesecakes There are many systems that conceptually look like EPN, but they are different in non functional properties

Two examples Very large network management: Millions of events every minute; Very few are significant, same event is repeated. Time windows are very short. Patient monitoring according to medical Treatment protocol : Sporadic events, but each is meaningful, time windows can span for weeks. Both of them can be implemented by event Processing – but very differently.

Agenda Introduction to Non functional properties of event processing Performance and scalability considerations Availability considerations Usability considerations Security and privacy considerations Summary I II III IV V VI

Performance and Scalability Considerations

Performance benchmarks There is a large variance among applications, thus a collection of benchmarks should be devised, and each application should be classified to a benchmark Some classification criteria: Application complexity Filtering rate Required Performance metrics

Performance benchmarks – cont. Adi A., Etzion O. Amit - the situation manager. The VLDB Journal – The International Journal on Very Large Databases. Volume 13 Issue 2, 2004. Mendes M., Bizarro P., Marques P. Benchmarking event processing systems: current state and future directions. WOSP/SIPEW 2010: 259-260 . Previous studies ‎indicate that there is a major performance degradation as application complexity increases.

Previous studies ‎indicate that there is a major performance degradation as application complexity increases  a single performance measure (e.g., event/s) is not good enough. Example for event processing system benchmark: Scenario 1: an empty scenario (upper bound on the performance) Scenario 2: low percentage of event instances is filtered in, agents are simple Scenario 3: low percentage of event instances is filtered in, agents are complex Scenario 4: high percentage of event instances is filtered in, agents are complex Some benchmarks scenarios Adi A., Etzion O. Amit - the situation manager. The VLDB Journal – The International Journal on Very Large Databases. Volume 13 Issue 2, 2004. 100000 100000 100000 100000 total external events 16503 7903 scenario 3 124319 1742 1372 accumulated latency (ms) 1923 57470 72887 throughput (event/s) scenario 4 scenario 2 scenario 1

Performance indicators One of the sources of variety Observations: The same system provides extremely different behavior based on type of functions employed Different application may require different metrics

Throughput Input throughput output throughput Processing throughput Measures: number of input events that the system can digest within a given time interval Measures: Total processing times / # of event processed within a given time interval Measures: # of events that were emitted to consumers within a given time interval

Latency latency In the E2E level it is defined as the elapsed time FROM the time-point when the producer emits an input event TO the time-point when the consumer receives an output event The latency definition But – input event may not result in output event: It may be filtered out, participate in a pattern but does not result in pattern detection, or participates in deferred operation (e.g. aggregation) Similar definitions for the EPA level, or path level

Latency definition – two variations: Producer 1 Producer 2 Producer 3 EPA Detecting Sequence (E1,E2,E3) within Sliding window of 1 hour E1 E2 E3 Consumer 11:00 12:00 11:10 11:15 11:30 E1 E2 E3 11:40 E2 Variation I: We measure the latency of E3 only Variation II: We measure the Latency of each event; for events that don’t create derived events directly, we measure the time until the system finishes processing them

Performance goals and metrics Multi-objective optimization function: min(  *avg latency + (1-  )*(1/thoughput)) Max throughput All/ 80% have max/avg latency < δ All/ 90% of time units have throughput > Ω minmax latency minavg latency latency leveling

Optimization tools Blackbox optimizations: Distribution Parallelism Scheduling Load balancing Load shedding Whitebox optimizations: Implementation selection Implementation optimization Pattern rewriting

Scalability Scalability is the ability of a system to handle growing amounts of work in a graceful manner, or its ability to be enlarged effortlessly and transparently to accommodate this growth Scale up Vertical scalability Adding resources within the same logical unit to increase capacity Scale out Horizontal scalability Adding additional logical units to increase processing power

Vertical Scalability- Scaling up Parallel concurrent execution support, such as multi-threading Qualifications of application designed for scale-up Common design patterns: the Actor model Utilizes the in-process memory for message passing Adding resources to a single logical unit to increase it’s processing abilities Adding CPUs, memory Expanding storage by adding hard-drives

Horizontal Scalability - Scaling out Qualifications of application designed for scale-out For stateful applications Master/Worker Shared Nothing approach Spaced Based Architecture Map Reduce Different patterns associated Distributed services -do not assume locality Load balancing Adding multiple logical units and making them work as a single unit Computer cluster Load balancing Distributed caching Partitioning of state (sharding)

Scale-out and scale-up tradeoffs Scale up Scale out Simpler programming model Simpler management layer No network overhead due to in-memory communication Finite growth limit Single point of failure Redundancy Flexibility Fault tolerance Increased management complexity More complex programming model Issues as throughput and latency between nodes

General approach to scalability Usually applications combine the two approaches… Scaling out by… Spreading application modules Load partitioning and load balancing Distributed cache Scaling up by… Running multiple threads in each module

Scalability in event processing: various dimensions # of producers # of input events # of EPA types # of concurrent runtime instances # of concurrent runtime contexts Internal state size # of consumers # of derived events Processing complexity # of geographical Locations # of geographical Locations

Event-processing techniques for scalability Load shedding Load partitioning according to EPAs topology and Runtime Contexts

Scalability in event volume Scalability in event volume is the ability to handle variable event loads effectively as the quantity of events may go up and down over time Scale out techniques Load partitioning Parallel processing Scale up techniques Load shedding Applicable scale-up and scale-out techniques Load balancing Scale out techniques Some applications requiring high event throughput financial weather phone-call tracking

Scalability in quantity of event processing agents Scalability in the quantity of EPAs is the ability of the system to adapt to substantial growth of event processing network and a high quantity of event processing agents Some applications allow users to create their own custom EPAs Applicable scale-up and scale-out techniques Partitioning Optimization in agent assignment (mapping between logical and physical artifacts) Parallelism and distribution

Scalability in quantity of event processing agents – partitioning and parallelism Parallelism : Running all artifacts in a single powerful unit Saves network communication overhead Distribution : Running all artifacts in multiple units When event load is also an issue Parallelism/Distribution Partitioning Dependency analysis Number of core processors Level of distribution Communication overhead Performance objective fun. EPA complexity analysis

Scalability in a number of producers/consumers Growth in a number of producers usually results in growth in event load even if number of events produced by each one is small Growth in a number of consumers Requires optimization at routing level, such as multicasting

Scalability in a number of context partitions and context-state size Hash (customer id) Nodes events Each context partition is represented by internal state of a certain size Use partitioning on context Growth in a number of context partitions Affects EPA performance since iterating on large states Significant growth of internal state for a single context partition Use EPA optimization techniques

Availability Availability is ratio of time the system is perceived as functioning by its users to the time it is required or expected to function Can be expressed as Direct proportion : 9/10 or 0.9 Percentage: 99% Can be expressed in terms of average or total downtime

Availability expectations and solutions Continuous availability provides the ability to keep the business application running without any noticeable downtime Major outages… Disaster recovery techniques Replicas on site Additional sites Continuous operation is the ability to avoid planned outages Minor outages… High availability System design and implementation approach Ensures pre-arranged level of availability during measuring period (SLA) Represents ability to avoid minor unplanned outages by eliminating single points of failure

Components of high availability Fault avoidance – redundancy and duplication Distributed application Clustering Duplication of storage systems Failover for systems and data Fault tolerance -recoverability Failure recovery

Redundancy and duplication Redundancy Using multiple components with a method to detect failure and perform failover of the failed component Scale out techniques Continuous monitoring of components (“heart-bit”) Failover – automatic reconfiguration Load balancing is one of the players When one fails – load balancer no longer sends traffic When initial component recovers the load balancer routes traffic back Duplication A single live component is paired with a single backup which takes over in event of failure Example : Storage – RAID 0

Recoverability in stateful applications – state management tradeoffs Data grid – replication of state between multiple machines Recoverability achieved by duplication of state Better performance than pure db Complexity in persistency layer implementation Performance costs on cache misses and cache outs Network overhead on replication of state Complexity in synchronization of replicas Memory based state Better performance than pure db Complexity in recoverability implementation In-memory db with caching capabilities Better performance than pure db Guaranteed recoverability

High availability costs Implementing some of HA practices can be very expensive… Performance costs State changes need to be logged Entire state has to be persisted at least periodically Toll on processing latency and overall event throughput Actual costs Duplication of hardware for redundancy and duplication Application complexity For implementing failover , recovery

Availability in event processing Fault avoidance Duplication and redundancy of processing components Failover mechanisms for processing components Fault tolerance Recoverability of state for all processing components EPAs state Context state Channels state Using the general availability techniques…

Cost-effectiveness of recoverability techniques in EP Have to consider if implementing recoverability is cost-effective? Applications not requiring recoverability solution Applications where events are symptom of some underlying problem and will occur again Systems looking for statistical trends, which might be based on sampling Mission critical applications Lost state might result in incorrect decisions Recoverability is a must

Usability 101 Definition by Jakob Nilsen * * http :// www . useit . com / alertbox / 20030825 . html Learnability: How easy it is for Users to accomplish basic tasks the first time they encounter the system? Efficiency: Once users have Learned the system, How quickly can they perform tasks? Memorability: When users return after period of not using the system, How easily can they reestablish proficiency ? Errors: How many errors do users make, how severe are these errors, and how easily they can recover from the errors? Satisfaction: How pleasant is it to use the system? Utility: Does the system do what the user intended?

In this part of the tutorial we’ll talk about Build time IDE Runtime control and audit tools Correctness – internal Consistency Debug and validation Consistency with the environment - Transactional behavior

Build time interfaces Text based programming languages Visual languages Form based languages Natural languages interfaces

Another Text-based IDE (Apama)

Visual language – StreamSQL EventFlow (Streambase)

Visual language – StreamSQL EventFlow (Streambase) – cont.

Form based language – Websphere Business Events (IBM) Whenever transfer occurs more than once in a month, then the Account Manager should be notified and Sales should contact the customer.

Business-oriented tool that intended to define business concepts that involve events and rules without consideration of the implementation details The tool uses an adaptation of the OMG's SBVR standard Natural language for event processing Based on work done by Mark H Linehan (IBM T.J.Watson Research Center) free text Frequent big cash deposit pattern is defined as “at least 4 big cash deposits to the same account”, where big deposit decision depends on customer’s profile. structured English A derived event that is derived from a big cash deposit using the frequent deposits in same account applying threshold the count of the participant event set of frequent big cash deposits is greater than or equal to 4.

Run time tools Performance monitoring Dashboards Audit and provenance Two types of run time tools: Monitoring the application Monitoring the event processing systems

Performance Monitoring (Aleri/Sybase)

Dashboard Construction (Apama)

Provenance and audit Tracking all consequences of an event Tracking the reasons that something happens Within the event processing system: Derivation of events, routing of events, Actions triggered by the events

Example: Pharmaceutical pedigree

Validation and debugging Debugger Testing and simulation Validation

Breakpoints and Debugging (StreamBase)

Testing & simulation – IBM WBE

Changing a certain event, what are the application artifacts affected? What are all possible ways to produce a certain action (derived event)? There was an event that should have resulted in a certain action, but that never happened! “ Wrong” action was taken, how did that happen? Application validation

Validation techniques Static Analysis Navigate through mass of information wisely Discover event processing application artifacts dependencies and change rules with confidence Dynamic Analysis Compare the actual output against the expected results Explore rule coverage with multiple scenario invocation System consistency tests Build-time Development phase Run-time Development and production phases Analysis with Formal Methods Advanced correctness and logical integrity observations Build-time Development phase

Disconnected agents Event possible consequences Event possible provenance Potential infinite cycles Static analysis

Event instance forward trace Event instance backward trace Application coverage by scenario execution Agent evaluation in context Dynamic Analysis Runtime Scenario Dynamic Analysis Component EP Application Definition History Data Store Observations for dynamic analysis EP system invocation on runtime scenario Results analysis for correctness and coverage Analysis results

Static analysis methods enable to derive a set of “shallow” observations on top of the application graph  an agent can be physically connected to the graph, but not reachable during the application runtime (e.g., due to a self-contradicting condition) Advanced verification with formal methods Agent/derived event unreachability Automatic generation of scenario for application coverage Logical equivalence of several agents Mutual exclusion of several agents

Correctness The ability of a developer to create correct implementation for all cases (including the boundaries) Observation: A substantial amount of effort is invested today in many of the tools to workaround the inability of the language to easily create correct solutions

Some correctness topics The right interpretation of language constructs The right order of events The right classification of events to windows

The right interpretation of language constructs – example All (E1, E2) – what do we mean? A customer both sells and buys the same security in value of more than $1M within a single day Deal fulfillment: Package arrival and payment arrival 6/3 10:00 7/3 11:00 8/3 11:00 8/3 14:00

Fine tuning of the semantics (I) When should the derived event be emitted? When the Pattern is matched ? At the window end?

Fine tuning of the semantics (II) How many instances of derived events should be emitted? Only once? Every time there is a match ?

Fine tuning of the semantics (III) What happens if the same event happens several times? Only one – first, last, higher/lower value on some predicate? All of them participate in a match?

Fine tuning of the semantics (IV) Can we consume or reuse events that participate in a match?

Fine tuning of semantics – conclusion Some languages have explicit policies: Example: CCL Keep policies KEEP LAST PER Id KEEP 3 MINUTES KEEP EVERY 3 MINUTES KEEP UNTIL (”MON 17:00:00”) KEEP 10 ROWS KEEP LAST ROW KEEP 10 ROWS PER Symbol In other cases – explicit programming and workarounds are used if semantics intended is different than the default semantics

The right order of events - scenario Bid scenario- ground rules: All bidders that issued a bid within the validity interval participate in the bid. The highest bid wins. In the case of tie between bids, the first accepted bid wins the auction ===Input Bids=== Bid Start 12:55:00 credit bid id=2,occurrence time=12:55:32,price=4 cash bid id=29,occurrence time=12:55:33,price=4 cash bid id=33,occurrence time=12:55:34,price=3 credit bid id=66,occurrence time=12:55:36,price=4 credit bid id=56,occurrence time=12:55:59,price=5 Bid End 12:56:00 ===Winning Bid=== cash bid id=29,occurrence time=12:55:33,price=4 Trace: Race conditions: Between events; Between events and Window start/end

Ordering in a distributed environment - possible issues Even if the occurrence time of an event is accurate, it might arrive after some processing has already been done If we used occurrence time of an event as reported by the sources it might not be accurate, due to clock accuracy in the source Most systems order event by detection time – but events may switch their order on the way

Clock accuracy in the source Clock synchronization Time server, example: http://tf.nist.gov/service/its.htm

Buffering technique Assumptions: Events are reported by the producers as soon as they occur; The delay in reporting events to the system is relatively small, and can be bounded by a time-out offset ; Events arriving after this time-out can be ignored. Principles: Let  be the time-out offset, according to the assumption it is safe to assume that at any time-point t , all events whose occurrence time is earlier than t -  have already arrived. Each event whose occurrence time is To is then kept in the buffer until To+  , at which time the buffer can be sorted by occurrence time, and then events can be processed in this sorted order. Sorted Buffer (by occurrence time) To t > To +  Producers Event Processing

Retrospective compensation Find out all EPAs that have already sent derived events which would have been affected by the "out-of-order" event if it had arrived at the right time. Retract all the derived events that should not have been emitted in their current form. Replay the original events with the late one inserted in its correct place in the sequence so that the correct derived events are generated.

Classification to windows - scenario Calculate Statistics for each Player (aggregate per quarter) Calculate Statistics for each Team (aggregate per quarter) Window classification: Player statistics are calculated at the end of each quarter Team statistics are calculated at the end of each quarter based on the players events arrived within the same quarter All instances of player statistics that occur within a quarter window must be classified to the same window, even if they are derived after the window termination.

Transactional Behavior In a complete transactional system: In event processing system this implies: Nothing gets out of the system until the transaction is committed The ability to track the effects of event (forward and backwards) The system knows to withdraw events from the EPAs’ internal state

Transactional behavior in event processing? Typically, event processing systems have decoupled architecture, and does not exhibit transactional behavior However, in several cases event processing is embedded within a transactional environment

CASE I: Transactional ECA at the consumer side When a derived event is emitted to a consumer, there is an ECA rule, with several actions, that is required to run as atomic unit. If failed, the Derived event should be withdrawn

CASE II: An event processing system monitors transactional system In this case, the producer may emit events that are not confirmed and may be rolled back.

Case III: Event processing is part of a chain There is some transactional relationship between the producer and consumer The event processing system should transfer rollback notice from the consumer to the producer Need to be able to track the effects/causes of event (forward and backwards) This implies rollback of other events

Case IV: A path in the event processing network should act as “unit of work” Example: the “determine winner” fails, and the bid is cancelled, all bid events are not kept in the event stores, and are withdrawn for other processing purposes

Transactions in event processing systems Usually in transactional systems there is assumption that a transaction time is short This is not necessarily the case in event processing systems All (E1, E2) - E2 arrived 5 days after E1 - The processing of the pattern failed – What do we mean? Withdraw only E2? Withdraw also E1 after 5 days?

Security and Privacy Considerations

Security, privacy and trust Security requirements ensure that operations are only performed by authorized parties, and that privacy considerations are met. Based on Enhancing the Development Life Cycle to Produce Secure Software [DHS/DACS 08] Characteristics of secure application: Containing no malicious logic that causes it to behave in a malicious manner. Trustworthiness Recovering as quickly as possible with as little damage as possible from attacks. Survivability Executing predictably and operating correctly under all conditions, including hostile conditions. Dependability

Towards security assurance Identify and categorize the information the software is going to contain Low sensitivity – The impact of security violation is minimal High sensitivity – Violation may pose a threat to human life Develop security requirements Access control (Authentication) Data management and data access (Authorization) Human resource security (Privacy) Audit trails

Security in event processing systems Only authorized parties are allowed to be event producers or consumers Incoming events are filtered to avoid events that producers are not entitled to publish Consumers only receive derived events to which they are entitled (in some cases only some attributes of an event) Extensive work on secure subscription was done in pub/sub systems authorized authorized

Security in event processing systems – cont. Unauthorized parties can not make modifications in the application Off-line definition modifications or hot updates All database and data communications links used by the system are secure, including data transfer in distributed environments Keeping auditable logs of events received and processed Preventing spam events Can all twitter events be trusted?

Security patterns in event processing Application definitions access patterns Access type control – view/edit/manage Access destination control – application parts access restrictions per user/group Both above should be enforced in development and runtime phases (hot updates) Event data access patterns Access to events satisfying a certain condition (selection) Access to a subset of event attributes (projection)

Summary Non Functional properties determine the nature of event processing applications – distribution, availability, optimization, correctness and security are some of the dimensions There are often the main decision factor in selecting whether to use an event processing system, and in the selection among various alternatives.

Debs 2011 tutorial on non functional properties of event processing

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Debs 2011 tutorial on non functional properties of event processing (20)

More from Opher Etzion (20)

Recently uploaded (20)

Debs 2011 tutorial on non functional properties of event processing

Editor's Notes