SlideShare a Scribd company logo
Apache Eagle Strata Hadoop World London 2016
2
Arun Manoharan
Product Manager – eBay
aruncarthick@gmail.com
@lycos_86
EBAY MARKETPLACE AT A GLANCE
$19.6B GMV
in Q1 2016
9.5M
New listings added via
mobile per week
300M
Searches each day
63%
Transactions that ship
for free
(in US, UK, DE)
79%
Items sold as new
Q1 2016 data
~900M
Live listings
One of the world’s largest and most vibrant
marketplaces
Most Powerful
Selling Platform
For business sellers:
the potential to drive
profitable sales and
build a brand
For consumer sellers:
an easy way to
declutter, sell and
make money
A partnership not a
competition
Best Choice
Providing the
greatest selection of
inventory for our
buyers
From new, everyday
items to rare and
unique goods
And incredible deals
only found on eBay
Most Relevance
A shopping experience
that is simple, data-
driven and personalized
Enabling buyers to
easily find, compare
and purchase items
they need and want
Highlighting the unique
value that eBay brings
OUR
STRATEGY
SMART
COMMERCE
Identify an interesting
set of candidate items,
trends, events, etc.
Personalize
the results
Inspiration
at scale!
6
Apache Eagle
Monitor Hadoop in Real Time
Arun Manoharan | Product Manager|
@lycos_86
+200 Petabytes
of Consumer
Data and
growing…
Consumers
on 6
Continents
Millions of
Transactions
1000’s of
Product
Categories
Multiple
cookies across
dozens of
business
Actionable
search
insights
+ 9M payments
every day
+ 6K
Total
Payment
Volume per
second
LoyaltyClick
behavior
and
patterns
Device
IDs
100’s of millions
of
Email addresses
Bank
accounts
POS
Autos
Products
IP Address
+200 Petabytes
of Consumer
Data and
growing…
Consumers
on 6
Continents
Credit cards
1000’s of
Product
Categories
Multiple
cookies across
dozens of
business
Actionable
search
insights
+ 9M payments
every day
+ 6K
Total
Payment
Volume per
second
Pair of
shoes sold
every 2
second
Loyalty
Cell phone
sold every 4
seconds
Click
behavior
and
patterns
Device
IDs
100’s of millions
of
Email addresses
Bank
accounts
POS
A ladies
handbag is
bought via
mobile every
12 seconds
Auto
Products
IP Address
COLLECT, ANALYZE, PREDICT
Big Data @ eBay
*Q3 2015 data
7
Hadoop Clusters*
800M
HDFS operations
(single cluster)*
120 PB
Data*
Hadoop @ eBay
HADOOP SECURITY
Authorization &
Access Control
Perimeter
Security
Data
Classification
Activity
Monitoring
Security
Security for Hadoop
Who is accessing the data?
What data are they accessing?
Is someone trying to access data that they don’t have access to?
Are there any anomalous access patterns?
Is there a security threat?
How to monitor and get notified during or prior to an anomalous event occurring?
Motivation for Eagle
Apache Eagle
Apache Eagle: Monitor Hadoop in Real Time
Apache Eagle is an Open Source Monitoring Platform for Hadoop eco-system,
which started with monitoring data activities in Hadoop. It can instantly
identify access to sensitive data, recognize attacks/malicious activities and
blocks access in real time.
In conjunction with components such as Ranger, Sentry, Knox,
DgSecure and Splunk etc., Eagle provides comprehensive solution to
secure sensitive data stored in Hadoop.
Eagle Architecture
Apache Eagle Composition
Apache Eagle
Integrations Alert Engine
HDFS
AUDIT
HIVE
QUERY
HBASE
AUDIT
CASSANDRA
AUDIT
MapR
AUDIT
HADOOP
Performance
Metric
Namenode
JMX
Metrics
Datanode
JMX
Metrics
System
Metrics
M/R Job
Performance
Metric
History Job
Metrics
Running
Job Metrics
SparkJob
Performance
Metric
Spark Job
Metrics
Queue
Metrics
Data Activity
Monitoring
RM
JMX
Metrics
Policy Store
Metadata API
Scalability
Extensibility
[Domains] [Applications]
Eagle
Data Classification - HDFS
•Browse HDFS file system
•Batch import sensitivity metadata through Eagle API
•Manually mark sensitivity in Eagle UI
Data Classification - Hive
•Browse Hive databases/tables/columns
•Batch import sensitivity metadata through Eagle API
•Manually mark sensitivity in Eagle UI
Define policy in UI and API
curl -u ${EAGLE_SERVICE_USER}:${EAGLE_SERVICE_PASSWD} -X POST -H 'Content-
Type:application/json' 
"http://${EAGLE_SERVICE_HOST}:${EAGLE_SERVICE_PORT}/eagle-
service/rest/entities?serviceName=AlertDefinitionService" 
-d '
[
{
"prefix": "alertdef",
"tags": {
"site": "sandbox",
"application": "hadoopJmxMetricDataSource",
"policyId": "capacityUsedPolicy",
"alertExecutorId": "hadoopJmxMetricAlertExecutor",
"policyType": "siddhiCEPEngine"
},
"description": "jmx metric ",
"policyDef": "{"expression":"from hadoopJmxMetricEventStream[metric ==
"hadoop.namenode.fsnamesystemstate.capacityused" and convert(value,
"long") > 0] select metric, host, value, timestamp, component, site insert into
tmp; ","type":"siddhiCEPEngine"}",
"enabled": true,
"dedupeDef": "{"alertDedupIntervalMin":10,"emailDedupIntervalMin":10}",
"notificationDef":
"[{"sender":"eagle@apache.org","recipients":"eagle@apache.org","subject
":"missing block
found.","flavor":"email","id":"email_1","tplFileName":""}]"
}
]
'
1 Create policy using API 2 Create policy using UI
Define policy
1 Single event evaluation
• threshold check with various conditions
Policy Capabilities
2 Event window based evaluation
• various window semantics (time/length sliding/batch window)
• comprehensive aggregation support
3 Correlation for multiple event
streams
• SQL-like join
4 Pattern Match and
Sequence
• a happens followed by b
Powered by Siddhi 3.0.5, and Eagle provides dynamic capabilities
and intuitive API/UI
Scalability
•Scale with # of events
•Scale with # of policies
Eagle Alert Engine Overview
1 Runs CEP engine on Apache
Storm
• Use CEP engine as library (Siddhi CEP)
• Evaluate policy on streamed data
• Rule is hot deployable
2 Inject policy dynamically
• API
• Intuitive UI
3 Scalability
• Computation
# of policies (policy placement)
• Storage
# of events (event partition)
4 Extensibility for policy
enforcement
• Post-alert processing with plugin
Eagle Alert
Statistics
• # of events evaluated per
second
• audit for policy change
Eagle Service
As of 0.3.0, Eagle stores metadata and statistics into HBASE, and
support Druid as metric store.
Metadata
• Policy
• Event schema
• Site/Application/UI Features
HBASE
• Store metrics
• Store M/R job/task data
• Rowkey design for time-series data
• HBase Coprocessor
Raw data
• Druid for metric
• HBASE for M/R job/task
etc.
• ES for log (future)
1 Data to be stored 2 Storage 3 API/UI
Druid
• Consume data from Kafka
HBASE
• filter, groupby, sort, top
Druid
• Druid query API
• Dashboard in Eagle
Highlights
1. Ease of use: after installation, user defines rules
2. Comprehensive rules on high volume of data: Eagle solves some
unique problem in Hadoop
3. Hot deploy rule: Eagle does not provide a lot of charts, instead it
allows user to write ad-hoc rule and hot deploy it.
4. Metadata driven: metadata includes policy, event schema and UI
component etc.
5. Monolithic storm topology: application pre-processing running
together with alert engine
6. Extensibility: Eagle can’t succeed alone, Eagle has to be integrated
with other system for example data classification, policy enforcement
etc.
Alert Engine Limitations in Eagle 0.3
1 High cost for integrating
• Coding for onboarding new data source
• Monolithic topology for pre-processing and alert
3 Policy capability restricted by event
partition
• Can’t do ad-hoc group-by policy expression
For example from groupby user to groupby cmd
2 Not multi-tenant
• Alert engine is embedded into application
• Many separate Storm topologies
4 Correlation is not declarative
• Coding for correlating existing data sources
If traffic is partitioned by user, policy only
supports expression of user based group-by
One storm topology even for one trivial data
source
Even if it is a simple data source, you have
to write storm topology and then deploy
Can’t declare correlations for multiple
metrics
5 Stateful policy evaluation
• fail over when bolt is down
How to replay one week history data when
node is down
Integrations
•Cassandra
•MapR
•Mongo DB
•Job Queue
Extensibility
 Sentry/Ranger
• As remediation engine
• As generic data source
 DgSecure
• Source of truth for data classification
 Splunk
• Syslog format output
• EAGLE alert output is the 1st abstraction of analytics and Splunk is the
2nd abstraction
USER PROFILE ALGORITHMS…
Eigen Value Decomposition
• Compute mean and variance
• Compute Eigen Vectors and determine Principal Components
• Normal data points lie near first few principal components
• Abnormal data points lie further from first few principal components and
closer to later components
USER PROFILE ARCHITECTURE
Eagle Next Releases
• Improve User experience
 Remote start storm topology
 Metadata stored in RDBMS
Eagle 0.4 Eagle 0.5
• Alert Engine as Platform
 No monolithic topology
 Declarative data source onboard
 Easy correlation
 Support policies with any field group-by
 Elastic capacity management
dev@eagle.incubator.apache.org
http://eagle.incubator.apache.org
https://github.com/apache/incubator-eagleGithub
Dev Mail
List
@TheApacheEagleTwitter
Q & A
34
Thank You!!

More Related Content

PPTX
Apache Eagle Dublin Hadoop Summit 2016
PPTX
Eagle from eBay at China Hadoop Summit 2015
PPTX
Apache Eagle in Action
PDF
Apache Eagle at Hadoop Summit 2016 San Jose
PPTX
Apache Eagle: 来自eBay的分布式实时Hadoop数据安全引擎
PDF
Apache Eagle: eBay构建开源分布式实时预警引擎实践
PPTX
Apache Eagle: Architecture Evolvement and New Features
PPTX
Analysis of Major Trends in Big Data Analytics
Apache Eagle Dublin Hadoop Summit 2016
Eagle from eBay at China Hadoop Summit 2015
Apache Eagle in Action
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle: 来自eBay的分布式实时Hadoop数据安全引擎
Apache Eagle: eBay构建开源分布式实时预警引擎实践
Apache Eagle: Architecture Evolvement and New Features
Analysis of Major Trends in Big Data Analytics

What's hot (20)

PDF
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
PPTX
Cloudbreak - Technical Deep Dive
PDF
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
PPTX
Security From The Big Data and Analytics Perspective
PDF
Opal: Simple Web Services Wrappers for Scientific Applications
PPTX
Real time machine learning visualization with spark -- Hadoop Summit 2016
PPTX
Setting Up Sumo Logic - Sep 2017
PPTX
Time Series Analytics Azure ADX
PPTX
Big Data Analytics with Spark
PDF
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
PPTX
August 2016 HUG: Recent development in Apache Oozie
PPTX
Future of Apache Storm
PDF
Streamlio and IoT analytics with Apache Pulsar
PPTX
Security event logging and monitoring techniques
PPTX
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
PPTX
Building an ETL pipeline for Elasticsearch using Spark
PDF
Spark Summit - Stratio Streaming
PDF
Apache Kylin - Balance Between Space and Time
PDF
Self Regulating Streaming - Data Platforms Conference 2018
PPTX
Druid Scaling Realtime Analytics
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Cloudbreak - Technical Deep Dive
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
Security From The Big Data and Analytics Perspective
Opal: Simple Web Services Wrappers for Scientific Applications
Real time machine learning visualization with spark -- Hadoop Summit 2016
Setting Up Sumo Logic - Sep 2017
Time Series Analytics Azure ADX
Big Data Analytics with Spark
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
August 2016 HUG: Recent development in Apache Oozie
Future of Apache Storm
Streamlio and IoT analytics with Apache Pulsar
Security event logging and monitoring techniques
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Building an ETL pipeline for Elasticsearch using Spark
Spark Summit - Stratio Streaming
Apache Kylin - Balance Between Space and Time
Self Regulating Streaming - Data Platforms Conference 2018
Druid Scaling Realtime Analytics
Ad

Viewers also liked (16)

PDF
Apache Eagle: Secure Hadoop in Real Time
PDF
Apache Eagle - Monitor Hadoop in Real Time
PDF
WSO2 and 2 Degrees Case Study
PDF
WSO2 & eBay Case Study
PPTX
eBay Case Study
PPTX
The Past, Present, and Future of Hadoop at LinkedIn
PPTX
Hadoop @ eBay: Past, Present, and Future
PPTX
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016
PDF
Path to 400M Members: LinkedIn’s Data Powered Journey
PPTX
The Glass Menagerie by. Tennessee Williams
PPTX
Ebay presentation
PPTX
Endodontic Periodontal Relationship, ENDO PERIO LESION
PPTX
Powerpoint Presentation on eBay.com
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
WSO2 and 2 Degrees Case Study
WSO2 & eBay Case Study
eBay Case Study
The Past, Present, and Future of Hadoop at LinkedIn
Hadoop @ eBay: Past, Present, and Future
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016
Path to 400M Members: LinkedIn’s Data Powered Journey
The Glass Menagerie by. Tennessee Williams
Ebay presentation
Endodontic Periodontal Relationship, ENDO PERIO LESION
Powerpoint Presentation on eBay.com
Ad

Similar to Apache Eagle Strata Hadoop World London 2016 (20)

PDF
Apache Eagle Architecture Evolvement
PPTX
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
PPT
Hadoop at Ebay
PPTX
Hadoop Turns a Corner and Sees the Future
PDF
Architecting Agile Data Applications for Scale
PPTX
Not Just Another Overview of Apache Hadoop
PDF
Hadoop - Architectural road map for Hadoop Ecosystem
PDF
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
PPTX
5 Things that Make Hadoop a Game Changer
PDF
Hadoop
PPTX
Five Tips for Running Cloudera on AWS
PPTX
Big Data, Baby Steps
PDF
Utah Big Mountain Big Data Baby Steps (4-12-2014) Final
PDF
The Modern Tech Stack: Data Analytics in the Cloud for Developers and Founders
PDF
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
PPTX
Atlanta hadoop users group july 2013
PPTX
Hadoop Innovation Summit 2014
PDF
PDF
IoT Crash Course Hadoop Summit SJ
Apache Eagle Architecture Evolvement
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop at Ebay
Hadoop Turns a Corner and Sees the Future
Architecting Agile Data Applications for Scale
Not Just Another Overview of Apache Hadoop
Hadoop - Architectural road map for Hadoop Ecosystem
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
5 Things that Make Hadoop a Game Changer
Hadoop
Five Tips for Running Cloudera on AWS
Big Data, Baby Steps
Utah Big Mountain Big Data Baby Steps (4-12-2014) Final
The Modern Tech Stack: Data Analytics in the Cloud for Developers and Founders
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
Atlanta hadoop users group july 2013
Hadoop Innovation Summit 2014
IoT Crash Course Hadoop Summit SJ

Recently uploaded (20)

PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPT
Teaching material agriculture food technology
PPTX
MYSQL Presentation for SQL database connectivity
PDF
KodekX | Application Modernization Development
PDF
cuic standard and advanced reporting.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Network Security Unit 5.pdf for BCA BBA.
Unlocking AI with Model Context Protocol (MCP)
Mobile App Security Testing_ A Comprehensive Guide.pdf
Chapter 3 Spatial Domain Image Processing.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Teaching material agriculture food technology
MYSQL Presentation for SQL database connectivity
KodekX | Application Modernization Development
cuic standard and advanced reporting.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Digital-Transformation-Roadmap-for-Companies.pptx
Understanding_Digital_Forensics_Presentation.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Per capita expenditure prediction using model stacking based on satellite ima...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
NewMind AI Weekly Chronicles - August'25 Week I
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf

Apache Eagle Strata Hadoop World London 2016

  • 2. 2 Arun Manoharan Product Manager – eBay aruncarthick@gmail.com @lycos_86
  • 3. EBAY MARKETPLACE AT A GLANCE $19.6B GMV in Q1 2016 9.5M New listings added via mobile per week 300M Searches each day 63% Transactions that ship for free (in US, UK, DE) 79% Items sold as new Q1 2016 data ~900M Live listings One of the world’s largest and most vibrant marketplaces
  • 4. Most Powerful Selling Platform For business sellers: the potential to drive profitable sales and build a brand For consumer sellers: an easy way to declutter, sell and make money A partnership not a competition Best Choice Providing the greatest selection of inventory for our buyers From new, everyday items to rare and unique goods And incredible deals only found on eBay Most Relevance A shopping experience that is simple, data- driven and personalized Enabling buyers to easily find, compare and purchase items they need and want Highlighting the unique value that eBay brings OUR STRATEGY
  • 5. SMART COMMERCE Identify an interesting set of candidate items, trends, events, etc. Personalize the results Inspiration at scale!
  • 6. 6 Apache Eagle Monitor Hadoop in Real Time Arun Manoharan | Product Manager| @lycos_86
  • 7. +200 Petabytes of Consumer Data and growing… Consumers on 6 Continents Millions of Transactions 1000’s of Product Categories Multiple cookies across dozens of business Actionable search insights + 9M payments every day + 6K Total Payment Volume per second LoyaltyClick behavior and patterns Device IDs 100’s of millions of Email addresses Bank accounts POS Autos Products IP Address
  • 8. +200 Petabytes of Consumer Data and growing… Consumers on 6 Continents Credit cards 1000’s of Product Categories Multiple cookies across dozens of business Actionable search insights + 9M payments every day + 6K Total Payment Volume per second Pair of shoes sold every 2 second Loyalty Cell phone sold every 4 seconds Click behavior and patterns Device IDs 100’s of millions of Email addresses Bank accounts POS A ladies handbag is bought via mobile every 12 seconds Auto Products IP Address COLLECT, ANALYZE, PREDICT
  • 9. Big Data @ eBay *Q3 2015 data 7 Hadoop Clusters* 800M HDFS operations (single cluster)* 120 PB Data* Hadoop @ eBay
  • 10. HADOOP SECURITY Authorization & Access Control Perimeter Security Data Classification Activity Monitoring Security Security for Hadoop
  • 11. Who is accessing the data? What data are they accessing? Is someone trying to access data that they don’t have access to? Are there any anomalous access patterns? Is there a security threat? How to monitor and get notified during or prior to an anomalous event occurring? Motivation for Eagle
  • 12. Apache Eagle Apache Eagle: Monitor Hadoop in Real Time Apache Eagle is an Open Source Monitoring Platform for Hadoop eco-system, which started with monitoring data activities in Hadoop. It can instantly identify access to sensitive data, recognize attacks/malicious activities and blocks access in real time. In conjunction with components such as Ranger, Sentry, Knox, DgSecure and Splunk etc., Eagle provides comprehensive solution to secure sensitive data stored in Hadoop.
  • 14. Apache Eagle Composition Apache Eagle Integrations Alert Engine HDFS AUDIT HIVE QUERY HBASE AUDIT CASSANDRA AUDIT MapR AUDIT HADOOP Performance Metric Namenode JMX Metrics Datanode JMX Metrics System Metrics M/R Job Performance Metric History Job Metrics Running Job Metrics SparkJob Performance Metric Spark Job Metrics Queue Metrics Data Activity Monitoring RM JMX Metrics Policy Store Metadata API Scalability Extensibility [Domains] [Applications]
  • 15. Eagle
  • 16. Data Classification - HDFS •Browse HDFS file system •Batch import sensitivity metadata through Eagle API •Manually mark sensitivity in Eagle UI
  • 17. Data Classification - Hive •Browse Hive databases/tables/columns •Batch import sensitivity metadata through Eagle API •Manually mark sensitivity in Eagle UI
  • 18. Define policy in UI and API curl -u ${EAGLE_SERVICE_USER}:${EAGLE_SERVICE_PASSWD} -X POST -H 'Content- Type:application/json' "http://${EAGLE_SERVICE_HOST}:${EAGLE_SERVICE_PORT}/eagle- service/rest/entities?serviceName=AlertDefinitionService" -d ' [ { "prefix": "alertdef", "tags": { "site": "sandbox", "application": "hadoopJmxMetricDataSource", "policyId": "capacityUsedPolicy", "alertExecutorId": "hadoopJmxMetricAlertExecutor", "policyType": "siddhiCEPEngine" }, "description": "jmx metric ", "policyDef": "{"expression":"from hadoopJmxMetricEventStream[metric == "hadoop.namenode.fsnamesystemstate.capacityused" and convert(value, "long") > 0] select metric, host, value, timestamp, component, site insert into tmp; ","type":"siddhiCEPEngine"}", "enabled": true, "dedupeDef": "{"alertDedupIntervalMin":10,"emailDedupIntervalMin":10}", "notificationDef": "[{"sender":"eagle@apache.org","recipients":"eagle@apache.org","subject ":"missing block found.","flavor":"email","id":"email_1","tplFileName":""}]" } ] ' 1 Create policy using API 2 Create policy using UI
  • 20. 1 Single event evaluation • threshold check with various conditions Policy Capabilities 2 Event window based evaluation • various window semantics (time/length sliding/batch window) • comprehensive aggregation support 3 Correlation for multiple event streams • SQL-like join 4 Pattern Match and Sequence • a happens followed by b Powered by Siddhi 3.0.5, and Eagle provides dynamic capabilities and intuitive API/UI
  • 21. Scalability •Scale with # of events •Scale with # of policies
  • 22. Eagle Alert Engine Overview 1 Runs CEP engine on Apache Storm • Use CEP engine as library (Siddhi CEP) • Evaluate policy on streamed data • Rule is hot deployable 2 Inject policy dynamically • API • Intuitive UI 3 Scalability • Computation # of policies (policy placement) • Storage # of events (event partition) 4 Extensibility for policy enforcement • Post-alert processing with plugin
  • 24. Statistics • # of events evaluated per second • audit for policy change Eagle Service As of 0.3.0, Eagle stores metadata and statistics into HBASE, and support Druid as metric store. Metadata • Policy • Event schema • Site/Application/UI Features HBASE • Store metrics • Store M/R job/task data • Rowkey design for time-series data • HBase Coprocessor Raw data • Druid for metric • HBASE for M/R job/task etc. • ES for log (future) 1 Data to be stored 2 Storage 3 API/UI Druid • Consume data from Kafka HBASE • filter, groupby, sort, top Druid • Druid query API • Dashboard in Eagle
  • 25. Highlights 1. Ease of use: after installation, user defines rules 2. Comprehensive rules on high volume of data: Eagle solves some unique problem in Hadoop 3. Hot deploy rule: Eagle does not provide a lot of charts, instead it allows user to write ad-hoc rule and hot deploy it. 4. Metadata driven: metadata includes policy, event schema and UI component etc. 5. Monolithic storm topology: application pre-processing running together with alert engine 6. Extensibility: Eagle can’t succeed alone, Eagle has to be integrated with other system for example data classification, policy enforcement etc.
  • 26. Alert Engine Limitations in Eagle 0.3 1 High cost for integrating • Coding for onboarding new data source • Monolithic topology for pre-processing and alert 3 Policy capability restricted by event partition • Can’t do ad-hoc group-by policy expression For example from groupby user to groupby cmd 2 Not multi-tenant • Alert engine is embedded into application • Many separate Storm topologies 4 Correlation is not declarative • Coding for correlating existing data sources If traffic is partitioned by user, policy only supports expression of user based group-by One storm topology even for one trivial data source Even if it is a simple data source, you have to write storm topology and then deploy Can’t declare correlations for multiple metrics 5 Stateful policy evaluation • fail over when bolt is down How to replay one week history data when node is down
  • 28. Extensibility  Sentry/Ranger • As remediation engine • As generic data source  DgSecure • Source of truth for data classification  Splunk • Syslog format output • EAGLE alert output is the 1st abstraction of analytics and Splunk is the 2nd abstraction
  • 29. USER PROFILE ALGORITHMS… Eigen Value Decomposition • Compute mean and variance • Compute Eigen Vectors and determine Principal Components • Normal data points lie near first few principal components • Abnormal data points lie further from first few principal components and closer to later components
  • 31. Eagle Next Releases • Improve User experience  Remote start storm topology  Metadata stored in RDBMS Eagle 0.4 Eagle 0.5 • Alert Engine as Platform  No monolithic topology  Declarative data source onboard  Easy correlation  Support policies with any field group-by  Elastic capacity management

Editor's Notes

  • #4: Today’s eBay isn’t what it used to be - many people think of us only as an auction site, but that perception hasn’t kept up with reality. The reality is that 79% of what is sold on eBay is new merchandise, available for purchase immediately. We have more than 900 million items listed for sale and 162 million active buyers, effectively making us the world’s biggest shopping destination.
  • #5: Our vision for commerce is one that is enabled by people, powered by technology, and open to everyone. Our strategy is to drive the best choice, have the most relevance, and deliver the most powerful selling platform.
  • #6: Consumers are overwhelmed by the number of choices they face day-to-day. Smart brands are using data to surface inventory to their consumers in ways that feel relevant, helpful and familiar. At eBay, we are curating and simplifying content in ways that align to users’ stated (and sometimes unstated) preferences, serving up content in new, simplified interfaces that surprise and delight them. We are also experimenting with machine learning to help bridge the gap between intent and understanding.
  • #14: Storage – hbase and mysql Archived logs – hdfs Eagle storage only for small mount , metadata , policies etc External for metrics Aw metrics trend – druid to visualize too
  • #15: Apache Eagle includes applications and alert engine. Today application connects to alert engine with JAVA API, in future, Alert Engine is a separate component, application can send data into Alert Engine Policy stored hbase – all metadata stored in habse – we support both hbase and mysql All logs as well hdfs for historical auditing
  • #16: Setup – one single Eagle instance can manage multiple sites
  • #17: Setup – one single Eagle instance can manage multiple sites
  • #18: Setup – one single Eagle instance can manage multiple sites
  • #19: Setup – one single Eagle instance can manage multiple sites
  • #20: Setup – one single Eagle instance can manage multiple sites
  • #21: There is some policies day-over-day, week-over-week comparison not supported by CEP
  • #22: There is some policies day-over-day, week-over-week comparison not supported by CEP
  • #31: So far its policy based alerting but there are certain patterns that can’t be caught by policies Machine learning Observe a user over period of time Learn his typical/normal behaviour Create user profile – which in terms is policy EVD – eigen value decomposition Density estimation
  • #34: Mentors – Julian, Owen, Henry, Taylor, Amreshwari Champion – Henry