SlideShare a Scribd company logo
Apache in Action
Secure Hadoop in Real-time
Hao Chen / 陈浩 / hao@apache.org / @haozch
Who is the Guy
2
Co-creator, Committer and PMC @ Apache Eagle
hao@apache.org
Hao Chen / 陈浩
Sr. Software Engineer @ eBay Cloud Service
hchen9@ebay.com
Speaker @ Hadoop Summit (SJC, SHA, BJ) ...
http://people.apache.org/~hao
Agenda
3
• About Eagle
• Architecture
• Ecosystem
• Q & A
What’s Apache Eagle
4
Apache Eagle is a distributed real-time monitoring and
alerting engine for hadoop from eBay
Open sourced as Apache Incubator Project on Oct 26th 2015
Secure Hadoop in Realtime a data activity monitoring solution to instantly identify access
to sensitive data, recognize attacks/ malicious activity and block access in real time.
See http://eagle.incubator.apache.org or http://goeagle.io
Apache Eagle History
Donated to Apache Software Foundation (ASF) from eBay at Oct 26th, 2015
5
Dec 2013 Oct 23 2015 Oct 26 2015
Hadoop Eagle
Project Initiative
Apache Incubator
eagle.incubator.apache.org
Github Open Source
github.com/apache/incubator-eagle
Hadoop Eagle
Production Release
May 2014
Why build Apache Eagle
6
Eagle was initialized by end of 2013 for hadoop ecosystem monitoring as any
existing tool like zabbix, ganglia can not handle the huge volume of metrics/logs
generated by hadoop system in eBay.
2013/2014
10,000 nodes
150,000+ cores
170 PB
2000+ user
3000+ nodes
10,000+ cores
50+ PB
2012
2011
1000+ nodes
10,000+ cores
10+ PB
100+ nodes
1000 + cores
1 PB
2010
2009
50+ nodes
2007
1-10 nodes
Hadoop Data
• Security
• Activity
Hadoop Platform
• Heath
• Availability
• Performance
Hadoop @ eBay Inc
Apache Eagle @ eBay
7
7 CLUSTERS
7427 NODES
160 PB DATA
10 B+ EVENTS / DAY
500+ METRIC TYPES
50,000+ JOBS / DAY
50,000,000+ TASKS / DAY
MONITOR
PROCESS
Agenda
8
• About Eagle
• Architecture
• Ecosystem
• Q & A
Apache Eagle Architecture Overview
9
Scalable
Scales to monitor thousands of policies and
billions of access events
Machine Learning
Create dynamic user profiles based on
user behavior
Real-time
Generates alerts in real time and blocks
users with malicious intent
Extensible
Eagle can be easily extended to monitor
other data sources
Apache Eagle Architecture Overview
10
STREAM PROCESSING
ENGINE
DataCollector
Kafka
HDFS, Audit, Security
METADATA MANAGER
DATASTORES
REMEDIATION ENGINE
Apache
Ranger
MACHINE LEARNING
MODULE
Custom
module
Actionable Alerts
Activities
Actionable
Alerts
PolicyThresholdsUser properties
MLThresholds
Real Time Alert
Dashboard
Security Analyst
Admin Console
Security Engineer
Insights
Metadata
Management
MACHINE LEARNING TRAINING MODULE
Policy Engine
Apache Eagle Architecture Features
11
• Real-time Data Collection
• Distributed Policy Engine
• Stream Processing DSL
• Scalable Data Storage & Query
• Machine Learning Intergration
NOTE {NAME}-{NUMBER} like HDFS-6914 means open source project ticket id contributed by us
Apache Eagle - Data Collection
12
Decoupling with Message Bus
• Apache Kafka: high-throughput distributed
messaging
• Partition: balance between logic and throughput
Cross-Platform Integration
• Community Kafka Client (18+)
• Python/Go/C/C++/JAVA ..
• Enhanced Log4j-kafka
• KAFKA-2041: Extensible Partition Key
• KAFKA-2077: Advanced Topic Selector
Apache Eagle - Data Collection
13
Availability: Filebeat + Logstash
Logstash-1
Logstash-2
Logstash-…
Logstash-
Ligh-weight collector (golang) with daemon Logstash instances cluster
Shuffle Grouping
KafkaField Grouping
Distributed Message Bus
Resource consumption balance
Message throughput balance ( LOGSTASH-179)
Storm Spout: Distributed crawling for hadoop job, node jmx and service logs, etc.
Zookeeper: Centralized state management and distributed locking
Apache Eagle - Data Collection
14
Scalability: Distributed Real-time Ingestion
Zookeeper
METRIC
Centralized State Management
JOB EVENT LOG
Apache Eagle - Distributed Real-time Policy Engine
15
METADATA MANAGER
Distributed Streaming Cluster Environment
AlertExecutor_{1}
AlertExecutor_{2}
…
AlertExecutor_{N}
Real Time
Alerts
Alerts
Policy
Management
Policy
Dynamical Policy Deployment
Real-time
Event Stream
Stream_{1}
Stream_{*}
Dynamical Stream Schema
Stream
Processing
Highlights
• Real-time
• Usability
• Scalability
• Extensibility
• Metadata-driven
Apache Eagle - Distributed Real-time Policy Engine
16
METADATA MANAGER
Real Time
Alerts
Alerts
Policy
Management
Policy
Event Stream
(Kafka)
Dynamical Stream Schema
Dynamical Policy Deployment
Real-time
• Kafka-based Distributed
Message Bus (Extensible)
• Storm-based Real-time
Execution Environment
(Extensible)
• Stream events are
processed and alerts are
evaluated during
streaming
Apache Eagle - Distributed Real-time Policy Engine
17
METADATA MANAGER
Distributed Streaming Cluster Environment
Real Time
Alerts
Alerts
Policy
Management
Policy
Dynamical Policy Deployment
Usability
• Powerful SQL-Like CEP CQL for
Policy Definition
• Dynamical Poilcy Metadata Lifecycle
Management (Deployment/Update)
• Easy-to-use Policy management and
Alert analytics UI
from metricStream[(name == 'ReplLag')
and (value > 1000)] select * insert into
outputStream;
Apache Eagle - Distributed Real-time Policy Engine
18
Full-function Streaming CEP CQL: Siddhi on Storm by default
hdfsAuditLogEventStream[(src == '/tmp/private')]#window.externalTime(timestamp,10 min) select user,
count(timestamp) as aggValue group by user having aggValue >= 5 insert into outputStream;
• Filter
• Join
• Aggregation: Avg, Sum , Min, Max, etc
• Group by
• Having
• Stream handlers for window: TimeWindow, Batch Window, Length Window
• Conditions and Expressions: and, or, not, ==,!=, >=, >, <=, <, and arithmetic operations
• Pattern processing
• Sequence processing
• Event Tables: intergrate historical data in realtime processing
• SQL-Like Query: Query, Stream Definition and Query Plan compilation
Apache Eagle - Distributed Real-time Policy Engine
19
Distributed Streaming Cluster Environment
AlertExecutor_{1}
AlertExecutor_{2}
…
AlertExecutor_{N}
Stream_{1}
Stream_{*}
Stream
Processing
Scalability: dynamic policy partition by {event} * {policy}
• N Users with 3 partitions, M policies with 2 partitions, then 3*2 physical tasks
• Physical partition + policy-level partition
Apache Eagle - Distributed Real-time Policy Engine
20
Distributed Streaming Partition Problem
https://en.wikipedia.org/wiki/Partition_problem
S = {3,1,1,2,2,1,1}
S1 = {1,1,1,1,1}
S2 = {2,2}
S3 = {3}
Apache Eagle - Distributed Real-time Policy Engine
21
Distributed Streaming Partition Strategy
groupBy[ GreedyStrategy ]((_.key1,_.key2 ))
HBase
Key Distribution Statistics
(Online/Offline)
Realtime Partition
Strategy
Key Statistics Cache
Async
Strategy
• Greedy (Online/Offline)
• PoTC
• PKG
• Hashing
Apache Eagle - Distributed Real-time Policy Engine
22
Distributed Real-time Policy Engine
Siddhi CEP
Policy
Evaluator
Machine
Learning Policy
Evaluator
Extensibility
• Support WSO2 Siddhi CEP as first class
• Extensible policy engine implementation
• Extensible policy lifecycle management
Extensible
Policy Evaluator
public interface PolicyEvaluatorServiceProvider {
public String getPolicyType(); // literal string to identify one type of policy
public Class getPolicyEvaluator(); // get policy evaluator implementation
public List getBindingModules(); // policy text with json format to object mapping
}
public interface PolicyEvaluator {
public void evaluate(ValuesArray input) throws Exception; // evaluate input event
public void onPolicyUpdate(AlertDefinitionAPIEntity newAlertDef);// policy update
public void onPolicyDelete(); // invoked when policy is deleted
}
METADATA MANAGER
Policy/Metadata
Apache Eagle - Distributed Real-time Policy Engine
23
Metadata-Driven
• Stream Schema: AlertStreamSchemaEntity
• Policy Definition: AlertDefinitionAPIEntity
• Central metadata management
• Dynamic metadata deployment
@Table("alertdef")
@ColumnFamily("f")
@Prefix("alertdef")
@Service(AlertConstants.ALERT_DEFINITION_SERVICE_ENDPOINT_NAME)
@JsonIgnoreProperties(ignoreUnknown = true)
@TimeSeries(false)
@Tags({"site", "dataSource", "alertExecutorId", "policyId", "policyType"})
@Indexes({
@Index(name="Index_1_alertExecutorId", columns = { "alertExecutorID" }, unique = true),
})
public class AlertDefinitionAPIEntity extends TaggedLogAPIEntity{
@Column("a")
private String desc;
@Column("b")
private String policyDef;
@Column("c")
private String dedupeDef;
METADATA MANAGER
Distributed Real-time Policy Engine
Dynamic Metadata Loading
.flatMap(AuditLogTransformer)
.groupBy(_.user)
.flatMap(UserProfileAggregator);
Apache Eagle - Fluent Stream Processing DSL
24
env.fromKafka (KafkaConfig)
.alert.persistAndEmail
val env = ExecutionEnvironment.getStorm()
env.execute();
Distributed Streaming Cluster Environment
AlertExecutor_{1}
AlertExecutor_{2}
…
AlertExecutor_{N}
Alerts
Real-time
Event Stream
Stream_{1}
Stream_{*}
Stream
Processing
env.execute()
Apache Eagle - Fluent Stream Processing DSL
25
• Physical execution platform independent
• Easily assemble data transformation, filtering,
join and alerting DAG in fluent way
• DAG rewrite and optimization
• StreamUnionExpansion
• StreamGroupbyExpansion
• StreamNameExpansion
• StreamAlertExpansion
• StreamParallelismConfigExpansion
trait StreamProducer{
filter
flatMap
map{1,2,3,4}
groupBy
streamUnion // stream join is hard, not implemented for storm
alertWithConsumer
}
StormExecutionEnvironment env =
ExecutionEnvironmentFactory.getStorm(config);
env.newSource(new
KafkaSourcedSpoutProvider().getSpout(config)).renameOutputFields(1)
.flatMap(new AuditLogTransformer())
.groupBy(0)
.flatMap(new UserProfileAggregatorExecutor());
.alertWithConsumer(“userActivity“,”userProfileExecutor“)
env.execute();
Optimizer
1. Development 2. Optimization 3. Compile to native app
Apache Eagle - Scalable Data Storage and Query
26
• Entity Metadata on large-scale NoSQL
storage like HBase
• Full-function SQL-Like REST Query
• Optimized rowkey design for time-series
monitoring data
• HBase Coprocessor
• Secondary Index
@Table("alertdef")
@ColumnFamily("f")
@Prefix("alertdef")
@Service(AlertConstants.ALERT_DEFINITION_SERVICE_ENDPOINT_NAME)
@JsonIgnoreProperties(ignoreUnknown = true)
@TimeSeries(false)
@Tags({"site", "dataSource", "alertExecutorId", "policyId", "policyType"})
@Indexes({
@Index(name="Index_1_alertExecutorId", columns = { "alertExecutorID" }, unique = true),
})
public class AlertDefinitionAPIEntity extends TaggedLogAPIEntity{
@Column("a")
private String desc;
@Column("b")
private String policyDef;
@Column("c")
private String dedupeDef;
query=
AlertDefinitionService[@dataSource="hiveQueryLog"]{@policyDef}
Uniform rowkey design
• Metric
• Entity
• Log
Rowkey ::= Prefix | Partition Keys | timestamp | tagName | tagValue | …
Rowkey ::= Metric Name | Partition Keys | timestamp | tagName | tagValue | …
Rowkey ::= Default Prefix | Partition Keys | timestamp | tagName | tagValue | …
Rowkey ::= Log Type | Partition Keys | timestamp | tagName | tagValue | …
Rowvalue ::= Log Content
27
Apache Eagle – Uniform HBase Rowkey Design
Apache Eagle - Machine Learning Intergration
28
29
User Activity Profiling
Offline: Determine bandwidth from training dataset the kernel density function parameters (KDE)
Online: If a test data point lies outside the trained bandwidth, it is anomaly (Policy)
PCs(Principle Components) in EVD
(Eigenvalue Value Decomposition)
Kernel Density Function
Apache Eagle – User/System Activity Profiling
30
Anomaly Metric Predictive Detection
Offline: Analyzing and combining 500+ metrics together for causal anomaly detections (IG -> PCA ->
GMM -> MCC)
Online: Predictively alert for anomaly metrics
Normal (Green) and Abnormal (Red)
Data and Probability Distribution and Threshold Selection
PCA (Principal Component Analysis)
Apache Eagle - Anomaly Metric Predictive Detection
Anomaly Metric Predictive Detection Case Study
Agenda
• About Apache Eagle
• Architecture
• Ecosystem
• Q & A
31
32
Apps
 Security
 Hadoop
 Cloud
 Database
Interface
 Web Portal
 REST Services
 Analytics Visualization
Integration
 Ambari
 Docker
 Ranger
 Dataguise
Eagle Framework
Distributed real-time framework for efficiently
developing highly scalable monitoring applications
Eagle Apps
Securiy / Hadoop / Cloud / Database
Eagle Interface
REST Service / Management UI / Customizable Analytics
Visualization
Eagle Integration
Ambari / Docker / Ranger / Dataguise
Open Source
Community-driven and Cross-community cooperation
Eagle
Framework
Apache Eagle Ecosystem
33
Apache Eagle Ecosystem - Security
How to Secure Hadoop in Realtime?
• Apache Eagle
• Apache Ranger
• Apache Knox
• Dataguise
34
Apache Eagle Ecosystem - Hadoop
Eagle in Apache Amabri: natively be part of hadoop ecosystem
35
Apache Eagle Ecosystem - Docker
Eagle in Docker: natively fly on Cloud/Container
STORM
KAFKA
ZOOKEEPER
HBASE
HADOOP
…
Powered of
git clone apache/incubator-eagle
eagle-docker
15 + 1 = 1
docker pull apacheeagle
36
Apache Eagle Ecosystem - Open Source
If you want to go fast, go alone.
If you want to go far, go together.
-- African Proverb
Learn more about Apache Eagle
37
• EAGLE: USER PROFILE-BASED ANOMALY DETECTION IN HADOOP CLUSTER
(IEEE)
• EAGLE: DISTRIBUTED REALTIME MONITORING FRAMEWORK FOR HADOOP
CLUSTER
Q & A
apache/incubator-eagle
@TheApacheEagle
@ApacheEagle
http://eagle.incubator.apache.org
The slide is licensed under Creative Commons Attribution 4.0 International license.

More Related Content

PPTX
Apache Eagle Dublin Hadoop Summit 2016
PPTX
Apache Eagle: 来自eBay的分布式实时Hadoop数据安全引擎
PPTX
Apache Eagle: Architecture Evolvement and New Features
PDF
Apache Eagle: eBay构建开源分布式实时预警引擎实践
PPTX
Apache Eagle Strata Hadoop World London 2016
PPTX
Eagle from eBay at China Hadoop Summit 2015
PDF
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle Dublin Hadoop Summit 2016
Apache Eagle: 来自eBay的分布式实时Hadoop数据安全引擎
Apache Eagle: Architecture Evolvement and New Features
Apache Eagle: eBay构建开源分布式实时预警引擎实践
Apache Eagle Strata Hadoop World London 2016
Eagle from eBay at China Hadoop Summit 2015
Apache Eagle at Hadoop Summit 2016 San Jose

What's hot (20)

PPTX
Analysis of Major Trends in Big Data Analytics
PPTX
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
PPTX
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
PPTX
Cloudbreak - Technical Deep Dive
PDF
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
PPTX
Future of Apache Storm
PPTX
Real Time Data Processing Using Spark Streaming
PDF
Building large scale applications in yarn with apache twill
PPTX
Spark Technology Center IBM
PPTX
Building and managing complex dependencies pipeline using Apache Oozie
PPTX
Enabling Modern Application Architecture using Data.gov open government data
PPTX
Security From The Big Data and Analytics Perspective
PPTX
IoT:what about data storage?
PPTX
Integrating Apache Phoenix with Distributed Query Engines
PDF
Trend Micro Big Data Platform and Apache Bigtop
PPTX
Design Patterns for Large-Scale Real-Time Learning
PDF
Visualizing Big Data in Realtime
PPTX
August 2016 HUG: Recent development in Apache Oozie
PPTX
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Analysis of Major Trends in Big Data Analytics
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Cloudbreak - Technical Deep Dive
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
Future of Apache Storm
Real Time Data Processing Using Spark Streaming
Building large scale applications in yarn with apache twill
Spark Technology Center IBM
Building and managing complex dependencies pipeline using Apache Oozie
Enabling Modern Application Architecture using Data.gov open government data
Security From The Big Data and Analytics Perspective
IoT:what about data storage?
Integrating Apache Phoenix with Distributed Query Engines
Trend Micro Big Data Platform and Apache Bigtop
Design Patterns for Large-Scale Real-Time Learning
Visualizing Big Data in Realtime
August 2016 HUG: Recent development in Apache Oozie
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Ad

Viewers also liked (13)

PDF
Apache Eagle: Secure Hadoop in Real Time
PDF
Apache Eagle - Monitor Hadoop in Real Time
PDF
Improving HDFS Availability with Hadoop RPC Quality of Service
PDF
Cost-based query optimization in Apache Hive 0.14
PPTX
Apache NiFi- MiNiFi meetup Slides
ODP
MSII service global
PDF
Leanforms folder panterra
PDF
あいにきて IoT
PDF
2do boletin emancipacion de la mujer
PPT
Walden3 twin slideshare 01
PDF
Science and Nature Portfolio
PDF
9789740333616
PPT
Afl presentation
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
Improving HDFS Availability with Hadoop RPC Quality of Service
Cost-based query optimization in Apache Hive 0.14
Apache NiFi- MiNiFi meetup Slides
MSII service global
Leanforms folder panterra
あいにきて IoT
2do boletin emancipacion de la mujer
Walden3 twin slideshare 01
Science and Nature Portfolio
9789740333616
Afl presentation
Ad

Similar to Apache Eagle in Action (20)

PDF
Enterprise guide to building a Data Mesh
PDF
Webinar: What's new in CDAP 3.5?
PDF
Apache Eagle Architecture Evolvement
PPTX
Achieve big data analytic platform with lambda architecture on cloud
PDF
MLOps pipelines using MLFlow - From training to production
PPTX
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
PPTX
Log Data Analysis Platform
PPTX
Log Data Analysis Platform by Valentin Kropov
PDF
Monitoring in 2017 - TIAD Camp Docker
PPTX
Leveraging Hadoop in Polyglot Architectures
PPTX
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
PPTX
Netflix Cloud Architecture and Open Source
PPT
DataFinder: A Python Application for Scientific Data Management
PPTX
ThroughTheLookingGlass_EffectiveObservability.pptx
PPTX
MuleSoft Meetup Roma - Processi di Automazione su CloudHub
PDF
Build an AI/ML-driven image archive processing workflow: Image archive, analy...
PPTX
Integrating Splunk into your Spring Applications
PDF
Real time cloud native open source streaming of any data to apache solr
PDF
Prototyping applications with heroku and elasticsearch
PDF
Jetpack, with new features in 2021 GDG Georgetown IO Extended
Enterprise guide to building a Data Mesh
Webinar: What's new in CDAP 3.5?
Apache Eagle Architecture Evolvement
Achieve big data analytic platform with lambda architecture on cloud
MLOps pipelines using MLFlow - From training to production
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Log Data Analysis Platform
Log Data Analysis Platform by Valentin Kropov
Monitoring in 2017 - TIAD Camp Docker
Leveraging Hadoop in Polyglot Architectures
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
Netflix Cloud Architecture and Open Source
DataFinder: A Python Application for Scientific Data Management
ThroughTheLookingGlass_EffectiveObservability.pptx
MuleSoft Meetup Roma - Processi di Automazione su CloudHub
Build an AI/ML-driven image archive processing workflow: Image archive, analy...
Integrating Splunk into your Spring Applications
Real time cloud native open source streaming of any data to apache solr
Prototyping applications with heroku and elasticsearch
Jetpack, with new features in 2021 GDG Georgetown IO Extended

Recently uploaded (20)

PPTX
Intro to ISO 9001 2015.pptx wareness raising
PPTX
Learning-Plan-5-Policies-and-Practices.pptx
PDF
oil_refinery_presentation_v1 sllfmfls.pdf
DOCX
"Project Management: Ultimate Guide to Tools, Techniques, and Strategies (2025)"
PPTX
Human Mind & its character Characteristics
PDF
Parts of Speech Prepositions Presentation in Colorful Cute Style_20250724_230...
PPTX
Relationship Management Presentation In Banking.pptx
DOC
学位双硕士UTAS毕业证,墨尔本理工学院毕业证留学硕士毕业证
PPTX
Hydrogel Based delivery Cancer Treatment
PPTX
BIOLOGY TISSUE PPT CLASS 9 PROJECT PUBLIC
PPTX
Tour Presentation Educational Activity.pptx
PPTX
The spiral of silence is a theory in communication and political science that...
PPTX
Impressionism_PostImpressionism_Presentation.pptx
PPTX
INTERNATIONAL LABOUR ORAGNISATION PPT ON SOCIAL SCIENCE
PPTX
Non-Verbal-Communication .mh.pdf_110245_compressed.pptx
PPTX
Self management and self evaluation presentation
DOCX
ENGLISH PROJECT FOR BINOD BIHARI MAHTO KOYLANCHAL UNIVERSITY
PPTX
Primary and secondary sources, and history
PPTX
Project and change Managment: short video sequences for IBA
PPTX
Introduction to Effective Communication.pptx
Intro to ISO 9001 2015.pptx wareness raising
Learning-Plan-5-Policies-and-Practices.pptx
oil_refinery_presentation_v1 sllfmfls.pdf
"Project Management: Ultimate Guide to Tools, Techniques, and Strategies (2025)"
Human Mind & its character Characteristics
Parts of Speech Prepositions Presentation in Colorful Cute Style_20250724_230...
Relationship Management Presentation In Banking.pptx
学位双硕士UTAS毕业证,墨尔本理工学院毕业证留学硕士毕业证
Hydrogel Based delivery Cancer Treatment
BIOLOGY TISSUE PPT CLASS 9 PROJECT PUBLIC
Tour Presentation Educational Activity.pptx
The spiral of silence is a theory in communication and political science that...
Impressionism_PostImpressionism_Presentation.pptx
INTERNATIONAL LABOUR ORAGNISATION PPT ON SOCIAL SCIENCE
Non-Verbal-Communication .mh.pdf_110245_compressed.pptx
Self management and self evaluation presentation
ENGLISH PROJECT FOR BINOD BIHARI MAHTO KOYLANCHAL UNIVERSITY
Primary and secondary sources, and history
Project and change Managment: short video sequences for IBA
Introduction to Effective Communication.pptx

Apache Eagle in Action

  • 1. Apache in Action Secure Hadoop in Real-time Hao Chen / 陈浩 / hao@apache.org / @haozch
  • 2. Who is the Guy 2 Co-creator, Committer and PMC @ Apache Eagle hao@apache.org Hao Chen / 陈浩 Sr. Software Engineer @ eBay Cloud Service hchen9@ebay.com Speaker @ Hadoop Summit (SJC, SHA, BJ) ... http://people.apache.org/~hao
  • 3. Agenda 3 • About Eagle • Architecture • Ecosystem • Q & A
  • 4. What’s Apache Eagle 4 Apache Eagle is a distributed real-time monitoring and alerting engine for hadoop from eBay Open sourced as Apache Incubator Project on Oct 26th 2015 Secure Hadoop in Realtime a data activity monitoring solution to instantly identify access to sensitive data, recognize attacks/ malicious activity and block access in real time. See http://eagle.incubator.apache.org or http://goeagle.io
  • 5. Apache Eagle History Donated to Apache Software Foundation (ASF) from eBay at Oct 26th, 2015 5 Dec 2013 Oct 23 2015 Oct 26 2015 Hadoop Eagle Project Initiative Apache Incubator eagle.incubator.apache.org Github Open Source github.com/apache/incubator-eagle Hadoop Eagle Production Release May 2014
  • 6. Why build Apache Eagle 6 Eagle was initialized by end of 2013 for hadoop ecosystem monitoring as any existing tool like zabbix, ganglia can not handle the huge volume of metrics/logs generated by hadoop system in eBay. 2013/2014 10,000 nodes 150,000+ cores 170 PB 2000+ user 3000+ nodes 10,000+ cores 50+ PB 2012 2011 1000+ nodes 10,000+ cores 10+ PB 100+ nodes 1000 + cores 1 PB 2010 2009 50+ nodes 2007 1-10 nodes Hadoop Data • Security • Activity Hadoop Platform • Heath • Availability • Performance Hadoop @ eBay Inc
  • 7. Apache Eagle @ eBay 7 7 CLUSTERS 7427 NODES 160 PB DATA 10 B+ EVENTS / DAY 500+ METRIC TYPES 50,000+ JOBS / DAY 50,000,000+ TASKS / DAY MONITOR PROCESS
  • 8. Agenda 8 • About Eagle • Architecture • Ecosystem • Q & A
  • 9. Apache Eagle Architecture Overview 9 Scalable Scales to monitor thousands of policies and billions of access events Machine Learning Create dynamic user profiles based on user behavior Real-time Generates alerts in real time and blocks users with malicious intent Extensible Eagle can be easily extended to monitor other data sources
  • 10. Apache Eagle Architecture Overview 10 STREAM PROCESSING ENGINE DataCollector Kafka HDFS, Audit, Security METADATA MANAGER DATASTORES REMEDIATION ENGINE Apache Ranger MACHINE LEARNING MODULE Custom module Actionable Alerts Activities Actionable Alerts PolicyThresholdsUser properties MLThresholds Real Time Alert Dashboard Security Analyst Admin Console Security Engineer Insights Metadata Management MACHINE LEARNING TRAINING MODULE Policy Engine
  • 11. Apache Eagle Architecture Features 11 • Real-time Data Collection • Distributed Policy Engine • Stream Processing DSL • Scalable Data Storage & Query • Machine Learning Intergration NOTE {NAME}-{NUMBER} like HDFS-6914 means open source project ticket id contributed by us
  • 12. Apache Eagle - Data Collection 12 Decoupling with Message Bus • Apache Kafka: high-throughput distributed messaging • Partition: balance between logic and throughput Cross-Platform Integration • Community Kafka Client (18+) • Python/Go/C/C++/JAVA .. • Enhanced Log4j-kafka • KAFKA-2041: Extensible Partition Key • KAFKA-2077: Advanced Topic Selector
  • 13. Apache Eagle - Data Collection 13 Availability: Filebeat + Logstash Logstash-1 Logstash-2 Logstash-… Logstash- Ligh-weight collector (golang) with daemon Logstash instances cluster Shuffle Grouping KafkaField Grouping Distributed Message Bus Resource consumption balance Message throughput balance ( LOGSTASH-179)
  • 14. Storm Spout: Distributed crawling for hadoop job, node jmx and service logs, etc. Zookeeper: Centralized state management and distributed locking Apache Eagle - Data Collection 14 Scalability: Distributed Real-time Ingestion Zookeeper METRIC Centralized State Management JOB EVENT LOG
  • 15. Apache Eagle - Distributed Real-time Policy Engine 15 METADATA MANAGER Distributed Streaming Cluster Environment AlertExecutor_{1} AlertExecutor_{2} … AlertExecutor_{N} Real Time Alerts Alerts Policy Management Policy Dynamical Policy Deployment Real-time Event Stream Stream_{1} Stream_{*} Dynamical Stream Schema Stream Processing Highlights • Real-time • Usability • Scalability • Extensibility • Metadata-driven
  • 16. Apache Eagle - Distributed Real-time Policy Engine 16 METADATA MANAGER Real Time Alerts Alerts Policy Management Policy Event Stream (Kafka) Dynamical Stream Schema Dynamical Policy Deployment Real-time • Kafka-based Distributed Message Bus (Extensible) • Storm-based Real-time Execution Environment (Extensible) • Stream events are processed and alerts are evaluated during streaming
  • 17. Apache Eagle - Distributed Real-time Policy Engine 17 METADATA MANAGER Distributed Streaming Cluster Environment Real Time Alerts Alerts Policy Management Policy Dynamical Policy Deployment Usability • Powerful SQL-Like CEP CQL for Policy Definition • Dynamical Poilcy Metadata Lifecycle Management (Deployment/Update) • Easy-to-use Policy management and Alert analytics UI from metricStream[(name == 'ReplLag') and (value > 1000)] select * insert into outputStream;
  • 18. Apache Eagle - Distributed Real-time Policy Engine 18 Full-function Streaming CEP CQL: Siddhi on Storm by default hdfsAuditLogEventStream[(src == '/tmp/private')]#window.externalTime(timestamp,10 min) select user, count(timestamp) as aggValue group by user having aggValue >= 5 insert into outputStream; • Filter • Join • Aggregation: Avg, Sum , Min, Max, etc • Group by • Having • Stream handlers for window: TimeWindow, Batch Window, Length Window • Conditions and Expressions: and, or, not, ==,!=, >=, >, <=, <, and arithmetic operations • Pattern processing • Sequence processing • Event Tables: intergrate historical data in realtime processing • SQL-Like Query: Query, Stream Definition and Query Plan compilation
  • 19. Apache Eagle - Distributed Real-time Policy Engine 19 Distributed Streaming Cluster Environment AlertExecutor_{1} AlertExecutor_{2} … AlertExecutor_{N} Stream_{1} Stream_{*} Stream Processing Scalability: dynamic policy partition by {event} * {policy} • N Users with 3 partitions, M policies with 2 partitions, then 3*2 physical tasks • Physical partition + policy-level partition
  • 20. Apache Eagle - Distributed Real-time Policy Engine 20 Distributed Streaming Partition Problem https://en.wikipedia.org/wiki/Partition_problem S = {3,1,1,2,2,1,1} S1 = {1,1,1,1,1} S2 = {2,2} S3 = {3}
  • 21. Apache Eagle - Distributed Real-time Policy Engine 21 Distributed Streaming Partition Strategy groupBy[ GreedyStrategy ]((_.key1,_.key2 )) HBase Key Distribution Statistics (Online/Offline) Realtime Partition Strategy Key Statistics Cache Async Strategy • Greedy (Online/Offline) • PoTC • PKG • Hashing
  • 22. Apache Eagle - Distributed Real-time Policy Engine 22 Distributed Real-time Policy Engine Siddhi CEP Policy Evaluator Machine Learning Policy Evaluator Extensibility • Support WSO2 Siddhi CEP as first class • Extensible policy engine implementation • Extensible policy lifecycle management Extensible Policy Evaluator public interface PolicyEvaluatorServiceProvider { public String getPolicyType(); // literal string to identify one type of policy public Class getPolicyEvaluator(); // get policy evaluator implementation public List getBindingModules(); // policy text with json format to object mapping } public interface PolicyEvaluator { public void evaluate(ValuesArray input) throws Exception; // evaluate input event public void onPolicyUpdate(AlertDefinitionAPIEntity newAlertDef);// policy update public void onPolicyDelete(); // invoked when policy is deleted } METADATA MANAGER Policy/Metadata
  • 23. Apache Eagle - Distributed Real-time Policy Engine 23 Metadata-Driven • Stream Schema: AlertStreamSchemaEntity • Policy Definition: AlertDefinitionAPIEntity • Central metadata management • Dynamic metadata deployment @Table("alertdef") @ColumnFamily("f") @Prefix("alertdef") @Service(AlertConstants.ALERT_DEFINITION_SERVICE_ENDPOINT_NAME) @JsonIgnoreProperties(ignoreUnknown = true) @TimeSeries(false) @Tags({"site", "dataSource", "alertExecutorId", "policyId", "policyType"}) @Indexes({ @Index(name="Index_1_alertExecutorId", columns = { "alertExecutorID" }, unique = true), }) public class AlertDefinitionAPIEntity extends TaggedLogAPIEntity{ @Column("a") private String desc; @Column("b") private String policyDef; @Column("c") private String dedupeDef; METADATA MANAGER Distributed Real-time Policy Engine Dynamic Metadata Loading
  • 24. .flatMap(AuditLogTransformer) .groupBy(_.user) .flatMap(UserProfileAggregator); Apache Eagle - Fluent Stream Processing DSL 24 env.fromKafka (KafkaConfig) .alert.persistAndEmail val env = ExecutionEnvironment.getStorm() env.execute(); Distributed Streaming Cluster Environment AlertExecutor_{1} AlertExecutor_{2} … AlertExecutor_{N} Alerts Real-time Event Stream Stream_{1} Stream_{*} Stream Processing env.execute()
  • 25. Apache Eagle - Fluent Stream Processing DSL 25 • Physical execution platform independent • Easily assemble data transformation, filtering, join and alerting DAG in fluent way • DAG rewrite and optimization • StreamUnionExpansion • StreamGroupbyExpansion • StreamNameExpansion • StreamAlertExpansion • StreamParallelismConfigExpansion trait StreamProducer{ filter flatMap map{1,2,3,4} groupBy streamUnion // stream join is hard, not implemented for storm alertWithConsumer } StormExecutionEnvironment env = ExecutionEnvironmentFactory.getStorm(config); env.newSource(new KafkaSourcedSpoutProvider().getSpout(config)).renameOutputFields(1) .flatMap(new AuditLogTransformer()) .groupBy(0) .flatMap(new UserProfileAggregatorExecutor()); .alertWithConsumer(“userActivity“,”userProfileExecutor“) env.execute(); Optimizer 1. Development 2. Optimization 3. Compile to native app
  • 26. Apache Eagle - Scalable Data Storage and Query 26 • Entity Metadata on large-scale NoSQL storage like HBase • Full-function SQL-Like REST Query • Optimized rowkey design for time-series monitoring data • HBase Coprocessor • Secondary Index @Table("alertdef") @ColumnFamily("f") @Prefix("alertdef") @Service(AlertConstants.ALERT_DEFINITION_SERVICE_ENDPOINT_NAME) @JsonIgnoreProperties(ignoreUnknown = true) @TimeSeries(false) @Tags({"site", "dataSource", "alertExecutorId", "policyId", "policyType"}) @Indexes({ @Index(name="Index_1_alertExecutorId", columns = { "alertExecutorID" }, unique = true), }) public class AlertDefinitionAPIEntity extends TaggedLogAPIEntity{ @Column("a") private String desc; @Column("b") private String policyDef; @Column("c") private String dedupeDef; query= AlertDefinitionService[@dataSource="hiveQueryLog"]{@policyDef}
  • 27. Uniform rowkey design • Metric • Entity • Log Rowkey ::= Prefix | Partition Keys | timestamp | tagName | tagValue | … Rowkey ::= Metric Name | Partition Keys | timestamp | tagName | tagValue | … Rowkey ::= Default Prefix | Partition Keys | timestamp | tagName | tagValue | … Rowkey ::= Log Type | Partition Keys | timestamp | tagName | tagValue | … Rowvalue ::= Log Content 27 Apache Eagle – Uniform HBase Rowkey Design
  • 28. Apache Eagle - Machine Learning Intergration 28
  • 29. 29 User Activity Profiling Offline: Determine bandwidth from training dataset the kernel density function parameters (KDE) Online: If a test data point lies outside the trained bandwidth, it is anomaly (Policy) PCs(Principle Components) in EVD (Eigenvalue Value Decomposition) Kernel Density Function Apache Eagle – User/System Activity Profiling
  • 30. 30 Anomaly Metric Predictive Detection Offline: Analyzing and combining 500+ metrics together for causal anomaly detections (IG -> PCA -> GMM -> MCC) Online: Predictively alert for anomaly metrics Normal (Green) and Abnormal (Red) Data and Probability Distribution and Threshold Selection PCA (Principal Component Analysis) Apache Eagle - Anomaly Metric Predictive Detection Anomaly Metric Predictive Detection Case Study
  • 31. Agenda • About Apache Eagle • Architecture • Ecosystem • Q & A 31
  • 32. 32 Apps  Security  Hadoop  Cloud  Database Interface  Web Portal  REST Services  Analytics Visualization Integration  Ambari  Docker  Ranger  Dataguise Eagle Framework Distributed real-time framework for efficiently developing highly scalable monitoring applications Eagle Apps Securiy / Hadoop / Cloud / Database Eagle Interface REST Service / Management UI / Customizable Analytics Visualization Eagle Integration Ambari / Docker / Ranger / Dataguise Open Source Community-driven and Cross-community cooperation Eagle Framework Apache Eagle Ecosystem
  • 33. 33 Apache Eagle Ecosystem - Security How to Secure Hadoop in Realtime? • Apache Eagle • Apache Ranger • Apache Knox • Dataguise
  • 34. 34 Apache Eagle Ecosystem - Hadoop Eagle in Apache Amabri: natively be part of hadoop ecosystem
  • 35. 35 Apache Eagle Ecosystem - Docker Eagle in Docker: natively fly on Cloud/Container STORM KAFKA ZOOKEEPER HBASE HADOOP … Powered of git clone apache/incubator-eagle eagle-docker 15 + 1 = 1 docker pull apacheeagle
  • 36. 36 Apache Eagle Ecosystem - Open Source If you want to go fast, go alone. If you want to go far, go together. -- African Proverb
  • 37. Learn more about Apache Eagle 37 • EAGLE: USER PROFILE-BASED ANOMALY DETECTION IN HADOOP CLUSTER (IEEE) • EAGLE: DISTRIBUTED REALTIME MONITORING FRAMEWORK FOR HADOOP CLUSTER
  • 38. Q & A apache/incubator-eagle @TheApacheEagle @ApacheEagle http://eagle.incubator.apache.org The slide is licensed under Creative Commons Attribution 4.0 International license.

Editor's Notes

  • #12: 特色
  • #13: Tech stack decision about Message Bus
  • #14: Cross-Nodes Resource Balanace in production environment
  • #33: A good ecosystem will continiously keep the architecture renew and advanced no matter in technical stack iteration, human resource or business growth