SlideShare a Scribd company logo
Graphene – Microsoft
SCOPE on Tez
Hitesh Sharma (Principal Software Eng. Manager)
Anupam (Senior Software Engineer)
Agenda
• Overview of SCOPE and Cosmos
• SCOPE Job Manager responsibilities
• Design of Graphene
• Features required in Tez
Cosmos
Environment
• A Microsoft-internal platform
for building big-data
applications
• Available externally as Azure
Data Lake Analytics
• Enable customers to transform
data of any scale into new
business assets easily at low
cost in the cloud
Cosmos: World’s Biggest YARN Cluster!
Single DC >
40K machines
Multiple DCs
> 500,000 jobs
/ day
~ 3 billion
containers/day
High avg. CPU
utilization
Three Nines
Exabytes in
storage
100s of PB
processed/day
Exabytes of
data moved
SCOPE
• Scripting language for Cosmos
• Influenced by SQL and relational
concepts
• Works great with C# and .NET
• Very extensible
• Auto scale
• Naturally parallelizable computation
• Lower the barrier to write efficient
programs
RawData =
EXTRACT
Clicks:int,
Domain:string
FROM @“RAWWEBDATA.TSV”
USING DefaultTextExtractor();
WebData =
SELECT *,
Domain.Trim().ToUpper()
AS NormalizedDomain
FROM RawData;
OUTPUT WebData
TO “WEBDATA.TSV”
USING DefaultTextOutputter();
CosmosFront-EndService
Optimizer
Job Manager
Compiler
Runtime
Engine
SCOPE platform
Job scale
• Single job can consume > 1PB of
data
• > 15000 concurrent tasks (degree of
parallelism)
• Thousands of vertices
• DAGs can be very wide, very deep,
or both
• Millions of tasks in a job
• Billions of edges
Job Manager
• DAG execution
• Builds execution graph
• Topologically executes the DAG
• Keep track of state of the job/vertices​
• Dynamic DAG updates
• Rack level aggregation
• Broadcast tree
• Fault tolerance
• Handle failures and do revocations
• Detect and mitigate outliers
Job Manager
• Scheduling
• Keep track of cluster resources​
• Distributed scheduler
• Requests bonus or opportunistic containers
to increase utilization
• Can upgrade opportunistic containers
• Container reuse
• Opportunistic containers present some
interesting choices for reuse
• Tricky to implement
Time
Parallel
containers
Max containers
allowed
Job Manager
• Finalization
• Concatenate final outputs​
• Metadata operations​
• Tooling
• Near real-time feedback
• Finding the critical path
• Structured error reporting
Current
Challenges
• Higher cost of ownership
• No AM recovery
• Tied to Cosmos infrastructure
• Memory inefficient
• Native support for interactive
workload
Status and Roadmap
Prototype
Run
benchmarks
like TPC-X,
TeraSort etc.
Offline
flighting of
customer
jobs
First stage of
production
deployment
Late 2017
Mid 2018
Late 2018
Early 2018
Design and
Implementation of
Graphene
Guiding
Principles
Minimal changes
in SCOPE stack
Work with
community
Use Tez
extensibility
Maintain
compatibility
Consume output of
compilation to
generate DAG
Algebr
a
Launch and
communicate with
ScopeEngine
Engine
Produce status,
debugging, and
error details for
existing tooling
Tooling
Interact with
storage layer
Store
Graphene – Integration Points
Graphene – Application Master
GRAPHENE AM
GrapheneDAGAppMaster
DAG
Converter
Algebra
Legend
Tez Component
Uses Tez API
External Component
DAG
Store Client
Input InitializerDAGAppMaster
DAGImpl
Custom Edge
and Vertex Mgr
Tez
Magic
Task
Graphene – Task Execution
Task Container
SCOPE Engine
SCOPE Processor
SCOPE Input SCOPE Output
SCOPE TaskTez
Magic!
GRAPHENE AM
AM Container
Launch Container
InputFailedEvent/DataMovementEvent
InputDataInformationEvent or
DataMovementEvent Task CommandStatus & Error
Legend
Tez Component
Uses Tez API
External Component
Graphene – Tooling Integration
Task Container
SCOPE Engine
SCOPE Task
Periodic Stats and Diag
Legend
Tez Component
Uses Tez API
External Component
Statistics & DiagTez
Magic
GRAPHENE AM
AM Container
JobProfiler:
EventListener
Real Time
Stats
Historic
Stats Task Level Stats
Vertex Level Stats
Experience So Far
Reliability
As expected from
a production ready
software
No major bugs or
reliability issues
Onboarding
Modular and
tested code
Documentation :
Opportunity to
contribute
Community
Very responsive
Special thanks to
Bikas Saha, Kuhu
Shukla, Jonathan
Eagles
Scaling Tez
• Existing Cosmos workloads can have >
15k parallel tasks
• Acquiring and managing these
containers
• Managing communications with
these tasks
• Providing real time progress for
all the tasks
Scaling Tez
• Optimize AM memory
• Metadata management for large
inputs
• Memory pressure under large
event throughput
• Large DAGs with > 2000 vertices
and > 1 million tasks
• Optimizations for deep DAGs
Integrating
with YARN
Opportunistic
containers
• Mechanism to drive up utilization of
cluster
• AM has deep understanding of the
capability
• Effectively using opportunistic
containers in scheduler
• Harder scheduling choices with
container reuse
AM Recovery
• High priority customer ask
• Need to plugin Graphene to this AM
resiliency
• Deterministic and reliable recovery
with dynamic behavior
Conclusion
Microsoft SCOPE analytics running on
Apache YARN and Tez!
Our Journey has just started.
We invite you to collaborate.
References
• Apache Tez: A Unifying Framework for Modeling and Building Data
Processing Applications [SIGMOD, 2015]
• SCOPE: easy and efficient parallel processing of massive data sets
[VLDB, 2008]
• Apollo: Scalable and Coordinated Scheduling for Cloud-Scale
Computing [OSDI, 2014]
• Dryad: distributed data-parallel programs from sequential building
blocks [EuroSys, 2007]
• Lessons learned from scaling YARN to 40k machines in a multi tenancy
environment. [DataWorksSummit, 2017]

More Related Content

PPTX
A machine learning and data science pipeline for real companies
PPTX
What's new in apache hive
PPTX
Quality for the Hadoop Zoo
PDF
Hadoop 3 @ Hadoop Summit San Jose 2017
PPTX
Bootstrapping state in Apache Flink
PPTX
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
PDF
Realizing the promise of portable data processing with Apache Beam
PPTX
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
A machine learning and data science pipeline for real companies
What's new in apache hive
Quality for the Hadoop Zoo
Hadoop 3 @ Hadoop Summit San Jose 2017
Bootstrapping state in Apache Flink
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
Realizing the promise of portable data processing with Apache Beam
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase

What's hot (20)

PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
PDF
Leveraging docker for hadoop build automation and big data stack provisioning
PDF
Present and future of unified, portable, and efficient data processing with A...
PPTX
Migrating Analytics to the Cloud at Fannie Mae
PPTX
Preventative Maintenance of Robots in Automotive Industry
PPTX
Running secured Spark job in Kubernetes compute cluster and integrating with ...
PPTX
Presto query optimizer: pursuit of performance
PPTX
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
PPTX
Apache Hadoop YARN: Present and Future
PPTX
Lessons learned from running Spark on Docker
PPTX
Practice of large Hadoop cluster in China Mobile
PPTX
Time-oriented event search. A new level of scale
PPTX
PPTX
Insights into Real-world Data Management Challenges
PDF
Fast SQL on Hadoop, Really?
PPTX
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
PPTX
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
PDF
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
PPTX
SAM—streaming analytics made easy
PPTX
Real-time Freight Visibility: How TMW Systems uses NiFi and SAM to create sub...
The columnar roadmap: Apache Parquet and Apache Arrow
Leveraging docker for hadoop build automation and big data stack provisioning
Present and future of unified, portable, and efficient data processing with A...
Migrating Analytics to the Cloud at Fannie Mae
Preventative Maintenance of Robots in Automotive Industry
Running secured Spark job in Kubernetes compute cluster and integrating with ...
Presto query optimizer: pursuit of performance
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
Apache Hadoop YARN: Present and Future
Lessons learned from running Spark on Docker
Practice of large Hadoop cluster in China Mobile
Time-oriented event search. A new level of scale
Insights into Real-world Data Management Challenges
Fast SQL on Hadoop, Really?
Disaster Recovery Experience at CACIB: Hardening Hadoop for Critical Financia...
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
SAM—streaming analytics made easy
Real-time Freight Visibility: How TMW Systems uses NiFi and SAM to create sub...
Ad

Similar to Graphene – Microsoft SCOPE on Tez (20)

PPTX
Building FoundationDB
PPTX
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
KEY
Writing Scalable Software in Java
PPTX
Solving Office 365 Big Challenges using Cassandra + Spark
PDF
Low Latency Polyglot Model Scoring using Apache Apex
PPTX
Machine Learning on Distributed Systems by Josh Poduska
PPTX
Apache Tez : Accelerating Hadoop Query Processing
PDF
An overview of modern scalable web development
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
PDF
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
PPTX
Stateful streaming and the challenge of state
PDF
Alluxio - Scalable Filesystem Metadata Services
PDF
Cassandra CLuster Management by Japan Cassandra Community
PDF
On-boarding with JanusGraph Performance
PPTX
Simulation of Heterogeneous Cloud Infrastructures
PDF
Mastering Query Optimization Techniques for Modern Data Engineers
PPT
Ecss des
PPTX
YARN Ready: Integrating to YARN with Tez
PPTX
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Building FoundationDB
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Writing Scalable Software in Java
Solving Office 365 Big Challenges using Cassandra + Spark
Low Latency Polyglot Model Scoring using Apache Apex
Machine Learning on Distributed Systems by Josh Poduska
Apache Tez : Accelerating Hadoop Query Processing
An overview of modern scalable web development
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
Stateful streaming and the challenge of state
Alluxio - Scalable Filesystem Metadata Services
Cassandra CLuster Management by Japan Cassandra Community
On-boarding with JanusGraph Performance
Simulation of Heterogeneous Cloud Infrastructures
Mastering Query Optimization Techniques for Modern Data Engineers
Ecss des
YARN Ready: Integrating to YARN with Tez
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Big Data Technologies - Introduction.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Electronic commerce courselecture one. Pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Review of recent advances in non-invasive hemoglobin estimation
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
“AI and Expert System Decision Support & Business Intelligence Systems”
Big Data Technologies - Introduction.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
NewMind AI Weekly Chronicles - August'25 Week I
Understanding_Digital_Forensics_Presentation.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Electronic commerce courselecture one. Pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Encapsulation theory and applications.pdf
Approach and Philosophy of On baking technology
Advanced methodologies resolving dimensionality complications for autism neur...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Unlocking AI with Model Context Protocol (MCP)
NewMind AI Monthly Chronicles - July 2025
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Chapter 3 Spatial Domain Image Processing.pdf

Graphene – Microsoft SCOPE on Tez

  • 1. Graphene – Microsoft SCOPE on Tez Hitesh Sharma (Principal Software Eng. Manager) Anupam (Senior Software Engineer)
  • 2. Agenda • Overview of SCOPE and Cosmos • SCOPE Job Manager responsibilities • Design of Graphene • Features required in Tez
  • 3. Cosmos Environment • A Microsoft-internal platform for building big-data applications • Available externally as Azure Data Lake Analytics • Enable customers to transform data of any scale into new business assets easily at low cost in the cloud
  • 4. Cosmos: World’s Biggest YARN Cluster! Single DC > 40K machines Multiple DCs > 500,000 jobs / day ~ 3 billion containers/day High avg. CPU utilization Three Nines Exabytes in storage 100s of PB processed/day Exabytes of data moved
  • 5. SCOPE • Scripting language for Cosmos • Influenced by SQL and relational concepts • Works great with C# and .NET • Very extensible • Auto scale • Naturally parallelizable computation • Lower the barrier to write efficient programs RawData = EXTRACT Clicks:int, Domain:string FROM @“RAWWEBDATA.TSV” USING DefaultTextExtractor(); WebData = SELECT *, Domain.Trim().ToUpper() AS NormalizedDomain FROM RawData; OUTPUT WebData TO “WEBDATA.TSV” USING DefaultTextOutputter();
  • 7. Job scale • Single job can consume > 1PB of data • > 15000 concurrent tasks (degree of parallelism) • Thousands of vertices • DAGs can be very wide, very deep, or both • Millions of tasks in a job • Billions of edges
  • 8. Job Manager • DAG execution • Builds execution graph • Topologically executes the DAG • Keep track of state of the job/vertices​ • Dynamic DAG updates • Rack level aggregation • Broadcast tree • Fault tolerance • Handle failures and do revocations • Detect and mitigate outliers
  • 9. Job Manager • Scheduling • Keep track of cluster resources​ • Distributed scheduler • Requests bonus or opportunistic containers to increase utilization • Can upgrade opportunistic containers • Container reuse • Opportunistic containers present some interesting choices for reuse • Tricky to implement Time Parallel containers Max containers allowed
  • 10. Job Manager • Finalization • Concatenate final outputs​ • Metadata operations​ • Tooling • Near real-time feedback • Finding the critical path • Structured error reporting
  • 11. Current Challenges • Higher cost of ownership • No AM recovery • Tied to Cosmos infrastructure • Memory inefficient • Native support for interactive workload
  • 12. Status and Roadmap Prototype Run benchmarks like TPC-X, TeraSort etc. Offline flighting of customer jobs First stage of production deployment Late 2017 Mid 2018 Late 2018 Early 2018
  • 14. Guiding Principles Minimal changes in SCOPE stack Work with community Use Tez extensibility Maintain compatibility
  • 15. Consume output of compilation to generate DAG Algebr a Launch and communicate with ScopeEngine Engine Produce status, debugging, and error details for existing tooling Tooling Interact with storage layer Store Graphene – Integration Points
  • 16. Graphene – Application Master GRAPHENE AM GrapheneDAGAppMaster DAG Converter Algebra Legend Tez Component Uses Tez API External Component DAG Store Client Input InitializerDAGAppMaster DAGImpl Custom Edge and Vertex Mgr Tez Magic Task
  • 17. Graphene – Task Execution Task Container SCOPE Engine SCOPE Processor SCOPE Input SCOPE Output SCOPE TaskTez Magic! GRAPHENE AM AM Container Launch Container InputFailedEvent/DataMovementEvent InputDataInformationEvent or DataMovementEvent Task CommandStatus & Error Legend Tez Component Uses Tez API External Component
  • 18. Graphene – Tooling Integration Task Container SCOPE Engine SCOPE Task Periodic Stats and Diag Legend Tez Component Uses Tez API External Component Statistics & DiagTez Magic GRAPHENE AM AM Container JobProfiler: EventListener Real Time Stats Historic Stats Task Level Stats Vertex Level Stats
  • 19. Experience So Far Reliability As expected from a production ready software No major bugs or reliability issues Onboarding Modular and tested code Documentation : Opportunity to contribute Community Very responsive Special thanks to Bikas Saha, Kuhu Shukla, Jonathan Eagles
  • 20. Scaling Tez • Existing Cosmos workloads can have > 15k parallel tasks • Acquiring and managing these containers • Managing communications with these tasks • Providing real time progress for all the tasks
  • 21. Scaling Tez • Optimize AM memory • Metadata management for large inputs • Memory pressure under large event throughput • Large DAGs with > 2000 vertices and > 1 million tasks • Optimizations for deep DAGs
  • 22. Integrating with YARN Opportunistic containers • Mechanism to drive up utilization of cluster • AM has deep understanding of the capability • Effectively using opportunistic containers in scheduler • Harder scheduling choices with container reuse
  • 23. AM Recovery • High priority customer ask • Need to plugin Graphene to this AM resiliency • Deterministic and reliable recovery with dynamic behavior
  • 24. Conclusion Microsoft SCOPE analytics running on Apache YARN and Tez! Our Journey has just started. We invite you to collaborate.
  • 25. References • Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications [SIGMOD, 2015] • SCOPE: easy and efficient parallel processing of massive data sets [VLDB, 2008] • Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing [OSDI, 2014] • Dryad: distributed data-parallel programs from sequential building blocks [EuroSys, 2007] • Lessons learned from scaling YARN to 40k machines in a multi tenancy environment. [DataWorksSummit, 2017]

Editor's Notes

  • #2: We are here to talk about how we are looking to power SCOPE with Tez.
  • #3: We will do a quick overview of Cosmos and SCOPE Then we will talk about the role job manager plays in the system How we are looking to fix some of the problems that we have by leveraging Tez Anupam will dig a little deeper into design of Graphene He will be talking about the challenges in front of us and why we need your help to take Tez to the next level
  • #4: A Microsoft-internal platform for building big-data applications​ ​ Used across Microsoft by Bing, Azure, Windows, Office for data mining and analysis. Available externally as Azure Data Lake Analytics Lets user focus on transforming data to gain insights while we focus on operating the platform at lower COGS.
  • #6: SCOPE is the main scripting language for Cosmos. Targeted for large-scale data analysis. You could run a script over 1GB, TB, or a PB and we handle scaling that. SQL like language that allows C# and .NET devs to get started easily. On the right is a sample SCOPE script. In this case we are reading a TSV file and running a select statement on that, add a new column, and output that as a new file. Users can easily define their own functions and even implement their own versions of operators like extractors, processors, and outputters. Users just write the scripts thinking it is going to run on a single machine and we scale it out on the cluster. This means that the nitty-gritties of dealing from failures and retries is not something a user should worry about
  • #7: Users submit a SCOPE script from VS using Scope studio plugin. The script goes through Cosmos Job Service and FE where it is compiled by Scope compiler. Compiler produces an AST representation of the script along with the codegen DLLs for user code and other artifacts. Optimizer makes decisions about execution plan, parallelism, and generates an algebra. Job manager, which is us, parses the algebra and starts executing the DAG on the cluster. As part of the execution JM launches Scope engine on the tasks which provides implementations of many standard physical operators. JM gives the Scope engine input paths to read and outputs to produce. Typically outputs of one vertex become input to some other vertex and DAG execution continues. --- SCOPE compiler and optimizer are responsible for generating an efficient execution plan and the runtime.
  • #9: So what are the responsibilities of the Job Manager? DAG execution JM is the central and coordinating process for all processing vertices within an application. The primary function of the JM is to construct the runtime DAG from the compile time representation of a DAG and execute over it. The JM schedules a DAG vertex onto the cluster nodes when all the inputs are ready. JM can also do dynamic updates to the graph like a pod level aggregation or build a broadcast tree. Fault tolerance The Job Manager monitors progress of all executing vertices. Failing vertices are re-executed a limited number of times and if there are too many failures, the job is terminated. JM also detects slower tasks in a vertex and reexecutes them elsewhere on the cluster.
  • #10: Scheduling When a task is ready then JM looks for a machine in the cluster to run the task upon. The global cluster load information used by each JM is provided through the cooperation of two additional entities in the system: a Resource Monitor (RM) for each cluster and a Process Node (PN) on each server. The RM aggregates load information from PNs across the cluster continuously, providing a global view of the cluster status for each JM to make informed scheduling decisions. It also enforces token limits.. Users typically give a job some tokens to run. Each token amounts to 2 cores and 6GB. JM ensures that the resources used by the job never exceed the allocated number of tokens.
  • #11: When the job finishes then the JM finalizes the outputs so they become visible to the user. It also supports some custom metadata operations like catalog updates.
  • #14: Go through details. Explains design and implementation decisions for Graphene and how we use Tez
  • #15: 1m Once we decided to implement SCOPE AM using Tez. We decided upon certain ground rules or guiding principles we would use to accomplish this goal. Hitesh already gave us an idea about the scale of Cosmos and SCOPE workloads, and how critical they are for Microsoft’s business. The compiler, optimizer, execution engine and tooling will be minimally changed, in order to allow for a staged transition. Tez has very powerful set of APIs to allow any system to plugin. We will be using these extensibility points as much as possible. Finally, for features that we feel the need to add to Tez, we will be working with the community and making them work generally for all Tez users as much as possible. With these ground rules set we started working on porting Scope to run on top of Tez.
  • #16: 3m The need to seamlessly upgrade from current job manager to graphene implies that graphene should be a drop-in replacement for current job manager. As Hitesh showed, doing this at Cosmos scale while being the backbone of Microsoft’s analytics need implies least perturbation. This meant that the SCOPE AM on Tez had to mimic existing job manager kind of behavior. Graphene has 4 unique integration point in Cosmos SCOPE stack not native to Tez. This introduction of our guiding principles and integration points will be helpful to understand our implementation and the rationale behind our design choices.
  • #17: 5m
  • #18: 6m
  • #19: 7m
  • #20: 8m
  • #21: 10m Bring learnings from Job Manager back to Tez.
  • #22: 12m
  • #23: 13m
  • #24: 15m
  • #25: 16m
  • #26: 16m