SlideShare a Scribd company logo
Automated Profiling of Virtualized Media Processing
Functions Using Telemetry and Machine Learning
Rufael Mekuria(Unified Streaming), Michael J. McGrath (Intel), Victor
Bayon-Molino(Intel), Vincenzo Riccobene(Intel), Christos Tselios(Citrix),
Artem Dobrodub(Nokia), John Thomson (Onapp)
ACM Multimedia Systems Conference 2018, June 12-15 Amsterdam the Netherlands: 5G Multimedia
- Context: Emerging 5G and 5G Multimedia technologies
- Research challenge
- Contribution
- TALE based profiling approach
- Virtualized media processing function
- Profiling using TALE and telemetry
- KPI Mapping using telemetry and machine learning
- Conclusion/Future work
Summary/Overview
ACM Multimedia Systems Conference 2018, June 12-15 Amsterdam the Netherlands: 5G Multimedia
Context (1): Emerging 5G Network
Technologies
5G
RAN
Cloudification
Distribution
Virtualized
Network Functions
Network
Function
Virtualization
Management
&
Orchestration
Massive Mimo
Network slicing
Millimetre
Wave technologies
(26+ GhZ)
Multi-Access
Edge
Computing
Virtualisation
technologies
5G
Radio Access
5G
Cloud/NFV
virtualized
core
ACM Multimedia Systems Conference 2018, June 12-15 Amsterdam the Netherlands
Small Cells
WLAN + LTE
Cloud
Operating System
e.g. hypervisor
Context(2): Advanced Media Services
Content encryption
Water
Marked streams
Personalized
advertisement
360 degree
High Dynamic Range
Personalized
video streaming
Content security
Emerging
Formats
(smarter pixels)
Many pixels
Point Cloud
Light Field
Personalized
captions (language)
Ultra HD
4K, 8K
High Frame
Rate
Increased
Bandwidth
of
5G radio link
-> more data
virtualized
5G cloud
native core
->
Smarter network
Smarter streaming
Computer Vision
5G Transcode
/transmux
AR/MR
ACM Multimedia Systems Conference 2018, June 12-15 Amsterdam the Netherlands: 5G Multimedia
Edge cloud
Edge cloud
Core CloudAccess Network
User equipment
User equipment
Scalable data center
cloud infrastructure
Aggregation Network
4G LTE
LTE
Broadcast
5G
Wireless
Virtualized
Network
User equipment
Cloud-RAN
Radio
network
information
orchestration
Radio
network
information
MEC Cloud
Regional cloudvEPC
vIMS
Virtualized network infrastructure
Superfluidity: a flexible functional
architecture for 5G networks
Giuseppe Bianchi et al. Volume27, Issue9
TETT Journal
Special Issue: 5GPPP Feature Issue
September 2016
Pages 1178-1186
5G Context(3): Example edge converged cloud native 5G architecture
- Hardware independence (X86, ARM, GPU, FPGA)
- Time independence
- Scale independence scale from 1 to many users (millions)
- Location independence
- Reduce costly overprovisioning
Broad range of technologies needed (NFV,MEC,SDN etc…):
A core of advances in cloud/NFV technologies (hence our work)
Context (4): Design goals of Superfluid 5G network
https://www.unified-streaming.com/blog/5g-superfluidity-and-future-streaming-video
ACM Multimedia Systems Conference 2018, June 12-15 Amsterdam the Netherlands: 5G Multimedia
+ Rashid Mjumbi et al. Network Function Virtualization: State of the Art and Research Challenges
IEEE Communications surveys & tutorials, VOL. 18. NO 1. First Quarter 2016
http://www.etsi.org/
technologies-clusters/technologies/nfv
Context (5): Some NFV pointers
Difference NFV vs Cloud: service/function abstraction: VNF, per function optimization possible
ACM Multimedia Systems Conference 2018, June 12-15 Amsterdam the Netherlands: 5G Multimedia
Context (6) Virtualisation Stack
Hypervisor type 1 Hypervisor type 2 Operating system level
* Mark Croes
Performance analysis of virtualized video streaming service
Bachelor Thesis, University of Amsterdam June 2017
ACM Multimedia Systems Conference 2018, June 12-15 Amsterdam the Netherlands: 5G Multimedia
Our focus is type 1
virtualisation
R1: Performance modelling/work load characterization
-> virtual box in the cloud vs. physical box, underlying hardware is
heterogeneous, cloud operating system stack
R2: Reduce overprovisioning of underlying physical/virtual infrastructure
R3: KPI Mapping for more efficient telemetry/monitoring, identifying most
Important metrics relating to service quality defined in SLA
Goal: “efficient carrier grade cloud native processing functions”
this work is a step in this direction
Research Challenges
ACM Multimedia Systems Conference 2018, June 12-15 Amsterdam the Netherlands: 5G Multimedia
- Cloud video streaming [1-5]
- Mathematical modelling (Jackson queue etc.) use QoS constraint
enable more efficient scale-in-out etc)
- Not sufficient for real deployment (anomaly, unexpected behavior)
- Not sufficient for NFV -> heterogeneous underlying hardware (entropy)
- Not Sufficient for MPEG DASH were client behavior is not standardized,
hence modelling user load difficult
- Google golden signals, CPU thread state, Off CPU analysis -> no automated
step mapping to Specific workload (domain knowledge needed), automated
mapping useful for function abstraction NFV
Prior art
[1] Wu et al., "CloudMedia: When Cloud on Demand Meets Video on Demand," in IEEE ICDCS, Minneapolis, 2011, pp. 268-277.
[2] Nan et al., "Optimal allocation of virtual machines for cloud-based multimedia applications," in IEEE MMSP Workshop, 2012, pp. 175-180.
[3] D. Niu et al., "Quality-Assured Cloud Bandwidth Auto-Scaling for Video-on-Demand Applications," in IEEE Infocomm, 2012.
[4] Y. Jin, Y. Wen, C. Westphal , "Optimal Transcoding and Caching for Adaptive Streaming in Media Cloud: An Analytical Approach,“
IEEE TCSVT, vol. 25, no. 12, pp. 1914 - 1925, December 2015.
[5] J. He, Y. Wen, J. Huang, D. Wu , "On the Cost-QoE Trade-off for Cloud-based Video Streaming under Amazon EC2’s Pricing Models,“
IEEE TCSVT, vol. 24, no. 4, pp. 669 - 680, September 2013.
- Automated profiling of virtualized media processing functions
to reduce overprovisioning in an NFV/Cloud deployment
- Automated KPI mapping of virtualized media processing functions, enabling
more efficient telemetry/targeted telemetry based monitoring (this can
enable more efficient scale-in-out etc)
- Emperical approach for cloud video streaming (we only focus on profiling
and KPI mapping)
Contribution
ACM Multimedia Systems Conference 2018, June 12-15 Amsterdam the Netherlands: 5G Multimedia
TALE metric collection summary
Metrics are obtained from the cloud hardware: e.g. Intel SNAP,
OpenStack Ceilometer, Amazon Cloudwatch etc…. Metrics
obtained from OS/hypervisors, compute, storage, networks etc.
Throughput
Anomaly
Latency
Entropy
Full Stack Monitoring: collect as many metrics from each of the
layers, later then statistics/machine learning will be used to
identify key metrics -> we discuss the analytics pipeline
later
ACM Multimedia Systems Conference 2018, June 12-15 Amsterdam the Netherlands: 5G Multimedia
Virtualized Media Processing Function
- Media, audio/video is a large percentage of the online traffic expected to continue in 5G
- Compute capabilities in network edge can improve video distribution [6] and [7]
- We consider a streaming function that can stream content using adaptive streaming using
HLS/DASH + captions and encrypt the content, all from a single media source based on
Unified Origin. NOTE: this is an Apache plugin
- In this paper the media processing function is deployed in the central cloud,
were it serves as origin server.
[6] Rufael Mekuria, Jelte Fennema, and Dirk Griffioen. 2016. Multi-Protocol Video
Delivery with Late Trans-Muxing. In Proceedings of the 2016 ACM on Multimedia
Conference (MM '16). ACM, New York, NY, USA, 92-96. DOI:
https://doi.org/10.1145/2964284.2967189
[7] https://www.unified-streaming.com/news/
finnish-telecom-leader-elisa-teams-unified-streaming-late-transmuxing
Performance of Live vs. VoD
- Performance analysis and planning model for live and VoD
- Basic setup source -> origin -> CDN -> Client
source origin CDN client
load on origin for increasing number of users
https://www.unified-streaming.com/blog/scaling-video-streaming-live-versus-vod
Does not converge for VoD!
Experimental Setup
[6] Rufael Mekuria, Jelte Fennema, and Dirk Griffioen. 2016. Multi-Protocol Video Delivery
with Late Trans-Muxing. In Proceedings of the 2016 ACM on Multimedia Conference (MM
'16). ACM, New York, NY, USA, 92-96. DOI: https://doi.org/10.1145/2964284.2967189
[7] https://www.unified-streaming.com/news/finnish-telecom-leader-elisa-teams-unified-streaming-late-transmuxing
Load generator by
Citrix
Simusers requesting
Different contents
2 vCPU 4GB Ram
Origin VM (apache)
OpenStack Liberty release (KVM based)
KPI Mapping
Framework
(TBD later)
Telemetry agent
Storage &
Visualisation
Profiling(1): Throughput tests (Tale)
- User Arrival rate matters for stability!
- Bitrate capped at 3.6 GBit
Profiling(2): Behavior tAle
- Fix/configure apache configuration of mpm (multi processing threading module)
- Result -> larger number connections and higher throughput 7Gbit (from 3.6 Gbps)
- Operational range leading to a breaking point
- Previous results only showed saturation
- How to identify/predict the breaking point ?
Profiling(3): Throughput tests (Tale)
Profiling(4): Anomalous Behavior tAle
- Interrupt storm dealing with interupts from I/O: NIC/storage
- VM exits (inter process communication hypervisor and VM) (at least one exit and
entry per interrupt)
- Overwhelms the CPU
- This is an example of virtualization overhead in hypervisor based virtualization,
should be accounted for in practical systems
Profiling(5): VM Exits caused by interrupts
Guest VM Hypervisor
Interuptable ?
No RFLAGS.IF=0
Set Interupt window exit
Wants to send interupt
Exit: reason interrupt window
- Record cause of an exit in VM exit
information
- Save processor state: control reg.
Debug reg., pending exception in
guest state area
- Save msr (machine specific
registers)
- Load processor based on host state
- After control is completed a similar
VM entry operation will happen
Steps in dealing
with interupts
when guest VM
is not interuptable
Profiling (7)
Latency and Entropy summary taLE
Single hardware/OS configuration -> hardware entropy not studied
Other entropic behavior not observed in test setup -> e.g. time of day, temperature
Analysis of subsystem metrics e.g. CPU utilization, memory etc. under
normal operational conditions did not reveal any unexpected behaviours
Undefined behavior after breaking point
KPI(1): Automated KPI Mapping
KPI Mapping framework Analytics Pipeline
Ensemble approach is taken combining the different algorithms, metrics with > 75%
Machine learning approach
with ensemble approach
KPI(2): KPI Mapping bootstrap
Throughput per client Received bytes per client (MB/s) Client
Latency per user request Delay to first byte received Client
Transaction Failure rate
Number of requests resulting in an
error
Client
Requests / second requests handled per second per user Client
Idea is to find the cloud metrics that can be measured by telemetry
that correlate most to these client side KPI’s
Client side metrics
(measured by load
generator)
KPI(3): KPI Mapping results
non-optimized network media processing function:
throughput and latency correlate to memory and network metrics
throughput:network 98 % latency:memory 80%
Optimized processing function: throughput and latency correlate to scheduler metrics
proc_schedstat_wake_up_local:throughput local wakeup calls
proc_schedstat_running:latency time running processes
Discussion highlights KPI/profiling
p1. Massive telemetry combined with Machine Learning
is useful for understanding virtaulized media processing function behavior
p2. Operational range identified, behavior predictable, but collapses beyond
Breaking point (not saturation) it will be important
to detect catastrofic failure before it happens
K1. non optimized KPI maps to subsystem metrics memory, network to Throughput/latency
K2. scheduler metrics map for optimized function to throughput and latency
K3. steady state and non steady state should be distinguished and detected
K4. then in both states the respective KPI mapping can be used to avoid catastrofic failure
K5. 85% reduction of metrics to be collected was achieved, by not collecting non relevant metrics
K6. 80% to 95 accuracy for throughput
Conclusion / Future Work
1. Systematic automated approach for KPI mapping and profiling
2. Not based on pre-defined analytical models
3. Based on real production technologies (OpenStack, Unified Origin),
results can be useful to enable KPI driven scaling, which can reduce overprovisioning
4. A small step in the goal “efficient carrier grade cloud native processing functions”
Thank You!
Contact me: Rufael Mekuria rufael@unified-streaming.com
Contact Vincenzo: vincenzo.m.riccobene@intel.com
This work is supported by the
European Union (H2020 RIA, GA No. 671566) Superfluidity.

More Related Content

PPTX
Tutorial on Point Cloud Compression and standardisation
PDF
Point Cloud Compression in MPEG
PPTX
Presentation NBMP and PCC
PDF
Tutorial MPEG 3D Graphics
PPTX
MPEG 3D graphics compression offer
PDF
3D Volumetric Data Generation with Generative Adversarial Networks
PDF
Tactile Internet with Human-in-the-Loop
PDF
Indoor Point Cloud Processing
Tutorial on Point Cloud Compression and standardisation
Point Cloud Compression in MPEG
Presentation NBMP and PCC
Tutorial MPEG 3D Graphics
MPEG 3D graphics compression offer
3D Volumetric Data Generation with Generative Adversarial Networks
Tactile Internet with Human-in-the-Loop
Indoor Point Cloud Processing

What's hot (20)

PDF
New Media Services from a Mobile Chipset Vendor and Standardization Perspective
PDF
Tackling Open Images Challenge (2019)
PDF
A0540106
PDF
IRJET- A Hybrid Image and Video Compression of DCT and DWT Techniques for H.2...
PPTX
3DgraphicsAndAR
PPTX
Point Cloud and its applications
PPTX
XLcloud 3-d remote rendering
PPTX
Chris Varekamp (Philips Group Innovation, Research): Depth estimation, Proces...
PPT
Teleimmersion
PPTX
Augmented Reality: Connecting physical and digital worlds
PDF
AUTO AI 2021 talk Real world data augmentations for autonomous driving : B Ra...
PDF
Tchebichef image watermarking along the edge using YCoCg-R color space for co...
PPTX
Bridging the gap between web and television
PDF
A Video Watermarking Scheme to Hinder Camcorder Piracy
PDF
06714519
PDF
Corrosion Detection Using A.I : A Comparison of Standard Computer Vision Tech...
PDF
Video Compression Algorithm Based on Frame Difference Approaches
PDF
"Deploying Deep Learning Models on Embedded Processors for Autonomous Systems...
PDF
深度學習在AOI的應用
PDF
A New Technique to Digital Image Watermarking Using DWT for Real Time Applica...
New Media Services from a Mobile Chipset Vendor and Standardization Perspective
Tackling Open Images Challenge (2019)
A0540106
IRJET- A Hybrid Image and Video Compression of DCT and DWT Techniques for H.2...
3DgraphicsAndAR
Point Cloud and its applications
XLcloud 3-d remote rendering
Chris Varekamp (Philips Group Innovation, Research): Depth estimation, Proces...
Teleimmersion
Augmented Reality: Connecting physical and digital worlds
AUTO AI 2021 talk Real world data augmentations for autonomous driving : B Ra...
Tchebichef image watermarking along the edge using YCoCg-R color space for co...
Bridging the gap between web and television
A Video Watermarking Scheme to Hinder Camcorder Piracy
06714519
Corrosion Detection Using A.I : A Comparison of Standard Computer Vision Tech...
Video Compression Algorithm Based on Frame Difference Approaches
"Deploying Deep Learning Models on Embedded Processors for Autonomous Systems...
深度學習在AOI的應用
A New Technique to Digital Image Watermarking Using DWT for Real Time Applica...
Ad

Similar to Mmsys slideshare-intel-nokia (20)

PDF
High-Performance Media Processing in an NFV World
PPTX
Moving Multimedia Applications to the Cloud
PPTX
Achieving real time voice and video virtualized network functionality in nfv
PDF
VMworld 2013: Network Function Virtualization in the Cloud: Case for Enterpri...
PPTX
Modern Networking Unit 3 Network Function virtualization
PDF
NFV Tutorial
PDF
NFV Tutorial
PDF
Analysis of basic Architectures used for Lifecycle Management and Orchestrati...
PDF
NFV Linaro Connect Keynote
PPT
1303briscoe-sdnrg-nfv.ppt
PDF
Carrier Grade MANO for Service Agility - Presented at NFV World Congress 2015
PDF
Empowering Customer Centric NFV - by Sean Chen @ Openstack Summit Paris 2014
PDF
Why Network Functions Virtualization sdn?
PDF
NFV and OpenStack
PDF
SIGCOMM16-Tutorial-5G-SDN-NFV-part2/2.pdf
PDF
Nfv short-course-sbrc14-full
PPTX
Superfluid networking for 5G: vision and state of the art
PDF
NFV's major movements - Cloudify in Light Reading
PDF
At a-glance-c45-736921
High-Performance Media Processing in an NFV World
Moving Multimedia Applications to the Cloud
Achieving real time voice and video virtualized network functionality in nfv
VMworld 2013: Network Function Virtualization in the Cloud: Case for Enterpri...
Modern Networking Unit 3 Network Function virtualization
NFV Tutorial
NFV Tutorial
Analysis of basic Architectures used for Lifecycle Management and Orchestrati...
NFV Linaro Connect Keynote
1303briscoe-sdnrg-nfv.ppt
Carrier Grade MANO for Service Agility - Presented at NFV World Congress 2015
Empowering Customer Centric NFV - by Sean Chen @ Openstack Summit Paris 2014
Why Network Functions Virtualization sdn?
NFV and OpenStack
SIGCOMM16-Tutorial-5G-SDN-NFV-part2/2.pdf
Nfv short-course-sbrc14-full
Superfluid networking for 5G: vision and state of the art
NFV's major movements - Cloudify in Light Reading
At a-glance-c45-736921
Ad

Recently uploaded (20)

PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
2. Earth - The Living Planet Module 2ELS
PPT
protein biochemistry.ppt for university classes
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
Introduction to Cardiovascular system_structure and functions-1
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPTX
2Systematics of Living Organisms t-.pptx
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PPTX
BIOMOLECULES PPT........................
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PDF
Placing the Near-Earth Object Impact Probability in Context
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PDF
An interstellar mission to test astrophysical black holes
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
2. Earth - The Living Planet earth and life
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
7. General Toxicologyfor clinical phrmacy.pptx
2. Earth - The Living Planet Module 2ELS
protein biochemistry.ppt for university classes
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
Introduction to Cardiovascular system_structure and functions-1
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
2Systematics of Living Organisms t-.pptx
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
Classification Systems_TAXONOMY_SCIENCE8.pptx
Taita Taveta Laboratory Technician Workshop Presentation.pptx
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
bbec55_b34400a7914c42429908233dbd381773.pdf
BIOMOLECULES PPT........................
AlphaEarth Foundations and the Satellite Embedding dataset
Placing the Near-Earth Object Impact Probability in Context
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
An interstellar mission to test astrophysical black holes
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
2. Earth - The Living Planet earth and life
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud

Mmsys slideshare-intel-nokia

  • 1. Automated Profiling of Virtualized Media Processing Functions Using Telemetry and Machine Learning Rufael Mekuria(Unified Streaming), Michael J. McGrath (Intel), Victor Bayon-Molino(Intel), Vincenzo Riccobene(Intel), Christos Tselios(Citrix), Artem Dobrodub(Nokia), John Thomson (Onapp) ACM Multimedia Systems Conference 2018, June 12-15 Amsterdam the Netherlands: 5G Multimedia
  • 2. - Context: Emerging 5G and 5G Multimedia technologies - Research challenge - Contribution - TALE based profiling approach - Virtualized media processing function - Profiling using TALE and telemetry - KPI Mapping using telemetry and machine learning - Conclusion/Future work Summary/Overview ACM Multimedia Systems Conference 2018, June 12-15 Amsterdam the Netherlands: 5G Multimedia
  • 3. Context (1): Emerging 5G Network Technologies 5G RAN Cloudification Distribution Virtualized Network Functions Network Function Virtualization Management & Orchestration Massive Mimo Network slicing Millimetre Wave technologies (26+ GhZ) Multi-Access Edge Computing Virtualisation technologies 5G Radio Access 5G Cloud/NFV virtualized core ACM Multimedia Systems Conference 2018, June 12-15 Amsterdam the Netherlands Small Cells WLAN + LTE Cloud Operating System e.g. hypervisor
  • 4. Context(2): Advanced Media Services Content encryption Water Marked streams Personalized advertisement 360 degree High Dynamic Range Personalized video streaming Content security Emerging Formats (smarter pixels) Many pixels Point Cloud Light Field Personalized captions (language) Ultra HD 4K, 8K High Frame Rate Increased Bandwidth of 5G radio link -> more data virtualized 5G cloud native core -> Smarter network Smarter streaming Computer Vision 5G Transcode /transmux AR/MR ACM Multimedia Systems Conference 2018, June 12-15 Amsterdam the Netherlands: 5G Multimedia
  • 5. Edge cloud Edge cloud Core CloudAccess Network User equipment User equipment Scalable data center cloud infrastructure Aggregation Network 4G LTE LTE Broadcast 5G Wireless Virtualized Network User equipment Cloud-RAN Radio network information orchestration Radio network information MEC Cloud Regional cloudvEPC vIMS Virtualized network infrastructure Superfluidity: a flexible functional architecture for 5G networks Giuseppe Bianchi et al. Volume27, Issue9 TETT Journal Special Issue: 5GPPP Feature Issue September 2016 Pages 1178-1186 5G Context(3): Example edge converged cloud native 5G architecture
  • 6. - Hardware independence (X86, ARM, GPU, FPGA) - Time independence - Scale independence scale from 1 to many users (millions) - Location independence - Reduce costly overprovisioning Broad range of technologies needed (NFV,MEC,SDN etc…): A core of advances in cloud/NFV technologies (hence our work) Context (4): Design goals of Superfluid 5G network https://www.unified-streaming.com/blog/5g-superfluidity-and-future-streaming-video ACM Multimedia Systems Conference 2018, June 12-15 Amsterdam the Netherlands: 5G Multimedia
  • 7. + Rashid Mjumbi et al. Network Function Virtualization: State of the Art and Research Challenges IEEE Communications surveys & tutorials, VOL. 18. NO 1. First Quarter 2016 http://www.etsi.org/ technologies-clusters/technologies/nfv Context (5): Some NFV pointers Difference NFV vs Cloud: service/function abstraction: VNF, per function optimization possible ACM Multimedia Systems Conference 2018, June 12-15 Amsterdam the Netherlands: 5G Multimedia
  • 8. Context (6) Virtualisation Stack Hypervisor type 1 Hypervisor type 2 Operating system level * Mark Croes Performance analysis of virtualized video streaming service Bachelor Thesis, University of Amsterdam June 2017 ACM Multimedia Systems Conference 2018, June 12-15 Amsterdam the Netherlands: 5G Multimedia Our focus is type 1 virtualisation
  • 9. R1: Performance modelling/work load characterization -> virtual box in the cloud vs. physical box, underlying hardware is heterogeneous, cloud operating system stack R2: Reduce overprovisioning of underlying physical/virtual infrastructure R3: KPI Mapping for more efficient telemetry/monitoring, identifying most Important metrics relating to service quality defined in SLA Goal: “efficient carrier grade cloud native processing functions” this work is a step in this direction Research Challenges ACM Multimedia Systems Conference 2018, June 12-15 Amsterdam the Netherlands: 5G Multimedia
  • 10. - Cloud video streaming [1-5] - Mathematical modelling (Jackson queue etc.) use QoS constraint enable more efficient scale-in-out etc) - Not sufficient for real deployment (anomaly, unexpected behavior) - Not sufficient for NFV -> heterogeneous underlying hardware (entropy) - Not Sufficient for MPEG DASH were client behavior is not standardized, hence modelling user load difficult - Google golden signals, CPU thread state, Off CPU analysis -> no automated step mapping to Specific workload (domain knowledge needed), automated mapping useful for function abstraction NFV Prior art [1] Wu et al., "CloudMedia: When Cloud on Demand Meets Video on Demand," in IEEE ICDCS, Minneapolis, 2011, pp. 268-277. [2] Nan et al., "Optimal allocation of virtual machines for cloud-based multimedia applications," in IEEE MMSP Workshop, 2012, pp. 175-180. [3] D. Niu et al., "Quality-Assured Cloud Bandwidth Auto-Scaling for Video-on-Demand Applications," in IEEE Infocomm, 2012. [4] Y. Jin, Y. Wen, C. Westphal , "Optimal Transcoding and Caching for Adaptive Streaming in Media Cloud: An Analytical Approach,“ IEEE TCSVT, vol. 25, no. 12, pp. 1914 - 1925, December 2015. [5] J. He, Y. Wen, J. Huang, D. Wu , "On the Cost-QoE Trade-off for Cloud-based Video Streaming under Amazon EC2’s Pricing Models,“ IEEE TCSVT, vol. 24, no. 4, pp. 669 - 680, September 2013.
  • 11. - Automated profiling of virtualized media processing functions to reduce overprovisioning in an NFV/Cloud deployment - Automated KPI mapping of virtualized media processing functions, enabling more efficient telemetry/targeted telemetry based monitoring (this can enable more efficient scale-in-out etc) - Emperical approach for cloud video streaming (we only focus on profiling and KPI mapping) Contribution ACM Multimedia Systems Conference 2018, June 12-15 Amsterdam the Netherlands: 5G Multimedia
  • 12. TALE metric collection summary Metrics are obtained from the cloud hardware: e.g. Intel SNAP, OpenStack Ceilometer, Amazon Cloudwatch etc…. Metrics obtained from OS/hypervisors, compute, storage, networks etc. Throughput Anomaly Latency Entropy Full Stack Monitoring: collect as many metrics from each of the layers, later then statistics/machine learning will be used to identify key metrics -> we discuss the analytics pipeline later ACM Multimedia Systems Conference 2018, June 12-15 Amsterdam the Netherlands: 5G Multimedia
  • 13. Virtualized Media Processing Function - Media, audio/video is a large percentage of the online traffic expected to continue in 5G - Compute capabilities in network edge can improve video distribution [6] and [7] - We consider a streaming function that can stream content using adaptive streaming using HLS/DASH + captions and encrypt the content, all from a single media source based on Unified Origin. NOTE: this is an Apache plugin - In this paper the media processing function is deployed in the central cloud, were it serves as origin server. [6] Rufael Mekuria, Jelte Fennema, and Dirk Griffioen. 2016. Multi-Protocol Video Delivery with Late Trans-Muxing. In Proceedings of the 2016 ACM on Multimedia Conference (MM '16). ACM, New York, NY, USA, 92-96. DOI: https://doi.org/10.1145/2964284.2967189 [7] https://www.unified-streaming.com/news/ finnish-telecom-leader-elisa-teams-unified-streaming-late-transmuxing
  • 14. Performance of Live vs. VoD - Performance analysis and planning model for live and VoD - Basic setup source -> origin -> CDN -> Client source origin CDN client load on origin for increasing number of users https://www.unified-streaming.com/blog/scaling-video-streaming-live-versus-vod Does not converge for VoD!
  • 15. Experimental Setup [6] Rufael Mekuria, Jelte Fennema, and Dirk Griffioen. 2016. Multi-Protocol Video Delivery with Late Trans-Muxing. In Proceedings of the 2016 ACM on Multimedia Conference (MM '16). ACM, New York, NY, USA, 92-96. DOI: https://doi.org/10.1145/2964284.2967189 [7] https://www.unified-streaming.com/news/finnish-telecom-leader-elisa-teams-unified-streaming-late-transmuxing Load generator by Citrix Simusers requesting Different contents 2 vCPU 4GB Ram Origin VM (apache) OpenStack Liberty release (KVM based) KPI Mapping Framework (TBD later) Telemetry agent Storage & Visualisation
  • 16. Profiling(1): Throughput tests (Tale) - User Arrival rate matters for stability! - Bitrate capped at 3.6 GBit
  • 17. Profiling(2): Behavior tAle - Fix/configure apache configuration of mpm (multi processing threading module) - Result -> larger number connections and higher throughput 7Gbit (from 3.6 Gbps)
  • 18. - Operational range leading to a breaking point - Previous results only showed saturation - How to identify/predict the breaking point ? Profiling(3): Throughput tests (Tale)
  • 19. Profiling(4): Anomalous Behavior tAle - Interrupt storm dealing with interupts from I/O: NIC/storage - VM exits (inter process communication hypervisor and VM) (at least one exit and entry per interrupt) - Overwhelms the CPU - This is an example of virtualization overhead in hypervisor based virtualization, should be accounted for in practical systems
  • 20. Profiling(5): VM Exits caused by interrupts Guest VM Hypervisor Interuptable ? No RFLAGS.IF=0 Set Interupt window exit Wants to send interupt Exit: reason interrupt window - Record cause of an exit in VM exit information - Save processor state: control reg. Debug reg., pending exception in guest state area - Save msr (machine specific registers) - Load processor based on host state - After control is completed a similar VM entry operation will happen Steps in dealing with interupts when guest VM is not interuptable
  • 21. Profiling (7) Latency and Entropy summary taLE Single hardware/OS configuration -> hardware entropy not studied Other entropic behavior not observed in test setup -> e.g. time of day, temperature Analysis of subsystem metrics e.g. CPU utilization, memory etc. under normal operational conditions did not reveal any unexpected behaviours Undefined behavior after breaking point
  • 22. KPI(1): Automated KPI Mapping KPI Mapping framework Analytics Pipeline Ensemble approach is taken combining the different algorithms, metrics with > 75% Machine learning approach with ensemble approach
  • 23. KPI(2): KPI Mapping bootstrap Throughput per client Received bytes per client (MB/s) Client Latency per user request Delay to first byte received Client Transaction Failure rate Number of requests resulting in an error Client Requests / second requests handled per second per user Client Idea is to find the cloud metrics that can be measured by telemetry that correlate most to these client side KPI’s Client side metrics (measured by load generator)
  • 24. KPI(3): KPI Mapping results non-optimized network media processing function: throughput and latency correlate to memory and network metrics throughput:network 98 % latency:memory 80% Optimized processing function: throughput and latency correlate to scheduler metrics proc_schedstat_wake_up_local:throughput local wakeup calls proc_schedstat_running:latency time running processes
  • 25. Discussion highlights KPI/profiling p1. Massive telemetry combined with Machine Learning is useful for understanding virtaulized media processing function behavior p2. Operational range identified, behavior predictable, but collapses beyond Breaking point (not saturation) it will be important to detect catastrofic failure before it happens K1. non optimized KPI maps to subsystem metrics memory, network to Throughput/latency K2. scheduler metrics map for optimized function to throughput and latency K3. steady state and non steady state should be distinguished and detected K4. then in both states the respective KPI mapping can be used to avoid catastrofic failure K5. 85% reduction of metrics to be collected was achieved, by not collecting non relevant metrics K6. 80% to 95 accuracy for throughput
  • 26. Conclusion / Future Work 1. Systematic automated approach for KPI mapping and profiling 2. Not based on pre-defined analytical models 3. Based on real production technologies (OpenStack, Unified Origin), results can be useful to enable KPI driven scaling, which can reduce overprovisioning 4. A small step in the goal “efficient carrier grade cloud native processing functions”
  • 27. Thank You! Contact me: Rufael Mekuria rufael@unified-streaming.com Contact Vincenzo: vincenzo.m.riccobene@intel.com This work is supported by the European Union (H2020 RIA, GA No. 671566) Superfluidity.