SlideShare a Scribd company logo
Professor Uri Weiser
Technion
Haifa, Israel
Handling Memory Accesses in Big Data
Environment
Chipex 2016
1The talk covers research done by: T. Horowitz , Prof. A. Kolodny, T. Morad, , Prof. A. Mendelson, Daniel Raskin, Gil Shomron, Loren Jamal, Prof. U. Weiser
2
A New Architecture Avenues in
Big Data Environment
 The Era of Heterogeneous
 HW/SW fits application
 Dynamic tuning
 Accelerators
  performance, energy efficiency
 Big Data = big
 In general non repeated access to all the
“Big Data”
 What are the implications?
Heterogeneous computing :
Application Specific Accelerators
Performance/power
Apps range
Continue performance trend by tuned architecture to bypass current technological hurdles
Performance/power
Accelerators
3
Tuned architectures
Apps behavior
4
A New Architecture Avenues in
Big Data Environment
 Heterogeneous computing – ”tuning” HW to
respond to specific needs
 example: Big Data memory access pattern
 Potential savings
 Reduction of Data Movements and bypass
DRAM
 Bandwidth issue
 Potential solution
Input: Unstructured data
Big Data  usage of DATA
5
Read Once
Non-Temporal
Memory Access
Funnel
beta=
BWout
BWin
Structuring
Input: Unstructured data
Structured data (aggregation)
A
ML
Model creation
Data structuring = ETL
C
B
C Model usage @ client
6
Machine Learning
7
Does Big Data exhibit special
memory access pattern?
It probably should since
 Revisiting ALL Big Data items will cause huge/slow
data transfers from Data sources
 There are 2 access modes of memory operations:
 Temporal Memory Access
 Non-Temporal Memory access
 Many Big Data computations exhibit a Non-Temporal
Memory-Accesses and/or Funnel operation
Non-Temporal Memory access
Initial analysis: Hadoop-grep Single Memory Access Pattern
~50% of Hadoop-grep unique memory references are single access
8
Non-Temporal Memory Accesses
Preliminary Results
WordCount:
Access to Storage:
Non-temporal locality
Sort:
Access to Storage:
NO Non-temporal locality
0
10000
20000
30000
40000
50000
60000
70000
80000
0 10 20 30 40 50
Time [s]
WordCount I/O Utilization
0
20000
40000
60000
80000
100000
120000
0 200 400 600 800 1000 1200
Time [s]
SORT I/O
Access rate
[KB/s]
Time
Time
9
Access rate
[KB/s]
10
Where energy is wasted?
• DRAM
• Limited BW
From: Mark Horowitz, Stanford “Computing’s Energy Problems”
From: Bill Dally (nVidia and Stanford), Efficiency and Parallelism, the challenges of future computing
11
Energy:
DRAM
12
Memory Subsystem - copies
L1$
L2$
LL Cache
DRAM
NV Storage
RegistersKBs
10’s KBs
MBs
TBs
GBs
10’s MBs
3GB/sec
25GB/sec
500GB/sec
TB/sec
Size
Core
BW
- Source
Copy 1 (main memory)
Copy 2 (LL Cache)
Copy 3 (L2 Cache)
Copy 4 (L1 Cache)
Copy 5 (Registers) - Destination
13
Memory Subsystem – DRAM bypass == DDIO
L1$
L2$
LL Cache
DRAM
NV Storage
Registers
3-20GB/sec
25GB/sec
500GB/sec
TB/sec
Core
BW
- Source
Copy 1 (main memory)
Copy 2 (LL Cache)
Copy 3 (L2 Cache)
Copy 4 (L1 Cache)
Copy 5 (Registers) - Destination
Potential savings:
@ 0.5n J/B (DRAM)
10 – 20 GB/s NV BW
 5W – 10W
Reference: “Optimizing Read-Once Data Flow in Big-Data Applications”
Morad, Ghomron, Erez, Weiser, Kolodny, in Computer Architecture Letters Journal 2016 14
Bandwidth
When should we use Funnel at the Data source
15
Memory Hierarchy is Optimized for
A: Bandwidth issue  System are built for Temporal Locality
16
Highest Bandwidth
L1$
L2$
LLC Cache
DRAM
NV Storage
RegistersKBs
10’s KBs
MBs
TBs
GBs
10’s MBs
3-20GB/sec
25GB/sec
500GB/sec
TB/sec
Size
Core
BW Existing
BW
NTMA
Desired BW
# of cores
Bandwidth
[MB/s]
# of cores
CPU
utilization
[%]
Bandwidth
[MB/s]
Read Once – Non-Temporal Memory Accesses
# of cores
Bandwidth
[MB/s]
CPU
utilization
[%]
Temporal Memory Accesses
# of cores
Bandwidth
[MB/s]
Hint: Memory access per operation
B: Memory access per operation impact BW
CPU Utilizations
17
Solution:
Flow of “Non-Temporal Data Accesses”
Core
L1$
L2$
LLC Cache
DRAM
NV Storage
Registers
The Funnel
18
Use Funnel when Bandwidth bottleneck occurs
- “high” memory accesses per Instruction
- Limited BW
- Non temporal locality memory access
*private communication with: Moinuddin Qureshi
“Funnel”ing “Read-Once” data in storage
*Kang, Yangwook, Yang-suk Kee, Ethan L. Miller, and Chanik Park. "Enabling cost-effective data processing with smart ssd." In Mass Storage Systems
and Technologies (MSST), 2013 IEEE 29th Symposium on, pp. 1-12. IEEE, 2013.
**K. Eshghi and R. Micheloni. “SSD Architecture and PCI Express Interface”
Typical SDD architecture*
19
Analytical model of the Funnel
20
Post
process
Bandwidth (BW) IN
Bandwidth BW OUT
Funnel
B
B
= BWOUT/BWIN
20
Purposed Architecture
21
PCIe
TL
B
CPU performs NTMA and TMA work
SSD Storage
B
Funnel
B=Bandwidth
Baseline Configuration
PCIe
TL
B
2,LcE
CPU performs TMA workSSD performs NTMA work
B
Funnel
Funnel Configurations
B
B B
21
Funnel Performance22
Performanceimprovement
CPU becomes
bottleneck
CPU becomes
bottleneck
𝟏
𝐏𝐂𝐈𝐞 𝐁𝐖
𝟏
𝐒𝐒𝐃 𝐁𝐖
PCIe
TL
B
CPU performs NTMA
and TMA work
SSD Storage
B
Funnel
B=Bandwidth
PCIe
TL
B
2,LcE
CPU performs: TMA
work
SSD performs NTMA
work
B
Funnel
beta
Performance
22
Funnel energy
Funnel
improvement
CPU becomes the
bottleneck
Funnel processor
overhead
PCIe
TL
B
CPU performs NTMA
and TMA work
SSD Storage
B
Funnel
B=Bandwidth
PCIe
TL
B
2,LcE
CPU performs TMA
work
SSD performs NTMA
work
B
Funnel
beta
Energy
CPU becomes the
bottleneck
23
Solution: ?
Non-Temporal Memory Accesses should be
processed as close as possible to the data source
Data that exhibit Temporal Locality should use
current Memory Hierarchy
Use Machine Learning (context aware*) to distinguish
between the two phases
Open questions:
SW model
Shared Data
HW implementation
Computational requirement at the “Funnel”
*Reference: “Semantic locality and Context based prefetching” Peled, Mannor, Weiser, Etsion in ISCA 2015
24
Summary
Memory access is a critical path in computing
Funnel should be used for:
Resolve BW systems’ bottleneck for specific applications
Solve the System’s BW issues for “Read Once” cases
Reduction of Data movement
Free up system’s memory resources (re-Spark)
Simple-energy-efficient engines at the front end
Issues
…
25
26

More Related Content

PDF
BGP in 2014
DOCX
Cisco catalyst 6500 architecture white paper
PDF
OIF on 400G for Next Gen Optical Networks Conference
PDF
OIF CEI 56-G-FOE-April2015
PDF
Prof. Danny Raz, Director, Bell Labs Israel, Nokia
PPT
ECOC Panel on OIF CEI 56G
PDF
Implementing Useful Clock Skew Using Skew Groups
PDF
Fujitsu 100G Overview
BGP in 2014
Cisco catalyst 6500 architecture white paper
OIF on 400G for Next Gen Optical Networks Conference
OIF CEI 56-G-FOE-April2015
Prof. Danny Raz, Director, Bell Labs Israel, Nokia
ECOC Panel on OIF CEI 56G
Implementing Useful Clock Skew Using Skew Groups
Fujitsu 100G Overview

Viewers also liked (12)

PPTX
The R Ecosystem
PPTX
R at Microsoft
PDF
Beyond 100GE
PPTX
Juniper Networks Router Architecture
PPTX
Building a scalable data science platform with R
PDF
A Strategic View of Enterprise Reporting and Analytics: The Data Funnel
PDF
End User DNS Measurement at APNIC
PDF
OIF 2015 FOE Architecture Presentation
PDF
ENRZ Advanced Modulation for Low Latency Applications
PDF
What's so special about the number 512?
PPTX
Dr. John Bainbridge, Principal Application Architect, NetSpeed
PDF
TCAMのしくみ
The R Ecosystem
R at Microsoft
Beyond 100GE
Juniper Networks Router Architecture
Building a scalable data science platform with R
A Strategic View of Enterprise Reporting and Analytics: The Data Funnel
End User DNS Measurement at APNIC
OIF 2015 FOE Architecture Presentation
ENRZ Advanced Modulation for Low Latency Applications
What's so special about the number 512?
Dr. John Bainbridge, Principal Application Architect, NetSpeed
TCAMのしくみ
Ad

Similar to Prof. Uri Weiser,Technion (20)

PPT
Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models
PDF
introdução a computação - arquitetura de computador
PPT
15845007 Computer architecture and Org.ppt
PPT
C-Store-s553-stonebraker.ppt
PDF
Chip Multiprocessing and the Cell Broadband Engine.pdf
PDF
Michael Gschwind, Chip Multiprocessing and the Cell Broadband Engine
PPTX
onur-comparch-fall2018-lecture3b-memoryhierarchyandcaches-afterlecture.pptx
PPTX
The von Neumann Memory Barrier and Computer Architectures for the 21st Century
PDF
hpc2013_20131223
PDF
Wolfgang Lehner Technische Universitat Dresden
PDF
PyData Paris 2015 - Closing keynote Francesc Alted
PPT
A High-Performance Campus-Scale Cyberinfrastructure: The Technical, Political...
PPT
NoSQL Slideshare Presentation
PDF
Nikravesh australia long_versionkeynote2012
PDF
Heterogeneous Computing : The Future of Systems
PPTX
Cluster based storage - Nasd and Google file system - advanced operating syst...
PPTX
Introduction to Warehouse-Scale Computers
PPT
onur-447-spring15-lecture2-isa-afterlecture.ppt
PPTX
onur-comparch-fall2018-lecture3a-whycomparch-afterlecture.pptx
PDF
数据中心网络研究:机遇与挑战
Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models
introdução a computação - arquitetura de computador
15845007 Computer architecture and Org.ppt
C-Store-s553-stonebraker.ppt
Chip Multiprocessing and the Cell Broadband Engine.pdf
Michael Gschwind, Chip Multiprocessing and the Cell Broadband Engine
onur-comparch-fall2018-lecture3b-memoryhierarchyandcaches-afterlecture.pptx
The von Neumann Memory Barrier and Computer Architectures for the 21st Century
hpc2013_20131223
Wolfgang Lehner Technische Universitat Dresden
PyData Paris 2015 - Closing keynote Francesc Alted
A High-Performance Campus-Scale Cyberinfrastructure: The Technical, Political...
NoSQL Slideshare Presentation
Nikravesh australia long_versionkeynote2012
Heterogeneous Computing : The Future of Systems
Cluster based storage - Nasd and Google file system - advanced operating syst...
Introduction to Warehouse-Scale Computers
onur-447-spring15-lecture2-isa-afterlecture.ppt
onur-comparch-fall2018-lecture3a-whycomparch-afterlecture.pptx
数据中心网络研究:机遇与挑战
Ad

More from chiportal (20)

PDF
Prof. Zhihua Wang, Tsinghua University, Beijing, China
PPTX
Prof. Steve Furber, University of Manchester, Principal Designer of the BBC M...
PPTX
Prof. Steve Furber, University of Manchester, Principal Designer of the BBC M...
PDF
Ken Liao, Senior Associate VP, Faraday
PDF
Marco Casale-Rossi, Product Mktg. Manager, Synopsys
PPTX
Dr.Efraim Aharoni, ESD Leader, TowerJazz
PPTX
Eddy Kvetny, System Engineering Group Leader, Intel
PPTX
Xavier van Ruymbeke, App. Engineer, Arteris
PPTX
Asi Lifshitz, VP R&D, Vtool
PPTX
Zvika Rozenshein,General Manager, EngineeringIQ
PPTX
Lewis Chu,Marketing Director,GUC
PPTX
Kunal Varshney, VLSI Engineer, Open-Silicon
PDF
Gert Goossens,Sen. Director, ASIP Tools, Synopsys
PPSX
Tuvia Liran, Director of VLSI, Nano Retina
PPTX
Sagar Kadam, Lead Software Engineer, Open-Silicon
PPTX
Ronen Shtayer,Director of ASG Operations & PMO, NXP Semiconductor
PDF
Prof. Emanuel Cohen, Technion
PPTX
Prof. Ramez Daniel, Technion
PPTX
Rotem Ben-Hur,Graduate Student,Technio
PPTX
Misbah Ramadan, Graduate Student,Technion
Prof. Zhihua Wang, Tsinghua University, Beijing, China
Prof. Steve Furber, University of Manchester, Principal Designer of the BBC M...
Prof. Steve Furber, University of Manchester, Principal Designer of the BBC M...
Ken Liao, Senior Associate VP, Faraday
Marco Casale-Rossi, Product Mktg. Manager, Synopsys
Dr.Efraim Aharoni, ESD Leader, TowerJazz
Eddy Kvetny, System Engineering Group Leader, Intel
Xavier van Ruymbeke, App. Engineer, Arteris
Asi Lifshitz, VP R&D, Vtool
Zvika Rozenshein,General Manager, EngineeringIQ
Lewis Chu,Marketing Director,GUC
Kunal Varshney, VLSI Engineer, Open-Silicon
Gert Goossens,Sen. Director, ASIP Tools, Synopsys
Tuvia Liran, Director of VLSI, Nano Retina
Sagar Kadam, Lead Software Engineer, Open-Silicon
Ronen Shtayer,Director of ASG Operations & PMO, NXP Semiconductor
Prof. Emanuel Cohen, Technion
Prof. Ramez Daniel, Technion
Rotem Ben-Hur,Graduate Student,Technio
Misbah Ramadan, Graduate Student,Technion

Recently uploaded (20)

PDF
Katrina Stoneking: Shaking Up the Alcohol Beverage Industry
PDF
Cours de Système d'information about ERP.pdf
PDF
How to Get Business Funding for Small Business Fast
PDF
Reconciliation AND MEMORANDUM RECONCILATION
PDF
Tata consultancy services case study shri Sharda college, basrur
PDF
Power and position in leadershipDOC-20250808-WA0011..pdf
PDF
Roadmap Map-digital Banking feature MB,IB,AB
PPTX
Belch_12e_PPT_Ch18_Accessible_university.pptx
PPTX
New Microsoft PowerPoint Presentation - Copy.pptx
PDF
Elevate Cleaning Efficiency Using Tallfly Hair Remover Roller Factory Expertise
PPTX
Amazon (Business Studies) management studies
PPTX
ICG2025_ICG 6th steering committee 30-8-24.pptx
PDF
NewBase 12 August 2025 Energy News issue - 1812 by Khaled Al Awadi_compresse...
PPTX
job Avenue by vinith.pptxvnbvnvnvbnvbnbmnbmbh
PDF
SIMNET Inc – 2023’s Most Trusted IT Services & Solution Provider
DOCX
unit 1 COST ACCOUNTING AND COST SHEET
PPT
340036916-American-Literature-Literary-Period-Overview.ppt
PDF
Unit 1 Cost Accounting - Cost sheet
PDF
Daniels 2024 Inclusive, Sustainable Development
PDF
Nidhal Samdaie CV - International Business Consultant
Katrina Stoneking: Shaking Up the Alcohol Beverage Industry
Cours de Système d'information about ERP.pdf
How to Get Business Funding for Small Business Fast
Reconciliation AND MEMORANDUM RECONCILATION
Tata consultancy services case study shri Sharda college, basrur
Power and position in leadershipDOC-20250808-WA0011..pdf
Roadmap Map-digital Banking feature MB,IB,AB
Belch_12e_PPT_Ch18_Accessible_university.pptx
New Microsoft PowerPoint Presentation - Copy.pptx
Elevate Cleaning Efficiency Using Tallfly Hair Remover Roller Factory Expertise
Amazon (Business Studies) management studies
ICG2025_ICG 6th steering committee 30-8-24.pptx
NewBase 12 August 2025 Energy News issue - 1812 by Khaled Al Awadi_compresse...
job Avenue by vinith.pptxvnbvnvnvbnvbnbmnbmbh
SIMNET Inc – 2023’s Most Trusted IT Services & Solution Provider
unit 1 COST ACCOUNTING AND COST SHEET
340036916-American-Literature-Literary-Period-Overview.ppt
Unit 1 Cost Accounting - Cost sheet
Daniels 2024 Inclusive, Sustainable Development
Nidhal Samdaie CV - International Business Consultant

Prof. Uri Weiser,Technion

  • 1. Professor Uri Weiser Technion Haifa, Israel Handling Memory Accesses in Big Data Environment Chipex 2016 1The talk covers research done by: T. Horowitz , Prof. A. Kolodny, T. Morad, , Prof. A. Mendelson, Daniel Raskin, Gil Shomron, Loren Jamal, Prof. U. Weiser
  • 2. 2 A New Architecture Avenues in Big Data Environment  The Era of Heterogeneous  HW/SW fits application  Dynamic tuning  Accelerators   performance, energy efficiency  Big Data = big  In general non repeated access to all the “Big Data”  What are the implications?
  • 3. Heterogeneous computing : Application Specific Accelerators Performance/power Apps range Continue performance trend by tuned architecture to bypass current technological hurdles Performance/power Accelerators 3 Tuned architectures Apps behavior
  • 4. 4 A New Architecture Avenues in Big Data Environment  Heterogeneous computing – ”tuning” HW to respond to specific needs  example: Big Data memory access pattern  Potential savings  Reduction of Data Movements and bypass DRAM  Bandwidth issue  Potential solution
  • 5. Input: Unstructured data Big Data  usage of DATA 5 Read Once Non-Temporal Memory Access Funnel beta= BWout BWin
  • 6. Structuring Input: Unstructured data Structured data (aggregation) A ML Model creation Data structuring = ETL C B C Model usage @ client 6 Machine Learning
  • 7. 7 Does Big Data exhibit special memory access pattern? It probably should since  Revisiting ALL Big Data items will cause huge/slow data transfers from Data sources  There are 2 access modes of memory operations:  Temporal Memory Access  Non-Temporal Memory access  Many Big Data computations exhibit a Non-Temporal Memory-Accesses and/or Funnel operation
  • 8. Non-Temporal Memory access Initial analysis: Hadoop-grep Single Memory Access Pattern ~50% of Hadoop-grep unique memory references are single access 8
  • 9. Non-Temporal Memory Accesses Preliminary Results WordCount: Access to Storage: Non-temporal locality Sort: Access to Storage: NO Non-temporal locality 0 10000 20000 30000 40000 50000 60000 70000 80000 0 10 20 30 40 50 Time [s] WordCount I/O Utilization 0 20000 40000 60000 80000 100000 120000 0 200 400 600 800 1000 1200 Time [s] SORT I/O Access rate [KB/s] Time Time 9 Access rate [KB/s]
  • 10. 10 Where energy is wasted? • DRAM • Limited BW
  • 11. From: Mark Horowitz, Stanford “Computing’s Energy Problems” From: Bill Dally (nVidia and Stanford), Efficiency and Parallelism, the challenges of future computing 11
  • 13. Memory Subsystem - copies L1$ L2$ LL Cache DRAM NV Storage RegistersKBs 10’s KBs MBs TBs GBs 10’s MBs 3GB/sec 25GB/sec 500GB/sec TB/sec Size Core BW - Source Copy 1 (main memory) Copy 2 (LL Cache) Copy 3 (L2 Cache) Copy 4 (L1 Cache) Copy 5 (Registers) - Destination 13
  • 14. Memory Subsystem – DRAM bypass == DDIO L1$ L2$ LL Cache DRAM NV Storage Registers 3-20GB/sec 25GB/sec 500GB/sec TB/sec Core BW - Source Copy 1 (main memory) Copy 2 (LL Cache) Copy 3 (L2 Cache) Copy 4 (L1 Cache) Copy 5 (Registers) - Destination Potential savings: @ 0.5n J/B (DRAM) 10 – 20 GB/s NV BW  5W – 10W Reference: “Optimizing Read-Once Data Flow in Big-Data Applications” Morad, Ghomron, Erez, Weiser, Kolodny, in Computer Architecture Letters Journal 2016 14
  • 15. Bandwidth When should we use Funnel at the Data source 15
  • 16. Memory Hierarchy is Optimized for A: Bandwidth issue  System are built for Temporal Locality 16 Highest Bandwidth L1$ L2$ LLC Cache DRAM NV Storage RegistersKBs 10’s KBs MBs TBs GBs 10’s MBs 3-20GB/sec 25GB/sec 500GB/sec TB/sec Size Core BW Existing BW NTMA Desired BW
  • 17. # of cores Bandwidth [MB/s] # of cores CPU utilization [%] Bandwidth [MB/s] Read Once – Non-Temporal Memory Accesses # of cores Bandwidth [MB/s] CPU utilization [%] Temporal Memory Accesses # of cores Bandwidth [MB/s] Hint: Memory access per operation B: Memory access per operation impact BW CPU Utilizations 17
  • 18. Solution: Flow of “Non-Temporal Data Accesses” Core L1$ L2$ LLC Cache DRAM NV Storage Registers The Funnel 18 Use Funnel when Bandwidth bottleneck occurs - “high” memory accesses per Instruction - Limited BW - Non temporal locality memory access *private communication with: Moinuddin Qureshi
  • 19. “Funnel”ing “Read-Once” data in storage *Kang, Yangwook, Yang-suk Kee, Ethan L. Miller, and Chanik Park. "Enabling cost-effective data processing with smart ssd." In Mass Storage Systems and Technologies (MSST), 2013 IEEE 29th Symposium on, pp. 1-12. IEEE, 2013. **K. Eshghi and R. Micheloni. “SSD Architecture and PCI Express Interface” Typical SDD architecture* 19
  • 20. Analytical model of the Funnel 20 Post process Bandwidth (BW) IN Bandwidth BW OUT Funnel B B = BWOUT/BWIN 20
  • 21. Purposed Architecture 21 PCIe TL B CPU performs NTMA and TMA work SSD Storage B Funnel B=Bandwidth Baseline Configuration PCIe TL B 2,LcE CPU performs TMA workSSD performs NTMA work B Funnel Funnel Configurations B B B 21
  • 22. Funnel Performance22 Performanceimprovement CPU becomes bottleneck CPU becomes bottleneck 𝟏 𝐏𝐂𝐈𝐞 𝐁𝐖 𝟏 𝐒𝐒𝐃 𝐁𝐖 PCIe TL B CPU performs NTMA and TMA work SSD Storage B Funnel B=Bandwidth PCIe TL B 2,LcE CPU performs: TMA work SSD performs NTMA work B Funnel beta Performance 22
  • 23. Funnel energy Funnel improvement CPU becomes the bottleneck Funnel processor overhead PCIe TL B CPU performs NTMA and TMA work SSD Storage B Funnel B=Bandwidth PCIe TL B 2,LcE CPU performs TMA work SSD performs NTMA work B Funnel beta Energy CPU becomes the bottleneck 23
  • 24. Solution: ? Non-Temporal Memory Accesses should be processed as close as possible to the data source Data that exhibit Temporal Locality should use current Memory Hierarchy Use Machine Learning (context aware*) to distinguish between the two phases Open questions: SW model Shared Data HW implementation Computational requirement at the “Funnel” *Reference: “Semantic locality and Context based prefetching” Peled, Mannor, Weiser, Etsion in ISCA 2015 24
  • 25. Summary Memory access is a critical path in computing Funnel should be used for: Resolve BW systems’ bottleneck for specific applications Solve the System’s BW issues for “Read Once” cases Reduction of Data movement Free up system’s memory resources (re-Spark) Simple-energy-efficient engines at the front end Issues … 25
  • 26. 26