SlideShare a Scribd company logo
Experience Seamless Full Ebook Downloads for Every Genre at textbookfull.com
Compiling Algorithms for Heterogeneous Systems
Steven Bell
https://textbookfull.com/product/compiling-algorithms-for-
heterogeneous-systems-steven-bell/
OR CLICK BUTTON
DOWNLOAD NOW
Explore and download more ebook at https://textbookfull.com
Recommended digital products (PDF, EPUB, MOBI) that
you can download immediately if you are interested.
Colloidal Nanoparticles for Heterogeneous Catalysis
Priscila Destro
https://textbookfull.com/product/colloidal-nanoparticles-for-
heterogeneous-catalysis-priscila-destro/
textboxfull.com
Radio Systems Engineering Steven W. Ellingson
https://textbookfull.com/product/radio-systems-engineering-steven-w-
ellingson/
textboxfull.com
Intelligent Algorithms for Analysis and Control of
Dynamical Systems Rajesh Kumar
https://textbookfull.com/product/intelligent-algorithms-for-analysis-
and-control-of-dynamical-systems-rajesh-kumar/
textboxfull.com
International environmental risk management: a systems
approach Second Edition Bell
https://textbookfull.com/product/international-environmental-risk-
management-a-systems-approach-second-edition-bell/
textboxfull.com
Data Parallel C++ Mastering DPC++ for Programming of
Heterogeneous Systems using C++ and SYCL 1st Edition James
Reinders
https://textbookfull.com/product/data-parallel-c-mastering-dpc-for-
programming-of-heterogeneous-systems-using-c-and-sycl-1st-edition-
james-reinders/
textboxfull.com
Tools and Algorithms for the Construction and Analysis of
Systems Dirk Beyer
https://textbookfull.com/product/tools-and-algorithms-for-the-
construction-and-analysis-of-systems-dirk-beyer/
textboxfull.com
Tools and Algorithms for the Construction and Analysis of
Systems Dirk Beyer
https://textbookfull.com/product/tools-and-algorithms-for-the-
construction-and-analysis-of-systems-dirk-beyer-2/
textboxfull.com
How to Draw Manga Volume 1 Compiling Characters Society
For The Study Of Manga Techniques
https://textbookfull.com/product/how-to-draw-manga-volume-1-compiling-
characters-society-for-the-study-of-manga-techniques/
textboxfull.com
Smart Electronic Systems Heterogeneous Integration of
Silicon and Printed Electronics Li-Rong Zheng
https://textbookfull.com/product/smart-electronic-systems-
heterogeneous-integration-of-silicon-and-printed-electronics-li-rong-
zheng/
textboxfull.com
Instant download Compiling Algorithms for Heterogeneous Systems Steven Bell pdf all chapter
Compiling Algorithms
for Heterogeneous Systems
Instant download Compiling Algorithms for Heterogeneous Systems Steven Bell pdf all chapter
Synthesis Lectures on
Computer Architecture
Editor
Margaret Martonosi, Princeton University
Founding Editor Emeritus
Mark D. Hill, University of Wisconsin, Madison
Synthesis Lectures on Computer Architecture publishes 50- to 100-page publications on topics
pertaining to the science and art of designing, analyzing, selecting and interconnecting hardware
components to create computers that meet functional, performance and cost goals. The scope will
largely follow the purview of premier computer architecture conferences, such as ISCA, HPCA,
MICRO, and ASPLOS.
Compiling Algorithms for Heterogeneous Systems
Steven Bell, Jing Pu, James Hegarty, and Mark Horowitz
2018
Architectural and Operating System Support for Virtual Memory
Abhishek Bhattacharjee and Daniel Lustig
2017
Deep Learning for Computer Architects
Brandon Reagen, Robert Adolf, Paul Whatmough, Gu-Yeon Wei, and David Brooks
2017
On-Chip Networks, Second Edition
Natalie Enright Jerger, Tushar Krishna, and Li-Shiuan Peh
2017
Space-Time Computing with Temporal Neural Networks
James E. Smith
2017
Hardware and Software Support for Virtualization
Edouard Bugnion, Jason Nieh, and Dan Tsafrir
2017
iv
Datacenter Design and Management: A Computer Architect’s Perspective
Benjamin C. Lee
2016
A Primer on Compression in the Memory Hierarchy
Somayeh Sardashti, Angelos Arelakis, Per Stenström, and David A. Wood
2015
Research Infrastructures for Hardware Accelerators
Yakun Sophia Shao and David Brooks
2015
Analyzing Analytics
Rajesh Bordawekar, Bob Blainey, and Ruchir Puri
2015
Customizable Computing
Yu-Ting Chen, Jason Cong, Michael Gill, Glenn Reinman, and Bingjun Xiao
2015
Die-stacking Architecture
Yuan Xie and Jishen Zhao
2015
Single-Instruction Multiple-Data Execution
Christopher J. Hughes
2015
Power-Efficient Computer Architectures: Recent Advances
Magnus Själander, Margaret Martonosi, and Stefanos Kaxiras
2014
FPGA-Accelerated Simulation of Computer Systems
Hari Angepat, Derek Chiou, Eric S. Chung, and James C. Hoe
2014
A Primer on Hardware Prefetching
Babak Falsafi and Thomas F. Wenisch
2014
On-Chip Photonic Interconnects: A Computer Architect’s Perspective
Christopher J. Nitta, Matthew K. Farrens, and Venkatesh Akella
2013
v
Optimization and Mathematical Modeling in Computer Architecture
Tony Nowatzki, Michael Ferris, Karthikeyan Sankaralingam, Cristian Estan, Nilay Vaish, and
David Wood
2013
Security Basics for Computer Architects
Ruby B. Lee
2013
The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale
Machines, Second edition
Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle
2013
Shared-Memory Synchronization
Michael L. Scott
2013
Resilient Architecture Design for Voltage Variation
Vijay Janapa Reddi and Meeta Sharma Gupta
2013
Multithreading Architecture
Mario Nemirovsky and Dean M. Tullsen
2013
Performance Analysis and Tuning for General Purpose Graphics Processing Units
(GPGPU)
Hyesoon Kim, Richard Vuduc, Sara Baghsorkhi, Jee Choi, and Wen-mei Hwu
2012
Automatic Parallelization: An Overview of Fundamental Compiler Techniques
Samuel P. Midkiff
2012
Phase Change Memory: From Devices to Systems
Moinuddin K. Qureshi, Sudhanva Gurumurthi, and Bipin Rajendran
2011
Multi-Core Cache Hierarchies
Rajeev Balasubramonian, Norman P. Jouppi, and Naveen Muralimanohar
2011
A Primer on Memory Consistency and Cache Coherence
Daniel J. Sorin, Mark D. Hill, and David A. Wood
2011
vi
Dynamic Binary Modification: Tools, Techniques, and Applications
Kim Hazelwood
2011
Quantum Computing for Computer Architects, Second Edition
Tzvetan S. Metodi, Arvin I. Faruque, and Frederic T. Chong
2011
High Performance Datacenter Networks: Architectures, Algorithms, and Opportunities
Dennis Abts and John Kim
2011
Processor Microarchitecture: An Implementation Perspective
Antonio González, Fernando Latorre, and Grigorios Magklis
2010
Transactional Memory, 2nd edition
Tim Harris, James Larus, and Ravi Rajwar
2010
Computer Architecture Performance Evaluation Methods
Lieven Eeckhout
2010
Introduction to Reconfigurable Supercomputing
Marco Lanzagorta, Stephen Bique, and Robert Rosenberg
2009
On-Chip Networks
Natalie Enright Jerger and Li-Shiuan Peh
2009
The Memory System: You Can’t Avoid It, You Can’t Ignore It, You Can’t Fake It
Bruce Jacob
2009
Fault Tolerant Computer Architecture
Daniel J. Sorin
2009
The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale
Machines
Luiz André Barroso and Urs Hölzle
2009
vii
Computer Architecture Techniques for Power-Efficiency
Stefanos Kaxiras and Margaret Martonosi
2008
Chip Multiprocessor Architecture: Techniques to Improve Throughput and Latency
Kunle Olukotun, Lance Hammond, and James Laudon
2007
Transactional Memory
James R. Larus and Ravi Rajwar
2006
Quantum Computing for Computer Architects
Tzvetan S. Metodi and Frederic T. Chong
2006
Copyright © 2018 by Morgan & Claypool
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations
in printed reviews, without the prior permission of the publisher.
Compiling Algorithms for Heterogeneous Systems
Steven Bell, Jing Pu, James Hegarty, and Mark Horowitz
www.morganclaypool.com
ISBN: 9781627059619 paperback
ISBN: 9781627057301 ebook
ISBN: 9781681732633 hardcover
DOI 10.2200/S00816ED1V01Y201711CAC043
A Publication in the Morgan & Claypool Publishers series
SYNTHESIS LECTURES ON COMPUTER ARCHITECTURE
Lecture #43
Series Editor: Margaret Martonosi, Princeton University
Founding Editor Emeritus: Mark D. Hill, University of Wisconsin, Madison
Series ISSN
Print 1935-3235 Electronic 1935-3243
Compiling Algorithms
for Heterogeneous Systems
Steven Bell
Stanford University
Jing Pu
Google
James Hegarty
Oculus
Mark Horowitz
Stanford University
SYNTHESIS LECTURES ON COMPUTER ARCHITECTURE #43
C
M
& cLaypool
Morgan publishers
&
ABSTRACT
Most emerging applications in imaging and machine learning must perform immense amounts
of computation while holding to strict limits on energy and power. To meet these goals, archi-
tects are building increasingly specialized compute engines tailored for these specific tasks. The
resulting computer systems are heterogeneous, containing multiple processing cores with wildly
different execution models. Unfortunately, the cost of producing this specialized hardware—and
the software to control it—is astronomical. Moreover, the task of porting algorithms to these
heterogeneous machines typically requires that the algorithm be partitioned across the machine
and rewritten for each specific architecture, which is time consuming and prone to error.
Over the last several years, the authors have approached this problem using domain-
specific languages (DSLs): high-level programming languages customized for specific domains,
such as database manipulation, machine learning, or image processing. By giving up general-
ity, these languages are able to provide high-level abstractions to the developer while producing
high-performance output. The purpose of this book is to spur the adoption and the creation of
domain-specific languages, especially for the task of creating hardware designs.
In the first chapter, a short historical journey explains the forces driving computer archi-
tecture today. Chapter 2 describes the various methods for producing designs for accelerators,
outlining the push for more abstraction and the tools that enable designers to work at a higher
conceptual level. From there, Chapter 3 provides a brief introduction to image processing al-
gorithms and hardware design patterns for implementing them. Chapters 4 and 5 describe and
compare Darkroom and Halide, two domain-specific languages created for image processing
that produce high-performance designs for both FPGAs and CPUs from the same source code,
enabling rapid design cycles and quick porting of algorithms. The final section describes how
the DSL approach also simplifies the problem of interfacing between application code and the
accelerator by generating the driver stack in addition to the accelerator configuration.
This book should serve as a useful introduction to domain-specialized computing for com-
puter architecture students and as a primer on domain-specific languages and image processing
hardware for those with more experience in the field.
KEYWORDS
domain-specific languages, high-level synthesis, compilers, image processing accel-
erators, stencil computation
xi
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 CMOS Scaling and the Rise of Specialization . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 What Will We Build Now? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Performance, Power, and Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.2 Flexibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 The Cost of Specialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Good Applications for Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Computations and Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1 Direct Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 High-level Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Domain-specific Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Image Processing with Stencil Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 Image Signal Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Example Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4 Darkroom: A Stencil Language for Image Processing . . . . . . . . . . . . . . . . . . . . 33
4.1 Language Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 A Simple Pipeline in Darkroom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Optimal Synthesis of Line-buffered Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3.1 Generating Line-buffered Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3.2 Shift Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.3 Finding Optimal Shifts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4.1 ASIC and FPGA Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4.2 CPU Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
xii
4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5.1 Scheduling for Hardware Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.5.2 Scheduling for General-purpose Processors . . . . . . . . . . . . . . . . . . . . . . 49
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5 Programming CPU/FPGA Systems from Halide . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1 The Halide Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Mapping Halide to Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3 Compiler Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3.1 Architecture Parameter Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3.2 IR Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3.3 Loop Perfection Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3.4 Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4 Implementation and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.4.1 Programmability and Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.4.2 Quality of Hardware Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6 Interfacing with Specialized Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.1 Common Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2 The Challenge of Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.3 Solutions to the Interface Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.3.1 Compiler Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.3.2 Library Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.3.3 API plus DSL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.4 Drivers for Darkroom and Halide on FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.4.1 Memory and Coherency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.4.2 Running the Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.4.3 Generating Systems and Drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.4.4 Generating the Whole Stack with Halide . . . . . . . . . . . . . . . . . . . . . . . 76
6.4.5 Heterogeneous System Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7 Conclusions and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Authors’ Biographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
xiii
Preface
Cameras are ubiquitous, and computers are increasingly being used to process image data to
produce better images, recognize objects, build representations of the physical world, and extract
salient bits from massive streams of video, among countless other things. But while the data
deluge continues to increase, and while the number of transistors that can be cost-effectively
placed on a silicon die is still going up (for now), limitations on power and energy mean that
traditional CPUs alone are insufficient to meet the demand. As a result, architects are building
more and more specialized compute engines tailored to provide energy and performance gains
on these specific tasks.
Unfortunately, the cost of producing this specialized hardware—and the software to con-
trol it—is astronomical. Moreover, the resulting computer systems are heterogeneous, contain-
ing multiple processing cores with wildly different execution models. The task of porting al-
gorithms to these heterogeneous machines typically requires that the algorithm be partitioned
across the machine and rewritten for each specific architecture, which is time consuming and
prone to error.
Over the last several years, we have approached this problem using domain-specific lan-
guages (DSLs)—high-level programming languages customized for specific domains, such as
database manipulation, machine learning, or image processing. By giving up generality, these
languages are able to provide high-level abstractions to the developer while producing high-
performance output. Our purpose in writing this book is to spur the adoption and the creation
of domain-specific languages, especially for the task of creating hardware designs.
This book is not an exhaustive description of image processing accelerators, nor of domain-
specific languages. Instead, we aim to show why DSLs make sense in light of the current state
of computer architecture and development tools, and to illustrate with some specific examples
what advantages DSLs provide, and what tradeoffs must be made when designing them. Our
examples will come from image processing, and our primary targets are mixed CPU/FPGA
systems, but the underlying techniques and principles apply to other domains and platforms as
well. We assume only passing familiarity with image processing, and focus our discussion on the
architecture and compiler sides of the problem.
In the first chapter, we take a short historical journey to explain the forces driving com-
puter architecture today. Chapter 2 describes the various methods for producing designs for
accelerators, outlining the push for more abstraction and the tools that enable designers to work
at a higher conceptual level. In Chapter 3, we provide a brief introduction to image processing
algorithms and hardware design patterns for implementing them, which we use through the
rest of the book. Chapters 4 and 5 describe Darkroom and Halide, two domain-specific lan-
xiv PREFACE
guages created for image processing. Both are able to produce high-performance designs for
both FPGAs and CPUs from the same source code, enabling rapid design cycles and quick
porting of algorithms. We present both of these examples because comparing and contrasting
them illustrates some of the tradeoffs and design decisions encountered when creating a DSL.
The final portion of the book discusses the task of controlling specialized hardware within a het-
erogeneous system running a multiuser operating system. We give a brief overview of how this
works on Linux and show how DSLs enable us to automatically generate the necessary driver
and interface code, greatly simplifying the creation of that interface.
This book assumes at least some background in computer architecture, such as an advanced
undergraduate or early graduate course in CPU architecture. We also build on ideas from com-
pilers, programming languages, FPGA synthesis, and operating systems, but the book should
be accessible to those without extensive study on these topics.
Steven Bell, Jing Pu, James Hegarty, and Mark Horowitz
January 2018
xv
Acknowledgments
Any work of this size is necessarily the result of many collaborations. We are grateful to John
Brunhaver, Zachary DeVito, Pat Hanrahan, Jonathan Ragan-Kelley, Steve Richardson, Jeff Set-
ter, Artem Vasilyev, and Xuan Yang, who influenced our thinking on these topics and helped
develop portions of the systems described in this book. We’re also thankful to Mike Morgan,
Margaret Martonosi, and the team at Morgan & Claypool for shepherding us through the
writing and production process, and to the reviewers whose feedback made this a much bet-
ter manuscript than it would have been otherwise.
Steven Bell, Jing Pu, James Hegarty, and Mark Horowitz
January 2018
Instant download Compiling Algorithms for Heterogeneous Systems Steven Bell pdf all chapter
1
C H A P T E R 1
Introduction
When the International Technology Roadmap for Semiconductors organization announced its
final roadmap in 2016, it was widely heralded as the official end of Moore’s law [ITRS, 2016].
As we write this, 7 nm technology is still projected to provide cheaper transistors than current
technology, so it isn’t over just yet. But after decades of transistor scaling, the ITRS report
revealed at least modest agreement across the industry that cost-effective scaling to 5 nm and
below was hardly a guarantee.
While the death of Moore’s law remains a topic of debate, there isn’t any debate that the
nature and benefit of scaling has decreased dramatically. Since the early 2000s, scaling has not
brought the power reductions it used to provide. As a result, computing devices are limited by
the electrical power they can dissipate, and this limitation has forced designers to find more
energy-efficient computing structures. In the 2000s this power limitation led to the rise of mul-
ticore processing, and is the reason that practically all current computing devices (outside of
embedded systems) contain multiple CPUs on each die. But multiprocessing was not enough to
continue to scale performance, and specialized processors were also added to systems to make
them more energy efficient. GPUs were added for graphics and data-parallel floating point op-
erations, specialized image and video processors were added to handle video, and digital signal
processors were added to handle the processing required for wireless communication.
On one hand, this shift in structure has made computation more energy efficient; on the
other, it has made programming the resulting systems much more complex. The vast major-
ity of algorithms and programming languages were created for an abstract computing machine
running a single thread of control, with access to the entire memory of the machine. Changing
these algorithms and languages to leverage multiple threads is difficult, and mapping them to
use the specialized processors is near impossible. As a result, accelerators only get used when
performance is essential to the application; otherwise, the code is written for CPU and declared
“good enough.” Unless we develop new languages and tools that dramatically simplify the task
of mapping algorithms onto these modern heterogeneous machines, computing performance
will stagnate.
This book describes one approach to address this issue. By restricting the application do-
main, it is possible to create programming languages and compilers that can ease the burden of
creating and mapping applications to specialized computing resources, allowing us to run com-
plete applications on heterogeneous platforms. We will illustrate this with examples from image
processing and computer vision, but the underlying principles extend to other domains.
2 1. INTRODUCTION
The rest of this chapter explains the constraints that any solution to this problem must
work within. The next section briefly reviews how computers were initially able to take advantage
of Moore’s law scaling without changing the programming model, why that is no longer the case,
and why energy efficiency is now key to performance scaling. Section 1.2 then shows how to
compare different power-constrained designs to determine which is best. Since performance
and power are tightly coupled, they both need to be considered to make the best decision. Using
these metrics, and some information about the energy and area cost of different operations, this
section also points out the types of algorithms that benefit the most from specialized compute
engines. While these metrics show the potential of specialization, Section 1.3 describes the costs
of this approach, which historically required large teams to design the customized hardware and
develop the software that ran on it. The remaining chapters in this book describe one approach
that addresses these cost issues.
1.1 CMOS SCALING AND THE RISE OF SPECIALIZATION
From the earliest days of electronic computers, improvements in physical technology have con-
tinually driven computer performance. The first few technology changes were discrete jumps,
first from vacuum tubes to bipolar transistors in the 1950s, and then from discrete transistors to
bipolar integrated circuits (ICs) in the 1960s. Once computers were built with ICs, they were
able to take advantage of Moore’s law, the prediction-turned-industry-roadmap which stated
that the number of components that could be economically packed onto an integrated circuit
would double every two years [Moore, 1965].
As MOS transistor technology matured, gates built with MOS transistors used less power
and area than gates built with bipolar transistors, and it became clear in the late 1970s that MOS
technology would dominate. During this time Robert Dennard at IBM Research published his
paper on MOS scaling rules, which showed different approaches that could be taken to scale
MOS transistors [Dennard et al., 1974]. In particular, he observed that if a transistor’s operating
voltage and doping concentration were scaled along with its physical dimensions, then a number
of other properties scaled nicely as well, and the resized transistor would behave predictably.
If a MOS transistor is shrunk by a factor of 1= in each linear dimension, and the operating
voltage is lowered by the same 1=, then several things follow:
1. Transistors get smaller, allowing 2
more logic gates in the same silicon area.
2. Voltages and currents inside the transistor scale by a factor of 1=.
3. The effective resistance of the transistor, I=V , remains constant, due to 2 above.
4. The gate capacitance C shrinks by a factor of 1= (1=2
due to decreased area, multiplied
by  due to reduced electrode spacing).
The switching time for a logic gate is proportional to the resistance of the driving transistor
multiplied by the capacitance of the driven transistor. If the effective resistance remains constant
1.1. CMOS SCALING AND THE RISE OF SPECIALIZATION 3
while the capacitance decreases by 1=, then the overall delay also decreases by 1=, and the chip
can be run faster by a factor of .
Taken together, these scaling factors mean that 2
more logic gates are switched  faster,
for a total increase of 3
more gate evaluations per second. At the same time, the energy required
to switch a logic gate is proportional to CV2
. With both capacitance and voltage decreasing by
a factor of 1=, the energy per gate evaluation decreased by a factor of 1=3
.
During this period, roughly every other year, a new technology process yielded transistors
which were about 1=
p
2 as large in each dimension. Following Dennard scaling, this would give
a chip with twice as many gates and a faster clock by a factor of 1.4, making it 2.8 more
powerful than the previous one. Simultaneously, however, the energy dissipated by each gate
evaluation dropped by 2.8, meaning that total power required was the same as the previous
chip. This remarkable result allowed each new generation to achieve nearly a 3 improvement
for the same die area and power.
This scaling is great in theory, but what happened in practice is somewhat more circuitous.
First, until the mid-1980s, most complex ICs were made with nMOS rather than CMOS gates,
which dissipate power even when they aren’t switching (known as static power). Second, during
this period power supply voltages remained at 5 V, a standard set in the bipolar IC days. As
a result of both of these, the power per gate did not change much even as transistors scaled
down. As nMOS chips grew more complex, the power dissipation of these chips became a
serious problem. This eventually forced the entire industry to transition from nMOS to CMOS
technology, despite the additional manufacturing complexity and lower intrinsic gate speed of
CMOS.
After transitioning to CMOS ICs in the mid-1980s, power supply voltages began to scale
down, but not exactly in sync with technology. While transistor density and clock speed contin-
ued to scale, the energy per logic gate dropped more slowly. With the number of gate evaluations
per second increasing faster than the energy of gate evaluation was scaling down, the overall chip
power grew exponentially.
This power scaling is exactly what we see when we look at historical data from CMOS
microprocessors, shown in Figure 1.1. From 1980 to 2000, the number of transistors on a chip
increased by about 500 (Figure 1.1a), which corresponds to scaling transistor feature size by
roughly 20. During this same period of time, processor clock frequency increased by 100,
which is 5 faster than one would expect from simple gate speed (Figure 1.1b). Most of this ad-
ditional clock speed gain came from microarchitectural changes to create more deeply pipelined
“short tick” machines with fewer gates per cycle, which were enabled by better circuit designs
of key functional units. While these fast clocks were good for performance, they were bad from
a power perspective.
By 2000, computers were executing 50,000 more gate evaluations per second than they
had in the 1980s. During this time the average capacitance had scaled down, providing a 20
energy savings, but power supply voltages had only scaled by 4–5 (Figure 1.1c), giving roughly
4 1. INTRODUCTION
a 25 savings. Taken together the capacitance and supply scaling only reduce the gate energy
by around 500, which means that the power dissipation of the processors should increase by
two orders of magnitude during this period. Figure 1.1d shows that is exactly what happened.
1 B
100 M
10 M
1 M
100 k
10 k
5 V
3.3 V
2.5 V
1.2 V
150 W
100 W
10 W
1 W
4 GHz
1 GHz
100 MHz
10 MHz
1 MHz
1970 1980 1990 2000 2010 2020 1970 1980 1990 2000 2010 2020
1970 1980 1990 2000 2010 2020 1970 1980 1990 2000 2010 2020
Number
of
Transistors
Clock
Frequency
Voltage
Thermal
Design
Power
(TDP)
(a) Transistors Per Chip (b) CPU Frequency
(c) Operating Voltage (d) Power Dissipation
Figure 1.1: From the 1960s until the early 2000s, transistor density and operating frequency
scaled up exponentially, providing exponential performance improvements. Power dissipa-
tion increased but was kept in check by lowering the operating voltage. Data from CPUDB
[Danowitz et al., 2012].
Up to this point, all of these additional transistors were used for a host of architectural im-
provements that increased performance even further, including pipelined datapaths, superscalar
instruction issue, and out-of-order execution. However, the instruction set architectures (ISAs)
for various processors generally remained the same through multiple hardware revisions, mean-
1.1. CMOS SCALING AND THE RISE OF SPECIALIZATION 5
ing that existing software could run on the newer machine without modification—and reap a
performance improvement.
But around 2004, Dennard scaling broke down. Lowering the gate threshold voltage fur-
ther caused the leakage power to rise unacceptably high, so it began to level out just below 1 V.
Without the possibility to manage the power density by scaling voltage, manufacturers hit
the “power wall” (the red line in Figure 1.1d). Chips such as the Intel Pentium 4 were dissipating
a little over 100 W at peak performance, which is roughly the limit of a traditional package
with a heatsink-and-fan cooling system. Running a CPU at significantly higher power than this
requires an increasingly complex cooling system, both at a system level and within the chip itself.
Pushed up against the power wall, the only choice was to stop increasing the clock fre-
quency and find other ways to increase performance. Although Intel had predicted processor
clock rates over 10 GHz, actual numbers peaked around 4 GHz and settled back between 2 and
4 GHz (Figure 1.1b).
Even though Dennard scaling had stopped, taking down frequency scaling with it,
Moore’s law continued its steady march forward. This left architects with an abundance of tran-
sistors, but the traditional microarchitectural approaches to improving performance had been
mostly mined out. As a result, computer architecture has turned in several new directions to
improve performance without increasing power consumption.
The first major tack was symmetric multicore, which stamped down two (and then four,
and then eight) copies of the CPU on each chip. This has the obvious benefit of delivering more
computational power for the same clock rate. Doubling the core count still doubles the total
power, but if the clock frequency is dialed back, the chip runs at a lower voltage, keeping the
energy constant while maintaining some of the performance advantage of having multiple cores.
This is especially true if the parallel cores are simplified and designed for energy efficiency rather
than single-thread performance. Nonetheless, even simple CPU cores incur significant overhead
to compute their results, and there is a limit to how much efficiency can be achieved simply by
making more copies.
The next theme was to build processors to exploit regularity in certain applications, lead-
ing to the rise of single-instruction-multiple-data (SIMD) instruction sets and general-purpose
GPU computing (GPGPU). These go further than symmetric multicore in that they amortize
the instruction fetch and decode steps across many hardware units, taking advantage of data
parallelism. Neither SIMD nor GPUs were new; SIMD had existed for decades as a staple
of supercomputer architectures and made its way into desktop processors for multimedia ap-
plications along with GPUs in the late 1990s. But in the mid-2000s, they started to became
prominent as a way to accelerate traditional compute-intensive applications.
A third major tack in architecture was the proliferation of specialized accelerators, which
go even further in stripping out control flow and optimizing data movement for particular appli-
cations. This trend was hastened by the widespread migration to mobile devices and “the cloud,”
where power is paramount and typical use is dominated by a handful of tasks. A modern smart-
6 1. INTRODUCTION
phone System-on-chip (SoC) contains more than a dozen custom compute engines, created
specifically to perform intensive tasks that would be impossible to run in real time on the main
CPU. For example, communicating over WiFi and cellular networks requires complex coding
and modulation/demodulation, which is performed on a small collection of hardware units spe-
cialized for these signal processing tasks. Likewise, decoding or encoding video—whether for
watching Netflix, video chatting, or camera filming—is handled by hardware blocks that only
perform this specific task. And the process of capturing raw pixels and turning them into a
pleasing (or at least presentable) image is performed by a long pipeline of hardware units that
demosaic, color balance, denoise, sharpen, and gamma-correct the image.
Even low-intensity tasks are getting accelerators. For example, playing music from an
MP3 file requires relatively little computational work, but the CPU must wake up a few dozen
times per second to fill a buffer with sound samples. For power efficiency, it may be better to
have a dedicated chip (or accelerator within the SoC, decoupled from the CPU) that just handles
audio.
While there remain some performance gains still to be squeezed out of thread and data
parallelism by incrementally advancing CPU and GPU architectures, they cannot close the gap
to a fully customized ASIC. The reason, as we’ve already hinted, comes down to power.
Cell phones are power-limited both by their battery capacity (roughly 8–12 Wh) and the
amount of heat it is acceptable to dissipate in the user’s hand (around 2 W). The datacenter is
the same story at a different scale. A warehouse-sized datacenter consumes tens of megawatts,
requiring a dedicated substation and a cheap source of electrical power. And like phones, data
center performance is constrained partly by the limits of our ability to get heat out, as evidenced
by recent experiments and plans to build datacenters in caves or in frigid parts of the ocean.
Thus, in today’s power-constrained computing environment, the formula for improvement is
simple: performance per watt is performance.
Only specialized architectures can optimize the data storage and movement to achieve the
energy reduction we want. As we will discuss in Section 1.4, specialized accelerators are able to
eliminate the overhead of instructions by “baking” them into the computation hardware itself.
They also eliminate waste for data movement by designing the storage to match the algorithm.
Of course, general-purpose processors are still necessary for most code, and so modern
systems are increasingly heterogeneous. As mentioned earlier, SoCs for mobile devices contain
dozens of processors and specialized hardware units, and datacenters are increasingly adding
GPUs, FPGAs, and ASIC accelerators [AWS, 2017, Norman P. Jouppi et al., 2017].
In the remainder of this chapter, we’ll describe the metrics that characterize a “good”
accelerator and explain how these factors will determine the kind of systems we will build in the
future. Then we lay out the challenges to specialization and describe the kinds of applications
for which we can expect accelerators to be most effective.
1.2. WHAT WILL WE BUILD NOW? 7
1.2 WHAT WILL WE BUILD NOW?
Given that specialized accelerators are—and will continue to be—an important part of computer
architecture for the foreseeable future, the question arises: What makes a good accelerator? Or
said another way, if I have a potential set of designs, how do I choose what to add to my SoC
or datacenter, if anything?
1.2.1 PERFORMANCE, POWER, AND AREA
On the surface, the good things we want are obvious. We want high performance, low power,
and low cost.
Raw performance—the speed at which a device is able to perform a computation—is
the most obvious measure of “good-ness.” Consumers will throw down cash for faster devices,
whether that performance means quicker web page loads or richer graphics. Unfortunately, this
isn’t easy to quantify with the most commonly advertised metrics.
Clock speed matters, but we also need to account for how much work is done on each
clock cycle. Multiplying clock speed by the number of instructions issued per cycle is better, but
still ignores the fact that some instructions might do much more work than others. And on top
of this, we have the fact that utilization is rarely 100% and depends heavily on the architecture
and application.
We can quantify performance in a device-independent way by counting the number of
essential operations performed per unit time. For the purposes of this metric, we define “essen-
tial operations” to include only the operations that form the actual result of the computation.
Most devices require a great deal of non-essential computation, such as decoding instructions or
loading and storing intermediate data. These are “non-essential” not because they are pointless
or unnecessary but because they are not intrinsically required to perform the computation. They
are simply overhead incurred by the specific architecture.
With this definition, adding two pieces of data to produce an intermediate result is an
essential operation, but incrementing a loop counter is not since the latter is required by the
implementation and not the computation itself.
To make things concrete, a 3  3 convolution on a single-channel image requires nine mul-
tiplications (multiplying 3  3 pixels by their corresponding weights) and eight 2-input additions
per output pixel. For a 640  480 image (307,200 pixels), this is a little more than 5.2 million
total operations.
A CPU implementation requires many more instructions than this to compute the result
since the instruction stream includes conditional branches, loop index computations, and so
forth. On the flip side, some implementations might require fewer instructions than operations,
if they process multiple pieces of data on each instruction or have complex instructions that
fuse multiple operations. But implementations across this whole spectrum can be compared if
we calculate everything in terms of device-independent operations, rather than device-specific
instructions.
8 1. INTRODUCTION
The second metric is power consumption, measured in Watts. In a datacenter context,
the power consumption is directly related to the operating cost, and thus to the total cost of
ownership (TCO). In a mobile device, power consumption determines how long the battery will
last (or how large a battery is necessary for the device to survive all day). Power consumption also
determines the maximum computational load that can be sustained without causing the device
to overheat and throttle back.
The third metric is cost. We’ll discuss development costs further in the following section,
but for now it is sufficient to observe that the production cost of the final product is closely related
to the silicon area of the chip, typically measured in square millimeters (mm2
). More chips of a
smaller design will fit on a fixed-size wafer, and smaller chips are likely to have somewhat higher
yield percentages, both of which reduce the manufacturing cost.
However, as important as performance, power, and silicon area are as metrics, they can’t
be used directly to compare designs, because it is relatively straightforward to trade one for the
other.
Running a chip at a higher operating voltage causes its transistors to switch more rapidly,
allowing us to increase the clock frequency and get increased performance, at the cost of in-
creased power consumption. Conversely, lowering the operating voltage along with the clock
frequency saves energy, at the cost of lower performance.1
It isn’t fair to compare the raw performance of a desktop Intel Core i7 to an ARM phone
SoC, if for no other reason than that the desktop processor has a 20–50 power advantage.
Instead, it is more appropriate to divide the power (Joules per second) by the performance (op-
erations per second) to get the average energy used per computation (Joules per operation).
Throughout the rest of this book, we’ll refer to this as “energy per operation” or pJ/op. We
could equivalently think about maximizing the inverse: operations/Joule.
For a battery-powered device, energy per operation relates directly to the amount of com-
putation that can be performed with a single battery charge; for anything plugged into the wall,
this relates the amount of useful computation that was done with the money you paid to the
electric company.
A similar difficulty is related to the area metric. For applications with sufficient parallelism,
we can double performance simply by stamping down two copies of the same processor on a chip.
This benefit requires no increase in clock speed or operating voltage—only more silicon. This
was, of course, the basic impetus behind going to multi core computation.
Even further, it is possible to lower the voltage and clock frequency of the two cores,
trading performance for energy efficiency as described earlier. As a result, it is possible to improve
either power or performance by increasing silicon area as long as there is enough parallelism.
Thus, when comparing between architectures for highly parallel applications, it is helpful to
1Of course, modern CPUs do this scaling on the fly to match their performance to the ever-changing CPU load, known as
“Dynamic Voltage and Frequency Scaling” (DVFS).
1.2. WHAT WILL WE BUILD NOW? 9
normalize performance by the silicon area used. This gives us operations/Joule divided by area,
or ops
mm2J
.
These two compound metrics, pJ=operation and ops
mm2J
, give us meaningful ways to com-
pare and evaluate vastly different architectures. However, it isn’t sufficient to simply minimize
these in the abstract; we must consider the overall system and application workload.
1.2.2 FLEXIBILITY
Engineers building a system are concerned with a particular application, or perhaps a collection
of applications, and the metrics discussed are only helpful insofar as they represent performance
on the applications of interest. If a specialized hardware module cannot run our problem, its
energy and area efficiency are irrelevant. Likewise, if a module can only accelerate parts of the
application, or only some applications out of a larger suite, then its benefit is capped by Ahm-
dahl’s law. As a result, we have a flexibility tradeoff: more flexible devices allow us to accelerate
computation that would otherwise remain on the CPU, but increased flexibility often means
reduced efficiency.
Suppose a hypothetical fixed-function device can accelerate 50% of a computation by a
factor of 100, reducing the total computation time from 1 second to 0.505 seconds. If adding
some flexibility to the device drops the performance to only 10 but allows us to accelerate 70%
of the computation, we will now complete the computation in 0.37 seconds—a clear win.
Moreover, many applications demand flexibility, whether the product is a networking de-
vice that needs to support new protocols or an augmented-reality headset that must incorporate
the latest advances in computer vision. As more and more devices are connected to the internet,
consumers increasingly expect that features can be upgraded and bugs can be fixed via over-the-
air updates. In this market, a fixed-function device that cannot support rapid iteration during
prototyping and cannot be reconfigured once deployed is a major liability.
The tradeoff is that flexibility isn’t free, as we have already alluded to. It almost always hurts
efficiency (performance per watt or ops
mm2J
) since overhead is spent processing the configuration.
Figure 1.2 illustrates this by comparing the performance and efficiency for a range of designs
proposed at ISSCC a number of years ago. While newer semiconductor processes have reduced
energy across the board, the same trend holds: the most flexible devices (CPUs) are the least
efficient, and increasing specialization also increases performance, by as much as three orders of
magnitude.
In certain domains, this tension has created something of a paradox: applications that were
traditionally performed completely in hardware are moving toward software implementations,
even while competing forces push related applications away from software toward hardware. For
example, the fundamental premise of software defined radio (SDR) is that moving much (or all)
of the signal processing for a radio from hardware to software makes it possible to build a system
that is simpler, cheaper, and more flexible. With only a minimal analog front-end, an SDR
system can easily run numerous different coding and demodulation schemes, and be upgraded
10 1. INTRODUCTION
1,000
100
10
1
0.1
0.01
Microprocessors General Purpose DSPs Dedicated
Energy
Efficiency
(MOPS/mW)
Figure 1.2: Comparison of efficiency for a number of designs from ISSCC, showing the clear
tradeoff between flexibility and efficiency. Designs are sorted by efficiency and grouped by overall
design paradigm. Figure from Marković and Brodersen [2012].
over the air. But because real-time signal processing requires extremely high computation rates,
many SDR platforms use an FPGA, and carefully optimized libraries have been written to fully
exploit the SIMD and digital signal processing (DSP) hardware in common SoCs. Likewise,
software-defined networking aims to provide software-based reconfigurability to networks, but
at the same time more and more effort is being poured into custom networking chips.
1.3 THE COST OF SPECIALIZATION
To fit these metrics together, we must consider one more factor: cost. After all, given the enor-
mous benefits of specialization, the only thing preventing us from making a specialized acceler-
ator for everything is the expense.
Figure 1.3 compares the non-recurring engineering (NRE) cost of building a new high-
end SoC on the past few silicon process nodes. The price tags for the most recent technologies
are now well out of reach for all but the largest companies. Most ASICs are less expensive than
this, by virtue of being less complex, using purchased or existing IP, having lower performance
targets, and being produced on older and mature processes [Khazraee et al., 2017]. Yet these
costs still run into the millions of dollars and remain risky undertakings for many businesses.
Several components contribute to this cost. The most obvious is the price of the lithogra-
phy masks and tooling setup, which has been driven up by the increasingly high precision of each
process node. Likewise, these processes have ever-more-stringent design rules, which require
more engineering effort during the place and route process and in verification. The exponen-
tial increase in number of transistors has enabled a corresponding growth in design complexity,
which comes with increased development expense. Some of these additional transistors are used
1.3. THE COST OF SPECIALIZATION 11
500
400
300
200
100
0
65 nm
(2006)
45/40 nm
(2008)
28 nm
(2010)
22 nm
(2012)
16/14 nm
(2014)
10 nm
(2017)
7 nm 5 nm
Software
Physical
Verification
Architecture
IP
Prototye
Validation
Cost
(million
USD)
Figure 1.3: Estimated cost breakdown to build a large SoC. The overall cost is increasing expo-
nentially, and software comprises nearly half of the total cost. (Data from International Business
Strategies [IBS, 2017].)
in ways that do not appreciably increase the design complexity, such as additional copies of pro-
cessor cores or larger caches. But while the exact slope of the correlation is debatable, the trend
is clear: More transistors means more complexity, and therefore higher design costs. Moreover,
with increased complexity comes increased costs for testing and verification.
Last, but particularly relevant to this book, is the cost of developing software to run the
chip, which in the IBS estimates accounts for roughly 40% of the total cost. The accelerator
must be configured, whether with microcode, a set of registers, or something else, and it must
be interfaced with the software running on the rest of the system. Even the most rigid of “fixed”
devices usually have some degree of configurability, such as the ability to set an operating mode
or to control specific parameters or coefficients.
This by itself is unremarkable, except that all of these “configurations” are tied to a pro-
gramming model very different than the idealized CPU that most developers are used to. Timing
details become crucial, instructions execute out of order or in a massively parallel fashion, and
12 1. INTRODUCTION
concurrency and synchronization are handled with device-specific primitives. Accelerators are,
almost by definition, difficult to program.
To state the obvious, the more configurable a device is, the more effort must go into con-
figuring it. In highly configurable accelerators such as GPUs or FPGAs, it is quite easy—even
typical—to produce configurations that do not perform well. Entire job descriptions revolve
around being able to work the magic to create high-performance configurations for accelera-
tors. These people, informally known as “the FPGA wizards” or “GPU gurus,” have an intimate
knowledge of the device hardware and carry a large toolbox of techniques for optimizing appli-
cations. They also have excellent job security.
This difficulty is exacerbated by a lack of tools. Specialized accelerators need specialized
tools, often including a compiler toolchain, debugger, and perhaps even an operating system.
This is not a problem in the CPU space: there are only a handful of competitive CPU archi-
tectures, and many groups are developing tools, both commercial and open source. Intel is but
one of many groups with an x86 C++ compiler, and the same is true for ARM. But specialized
accelerators are not as widespread, and making tools for them is less profitable. Unsurprisingly,
NVIDIA remains the primary source of compilers, debuggers, and development tools for their
GPUs. This software design effort cannot easily be pushed onto third-party companies or the
open-source community, and becomes part of the chip development cost.
As we stand today, bringing a new piece of silicon to market is as much about writing
software as it is designing logic. It isn’t sufficient to just “write a driver” for the hardware; what
is needed is an effective bridge to application-level code.
Ultimately, companies will only create and use accelerators if the improvement justifies
the expense. That is, an accelerator is only worthwhile if the engineering cost can be recouped
by savings in the operating cost, or if the accelerator enables an application that was previously
impossible. The operating cost is closely tied to the efficiency of the computing system, both in
terms of the number of units necessary (buying a dozen CPUs vs. a single customized accelerator)
and in terms of time and electricity. Because it is almost always easier to implement an algorithm
on a more flexible device, this cost optimization results in a tug-of-war between performance
and flexibility, illustrated in Figure 1.4.
This is particularly true for low-volume products, where the NRE cost dominates the
overall expense. In such cases, the cheapest solution—rather than the most efficient—might be
the best. Often, the most cost-effective solution to speed up an application is to buy a more
powerful computer (or a whole rack of computers!) and run the same horribly inefficient code
on it. This is why an enormous amount of code, even deployed production code, is written in
languages like Python and Matlab, which have poor runtime performance but terrific developer
productivity.
Our goal is to reduce the cost of developing accelerators and of mapping emerging applica-
tions onto heterogeneous systems, pushing down the NRE of the high-cost/high-performance
1.4. GOOD APPLICATIONS FOR ACCELERATION 13
CPU
Optimized CPU
GPU
FPGA
ASIC
Engineering Cost
Operating
Cost
Figure 1.4: Tradeoff of operating cost (which is inversely related to runtime performance) vs.
non-recurring engineering cost (which is inversely related to flexibility). More flexible devices
(CPUs and GPUs) require less development effort but achieve worse performance compared to
FPGAs and ASICs. We aim to reduce the engineering development cost (red arrows), making
it more feasible to adopt specialized computing.
areas of this tradeoff space. Unless we do so, it will remain more cost effective to use general-
purpose systems, and computer performance in many areas will suffer.
1.4 GOOD APPLICATIONS FOR ACCELERATION
Before we launch into systems for programming accelerators, we’ll examine which applications
can be accelerated most effectively. Can all applications be accelerated with specialized proces-
sors, or just some of them?
The short answer is that only a few types of applications are worth accelerating. To see
why, we have to go back to the fundamentals of power and energy. Given that, for a modern chip,
performance per watt is equivalent to performance, we want to minimize the energy consumed
per unit of computation. That is, if the way to maximize operations per second is to maximize
operations per second per watt, we can cancel “seconds,” and simply maximize operations per
Joule.
Table 1.1 shows the energy required for a handful of fundamental operations in a 45 nm
process. The numbers are smaller for more recent process nodes, but the relative scale remains
essentially the same.
The crucial observation here is that a DRAM fetch requires 500 more energy than a
32-bit multiplication, and 50; 000 more than an 8-bit addition. The cost of fetching data from
memory completely dwarfs the cost of computing with it. The cache hierarchy helps, of course,
Exploring the Variety of Random
Documents with Different Content
V
SHE GOES ON SUNDAY TO THE CHURCH
Eumenes Fane’s marriage had been both more respectable and
more romantic than his kind enemies believed: living in Paris, he had
eloped with a handsome, wilful French girl of noble family. Her
relations swallowed the match as a bitter pill, his did not exist; and
the married lovers lived in isolation far away in Brittany until death
cut short their long honeymoon. Eumenes returned to England
embittered; he had always been disagreeable. The relations
between him and his children were eccentric. He lived with them, he
had taught them, yet he lavished satire upon their boorishness and
stupidity; he had been devoted to the mother, yet for the children he
had no feeling but unamiable contempt. They, on their part, repaid
him with indifference. Bernard at eighteen, on his own initiative, took
control of the farm and made it pay; Dolly managed the dairy and the
household. Their lives were isolated equally from their father and
from the world. Bernard was not much of a reader, and never strayed
far from his Shakespeare and his farming journals, with excursions
into Tennyson; but Dolly was insatiable. She had read and digested
every book in their heterogeneous library. Unfortunately, the
collection was not representative; the modern French novelists were
there arranged in full tale, and fresh volumes were added as they
appeared, but it had no single work of English fiction later than the
date of the admirable Sir Charles Grandison. Both Bernard and Dolly
could read and speak French as easily as English, though they did
not know the worth of their accomplishment; and from their study of
fin-de-siècle literature they had gained an innocently lurid knowledge
of the world which hardly fitted in with the conditions of English
country life, and was particularly inappropriate as applied to the
blameless households at the vicarage, the surgery, or The Lilacs.
When young Merton of The Hall brought home a pretty bride, Dolly
seriously looked for the appearance of Tertium Quid. He delayed his
coming for a year, and then arrived in the cradle. Dolly was
surprised; but she ascribed this breach of custom to the fact that
Merton senior’s money was made in soap. Only the true aristocrats
indulge in a friend of the house.
After Farquhar’s visit Dolly made a dress for herself. It was then
the fashion to wear a bodice opening at the sleeves and in front to
show a lighter under-dress, which also appeared beneath the skirt,
as the corolla of a flower beneath the calyx. Dolly’s gown of dark
chestnut matched her hair; the colour of the vest was white. She was
more skilful in the dairy than with her needle, but she gave her mind
to this, and in the end her work was crowned with fair success.
“I guess that colour, what they call, suits you,” said Bernard, whom
she called in to assist at the full-dress rehearsal.
“I expect it does,” assented Dolly, bending back her swan’s-neck to
catch a glimpse of her supple young waist in the spotty mirror. “It fits
rather badly; any one can see it is homemade, but that can’t be
helped. I am going to wear it to church on Christmas Day.”
“Father’ll be awfully angry if you go to church.”
“Of course, but that doesn’t matter. No one except small
shopkeepers and mill-girls goes to chapel now. Besides, the minister
drops his h’s and mixes his metaphors and talks the silliest
nonsense: I wouldn’t listen to him even if it were the fashion. Shall
you come with me?”
“I guess I’d better. Have you seen that Farquhar chap again?”
“I have,” Dolly answered, composedly.
“You’ll get yourself into a mess if you don’t look out.”
“Oh no. He may get into a mess, but I shall not.”
“Then I don’t think you are playing fair.”
“Yes, I am. He knows why I spoke to him.”
“Why did you?”
“To know how ladies behave.”
“I suppose you’ll go your own way,” said Bernard, after a pause;
“but people’ll talk if you go on meeting him.”
“Let them. I don’t mean to stay down here.”
“I do,” said Bernard.
Dolly perceived the force of this objection. She valued Farquhar’s
advice; but where her own aims clashed with Bernard’s well-being,
she rarely hesitated.
“Very well; I won’t meet him again,” she said. “But, Bernard, if he
speaks to you, do you respond. Ask him here; no one can find fault if
I see him in my own house. Or I don’t think they can; do you?”
She was reassured by Bernard’s hearty assent, backed by a
special instance. “For,” said he, “when Maude had his sister staying
here, Farquhar went and saw them; and I guess if he goes to
Maude’s house he can come to us.” And the point was thus settled.
Two days before Christmas the wind blew softly from the south,
the snow melted from the earth and the clouds from the sky, the
robins broke out into their pure celestial strains, and it was spring in
all but name. Farquhar’s invalid began to pester his doctor for
permission to go out, and Dolly got a white hat to go with her
chestnut gown.
Christmas Day itself was a flash of summer. Dolly came down
dressed for church at half-past ten, and found her brother ready in a
Norfolk jacket, knickerbockers, and a cap. An inward monitor told her
that this attire was incorrect, and she said so; but as Bernard had
nothing else to wear, the question solvitur ambulando.
Neither of them had ever been to church. In early days Bernard
had been sent to a chapel with a damnatory creed, and he took his
sister with him till she developed opinions of her own: an epoch early
in Dolly’s history. She rebelled: Bernard, who was bored by the
service, outraged by the music, and submissive only from
indifference, supported her: and Mr. Fane’s graceless children took
their own way, and henceforth spent the Sabbath hours in reading,
prefaced always by a chapter of the Bible.
They arrived late, having lingered in the woods because Dolly
said, and Bernard agreed, that Mrs. Merton and the lady in the black
frills had never entered the church till after the bells stopped ringing.
Such is the force of bad example. Bernard held the door open for his
sister, and followed her in, according to instructions which he had
received from her, and she from Noel Farquhar. The aisles were
crossed by dim sunbeams swimming with drowsy motes, the people
were sleepy, the priest was monotoning monotonously out of tune;
and Dolly’s entrance, in company with a beam of pure sunshine and
a gust of wind which set the Christmas wreaths rustling all round the
church, electrified everybody. Heads turned to stare; the choristers,
ever the devotees of inattention, nudged and whispered. Up the aisle
came Dolly, a glowing piece of colour in her rich dress and richer
hair, with the immaculate whiteness of her brow and the deepening
carmine of her cheeks, her eyes shining like brown diamonds. She
walked steadily, carrying her head high, up to the big square pew
assigned by tradition to the house of Fanes, unlatched the door, and
took her seat. Bernard followed, his height and his patent unconcern
making his figure quite as imposing as hers.
For a space Dolly knelt, as she saw others doing, and hid her hot
face; but when the time came she rose, and pinched Bernard, who
had sat down and stayed there. He got up slowly, plunged his hands
into his pockets, and looked round him. Dolly was convinced that his
behaviour was improper; she also looked round her, but without
moving her head, and found her exemplar in the person of Noel
Farquhar, who was attentively following the service in a large prayer-
book. Three volumes lay on the shelf of their pew; Dolly opened one
and handed another to her brother, signing to him to do his duty. He
looked into it helplessly; it was a copy of Hymns Ancient and
Modern, and it is not surprising that he could not find the place. Dolly
was no better off, but she had a model to imitate; she turned over the
pages as though they were perfectly familiar, found her place near
the beginning of the volume, and devoutly studied the evening
hymns while the choristers chanted the Venite.
The recollection of that morning always brought a smile to Dolly’s
lips. Occupied by her culte of deportment, and still more by her culte
of Bernard’s deportment, she missed the humours at the moment,
but found them all the more amusing under the enchantment lent by
distance. Bernard, who was not thinking about himself, was not
amused. Music at chapel had been bad enough, but this, more
ambitious, was really horrible. The choir sang neither better nor
worse than most village performers; there was a preponderance of
trebles out of tune and raucous, an absence of altos, two tenors who
sang wrong, and three basses who sang treble. When they should
have monotoned they climbed unevenly and one by one in linked
sweetness long drawn out down a chromatic scale, until Bernard
suddenly launched the true note at them in a voice of startling
richness and power, which would have made his fortune had he
taken it to market in town. It had the true bass quality, but an
unusually extensive compass, ranging from the C below the bass
clef up to the octave of middle C.
After he began to sing, most of the curious eyes were diverted
from Dolly to him, and she regained her composure. Farquhar had
not looked at her; it was not his cue to let his eye wander during
service. But Dolly was sure, from the dark flush which overspread his
face, that he had seen her enter. She designed this meeting as a
test. If he refused to acknowledge her before his friends, Dolly
vowed that she would never speak to him again. Her pride of birth
was keen; she went to the length of thinking her brother the only
gentleman present, inasmuch as he alone, so far as she knew, had
the right to bear arms. She took little part in the religious ceremonies.
Dolly had her creed, and held to it in practice, but at this time she
was too intent on this world to think much of the next.
She got up with alacrity after the benediction, and marshalled out
Bernard, glad to go. The organist was now playing music soft and
slow, and tenderly touching the pedals with boots so large that he
frequently put down two notes at once by accident. Music was really
the only subject about which Bernard was sensitive; as a false
quantity to a Latinist, as a curse to a Quaker, as a red rag to a bull,
so was a wrong note to Bernard Fane.
Outside shone the sun and breathed the wind and danced the
grasses over the graves of women as young and beautiful as Dolly;
but she was not thinking of them. The stream of people began to
condense into groups of two and three, who gave each other the
accustomed greetings and echoed cheerful wishes at cross purpose
in a babel of inanity. Farquhar was shaking hands with Mrs. Merton,
a fragile little lady with dark eyes, frileuse, as Dolly christened her,
who dressed very well and talked plaintive nonsense in an erratic
fashion. Dolly knew by instinct that they were speaking of her. She
went on at an even pace. Farquhar broke from his friends and
followed, and Dolly, with true Christmas good-will in her heart, found
herself shaking his hand in the overhand style, according to the
custom of the lady in black frills.
“I wish I could walk home your way; I’ve a hundred things to say
about that Burnt House business, and one never has a chance of
seeing Mr. Fane. But I’ve an invalid at home who’s to take his first
airing to-day, and I know he’ll go too far if I don’t look after him.”
“Is that the chap you picked up on the road?” asked Bernard, who
had heard the story from the men, with romantic embellishments.
“Oh, I didn’t pick him up; don’t think it; he was planted on me by
Providence. I say, Fane, if you’ve nothing better to do, I wish you’d
come in to-night and have a knock-up at billiards. It would be a
Christian act, for I’ve not a soul in the house except the invalid, who
toddles off to bye-bye at seven.”
“I can’t play billiards,” was Bernard’s reply, rather proudly spoken.
“Right; I’ll teach you. There’s nothing I like better; is there, Mrs.
Merton?”
“Don’t ask me; I never pretend to fathom you,” said Mrs. Merton,
plaintively, shaking her head. And she put out a very small hand to
Dolly. “Please don’t snub me, Miss Fane; I’d so like to come and call,
if you’ll let me. I was told you were a dreadful person, who dropped
the h and divided the hoof—skirt, I mean; besides, it was your turn to
call first on me. But you aren’t dreadful, are you? So may I come?”
Had there been any patronage in Mrs. Merton’s manner, Dolly
would have been delighted to snub her; but there was none. The
formula of gracious acceptance was less easy than a refusal, but
Dolly let no one guess her difficulties. An interesting general
discussion of the weather followed, during which one remarked that
it gave the doctors quite a holiday, a second that it was muggy and
unwholesome and why didn’t we have a nice healthy frost, a third
that it was excellent for the crops, and a fourth that the harvest would
be certainly ruined by wireworms, and each agreed with all the rest.
Bernard, standing still, thought fashionable people talked like
imbeciles. Dolly, shy, though no one saw it, was in a glow of triumph.
Their way home led through woods. So much rain had fallen that
the mossy bridle-path was scored with deep ruts full of water, and
Dolly had to hold her skirt away from the black leaf-mould. Rain-
drops held in crumpled copper leaves shone gemlike, smooth young
stems glistened; only the grey boles of the forest trees looked warm
and dry. Dolly, herself like a russet leaf, harmonised with the
woodland scenery, which seemed a frame made for her.
Farther on down the path, resignedly sitting on a bundle of fagots,
and beginning to grow chilly, Lucian de Saumarez was waiting for
some one to pass. He had set out with the virtuous intention of
returning home in half an hour precisely, but had been lured on by a
shrew-mouse, a squirrel, and the enchanting sun, till the end of his
strength put a period to his walk; his legs gave way under him. Then
he sat down and whistled “Just Break the News to Mother,” very
cheerfully. It was fortunate that in Bernard’s hearing he did not
attempt to sing, for his voice can only be described by the adjective
squawky. He looked like a tramp who had stolen a coat, for over his
own he wore one of Farquhar’s, which was truly a giant’s robe to
him. At first glimpse of Dolly he whipped off his cap, and stood up
bareheaded and recklessly polite.
“Excuse me—” he began.
“If you want relief, you’d better go to Alresworth workhouse; they’ll
take you in there,” interrupted Bernard, who would never give to
tramps.
“Be quiet, Bernard. Is there anything we can do for you?” asked
Dolly, in her gentlest voice.
“Candidly, I only ask an arm, and not an alms,” said Lucian,
laughing in Bernard’s face. “Fact is, I’ve walked up from The Lilacs
and just petered out. Your woods are such a very remarkably long
way through.”
“Then your name is De Saumarez. Bernard, give Mr. de Saumarez
your arm. You must come home with us and rest; afterwards you can
go back. You ought not to be sitting down out-of-doors this weather,”
said Dolly, fixing her imperious young eyes upon him, between pity
and severity.
“No, I’m an abomination, I confess it,” answered the culprit,
meekly.
“You must be feeling very tired.”
“I’m feeling more like boned goose than anything else, especially
in the legs. By-the-way, I wonder if Farquhar will leave his to look for
the strayed lamb?”
“Let him; it won’t do him any harm.”
Lucian’s eyes opened wide; Farquhar had described the ladies of
Monkswell in picture-making phrases, and he was trying to fit this
vivid young beauty into some one of the frames provided, which all
seemed too strait. “Am I speaking to Miss Maude?” he asked at a
venture, choosing the likeliest.
“Oh no. I am Mirabelle Fane, and this is my brother Bernard.”
“The dickens you are!” said Lucian to himself; for Farquhar, in
relating the adventure of Mr. Fane and the copper, had not
mentioned Miss Fane. Her foreign name and intonation caught
Lucian’s ear, and he asked if she were French.
“My mother was Comtesse de Beaufort,” Dolly told him, and her
naïve pride was quaint and pretty. Lucian mentioned Paris, and she
fastened upon him with a string of eager questions, but put him to
silence before half were answered, by declaring that he had talked
too much.
“I’ve been off the silent list this fortnight past,” Lucian pleaded.
“But you are already overtired. You ought to lie down directly you
get in, and take a dose of cod-liver oil.”
“I take cod-liver oil three times a day,” Lucian assured her, with
equal gravity.
“How? In port wine?”
“I should consider that a sacrilege. No; I will describe the
operation,” said Lucian, warming to his subject, which in any of his
many conversations with pretty girls he had never discussed before.
“I squeeze half a lemon into a wineglass, so; then I pour the oil in on
it; next I squeeze the juice of the other half-lemon into another
wineglass; and finally I swallow first the lemon plus oil and then the
lemon solus. It is a process which requires great nicety and
precision. Farquhar is not so careful as I could wish. Of course, it is
nothing to him if I suffer.”
“Port wine would be far more nourishing than lemon-juice,” Dolly
asseverated, knitting her brows. “Or milk would be better. Have you
ever tried goat’s milk?”
“I have not; is it a sovereign specific?”
“I have known it work wonderful cures on emaciated people. How
much do you weigh?”
“Six stone eleven, I believe.”
“That is far too little. You should test your weight every day—are
you laughing at me?”
“I’m awfully sorry!” said Lucian, who certainly was. “But, Miss
Fane, what a nurse you would make! I was expecting you to feel my
pulse, and take my temperature, and look at my tongue.”
“So I was intending to do; I have a clinical thermometer at home,”
Dolly proudly answered. “I do not know how to behave. I have never
learned any manners.”
“Say you’ve never learned customs; manners come by nature.”
Lucian’s smile was irresistible.
“Mine come very badly, then,” said Dolly, smiling back at him; “for
when we get in you will certainly have to lie down; and, what’s more,
I shall give you a glass of goat’s milk.”
VI
HONESTY IS THE BEST POLICY
A royal stag, whose many-branched and palmate antlers showed
that he had seen at least ten springs, looked down upon the mantel-
piece of Noel Farquhar’s library; a huge elk fronted him across the
room. This style of decoration, which took its origin in the simple
skull palisades of primitive Britain and latter-day Africa, which was
handed down by the traditions of Tower Hill, and which is rampant in
the modern hall, had in Noel Farquhar a devotee. The walls of his
smoking-room bristled with the heads of digested enemies. Thither
the two men repaired after dinner on Christmas night, taking with
them a decanter of mid-century port, cigars of indubitable
excellence, and a dish of nuts for Lucian, who took a childlike
interest in extracting and peeling walnuts without breaking the
kernel. Farquhar was inclined to be silent, in which mood Lucian, the
student of the abnormal, found him specially interesting.
“Queer chap you are Farquhar,” he suddenly remarked. “Why
didn’t you ever tell me about the fascinating Fanes?”
“Didn’t I? I thought I had.” Farquhar did not think any such thing,
and Lucian knew it. “The day I went there Miss Dolly Fane stopped
me in the hall, and would know whether I thought she’d make an
actress. An odd girl.”
“Well, and what did you say to her?”
“Said she would. I couldn’t do otherwise, could I?”
“My immaculate friend, I’m afraid the charms of Miss Fane have
persuaded you into a statement which is very remarkably near to a
L, I, E, lie. At the least, you were disingenuous, decidedly.”
“Who says I am immaculate? Not I. You thrust virtues upon me
and then cry out when I don’t come up to your notions of an
archangel.”
“And your church-going and your alms-giving and your brand-new
coppers and general holiness? Eh, sonny?”
“I’ve a creed, as four-fifths of the men down here are supposed to
have; but whereas they deny in their acts what they repeat with their
tongues, I prefer to perform what I profess. There’s a fine lack of
logic about the way men regard their faith; each time they repeat
their Credo they’re self-condemned fools. Well, I don’t relish making
a fool of myself. Either I’ll be an infidel, and thus set myself free, or
else I’ll act up to what I say. For that you praise me. Now, the only
virtue to which I do lay claim is patience, of which I think I possess
an extraordinary store.”
Lucian peeled a walnut with painstaking earnestness, and ate it
with salt and pepper. The shell he flicked across at Farquhar, who
had fallen into a brown study and was looking very grim. He looked
up with a quick, involuntary smile.
“Did you shoot all these horned beasties yourself?” Lucian
inquired, introducing the elk and the stag with a wave of the hand.
“Yes. I shot the elk in Russia; the horns weigh a good eighty
pounds. Shy brutes they are, and fierce when at bay; this one lamed
me with a kick after I thought I had done for him.”
“My biggest bag was twenty sjamboks running,” said Lucian,
pensively. “I and some others were up country on a big shoot, and,
of course, I got fever and had to lie up. Well, they used to come in
with their blesbok and their springbok, and all the rest of it, so I didn’t
see why I shouldn’t do a little on my own. So I lined up all our
niggers with a sjambok apiece, and made my bag from my couch of
pain. I worked those sjamboks afterwards for all they were worth.
Yes, sir-ree.”
“Sometimes I really think you’re daft, De Saumarez!”
“Pray don’t mention it. Let’s see, where were you? Oh, in Russia.
No, I’ve never been there—I don’t know Russia at all.”
“I do.”
“What, intimately?”
Farquhar turned his head, met Lucian’s eyes, and smiled. “Oh no;
quite slightly,” he said, lying with candour and glee.
“Oh, indeed,” said Lucian. “Now that’s queer; I thought I’d met you
there. By the way, do you believe in eternal constancy?”
“In what?”
“In eternal constancy; did you never hear of it before?”
“Well, yes, pulex irritans, I’ve seen a man go mourning all his life
long; so I do believe in it.”
“No, no, sonny; I’m not discussing its existence, but its merits. Do
you hold that a man should be eternally faithful to the memory of a
dead woman?”
“Not if he doesn’t want to.”
“My point is that he oughtn’t to want to. See here; your body
changes every seven years, and I’ll be hanged if your mind doesn’t
change, too. Now, your married couple change together and so keep
abreast. But if the woman dies, she comes to a stop. In seven years
the survivor will have grown right away from her. The constant
husband prides himself on his loyalty, and is ashamed to admit even
in camera that a resurrected wife wouldn’t fit into his present life; but
in nine cases out of ten the wound’s healed and cicatrised, and only
a sentimental scruple bars him from saying so. And there, as I take
it, he’s wrong.”
“What would you have him do?”
“Take another woman and make her and himself happy.”
“What becomes of the dead wife’s point of view?”
“According to my creed, you know, she’s got no point of view at
all.”
“You can’t expect me to follow you there.”
“No; and so I’ll cite your own creed. After the resurrection there
shall be no marrying or giving in marriage. She’s no call to be
jealous.”
“You’ve no romance about you.”
“No sentimentalism, you mean. Half the feelings consecrated by
public opinion are trash. It’s astounding how we do adore the dumps.
Happiness is our first duty. It seems to me that one needs more
courage to forget than to remember. That’s where I’ve been weak
myself.”
Lucian put his hand inside his coat and took out the letter which
Farquhar had read; he had been leading up to this point. He spread
it open on his knee, showing the thick, chafed, blue paper, the gilded
monogram and daisy crest, the thin Italian writing. “I’ve carried that
about for nine years,” he said, glancing up, and then held the paper
to the fire and watched it catch light. The advancing line of brown,
the blue-edged flame, crept across the letter, leaving shrivelled ash
in its track. Lucian held it till the heat scorched his fingers, and then
let it fall in the fire. “A passionate letter, was it not?” he said, turning
from the black, rustling tinder to meet Farquhar’s eyes.
“My dear De Saumarez!”
“Don’t humbug; you read it when you thought I was unconscious.”
“Ah,” said Farquhar, “now I understand why you understood.”
He altered his pose slightly, relaxing as though freed from some
slight, omnipresent constraint; the nature which confronted Lucian
was different in gross and in detail from the mask of excellence
which he had hitherto kept on. Vices were there, and virtues
unsuspected: coarse, barbaric, potent qualities, dominated by a will-
power mightier than they. Race-characteristics, hitherto overlaid,
suddenly started out; and Lucian, recurring quickly to the last fresh
lie which Farquhar had told him, exclaimed, “Why, man, you’re a
Russian yourself!”
“Half-breed. My mother was Russian. My father was Scotch, but a
naturalized Russian subject. The worse for him; he died in the
mines. Confound him: a pretty ancestry he’s given me, and a pretty
job I’ve had to keep the story out of the papers. I’ve done it, though.”
“But what’s it for?” asked Lucian, whose mind was flying to the
story of Jekyll and Hyde.
“Respectability; that’s the god of England. Do you think I could
confess myself the son of a couple of dirty Russian nihilists and keep
my position? Not much. It’s the only crevice in my armour. Scores of
men have tried to get on by shamming virtuous, but I’ve gone one
better than they; I am virtuous. You can’t pick a hole in my character,
because there’s none to pick. I speak the truth, I do my duty, I’m
honest and honourable down to the end of the whole fool’s
catalogue, I even go out of my way to be chivalrously charitable, as
when I picked you up, or made a fool of myself over that confounded
copper. That’s all the political muck-worms find when they come
burrowing about me. Yes, honesty’s the best policy; it pays.”
“H’m! well, my most honourable friend, you’d find yourself in Queer
Street if I related how you’d read my letter.”
“Not in the least. I was glancing at it to find your address.”
“You took a mighty long time over your glance.”
“The paper was so much rubbed that I could hardly see where it
began or ended.”
“There was the monogram for a sign-post.”
“Plenty women begin on the back sheet.”
“You’re abominable; faith, you are,” said Lucian. “You’re a regular
prayer-mill of lies!”
“I’d never have touched it if I hadn’t prepared my excuse
beforehand. Ruin my career for the sake of reading an old love-
letter? Not I!”
Even as Farquhar wished it, the contemptuous and insulting
reference displeased Lucian; the letter was still sacred in his eyes.
But he would not, and he did not, allow the feeling to be seen.
Farquhar’s measure of reserve was matched by his present
openness; but Lucian, whose affairs were everybody’s business,
kept his mind as a fenced garden and a fountain sealed. Action and
reaction are always equal and opposite; the law is true in the moral
as well as the physical world.
“Kindly speak of my letter with more respect, will you?” was all
Lucian said.
“Oh, the letter was charming; I wish it had been addressed to me!”
“You shut up, and don’t try to be a profane and foolish babbler. I
want to know what it’s all for—what’s your aim and object, sonny?”
“I’m going to get into the Cabinet.”
“You are, are you?” said Lucian. “And why not be premier?”
“And why not king? Because I happen to know my own limitations.
I’ll make a damned good understrapper, but the other’s beyond me.”
“You’ll change your mind when you’ve got your wish.”
“And there you’re wrong. I’ll be content then. I’m content now, for
that matter. It’s as good as a play to see how the virtuous people
look up to me.”
Lucian leaned back in the attitude proper to meditation, and
studied his vis-à-vis over his joined finger-tips. Strength of body,
strength of mind, a will keen as a knife-blade to cut through
obstacles, an arrogant pride in himself and his sins, all these had writ
themselves large on Farquhar’s face; but the acute mind of the critic
was questing after more amiable qualities.
“And so you took me in as an instance of chivalrous charity, eh?
And what do you keep me here for, now I’m sain and safe?”
“You’re not well enough to be dismissed cured.”
“I beg your pardon. I could go and hold horses to-morrow.”
“I shall have to find some work for you before I let you go. I like to
do the thing thoroughly.”
“I see. I’m being kept as an object-lesson in generosity; is that
so?”
“You’ve hit it,” said Farquhar. “Hope you like the position. Have a
cigar?”
“No, thanks. I don’t mind being a sandwich-man, but I draw the
line at an object-lesson.” Lucian got up, and began buttoning his coat
round him. “If that’s your reason for keeping me, I’m off.”
“De Saumarez, don’t be a fool.”
“I will not be an object-lesson,” said Lucian, making for the door.
“My conscience rebels against the deception. I will expire on your
threshold.”
Farquhar jumped up and put his back against the door. “Go and sit
down, you fool!”
“I’ve not the slightest intention of sitting down. I will be a body—a
demd, damp, moist, unpleasant body.”
“Do you mean this?”
“I do. I’m too proud to take money from a man who’s not a friend.”
Farquhar was very angry. He knew what Lucian wanted, but he
would not say it. “Go, and be hanged to you, then!” he retorted, and
flung round towards the fire.
“All right, I’m going,” said Lucian, as he went into the hall.
He took his cap and his stick. Overcoat he had none, and he could
not now borrow Farquhar’s. His own clothes were inadequate even
for mid-day wearing, and for night were absurd. All this Farquhar
knew. He heard Lucian unbolt and unlock the front door, and
presently the wind swept in, invaded the hall, and made Farquhar
shiver, sitting by the fire. Lucian coughed.
Up sprang Farquhar; he ran into the hall, flung the door closed,
caught Lucian round the shoulders, and in the impatient pride of his
strength literally carried him back to the library close to the fire. “You
fool!” he said. “You dashed fool!”
“Well?” said Lucian, looking up, laughing, from the sofa upon
which he had been cast. “Own up! Why do you keep me here?”
“Because you have a damnable way of getting yourself liked.
Because you’re sick.”
“Sh! don’t swear like that, sonny; you really do shock me. And so
you like me?”
“I’ve always a respect for people who find me out,” retorted
Farquhar. “The others—Lord, what fools—what fools colossal! But
you’ve grit; you know your own mind; you do what you want, and not
what your dashed twopenny-halfpenny passions want. Besides,
you’re ill,” he wound up again, with a change of tone which sent
Lucian’s eyebrows up to his shaggy hair.
“You’re a nice person for a small Sunday-school!” was his
comment. “Well, well! So you profess yourself superior to dashed
twopenny-halfpenny passions—such as affection, for example?”
“I was bound to stop you going. You’d have died at my door and
made a scandal.”
“You know very well that never entered your head. Take care what
you say; I can still go, you know.”
Farquhar laughed, half angry; he chafed under Lucian’s control;
would fain have denied it, but could not. “Confound you, I wish I’d
never seen you!” he said.
“You’ll wish that more before you’ve done. I’m safe to bring bad
luck. Gimme your hand and I’ll tell your fortune. I can read the palm
like any gypsy; got a drop of Romany blood in me, I guess.”
“You’ll not read mine,” said Farquhar, grimly, putting it out.
“Won’t I? Hullo! You’ve got a nice little handful!”
The hand was scarred from wrist to finger-tips.
“Never noticed it before, did you? I’m pretty good at hiding it by
now.”
“How on earth was it done?”
“In hell—that’s Africa. I told you I learned massage from an old
Arab sheikh; well, I practised on him. I was alone and down with
fever, and they don’t have river police on the Lualaba. He made me
his slave. Used to thrash me when he chose to say I’d not done my
work; make me kneel at his feet and strike me on the face.”
“Good Lord! How did you like that, sonny?”
“I smiled at him till he got sick of it. Then he put me on silence: one
word, death. He thought he’d catch me out, but I’d no notion of that; I
held my tongue. So one day the old devil sent me to fetch his knife. It
was dusk, and I picked it up carelessly; the handle was white-hot.
He’d tried that trick with slaves before. Liked to see them howl and
drop it, and then finish them off with the very identical knife—
confound him!”
“Amen. And what did you do?”
“I? Brought him his knife by the blade; do you think I was going to
let him cheat me out of my career?”
Lucian stared at him. “You—you!” he said. “And I verily believe the
man’s telling the truth. What happened next?”
“Something to do with termites that I won’t repeat; it might make
you ill.”
“Only a channel steamer does that, sonny. You got away, though?”
“Eventually; half blind and deadly sick. By the way, you’ve not told
me why you made up your mind to burn that letter at this precise
time?”
“To draw you, of course. And now you’ll be pleased to go and see
that my room’s ready; I can hear Bernard Fane hammering at the
door, so you can play billiards with him while I go to bye-low.”
VII
COURAGE QUAND MÊME
January came with the snow-drop, February brought the crocus, and
March violets were empurpling the woods before the next scene
came on the stage and introduced a new actor. In the meanwhile,
Lucian continued to live on Noel Farquhar’s bounty. It should have
been an intolerable position, but Lucian’s luckless head had received
such severe bludgeonings at the hands of Fate that he was glad to
hide it anywhere, and give his pride the congé. His choice lay
between remaining with Farquhar, retiring to the workhouse, and
expiring in a haystack without benefit of clergy; he chose the least
heroic course, and, sad to say, he found it very pleasant.
One night alarm he gave Farquhar. Punctual to its time, the cold
snap of mid-January arrived on the eleventh of the month, and
Lucian went skating at Fanes. His tutelary divinity Dolly being
absent, he was beguiled into staying late, got chilled, and awoke
Farquhar at three in the morning by one of his usual attacks. It was
very slight and soon checked, but the incident strengthened the bond
between them; for Lucian did not forget Farquhar’s face when he
found him fighting for breath, nor the lavish tenderness of his
subsequent nursing, which seemed to be extorted from him by a
force stronger than his would-be carelessness. That constraining
force Lucian declined to christen: friendship seemed too mild a term
for Farquhar’s crude emotions.
No one could have felt more horribly ashamed than Lucian, on
finding that his host gave up all engagements to wait upon him. He
was soon about again, but he now guarded his health as though he
had it on a repairing lease. When Dolly consulted him on points of
etiquette, as she soon learned to do, he retaliated with questions
concerning the proper conduct of an invalid; it is only fair to say that
Dolly was the more correct informant. He was welcome at Fanes.
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
Let us accompany you on the journey of exploring knowledge and
personal growth!
textbookfull.com

More Related Content

PDF
Compiling Algorithms for Heterogeneous Systems Steven Bell
PDF
Compiling Algorithms for Heterogeneous Systems Steven Bell
PDF
Compiling Algorithms for Heterogeneous Systems Steven Bell
PDF
Robotic Computing On Fpgas Synthesis Lectures On Distributed Computing Theory...
PDF
The Datacenter As A Computer Designing Warehouse Scale Machines Luiz André Ba...
PDF
Architecture and operating system support for virtual memory 1st Edition Abhi...
PDF
Parallel Processing 1980 To 2020 Robert Kuhn David Padua
PDF
Implementation of Machine Learning Algorithms Using Control Flow and Dataflow...
Compiling Algorithms for Heterogeneous Systems Steven Bell
Compiling Algorithms for Heterogeneous Systems Steven Bell
Compiling Algorithms for Heterogeneous Systems Steven Bell
Robotic Computing On Fpgas Synthesis Lectures On Distributed Computing Theory...
The Datacenter As A Computer Designing Warehouse Scale Machines Luiz André Ba...
Architecture and operating system support for virtual memory 1st Edition Abhi...
Parallel Processing 1980 To 2020 Robert Kuhn David Padua
Implementation of Machine Learning Algorithms Using Control Flow and Dataflow...

Similar to Instant download Compiling Algorithms for Heterogeneous Systems Steven Bell pdf all chapter (20)

PDF
Implementation of Machine Learning Algorithms Using Control Flow and Dataflow...
PDF
Neuromorphic Computing Systems For Industry 40 22th Edition Dhanasekar S
PDF
ADVANCED COMPUTER ARCHITECTURE PARALLELISM SCALABILITY PROGRAMMABILITY Baas ...
PPTX
Model_of_Heterogeneous_System and other things
PDF
Handbook Of Research On Machine Learningenabled Iot For Smart Applications Ac...
PDF
Mauricio breteernitiz hpc-exascale-iscte
PDF
Reconfigurable And Adaptive Computing Theory And Applications Nedjah
PDF
Computer Architecture And System Design J Vaideeswaran
PDF
Computer Science And Engineeringtheory And Applications 1st Coll
PDF
(eBook PDF) Computer Organization and Architecture10th Global Edition
PDF
Handbook Of Research On Computational Simulation And Modeling In Engineering ...
PDF
UNC Cause chris maher ibm High Performance Computing HPC
PDF
Nikravesh australia long_versionkeynote2012
PDF
Handbook Of Parallel Computing Models Algorithms And Applications Chapman Hal...
PPTX
TWISummit 2019 - Return of Reconfigurable Computing
PDF
Exascale Scientific Applications Scalability and Performance Portability 1st ...
DOC
Proposed-curricula-MCSEwithSyllabus_24_...
PPTX
Proposal defense2 flat
PDF
Algorithms and Parallel Computing 1st Edition Fayez Gebali
PDF
FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote)
Implementation of Machine Learning Algorithms Using Control Flow and Dataflow...
Neuromorphic Computing Systems For Industry 40 22th Edition Dhanasekar S
ADVANCED COMPUTER ARCHITECTURE PARALLELISM SCALABILITY PROGRAMMABILITY Baas ...
Model_of_Heterogeneous_System and other things
Handbook Of Research On Machine Learningenabled Iot For Smart Applications Ac...
Mauricio breteernitiz hpc-exascale-iscte
Reconfigurable And Adaptive Computing Theory And Applications Nedjah
Computer Architecture And System Design J Vaideeswaran
Computer Science And Engineeringtheory And Applications 1st Coll
(eBook PDF) Computer Organization and Architecture10th Global Edition
Handbook Of Research On Computational Simulation And Modeling In Engineering ...
UNC Cause chris maher ibm High Performance Computing HPC
Nikravesh australia long_versionkeynote2012
Handbook Of Parallel Computing Models Algorithms And Applications Chapman Hal...
TWISummit 2019 - Return of Reconfigurable Computing
Exascale Scientific Applications Scalability and Performance Portability 1st ...
Proposed-curricula-MCSEwithSyllabus_24_...
Proposal defense2 flat
Algorithms and Parallel Computing 1st Edition Fayez Gebali
FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote)
Ad

Recently uploaded (20)

PPTX
Institutional Correction lecture only . . .
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PPTX
Cell Types and Its function , kingdom of life
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
GDM (1) (1).pptx small presentation for students
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PDF
Computing-Curriculum for Schools in Ghana
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Institutional Correction lecture only . . .
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Supply Chain Operations Speaking Notes -ICLT Program
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
STATICS OF THE RIGID BODIES Hibbelers.pdf
Module 4: Burden of Disease Tutorial Slides S2 2025
FourierSeries-QuestionsWithAnswers(Part-A).pdf
202450812 BayCHI UCSC-SV 20250812 v17.pptx
Cell Types and Its function , kingdom of life
O7-L3 Supply Chain Operations - ICLT Program
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
GDM (1) (1).pptx small presentation for students
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
Computing-Curriculum for Schools in Ghana
102 student loan defaulters named and shamed – Is someone you know on the list?
Abdominal Access Techniques with Prof. Dr. R K Mishra
Final Presentation General Medicine 03-08-2024.pptx
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Ad

Instant download Compiling Algorithms for Heterogeneous Systems Steven Bell pdf all chapter

  • 1. Experience Seamless Full Ebook Downloads for Every Genre at textbookfull.com Compiling Algorithms for Heterogeneous Systems Steven Bell https://textbookfull.com/product/compiling-algorithms-for- heterogeneous-systems-steven-bell/ OR CLICK BUTTON DOWNLOAD NOW Explore and download more ebook at https://textbookfull.com
  • 2. Recommended digital products (PDF, EPUB, MOBI) that you can download immediately if you are interested. Colloidal Nanoparticles for Heterogeneous Catalysis Priscila Destro https://textbookfull.com/product/colloidal-nanoparticles-for- heterogeneous-catalysis-priscila-destro/ textboxfull.com Radio Systems Engineering Steven W. Ellingson https://textbookfull.com/product/radio-systems-engineering-steven-w- ellingson/ textboxfull.com Intelligent Algorithms for Analysis and Control of Dynamical Systems Rajesh Kumar https://textbookfull.com/product/intelligent-algorithms-for-analysis- and-control-of-dynamical-systems-rajesh-kumar/ textboxfull.com International environmental risk management: a systems approach Second Edition Bell https://textbookfull.com/product/international-environmental-risk- management-a-systems-approach-second-edition-bell/ textboxfull.com
  • 3. Data Parallel C++ Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL 1st Edition James Reinders https://textbookfull.com/product/data-parallel-c-mastering-dpc-for- programming-of-heterogeneous-systems-using-c-and-sycl-1st-edition- james-reinders/ textboxfull.com Tools and Algorithms for the Construction and Analysis of Systems Dirk Beyer https://textbookfull.com/product/tools-and-algorithms-for-the- construction-and-analysis-of-systems-dirk-beyer/ textboxfull.com Tools and Algorithms for the Construction and Analysis of Systems Dirk Beyer https://textbookfull.com/product/tools-and-algorithms-for-the- construction-and-analysis-of-systems-dirk-beyer-2/ textboxfull.com How to Draw Manga Volume 1 Compiling Characters Society For The Study Of Manga Techniques https://textbookfull.com/product/how-to-draw-manga-volume-1-compiling- characters-society-for-the-study-of-manga-techniques/ textboxfull.com Smart Electronic Systems Heterogeneous Integration of Silicon and Printed Electronics Li-Rong Zheng https://textbookfull.com/product/smart-electronic-systems- heterogeneous-integration-of-silicon-and-printed-electronics-li-rong- zheng/ textboxfull.com
  • 7. Synthesis Lectures on Computer Architecture Editor Margaret Martonosi, Princeton University Founding Editor Emeritus Mark D. Hill, University of Wisconsin, Madison Synthesis Lectures on Computer Architecture publishes 50- to 100-page publications on topics pertaining to the science and art of designing, analyzing, selecting and interconnecting hardware components to create computers that meet functional, performance and cost goals. The scope will largely follow the purview of premier computer architecture conferences, such as ISCA, HPCA, MICRO, and ASPLOS. Compiling Algorithms for Heterogeneous Systems Steven Bell, Jing Pu, James Hegarty, and Mark Horowitz 2018 Architectural and Operating System Support for Virtual Memory Abhishek Bhattacharjee and Daniel Lustig 2017 Deep Learning for Computer Architects Brandon Reagen, Robert Adolf, Paul Whatmough, Gu-Yeon Wei, and David Brooks 2017 On-Chip Networks, Second Edition Natalie Enright Jerger, Tushar Krishna, and Li-Shiuan Peh 2017 Space-Time Computing with Temporal Neural Networks James E. Smith 2017 Hardware and Software Support for Virtualization Edouard Bugnion, Jason Nieh, and Dan Tsafrir 2017
  • 8. iv Datacenter Design and Management: A Computer Architect’s Perspective Benjamin C. Lee 2016 A Primer on Compression in the Memory Hierarchy Somayeh Sardashti, Angelos Arelakis, Per Stenström, and David A. Wood 2015 Research Infrastructures for Hardware Accelerators Yakun Sophia Shao and David Brooks 2015 Analyzing Analytics Rajesh Bordawekar, Bob Blainey, and Ruchir Puri 2015 Customizable Computing Yu-Ting Chen, Jason Cong, Michael Gill, Glenn Reinman, and Bingjun Xiao 2015 Die-stacking Architecture Yuan Xie and Jishen Zhao 2015 Single-Instruction Multiple-Data Execution Christopher J. Hughes 2015 Power-Efficient Computer Architectures: Recent Advances Magnus Själander, Margaret Martonosi, and Stefanos Kaxiras 2014 FPGA-Accelerated Simulation of Computer Systems Hari Angepat, Derek Chiou, Eric S. Chung, and James C. Hoe 2014 A Primer on Hardware Prefetching Babak Falsafi and Thomas F. Wenisch 2014 On-Chip Photonic Interconnects: A Computer Architect’s Perspective Christopher J. Nitta, Matthew K. Farrens, and Venkatesh Akella 2013
  • 9. v Optimization and Mathematical Modeling in Computer Architecture Tony Nowatzki, Michael Ferris, Karthikeyan Sankaralingam, Cristian Estan, Nilay Vaish, and David Wood 2013 Security Basics for Computer Architects Ruby B. Lee 2013 The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second edition Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle 2013 Shared-Memory Synchronization Michael L. Scott 2013 Resilient Architecture Design for Voltage Variation Vijay Janapa Reddi and Meeta Sharma Gupta 2013 Multithreading Architecture Mario Nemirovsky and Dean M. Tullsen 2013 Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU) Hyesoon Kim, Richard Vuduc, Sara Baghsorkhi, Jee Choi, and Wen-mei Hwu 2012 Automatic Parallelization: An Overview of Fundamental Compiler Techniques Samuel P. Midkiff 2012 Phase Change Memory: From Devices to Systems Moinuddin K. Qureshi, Sudhanva Gurumurthi, and Bipin Rajendran 2011 Multi-Core Cache Hierarchies Rajeev Balasubramonian, Norman P. Jouppi, and Naveen Muralimanohar 2011 A Primer on Memory Consistency and Cache Coherence Daniel J. Sorin, Mark D. Hill, and David A. Wood 2011
  • 10. vi Dynamic Binary Modification: Tools, Techniques, and Applications Kim Hazelwood 2011 Quantum Computing for Computer Architects, Second Edition Tzvetan S. Metodi, Arvin I. Faruque, and Frederic T. Chong 2011 High Performance Datacenter Networks: Architectures, Algorithms, and Opportunities Dennis Abts and John Kim 2011 Processor Microarchitecture: An Implementation Perspective Antonio González, Fernando Latorre, and Grigorios Magklis 2010 Transactional Memory, 2nd edition Tim Harris, James Larus, and Ravi Rajwar 2010 Computer Architecture Performance Evaluation Methods Lieven Eeckhout 2010 Introduction to Reconfigurable Supercomputing Marco Lanzagorta, Stephen Bique, and Robert Rosenberg 2009 On-Chip Networks Natalie Enright Jerger and Li-Shiuan Peh 2009 The Memory System: You Can’t Avoid It, You Can’t Ignore It, You Can’t Fake It Bruce Jacob 2009 Fault Tolerant Computer Architecture Daniel J. Sorin 2009 The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines Luiz André Barroso and Urs Hölzle 2009
  • 11. vii Computer Architecture Techniques for Power-Efficiency Stefanos Kaxiras and Margaret Martonosi 2008 Chip Multiprocessor Architecture: Techniques to Improve Throughput and Latency Kunle Olukotun, Lance Hammond, and James Laudon 2007 Transactional Memory James R. Larus and Ravi Rajwar 2006 Quantum Computing for Computer Architects Tzvetan S. Metodi and Frederic T. Chong 2006
  • 12. Copyright © 2018 by Morgan & Claypool All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations in printed reviews, without the prior permission of the publisher. Compiling Algorithms for Heterogeneous Systems Steven Bell, Jing Pu, James Hegarty, and Mark Horowitz www.morganclaypool.com ISBN: 9781627059619 paperback ISBN: 9781627057301 ebook ISBN: 9781681732633 hardcover DOI 10.2200/S00816ED1V01Y201711CAC043 A Publication in the Morgan & Claypool Publishers series SYNTHESIS LECTURES ON COMPUTER ARCHITECTURE Lecture #43 Series Editor: Margaret Martonosi, Princeton University Founding Editor Emeritus: Mark D. Hill, University of Wisconsin, Madison Series ISSN Print 1935-3235 Electronic 1935-3243
  • 13. Compiling Algorithms for Heterogeneous Systems Steven Bell Stanford University Jing Pu Google James Hegarty Oculus Mark Horowitz Stanford University SYNTHESIS LECTURES ON COMPUTER ARCHITECTURE #43 C M & cLaypool Morgan publishers &
  • 14. ABSTRACT Most emerging applications in imaging and machine learning must perform immense amounts of computation while holding to strict limits on energy and power. To meet these goals, archi- tects are building increasingly specialized compute engines tailored for these specific tasks. The resulting computer systems are heterogeneous, containing multiple processing cores with wildly different execution models. Unfortunately, the cost of producing this specialized hardware—and the software to control it—is astronomical. Moreover, the task of porting algorithms to these heterogeneous machines typically requires that the algorithm be partitioned across the machine and rewritten for each specific architecture, which is time consuming and prone to error. Over the last several years, the authors have approached this problem using domain- specific languages (DSLs): high-level programming languages customized for specific domains, such as database manipulation, machine learning, or image processing. By giving up general- ity, these languages are able to provide high-level abstractions to the developer while producing high-performance output. The purpose of this book is to spur the adoption and the creation of domain-specific languages, especially for the task of creating hardware designs. In the first chapter, a short historical journey explains the forces driving computer archi- tecture today. Chapter 2 describes the various methods for producing designs for accelerators, outlining the push for more abstraction and the tools that enable designers to work at a higher conceptual level. From there, Chapter 3 provides a brief introduction to image processing al- gorithms and hardware design patterns for implementing them. Chapters 4 and 5 describe and compare Darkroom and Halide, two domain-specific languages created for image processing that produce high-performance designs for both FPGAs and CPUs from the same source code, enabling rapid design cycles and quick porting of algorithms. The final section describes how the DSL approach also simplifies the problem of interfacing between application code and the accelerator by generating the driver stack in addition to the accelerator configuration. This book should serve as a useful introduction to domain-specialized computing for com- puter architecture students and as a primer on domain-specific languages and image processing hardware for those with more experience in the field. KEYWORDS domain-specific languages, high-level synthesis, compilers, image processing accel- erators, stencil computation
  • 15. xi Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 CMOS Scaling and the Rise of Specialization . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 What Will We Build Now? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.1 Performance, Power, and Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.2 Flexibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3 The Cost of Specialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4 Good Applications for Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2 Computations and Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1 Direct Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3 High-level Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Domain-specific Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3 Image Processing with Stencil Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1 Image Signal Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 Example Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4 Darkroom: A Stencil Language for Image Processing . . . . . . . . . . . . . . . . . . . . 33 4.1 Language Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2 A Simple Pipeline in Darkroom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3 Optimal Synthesis of Line-buffered Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3.1 Generating Line-buffered Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3.2 Shift Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3.3 Finding Optimal Shifts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.4.1 ASIC and FPGA Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.4.2 CPU Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
  • 16. xii 4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.5.1 Scheduling for Hardware Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.5.2 Scheduling for General-purpose Processors . . . . . . . . . . . . . . . . . . . . . . 49 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5 Programming CPU/FPGA Systems from Halide . . . . . . . . . . . . . . . . . . . . . . . . 51 5.1 The Halide Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2 Mapping Halide to Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.3 Compiler Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.3.1 Architecture Parameter Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.3.2 IR Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.3.3 Loop Perfection Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.3.4 Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.4 Implementation and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.4.1 Programmability and Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.4.2 Quality of Hardware Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6 Interfacing with Specialized Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.1 Common Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.2 The Challenge of Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.3 Solutions to the Interface Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.3.1 Compiler Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.3.2 Library Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.3.3 API plus DSL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.4 Drivers for Darkroom and Halide on FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.4.1 Memory and Coherency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.4.2 Running the Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.4.3 Generating Systems and Drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.4.4 Generating the Whole Stack with Halide . . . . . . . . . . . . . . . . . . . . . . . 76 6.4.5 Heterogeneous System Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7 Conclusions and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Authors’ Biographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
  • 17. xiii Preface Cameras are ubiquitous, and computers are increasingly being used to process image data to produce better images, recognize objects, build representations of the physical world, and extract salient bits from massive streams of video, among countless other things. But while the data deluge continues to increase, and while the number of transistors that can be cost-effectively placed on a silicon die is still going up (for now), limitations on power and energy mean that traditional CPUs alone are insufficient to meet the demand. As a result, architects are building more and more specialized compute engines tailored to provide energy and performance gains on these specific tasks. Unfortunately, the cost of producing this specialized hardware—and the software to con- trol it—is astronomical. Moreover, the resulting computer systems are heterogeneous, contain- ing multiple processing cores with wildly different execution models. The task of porting al- gorithms to these heterogeneous machines typically requires that the algorithm be partitioned across the machine and rewritten for each specific architecture, which is time consuming and prone to error. Over the last several years, we have approached this problem using domain-specific lan- guages (DSLs)—high-level programming languages customized for specific domains, such as database manipulation, machine learning, or image processing. By giving up generality, these languages are able to provide high-level abstractions to the developer while producing high- performance output. Our purpose in writing this book is to spur the adoption and the creation of domain-specific languages, especially for the task of creating hardware designs. This book is not an exhaustive description of image processing accelerators, nor of domain- specific languages. Instead, we aim to show why DSLs make sense in light of the current state of computer architecture and development tools, and to illustrate with some specific examples what advantages DSLs provide, and what tradeoffs must be made when designing them. Our examples will come from image processing, and our primary targets are mixed CPU/FPGA systems, but the underlying techniques and principles apply to other domains and platforms as well. We assume only passing familiarity with image processing, and focus our discussion on the architecture and compiler sides of the problem. In the first chapter, we take a short historical journey to explain the forces driving com- puter architecture today. Chapter 2 describes the various methods for producing designs for accelerators, outlining the push for more abstraction and the tools that enable designers to work at a higher conceptual level. In Chapter 3, we provide a brief introduction to image processing algorithms and hardware design patterns for implementing them, which we use through the rest of the book. Chapters 4 and 5 describe Darkroom and Halide, two domain-specific lan-
  • 18. xiv PREFACE guages created for image processing. Both are able to produce high-performance designs for both FPGAs and CPUs from the same source code, enabling rapid design cycles and quick porting of algorithms. We present both of these examples because comparing and contrasting them illustrates some of the tradeoffs and design decisions encountered when creating a DSL. The final portion of the book discusses the task of controlling specialized hardware within a het- erogeneous system running a multiuser operating system. We give a brief overview of how this works on Linux and show how DSLs enable us to automatically generate the necessary driver and interface code, greatly simplifying the creation of that interface. This book assumes at least some background in computer architecture, such as an advanced undergraduate or early graduate course in CPU architecture. We also build on ideas from com- pilers, programming languages, FPGA synthesis, and operating systems, but the book should be accessible to those without extensive study on these topics. Steven Bell, Jing Pu, James Hegarty, and Mark Horowitz January 2018
  • 19. xv Acknowledgments Any work of this size is necessarily the result of many collaborations. We are grateful to John Brunhaver, Zachary DeVito, Pat Hanrahan, Jonathan Ragan-Kelley, Steve Richardson, Jeff Set- ter, Artem Vasilyev, and Xuan Yang, who influenced our thinking on these topics and helped develop portions of the systems described in this book. We’re also thankful to Mike Morgan, Margaret Martonosi, and the team at Morgan & Claypool for shepherding us through the writing and production process, and to the reviewers whose feedback made this a much bet- ter manuscript than it would have been otherwise. Steven Bell, Jing Pu, James Hegarty, and Mark Horowitz January 2018
  • 21. 1 C H A P T E R 1 Introduction When the International Technology Roadmap for Semiconductors organization announced its final roadmap in 2016, it was widely heralded as the official end of Moore’s law [ITRS, 2016]. As we write this, 7 nm technology is still projected to provide cheaper transistors than current technology, so it isn’t over just yet. But after decades of transistor scaling, the ITRS report revealed at least modest agreement across the industry that cost-effective scaling to 5 nm and below was hardly a guarantee. While the death of Moore’s law remains a topic of debate, there isn’t any debate that the nature and benefit of scaling has decreased dramatically. Since the early 2000s, scaling has not brought the power reductions it used to provide. As a result, computing devices are limited by the electrical power they can dissipate, and this limitation has forced designers to find more energy-efficient computing structures. In the 2000s this power limitation led to the rise of mul- ticore processing, and is the reason that practically all current computing devices (outside of embedded systems) contain multiple CPUs on each die. But multiprocessing was not enough to continue to scale performance, and specialized processors were also added to systems to make them more energy efficient. GPUs were added for graphics and data-parallel floating point op- erations, specialized image and video processors were added to handle video, and digital signal processors were added to handle the processing required for wireless communication. On one hand, this shift in structure has made computation more energy efficient; on the other, it has made programming the resulting systems much more complex. The vast major- ity of algorithms and programming languages were created for an abstract computing machine running a single thread of control, with access to the entire memory of the machine. Changing these algorithms and languages to leverage multiple threads is difficult, and mapping them to use the specialized processors is near impossible. As a result, accelerators only get used when performance is essential to the application; otherwise, the code is written for CPU and declared “good enough.” Unless we develop new languages and tools that dramatically simplify the task of mapping algorithms onto these modern heterogeneous machines, computing performance will stagnate. This book describes one approach to address this issue. By restricting the application do- main, it is possible to create programming languages and compilers that can ease the burden of creating and mapping applications to specialized computing resources, allowing us to run com- plete applications on heterogeneous platforms. We will illustrate this with examples from image processing and computer vision, but the underlying principles extend to other domains.
  • 22. 2 1. INTRODUCTION The rest of this chapter explains the constraints that any solution to this problem must work within. The next section briefly reviews how computers were initially able to take advantage of Moore’s law scaling without changing the programming model, why that is no longer the case, and why energy efficiency is now key to performance scaling. Section 1.2 then shows how to compare different power-constrained designs to determine which is best. Since performance and power are tightly coupled, they both need to be considered to make the best decision. Using these metrics, and some information about the energy and area cost of different operations, this section also points out the types of algorithms that benefit the most from specialized compute engines. While these metrics show the potential of specialization, Section 1.3 describes the costs of this approach, which historically required large teams to design the customized hardware and develop the software that ran on it. The remaining chapters in this book describe one approach that addresses these cost issues. 1.1 CMOS SCALING AND THE RISE OF SPECIALIZATION From the earliest days of electronic computers, improvements in physical technology have con- tinually driven computer performance. The first few technology changes were discrete jumps, first from vacuum tubes to bipolar transistors in the 1950s, and then from discrete transistors to bipolar integrated circuits (ICs) in the 1960s. Once computers were built with ICs, they were able to take advantage of Moore’s law, the prediction-turned-industry-roadmap which stated that the number of components that could be economically packed onto an integrated circuit would double every two years [Moore, 1965]. As MOS transistor technology matured, gates built with MOS transistors used less power and area than gates built with bipolar transistors, and it became clear in the late 1970s that MOS technology would dominate. During this time Robert Dennard at IBM Research published his paper on MOS scaling rules, which showed different approaches that could be taken to scale MOS transistors [Dennard et al., 1974]. In particular, he observed that if a transistor’s operating voltage and doping concentration were scaled along with its physical dimensions, then a number of other properties scaled nicely as well, and the resized transistor would behave predictably. If a MOS transistor is shrunk by a factor of 1= in each linear dimension, and the operating voltage is lowered by the same 1=, then several things follow: 1. Transistors get smaller, allowing 2 more logic gates in the same silicon area. 2. Voltages and currents inside the transistor scale by a factor of 1=. 3. The effective resistance of the transistor, I=V , remains constant, due to 2 above. 4. The gate capacitance C shrinks by a factor of 1= (1=2 due to decreased area, multiplied by due to reduced electrode spacing). The switching time for a logic gate is proportional to the resistance of the driving transistor multiplied by the capacitance of the driven transistor. If the effective resistance remains constant
  • 23. 1.1. CMOS SCALING AND THE RISE OF SPECIALIZATION 3 while the capacitance decreases by 1=, then the overall delay also decreases by 1=, and the chip can be run faster by a factor of . Taken together, these scaling factors mean that 2 more logic gates are switched faster, for a total increase of 3 more gate evaluations per second. At the same time, the energy required to switch a logic gate is proportional to CV2 . With both capacitance and voltage decreasing by a factor of 1=, the energy per gate evaluation decreased by a factor of 1=3 . During this period, roughly every other year, a new technology process yielded transistors which were about 1= p 2 as large in each dimension. Following Dennard scaling, this would give a chip with twice as many gates and a faster clock by a factor of 1.4, making it 2.8 more powerful than the previous one. Simultaneously, however, the energy dissipated by each gate evaluation dropped by 2.8, meaning that total power required was the same as the previous chip. This remarkable result allowed each new generation to achieve nearly a 3 improvement for the same die area and power. This scaling is great in theory, but what happened in practice is somewhat more circuitous. First, until the mid-1980s, most complex ICs were made with nMOS rather than CMOS gates, which dissipate power even when they aren’t switching (known as static power). Second, during this period power supply voltages remained at 5 V, a standard set in the bipolar IC days. As a result of both of these, the power per gate did not change much even as transistors scaled down. As nMOS chips grew more complex, the power dissipation of these chips became a serious problem. This eventually forced the entire industry to transition from nMOS to CMOS technology, despite the additional manufacturing complexity and lower intrinsic gate speed of CMOS. After transitioning to CMOS ICs in the mid-1980s, power supply voltages began to scale down, but not exactly in sync with technology. While transistor density and clock speed contin- ued to scale, the energy per logic gate dropped more slowly. With the number of gate evaluations per second increasing faster than the energy of gate evaluation was scaling down, the overall chip power grew exponentially. This power scaling is exactly what we see when we look at historical data from CMOS microprocessors, shown in Figure 1.1. From 1980 to 2000, the number of transistors on a chip increased by about 500 (Figure 1.1a), which corresponds to scaling transistor feature size by roughly 20. During this same period of time, processor clock frequency increased by 100, which is 5 faster than one would expect from simple gate speed (Figure 1.1b). Most of this ad- ditional clock speed gain came from microarchitectural changes to create more deeply pipelined “short tick” machines with fewer gates per cycle, which were enabled by better circuit designs of key functional units. While these fast clocks were good for performance, they were bad from a power perspective. By 2000, computers were executing 50,000 more gate evaluations per second than they had in the 1980s. During this time the average capacitance had scaled down, providing a 20 energy savings, but power supply voltages had only scaled by 4–5 (Figure 1.1c), giving roughly
  • 24. 4 1. INTRODUCTION a 25 savings. Taken together the capacitance and supply scaling only reduce the gate energy by around 500, which means that the power dissipation of the processors should increase by two orders of magnitude during this period. Figure 1.1d shows that is exactly what happened. 1 B 100 M 10 M 1 M 100 k 10 k 5 V 3.3 V 2.5 V 1.2 V 150 W 100 W 10 W 1 W 4 GHz 1 GHz 100 MHz 10 MHz 1 MHz 1970 1980 1990 2000 2010 2020 1970 1980 1990 2000 2010 2020 1970 1980 1990 2000 2010 2020 1970 1980 1990 2000 2010 2020 Number of Transistors Clock Frequency Voltage Thermal Design Power (TDP) (a) Transistors Per Chip (b) CPU Frequency (c) Operating Voltage (d) Power Dissipation Figure 1.1: From the 1960s until the early 2000s, transistor density and operating frequency scaled up exponentially, providing exponential performance improvements. Power dissipa- tion increased but was kept in check by lowering the operating voltage. Data from CPUDB [Danowitz et al., 2012]. Up to this point, all of these additional transistors were used for a host of architectural im- provements that increased performance even further, including pipelined datapaths, superscalar instruction issue, and out-of-order execution. However, the instruction set architectures (ISAs) for various processors generally remained the same through multiple hardware revisions, mean-
  • 25. 1.1. CMOS SCALING AND THE RISE OF SPECIALIZATION 5 ing that existing software could run on the newer machine without modification—and reap a performance improvement. But around 2004, Dennard scaling broke down. Lowering the gate threshold voltage fur- ther caused the leakage power to rise unacceptably high, so it began to level out just below 1 V. Without the possibility to manage the power density by scaling voltage, manufacturers hit the “power wall” (the red line in Figure 1.1d). Chips such as the Intel Pentium 4 were dissipating a little over 100 W at peak performance, which is roughly the limit of a traditional package with a heatsink-and-fan cooling system. Running a CPU at significantly higher power than this requires an increasingly complex cooling system, both at a system level and within the chip itself. Pushed up against the power wall, the only choice was to stop increasing the clock fre- quency and find other ways to increase performance. Although Intel had predicted processor clock rates over 10 GHz, actual numbers peaked around 4 GHz and settled back between 2 and 4 GHz (Figure 1.1b). Even though Dennard scaling had stopped, taking down frequency scaling with it, Moore’s law continued its steady march forward. This left architects with an abundance of tran- sistors, but the traditional microarchitectural approaches to improving performance had been mostly mined out. As a result, computer architecture has turned in several new directions to improve performance without increasing power consumption. The first major tack was symmetric multicore, which stamped down two (and then four, and then eight) copies of the CPU on each chip. This has the obvious benefit of delivering more computational power for the same clock rate. Doubling the core count still doubles the total power, but if the clock frequency is dialed back, the chip runs at a lower voltage, keeping the energy constant while maintaining some of the performance advantage of having multiple cores. This is especially true if the parallel cores are simplified and designed for energy efficiency rather than single-thread performance. Nonetheless, even simple CPU cores incur significant overhead to compute their results, and there is a limit to how much efficiency can be achieved simply by making more copies. The next theme was to build processors to exploit regularity in certain applications, lead- ing to the rise of single-instruction-multiple-data (SIMD) instruction sets and general-purpose GPU computing (GPGPU). These go further than symmetric multicore in that they amortize the instruction fetch and decode steps across many hardware units, taking advantage of data parallelism. Neither SIMD nor GPUs were new; SIMD had existed for decades as a staple of supercomputer architectures and made its way into desktop processors for multimedia ap- plications along with GPUs in the late 1990s. But in the mid-2000s, they started to became prominent as a way to accelerate traditional compute-intensive applications. A third major tack in architecture was the proliferation of specialized accelerators, which go even further in stripping out control flow and optimizing data movement for particular appli- cations. This trend was hastened by the widespread migration to mobile devices and “the cloud,” where power is paramount and typical use is dominated by a handful of tasks. A modern smart-
  • 26. 6 1. INTRODUCTION phone System-on-chip (SoC) contains more than a dozen custom compute engines, created specifically to perform intensive tasks that would be impossible to run in real time on the main CPU. For example, communicating over WiFi and cellular networks requires complex coding and modulation/demodulation, which is performed on a small collection of hardware units spe- cialized for these signal processing tasks. Likewise, decoding or encoding video—whether for watching Netflix, video chatting, or camera filming—is handled by hardware blocks that only perform this specific task. And the process of capturing raw pixels and turning them into a pleasing (or at least presentable) image is performed by a long pipeline of hardware units that demosaic, color balance, denoise, sharpen, and gamma-correct the image. Even low-intensity tasks are getting accelerators. For example, playing music from an MP3 file requires relatively little computational work, but the CPU must wake up a few dozen times per second to fill a buffer with sound samples. For power efficiency, it may be better to have a dedicated chip (or accelerator within the SoC, decoupled from the CPU) that just handles audio. While there remain some performance gains still to be squeezed out of thread and data parallelism by incrementally advancing CPU and GPU architectures, they cannot close the gap to a fully customized ASIC. The reason, as we’ve already hinted, comes down to power. Cell phones are power-limited both by their battery capacity (roughly 8–12 Wh) and the amount of heat it is acceptable to dissipate in the user’s hand (around 2 W). The datacenter is the same story at a different scale. A warehouse-sized datacenter consumes tens of megawatts, requiring a dedicated substation and a cheap source of electrical power. And like phones, data center performance is constrained partly by the limits of our ability to get heat out, as evidenced by recent experiments and plans to build datacenters in caves or in frigid parts of the ocean. Thus, in today’s power-constrained computing environment, the formula for improvement is simple: performance per watt is performance. Only specialized architectures can optimize the data storage and movement to achieve the energy reduction we want. As we will discuss in Section 1.4, specialized accelerators are able to eliminate the overhead of instructions by “baking” them into the computation hardware itself. They also eliminate waste for data movement by designing the storage to match the algorithm. Of course, general-purpose processors are still necessary for most code, and so modern systems are increasingly heterogeneous. As mentioned earlier, SoCs for mobile devices contain dozens of processors and specialized hardware units, and datacenters are increasingly adding GPUs, FPGAs, and ASIC accelerators [AWS, 2017, Norman P. Jouppi et al., 2017]. In the remainder of this chapter, we’ll describe the metrics that characterize a “good” accelerator and explain how these factors will determine the kind of systems we will build in the future. Then we lay out the challenges to specialization and describe the kinds of applications for which we can expect accelerators to be most effective.
  • 27. 1.2. WHAT WILL WE BUILD NOW? 7 1.2 WHAT WILL WE BUILD NOW? Given that specialized accelerators are—and will continue to be—an important part of computer architecture for the foreseeable future, the question arises: What makes a good accelerator? Or said another way, if I have a potential set of designs, how do I choose what to add to my SoC or datacenter, if anything? 1.2.1 PERFORMANCE, POWER, AND AREA On the surface, the good things we want are obvious. We want high performance, low power, and low cost. Raw performance—the speed at which a device is able to perform a computation—is the most obvious measure of “good-ness.” Consumers will throw down cash for faster devices, whether that performance means quicker web page loads or richer graphics. Unfortunately, this isn’t easy to quantify with the most commonly advertised metrics. Clock speed matters, but we also need to account for how much work is done on each clock cycle. Multiplying clock speed by the number of instructions issued per cycle is better, but still ignores the fact that some instructions might do much more work than others. And on top of this, we have the fact that utilization is rarely 100% and depends heavily on the architecture and application. We can quantify performance in a device-independent way by counting the number of essential operations performed per unit time. For the purposes of this metric, we define “essen- tial operations” to include only the operations that form the actual result of the computation. Most devices require a great deal of non-essential computation, such as decoding instructions or loading and storing intermediate data. These are “non-essential” not because they are pointless or unnecessary but because they are not intrinsically required to perform the computation. They are simply overhead incurred by the specific architecture. With this definition, adding two pieces of data to produce an intermediate result is an essential operation, but incrementing a loop counter is not since the latter is required by the implementation and not the computation itself. To make things concrete, a 3 3 convolution on a single-channel image requires nine mul- tiplications (multiplying 3 3 pixels by their corresponding weights) and eight 2-input additions per output pixel. For a 640 480 image (307,200 pixels), this is a little more than 5.2 million total operations. A CPU implementation requires many more instructions than this to compute the result since the instruction stream includes conditional branches, loop index computations, and so forth. On the flip side, some implementations might require fewer instructions than operations, if they process multiple pieces of data on each instruction or have complex instructions that fuse multiple operations. But implementations across this whole spectrum can be compared if we calculate everything in terms of device-independent operations, rather than device-specific instructions.
  • 28. 8 1. INTRODUCTION The second metric is power consumption, measured in Watts. In a datacenter context, the power consumption is directly related to the operating cost, and thus to the total cost of ownership (TCO). In a mobile device, power consumption determines how long the battery will last (or how large a battery is necessary for the device to survive all day). Power consumption also determines the maximum computational load that can be sustained without causing the device to overheat and throttle back. The third metric is cost. We’ll discuss development costs further in the following section, but for now it is sufficient to observe that the production cost of the final product is closely related to the silicon area of the chip, typically measured in square millimeters (mm2 ). More chips of a smaller design will fit on a fixed-size wafer, and smaller chips are likely to have somewhat higher yield percentages, both of which reduce the manufacturing cost. However, as important as performance, power, and silicon area are as metrics, they can’t be used directly to compare designs, because it is relatively straightforward to trade one for the other. Running a chip at a higher operating voltage causes its transistors to switch more rapidly, allowing us to increase the clock frequency and get increased performance, at the cost of in- creased power consumption. Conversely, lowering the operating voltage along with the clock frequency saves energy, at the cost of lower performance.1 It isn’t fair to compare the raw performance of a desktop Intel Core i7 to an ARM phone SoC, if for no other reason than that the desktop processor has a 20–50 power advantage. Instead, it is more appropriate to divide the power (Joules per second) by the performance (op- erations per second) to get the average energy used per computation (Joules per operation). Throughout the rest of this book, we’ll refer to this as “energy per operation” or pJ/op. We could equivalently think about maximizing the inverse: operations/Joule. For a battery-powered device, energy per operation relates directly to the amount of com- putation that can be performed with a single battery charge; for anything plugged into the wall, this relates the amount of useful computation that was done with the money you paid to the electric company. A similar difficulty is related to the area metric. For applications with sufficient parallelism, we can double performance simply by stamping down two copies of the same processor on a chip. This benefit requires no increase in clock speed or operating voltage—only more silicon. This was, of course, the basic impetus behind going to multi core computation. Even further, it is possible to lower the voltage and clock frequency of the two cores, trading performance for energy efficiency as described earlier. As a result, it is possible to improve either power or performance by increasing silicon area as long as there is enough parallelism. Thus, when comparing between architectures for highly parallel applications, it is helpful to 1Of course, modern CPUs do this scaling on the fly to match their performance to the ever-changing CPU load, known as “Dynamic Voltage and Frequency Scaling” (DVFS).
  • 29. 1.2. WHAT WILL WE BUILD NOW? 9 normalize performance by the silicon area used. This gives us operations/Joule divided by area, or ops mm2J . These two compound metrics, pJ=operation and ops mm2J , give us meaningful ways to com- pare and evaluate vastly different architectures. However, it isn’t sufficient to simply minimize these in the abstract; we must consider the overall system and application workload. 1.2.2 FLEXIBILITY Engineers building a system are concerned with a particular application, or perhaps a collection of applications, and the metrics discussed are only helpful insofar as they represent performance on the applications of interest. If a specialized hardware module cannot run our problem, its energy and area efficiency are irrelevant. Likewise, if a module can only accelerate parts of the application, or only some applications out of a larger suite, then its benefit is capped by Ahm- dahl’s law. As a result, we have a flexibility tradeoff: more flexible devices allow us to accelerate computation that would otherwise remain on the CPU, but increased flexibility often means reduced efficiency. Suppose a hypothetical fixed-function device can accelerate 50% of a computation by a factor of 100, reducing the total computation time from 1 second to 0.505 seconds. If adding some flexibility to the device drops the performance to only 10 but allows us to accelerate 70% of the computation, we will now complete the computation in 0.37 seconds—a clear win. Moreover, many applications demand flexibility, whether the product is a networking de- vice that needs to support new protocols or an augmented-reality headset that must incorporate the latest advances in computer vision. As more and more devices are connected to the internet, consumers increasingly expect that features can be upgraded and bugs can be fixed via over-the- air updates. In this market, a fixed-function device that cannot support rapid iteration during prototyping and cannot be reconfigured once deployed is a major liability. The tradeoff is that flexibility isn’t free, as we have already alluded to. It almost always hurts efficiency (performance per watt or ops mm2J ) since overhead is spent processing the configuration. Figure 1.2 illustrates this by comparing the performance and efficiency for a range of designs proposed at ISSCC a number of years ago. While newer semiconductor processes have reduced energy across the board, the same trend holds: the most flexible devices (CPUs) are the least efficient, and increasing specialization also increases performance, by as much as three orders of magnitude. In certain domains, this tension has created something of a paradox: applications that were traditionally performed completely in hardware are moving toward software implementations, even while competing forces push related applications away from software toward hardware. For example, the fundamental premise of software defined radio (SDR) is that moving much (or all) of the signal processing for a radio from hardware to software makes it possible to build a system that is simpler, cheaper, and more flexible. With only a minimal analog front-end, an SDR system can easily run numerous different coding and demodulation schemes, and be upgraded
  • 30. 10 1. INTRODUCTION 1,000 100 10 1 0.1 0.01 Microprocessors General Purpose DSPs Dedicated Energy Efficiency (MOPS/mW) Figure 1.2: Comparison of efficiency for a number of designs from ISSCC, showing the clear tradeoff between flexibility and efficiency. Designs are sorted by efficiency and grouped by overall design paradigm. Figure from Marković and Brodersen [2012]. over the air. But because real-time signal processing requires extremely high computation rates, many SDR platforms use an FPGA, and carefully optimized libraries have been written to fully exploit the SIMD and digital signal processing (DSP) hardware in common SoCs. Likewise, software-defined networking aims to provide software-based reconfigurability to networks, but at the same time more and more effort is being poured into custom networking chips. 1.3 THE COST OF SPECIALIZATION To fit these metrics together, we must consider one more factor: cost. After all, given the enor- mous benefits of specialization, the only thing preventing us from making a specialized acceler- ator for everything is the expense. Figure 1.3 compares the non-recurring engineering (NRE) cost of building a new high- end SoC on the past few silicon process nodes. The price tags for the most recent technologies are now well out of reach for all but the largest companies. Most ASICs are less expensive than this, by virtue of being less complex, using purchased or existing IP, having lower performance targets, and being produced on older and mature processes [Khazraee et al., 2017]. Yet these costs still run into the millions of dollars and remain risky undertakings for many businesses. Several components contribute to this cost. The most obvious is the price of the lithogra- phy masks and tooling setup, which has been driven up by the increasingly high precision of each process node. Likewise, these processes have ever-more-stringent design rules, which require more engineering effort during the place and route process and in verification. The exponen- tial increase in number of transistors has enabled a corresponding growth in design complexity, which comes with increased development expense. Some of these additional transistors are used
  • 31. 1.3. THE COST OF SPECIALIZATION 11 500 400 300 200 100 0 65 nm (2006) 45/40 nm (2008) 28 nm (2010) 22 nm (2012) 16/14 nm (2014) 10 nm (2017) 7 nm 5 nm Software Physical Verification Architecture IP Prototye Validation Cost (million USD) Figure 1.3: Estimated cost breakdown to build a large SoC. The overall cost is increasing expo- nentially, and software comprises nearly half of the total cost. (Data from International Business Strategies [IBS, 2017].) in ways that do not appreciably increase the design complexity, such as additional copies of pro- cessor cores or larger caches. But while the exact slope of the correlation is debatable, the trend is clear: More transistors means more complexity, and therefore higher design costs. Moreover, with increased complexity comes increased costs for testing and verification. Last, but particularly relevant to this book, is the cost of developing software to run the chip, which in the IBS estimates accounts for roughly 40% of the total cost. The accelerator must be configured, whether with microcode, a set of registers, or something else, and it must be interfaced with the software running on the rest of the system. Even the most rigid of “fixed” devices usually have some degree of configurability, such as the ability to set an operating mode or to control specific parameters or coefficients. This by itself is unremarkable, except that all of these “configurations” are tied to a pro- gramming model very different than the idealized CPU that most developers are used to. Timing details become crucial, instructions execute out of order or in a massively parallel fashion, and
  • 32. 12 1. INTRODUCTION concurrency and synchronization are handled with device-specific primitives. Accelerators are, almost by definition, difficult to program. To state the obvious, the more configurable a device is, the more effort must go into con- figuring it. In highly configurable accelerators such as GPUs or FPGAs, it is quite easy—even typical—to produce configurations that do not perform well. Entire job descriptions revolve around being able to work the magic to create high-performance configurations for accelera- tors. These people, informally known as “the FPGA wizards” or “GPU gurus,” have an intimate knowledge of the device hardware and carry a large toolbox of techniques for optimizing appli- cations. They also have excellent job security. This difficulty is exacerbated by a lack of tools. Specialized accelerators need specialized tools, often including a compiler toolchain, debugger, and perhaps even an operating system. This is not a problem in the CPU space: there are only a handful of competitive CPU archi- tectures, and many groups are developing tools, both commercial and open source. Intel is but one of many groups with an x86 C++ compiler, and the same is true for ARM. But specialized accelerators are not as widespread, and making tools for them is less profitable. Unsurprisingly, NVIDIA remains the primary source of compilers, debuggers, and development tools for their GPUs. This software design effort cannot easily be pushed onto third-party companies or the open-source community, and becomes part of the chip development cost. As we stand today, bringing a new piece of silicon to market is as much about writing software as it is designing logic. It isn’t sufficient to just “write a driver” for the hardware; what is needed is an effective bridge to application-level code. Ultimately, companies will only create and use accelerators if the improvement justifies the expense. That is, an accelerator is only worthwhile if the engineering cost can be recouped by savings in the operating cost, or if the accelerator enables an application that was previously impossible. The operating cost is closely tied to the efficiency of the computing system, both in terms of the number of units necessary (buying a dozen CPUs vs. a single customized accelerator) and in terms of time and electricity. Because it is almost always easier to implement an algorithm on a more flexible device, this cost optimization results in a tug-of-war between performance and flexibility, illustrated in Figure 1.4. This is particularly true for low-volume products, where the NRE cost dominates the overall expense. In such cases, the cheapest solution—rather than the most efficient—might be the best. Often, the most cost-effective solution to speed up an application is to buy a more powerful computer (or a whole rack of computers!) and run the same horribly inefficient code on it. This is why an enormous amount of code, even deployed production code, is written in languages like Python and Matlab, which have poor runtime performance but terrific developer productivity. Our goal is to reduce the cost of developing accelerators and of mapping emerging applica- tions onto heterogeneous systems, pushing down the NRE of the high-cost/high-performance
  • 33. 1.4. GOOD APPLICATIONS FOR ACCELERATION 13 CPU Optimized CPU GPU FPGA ASIC Engineering Cost Operating Cost Figure 1.4: Tradeoff of operating cost (which is inversely related to runtime performance) vs. non-recurring engineering cost (which is inversely related to flexibility). More flexible devices (CPUs and GPUs) require less development effort but achieve worse performance compared to FPGAs and ASICs. We aim to reduce the engineering development cost (red arrows), making it more feasible to adopt specialized computing. areas of this tradeoff space. Unless we do so, it will remain more cost effective to use general- purpose systems, and computer performance in many areas will suffer. 1.4 GOOD APPLICATIONS FOR ACCELERATION Before we launch into systems for programming accelerators, we’ll examine which applications can be accelerated most effectively. Can all applications be accelerated with specialized proces- sors, or just some of them? The short answer is that only a few types of applications are worth accelerating. To see why, we have to go back to the fundamentals of power and energy. Given that, for a modern chip, performance per watt is equivalent to performance, we want to minimize the energy consumed per unit of computation. That is, if the way to maximize operations per second is to maximize operations per second per watt, we can cancel “seconds,” and simply maximize operations per Joule. Table 1.1 shows the energy required for a handful of fundamental operations in a 45 nm process. The numbers are smaller for more recent process nodes, but the relative scale remains essentially the same. The crucial observation here is that a DRAM fetch requires 500 more energy than a 32-bit multiplication, and 50; 000 more than an 8-bit addition. The cost of fetching data from memory completely dwarfs the cost of computing with it. The cache hierarchy helps, of course,
  • 34. Exploring the Variety of Random Documents with Different Content
  • 35. V SHE GOES ON SUNDAY TO THE CHURCH Eumenes Fane’s marriage had been both more respectable and more romantic than his kind enemies believed: living in Paris, he had eloped with a handsome, wilful French girl of noble family. Her relations swallowed the match as a bitter pill, his did not exist; and the married lovers lived in isolation far away in Brittany until death cut short their long honeymoon. Eumenes returned to England embittered; he had always been disagreeable. The relations between him and his children were eccentric. He lived with them, he had taught them, yet he lavished satire upon their boorishness and stupidity; he had been devoted to the mother, yet for the children he had no feeling but unamiable contempt. They, on their part, repaid him with indifference. Bernard at eighteen, on his own initiative, took control of the farm and made it pay; Dolly managed the dairy and the household. Their lives were isolated equally from their father and from the world. Bernard was not much of a reader, and never strayed far from his Shakespeare and his farming journals, with excursions into Tennyson; but Dolly was insatiable. She had read and digested every book in their heterogeneous library. Unfortunately, the collection was not representative; the modern French novelists were there arranged in full tale, and fresh volumes were added as they appeared, but it had no single work of English fiction later than the date of the admirable Sir Charles Grandison. Both Bernard and Dolly could read and speak French as easily as English, though they did not know the worth of their accomplishment; and from their study of fin-de-siècle literature they had gained an innocently lurid knowledge of the world which hardly fitted in with the conditions of English country life, and was particularly inappropriate as applied to the blameless households at the vicarage, the surgery, or The Lilacs. When young Merton of The Hall brought home a pretty bride, Dolly seriously looked for the appearance of Tertium Quid. He delayed his
  • 36. coming for a year, and then arrived in the cradle. Dolly was surprised; but she ascribed this breach of custom to the fact that Merton senior’s money was made in soap. Only the true aristocrats indulge in a friend of the house. After Farquhar’s visit Dolly made a dress for herself. It was then the fashion to wear a bodice opening at the sleeves and in front to show a lighter under-dress, which also appeared beneath the skirt, as the corolla of a flower beneath the calyx. Dolly’s gown of dark chestnut matched her hair; the colour of the vest was white. She was more skilful in the dairy than with her needle, but she gave her mind to this, and in the end her work was crowned with fair success. “I guess that colour, what they call, suits you,” said Bernard, whom she called in to assist at the full-dress rehearsal. “I expect it does,” assented Dolly, bending back her swan’s-neck to catch a glimpse of her supple young waist in the spotty mirror. “It fits rather badly; any one can see it is homemade, but that can’t be helped. I am going to wear it to church on Christmas Day.” “Father’ll be awfully angry if you go to church.” “Of course, but that doesn’t matter. No one except small shopkeepers and mill-girls goes to chapel now. Besides, the minister drops his h’s and mixes his metaphors and talks the silliest nonsense: I wouldn’t listen to him even if it were the fashion. Shall you come with me?” “I guess I’d better. Have you seen that Farquhar chap again?” “I have,” Dolly answered, composedly. “You’ll get yourself into a mess if you don’t look out.” “Oh no. He may get into a mess, but I shall not.” “Then I don’t think you are playing fair.” “Yes, I am. He knows why I spoke to him.” “Why did you?” “To know how ladies behave.”
  • 37. “I suppose you’ll go your own way,” said Bernard, after a pause; “but people’ll talk if you go on meeting him.” “Let them. I don’t mean to stay down here.” “I do,” said Bernard. Dolly perceived the force of this objection. She valued Farquhar’s advice; but where her own aims clashed with Bernard’s well-being, she rarely hesitated. “Very well; I won’t meet him again,” she said. “But, Bernard, if he speaks to you, do you respond. Ask him here; no one can find fault if I see him in my own house. Or I don’t think they can; do you?” She was reassured by Bernard’s hearty assent, backed by a special instance. “For,” said he, “when Maude had his sister staying here, Farquhar went and saw them; and I guess if he goes to Maude’s house he can come to us.” And the point was thus settled. Two days before Christmas the wind blew softly from the south, the snow melted from the earth and the clouds from the sky, the robins broke out into their pure celestial strains, and it was spring in all but name. Farquhar’s invalid began to pester his doctor for permission to go out, and Dolly got a white hat to go with her chestnut gown. Christmas Day itself was a flash of summer. Dolly came down dressed for church at half-past ten, and found her brother ready in a Norfolk jacket, knickerbockers, and a cap. An inward monitor told her that this attire was incorrect, and she said so; but as Bernard had nothing else to wear, the question solvitur ambulando. Neither of them had ever been to church. In early days Bernard had been sent to a chapel with a damnatory creed, and he took his sister with him till she developed opinions of her own: an epoch early in Dolly’s history. She rebelled: Bernard, who was bored by the service, outraged by the music, and submissive only from indifference, supported her: and Mr. Fane’s graceless children took their own way, and henceforth spent the Sabbath hours in reading, prefaced always by a chapter of the Bible.
  • 38. They arrived late, having lingered in the woods because Dolly said, and Bernard agreed, that Mrs. Merton and the lady in the black frills had never entered the church till after the bells stopped ringing. Such is the force of bad example. Bernard held the door open for his sister, and followed her in, according to instructions which he had received from her, and she from Noel Farquhar. The aisles were crossed by dim sunbeams swimming with drowsy motes, the people were sleepy, the priest was monotoning monotonously out of tune; and Dolly’s entrance, in company with a beam of pure sunshine and a gust of wind which set the Christmas wreaths rustling all round the church, electrified everybody. Heads turned to stare; the choristers, ever the devotees of inattention, nudged and whispered. Up the aisle came Dolly, a glowing piece of colour in her rich dress and richer hair, with the immaculate whiteness of her brow and the deepening carmine of her cheeks, her eyes shining like brown diamonds. She walked steadily, carrying her head high, up to the big square pew assigned by tradition to the house of Fanes, unlatched the door, and took her seat. Bernard followed, his height and his patent unconcern making his figure quite as imposing as hers. For a space Dolly knelt, as she saw others doing, and hid her hot face; but when the time came she rose, and pinched Bernard, who had sat down and stayed there. He got up slowly, plunged his hands into his pockets, and looked round him. Dolly was convinced that his behaviour was improper; she also looked round her, but without moving her head, and found her exemplar in the person of Noel Farquhar, who was attentively following the service in a large prayer- book. Three volumes lay on the shelf of their pew; Dolly opened one and handed another to her brother, signing to him to do his duty. He looked into it helplessly; it was a copy of Hymns Ancient and Modern, and it is not surprising that he could not find the place. Dolly was no better off, but she had a model to imitate; she turned over the pages as though they were perfectly familiar, found her place near the beginning of the volume, and devoutly studied the evening hymns while the choristers chanted the Venite. The recollection of that morning always brought a smile to Dolly’s lips. Occupied by her culte of deportment, and still more by her culte
  • 39. of Bernard’s deportment, she missed the humours at the moment, but found them all the more amusing under the enchantment lent by distance. Bernard, who was not thinking about himself, was not amused. Music at chapel had been bad enough, but this, more ambitious, was really horrible. The choir sang neither better nor worse than most village performers; there was a preponderance of trebles out of tune and raucous, an absence of altos, two tenors who sang wrong, and three basses who sang treble. When they should have monotoned they climbed unevenly and one by one in linked sweetness long drawn out down a chromatic scale, until Bernard suddenly launched the true note at them in a voice of startling richness and power, which would have made his fortune had he taken it to market in town. It had the true bass quality, but an unusually extensive compass, ranging from the C below the bass clef up to the octave of middle C. After he began to sing, most of the curious eyes were diverted from Dolly to him, and she regained her composure. Farquhar had not looked at her; it was not his cue to let his eye wander during service. But Dolly was sure, from the dark flush which overspread his face, that he had seen her enter. She designed this meeting as a test. If he refused to acknowledge her before his friends, Dolly vowed that she would never speak to him again. Her pride of birth was keen; she went to the length of thinking her brother the only gentleman present, inasmuch as he alone, so far as she knew, had the right to bear arms. She took little part in the religious ceremonies. Dolly had her creed, and held to it in practice, but at this time she was too intent on this world to think much of the next. She got up with alacrity after the benediction, and marshalled out Bernard, glad to go. The organist was now playing music soft and slow, and tenderly touching the pedals with boots so large that he frequently put down two notes at once by accident. Music was really the only subject about which Bernard was sensitive; as a false quantity to a Latinist, as a curse to a Quaker, as a red rag to a bull, so was a wrong note to Bernard Fane. Outside shone the sun and breathed the wind and danced the grasses over the graves of women as young and beautiful as Dolly;
  • 40. but she was not thinking of them. The stream of people began to condense into groups of two and three, who gave each other the accustomed greetings and echoed cheerful wishes at cross purpose in a babel of inanity. Farquhar was shaking hands with Mrs. Merton, a fragile little lady with dark eyes, frileuse, as Dolly christened her, who dressed very well and talked plaintive nonsense in an erratic fashion. Dolly knew by instinct that they were speaking of her. She went on at an even pace. Farquhar broke from his friends and followed, and Dolly, with true Christmas good-will in her heart, found herself shaking his hand in the overhand style, according to the custom of the lady in black frills. “I wish I could walk home your way; I’ve a hundred things to say about that Burnt House business, and one never has a chance of seeing Mr. Fane. But I’ve an invalid at home who’s to take his first airing to-day, and I know he’ll go too far if I don’t look after him.” “Is that the chap you picked up on the road?” asked Bernard, who had heard the story from the men, with romantic embellishments. “Oh, I didn’t pick him up; don’t think it; he was planted on me by Providence. I say, Fane, if you’ve nothing better to do, I wish you’d come in to-night and have a knock-up at billiards. It would be a Christian act, for I’ve not a soul in the house except the invalid, who toddles off to bye-bye at seven.” “I can’t play billiards,” was Bernard’s reply, rather proudly spoken. “Right; I’ll teach you. There’s nothing I like better; is there, Mrs. Merton?” “Don’t ask me; I never pretend to fathom you,” said Mrs. Merton, plaintively, shaking her head. And she put out a very small hand to Dolly. “Please don’t snub me, Miss Fane; I’d so like to come and call, if you’ll let me. I was told you were a dreadful person, who dropped the h and divided the hoof—skirt, I mean; besides, it was your turn to call first on me. But you aren’t dreadful, are you? So may I come?” Had there been any patronage in Mrs. Merton’s manner, Dolly would have been delighted to snub her; but there was none. The formula of gracious acceptance was less easy than a refusal, but
  • 41. Dolly let no one guess her difficulties. An interesting general discussion of the weather followed, during which one remarked that it gave the doctors quite a holiday, a second that it was muggy and unwholesome and why didn’t we have a nice healthy frost, a third that it was excellent for the crops, and a fourth that the harvest would be certainly ruined by wireworms, and each agreed with all the rest. Bernard, standing still, thought fashionable people talked like imbeciles. Dolly, shy, though no one saw it, was in a glow of triumph. Their way home led through woods. So much rain had fallen that the mossy bridle-path was scored with deep ruts full of water, and Dolly had to hold her skirt away from the black leaf-mould. Rain- drops held in crumpled copper leaves shone gemlike, smooth young stems glistened; only the grey boles of the forest trees looked warm and dry. Dolly, herself like a russet leaf, harmonised with the woodland scenery, which seemed a frame made for her. Farther on down the path, resignedly sitting on a bundle of fagots, and beginning to grow chilly, Lucian de Saumarez was waiting for some one to pass. He had set out with the virtuous intention of returning home in half an hour precisely, but had been lured on by a shrew-mouse, a squirrel, and the enchanting sun, till the end of his strength put a period to his walk; his legs gave way under him. Then he sat down and whistled “Just Break the News to Mother,” very cheerfully. It was fortunate that in Bernard’s hearing he did not attempt to sing, for his voice can only be described by the adjective squawky. He looked like a tramp who had stolen a coat, for over his own he wore one of Farquhar’s, which was truly a giant’s robe to him. At first glimpse of Dolly he whipped off his cap, and stood up bareheaded and recklessly polite. “Excuse me—” he began. “If you want relief, you’d better go to Alresworth workhouse; they’ll take you in there,” interrupted Bernard, who would never give to tramps. “Be quiet, Bernard. Is there anything we can do for you?” asked Dolly, in her gentlest voice.
  • 42. “Candidly, I only ask an arm, and not an alms,” said Lucian, laughing in Bernard’s face. “Fact is, I’ve walked up from The Lilacs and just petered out. Your woods are such a very remarkably long way through.” “Then your name is De Saumarez. Bernard, give Mr. de Saumarez your arm. You must come home with us and rest; afterwards you can go back. You ought not to be sitting down out-of-doors this weather,” said Dolly, fixing her imperious young eyes upon him, between pity and severity. “No, I’m an abomination, I confess it,” answered the culprit, meekly. “You must be feeling very tired.” “I’m feeling more like boned goose than anything else, especially in the legs. By-the-way, I wonder if Farquhar will leave his to look for the strayed lamb?” “Let him; it won’t do him any harm.” Lucian’s eyes opened wide; Farquhar had described the ladies of Monkswell in picture-making phrases, and he was trying to fit this vivid young beauty into some one of the frames provided, which all seemed too strait. “Am I speaking to Miss Maude?” he asked at a venture, choosing the likeliest. “Oh no. I am Mirabelle Fane, and this is my brother Bernard.” “The dickens you are!” said Lucian to himself; for Farquhar, in relating the adventure of Mr. Fane and the copper, had not mentioned Miss Fane. Her foreign name and intonation caught Lucian’s ear, and he asked if she were French. “My mother was Comtesse de Beaufort,” Dolly told him, and her naïve pride was quaint and pretty. Lucian mentioned Paris, and she fastened upon him with a string of eager questions, but put him to silence before half were answered, by declaring that he had talked too much. “I’ve been off the silent list this fortnight past,” Lucian pleaded.
  • 43. “But you are already overtired. You ought to lie down directly you get in, and take a dose of cod-liver oil.” “I take cod-liver oil three times a day,” Lucian assured her, with equal gravity. “How? In port wine?” “I should consider that a sacrilege. No; I will describe the operation,” said Lucian, warming to his subject, which in any of his many conversations with pretty girls he had never discussed before. “I squeeze half a lemon into a wineglass, so; then I pour the oil in on it; next I squeeze the juice of the other half-lemon into another wineglass; and finally I swallow first the lemon plus oil and then the lemon solus. It is a process which requires great nicety and precision. Farquhar is not so careful as I could wish. Of course, it is nothing to him if I suffer.” “Port wine would be far more nourishing than lemon-juice,” Dolly asseverated, knitting her brows. “Or milk would be better. Have you ever tried goat’s milk?” “I have not; is it a sovereign specific?” “I have known it work wonderful cures on emaciated people. How much do you weigh?” “Six stone eleven, I believe.” “That is far too little. You should test your weight every day—are you laughing at me?” “I’m awfully sorry!” said Lucian, who certainly was. “But, Miss Fane, what a nurse you would make! I was expecting you to feel my pulse, and take my temperature, and look at my tongue.” “So I was intending to do; I have a clinical thermometer at home,” Dolly proudly answered. “I do not know how to behave. I have never learned any manners.” “Say you’ve never learned customs; manners come by nature.” Lucian’s smile was irresistible.
  • 44. “Mine come very badly, then,” said Dolly, smiling back at him; “for when we get in you will certainly have to lie down; and, what’s more, I shall give you a glass of goat’s milk.”
  • 45. VI HONESTY IS THE BEST POLICY A royal stag, whose many-branched and palmate antlers showed that he had seen at least ten springs, looked down upon the mantel- piece of Noel Farquhar’s library; a huge elk fronted him across the room. This style of decoration, which took its origin in the simple skull palisades of primitive Britain and latter-day Africa, which was handed down by the traditions of Tower Hill, and which is rampant in the modern hall, had in Noel Farquhar a devotee. The walls of his smoking-room bristled with the heads of digested enemies. Thither the two men repaired after dinner on Christmas night, taking with them a decanter of mid-century port, cigars of indubitable excellence, and a dish of nuts for Lucian, who took a childlike interest in extracting and peeling walnuts without breaking the kernel. Farquhar was inclined to be silent, in which mood Lucian, the student of the abnormal, found him specially interesting. “Queer chap you are Farquhar,” he suddenly remarked. “Why didn’t you ever tell me about the fascinating Fanes?” “Didn’t I? I thought I had.” Farquhar did not think any such thing, and Lucian knew it. “The day I went there Miss Dolly Fane stopped me in the hall, and would know whether I thought she’d make an actress. An odd girl.” “Well, and what did you say to her?” “Said she would. I couldn’t do otherwise, could I?” “My immaculate friend, I’m afraid the charms of Miss Fane have persuaded you into a statement which is very remarkably near to a L, I, E, lie. At the least, you were disingenuous, decidedly.” “Who says I am immaculate? Not I. You thrust virtues upon me and then cry out when I don’t come up to your notions of an
  • 46. archangel.” “And your church-going and your alms-giving and your brand-new coppers and general holiness? Eh, sonny?” “I’ve a creed, as four-fifths of the men down here are supposed to have; but whereas they deny in their acts what they repeat with their tongues, I prefer to perform what I profess. There’s a fine lack of logic about the way men regard their faith; each time they repeat their Credo they’re self-condemned fools. Well, I don’t relish making a fool of myself. Either I’ll be an infidel, and thus set myself free, or else I’ll act up to what I say. For that you praise me. Now, the only virtue to which I do lay claim is patience, of which I think I possess an extraordinary store.” Lucian peeled a walnut with painstaking earnestness, and ate it with salt and pepper. The shell he flicked across at Farquhar, who had fallen into a brown study and was looking very grim. He looked up with a quick, involuntary smile. “Did you shoot all these horned beasties yourself?” Lucian inquired, introducing the elk and the stag with a wave of the hand. “Yes. I shot the elk in Russia; the horns weigh a good eighty pounds. Shy brutes they are, and fierce when at bay; this one lamed me with a kick after I thought I had done for him.” “My biggest bag was twenty sjamboks running,” said Lucian, pensively. “I and some others were up country on a big shoot, and, of course, I got fever and had to lie up. Well, they used to come in with their blesbok and their springbok, and all the rest of it, so I didn’t see why I shouldn’t do a little on my own. So I lined up all our niggers with a sjambok apiece, and made my bag from my couch of pain. I worked those sjamboks afterwards for all they were worth. Yes, sir-ree.” “Sometimes I really think you’re daft, De Saumarez!” “Pray don’t mention it. Let’s see, where were you? Oh, in Russia. No, I’ve never been there—I don’t know Russia at all.” “I do.”
  • 47. “What, intimately?” Farquhar turned his head, met Lucian’s eyes, and smiled. “Oh no; quite slightly,” he said, lying with candour and glee. “Oh, indeed,” said Lucian. “Now that’s queer; I thought I’d met you there. By the way, do you believe in eternal constancy?” “In what?” “In eternal constancy; did you never hear of it before?” “Well, yes, pulex irritans, I’ve seen a man go mourning all his life long; so I do believe in it.” “No, no, sonny; I’m not discussing its existence, but its merits. Do you hold that a man should be eternally faithful to the memory of a dead woman?” “Not if he doesn’t want to.” “My point is that he oughtn’t to want to. See here; your body changes every seven years, and I’ll be hanged if your mind doesn’t change, too. Now, your married couple change together and so keep abreast. But if the woman dies, she comes to a stop. In seven years the survivor will have grown right away from her. The constant husband prides himself on his loyalty, and is ashamed to admit even in camera that a resurrected wife wouldn’t fit into his present life; but in nine cases out of ten the wound’s healed and cicatrised, and only a sentimental scruple bars him from saying so. And there, as I take it, he’s wrong.” “What would you have him do?” “Take another woman and make her and himself happy.” “What becomes of the dead wife’s point of view?” “According to my creed, you know, she’s got no point of view at all.” “You can’t expect me to follow you there.” “No; and so I’ll cite your own creed. After the resurrection there shall be no marrying or giving in marriage. She’s no call to be
  • 48. jealous.” “You’ve no romance about you.” “No sentimentalism, you mean. Half the feelings consecrated by public opinion are trash. It’s astounding how we do adore the dumps. Happiness is our first duty. It seems to me that one needs more courage to forget than to remember. That’s where I’ve been weak myself.” Lucian put his hand inside his coat and took out the letter which Farquhar had read; he had been leading up to this point. He spread it open on his knee, showing the thick, chafed, blue paper, the gilded monogram and daisy crest, the thin Italian writing. “I’ve carried that about for nine years,” he said, glancing up, and then held the paper to the fire and watched it catch light. The advancing line of brown, the blue-edged flame, crept across the letter, leaving shrivelled ash in its track. Lucian held it till the heat scorched his fingers, and then let it fall in the fire. “A passionate letter, was it not?” he said, turning from the black, rustling tinder to meet Farquhar’s eyes. “My dear De Saumarez!” “Don’t humbug; you read it when you thought I was unconscious.” “Ah,” said Farquhar, “now I understand why you understood.” He altered his pose slightly, relaxing as though freed from some slight, omnipresent constraint; the nature which confronted Lucian was different in gross and in detail from the mask of excellence which he had hitherto kept on. Vices were there, and virtues unsuspected: coarse, barbaric, potent qualities, dominated by a will- power mightier than they. Race-characteristics, hitherto overlaid, suddenly started out; and Lucian, recurring quickly to the last fresh lie which Farquhar had told him, exclaimed, “Why, man, you’re a Russian yourself!” “Half-breed. My mother was Russian. My father was Scotch, but a naturalized Russian subject. The worse for him; he died in the mines. Confound him: a pretty ancestry he’s given me, and a pretty job I’ve had to keep the story out of the papers. I’ve done it, though.”
  • 49. “But what’s it for?” asked Lucian, whose mind was flying to the story of Jekyll and Hyde. “Respectability; that’s the god of England. Do you think I could confess myself the son of a couple of dirty Russian nihilists and keep my position? Not much. It’s the only crevice in my armour. Scores of men have tried to get on by shamming virtuous, but I’ve gone one better than they; I am virtuous. You can’t pick a hole in my character, because there’s none to pick. I speak the truth, I do my duty, I’m honest and honourable down to the end of the whole fool’s catalogue, I even go out of my way to be chivalrously charitable, as when I picked you up, or made a fool of myself over that confounded copper. That’s all the political muck-worms find when they come burrowing about me. Yes, honesty’s the best policy; it pays.” “H’m! well, my most honourable friend, you’d find yourself in Queer Street if I related how you’d read my letter.” “Not in the least. I was glancing at it to find your address.” “You took a mighty long time over your glance.” “The paper was so much rubbed that I could hardly see where it began or ended.” “There was the monogram for a sign-post.” “Plenty women begin on the back sheet.” “You’re abominable; faith, you are,” said Lucian. “You’re a regular prayer-mill of lies!” “I’d never have touched it if I hadn’t prepared my excuse beforehand. Ruin my career for the sake of reading an old love- letter? Not I!” Even as Farquhar wished it, the contemptuous and insulting reference displeased Lucian; the letter was still sacred in his eyes. But he would not, and he did not, allow the feeling to be seen. Farquhar’s measure of reserve was matched by his present openness; but Lucian, whose affairs were everybody’s business, kept his mind as a fenced garden and a fountain sealed. Action and
  • 50. reaction are always equal and opposite; the law is true in the moral as well as the physical world. “Kindly speak of my letter with more respect, will you?” was all Lucian said. “Oh, the letter was charming; I wish it had been addressed to me!” “You shut up, and don’t try to be a profane and foolish babbler. I want to know what it’s all for—what’s your aim and object, sonny?” “I’m going to get into the Cabinet.” “You are, are you?” said Lucian. “And why not be premier?” “And why not king? Because I happen to know my own limitations. I’ll make a damned good understrapper, but the other’s beyond me.” “You’ll change your mind when you’ve got your wish.” “And there you’re wrong. I’ll be content then. I’m content now, for that matter. It’s as good as a play to see how the virtuous people look up to me.” Lucian leaned back in the attitude proper to meditation, and studied his vis-à-vis over his joined finger-tips. Strength of body, strength of mind, a will keen as a knife-blade to cut through obstacles, an arrogant pride in himself and his sins, all these had writ themselves large on Farquhar’s face; but the acute mind of the critic was questing after more amiable qualities. “And so you took me in as an instance of chivalrous charity, eh? And what do you keep me here for, now I’m sain and safe?” “You’re not well enough to be dismissed cured.” “I beg your pardon. I could go and hold horses to-morrow.” “I shall have to find some work for you before I let you go. I like to do the thing thoroughly.” “I see. I’m being kept as an object-lesson in generosity; is that so?”
  • 51. “You’ve hit it,” said Farquhar. “Hope you like the position. Have a cigar?” “No, thanks. I don’t mind being a sandwich-man, but I draw the line at an object-lesson.” Lucian got up, and began buttoning his coat round him. “If that’s your reason for keeping me, I’m off.” “De Saumarez, don’t be a fool.” “I will not be an object-lesson,” said Lucian, making for the door. “My conscience rebels against the deception. I will expire on your threshold.” Farquhar jumped up and put his back against the door. “Go and sit down, you fool!” “I’ve not the slightest intention of sitting down. I will be a body—a demd, damp, moist, unpleasant body.” “Do you mean this?” “I do. I’m too proud to take money from a man who’s not a friend.” Farquhar was very angry. He knew what Lucian wanted, but he would not say it. “Go, and be hanged to you, then!” he retorted, and flung round towards the fire. “All right, I’m going,” said Lucian, as he went into the hall. He took his cap and his stick. Overcoat he had none, and he could not now borrow Farquhar’s. His own clothes were inadequate even for mid-day wearing, and for night were absurd. All this Farquhar knew. He heard Lucian unbolt and unlock the front door, and presently the wind swept in, invaded the hall, and made Farquhar shiver, sitting by the fire. Lucian coughed. Up sprang Farquhar; he ran into the hall, flung the door closed, caught Lucian round the shoulders, and in the impatient pride of his strength literally carried him back to the library close to the fire. “You fool!” he said. “You dashed fool!” “Well?” said Lucian, looking up, laughing, from the sofa upon which he had been cast. “Own up! Why do you keep me here?”
  • 52. “Because you have a damnable way of getting yourself liked. Because you’re sick.” “Sh! don’t swear like that, sonny; you really do shock me. And so you like me?” “I’ve always a respect for people who find me out,” retorted Farquhar. “The others—Lord, what fools—what fools colossal! But you’ve grit; you know your own mind; you do what you want, and not what your dashed twopenny-halfpenny passions want. Besides, you’re ill,” he wound up again, with a change of tone which sent Lucian’s eyebrows up to his shaggy hair. “You’re a nice person for a small Sunday-school!” was his comment. “Well, well! So you profess yourself superior to dashed twopenny-halfpenny passions—such as affection, for example?” “I was bound to stop you going. You’d have died at my door and made a scandal.” “You know very well that never entered your head. Take care what you say; I can still go, you know.” Farquhar laughed, half angry; he chafed under Lucian’s control; would fain have denied it, but could not. “Confound you, I wish I’d never seen you!” he said. “You’ll wish that more before you’ve done. I’m safe to bring bad luck. Gimme your hand and I’ll tell your fortune. I can read the palm like any gypsy; got a drop of Romany blood in me, I guess.” “You’ll not read mine,” said Farquhar, grimly, putting it out. “Won’t I? Hullo! You’ve got a nice little handful!” The hand was scarred from wrist to finger-tips. “Never noticed it before, did you? I’m pretty good at hiding it by now.” “How on earth was it done?” “In hell—that’s Africa. I told you I learned massage from an old Arab sheikh; well, I practised on him. I was alone and down with
  • 53. fever, and they don’t have river police on the Lualaba. He made me his slave. Used to thrash me when he chose to say I’d not done my work; make me kneel at his feet and strike me on the face.” “Good Lord! How did you like that, sonny?” “I smiled at him till he got sick of it. Then he put me on silence: one word, death. He thought he’d catch me out, but I’d no notion of that; I held my tongue. So one day the old devil sent me to fetch his knife. It was dusk, and I picked it up carelessly; the handle was white-hot. He’d tried that trick with slaves before. Liked to see them howl and drop it, and then finish them off with the very identical knife— confound him!” “Amen. And what did you do?” “I? Brought him his knife by the blade; do you think I was going to let him cheat me out of my career?” Lucian stared at him. “You—you!” he said. “And I verily believe the man’s telling the truth. What happened next?” “Something to do with termites that I won’t repeat; it might make you ill.” “Only a channel steamer does that, sonny. You got away, though?” “Eventually; half blind and deadly sick. By the way, you’ve not told me why you made up your mind to burn that letter at this precise time?” “To draw you, of course. And now you’ll be pleased to go and see that my room’s ready; I can hear Bernard Fane hammering at the door, so you can play billiards with him while I go to bye-low.”
  • 54. VII COURAGE QUAND MÊME January came with the snow-drop, February brought the crocus, and March violets were empurpling the woods before the next scene came on the stage and introduced a new actor. In the meanwhile, Lucian continued to live on Noel Farquhar’s bounty. It should have been an intolerable position, but Lucian’s luckless head had received such severe bludgeonings at the hands of Fate that he was glad to hide it anywhere, and give his pride the congé. His choice lay between remaining with Farquhar, retiring to the workhouse, and expiring in a haystack without benefit of clergy; he chose the least heroic course, and, sad to say, he found it very pleasant. One night alarm he gave Farquhar. Punctual to its time, the cold snap of mid-January arrived on the eleventh of the month, and Lucian went skating at Fanes. His tutelary divinity Dolly being absent, he was beguiled into staying late, got chilled, and awoke Farquhar at three in the morning by one of his usual attacks. It was very slight and soon checked, but the incident strengthened the bond between them; for Lucian did not forget Farquhar’s face when he found him fighting for breath, nor the lavish tenderness of his subsequent nursing, which seemed to be extorted from him by a force stronger than his would-be carelessness. That constraining force Lucian declined to christen: friendship seemed too mild a term for Farquhar’s crude emotions. No one could have felt more horribly ashamed than Lucian, on finding that his host gave up all engagements to wait upon him. He was soon about again, but he now guarded his health as though he had it on a repairing lease. When Dolly consulted him on points of etiquette, as she soon learned to do, he retaliated with questions concerning the proper conduct of an invalid; it is only fair to say that Dolly was the more correct informant. He was welcome at Fanes.
  • 55. Welcome to our website – the ideal destination for book lovers and knowledge seekers. With a mission to inspire endlessly, we offer a vast collection of books, ranging from classic literary works to specialized publications, self-development books, and children's literature. Each book is a new journey of discovery, expanding knowledge and enriching the soul of the reade Our website is not just a platform for buying books, but a bridge connecting readers to the timeless values of culture and wisdom. With an elegant, user-friendly interface and an intelligent search system, we are committed to providing a quick and convenient shopping experience. Additionally, our special promotions and home delivery services ensure that you save time and fully enjoy the joy of reading. Let us accompany you on the journey of exploring knowledge and personal growth! textbookfull.com