Multiscalar Processors 1st Edition Manoj Franklin Auth

Multiscalar Processors 1st Edition Manoj
Franklin Auth download
https://ebookbell.com/product/multiscalar-processors-1st-edition-
manoj-franklin-auth-4200224
Explore and download more ebooks at ebookbell.com

Here are some recommended products that we believe you will be
interested in. You can click the link to download.
Multiscale Processes Of Instability Deformation And Fracturing In
Geomaterials Proceedings Of 12th International Workshop On Bifurcation
And Degradation In Geomechanics Elena Pasternak
https://ebookbell.com/product/multiscale-processes-of-instability-
deformation-and-fracturing-in-geomaterials-proceedings-of-12th-
international-workshop-on-bifurcation-and-degradation-in-geomechanics-
elena-pasternak-47412232
Multiscale Processes In The Earths Magnetosphere From Interball To
Cluster 1st Edition Jeanandr Sauvaud
https://ebookbell.com/product/multiscale-processes-in-the-earths-
magnetosphere-from-interball-to-cluster-1st-edition-jeanandr-
sauvaud-2225004
Tempered Stable Distributions Stochastic Models For Multiscale
Processes 1st Edition Michael Grabchak Auth
https://ebookbell.com/product/tempered-stable-distributions-
stochastic-models-for-multiscale-processes-1st-edition-michael-
grabchak-auth-5482746
Multiscale And Multiphysics Processes In Geomechanics Resultsof The
Workshop On Multiscale And Multiphysics Processes In Geomechanics
Stanford June 2325 2010 1st Edition Amy L Rechenmacher
https://ebookbell.com/product/multiscale-and-multiphysics-processes-
in-geomechanics-resultsof-the-workshop-on-multiscale-and-multiphysics-
processes-in-geomechanics-stanford-june-2325-2010-1st-edition-amy-l-
rechenmacher-2450956

Multiscale Coupling Of Sunearth Processes 1st Edition Giuseppe
Consolini
https://ebookbell.com/product/multiscale-coupling-of-sunearth-
processes-1st-edition-giuseppe-consolini-2517470
The Hybrid Multiscale Simulation Technology An Introduction With
Application To Astrophysical And Laboratory Plasmas 1st Edition
Professor Dr Alexander S Lipatov Auth
https://ebookbell.com/product/the-hybrid-multiscale-simulation-
technology-an-introduction-with-application-to-astrophysical-and-
laboratory-plasmas-1st-edition-professor-dr-alexander-s-lipatov-
auth-4206872
Multiscale Modelling Of Damage And Fracture Processes In Composite
Materials 1st Edition Holm Altenbach Auth
https://ebookbell.com/product/multiscale-modelling-of-damage-and-
fracture-processes-in-composite-materials-1st-edition-holm-altenbach-
auth-4601446
Multiscale Biomechanics And Tribology Of Inorganic And Organic Systems
In Memory Of Professor Sergey Psakhie 1st Ed Georgpeter Ostermeyer
https://ebookbell.com/product/multiscale-biomechanics-and-tribology-
of-inorganic-and-organic-systems-in-memory-of-professor-sergey-
psakhie-1st-ed-georgpeter-ostermeyer-32710868
Stochastic Processes Multiscale Modeling And Numerical Methods For
Computational Cellular Biology 1st Edition David Holcman Eds
https://ebookbell.com/product/stochastic-processes-multiscale-
modeling-and-numerical-methods-for-computational-cellular-biology-1st-
edition-david-holcman-eds-6789448

THE KLUWER INTERNATIONAL SERIES
IN ENGINEERING AND COMPUTER SCIENCE

MULTISCALAR
PROCESSORS
by
Manoj Franklin
University ofMaryland, US.A.
SPRINGER SCIENCE+BUSINESS MEDIA, LLC

Library of Congress Cataloging-in-Publication Data
A C.I.P. Catalogue record for this book is available
from the Library ofCongress.
Franklin, Manoj
MULTISCALAR PROCESSORS
ISBN 978-1-4613-5364-5 ISBN 978-1-4615-1039-0 (eBook)
DOI 10.1007/978-1-4615-1039-0
Copyright© 2003 by Springer Science+Business Media New York
Originally published by Kluwer Academic Publishers in 2003
Softcover reprint of the hardcover 1st edition 2003
All rights reserved. No part ofthis work may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, electronic, mechanical,
photocopying, microfilming, recording, or otherwise, without the written
permission from the Publisher, with the exception of any material supplied
specifically for the purpose ofbeing entered and executed on a computer system,
for exclusive use by the purchaser ofthe work.
Permission for books published in Europe: permissions@wkap.nl
Permissions for books published in the United States ofAmerica: permissions@wkap.com
Printed an acid-free paper.

Foreword
The revolution of semiconductor technology has continued to provide mi-
croprocessor architects with an ever increasing number offaster transistors with
which to build microprocessors. Microprocessor architects have responded by
using the available transistors to build faster microprocessors which exploit
instruction-level parallelism (ILP) to attain their performance objectives. Start-
ing with serial instruction processing in the 1970s microprocessors progressed
to pipelined and superscalar instruction processing in the 1980s and eventually
(mid 1990s) to the currently popular dynamically-scheduled instruction pro-
cessing models. During this progression, microprocessor architects borrowed
heavily from ideas that were initially developed for processors of mainframe
computers and rapidly adopted them for their designs. In the late 1980s it was
clear that most of the ideas developed for high-performance instruction pro-
cessing were either already adopted, or were soon going to be adopted. New
ideas would have to be developed to continue the march of microprocessor per-
formance. The initial multiscalar ideas were developed with this background
in the late 1980s at the University of Wisconsin. The objective was to develop
an instruction processing paradigm for future microprocessors when transistors
were abundant, but other constraints such as wire delays and design verification
were important.
The multiscalar research at Wisconsin started out small but quickly grew to
a much larger effort as the ideas generated interest in the research community.
Manoj Franklin's Ph.D thesis was the first to develop and study the initial ideas.
This was followed by the Wisconsin Ph.D theses of Scott Breach, T.N. Vijayku-
mar, Andreas Moshovos, Quinn Jacobson and Eric Rotenberg which studied
various aspects of the multiscalar execution models. A significant amount of
research on processing models derived from multiscalar was also carried out at
other universities and research labs in the 1990s. Today variants of the basic
multiscalar paradigm and other follow-on models continue to be the focus of
significant research activity as researchers continue to build the knowledge base
that will be crucial to the design of future microprocessors.

vi
This book provides an excellent synopsis of a large body of research carried
out on multiscalar processors in the 1990s. It will be a valuable resource for
designers offuture microprocessors as well as for students interested in learning
about the concepts of speculative multithreading.
GURI SOH!
UNIVERSITY OF WISCONSIN-MADISON

Contents
Foreword by Guri Sohi v
Preface xv
Acknowledgments xix
1 INTRODUCTION 1
1.1 Technology Trends 2
1.1.1 Sub-Micron Technology 2
1.1.2 Implications of Sub-Micron Technology 2
1.2 Instruction-Level Parallelism (ILP) 3
1.2.1 Extracting ILP by Software 5
1.2.2 Extracting ILP by Hardware 9
1.3 Thread-Level Parallelism (TLP) 12
1.3.1 Speculative TLP 13
1.3.2 Challenges for TLP Processing 14
1.4 The Multiscalar Paradigm 15
1.5 The Multiscalar Story 16
1.5.1 Developing the Idea 16
1.5.2 Multi-block based Threads and the ARB 17
1.5.3 Maturing of the Ideas 18
1.5.4 Other Speculative Multithreading Models 19
1.6 The Rest of the Story 20
2 THE MULTISCALAR PARADIGM 25
2.1 Ideal TLP Processing Paradigm-The Goal 26
2.2 Multiscalar Paradigm-The Basic Idea 27
2.3 Multiscalar Execution Example 29
2.3.1 Control Dependences 30

x
2.3.2 Register Data Dependences 31
2.3.3 Memory Data Dependences 32
2.4 Interesting Aspects of the Multiscalar Paradigm 32
2.5 Comparison with Other Processing Paradigms 35
2.5.1 Multiprocessing Paradigm 35
2.5.2 Superscalar Paradigm 36
2.5.3 VLIW Paradigm 38
2.6 The Multiscalar Processor 38
2.7 Summary 40
3 MULTISCALAR THREADS-STATIC ASPECTS 43
3.1 Structural Aspects of Multiscalar Threads 43
3.1.1 Definition 43
3.1.2 Thread Spawning Model 44
3.1.3 Thread Flow Graph 46
3.1.4 Thread Granularity 48
3.1.5 Thread Size Variance 49
3.1.6 Thread Shape 50
3.1.7 Thread Entry Points 52
3.1.8 Thread Exit Points 54
3.2 Data Flow Aspects of Multiscalar Threads 55
3.2.1 Shared Name Spaces 55
3.2.2 Inter-Thread Data Dependence 55
3.3 Program Partitioning 57
3.3.1 Compiler-based Partitioning 58
3.3.2 Hardware-based Partitioning 59
3.4 Static Thread Descriptor 59
3.4.1 Nature of Information 59
3.4.2 Compatibility Issues and Binary Representation 61
3.5 Concluding Remarks 62
4 MULTISCALAR THREADS-DYNAMIC ASPECTS 65
4.1 Multiscalar Microarchitecture 65
4.1.1 Circular Queue Organization of Processing Units 66
4.1.2 PU Interconnect 68
4.2 Thread Processing Phases 69
4.2.1 Spawn: Inter-Thread Control Prediction 69
4.2.2 Activate 69
4.2.3 Execute 70

Contents xi
4.2.4 Resolve 70
4.2.5 Commit 70
4.2.6 Squash 71
4.3 Thread Assignment Policies 71
4.3.1 Number of Threads in a PU 71
4.3.2 Thread-PU Mapping Policy 72
4.4 Thread Execution Policies 74
4.4.1 Intra-PU Thread Concurrency Policy: TLP 74
4.4.2 Intra-Thread Instruction Concurrency Policy: ILP 75
4.5 Recovery Policies 76
4.5.1 Thread Squashing 77
4.5.2 Basic Block Squashing 77
4.5.3 Instruction Re-execution 78
4.6 Exception Handling 78
4.6.1 Exceptions 78
4.6.2 Interrupt Handling 79
5 MULTISCALAR PROCESSOR-CONTROL FLOW 81
5.1 Inter-Thread Control Flow Predictor 81
5.1.1 Dynamic Inter-Thread Control Prediction 82
5.1.2 Control Flow Outcome 83
5.1.3 Thread History 84
5.1.4 Prediction Automata 85
5.1.5 History Updates 86
5.1.6 Return Address Prediction 87
5.2 Intra-Thread Branch Prediction 92
5.2.1 Problems with Conventional Branch Predictors 93
5.2.2 Bimodal Predictor 96
5.2.3 Extrapolation with Shared Predictor 96
5.2.4 Correlation with Thread-Level Information to Obtain
Accurate History 97
5.2.5 Hybrid of Extrapolation and Correlation 99
5.3 Intra-Thread Return Address Prediction 99
5.3.1 Private RASes with Support from Inter-Thread RAS 100
5.3.2 Detailed Example 100
5.4 Instruction Supply 101
5.4.1 Instruction Cache Options 101

xii
5.4.2 A Hybrid Instruction Cache Organization for Multiscalar
Processor 104
5.4.3 Static Thread Descriptor Cache (STDC) 105
6 MULTISCALAR PROCESSOR-REGISTER DATA FLOW 109
6.1 Nature of Register Data Flow in a Multiscalar Processor 110
6.1.1 Correctness Issues: Synchronization 111
6.1.2 Register Data Flow in Example Code 112
6.1.3 Performance Issues 113
6.1.4 Decentralized Register File 114
6.2 Multi-Version Register File-Basic Idea 115
6.2.1 Local Register File 116
6.2.2 Performing Intra-Thread Register Data Flow 116
6.2.3 Performing Inter-Thread Register Data Flow 117
6.3 Inter-Thread Synchronization: Busy Bits 119
6.3.1 How are Busy Bits Set? Forwarding of Create Mask 119
6.3.2 How are Busy Bits Reset? Forwarding of Register
Values 121
6.3.3 Strategies for Inter-Thread Forwarding 123
6.4 Multi-Version Register File-Detailed Operation 126
6.4.1 Algorithms for Register Write and Register Read 127
6.4.2 Committing a Thread 128
6.4.3 Squashing a Thread 130
6.4.4 Example 131
6.5 Data Speculation: Relaxing Inter-Thread Synchronization 133
6.5.1 Producer Identity Speculation 134
6.5.2 Producer Result Speculation 138
6.5.3 Consumer Source Speculation 143
6.6 Compiler and ISA Support 144
6.6.1 Inter-Thread Data Flow Information 145
6.6.2 Utilizing Dead Register Information 146
6.6.3 Effect of Anti-Dependences 147
7 MULTISCALAR PROCESSOR-MEMORY DATA FLOW 151
7.1 Nature of Memory Data Flow in a Multiscalar Processor 152
7.1.1 Example 152

Contents xiii
7.2 Address Resolution Buffer (ARB) 156
7.2.1 Basic Idea 156
7.2.2 lIardvvare Structure 157
7.2.3 lIandling of Loads and Stores 158
7.2.4 Committing or Squashing a Thread 160
7.2.5 Reclaiming the ARB Entries 161
7.2.6 Example 162
7.2.7 Tvvo-LevellIierarchical ARB 164
7.2.8 Novel Features of ARB 164
7.2.9 ARB Extensions 166
7.2.10 Memory Dependence Table: Controlled Dependence
Speculation 167
7.3 Multi-Version Cache 168
7.3.1 Local Data Cache 168
7.3.2 Performing Intra-Thread Memory Data Flovv 170
7.3.3 Performing Inter-Thread Memory Data Flovv 171
7.3.4 Detailed Working 172
7.3.5 Comparison vvith Multiprocessor Caches 175
7.4 Speculative Version Cache 175
8 MULTISCALAR COMPILATION 179
8.1 Role of the Compiler 179
8.1.1 Correctness Issues 181
8.1.3 Compiler Organization 181
8.2 Program Partitioning Criteria 183
8.2.1 Thread Size Criteria 183
8.2.2 Control Flovv Criteria 184
8.2.3 Data Dependence Criteria 185
8.2.4 Interaction Among the Criteria 188
8.3 Program Partitioning lIeuristics 188
8.3.1 Basic Thread Formation Process 189
8.3.2 Control Flovv lIeuristic 190
8.3.3 Data Dependence lIeuristics 190
8.3.4 Loop Recurrence lIeuristics 194
8.4 Implementation of Program Partitioning 194
8.4.1 Program Profiling 194
8.4.2 Optimizations 195

xiv
8.4.3 Code Replication 195
8.4.4 Code Layout 195
8.5 Intra-Thread Static Scheduling 196
8.5.1 Identifying the Instructions for Motion 197
8.5.2 Cost Model 198
8.5.3 Code Transformations 199
8.5.4 Scheduling Loop Induction Variables 199
8.5.5 Controlling Code Explosion 200
8.5.6 Crosscutting Issues 202
9 RECENT DEVELOPMENTS 207
9.1 Incorporating Fault Tolerance 207
9.1.1 Where to Execute the Duplicate Thread? 208
9.1.2 When to Execute the Duplicate Thread? 209
9.1.3 Partitioning of PUs 210
9.2 Multiscalar Processor with Trace-based Threads 211
9.2.1 Implementation Hurdles of Complex Threads 212
9.2.2 Tree-Like Threads 213
9.2.3 Instruction Cache Organization 215
9.2.4 Advantages 216
9.2.5 Trace Processors 216
9.3 Hierarchical Multiscalar Processor 217
9.3.1 Microarchitecture 219
9.3.2 Inter-Superthread Register Data Flow 219
9.3.3 Inter-Superthread Memory Data Flow 221
9.3.4 Advantages of Hierarchical Multiscalar Processing 221
9.4 Compiler-Directed Thread Execution 221
9.4.1 Non-speculative Inter-Thread Memory Data Flow 221
9.4.2 Thread-Level Pipelining 222
9.4.3 Increased Role of Compiler 222
9.5 A Commercial Implementation: NEC Merlot 223
Index 235

Preface
Semiconductor technology projections indicate that we are on the verge of
having billion-transistor chips. This ongoing explosion in transistor count is
complemented by similar projections for clock speeds, thanks again to advances
in semiconductor process technology. These projections are tempered by two
problems that are germane to single-chip microprocessors-on-chip wire delays
and power consumption constraints. Wire delays, especially in the global wires,
become moreimportant, as only a small portion ofthe chip area will be reachable
in a single clock cycle. Power density levels, which already exceed that of a
kitchen hot plate, threaten to reach that of a nuclear reactor!
Looking at software trends, sequential programs still constitute a major por-
tion of the real-world software used by various professionals as well as the
general public. State-of-the-art processors are therefore designed with sequen-
tial applications as the primary target. Continued demands for performance
boost have been traditionally met by increasing the clock speed and incor-
porating an array of sophisticated microarchitectural techniques and compiler
optimizations to extract instruction level parallelism (ILP) from sequential pro-
grams. From that perspective, ILP can be viewed as the main success story
form of parallelism, as it was adopted in a big way in the commercial world
for reducing the completion time of ordinary applications. Today's superscalar
processors are able to issue up to six instructions per cycle from a sequential
instruction stream. VLSI technology may soon allow future microprocessors to
issue even more instructions per cycle. Despite this success story, the amount
of parallelism that can be realistically exploited in the form of ILP appears to
be reaching its limits, especially when the hardware is limited to pursuing a
single flow of control. Limitations arise primarily from the inability to support
large instruction windows--due to wire delay limitations and complex program
control flow characteristics-and the ever-increasing latency to memory.

xvi
Research on the multiscalar execution model started in the early 1990s, after
recognizing this inadequacy ofjust relying on ILP. The goal was to expand the
"parallelism bridgehead" established by ILP by augmenting it with the "ground
forces" of thread-level parallelism (TLP), a coarser form of parallelism that is
more amenable to exploiting control independence. Many studies on paral-
lelism indeed confirm the significant performance potential of paralleUy exe-
cuting multiple threads of a program. The difficulties that have been plaguing
the parallelization of ordinary, non-numeric programs for decades have been
complex control flows and ambiguous data dependences through memory. The
breakthrough provided by the multiscalar execution model was the use of "se-
quential threads," i.e., threads that form a strict sequential ordering.
Multiscalar threads are nothing but subgraphs of the control flow graph of
the original sequential program. The sequential ordering of threads dictates
that control passes from a thread to exactly one successor thread (among dif-
ferent alternatives). At run-time, the multiscalar hardware exploits TLP (in
addition to ILP) by predicting and executing a dynamic sequence of threads on
multiple processing units (PUs). This sequence is constructed by performing
the required number of thread-level control predictions in succession. Thread-
level control speculation is the essence of multiscalar processing; sequentially
ordered threads are executed in parallel in a speculative manner on independent
PUs, without violating sequential program semantics. In case of misspecula-
tion, the results of the incorrectly speculated thread and subsequent threads are
discarded. The collection of PUs is built in such a way that (i) there are only
a few global wires, and (ii) very little communication occurs through global
wires. Localized communication can be done using short wires, and can be
expected to be fast. Thus the use of multiple hardware sequencers (to fetch
and execute multiple threads)-besides making judicious use of the available
transistor budget increase-fits nicely with the goal of reducing on-chip wire
delays through decentralization.
Besides forming the backbone of several Ph.D. theses, the multiscalar model
has sparked research on several other speculative multithreading models-
superthreading, trace processing, clustered multithreading, and dynamic mul-
tithreading. It has become one of the landmark paradigms, with appearances
in the Callfor Papers of important conferences such as [SCA and Micro. It has
been featured in an article entitled "What's Next for Microprocessor Design?"
in the October 2, 1995 issue of Microprocessor Report. Recently multiscalar
ideas have found their way into a commercial implementation from NEe called
Merlot, furthering expectation for this execution model to become one of the
"paradigms of choice" for future microprocessor design.
A detailed understanding of the software and hardware issues related to the
multiscalar paradigm is of utmost importance to researchers and graduate stu-
dents working in advanced computer architecture. The past few years have

PREFACE xvii
indeed seen many publications on the multiscalar paradigm, both from the
academia and the industry. However, there has been no book that integrates all
of the concepts in a cohesive manner. This book is intended to serve computer
professionals and students by providing a comprehensive treatment of the basic
principles of multiscalar execution as well as advanced techniques for imple-
menting the multiscalar concepts. The presentation benefits from the many
years of experience the author has had with the multiscalar execution model,
both as Ph.D. dissertation work and as follow up research work. The discussion
within most of the sections follows a top-down approach. This discussion is
accompanied by a wealth of examples for clarity and ease of understanding.
For each major building block, the book presents alternative designs and dis-
cusses design trade-offs. Special emphasis is placed on highlighting the major
challenges. Of particular importance is deciding where a thread should start
and end. Another challenge is enforcing proper synchronization and commu-
nication of register values as well as memory values from an active thread to
its successors.
The book provides a comprehensive coverage of all topics related to multi-
scalar processors. It starts with an introduction to this topic, including technol-
ogy trends that provided an impetus to the development of multiscalar proces-
sors and are likely to shape the future development of processors. It ends with
a review of the recent developments related to multiscalar processors. We have
three audiences in mind: (1) designers and programmers of next-generation
processors, (2) researchers in computer architecture, and (3) graduate students
studying advanced computer architecture. The primary intended audience are
computer engineers and researchers in the field of computer science and engi-
neering. The book can also be used as a textbook for advanced graduate-level
computer architecture courses where the students have a strong background
in computer architecture. This book would certainly engage the students, and
would better prepare them to be effective researchers in the broad areas of
multithreading and parallel processing.
MANO] FRANKLIN

Acknowledgments
First of all, I praise and thank my Lord JESUS CHRIST-to whom this book
is dedicated-for HIS love and divine guidance all through my life. Everything
that I am and will ever be will be because of HIM. It was HE who bestowed
me with the ability to do research and write this book. Over the years, I have
come to realize that without such an acknowledgement, all achievements are
meaningless, and a mere chasing after the wind. So, to HIM be praise, glory,
and honor, for ever and ever.
I thank my family andfriends for their support and encouragementthroughout
the writing of this book. I like to acknowledge my parents Prof. G. Aruldhas
and Mrs. Myrtle Grace Aruldhas who have been a constant inspiration to
me in intellectual pursuits. My father has always encouraged me to strive
for insight and excellence. Thanks to my wife, Bini, for her companionship,
love, understanding, and undying support. And thanks to my children, Zaneta,
Joshua, and Tesiya, who often succeeded in steeling my time away from this
book and have provided the necessary distraction.
Prof. Guri Sohi, my Ph.D. advisor, was instrumental in the development
and publicizing of the multiscalar paradigm. He provided many insightful
advice while I was working on the multiscalar architecture for my Ph.D. Besides
myself, Scott Breach and T. N. Vijaykumar also completed Ph.D. theses on the
multiscalar paradigm. Much ofthe information presented in this book has been
assimilated from our theses and papers on the multiscalar paradigm.
The National Science Foundation, DARPA, and IBM have been instrumen-
tal in funding the research on the multiscalar architecture at University of
Wisconsin-Madison, University of Minnesota, and University of Maryland.
Without their support, multiscalar research would not have progressed very far.
Finally, I thank Susan Lagerstrom-Fife and Sharon Palleschi of Kluwer Aca-
demic Publishers for their hard work in bringing this manuscript to publication.

Chapter 1
INTRODUCTION
What to do with slow wires and 1 billion fast transistors?
We have witnessed tremendous increases in computing power over the years,
yet no computer user has ever complained of a glut in computing power; the
demand for computing power seems to increase with supply. To satisfy this
demand in the midst of fast approaching physical limits such as speed of light
and high power density, scientists should find ever ingenious ways ofincreasing
the computing power. The main technique computer architects use to achieve
speedup is to do parallel processing of various kinds.
The execution of a computer program involves computation operations as
well as communication of values, both of which are constrained by control
structures in the program. The time taken to execute the program is a function
of the total number of computation operations and communication operations.
It is also a function of the cycle time and the average number of computa-
tion operations and communication operations performed in a cycle. The basic
idea behind parallel processing is to use multiple hardware resources to perform
mUltiple computation operations and multiple communication operations in par-
allel so as to reduce the program's execution time. With continued advances in
semiconductor technology, switching components have become progressively
smaller and more efficient, with the effect that computation operations have
become very fast. Communication speed, on the other hand, seems to be more
restricted by the effects of physical factors such as the speed of light, and has
become the major bottleneck.
M. Franklin, Multiscalar Processors
© Kluwer Academic Publishers 2003

2 1.1 Technology Trends
1.1 Technology Trends
Technology has always played a major role in motivating the development
of specific architectural techniques. In the past decade, processor performance
has been increasing at an approximate rate of 50-60% per year. Semiconductor
technology has played a major part in this monotonic increase.
1.1.1 Sub-Micron Technology
Processor performance improvements in the last few decades have been
driven to a large extent by developments in silicon fabrication technology that
have enabled transistor sizes to reduce monotonically. Reduced feature sizes
impact processor design in two important ways:
• It permits more transistors to be integrated into a processor chip. Gathering
from the trends in the late 1990s and the early 2000s, there appears to be no
end in sight to the growth in the number of transistors that can be integrated
on a single chip. Technology projections even suggest the integration of 1
billion transistors in this decade [10] [101], a significant improvement over
what is integrated today. This increasing transistor budget has opened up
new opportunities and challenges for the development of new microarchi-
tectures as well as compilation techniques for the new microarchitectures.
• Technology scaling reduces the transistor gate length and hence the tranistor
switching time. This enables the clock speed to be increased.
Ongoing improvements in semiconductor technology have thus provided
computer architects with an increasing number of faster transistors with which
to build processors.
1.1.2 Implications of Sub-Micron Technology
The technological advances described above are tempered, however, by the
fact that in the sub-micron technology era, wire delays are increasing! From
one generation of process technology to the next, the wires are made thinner in
order to cope with the shrinking of logic gates, because it may not be possible
to always increase the number of metal layers. This causes an increase in the
resistance ofthe interconnecting wires withouta commensurate decrease in their
capacitance, thereby increasing the wire delays. This effect will be predominant
in global wires because their length depends on the die size, which is steadily
increasing. The increase in wire delays poses some unique challenges:
• The speed of a logic circuit depends on the sum of gate delays and wire
delays along the critical path from the input to the output of the circuit.
Wire delays become significant compared to gate delays starting with the
0.25 J1m CMOS process [101]. This impacts the design of complex circuits

1. Introduction 3
that cannot be easily pipelined to take advantage of potential increases in
clock speed. For instance, detailed studies with 0.8 /-Lm, 0.35 /-Lm, and 0.18
/-Lm CMOS technology [64] show that a centralized dynamic scheduling
hardware does not scale well. This limitation makes it difficult in the fu-
ture to keep up with the current rate of reduction in processor cycle time
[57]. Today digital computing is at a point where clock speeds of less than
0.5 ns are the norm, and further improvements in the clock speed may re-
quire tremendous engineering effort. An order of magnitude improvement
in clock speed-to achieve clock cycles in the sub-nanosecond range-is
frought with difficulties, especially because of approaching physical limits
[1].
• The slower wires, along with faster clock rates, will place a severe limit
on the fraction of the chip that is reachable in a single cycle [1]. In other
words, an important implication ofthe physical limits of wire scaling is that
the area that is reachable in a single clock cycle of future processors will be
confined to a small portion of the die.
Apart from wire delays, increases in power consumption also pose a major
challenge to microprocessor design. How does the microarchitect deal with
these challenges? These challenges have in fact prompted computer architects
to consider new ways ofutilizing the additional transistor resources for carrying
out parallel processing. Before looking at these new ways, let us briefly review
the prevalent execution models of the day.
1.2 Instruction-Level Parallelism (ILP)
The parallelism present in programs can be classified into different types-
regular versus irregular parallelism, coarse-grain versus fine-grain (instruction
level) parallelism, etc. Regular parallelism, also known as data parallelism,
refers to the parallelism present in performing the same set of operations on
different elements of a data set, and is very easy to exploit. Irregular parallelism
refers to parallelism that is not regular, and is harder to exploit. Coarse-grain
parallelism refers to the parallelism between large sets of operations such as
subprograms, and is best exploited by a multiprocessor. Fine-grain parallelism,
or instruction level parallelism refers to the parallelism between individual op-
erations. Over the last few decades, several parallel processing paradigms,
including some special purpose paradigms, have been proposed to exploit dif-
ferent types of parallelism. In this section, we take a brief look at techniques to
exploit instruction-level parallelism, the dominant form ofparallelism exploited
by microprocessors.
Converting a high-level language program into one that a machine can exe-
cute involves taking several decisions at various levels. Parallelism exploitation

4 1.2 Instruction-Level Parallelism (ILP)
involves additional decisions on top ofthis. The fundamental aspect in ILP pro-
cessing is: Given a program graph with control and data constraints, arrive at
a good execution schedule in which multiple computation operations are exe-
cuted in a cycle as allowed by the resource constraints in the machine. Arriving
at a good schedule involves manipulations on the program graph, taking into
consideration several aspects such as the ISA and the resources in the machine.
Since there can only be a finite amount of fast storage (such as registers) for
temporarily storing the intermediate computation values, the values have to be
either consumed immediately or stored away into some form ofbackup storage
(such as main memory), creating additional communication arcs. Thus, the
challenge in ILP processing is not only to identify a large number of indepen-
dent operations to be executed every cycle from a large block of computation
operations having intricate control dependences and data dependences, but also
reduce the inter-operation communication costs and the costs of storing tempo-
rary results. A good paradigm should not only attempt to increase the number of
operations executed in parallel, but also decrease the inter-operation communi-
cation costs by reducing the communication distances and the temporary storing
away of values, thereby allowing the hardware to be expanded as allowed by
technology improvements in hardware and software.
Optimal scheduling (under finite resource constraints) is an NP-complete
problem, necessitating the use of heuristics to take decisions. Although pro-
grammers can ease scheduling by expressing some of the parallelism present
in programs by using a non-standard high-level language (HLL), the major
scheduling decisions have to be taken by the compiler, by the hardware, or by
both of them. There are different trade-offs in taking the decisions at program-
ming time, at compile time, and at run time.
A program's input (which can affect scheduling decisions) are available only
at run-time when the program is executed, leaving compilers to work with con-
servative assumptions while taking scheduling decisions. Run-time deviations
from the compile-time assumptions render the quality ofthe compiler-generated
schedule poor, and increase the program execution time significantly. On the
other hand, any scheduling decisions taken by the hardware could increase the
hardware complexity, and hence the machine cycle time, making it practical for
the hardware to analyze only small portions of the program at a time. Different
ILP processing paradigms differ in the extent to which scheduling decisions are
taken by the compiler or by the hardware. In this section, we explore the differ-
ent steps involved in ILP processing. To explore the full possibilities of what
can be done by the compiler and what can be done by the hardware, this dis-
cussion assumes a combination of control-driven specification and data-driven
execution.

1. Introduction 5
1.2.1 Extracting ILP by Software
Extraction of ILP can be performed by software and by hardware. The
motivation for using software to extract ILP is to keep the hardware simpler,
and therefore faster. The motivation for using hardware to extract ILP, on the
other hand, is to extract that parallelism which can be detected only at run
time. A central premise ofthis book is that these two methods are not mutually
exclusive, and can both be used in conjunction to extract as much parallelism as
possible. There are three fundamental steps in extracting ILP from a program:
(l) Establish a window ofoperations. (2) Determine and minimize dependences
between operations in this window. (3) Schedule operations.
1.2.1.1 Establishing a Window of Operations
The first step in extracting ILP from a program at compile time is to establish
a path or a subgraph in the program's control flow graph (CFG), called an oper-
ation window. The two important criteria in establishing the operation window
are that the window should be both large and accurate. Small windows tend to
have only small amounts of parallelism. Control dependences caused by con-
ditional branches are the major hurdle in establishing a large and accurate static
window. To overcome this, compilers typically analyze both paths of a condi-
tional branch or do a prediction as to which direction the branch is most likely to
go. Because an important component of most window-establishment schemes
is the accurate prediction of conditional branches, a considerable amount of
research has gone into better branch prediction techniques. Initial static predic-
tion schemes were based on branch opcodes, and were not accurate. Now, static
prediction schemes are much more sophisticated, and use profile information
or heuristics to take decisions [40] [70].
In addition to branch prediction, the compiler uses several other techniques
to overcome the effects of control dependences. Some of these techniques are
if-conversion, loop unrolling, loop peeling, loop conditioning, loop exchange,
function inlining, replacing a set ofIF-THEN statements by a jump table [70],
and even changing data structures. All these techniques modify the CFG of the
program, mostly by reducing the number of control decision points in the CFG.
We shall review some of these schemes in terms of the type of modifications
done to the CFG and how the modifications are incorporated.
1.2.1.2 Determining and Minimizing Dependences
Once a window of operations has been established, the next step is to deter-
mine the data dependences between the operations in the window, which exist
through (pseudo)registers and memory locations. If register allocation has al-
ready been performed, then this step involves determining the register storage
dependences (anti- and output dependences) as well.

6 1.2 1nstruction-Level Parallelism (lLP)
Static Memory Address Disambiguation: Static memory address disam-
biguation is the process of determining if two memory references (at least
one of which is a store) could ever point to the same memory location in any
valid execution of the program. Good static memory disambiguation is fun-
damental to the success of any parallelizing compiler. This is a hard task as
memory addresses could correspond to pointer variables, whose values might
change at run time. Two memory references may be dependent in one instance
of program execution and not dependent in another instance, and static dis-
ambiguation has to consider all possible executions of the program. Various
techniques have been proposed to do static disambiguation of memory refer-
ences involving arrays [19]. These techniques involve the use of conventional
flow analyses of reaching definitions to derive symbolic expressions for array
indexes. in terms of compile-time constants, loop-invariants, and induction
variables, as well as variables whose values cannot be derived at compile time.
For arbitrary multi-dimensional arrays and complex array subscripts, unfortu-
nately, many ofthe test results can be too conservative; several techniques have
been proposed to produce exact dependence relations for certain subclasses of
multi-dimensional arrays. Current static disambiguation techniques are able to
perform inter-procedural analysis also. Moreover, they can do some pointer
analysis also. It is also possible to utilize annotations from the programmer.
Once the dependences in the window are determined, the dependences can
be minimized by techniques such as software register renaming (if register
allocation has been performed), induction variable expansion, and accumulator
variable expansion. A description of some of these techniques is given below.
Software Register Renaming: Reuse of storage names (variables by the
programmer and registers by the compiler) introduces artificial anti- and output
dependences, and restricts the static scheduler's opportunities for reordering
operations. Many of these artificial dependences can be eliminated with soft-
ware register renaming. The idea behind software register renaming is to use
a unique architectural register for each assignment in the window, in similar
spirit to static single assignment [13].
Induction Variable Expansion: Induction variables, used within loops to
index through loop iterations and arrays, can cause anti-, output, and flow de-
pendences between different iterations of a loop. Induction variable expansion
is a technique to reduce. the effects of such dependences caused by induction
variables. The main idea is to eliminate re-assignments ofthe induction variable
within the window, by giving each re-assignment ofthe induction variable a new
induction variable name, thereby eliminating all dependences due to multiple
assignments.

1. Introduction 7
1.2.1.3 Scheduling Operations
Once an operation window is established, and the register dependences and
memory dependences in the window are determined and minimized, the next
step is to move independent operations up in the CFG, and schedule them
in parallel with other operations so that they can be initiated and executed
earlier than they would be in a sequential execution. If a static scheduler uses a
basic block as the operation window, then the scheduling is called basic block
scheduling. Ifthe scheduler uses mUltiple basic blocks as the operation window,
then the scheduling is called global scheduling. Basic block schedulers are
simpler than global schedulers, as they do not deal with control dependences;
however, their use for extracting parallelism is limited. Global scheduling is
more useful, as it considers large operation windows. Several global scheduling
techniques have been developed over the years to establish large static windows
and to carry out static code motions in the windows. These include trace
scheduling [19], superblock scheduling [40], software pipelining [45, 89, 102,
103, 161], and boosting [79].
Trace Scheduling: The key idea of trace scheduling is to reduce the execu-
tion time along the more frequently executed paths, possibly by increasing the
execution time in the less frequently executed paths. Originally developed for
microcode compaction, trace scheduling later found application in ILP process-
ing. The compiler forms the operation window by selecting from an acyclic part
ofthe CFG the most likely path, called trace, that will be taken at run time. The
compiler typically uses profile-based estimates ofconditional branch outcomes
to make judicious decisions in selecting the traces. There may be conditional
branches out of the middle of the trace and branches into the middle of the
trace from outside. However, the trace is treated and scheduled as if there were
no control dependences within the trace; special compensation codes are in-
serted on the off-trace branch edges to ensure program correctness. Then the
next likely path is selected and scheduled, and the process is continued until
the entire program is scheduled. Trace scheduling is very useful for numeric
programs in which there are a few most likely executed paths. In non-numeric
programs, however, many conditional branches are statically difficult to predict,
let alone have a high probability of branching in anyone direction.
Superblock Scheduling: Superblock scheduling is a variant oftrace schedul-
ing. A superblockis atrace with a unique entry pointand one ormore exitpoints,
and is the operation window used by the compiler to extract parallelism. Su-
perblocks are formed by identifying traces using profile information, and then
using tail duplication to eliminate any control entries into the middle of the
trace. In order to generate large traces, techniques such as branch target expan-
sion, loop peeling, and loop unrolling are used. Once a superblock is formed,

8 1.2 1nstruction-Level Parallelism (ILP)
the anti-, output, and flow dependences within the superblock are reduced by
standard techniques, and then scheduling is performed within the superblock.
In orderto reduce the effect ofcontrol dependences, operations are speculatively
moved above conditional branches.
Hyperhlock Scheduling: In hyperblock scheduling, the operation window
is a hyperblock, which is an enhancement on superblock. A hyperblock is a set
of predicated basic blocks in which control may enter only from the top, but
may exit from one or more points. The difference between a hyperblock and
a superblock is that a superblock contains instructions from only instructions
from one path of control, whereas a hyperblock contains instructions from
multiple paths of control. If-conversion is used to convert control dependences
within the hyperblock to data dependences. The predicated instructions are
reordered without consideration to the availability of their predicates. The
compiler assumes architectural support to guarantee correct execution.
Software Pipelining: The static scheduling techniques described so far deal
mostly with operation windows involving acyclic codes. Software pipelining is
a static technique for scheduling windows involving loops. The principlebehind
software pipelining is to overlap or pipeline different iterations ofthe loop body.
The methodology is to do loop unrolling and scheduling ofsuccessive iterations
until a repeating pattern is detected in the schedule. The repeating pattern can
be re-rolled to yield a loop whose body is the repeating schedule. Different
techniques have been proposed to do software pipelining: perfect pipelining
[10], enhanced pipeline scheduling [47], GURPR* [149], modulo scheduling
[48, 124], and polycyclic scheduling [125].
Boosting: Boosting is a technique for statically specifying speculative ex-
ecution. Conceptually, boosting converts control dependences into data de-
pendences using a technique similar to if-conversion, and then executes the
if-converted operations in a speculative manner before their predicates are
available. Extra buffering is provided in the processor to hold the results of
of speculative operations. When the predicate of a speculatively executed op-
eration becomes available, the hardware checks if the operation's execution
was required. If the execution was required, the non-speculative state of the
machine is updated with the buffered effects of that operation's execution. If
the operation should not have been executed, the hardware simply discards the
state and side-effects ofthat operation's execution. Boosting provides the com-
piler with additional opportunity for reordering operations, while making the
hardware responsible for ensuring that the effects of speculatively executed op-
erations do not affect the correctness of program execution when the compiler
is incorrect in its speculation.

i. introduction 9
Advantages of Static Extraction of ILP: The singular advantage of using
the compiler to extract ILP is that the compiler can do a global and much more
thorough analysis of the program than is possible by the hardware. It can even
consider the entire program as a single window, and do global scheduling in this
window. Furthermore, extraction of ILP by software allows the hardware to be
simpler. In any case, it is a good idea to use the compiler to extract whatever
parallelism it can extract, and to do whatever scheduling it can to match the
parallelism to the hardware model.
Limitations of Static Extraction of ILP: Static extraction of ILP has its
limitations. The main limitation is the extent to which static extraction can be
done for non-numeric programs in the midst of a conglomeration of ambiguous
memory dependences and data-dependent conditional branches. The inflexi-
bility in moving ambiguous memory operations can pose severe restrictions on
static code motion in non-numeric programs. Realizing this, researchers have
proposed schemes that allow ambiguous references to be statically reordered,
with checks made at run time to determine ifany dependences are violated by the
static code motions [62]. Ambiguous references that are statically reordered are
called statically unresolved references. A limitation of this scheme, however,
is that the run-time checks need extra code and in some schemes associative
compare of store addresses with preceding load addresses in the active window.
Another issue ofconcern in static extraction of ILP is code explosion. An issue,
probably of less concern nowadays, is that any extraction of parallelism done
at compile time is architectural, and hence may be tailored to a specific archi-
tecture or implementation. This is not a major concern, as specific compilers
have become an integral part of any new architecture or implementation.
1.2.2 Extracting ILP by Hardware
Given a program with a particular static ordering, the hardware can change
the order and execute instructions concurrently or even out-of-order in order
to extract additional parallelism, so long as the data dependences and control
dependences in the program are honored. There is a price paid in doing this
run-time scheduling, however. The price is the complexity it introduces to the
hardware, which could lead to potential increases in cycle time. For hardware
scheduling to be effective, any increase in cycle time should be offset by the
additional parallelism extracted at run time. When the hardware extracts ILP,
the same 3 steps mentioned in Section 1.2.1 are employed. However, instead
of doing the 3 steps in sequence, the hardware usually overlaps the steps, and
performs all of them in each clock cycle.

10 1.2 Instruction-Level Parallelism (ILP)
1.2.2.1 Establishing a Window of Instructions
To extract large amounts of ILP at run time, the hardware has to establish a
large window of instructions. It typically does that by fetching a fixed number
of instructions every cycle, and collecting these instructions in a hardware win-
dow structure. The main hurdles in creating a large dynamic window are control
dependences, introduced by conditional branches. To overcome these hurdles,
the hardware usually performs speculative fetching of instructions. With spec-
ulative fetching, rather than waiting for the outcome of a conditional branch
to be determined, the branch outcome is predicted, and operations from the
predicted path are entered into the window for execution. Dynamic prediction
techniques have significantly evolved over the years [58] [98]. Although the
accuracies of contemporary dynamic branch prediction techniques are fairly
high, averaging 95% for the SPEC non-numeric programs, the accuracy of a
large window obtained through n independent branch predictions in a row is
only (0.95)n on the average, and is therefore poor even for moderate values of
n. Notice that this problem is an inherent limitation of following a single line
of control. The multiscalar paradigm that we describe in this book breaks this
restriction by following multiple flows of control.
1.2.2.2 Determining and Minimizing Dependences
In parallel to establishing the window, the hardware also determines the dif-
ferent types (flow, anti-, and output) of register and memory dependences be-
tween the instructions in the window. Register dependences are comparatively
easy to determine as they require only the comparison of the source and desti-
nation operand specifiers ofthe operations. Determining memory dependences
is harder, and is described below.
Dynamic Memory Address Disambiguation: To determine the memory
dependences in the established window, memory references must be disam-
biguated. Disambiguating two memory references at run time means deter-
mining if the two references point to the same memory location or not. In
processors that perform dynamic extraction of parallelism, dynamic disam-
biguation involves comparing the addresses of all loads and stores in the active
window; a simple approach is to perform this comparison by means of associa-
tive searches, which becomes extremely complex for large windows. Chapter
7 further addresses the issues involved in dynamic disambiguation. Over the
years, different techniques have been proposed for performing dynamic disam-
biguation [29].
After determining the register and memory dependences in the window, the
next focus is on reducing the anti- and output dependences (storage conflicts)
in the window, in order to facilitate aggressive reordering of instructions. The
natural hardware solution to reduce such storage conflicts is to provide more

1. Introduction 11
physical storage, and use some dynamic renaming scheme to map from the
limited architectural storage to the not-so-limited physical storage. An example
for this technique is register renaming.
Hardware Register Renaming: Storage conflicts occur very frequently with
registers, because they are limited in number, and serve as the hub for inter-
operation communication. The effect of these storage conflicts becomes very
severe if the compiler has attempted to keep as many values in as few registers
as possible, because the execution order assumed by a compiler is different
from the one the hardware attempts to create. A hardware solution to decrease
such storage conflicts is to provide additional physical registers, which are
then dynamically allocated by hardware register renaming techniques. With
hardware register renaming, typically a free physical register is allocated for
every assignment to a register in the window, much like the way software
register renaming allocates architectural registers. Many different techniques
are available to perform hardware register renaming.
1.2.2.3 Scheduling Instructions
In parallel to establishing a window and enforcing the register and mem-
ory dependences, the hardware performs scheduling of ready-to-execute in-
structions. Instruction s that are speculatively fetched from beyond unresolved
branches are executed speculatively, i.e., before determining that their execu-
tion is needed. The hardware support for speculative execution consists ofextra
buffering in the processor, which holds the effects of speculatively executed in-
structions. When a conditional branch is resolved, if the earlier prediction was
correct, all speculative instructions that are directly control dependent on the
branch are committed. If the prediction was incorrect, then the results of spec-
ulatively executed instructions are discarded, and instructions are fetched and
executed from the correct path. Several dynamic techniques are available to
carry out speculative execution along with precise state recovery [36].
Hardware schedulers often use simplistic heuristics to choose from the in-
structions that are ready for execution. This is because any sophistication ofthe
instruction scheduler directly impacts the hardware complexity. A number of
dynamic scheduling techniques have been proposed: CDC 6600's scoreboard
[85], Tomasulo's algorithm [86], decoupled execution [80], register update unit
(RUU) [81], dispatch stack [18], deferred-scheduling register-renaming instruc-
tion shelf (DRIS) [67], etc. A detailed treatment of some of these schemes is
available in [25] [36] [44].
Advantages of Dynamic Extraction of ILP: The major advantage in doing
(further) extraction of ILP at run-time is that the hardware can utilize the in-
formation that is available only at run time to extract the ILP that could not be

12 1.3 Thread-Level Parallelism (TLP)
extracted at compile time. In particular, the hardware can resolve ambiguous
memory dependences, which cannot be resolved at compile time, and use that
information to make more informed decisions in extracting ILP. The schedule
developed at run time is also better adapted to run-time uncertainities such as
cache misses and memory bank conflicts.
Limitations of Dynamic Extraction of ILP: Although dynamic scheduling
with a large centralized window has the potential to extract large amounts ofILP,
a realistic implementation of a wide-issue (say a 16-issue) processor with a fast
clock is not likely to be possible because ofits complexity. A major reason has to
do with the hardware required to parse a number ofinstructions every cycle. The
hardware required to extract independent instructions from a large centralized
window and to enforce data dependences typically involves wide associative
searches, and is non-trivial. While this hardware is tolerable for 2-issue and 4-
issue processors, its complexity increases rapidly as the issue width is increased.
The major issues of concern for wide-issue processors include: (i) the ability
to create accurate windows of perhaps 100s of instructions, needed to sustain
significant levels of ILP, (ii) elaborate mechanisms to enforce dependences
between instructions in the window, iii) possibly wide associative searches in
the window for detecting independent instructions, and (iv) possibly centralized
or serial resources for disambiguating memory references at run time.
1.3 Thread-Level Parallelism (TLP)
Modem microprocessors make use of a variety of instruction-level parallel
processing techniques to achieve high performance. The commodity micropro-
cessor industry uses a variety of microarchitectural techniques such as pipelin-
ing, branch prediction, out-of-order execution, and superscalar execution, and
sophisticated compiler optimizations. Such hardware-centered techniques ap-
pear to have scalability problems in the sub-micron technology era, and are
already appearing to run out of steam. According to a recent position paper by
Dally and Lacy [14], "overthe past 20 years, the increased density ofVLSI chips
was applied to close the gap between microprocessors and high-end CPUs. To-
day this gap is fully closed and adding devices to uniprocessors is well beyond
the point of diminishing returns". We view ILP as the main success story form
of parallelism thus far, as it was adopted in a big way in the commercial world
for reducing the completion time of general purpose applications. The future
promises to expand the "parallelism bridgehead" established by ILP with the
"ground forces" of thread-level parallelism (TLP), by using multiple process-
ing elements to exploit both fine-grained and coarse-grained parallelism in a
natural way.

1. 1ntroduction 13
Why, in any case, must we look at ingenious ways to exploit thread-level par-
allelism? After all, medium-grain and coarse-grain parallelism have been regu-
larly exploited by multiprocessors for several decades. The answer is that many
important applications exist (mostly non-numeric) in which conventional TLP
techniques appear to be ineffective. For these applications, speculative TLP
appears to be the only type of parallelism that can be exploited. Exploitation
of parallelism at the instruction level can only provide limited performance for
such programs. Many studies have confirmed that there exists a large amount of
parallelism in ordinary programs [5] [11] [61] [94]. Even in other applications,
no matter how much parallelism is exploited by ILP processing, a substantial
amount of parallelism will still remain to be exploited at a higher granUlarity.
Therefore, irrespective of the speedup obtained by ILP processing, TLP pro-
cessing can give additional speedups over that speedup. Thus, TLP processing
and ILP processing complement each other, and we can expect future processors
to be doing both.
1.3.1 Speculative TLP
A natural way to make use ofthe additional transistor budget and to deal with
the wire delay problem is to use the concept of speCUlative multithreading in
the processor microarchitecture. That is, build the processor as a collection of
independentprocessing units (PUs), each ofwhich executes a separate thread or
flow ofcontrol. By designing the processoras a collection ofPUs, (i) the number
ofglobal wires reduces, and (ii) very little communication occurs through global
wires. Thus, much ofthe communication occurring in the multi-PU processor is
local in nature, and occurs through short wires. Such a decentralized processor
can execute groups of instructions independently, and is not fundamentally
limited by technological constraints like the processors made of centralized
hardware resources.
Although multithreading and multiprocessing have been used in the high-
est performance computer systems for the past 30 years, it was traditionally
confined to special-purpose paradigms for exploiting regular parallelism from
numeric programs. In this book we place a strong emphasis on exploiting TLP
from non-numeric programs, which mostly contain irregular parallelism. This
is not to belittle the importance of numeric programs, which are the backbone
of many theoretical and simulation studies in scientific applications. Numeric
programs have received substantial attention in the past, whereas non-numerical
programs have received only a passing attention. The multiscalar research was
an attempt not only to bridge that gap, but also to lay the foundation for future
microprocessors.
Parallelization has been a success for scientific applications, but not quite
so for non-numeric applications which use irregular data structures and have
complex control flows that make them hard to parallelize. The emergence ofthe

14 1.3 Thread-Level Parallelism (TLP)
speCUlative multithreading model in the last decade to exploit speculative TLP
has provided the much awaited breakthrough for non-numeric applications.
Hardware support for speculative thread execution makes it possible for the
compiler to parallelize sequential applications without worrying about data and
control dependencies.
1.3.2 Challenges for TLP Processing
There are several issues to be tackled in developing a good TLP processing
paradigm. First, there are different schools of thought on when the extraction
of parallelism is to be done-at programming time, compile time, or run time.
Each method has its own strengths and shortcomings. Any processing model
that relies entirely on compile-time scheduling or on run time scheduling is
very likely to fail because of inherent limitations of both. So the challenge is
to use the right mix of compile-time and run-time extraction of parallelism.
The alternatives differ widely, based on the extent to which this question is
answered by the compiler or the hardware, and on the manner in which the
compiler-extracted parallelism information is conveyed to the hardware.
Second, studies have found little TLP within a small sequential block of
instructions, but significant amounts in large blocks [5] [11] [50] [94]. There
are several inter-related factors that contribute to this. Because most programs
are written in an imperative language for a sequential machine with a limited
number of architectural registers for storing temporary values, instructions of
close proximity are very likely to be data dependent, unless they are reordered
by the compiler. This means that most of the parallelism can be found only
amongst instructions that are further apartin the instruction stream. The obvious
way to get to that parallelism is to establish a large window of instructions, and
look for parallelism in this window.
The creation of the large window, whether done statically or dynamically,
should be accurate. That is, the window should consist mostly of instructions
that are guaranteed to execute, and not instructions that might be executed.
Given the basic block sizes and branch prediction accuracies for some common
C programs, following a single thread of control while establishing a window
may not be sufficient: the maximum parallelism that can be extracted from such
a window is limited to about 7 [50]. A more complex window, which contains
instructions from multiple threads of control might be needed; analysis of the
control dependence graph [13] [21] of a program can aid in the selection ofthe
threads of control.
Another major challenge in designing the TLP hardware is to decentralize
the critical resources in the system. These include the hardware for fetching
from multiple threads, the hardware for carrying out the inter-operation com-
munication of the many operations in flight, a memory system that can handle

1. Introduction 15
multiple accesses simultaneously, and in a dynamically scheduled processor,
the hardware for detecting the parallelism at run time.
1.4 The Multiscalar Paradigm
This book explores the issues involved in TLP processing, and focuses on
the first speculative multithreading paradigm-the multiscalar paradigm-for
TLP processing. This paradigm executes programs by means of the parallel
execution of multiple threads that are derived from a sequential instruction
stream. This type of execution is achieved by considering a subgraph of the
program's control flow graph to be a thread, and executing many such threads
in parallel. The multiple threads in execution can have both data dependences
and control dependences between them. The execution model within each
thread can be a simple, sequential processor. As we will see in this book,
such an approach has the synergistic effect of combining the advantages of the
sequential and the dataflow execution models, and the advantages of static and
dynamic scheduling. Executing multiple threads in parallel, although simple
in concept, has powerful implications:
1 Most of the hardware structures can be built by replicating a conventional
processor core. This allows the critical hardware resources to be decen-
tralized by divide-and-conquer strategy, as will be seen in chapters 4-7. A
decentralized hardware realization facilitates clock speeds comparable to
that of contemporary processors. Furthermore, it allows expandability.
2 Sequential programs can be partitioned into threads (as far as possible)
at those points that facilitate the execution of control-independent code in
parallel. Even if the program partitioning agent (most likely the compiler)
may not know the exact path that will be taken through a thread at run time,
it may be fairly sure of the next thread that will be executed. Thus, the
overall large window can be made very accurate.
3 It helps to overlap the execution ofblocks of code that are not guaranteed to
be data-independent. The program partitioning agent can, ofcourse, attempt
to pack data-dependent instructions into a thread, and as far as possible form
threads that are independent so as to improve the processor performance.
However, the processing paradigm does not require the threads to be inde-
pendent, which is a significant advantage.
4 Because the multiscalar paradigm considers a block of instructions as a
single unit (thread), the program partitioning agent can convey to the run-
time hardware more information such as inter-thread register dependences
and control flow information. Thus, the hardware need not reconstruct some
of the information that was already available at compile time.

16 1.5 The Multiscalar Story
5 It helps to exploit the localities of communication present in a program.
These statements may appear a bit "rough-and-ready", and may not make much
sense before a detailed study ofthe new paradigm. It is precisely this paradigm
and its implementation that we discuss in the ensuing chapters of this book.
1.5 The Multiscalar Story
The multiscalar paradigm originated at University of Wisconsin-Madison
in the early 1990s. A detailed retrospective on multiscalar processors is pro-
vided by Guri Sohi in [83]; here we provide the highlights from the author's
perspective. Research work on multiscalar ideas started after recognizing the
limitations of using a centralized scheduler for dynamic scheduling. The main
point ofattack was the logic needed to implement the instruction scheduling and
wakeup functions: a large centralized instruction window was not a long-term
solution.
Another motivation point was the publishing of an article entitled "Micro-
processors Circa 2000," in the October 1989 issue ofIEEE Spectrum [30], with
projections of a 100 million transistors on a single chip. The question that
begged for an answer was: how could these resources be used to speed up
computation? What would be the execution model for a 100 million transistor
processor? The proposal in [30] amounted to a 4-way multiprocessor on a chip.
The explicitly parallel multiprocessor model had practical limitations because
it appeared unlikely that parallelizing compiler technology would be able to
automatically parallelize a majority of applications in the near future.
1.5.1 Developing the Idea
Guri Sohi started thinking about possible architectural paradigms for a circa
2000 processor, i.e., what lay beyond superscalar. He started the search by
looking at the dataflow model. The concepts looked good - thinking about the
RUU-based superscalar processor as a dataflow engine makes it possible to get
good insight into its operation. However, the use of this model in its entirity
had limitations. In particular, giving up sequential programming semantics did
not appear to be a good option, as it appeared unlikely that inherently parallel
languages were going to be adopted widely in the near future. This meant
that dataflow-like execution should be achieved for a serial program. Rather
than consider this a drawback, he considered this an asset: exploit the inherent
sequentiality to create "localities" in the inter-operation communication that
could be exploited to simplify the inter-operation communication mechanism
(aka token store in a dataflow machine).
Earlier experiments with the RUU also had shown that although increasing
the RUU size would allow more parallelism to be exploited, much of the par-
allelism was coming from points that were far apart in the RUU - there was

1. Introduction 17
little parallelism from "close by". As increasing the size of a centralized RUU
entailed significant overheads, the importance of decentralization by exploiting
the localities of communication became apparent.
At about the same time, Jim Smith introduced Guri to the concept of a
dependence architecture. This model was based upon an early version of the
Cray-2, which was abandoned. The machine consisted of 4 independent units,
each with an accumulator, and collectively backed by a shared register file.
Sequences of dependent operations were submitted to each unit, where they
would execute in parallel.
The author had started Ph.D. work with Guri in the Fall of 1988. After
building a MIPS version ofthe RUU-based superscalar processor, and studying
the design of non-blocking caches, in the Summer of 1990, Guri shared with
him the idea of an architecture in which the instruction window (aka regis-
ter update unit (RUU» could be split into multiple sub-windows. The author
started implementing this concept in the beginning of Fall 1990. He developed
a circular queue of sub-windows in which the major aspects of the machine
were decentralized. The author built a MIPS ISA-based cycle-accurate sim-
ulator to test out the basic concepts by the end of Fall 1990. This simulator
allocated a basic block to each sub-window. Branch-level prediction was used
to decide the next basic block to be allocated. Multiple sequencers were used
to parallelly fetch instructions in the active sub-windows. The last updates of
each architectural register were forwarded from each sub-window to the next.
A create mask was used to decide whether a register value arriving from a
previous sub-window should be forwarded or not.
1.5.2 Multi-block based Threads and the ARB
Very soon it became apparent that the sub-windows should be larger than a
basic block. As there was no compiler support, the author formed these "multi-
block" threads by a post-compilation phase ofthe MIPS binary. It was easier for
the post-compilation phase to consider statically adjacent basic blocks as multi-
blocks. Restrictions were imposed on the multi-blocks' length and number
of successors (only 2 successors were initially allowed). Multi-blocks were
terminated immediately after any unconditional control-changing instruction
such as subroutine call, subroutine return, and direct as well as indirect jump.
Information about the formed multi-blocks were kept in a separate file, and
supplied as input to the simulator.
Executing the multi-blocks in parallel required some changes to the hard-
ware. First of all, it required novel control prediction techniques that could
go beyond mUltiple branches simultaneously, as well as the ability for the ma-
chine to resolve multiple branches simultaneously. A technique called control
flow prediction was developed to do this [65]. The most notable change was
in forwarding register values. It was no longer possible to determine the last

18 1.5 The Multiscalar Story
updates of registers by a static inspection of the thread. If each sub-window
waits until the completion of its mult-block to forward the register values that
it produced, then very poor performance will be achieved. The author allevi-
ated this problem by incorporating register dependence speculation. Whenever
a misspeculation occurs, selective re-execution is done to recover: only the
affected instruction and its dependent slice of instructions are re-executed.
By Spring 1991, the author extended the cycle-accurate simulator to incor-
porate multi-blocks. Still, there was no decentralized mechanism for carrying
out memory address disambiguation. Because the different sub-windows op-
erate independently, loads would need to execute before the identities of prior
stores (in a different sub-window) were known. This would require a significant
rethinking of how memory operations are to be carried out. In May 1991, Guri
gave a talk about the basic multiscalar concepts at Cray Research, and in June
1991, at DEC, Marlboro. After the latter talk, he had a long conversation with
Joel Emer and Bob Nix about the memory system aspects of such a machine.
They told him that they had a solution, in the context of a VLIW processor,
but were unable to share the details. Guri convinced the author to come up
with a solution applicable to the new paradigm, and the author came up with
the address resolution buffer (ARB) in Fall 1991. (Later, it turned out that the
two solutions, and the problems they were solving, were entirely different.)
The ARB performed memory data dependence speCUlations in an aggressive
manner. Misspeculations resulted in squashing all multi-blocks from the mis-
speCUlation point.
In Fall 1991, the author incorporated the ARB into the cycle-accurate simu-
lator, and submitted the ISCA92 paper along with Guri [23]. The multiscalar
paradigm was then called the Expandable Split Window paradigm. Guri also
gave talks at several companies, and discussed the multiscalar ideas with many
people. Most notably, he had detailed discussions with, and received critiques
from Mitch Alsup, Jim Smith, and Bob Rau. These discussions were crucial
in the refinement of the multiscalar concept. In January 1992 Guri gave the
first "public" presentation of the multiscalar paradigm at HICCS. He received
a number of difficult questions from the audience, which included Mike Flynn,
Andy Heller, Peter Hsu, Wen-Mei Hwu, Yale Patt, and Bob Rau.
In the Summer of 1992, Mark Hill convinced Guri to come up with a better
name for the concept; the term "Expandable Split Window" was not sufficiently
catchy. After trying several variations of "scalar", Guri coined the term "Mul-
tiscalar" .
1.5.3 Maturing of the Ideas
Guri and the author continued with experiments of the multiscalar concept.
One ofthe performance impediments that they faced was squashes due to mem-
ory data dependences: the MIPS compiler would often spill a register (assuming

1. Introduction 19
it would be a cache hit) and reload it shortly afterwards - this would cause
memory data dependence misspeculations. The author alleviated this problem
using selective re-execution. Guri then mentioned the need to decentralize the
ARB itself, and the need to bring the top level of the memory hierarchy "on
the same side of the interconnect as the processing units". The author then
developed the multi-version cache, along the lines of the multi-version register
file used for decentralizing register communication. In Fall 1993, the author
wrote his Ph.D. dissertation entitled "The Multiscalar Architecture" [25].
Significant enhancements were done to the multiscalar paradigm since the
author leftUniversity ofWisconsin. These enhancements were primarily geared
towards enhancing the performance. The main restriction to multiscalar per-
formance at that time was the lack of a compiler that could do a better job of
program partitioning. Post-compilation program partitioning had several lim-
itations. The program was sometimes getting divided at improper points, for
example, after half the load of a double-word load or half-way through build-
ing an address. This aggravated inter-thread data dependences. Moreover,
threads could not include entire loops or function call invocations, because of
the use of selective re-execution in the multiscalar processing units. Selective
re-execution during times of register dependence misspeculation and memory
dependence misspeculation required all the instructions of the thread to be
present in the instruction queue of a processing unit. This meant that threads
could not be larger than the instruction queue size, because conceptually any
instruction is likely to require re-execution.
In 1993-94, T. N. Vijaykumar developed a multiscalar compiler on top ofthe
GNU C compiler. This compiler could perform program partitioning as well as
intra-thread static scheduling, and generate a multiscalar binary. The compiler
used a detailed set of heuristics to guide program partitioning. Intra-thread
static scheduling was also done to reduce the impact of inter-thread data depen-
dences. This compiler also incorporated features such as release register
instructions and forward bit annotations.
During the same period, Scott Breach refined the multiscalar hardware to
incorporate the new features, and updated the cycle-accurate simulator to in-
corporate the new hardware features. He developed different strategies for
performing inter-thread register communication. He also developed different
policies for allocating spawned threads to processing units. In the Fall of 1994,
Guri, Vijay, and Scott wrote the ISCA95 paper [82], with these enhancements
and the new set of simulation results.
InFall 1994, Jim Smith returned to University ofWisconsin, and started direct
involvement on the multiscalar project. NSF and ARPA provided extensive
funds to test out the feasibility and practicality of the concept. This resulted in
the Kestrel project.

20 1.6 The Rest a/the Story
1.5.4 Other Speculative Multithreading Models
Since the development ofthe multiscalarparadigm, several related paradigms
have been proposed. Notable ones among them are superthreading, trace pro-
cessors, chip multiprocessing, dynamic multithreading, clustered speCUlative
multithreading, and dynamic vectorization. In current literature, the term "spec-
ulative multithreading" is used to refer to all of these execution models. After
moving to Clemson University, the author looked at the applicability of trace-
based threads for the multiscalar processor. Restricting multiscalar threads to
traces makes the hardware substantially simpler. Trace-based threads have been
found to have so many unique features that researchers have come up trace pro-
cessors, which have some differences with traditional multiscalar processors.
Trace processors were originally proposed by Sriram Vajapeyam and Tulika
Mitra [90], and improved upon by Eric Rotenberg and Jim Smith [72].
Priorto that, Jenn-Yuan Tsai and Pen-Chung Yew developed the superthread-
ing execution model at University ofMinnesota [88]. This execution model uses
the compiler not only to form threads, but also to do intra-thread scheduling
in such a manner as to allow the hardware to execute multiple threads in a
pipelined fashion.
Pedro Marcuello and Antonio Gonzalez investigated a speculative multi-
threading scheme in which loop-based threads are dynamically formed at run-
time [53] Haitham Akkary and Mike Driscoll proposed the dynamic multi-
threading execution model [3] in which multiscalar threads are executed in a
single pipeline as in simultaneous multithreading (SMT) [89].
More recently, Sriram Vajapeyam, P. J. Joseph, and Tulika Mitra proposed
dynamic vectorization as a technique for exploiting distant parallelism [91].
Mohamed Zahran and the author proposed hierarchical multithreading, which
uses a 2-level hierarchical multiscalar processor to exploit thread-level paral-
lelism at two granularities. With Intel's recent paper on Micro 2010, it is time for
computer architects to start thinking about architectural and microarchitectural
models for processor chips of that era.
1.6 The Rest of the Story
We have outlined the important technological trends in processor design,
and have now sketched in enough common ground for our study of thread-level
parallelism and the multiscalar paradigm to begin. Chapter 1 has provided
the background for the subject of the book. It started with technology trends
that playa major role in processor development, and introduced thread-level
parallelism to complement instruction-level parallelism, the prominent type
of parallelism exploited by microprocessors until recently. The chapter then
proceeded to speculative thread-level parallelism, which sets the multiscalar
execution model in context. Finally, the chapter provided a brief introduction

i. introduction 21
to the multiscalar paradigm, and concludes with a history of its development.
The rest of the book is organized into 8 more chapters.
Chapter 2 expounds on the multiscalar paradigm. It presents the basic idea
first, and then proceeds to a detailed example control flow graph that shows
how a program fragment is partitioned into speculative threads, which are spec-
ulatively executed in parallel. The ensuing discussion highlights how the mul-
tiscalar execution model deals with complex control dependences and data
dependences that are germane to non-numeric programs. Different types of
speculation are shown to be the key to dealing with control dependences as
well as data dependences. A qualitative assessment of the performance po-
tential is presented next, along with justifications. The chapter also provides
a review of the interesting aspects of the multiscalar execution model, and a
comparison ofthe model with other popular execution models. It concludes by
introducing a possible hardware implementation of the multiscalar paradigm.
With the basic multiscalar idea introduced in Chapter 2, Chapter 3 examines a
set ofcross-cutting issues related to static threads. These issues deal with thread
granularity, thread structure, thread boundaries, number of successor threads,
program partitioning agent, and thread specification. Threads can come in
many forms and at different granularities, and the chapter discusses the trade-
offs involved in selecting a thread model. It also provides an understanding of
the trade-offs involved in performing program partitioning at compile time and
at execution time.
Chapter 4 discusses dynamic aspects related to threads, including the execu-
tion of threads on a multiscalar microarchitectural platform. It discusses how
multiple processing units (PUs) can be organized, what kind of interconnects
can be used to connect the PUs, and the detailed microarchitecture ofa PU. This
discussion is followed by a breakdown of a dynamic thread's lifetime into its
constituent phases: spawn, activate, execute, resolve, commit, and sometimes
squash. These phases account for the period of processing that takes place in
the multiscalar processor from the spawn to the exit of a thread. Each of these
phases is then discussed in detail, with special emphasis given to presenting
different schemes and their trade-offs. The chapter ends with a discussion on
schemes for handling interrupts and exceptions in the multiscalar processor.
Chapter 5 focuses on microarchitectural aspects that are specific to control
flow. This chapter deals with 3 central topics related to a thread's execution:
spawning, activation, and retirement. Thread spawning often requires perform-
ing thread-level control speculation to decide which thread to be spawned next,
and the chapter begins with a discussion on hardware schemes for performing
thread-level control speculation. The discussion then continues onto strategies
that can be used for deciding which of the spawned threads should be activated
in the available processing units. Another important topic in any speculative
multithreading processor is recovery from incorrectly speculated threads. The

22 1.6 The Rest ofthe Story
chapter discusses different strategies for performing this recovery in multiscalar
processors.
Chapters 6 and 7 provide a complete understanding ofthe microarchitectural
aspects of data communication occurring in a multiscalar processor. Chapter
6 discusses issues related to register data flow, whereas chapter 7 focuses on
memory data flow. In Chapter 6 we talk about the need to synchronize between
a producer thread and a consumer thread, and the use ofdata value prediction to
relax this synchronization. We then go on to discuss different strategies for for-
warding register values from producer threads to consumer threads. Compiler
support, particularly in providing inter-thread register data dependence infor-
mation, is discussed next. Finally, the chapterends with a detailed discussion on
a multi-version register file structure for implementing the architected registers
and to carry out proper synchronization and communication. This discussion is
supported with a detailed working example depicting the structure's working.
The discussion in chapter 7 on memory data flow parallels the discussion in
chapter 6 on register data flow, as there are many similarities between register
data flow and memory data flow. A few differences arise, however, owing to the
dynamic determination ofmemory addresses, in contrast to static determination
of register addresses. For memory data flow, inter-thread data dependence
speculation is very important, because it is not possible to statically know
all of the inter-thread memory data dependences. The hardware structures
for managing memory data flow are therefore slightly different from the ones
used for managing register data flow. Chapter 7 documents under a common
framework well-researched hardware structures for the multiscalar processor
such as the address resolution buffer (ARB), the multi-version cache (MVC),
and the speculative versioning cache (SVC).
Chapter 8 details the subject of compiling for a multiscalar processor in
which threads are formed statically by the compiler. It begins by highlighting
the challenges involved in performing a good job ofprogram partitioning. This
discussion is followed by a consideration of the cost model used for multi-
scalar compilation. This cost model includes such factors as thread start and
end overheads, thread imbalance overhead, and wait times due to data depen-
dences. Afterwords, the discussion focuses on program transformations that are
geared to facilitate multiscalar execution and the creation of better multiscalar
threads. The chapter then describes a set of heuristics used for deciding thread
boundaries. These heuristics include control flow he~ristics, data dependence
heuristics, and other special heuristics. After determining the thread bound-
aries, the multiscalar compiler performs intra-thread scheduling to reduce the
wait times due to inter-thread data dependences; a detailed treatment of intra-
thread scheduling is presented in this chapter. Finally, register management,
thread annotation, and code generation are discussed.

1. Introduction 23
Chapter 9 concludes the book by taking a look at recent developments in
multiscalar processing. These include topics such as incorporating fault toler-
ance, the use of trace-based threads, hierarchical multiscalar processor, and a
commercial implementation of the multiscalar processor. Fault tolerance can
be easily incorporated at the PU level by executing the same thread in adjacent
PUs and comparing the two sets of results. Features such as these are likely
to provide an edge for the multiscalar paradigm in its quest for becoming the
paradigm of choice for next-generation processors. The chapter concludes by
discussing a commercial implementation named Merlot from NEC.

Chapter 2
THE MULTISCALAR PARADIGM
How to exploit irregular parallelism from non-numeric programs?
We have seen the technological trends that have motivated the development
of the multiscalar paradigm. We saw that ILP processing paradigms are un-
able to extract and exploit parallelism that is present at a distance. They also
fail to exploit control independence present in programs. In this chapter, we
continue our discussion of the multiscalar paradigm that we began in the last
chapter. The multiscalar paradigm not only combines the best of both worlds
in TLP extraction-software extraction and hardware extraction-but also ex-
ploits the localities of communication present in programs. Because of these
and a host of other features, which we will study in this chapter, the multiscalar
paradigm is poised to become a cornerstone for future microprocessor design.
The name multiscalar is derived from the fact that the overall computing engine
is a collection of scalar processors that cooperate in the execution of a sequen-
tial program. In the initial phases of its research, the multiscalar paradigm was
called the Expandable Split Window (ESW) paradigm [23].
This chapter is organized in five sections. The first section describes our view
of an ideal processing paradigm. The attributes mentioned in Section 2.1 had a
significant impact on the development of the multiscalar concept and later be-
came the driving force behind an implementation of the paradigm. Section 2.2
discusses the basics of the multiscalar paradigm. This introduction is followed
by a detailed example in Section 2.3 to illustrate the multiscalar execution ba-
sics. Section 2.4 describes the interesting and novel aspects of the multiscalar
paradigm. Section 2.5 compares and contrasts the multiscalar paradigm with
some of the existing processing paradigms such as the multiprocessor, super-
M. Franklin, Multiscalar Processors
© Kluwer Academic Publishers 2003

26 2.1 Ideal TLP Processing Paradigm-The Goal
scalar, and VLIW paradigms. Section 2.6 introduces a multiscalar processor,
one possible implementation of the multiscalar paradigm. Section 2.7 sum-
marizes the chapter by drawing attention to the highlights of the multiscalar
paradigm.
2.1 Ideal TLP Processing Paradigm-The Goal
Before embarking on a discussion of the multiscalar paradigm, it is worth
our while contemplating on the desired features that shaped its development.
Ideally, these features should take into consideration the hardware and software
technological developments that we expect to see in the next several years.
We can categorize the features into those related to software issues and those
related to hardware issues. First, let us look at the software issues. These issues
can be classified under three attributes, namely practicality, parallelism, and
versatility.
1 Practicality: By practicality we mean the ability to execute ordinary pro-
grams on the processor. The paradigm should not require the programmers
to write programs in specific programming languages; instead, program-
mers should be given the freedom to write programs in ordinary, imperative
languages such as C. The programmers should not be forced to spend too
much effort finding the thread-level parallelism in an application. In short,
the paradigm should place no unnecessary burden on the programmers to
carry out TLP processing.
2 Versatility: As far as possible, the high-level language programs should not
be tailored for specific architectures and specific hardware implementations,
so that the same high-level language program can be used for a wide variety
of architectures and implementations. The programmer should not have to
consider the number or logical connectivity of the processing units in the
computer system.
3 Parallelism: The compiler should extract the maximum amount of TLP
possible at compile time. The compiler could also convey additional in-
formation about the program, such as inter-thread register dependences and
control flow information, to the hardware. These steps will not only sim-
plify the hardware, but also allow it to concentrate more on extracting the
parallelism that can be detected only at run time.
Now let us consider the desired features for the hardware. We classify the de-
sired hardware features also under the same three attributes, namely parallelism,
practicality, and versatility.
1 Parallelism: The hardware should extract the parallelism that could not be
detected at compile time, and should exploit the maximum amount of par-

2. The Multiscalar Paradigm 27
allelism possible. The hardware should be able to execute multiple threads
in parallel.
2 Practicality: Here, by practicality we mean realizability of the hardware.
That is, the execution model should have attributes that facilitate commercial
realization. A processor based on the paradigm should be implementable in
technology that we expect to see in the next several years, and the hardware
structures should be regular to facilitate implementation with clock speeds
comparable to the clock speeds of contemporary processors, resulting in the
highest performance processor of a given generation.
3 Versatility: The paradigm should facilitate hardware implementations with
no centralized resources. Decentralization of resources is important for fu-
ture expansion of the system (as allowed by technology improvements in
hardware and software). These resources include the hardware for extracting
TLP such as inter-thread register and memory synchronization enforcement
and identification of independent instructions; and the hardware for exploit-
ing TLP such as instruction supply mechanism, register data flow, and data
memory system. The hardware implementation should be such that it pro-
vides an easy growth path from one generation of processors to the next,
with minimum hardware and software effort. An easy hardware growth path
implies the reuse of hardware components, as much as possible, from one
generation to the next.
2.2 Multiscalar Paradigm-The Basic Idea
Realization of the software and hardware features described above has been
the main driving force behind the development of the multiscalar paradigm.
Bringing all of the above features together requires bringing together in a new
manner the worlds of control-driven execution and data-driven execution, and
combine the best of both worlds.
The basic idea of the multiscalar paradigm is to split the jobs of TLP ex-
traction and exploitation amongst mUltiple processing units. Each PU can be
assigned a reasonably sized thread, and parallelism can be exploited by over-
lapping the execution of mUltiple threads1. So far, it looks no different from
a conventional multiprocessor. But the difference - a key one indeed - is
that the threads being executed in parallel in the multiscalar paradigm can have
both control and data dependences between them. Whereas the multiprocessor
takes control-independent portions (preferably data-independent as well) ofthe
control flow graph (CFG) of a program, and assigns them to different process-

28 2.2 Multiscalar Paradigm-The Basic Idea
ing units, the multiscalar processor takes a sequential instruction stream, and
assigns contiguous portions of it to different processing units.
The multiple processing units are connected together as a circular queue.
The multiscalar processor traverses the CFG of a program as follows: take a
subgraph (thread) T from the CFG and assign it to the tail PU, advance the tail
pointer by one PU, do a prediction as to where control is most likely to go after
the execution ofT, and assign a subgraph starting at that target to the next PU in
the next cycle, and so on until the circular queue is full. The assigned threads
together encompass a contiguous portion of the dynamic instruction stream.
These threads are executed in parallel, although the paradigm preserves logical
sequentiality among the threads. The PUs are connected as a circular queue to
obtain a sliding or continuous big window (as opposed to a fixed window), a
feature that allows more parallelism to be exploited [94]. When the execution
of the thread at the head PU is over, the head pointer is advanced by one Pu.
A thread could be as simple as a basic block or even part of a basic block.
More complex threads could be sequences ofbasic blocks, entire loops, or even
entire function calls. In its most general form, a thread can be any connected
subgraph of the control flow graph ofthe program being executed. The moti-
vation behind considering a subgraph as a thread is to collapse several nodes
of the CFG into a single node, as shown later in Figure 2.1. Traversing the
CFG in steps of subgraphs helps to tide over the problem of poor predictability
of some CFG nodes, by incorporating those nodes within subgraphs2. Multi-
scalar threads, in general, encompass alternate control flow edges (otherwise
threads would be nothing other than basic blocks or traces). Parallelly executed
threads can have both control dependences and data dependences between them.
The execution model within each thread can be a simple, sequential process-
ing paradigm, or more complicated paradigms such as a small-issue VLIW or
superscalar paradigm.
Let us throw more light on multiscalar execution. The multiscalar paradigm
executes mUltiple threads in parallel, with distinct PUs. Each of these PUs
can be a sequential, single-issue processor. Collectively, several instructions
are executed per cycle, one from each thread. Apart from any static code
motions done by the compiler, by simultaneously executing instructions from
multiple threads, the multiscalarexecution moves some instructions "up in time"
within the overall dynamic window. That is, some instructions from later in
the sequential instruction stream are initiated earlier in time, thereby exploiting
parallelism, and decreasing the overall execution time. Notice that the compiler
did not give any guarantee that these instructions are independent; the hardware
determines the inter-thread dependences (possibly with additional information
provided by the compiler), and determines the independent instructions. If
a new thread is assigned to a different PU each cycle, collectively the PUs
establish a large dynamic window of instructions. If all active PUs execute

instructions in parallel, overall the multiscalar processor could be executing
multiple instructions per cycle.
2.3 Multiscalar Execution Example
We shall illustrate the details ofthe working ofthe multiscalar paradigm with
the help of an example. This example is only meant to be illustrative, and is not
meant to be exclusive. Consider the simple code fragment shown in Figure 2.1.
The figure shows the control flow graph as well as the assembly code within each
basic block. The example is a simple loop with a data-dependent conditional
branch in the loop body. The loop adds the number 10 to 100 elements of an
array A, and sets an element to 1000 if it is greater than 50. The loop body
consists of 3 basic blocks, and the overall CFG consists of 4 basic blocks. This
example is chosen for its simplicity. Whereas it does not illustrate some of
the complexities of the control flow graphs that are generally encountered in
practice, it does provide a background for discussing these complexities.
Figure 2.1. Example Control Flow Graph and Code
On inspection of the assembly code in Figure 2.1, we can see that almost all
the instructions of an iteration are data-dependent on previous instructions of
the same iteration, and that there is very little ILP in a single iteration ofthe loop.
However, all the iterations are independent (except for the data dependences

30 2.3 Multiscalar Execution Example
through the loop induction variable allocated in register Rl) because each itera-
tion operates on a different element of the array. Thus, there is significant TLP
if each iteration is considered as a separate thread.
Now, let us look at how the multiscalar paradigm executes this loop. The
program partitioning process (which is typically done by the compiler) has
formed two overlapping static threads from this CFG. The first static thread,
TO, encompasses all 4 basic blocks into a single thread. This thread has two
possible successors, one of which is Tl, and the other is the thread starting at
the post-dominator ofthe loop. The second static thread, Tl, begins at the loop
starting point, and encompasses one iteration of the loop. This thread also has
has the same two successors as TO.
At run time, the multiscalar processor forms multiple dynamic threads as
shown in Figure 2.2, effectively establishing a large dynamic window of dy-
namic threads. The large dynamic window encompasses a contiguous portion
of the dynamic instruction stream. The multiscalar paradigm executes these
multiple threads in parallel, with distinct PUs. Collectively, several instruc-
tions are executed per cycle, one from each thread. For instance, consider the
shaded horizontal slice in Figure 2.2, which refers to a particular time-frame
(cycle). In that cycle, three instructions are executed from the three threads.
Figure 2.2. Multiscalar Execution of Example Code in Figure 2.1

Given the background experience assumed here, it would be coy not to rec-
ognize the reader's familiarity with software scheduling techniques such as loop
unrolling and software pipelining. However, it cannot be emphasized too of-
ten that the multiscalar paradigm is far more general than loop unrolling and
other similar techniques for redressing the effect of control dependences. The
structure of a multiscalar thread can be as general as a connected subgraph of
the control flow graph, and is far more general than a loop body. Let us look
into more detail how inter-thread control dependences and data dependences
are handled in the multiscalar paradigm.
2.3.1 Control Dependences
We will first see how inter-thread control dependences are overcome. Once
thread TO is dynamically assigned to PU 0, a prediction is made by the hardware
(based on static or dynamic techniques) to determine the next thread to which
control will most likely flow after the execution of thread TO. In this example,
it determines that control is most likely to go to thread T1, and so in the next
cycle, an instance of Tl is spawned and assigned to the next PU. This process
is repeated. The type of prediction used by the multiscalar paradigm is called
inter-thread control prediction [65].
In the multiscalar paradigm, the execution of all active threads, except the
first, is speculative innature. The hardware provides facilities for recovery when
it is determined that an incorrect control flow prediction has been made. It is
important to note that among the two branches in an iteration ofthe above loop,
the first branch, which has poor predictability, has been encompassed within
threads so that the control flow prediction need not consider its targets at all
while making the prediction. Only the targets of the second branch, which can
be predicted with good accuracy, have been included in the thread's successors.
Thus, the constraints introduced by control dependences are overcome by doing
speculative execution (along the control paths indicated by the light dotted
arrows in Figure 2.2), but doing predictions at those points in the control flow
graph that are easily predictable. This facilitates the multiscalar hardware in
establishing accurate and large dynamic windows.
2.3.2 Register Data Dependences
Next we will look at how inter-thread register data dependences are han-
dled. These data dependences are taken care of by forwarding the last update
of each register in a thread to the subsequent threads, preferably as and when
the last updates are generated. In Figure 2.2, the register instances produced
in different threads are shown with different subscripts, for example, Rl1, R12,
and R13 , and the inter-thread register data dependences are marked by solid
arrows. As we can gather from Figure 2.2, the only register data dependences

32 2.4 Interesting Aspects o/the Multiscalar Paradigm
that are carried across the threads are the ones through register R1, which cor-
responds to the induction variable. Thus, although the instructions of a thread
are mostly sequentially dependent, the next thread can start execution once the
first instruction of a thread has been executed (in this example), and its result
forwarded to the next thread.
2.3.3 Memory Data Dependences
Now let us see how potential inter-thread data dependences through mem-
ory, occurring through loads and stores, are handled. These dependences are
marked by long dash arrows in Figure 2.2. In a sequential execution of the
program, the load of the second iteration is performed after the store of the first
iteration, and thus any potential data dependence is automatically taken care of.
However, in the multiscalar paradigm, because the two iterations are executed
in parallel, it is quite likely that the load of the second iteration may be ready to
be executed earlier than the store of the first iteration. If a load is made to wait
until all preceding stores are executed, then much ofthe code reordering oppor-
tunities are inhibited, and performance may be badly affected. The multiscalar
paradigm cannot afford such a callous wait; so it allows memory references to
be executed out-of-order, along with special hardware to check if the dynamic
reordering of memory references produces any violation of dependences. For
this recovery, it is possible to use the same facility that is provided for recov-
ery in times of incorrect control flow prediction. If the dynamic code motion
rarely results in a violation of dependences, significantly more parallelism can
be exploited. This is a primary mechanism that we use for breaking the restric-
tion due to ambiguous data dependences, which cannot be resolved by static
memory disambiguation techniques.
2.4 Interesting Aspects of the Multiscalar Paradigm
The astute reader would have realized by now that the multiscalar paradigm
allows very flexible dynamic scheduling that could be assisted with software
scheduling. The compiler has a big role to play in bringing to fruit the full
capabilities of this paradigm. The compiler decides which parts of the CFG
should be brought together as a thread, and performs static scheduling within
each thread. The role of the compiler is discussed in great detail in Chapter 8.
Figure 2.3 gives ac1earpicture ofwhere the multiscalarparadigm stands in terms
of what is done by software and what is done by hardware. The multiscalar
paradigm is grounded on a good interplay between compile-time extraction of
ILP and run-time extraction of ILP. Below, we describe the interesting aspects
of the multiscalar paradigm.

SOFTWARE HARDWARE
Figure 2.3. The Multiscalar Execution Model-What is done by Software and What is done
by Hardware
Decentralization of Critical Resources: Chapters 4-7 describe one possible
hardware implementation ofthe multiscalar paradigm. Without considering the
details of the multiscalar implementation here, we can make one observation
upon the strategy it employs for decentralizing the critical resources. By split-
ting the large dynamic window of instructions into smaller threads (cf. Figure
3.7), the complex task of searching a large window for independent instruc-
tions is split into two simpler subtasks: (i) independent searches (if need be) in
smaller threads, all of which can be done in parallel by separate PUs, and (ii)
enforcement of control and data dependences between the threads. This allows
the dynamic scheduling hardware to be divided into a two-level hierarchical
structure - a distributed top-level unit that enforces dependences between the
threads, and several independent lower-level units at the bottom level, each of
which enforces dependences within a thread and identifies the independent in-
structions in that thread. Each of these lower-level units can be a separate PU

34 2.4 Interesting Aspects afthe Multiscalar Paradigm
akin to a simple (possibly sequential) execution datapath. A direct outgrowth
of the decentralization of critical resources is expandability of the hardware.
Parallel Execution of Multiple Threads: The multiscalar paradigm is spe-
cially geared to execute multiple threads in parallel. While partitioning a pro-
gram into threads, as far as possible, an attempt is made to generate threads
that are control-independent of each other, so that the multiscalar hardware can
parallelly execute non-speculative threads. However, most non-numeric pro-
grams have such complex flows of control that finding non-speculative threads
ofreasonable size is often infeasible. So, the multiscalar solution is to parallelly
execute possibly control-dependent, and possibly data-dependent threads, in a
speculative manner. Thus, as far as possible, an attempt is made to demarcate
threads at those points where it is easy to speculate the next thread to be executed
when control leaves a thread (although the exact path taken through the thread
may vary in different dynamic instances). Such a division into threads will not
only allow the overall large window to be accurate, but also facilitate the execu-
tion of(mostly) control-independent code in parallel, thereby pursuing multiple
flows of control, which is needed to exploit significant levels of parallelism in
non-numeric applications [50]. By encompassing complex control structures
within a thread, the overall prediction accuracy is significantly improved.
Speculative Execution: The multiscalar paradigm is an epitome for spec-
ulative execution; almost all of the execution in the multiscalar hardware is
speculative in nature. At any time, the only thread that is guaranteed to be
executed non-speculatively is the sequentially earliest thread that is being ex-
ecuted at that time. There are different kinds of speculative execution taking
place across threads in the multiscalar hardware: (i) speculative execution of
control-dependent code across threads, and (ii) speculative execution of loads
before stores from preceding threads, and stores before loads and stores from
preceding threads. The importance of speculative execution for exploiting par-
allelism in non-numeric codes was underscored in [50].
Parallel Execution of Data-Dependent Threads: Another important fea-
ture and big advantage of the multiscalar paradigm is that it does not require
the parallelly executed threads to be data independent either. If inter-thread
dependences are present, either through registers or through memory locations,
the hardware automatically enforces these dependences. This feature gives sig-
nificant flexibility to the compiler. It is worthwhile to point out, however, that
although the execution of data-dependent threads can be overlapped, the parti-
tioning agent can and should as far as possible attempt to pack data-dependent
instructions into the same thread, so that at run time the threads can be executed

Another Random Document on
Scribd Without Any Related Topics

The Project Gutenberg eBook of Kinship and
Social Organisation

This ebook is for the use of anyone anywhere in the United
States and most other parts of the world at no cost and with
almost no restrictions whatsoever. You may copy it, give it away
or re-use it under the terms of the Project Gutenberg License
included with this ebook or online at www.gutenberg.org. If you
are not located in the United States, you will have to check the
laws of the country where you are located before using this
eBook.
Title: Kinship and Social Organisation
Author: W. H. R. Rivers
Release date: January 22, 2014 [eBook #44728]
Most recently updated: October 24, 2024
Language: English
Credits: Produced by Henry Flower and the Online Distributed
Proofreading Team at http://www.pgdp.net (This file
was
produced from images generously made available by
The
Internet Archive/Canadian Libraries)
*** START OF THE PROJECT GUTENBERG EBOOK KINSHIP AND
SOCIAL ORGANISATION ***

STUDIES IN ECONOMIC AND POLITICAL SCIENCE
Edited by the HON. W. PEMBER REEVES
Director of the London School of Economics and Political Science
No. 36 in the Series of Monographs by Writers connected with the
London School of Economics and Political Science.
KINSHIP AND SOCIAL ORGANISATION

Kinship and
Social Organisation
By
W. H. R. RIVERS, m.d., f.r.s.,
Fellow of St. John’s College, Cambridge
LONDON
CONSTABLE & CO LTD
1914

CONTENTS
PAGE
PREFACE vii.
LECTURE I 1
LECTURE II 28
LECTURE III 60
INDEX 95

PREFACE.
These lectures were delivered at the London School of Economics in
May of the present year. They are largely based on experience
gained in the work of the Percy Sladen Trust Expedition to Melanesia
of 1908, and give a simplified record of social conditions which will
be described in detail in the full account of the work of that
expedition.
A few small additions and modifications have been made since the
lectures were given, some of these being due to suggestions made
by Professor Westermarck and Dr. Malinowski in the discussions
which followed the lectures. I am also indebted to Miss B. Freire-
Marreco for allowing me to refer to unpublished material collected
during her recent work among the Pueblo Indians of North America.
W. H. R. Rivers.
St. John’s College,
Cambridge.
November 19th, 1913.

KINSHIP AND SOCIAL
ORGANISATION

LECTURE I
The aim of these lectures is to demonstrate the close connection
which exists between methods of denoting relationship or kinship
and forms of social organisation, including those based on different
varieties of the institution of marriage. In other words, my aim will
be to show that the terminology of relationship has been rigorously
determined by social conditions and that, if this position has been
established and accepted, systems of relationship furnish us with a
most valuable instrument in studying the history of social
institutions.
In the controversy of the present and of recent times, it is the
special mode of denoting relationship known as the classificatory
system which has formed the chief subject of discussion. It is in
connection with this system that there have arisen the various vexed
questions which have so excited the interest—I might almost say the
passions—of sociologists during the last quarter of a century.
I am afraid it would be dangerous to assume your familiarity with
this system, and I must therefore begin with a brief description of its
main characters. The essential feature of the classificatory system,
that to which it owes its name, is the application of its terms, not to
single individual persons, but to classes of relatives which may often
be very large. Objections have been made to the use of the term
“classificatory” on the ground that our own terms of relationship also
apply to classes of persons; the term “brother,” for instance, to all
the male children of the same father and mother, the term “uncle” to
all the brothers of the father and mother as well as to the husband
of an aunt, while the term “cousin” may denote a still larger class. It
is, of course, true that many of our own terms of relationship apply
to classes of persons, but in the systems to which the word

“classificatory” is usually applied, the classificatory principle applies
far more widely, and in some cases even, more logically and
consistently. In the most complete form of the classificatory system
there is not one single term of relationship the use of which tells us
that reference is being made to one person and to one person only,
whereas in our own system there are six such terms, viz., husband,
wife, father, mother, father-in-law and mother-in-law. In those
systems in which the classificatory principle is carried to its extreme
degree every term is applied to a class of persons. The term “father,”
for instance, is applied to all those whom the father would call
brother, and to all the husbands of those whom the mother calls
sister, both brother and sister being used in a far wider sense than
among ourselves. In some forms of the classificatory system the
term “father” is also used for all those whom the mother would call
brother, and for all the husbands of those whom the father would
call sister, and in other systems the application of the term may be
still more extensive. Similarly, the term used for the wife may be
applied to all those whom the wife would call sister and to the wives
of all those whom the speaker calls brother, brother and sister again
being used in a far wider sense than in our own language.
The classificatory system has many other features which mark it off
more or less sharply from our own mode of denoting relationship,
but I do not think it would be profitable to attempt a full description
at this stage of our enquiry. As I have said, the object of these
lectures is to show how the various features of the classificatory
system have arisen out of, and can therefore be explained
historically by, social facts. If you are not already acquainted with
these features, you will learn to know them the more easily if at the
same time you learn how they have come into existence.
I will begin with a brief history of the subject. So long as it was
supposed that all the peoples of the world denoted relationship in
the same way, namely, that which is customary among ourselves,
there was no problem. There was no reason why the subject should
have awakened any interest, and so far as I have been able to find,
it is only since the discovery of the classificatory system of

relationship that the problem now before us was ever raised. I
imagine that, if students ever thought about the matter at all, it
must have seemed obvious that the way in which they and the other
known peoples of the world used terms of relationship was
conditioned and determined by the social relations which the terms
denoted.
The state of affairs became very different as soon as it was known
that many peoples of the world use terms of relationship in a
manner, and according to rules, so widely different from our own
that they seem to belong to an altogether different order, a
difference well illustrated by the confusion which is apt to arise when
we use English words in the translation of classificatory terms or
classificatory terms as the equivalents of our own. The difficulty or
impossibility of conforming to complete truth and reality, when we
attempt this task, is the best witness to the fundamental difference
between the two modes of denoting relationship.
I do not know of any discovery in the whole range of science which
can be more certainly put to the credit of one man than that of the
classificatory system of relationship by Lewis Morgan. By this I
mean, not merely that he was the first to point out clearly the
existence of this mode of denoting relationship, but that it was he
who collected the vast mass of material by which the essential
characters of the system were demonstrated, and it was he who was
the first to recognise the great theoretical importance of his new
discovery. It is the denial of this importance by his contemporaries
and successors which furnishes the best proof of the credit which is
due to him for the discovery. The very extent of the material he
collected[1] has probably done much to obstruct the recognition of
the importance of his work. It is a somewhat discouraging thought
that, if Morgan had been less industrious and had amassed a smaller
collection of material which could have been embodied in a more
available form, the value of his work would probably have been far
more widely recognised than it is to-day. The volume of his material
is, however, only a subsidiary factor in the process which has led to

the neglect or rejection of the importance of Morgan’s discovery. The
chief cause of the neglect is one for which Morgan must himself
largely bear the blame. He was not content to demonstrate, as he
might to some extent have done from his own material, the close
connection between the terminology of the classificatory system of
relationship and forms of social organisation. There can be little
doubt that he recognised this connection, but he was not content to
demonstrate the dependence of the terminology of relationship upon
social forms the existence of which was already known, or which
were capable of demonstration with the material at his disposal. He
passed over all these early stages of the argument, and proceeded
directly to refer the origin of the terminology to forms of social
organisation which were not known to exist anywhere on the earth
and of which there was no direct evidence in the past. When,
further, the social condition which Morgan was led to formulate was
one of general promiscuity developing into group-marriage,
conditions bitterly repugnant to the sentiments of most civilised
persons, it is not surprising that he aroused a mass of heated
opposition which led, not merely to widespread rejection of his
views, but also to the neglect of lessons to be learnt from his new
discovery which must have received general recognition long before
this, if they had not been obscured by other issues.
The first to take up the cudgels in opposition to Morgan was our own
pioneer in the study of the early forms of human society, John
Ferguson McLennan.[2] He criticised the views of Morgan severely
and often justly, and then pointing out, as was then believed to be
the case, that no duties or rights were connected with the
relationships of the classificatory system, he concluded that the
terms formed merely a code of courtesies and ceremonial addresses
for social intercourse. Those who have followed him have usually
been content to repeat the conclusion that the classificatory system
is nothing more than a body of mutual salutations and terms of
address. They have failed to see that it still remains necessary to
explain how the terms of the classificatory system came to be used
in mutual salutation. They have failed to recognise that they were

either rejecting the principle of determinism in sociology, or were
only putting back to a conveniently remote distance the
consideration of the problem how and why the classificatory terms
came to be used in the way now customary among so many peoples
of the earth.
This aspect of the problem, which has been neglected or put on one
side by the followers of McLennan, was not so treated by McLennan
himself. As we should expect from the general character of his work,
McLennan clearly recognised that the classificatory system must
have been determined by social conditions, and he tried to show
how it might have arisen as the result of the change from the Nair to
the Tibetan form of polyandry.[3] He even went so far as to
formulate varieties of this process by means of which there might
have been produced the chief varieties of the classificatory system,
the existence of which had been demonstrated by Morgan. It is quite
clear that McLennan had no doubts about the necessity of tracing
back the social institution of the classificatory system of relationship
to social causes, a necessity which has been ignored or even
explicitly denied by those who have followed him in rejecting the
views of Morgan. It is one of the many unfortunate consequences of
McLennan’s belief in the importance of polyandry in the history of
human society that it has helped to prevent his followers from
seeing the social importance of the classificatory system. They have
failed to see that the classificatory system may be the result neither
of promiscuity nor of polyandry, and yet have been determined, both
in its general character and in its details, by forms of social
organisation.
Since the time of Morgan and McLennan few have attempted to deal
with the question in any comprehensive manner. The problem has
inevitably been involved in the controversy which has raged between
the advocates of the original promiscuity or the primitive monogamy
of mankind, but most of the former have been ready to accept
Morgan’s views blindly, while the latter have been content to try to
explain away the importance of conclusions derived from the

classificatory system without attempting any real study of the
evidence. On the side of Morgan there has been one exception in
the person of Professor J. Kohler,[4] who has recognised the lines on
which the problem must be studied, while on the other side there
has been, so far as I am aware, only one writer who has recognised
that the evidence from the nature of the classificatory system of
relationship cannot be ignored or belittled, but must be faced and
some explanation alternative to that of Morgan provided.
This attempt was made four years ago by Professor Kroeber,[5] of
the University of California. The line he takes is absolutely to reject
the view common to both Morgan and McLennan that the nature of
the classificatory system has been determined by social conditions.
He explicitly rejects the view that the mode of using terms of
relationship depends on social causes, and puts forward as the
alternative that they are conditioned by causes purely linguistic and
psychological.
It is not quite easy to understand what is meant by the linguistic
causation of terms of relationship. In the summary at the end of his
paper Kroeber concludes that “they (terms of relationship) are
determined primarily by language.” Terms of relationship, however,
are elements of language, so that Kroeber’s proposition is that
elements of language are determined primarily by language. In so
far as this proposition has any meaning, it must be that, in the
process of seeking the origin of linguistic phenomena, it is our
business to ignore any but linguistic facts. It would follow that the
student of the subject should seek the antecedents of linguistic
phenomena in other linguistic phenomena, and put on one side as
not germane to his task all reference to the objects and relations
which the words denote and connote.
Professor Kroeber’s alternative proposition is that terms of
relationship reflect psychology, not sociology, or, in other words, that
the way in which terms of relationship are used depends on a chain
of causation in which psychological processes are the direct
antecedents of this use. I will try to make his meaning clear by

means of an instance which he himself gives. He says that at the
present time there is a tendency among ourselves to speak of the
brother-in-law as a brother; in other words, we tend to class the
brother-in-law and the brother together in the nomenclature of our
own system of relationship. He supposes that we do this because
there is a psychological similarity between the two relationships
which leads us to class them together in our customary
nomenclature. I shall return both to this and other of his examples
later.
We have now seen that the opponents of Morgan have taken up two
main positions which it is possible to attack: one, that the
classificatory system is nothing more than a body of terms of
address; the other, that it and other modes of denoting relationship
are determined by psychological and not by sociological causes. I
propose to consider these two positions in turn.
Morgan himself was evidently deeply impressed by the function of
the classificatory system of relationship as a body of salutations. His
own experience was derived from the North American Indians, and
he notes the exclusive use of terms of relationship in address, a
usage so habitual that an omission to recognise a relative in this
manner would amount almost to an affront. Morgan also points out,
as one motive for the custom, the presence of a reluctance to utter
personal names. McLennan had to rely entirely on the evidence
collected by Morgan, and there can be no doubt that he was greatly
influenced by the stress Morgan himself laid on the function of the
classificatory terms as mutual salutations. That in rude societies
certain relatives have social functions definitely assigned to them by
custom was known in Morgan’s time, and I think it might even then
have been discovered that the relationships which carried these
functions were of the classificatory kind. It is, however, only by more
recent work, beginning with that of Howitt, of Spencer and Gillen,
and of Roth in Australia, and of the Cambridge Expedition to Torres
Straits, that the great importance of the functions of relatives
through the classificatory system has been forced upon the attention
of sociologists. The social and ceremonial proceedings of the

Australian aborigines abound in features in which special functions
are performed by such relatives as the elder brother or the brother
of the mother, while in Torres Straits I was able to record large
groups of duties, privileges and restrictions associated with different
classificatory relationships.
Further work has shown that widely, though not universally, the
nomenclature of the classificatory system carries with it a number of
clearly defined social practices. One who applies a given term of
relationship to another person has to behave towards that person in
certain definite ways. He has to perform certain duties towards him,
and enjoys certain privileges, and is subject to certain restrictions in
his conduct in relation to him. These duties, privileges and
restrictions vary greatly in number among different peoples, but
wherever they exist, I know of no exception to their importance and
to the regard in which they are held by all members of the
community. You doubtless know of many examples of such functions
associated with relationship, and I need give only one example.
In the Banks Islands the term used between two brothers-in-law is
wulus, walus, or walui, and a man who applies one of these terms to
another may not utter his name, nor may the two behave familiarly
towards one another in any way. In one island, Merlav, these
relatives have all their possessions in common, and it is the duty of
one to help the other in any difficulty, to warn him in danger, and, if
need be, to die with him. If one dies, the other has to help to
support his widow and has to abstain from certain foods. Further,
there are a number of curious regulations in which the sanctity of
the head plays a great part. A man must take nothing from above
the head of his brother-in-law, nor may he even eat a bird which has
flown over his head. A person has only to say of an object “That is
the head of your brother-in-law,” and the person addressed will have
to desist from the use of the object. If the object is edible, it may
not be eaten; if it is one which is being manufactured, such as a
mat, the person addressed will have to cease from his work if the
object be thus called the head of his brother-in-law. He will only be
allowed to finish it on making compensation, not to the person who

has prevented the work by reference to the head, but to the
brother-in-law whose head had been mentioned. Ludicrous as some
of these customs may seem to us, they are very far from being so to
those who practise them. They show clearly the very important part
taken in the lives of those who use the classificatory system by the
social functions associated with relationship. As I have said, these
functions are not universally associated with the classificatory
system, but they are very general in many parts of the world and
only need more careful investigation to be found even more general
and more important than appears at present.
Let us now look at our own system of relationship from this point of
view. Two striking features present themselves. First, the great
paucity of definite social functions associated with relationship, and
secondly, the almost complete limitation of such functions to those
relationships which apply only to individual persons and not to
classes of persons. Of such relationships as cousin, uncle, aunt,
father-in-law, or mother-in-law there may be said to be no definite
social functions. A school-boy believes it is the duty of his uncle to
tip him, but this is about as near as one can get to any social
obligation on the part of this relative.
The same will be found to hold good to a large extent if we turn to
those social regulations which have been embodied in our laws. It is
only in the case of the transmission of hereditary rank and of the
property of a person dying intestate that more distant relatives are
brought into any legal relationship with one another, and then only if
there is an absence of nearer relatives. It is only when forced to do
so by exceptional circumstances that the law recognises any of the
persons to whom the more classificatory of our terms of relationship
apply. If we pay regard to the social functions associated with
relationship, it is our own system, rather than the classificatory,
which is open to the reproach that its relationships carry into them
no rights and duties.
In the course of the recent work of the Percy Sladen Trust
Expedition in Melanesia and Polynesia I have been able to collect a

body of facts which bring out, even more clearly than has hitherto
been recognised, the dependence of classificatory terms on social
rights.[6] The classificatory systems of Oceania vary greatly in
character. In some places relationships are definitely distinguished in
nomenclature which are classed with other relationships elsewhere.
Thus, while most Melanesian and some Polynesian systems have a
definite term for the mother’s brother and for the class of relatives
whom the mother calls brother, in other systems this relative is
classed with, and is denoted by, the same term as the father. The
point to which I now call your attention is that there is a very close
correlation between the presence of a special term for this relative
and the presence of special functions attached to the relationship.
In Polynesia, both the Hawaiians and the inhabitants of Niue class
the mother’s brother with the father, and in neither place was I able
to discover that there were any special duties, privileges or
restrictions ascribed to the mother’s brother. In the Polynesian
islands of Tonga and Tikopia, on the other hand, where there are
special terms for the mother’s brother, this relative has also special
functions. The only place in Melanesia where I failed to find a special
term for the mother’s brother was in the western Solomon Islands,
and that was also the only part of Melanesia where I failed to find
any trace of special social functions ascribed to this relative. I do not
know of such functions in Santa Cruz, but my information about the
system of that island is derived from others, and further research
will almost certainly show that they are present.
In my own experience, then, among two different peoples, I have
been able to establish a definite correlation between the presence of
a term of relationship and special functions associated with the
relationship. Information kindly given to me by Father Egidi,
however, seems to show that the correlation among the Melanesians
is not complete. In Mekeo, the mother’s brother has the duty of
putting on the first perineal garment of his nephew, but he has no
special term and is classed with the father. Among the Kuni, on the
other hand, there is a definite term for the mother’s brother

distinguishing him from the father, but yet he has not, so far as
Father Egidi knows, any special functions.
Both in Melanesia and Polynesia a similar correlation comes out in
connection with other relationships, the most prominent exception
being the absence of a special term for the father’s sister in the
Banks Islands, although this relative has very definite and important
functions. In these islands the father’s sister is classed with the
mother as vev or veve, but even here, where the generalisation
seems to break down, it does not do so completely, for the father’s
sister is distinguished from the mother as veve vus rawe, the mother
who kills a pig, as opposed to the simple veve used for the mother
and her sisters.
There is thus definite evidence, not only for the association of
classificatory terms of relationship with special social functions, but
from one part of the world we now have evidence which shows that
the presence or absence of special terms is largely dependent on
whether there are or are not such functions. We may take it as
established that the terms of the classificatory system are not, as
McLennan supposed, merely terms of address and modes of mutual
salutation. McLennan came to this conclusion because he believed
that the classificatory terms were associated with no such functions
as those of which we now have abundant evidence. He asks, “What
duties or rights are affected by the relationships comprised in the
classificatory system?” and answers himself according to the
knowledge at his disposal, “Absolutely none.”[7] This passage makes
it clear that, if McLennan had known what we know to-day, he would
never have taken up the line of attack upon Morgan’s position in
which he has had, and still has, so many followers.
I can now turn to the second line of attack, that which boldly
discards the origin of the terminology of relationship in social
conditions, and seeks for its explanation in psychology. The line of
argument I propose to follow is first to show that many details of
classificatory systems have been directly determined by social

factors. If that task can be accomplished, we shall have firm ground
from which to take off in the attempt to refer the general characters
of the classificatory and other systems of relationship to forms of
social organisation. Any complete theory of a social institution has
not only to account for its general characters, but also for its details,
and I propose to begin with the details.
I must first return to the history of the subject, and stay for a
moment to ask why the line of argument I propose to follow was not
adopted by Morgan and has been so largely disregarded by others.
Whenever a new phenomenon is discovered in any part of the world,
there is a natural tendency to seek for its parallels elsewhere.
Morgan lived at a time when the unity of human culture was a topic
which greatly excited ethnologists, and it is evident that one of his
chief interests in the new discovery arose from the possibility it
seemed to open of showing the uniformity of human culture. He
hoped to demonstrate the uniformity of the classificatory system
throughout the world, and he was content to observe certain broad
varieties of the system and refer them to supposed stages in the
history of human society. He paid but little attention to such varieties
of the classificatory system as are illustrated in his own record of
North American systems, and seems to have overlooked entirely
certain features of the Indian and Oceanic systems he recorded,
which might have enabled him to demonstrate the close relation
between the terminology of relationship and social institutions.
Morgan’s neglect to attend to these differences must be ascribed in
some measure to the ignorance of rude forms of social organisation
which existed when he wrote, but the failure of others to recognise
the dependence of the details of classificatory systems upon social
institutions is rather to be ascribed to the absence of interest in the
subject induced by their adherence to McLennan’s primary error.
Those who believe that the classificatory system is merely an
unimportant code of mutual salutations are not likely to attend to
relatively minute differences in the customs they despise. The credit
of having been the first fully to recognise the social importance of
these differences belongs to J. Kohler. In his book “Zur Urgeschichte

der Ehe,” which I have already mentioned, he studied minutely the
details of many different systems, and showed that they could be
explained by certain forms of marriage practised by those who use
the terms. I propose now to deal with classificatory terminology from
this point of view. My procedure will be first to show that the details
which distinguish different forms of the classificatory system from
one another have been directly determined by the social institutions
of those who use the systems, and only when this has been
established, shall I attempt to bring the more general characters of
the classificatory and other systems into relation with social
institutions.
I am able to carry out this task more fully than has hitherto been
possible because I have collected in Melanesia a number of systems
of relationship which differ far more widely from one another than
those recorded in Morgan’s book or others which have been
collected since. Some of the features which characterise these
Melanesian systems will be wholly new to ethnologists, not having
yet been recorded elsewhere, but I propose to begin with a long
familiar mode of terminology which accompanies that widely
distributed custom known as the cross-cousin marriage. In the more
frequent form of this marriage a man marries the daughter either of
his mother’s brother or of his father’s sister; more rarely his choice is
limited to one of these relatives.
Such a marriage will have certain definite consequences. Let us take
a case in which a man marries the daughter of his mother’s brother,
as is represented in the following diagram:
Diagram 1[8]

One consequence of the marriage between C and d will be that A,
who before the marriage of C was only his mother’s brother, now
becomes also his wife’s father, while b, who before the marriage was
the mother’s brother’s wife of C, now becomes his wife’s mother.
Reciprocally, C, who before his marriage had been the sister’s son of
A and the husband’s sister’s son of b, now becomes their son-in-law.
Further, E and f, the other children of A and b, who before the
marriage had been only the cousins of C, now become his wife’s
brother and sister.
Similarly, a, who before the marriage of d was her father’s sister,
now becomes also her husband’s mother, and B, her father’s sister’s
husband, comes to stand in the relation of husband’s father; if C
should have any brothers and sisters, these cousins now become her
brothers- and sisters-in-law.
The combinations of relationship which follow from the marriage of a
man with the daughter of his mother’s brother thus differ for a man
and a woman, but if, as is usual, a man may marry the daughter
either of his mother’s brother or of his father’s sister, these
combinations of relationship will hold good for both men and
women.
Another and more remote consequence of the cross-cousin
marriage, if this become an established institution, is that the
relationships of mother’s brother and father’s sister’s husband will
come to be combined in one and the same person, and that there
will be a similar combination of the relationships of father’s sister
and mother’s brother’s wife. If the cross-cousin marriage be the
habitual custom, B and b in Diagram 1 will be brother and sister; in
consequence A will be at once the mother’s brother and the father’s
sister’s husband of C, while b will be both his father’s sister and his
mother’s brother’s wife. Since, however, the mother’s brother is also
the father-in-law, and the father’s sister the mother-in-law, three
different relationships will be combined in each case. Through the
cross-cousin marriage the relationships of mother’s brother, father’s
sister’s husband and father-in-law will be combined in one and the

same person, and the relationships of father’s sister, mother’s
brother’s wife and mother-in-law will be similarly combined.
In many places where we know the cross-cousin marriage to be an
established institution, we find just those common designations
which I have just described. Thus, in the Mbau dialect of Fiji the
word vungo is applied to the mother’s brother, the husband of the
father’s sister and the father-in-law. The word nganei is used for the
father’s sister, the mother’s brother’s wife and the mother-in-law.
The term tavale is used by a man for the son of the mother’s brother
or of the father’s sister as well as for the wife’s brother and the
sister’s husband. Ndavola is used not only for the child of the
mother’s brother or father’s sister when differing in sex from the
speaker, but this word is also used by a man for his wife’s sister and
his brother’s wife, and by a woman for her husband’s brother and
her sister’s husband. Every one of these details of the Mbau system
is the direct and inevitable consequence of the cross-cousin
marriage, if it become an established and habitual practice.
This Fijian system does not stand alone in Melanesia. In the
southern islands of the New Hebrides, in Tanna, Eromanga,
Anaiteum and Aniwa, the cross-cousin marriage is practised and
their systems of relationship have features similar to those of Fiji.
Thus, in Anaiteum the word matak applies to the mother’s brother,
the father’s sister’s husband and the father-in-law, while the word
engak used for the cross-cousin is not only used for the wife’s sister
and the brother’s wife, but also for the wife herself.
Again, in the island of Guadalcanar in the Solomons the system of
relationship is just such as would result from the cross-cousin
marriage. One term, nia, is used for the mother’s brother and the
wife’s father, and probably also for the father’s sister’s husband and
the husband’s father, though my stay in the island was not long
enough to enable me to collect sufficient genealogical material to
demonstrate these points completely. Similarly, tarunga includes in
its connotation the father’s sister, the mother’s brother’s wife and the
wife’s mother, and probably also the husband’s mother, while the

word iva is used for both cross-cousins and brothers- and sisters-in-
law. Corresponding to this terminology there seemed to be no doubt
that it was the custom for a man to marry the daughter of his
mother’s brother or his father’s sister, though I was not able to
demonstrate this form of marriage genealogically.
These three regions, Fiji, the southern New Hebrides and
Guadalcanar, are the only parts of Melanesia included in my survey
where I found the practice of the cross-cousin marriage, and in all
three regions the systems of relationship are just such as would
follow from this form of marriage.
Let us now turn to inquire how far it is possible to explain these
features of Melanesian systems of relationship by psychological
similarity. If it were not for the cross-cousin marriage, what can
there be to give the mother’s brother a greater psychological
similarity to the father-in-law than the father’s brother, or the
father’s sister a greater similarity to the mother-in-law than the
mother’s sister? Why should it be two special kinds of cousin who
are classed with two special kinds of brother- and sister-in-law or
with the husband or wife? Once granted the presence of the cross-
cousin marriage, and there are psychological similarities certainly,
though even here the matter is not quite straightforward from the
point of view of the believer in their importance, for we have to do
not merely with the similarity of two relatives, but with their identity,
with the combination of two or more relationships in one and the
same person. Even if we put this on one side, however, it remains to
ask how it is possible to say that terms of relationship do not reflect
sociology, if such psychological similarities are themselves the result
of the cross-cousin marriage? What point is there in bringing in
hypothetical psychological similarities which are only at the best
intermediate links in the chain of causation connecting the
terminology of relationship with antecedent social conditions?
If you concede the causal relation between the characteristic
features of a Fijian or Anaiteum or Guadalcanar system and the
cross-cousin marriage, there can be no question that it is the cross-

cousin marriage which is the antecedent and the features of the
system of relationship the consequences. I do not suppose that,
even in this subject, there will be found anyone to claim that the
Fijians took to marrying their cross-cousins because such a marriage
was suggested to them by the nature of their system of relationship.
We have to do in this case, not merely with one or two features
which might be the consequence of the cross-cousin marriage, but
with a large and complicated meshwork of resemblances and
differences in the nomenclature of relationship, each and every
element of which follows directly from such a marriage, while no one
of the systems I have considered possesses a single feature which is
not compatible with social conditions arising out of this marriage.
Apart from quantitative verification, I doubt whether it would be
possible in the whole range of science to find a case where we can
be more confident that one phenomenon has been conditioned by
another. I feel almost guilty of wasting your time by going into it so
fully, and should hardly have ventured to do so if this case of social
causation had not been explicitly denied by one with so high a
reputation as Professor Kroeber. I hope, however, that the argument
will be useful as an example of the method I shall apply to other
cases in which the evidence is less conclusive.
The features of terminology which follow from the cross-cousin
marriage were known to Morgan, being present in three of the
systems he recorded from Southern India and in the Fijian system
collected for him by Mr. Fison. The earliest reference[9] to the cross-
cousin marriage which I have been able to discover is among the
Gonds of Central India. This marriage was recorded in 1870, which,
though earlier than the appearance of Morgan’s book, was after it
had been accepted for publication, so that I think we can be
confident that Morgan was unacquainted with the form of marriage
which would have explained the peculiar features of the Indian and
Fijian systems. It is evident, however, that Morgan was so absorbed
in his demonstration of the similarity of these systems to those of
America that he paid but little, if any, attention to their peculiarities.
He thus lost a great opportunity; if he had attended to these

peculiarities and had seen their meaning, he might have predicted a
form of marriage which would soon afterwards have been
independently discovered. Such an example of successful prediction
would have forced the social significance of the terminology of
relationship upon the attention of students in such a way that we
should have been spared much of the controversy which has so long
obstructed progress in this branch of sociology. It must at the very
least have acted as a stimulus to the collection of systems of
relationship. It would hardly have been possible that now, more than
forty years after the appearance of Morgan’s book, we are still in
complete ignorance of the terminology of relationship of many
peoples about whom volumes have been written. It would seem
impossible, for instance, that our knowledge of Indian systems of
relationship could have been what it is to-day. India would have
been the country in which the success of Morgan’s prediction would
first have shown itself, and such an event must have prevented the
almost total neglect which the subject of relationship has suffered at
the hands of students of Indian sociology.

Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com

Multiscalar Processors 1st Edition Manoj Franklin Auth

More Related Content

Similar to Multiscalar Processors 1st Edition Manoj Franklin Auth (20)

Recently uploaded (20)

Multiscalar Processors 1st Edition Manoj Franklin Auth