Parallel Programming For Multicore And Cluster Systems Thomas Rauber

Parallel Programming For Multicore And Cluster
Systems Thomas Rauber download
https://ebookbell.com/product/parallel-programming-for-multicore-
and-cluster-systems-thomas-rauber-2047006
Explore and download more ebooks at ebookbell.com

Here are some recommended products that we believe you will be
interested in. You can click the link to download.
Parallel Programming For Multicore And Cluster Systems 3rd Edition
Thomas Rauber
https://ebookbell.com/product/parallel-programming-for-multicore-and-
cluster-systems-3rd-edition-thomas-rauber-48456870
Parallel Programming For Multicore And Cluster Systems 2nd Edition
Thomas Rauber
cluster-systems-2nd-edition-thomas-rauber-4230862
Parallel Programming For Multicore And Cluster Systems 3rd Edition 3rd
Edition Thomas Rauber
cluster-systems-3rd-edition-3rd-edition-thomas-rauber-49457946
The Opencl Programming Book Parallel Programming For Multicore Cpu And
Gpu Ryoji Tsuchiyama
https://ebookbell.com/product/the-opencl-programming-book-parallel-
programming-for-multicore-cpu-and-gpu-ryoji-tsuchiyama-4348166

Parallel Programming For Multicore Machines Using Openmp And Mpi
Lecture Notes Dr Constantinos Evangelinos
https://ebookbell.com/product/parallel-programming-for-multicore-
machines-using-openmp-and-mpi-lecture-notes-dr-constantinos-
evangelinos-51196790
Parallel And Concurrent Programming In Haskell Techniques For
Multicore And Multithreaded Programming Simon Marlow
https://ebookbell.com/product/parallel-and-concurrent-programming-in-
haskell-techniques-for-multicore-and-multithreaded-programming-simon-
marlow-33556292
Parallel Programming With Microsoft Visual C Design Patterns For
Decomposition And Coordination On Multicore Architectures Patterns And
Practices 1st Edition Colin Campbell
https://ebookbell.com/product/parallel-programming-with-microsoft-
visual-c-design-patterns-for-decomposition-and-coordination-on-
multicore-architectures-patterns-and-practices-1st-edition-colin-
campbell-2112720
Parallel Programming With Microsoft Net Design Patterns For
Decomposition And Coordination On Multicore Architectures Patterns
Practices 1st Edition Colin Campbell
https://ebookbell.com/product/parallel-programming-with-microsoft-net-
design-patterns-for-decomposition-and-coordination-on-multicore-
architectures-patterns-practices-1st-edition-colin-campbell-2488296
Parallel Programming For Modern High Performance Computing Systems 1st
Czarnul
https://ebookbell.com/product/parallel-programming-for-modern-high-
performance-computing-systems-1st-czarnul-6992404

Thomas Rauber · Gudula Rünger
Parallel
Programming
For Multicore and Cluster Systems
123

Thomas Rauber
Universität Bayreuth
Computer Science Department
95440 Bayreuth
Germany
rauber@uni-bayreuth.de
Gudula Rünger
Technische Universität Chemnitz
Computer Science Department
09107 Chemnitz
Germany
ruenger@informatik.tu-chemnitz.de
ISBN 978-3-642-04817-3 e-ISBN 978-3-642-04818-0
DOI 10.1007/978-3-642-04818-0
Springer Heidelberg Dordrecht London New York
ACM Computing Classification (1998): D.1, C.1, C.2, C.4
Library of Congress Control Number: 2009941473
c
Springer-Verlag Berlin Heidelberg 2010
This is an extended English language translation of the German language edition:
Parallele Programmierung (2nd edn.) by T. Rauber and G. Rünger
Published in the book series: Springer-Lehrbuch
Copyright c
Springer-Verlag is part of Springer Science+Business Media.
All Rights Reserved.
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting,
reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9,
1965, in its current version, and permission for use must always be obtained from Springer. Violations
are liable to prosecution under the German Copyright Law.
The use of general descriptive names, registered names, trademarks, etc. in this publication does not
imply, even in the absence of a specific statement, that such names are exempt from the relevant protective
laws and regulations and therefore free for general use.
Cover design: KuenkelLopka GmbH, Heidelberg
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)

Preface
Innovations in hardware architecture, like hyperthreading or multicore processors,
make parallel computing resources available for inexpensive desktop computers.
However, the use of these innovations requires parallel programming techniques.
In a few years, many standard software products will be based on concepts of
parallel programming to use the hardware resources of future multicore proces-
sors efficiently. Thus, the need for parallel programming will extend to all areas
of software development. The application area will be much larger than the area
of scientific computing, which used to be the main area for parallel computing for
many years. The expansion of the application area for parallel computing will lead to
an enormous need for software developers with parallel programming skills. Some
chip manufacturers already demand to include parallel programming as a standard
course in computer science curricula.
This book takes up the new development in processor architecture by giving a
detailed description of important parallel programming techniques that are neces-
sary for developing efficient programs for multicore processors as well as for par-
allel cluster systems or supercomputers. Both shared and distributed address space
architectures are covered. The main goal of the book is to present parallel program-
ming techniques that can be used in many situations for many application areas
and to enable the reader to develop correct and efficient parallel programs. Many
example programs and exercises are provided to support this goal and to show how
the techniques can be applied to further applications. The book can be used as both
a textbook for students and a reference book for professionals. The material of the
book has been used for courses in parallel programming at different universities for
many years.
This is the third version of the book on parallel programming. The first two ver-
sions have been published in German in the years 2000 and 2007, respectively. This
new English version is an updated and revised version of the newest German edition
of the book. The update especially covers new developments in the area of multicore
processors as well as a more detailed description of OpenMP and Java threads.
The content of the book consists of three main parts, covering all areas of par-
allel computing: the architecture of parallel systems, parallel programming models
and environments, and the implementation of efficient application algorithms. The
v

vi Preface
emphasis lies on parallel programming techniques needed for different architec-
tures.
The first part contains an overview of the architecture of parallel systems, includ-
ing cache and memory organization, interconnection networks, routing and switch-
ing techniques, as well as technologies that are relevant for modern and future mul-
ticore processors.
The second part presents parallel programming models, performance models,
and parallel programming environments for message passing and shared memory
models, including MPI, Pthreads, Java threads, and OpenMP. For each of these
parallel programming environments, the book gives basic concepts as well as more
advanced programming methods and enables the reader to write and run semanti-
cally correct and efficient parallel programs. Parallel design patterns like pipelining,
client–server, and task pools are presented for different environments to illustrate
parallel programming techniques and to facilitate the implementation of efficient
parallel programs for a wide variety of application areas. Performance models and
techniques for runtime analysis are described in detail, as they are a prerequisite for
achieving efficiency and high performance.
The third part applies the programming techniques from the second part to repre-
sentative algorithms from scientific computing. The emphasis lies on basic methods
for solving linear equation systems, which play an important role in many scientific
simulations. The focus of the presentation lies on the analysis of the algorithmic
structure of the different algorithms, which is the basis for parallelization, and not on
the mathematical properties of the solution methods. For each algorithm, the book
discusses different parallelization variants, using different methods and strategies.
Many colleagues and students have helped to improve the quality of this book.
We would like to thank all of them for their help and constructive criticisms. For
numerous corrections and suggestions we would like to thank Jörg Dümmler, Mar-
vin Ferber, Michael Hofmann, Ralf Hoffmann, Sascha Hunold, Matthias Korch,
Raphael Kunis, Jens Lang, John O’Donnell, Andreas Prell, Carsten Scholtes, and
Michael Schwind. Many thanks to Matthias Korch, Carsten Scholtes, and Michael
Schwind for help with the exercises. We thank Monika Glaser for her help and
support with the L
A
TEX typesetting of the book. We also thank all the people who
have been involved in the writing of the first two German versions of this book. It
has been a pleasure working with the Springer Verlag in the development of this
book. We especially thank Ralf Gerstner for his support and patience.
Bayreuth Thomas Rauber
Chemnitz Gudula Rünger
August 2009

Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Classical Use of Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Parallelism in Today’s Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Overview of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Parallel Computer Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Processor Architecture and Technology Trends . . . . . . . . . . . . . . . . . . . 7
2.2 Flynn’s Taxonomy of Parallel Architectures . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Memory Organization of Parallel Computers . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Computers with Distributed Memory Organization . . . . . . . . . 12
2.3.2 Computers with Shared Memory Organization . . . . . . . . . . . . . 15
2.3.3 Reducing Memory Access Times . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Thread-Level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.1 Simultaneous Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.2 Multicore Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.3 Architecture of Multicore Processors . . . . . . . . . . . . . . . . . . . . . 24
2.5 Interconnection Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5.1 Properties of Interconnection Networks . . . . . . . . . . . . . . . . . . . 29
2.5.2 Direct Interconnection Networks . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.3 Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5.4 Dynamic Interconnection Networks . . . . . . . . . . . . . . . . . . . . . . 40
2.6 Routing and Switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.6.1 Routing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.6.2 Routing in the Omega Network. . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.6.3 Switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.6.4 Flow Control Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.7 Caches and Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.7.1 Characteristics of Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.7.2 Write Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.7.3 Cache Coherency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
2.7.4 Memory Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
2.8 Exercises for Chap. 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
vii

viii Contents
3 Parallel Programming Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.1 Models for Parallel Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.2 Parallelization of Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.3 Levels of Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.3.1 Parallelism at Instruction Level . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.3.2 Data Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.3.3 Loop Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.3.4 Functional Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.3.5 Explicit and Implicit Representation of Parallelism . . . . . . . . . 105
3.3.6 Parallel Programming Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.4 Data Distributions for Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
3.4.1 Data Distribution for One-Dimensional Arrays . . . . . . . . . . . . . 113
3.4.2 Data Distribution for Two-Dimensional Arrays . . . . . . . . . . . . 114
3.4.3 Parameterized Data Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.5 Information Exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
3.5.1 Shared Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
3.5.2 Communication Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
3.6 Parallel Matrix–Vector Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
3.6.1 Parallel Computation of Scalar Products . . . . . . . . . . . . . . . . . . 126
3.6.2 Parallel Computation of the Linear Combinations . . . . . . . . . . 129
3.7 Processes and Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.7.1 Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.7.2 Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3.7.3 Synchronization Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
3.7.4 Developing Efficient and Correct Thread Programs . . . . . . . . . 139
3.8 Further Parallel Programming Approaches . . . . . . . . . . . . . . . . . . . . . . . 141
3.8.1 Approaches for New Parallel Languages . . . . . . . . . . . . . . . . . . 142
3.8.2 Transactional Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
4 Performance Analysis of Parallel Programs . . . . . . . . . . . . . . . . . . . . . . . . 151
4.1 Performance Evaluation of Computer Systems . . . . . . . . . . . . . . . . . . . 152
4.1.1 Evaluation of CPU Performance . . . . . . . . . . . . . . . . . . . . . . . . . 152
4.1.2 MIPS and MFLOPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
4.1.3 Performance of Processors with a Memory Hierarchy . . . . . . . 155
4.1.4 Benchmark Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
4.2 Performance Metrics for Parallel Programs . . . . . . . . . . . . . . . . . . . . . . 161
4.2.1 Speedup and Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
4.2.2 Scalability of Parallel Programs . . . . . . . . . . . . . . . . . . . . . . . . . 165
4.3 Asymptotic Times for Global Communication . . . . . . . . . . . . . . . . . . . . 166
4.3.1 Implementing Global Communication Operations . . . . . . . . . . 167
4.3.2 Communications Operations on a Hypercube . . . . . . . . . . . . . . 173
4.4 Analysis of Parallel Execution Times . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
4.4.1 Parallel Scalar Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

Contents ix
4.4.2 Parallel Matrix–Vector Product . . . . . . . . . . . . . . . . . . . . . . . . . . 183
4.5 Parallel Computational Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
4.5.1 PRAM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
4.5.2 BSP Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
4.5.3 LogP Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
5 Message-Passing Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
5.1 Introduction to MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
5.1.1 MPI Point-to-Point Communication . . . . . . . . . . . . . . . . . . . . . . 199
5.1.2 Deadlocks with Point-to-Point Communications . . . . . . . . . . . 204
5.1.3 Non-blocking Operations and Communication Modes. . . . . . . 208
5.1.4 Communication Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
5.2 Collective Communication Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 213
5.2.1 Collective Communication in MPI . . . . . . . . . . . . . . . . . . . . . . . 214
5.2.2 Deadlocks with Collective Communication . . . . . . . . . . . . . . . . 227
5.3 Process Groups and Communicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
5.3.1 Process Groups in MPI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
5.3.2 Process Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
5.3.3 Timings and Aborting Processes . . . . . . . . . . . . . . . . . . . . . . . . . 239
5.4 Introduction to MPI-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
5.4.1 Dynamic Process Generation and Management . . . . . . . . . . . . 240
5.4.2 One-Sided Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
6 Thread Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
6.1 Programming with Pthreads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
6.1.1 Creating and Merging Threads . . . . . . . . . . . . . . . . . . . . . . . . . . 259
6.1.2 Thread Coordination with Pthreads . . . . . . . . . . . . . . . . . . . . . . 263
6.1.3 Condition Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
6.1.4 Extended Lock Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
6.1.5 One-Time Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
6.1.6 Implementation of a Task Pool . . . . . . . . . . . . . . . . . . . . . . . . . . 276
6.1.7 Parallelism by Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
6.1.8 Implementation of a Client–Server Model . . . . . . . . . . . . . . . . . 286
6.1.9 Thread Attributes and Cancellation . . . . . . . . . . . . . . . . . . . . . . 290
6.1.10 Thread Scheduling with Pthreads . . . . . . . . . . . . . . . . . . . . . . . . 299
6.1.11 Priority Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
6.1.12 Thread-Specific Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
6.2 Java Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
6.2.1 Thread Generation in Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
6.2.2 Synchronization of Java Threads . . . . . . . . . . . . . . . . . . . . . . . . 312
6.2.3 Wait and Notify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
6.2.4 Extended Synchronization Patterns . . . . . . . . . . . . . . . . . . . . . . 326

x Contents
6.2.5 Thread Scheduling in Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
6.2.6 Package java.util.concurrent . . . . . . . . . . . . . . . . . . . 332
6.3 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
6.3.1 Compiler Directives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
6.3.2 Execution Environment Routines . . . . . . . . . . . . . . . . . . . . . . . . 348
6.3.3 Coordination and Synchronization of Threads . . . . . . . . . . . . . 349
7 Algorithms for Systems of Linear Equations . . . . . . . . . . . . . . . . . . . . . . . 359
7.1 Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
7.1.1 Gaussian Elimination and LU Decomposition. . . . . . . . . . . . . . 360
7.1.2 Parallel Row-Cyclic Implementation . . . . . . . . . . . . . . . . . . . . . 363
7.1.3 Parallel Implementation with Checkerboard Distribution . . . . 367
7.1.4 Analysis of the Parallel Execution Time . . . . . . . . . . . . . . . . . . 373
7.2 Direct Methods for Linear Systems with Banded Structure . . . . . . . . . 378
7.2.1 Discretization of the Poisson Equation . . . . . . . . . . . . . . . . . . . . 378
7.2.2 Tridiagonal Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
7.2.3 Generalization to Banded Matrices . . . . . . . . . . . . . . . . . . . . . . . 395
7.2.4 Solving the Discretized Poisson Equation . . . . . . . . . . . . . . . . . 397
7.3 Iterative Methods for Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
7.3.1 Standard Iteration Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
7.3.2 Parallel Implementation of the Jacobi Iteration . . . . . . . . . . . . . 404
7.3.3 Parallel Implementation of the Gauss–Seidel Iteration . . . . . . . 405
7.3.4 Gauss–Seidel Iteration for Sparse Systems . . . . . . . . . . . . . . . . 407
7.3.5 Red–Black Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
7.4 Conjugate Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
7.4.1 Sequential CG Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
7.4.2 Parallel CG Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
7.5 Cholesky Factorization for Sparse Matrices . . . . . . . . . . . . . . . . . . . . . . 424
7.5.1 Sequential Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
7.5.2 Storage Scheme for Sparse Matrices . . . . . . . . . . . . . . . . . . . . . 430
7.5.3 Implementation for Shared Variables . . . . . . . . . . . . . . . . . . . . . 432
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449

Chapter 1
Introduction
In this short introduction, we give an overview of the use of parallelism and try
to explain why parallel programming will be used for software development in the
future. We also give an overview of the rest of the book and show how it can be used
for courses with various foci.
1.1 Classical Use of Parallelism
Parallel programming and the design of efficient parallel programs have been well
established in high-performance, scientific computing for many years. The simu-
lation of scientific problems is an important area in natural and engineering sci-
ences of growing importance. More precise simulations or the simulations of larger
problems need greater and greater computing power and memory space. In the last
decades, high-performance research included new developments in parallel hard-
ware and software technologies, and a steady progress in parallel high-performance
computing can be observed. Popular examples are simulations of weather forecast
based on complex mathematical models involving partial differential equations or
crash simulations from car industry based on finite element methods.
Other examples include drug design and computer graphics applications for film
and advertising industry. Depending on the specific application, computer simu-
lation is the main method to obtain the desired result or it is used to replace or
enhance physical experiments. A typical example for the first application area is
weather forecast where the future development in the atmosphere has to be pre-
dicted, which can only be obtained by simulations. In the second application area,
computer simulations are used to obtain results that are more precise than results
from practical experiments or that can be performed with less financial effort. An
example is the use of simulations to determine the air resistance of vehicles: Com-
pared to a classical wind tunnel experiment, a computer simulation can give more
precise results because the relative movement of the vehicle in relation to the ground
can be included in the simulation. This is not possible in the wind tunnel, since the
vehicle cannot be moved. Crash tests of vehicles are an obvious example where
computer simulations can be performed with less financial effort.
T. Rauber, G. Rünger, Parallel Programming,
DOI 10.1007/978-3-642-04818-0 1, C
1

2 1 Introduction
Computer simulations often require a large computational effort. A low perfor-
mance of the computer system used can restrict the simulations and the accuracy
of the results obtained significantly. In particular, using a high-performance system
allows larger simulations which lead to better results. Therefore, parallel comput-
ers have often been used to perform computer simulations. Today, cluster systems
built up from server nodes are widely available and are now often used for par-
allel simulations. To use parallel computers or cluster systems, the computations
to be performed must be partitioned into several parts which are assigned to the
parallel resources for execution. These computation parts should be independent of
each other, and the algorithm performed must provide enough independent compu-
tations to be suitable for a parallel execution. This is normally the case for scientific
simulations. To obtain a parallel program, the algorithm must be formulated in a
suitable programming language. Parallel execution is often controlled by specific
runtime libraries or compiler directives which are added to a standard programming
language like C, Fortran, or Java. The programming techniques needed to obtain
efficient parallel programs are described in this book. Popular runtime systems and
environments are also presented.
1.2 Parallelism in Today’s Hardware
Parallel programming is an important aspect of high-performance scientific com-
puting but it used to be a niche within the entire field of hardware and software
products. However, more recently parallel programming has left this niche and will
become the mainstream of software development techniques due to a radical change
in hardware technology.
Major chip manufacturers have started to produce processors with several power-
efficient computing units on one chip, which have an independent control and can
access the same memory concurrently. Normally, the term core is used for single
computing units and the term multicore is used for the entire processor having sev-
eral cores. Thus, using multicore processors makes each desktop computer a small
parallel system. The technological development toward multicore processors was
forced by physical reasons, since the clock speed of chips with more and more
transistors cannot be increased at the previous rate without overheating.
Multicore architectures in the form of single multicore processors, shared mem-
ory systems of several multicore processors, or clusters of multicore processors
with a hierarchical interconnection network will have a large impact on software
development. In 2009, dual-core and quad-core processors are standard for normal
desktop computers, and chip manufacturers have already announced the introduc-
tion of oct-core processors for 2010. It can be predicted from Moore’s law that the
number of cores per processor chip will double every 18–24 months. According to
a report of Intel, in 2015 a typical processor chip will likely consist of dozens up
to hundreds of cores where a part of the cores will be dedicated to specific pur-
poses like network management, encryption and decryption, or graphics [109]; the

1.3 Basic Concepts 3
majority of the cores will be available for application programs, providing a huge
performance potential.
The users of a computer system are interested in benefitting from the perfor-
mance increase provided by multicore processors. If this can be achieved, they can
expect their application programs to keep getting faster and keep getting more and
more additional features that could not be integrated in previous versions of the
software because they needed too much computing power. To ensure this, there
should definitely be a support from the operating system, e.g., by using dedicated
cores for their intended purpose or by running multiple user programs in parallel,
if they are available. But when a large number of cores are provided, which will
be the case in the near future, there is also the need to execute a single application
program on multiple cores. The best situation for the software developer would
be that there be an automatic transformer that takes a sequential program as input
and generates a parallel program that runs efficiently on the new architectures. If
such a transformer were available, software development could proceed as before.
But unfortunately, the experience of the research in parallelizing compilers during
the last 20 years has shown that for many sequential programs it is not possible to
extract enough parallelism automatically. Therefore, there must be some help from
the programmer, and application programs need to be restructured accordingly.
For the software developer, the new hardware development toward multicore
architectures is a challenge, since existing software must be restructured toward
parallel execution to take advantage of the additional computing resources. In partic-
ular, software developers can no longer expect that the increase of computing power
can automatically be used by their software products. Instead, additional effort is
required at the software level to take advantage of the increased computing power.
If a software company is able to transform its software so that it runs efficiently on
novel multicore architectures, it will likely have an advantage over its competitors.
There is much research going on in the area of parallel programming languages
and environments with the goal of facilitating parallel programming by providing
support at the right level of abstraction. But there are many effective techniques
and environments already available. We give an overview in this book and present
important programming techniques, enabling the reader to develop efficient parallel
programs. There are several aspects that must be considered when developing a
parallel program, no matter which specific environment or system is used. We give
a short overview in the following section.
1.3 Basic Concepts
A first step in parallel programming is the design of a parallel algorithm or pro-
gram for a given application problem. The design starts with the decomposition
of the computations of an application into several parts, called tasks, which can
be computed in parallel on the cores or processors of the parallel hardware. The
decomposition into tasks can be complicated and laborious, since there are usually

4 1 Introduction
many different possibilities of decomposition for the same application algorithm.
The size of tasks (e.g., in terms of the number of instructions) is called granularity
and there is typically the possibility of choosing tasks of different sizes. Defining
the tasks of an application appropriately is one of the main intellectual works in
the development of a parallel program and is difficult to automate. Potential par-
allelism is an inherent property of an application algorithm and influences how an
application can be split into tasks.
The tasks of an application are coded in a parallel programming language or
environment and are assigned to processes or threads which are then assigned to
physical computation units for execution. The assignment of tasks to processes or
threads is called scheduling and fixes the order in which the tasks are executed.
Scheduling can be done by hand in the source code or by the programming envi-
ronment, at compile time or dynamically at runtime. The assignment of processes
or threads onto the physical units, processors or cores, is called mapping and is
usually done by the runtime system but can sometimes be influenced by the pro-
grammer. The tasks of an application algorithm can be independent but can also
depend on each other resulting in data or control dependencies of tasks. Data and
control dependencies may require a specific execution order of the parallel tasks:
If a task needs data produced by another task, the execution of the first task can
start only after the other task has actually produced these data and has provided the
information. Thus, dependencies between tasks are constraints for the scheduling.
In addition, parallel programs need synchronization and coordination of threads
and processes in order to execute correctly. The methods of synchronization and
coordination in parallel computing are strongly connected with the way in which
information is exchanged between processes or threads, and this depends on the
memory organization of the hardware.
A coarse classification of the memory organization distinguishes between shared
memory machines and distributed memory machines. Often the term thread is
connected with shared memory and the term process is connected with distributed
memory. For shared memory machines, a global shared memory stores the data
of an application and can be accessed by all processors or cores of the hardware
systems. Information exchange between threads is done by shared variables written
by one thread and read by another thread. The correct behavior of the entire pro-
gram has to be achieved by synchronization between threads so that the access to
shared data is coordinated, i.e., a thread reads a data element not before the write
operation by another thread storing the data element has been finalized. Depending
on the programming language or environment, synchronization is done by the run-
time system or by the programmer. For distributed memory machines, there exists
a private memory for each processor, which can only be accessed by this processor,
and no synchronization for memory access is needed. Information exchange is done
by sending data from one processor to another processor via an interconnection
network by explicit communication operations.
Specific barrier operations offer another form of coordination which is avail-
able for both shared memory and distributed memory machines. All processes or
threads have to wait at a barrier synchronization point until all other processes or

1.4 Overview of the Book 5
threads have also reached that point. Only after all processes or threads have exe-
cuted the code before the barrier, they can continue their work with the subsequent
code after the barrier.
An important aspect of parallel computing is the parallel execution time which
consists of the time for the computation on processors or cores and the time for data
exchange or synchronization. The parallel execution time should be smaller than the
sequential execution time on one processor so that designing a parallel program is
worth the effort. The parallel execution time is the time elapsed between the start of
the application on the first processor and the end of the execution of the application
on all processors. This time is influenced by the distribution of work to processors or
cores, the time for information exchange or synchronization, and idle times in which
a processor cannot do anything useful but wait for an event to happen. In general,
a smaller parallel execution time results when the work load is assigned equally
to processors or cores, which is called load balancing, and when the overhead for
information exchange, synchronization, and idle times is small. Finding a specific
scheduling and mapping strategy which leads to a good load balance and a small
overhead is often difficult because of many interactions. For example, reducing the
overhead for information exchange may lead to load imbalance whereas a good load
balance may require more overhead for information exchange or synchronization.
For a quantitative evaluation of the execution time of parallel programs, cost
measures like speedup and efficiency are used, which compare the resulting parallel
execution time with the sequential execution time on one processor. There are differ-
ent ways to measure the cost or runtime of a parallel program and a large variety of
parallel cost models based on parallel programming models have been proposed and
used. These models are meant to bridge the gap between specific parallel hardware
and more abstract parallel programming languages and environments.
1.4 Overview of the Book
The rest of the book is structured as follows. Chapter 2 gives an overview of
important aspects of the hardware of parallel computer systems and addresses new
developments like the trends toward multicore architectures. In particular, the chap-
ter covers important aspects of memory organization with shared and distributed
address spaces as well as popular interconnection networks with their topological
properties. Since memory hierarchies with several levels of caches may have an
important influence on the performance of (parallel) computer systems, they are
covered in this chapter. The architecture of multicore processors is also described in
detail. The main purpose of the chapter is to give a solid overview of the important
aspects of parallel computer architectures that play a role in parallel programming
and the development of efficient parallel programs.
Chapter 3 considers popular parallel programming models and paradigms and
discusses how the inherent parallelism of algorithms can be presented to a par-
allel runtime environment to enable an efficient parallel execution. An impor-
tant part of this chapter is the description of mechanisms for the coordination

6 1 Introduction
of parallel programs, including synchronization and communication operations.
Moreover, mechanisms for exchanging information and data between computing
resources for different memory models are described. Chapter 4 is devoted to the
performance analysis of parallel programs. It introduces popular performance or
cost measures that are also used for sequential programs, as well as performance
measures that have been developed for parallel programs. Especially, popular com-
munication patterns for distributed address space architectures are considered and
their efficient implementations for specific interconnection networks are given.
Chapter 5 considers the development of parallel programs for distributed address
spaces. In particular, a detailed description of MPI (Message Passing Interface) is
given, which is by far the most popular programming environment for distributed
address spaces. The chapter describes important features and library functions of
MPI and shows which programming techniques must be used to obtain efficient
MPI programs. Chapter 6 considers the development of parallel programs for shared
address spaces. Popular programming environments are Pthreads, Java threads, and
OpenMP. The chapter describes all three and considers programming techniques to
obtain efficient parallel programs. Many examples help to understand the relevant
concepts and to avoid common programming errors that may lead to low perfor-
mance or cause problems like deadlocks or race conditions. Programming examples
and parallel programming pattern are presented. Chapter 7 considers algorithms
from numerical analysis as representative example and shows how the sequential
algorithms can be transferred into parallel programs in a systematic way.
The main emphasis of the book is to provide the reader with the programming
techniques that are needed for developing efficient parallel programs for different
architectures and to give enough examples to enable the reader to use these tech-
niques for programs from other application areas. In particular, reading and using the
book is a good training for software development for modern parallel architectures,
including multicore architectures.
The content of the book can be used for courses in the area of parallel com-
puting with different emphasis. All chapters are written in a self-contained way so
that chapters of the book can be used in isolation; cross-references are given when
material from other chapters might be useful. Thus, different courses in the area of
parallel computing can be assembled from chapters of the book in a modular way.
Exercises are provided for each chapter separately. For a course on the programming
of multicore systems, Chaps. 2, 3, and 6 should be covered. In particular, Chapter 6
provides an overview of the relevant programming environments and techniques.
For a general course on parallel programming, Chaps. 2, 5, and 6 can be used. These
chapters introduce programming techniques for both distributed and shared address
spaces. For a course on parallel numerical algorithms, mainly Chaps. 5 and 7 are
suitable; Chap. 6 can be used additionally. These chapters consider the parallel algo-
rithms used as well as the programming techniques required. For a general course
on parallel computing, Chaps. 2, 3, 4, 5, and 6 can be used with selected applications
from Chap. 7. The following web page will be maintained for additional and new
material: ai2.inf.uni-bayreuth.de/pp book.

Chapter 2
Parallel Computer Architecture
The possibility for parallel execution of computations strongly depends on the
architecture of the execution platform. This chapter gives an overview of the gen-
eral structure of parallel computers which determines how computations of a pro-
gram can be mapped to the available resources such that a parallel execution is
obtained. Section 2.1 gives a short overview of the use of parallelism within a
single processor or processor core. Using the available resources within a single
processor core at instruction level can lead to a significant performance increase.
Sections 2.2 and 2.3 describe the control and data organization of parallel plat-
forms. Based on this, Sect. 2.4.2 presents an overview of the architecture of multi-
core processors and describes the use of thread-based parallelism for simultaneous
multithreading.
The following sections are devoted to specific components of parallel plat-
forms. Section 2.5 describes important aspects of interconnection networks which
are used to connect the resources of parallel platforms and to exchange data and
information between these resources. Interconnection networks also play an impor-
tant role in multicore processors for the connection between the cores of a pro-
cessor chip. Section 2.5 describes static and dynamic interconnection networks
and discusses important characteristics like diameter, bisection bandwidth, and
connectivity of different network types as well as the embedding of networks
into other networks. Section 2.6 addresses routing techniques for selecting paths
through networks and switching techniques for message forwarding over a given
path. Section 2.7 considers memory hierarchies of sequential and parallel plat-
forms and discusses cache coherence and memory consistency for shared memory
platforms.
2.1 Processor Architecture and Technology Trends
Processor chips are the key components of computers. Considering the trends
observed for processor chips during the last years, estimations for future develop-
ments can be deduced. Internally, processor chips consist of transistors. The number
of transistors contained in a processor chip can be used as a rough estimate of
T. Rauber, G. Rünger, Parallel Programming,
DOI 10.1007/978-3-642-04818-0 2, C
7

8 2 Parallel Computer Architecture
its complexity and performance. Moore’s law is an empirical observation which
states that the number of transistors of a typical processor chip doubles every 18–24
months. This observation was first made by Gordon Moore in 1965 and is valid now
for more than 40 years. The increasing number of transistors can be used for archi-
tectural improvements like additional functional units, more and larger caches, and
more registers. A typical processor chip for desktop computers from 2009 consists
of 400–800 million transistors.
The increase in the number of transistors has been accompanied by an increase in
clock speed for quite a long time. Increasing the clock speed leads to a faster compu-
tational speed of the processor, and often the clock speed has been used as the main
characteristic of the performance of a computer system. In the past, the increase
in clock speed and in the number of transistors has led to an average performance
increase of processors of 55% (integer operations) and 75% (floating-point oper-
ations), respectively [84]. This can be measured by specific benchmark programs
that have been selected from different application areas to get a representative per-
formance measure of computer systems. Often, the SPEC benchmarks (System Per-
formance and Evaluation Cooperative) are used to measure the integer and floating-
point performance of computer systems [137, 84], see www.spec.org. The aver-
age performance increase of processors exceeds the increase in clock speed. This
indicates that the increasing number of transistors has led to architectural improve-
ments which reduce the average time for executing an instruction. In the following,
we give a short overview of such architectural improvements. Four phases of micro-
processor design trends can be observed [35] which are mainly driven by the internal
use of parallelism:
1. Parallelism at bit level: Up to about 1986, the word size used by processors for
operations increased stepwise from 4 bits to 32 bits. This trend has slowed down
and ended with the adoption of 64-bit operations beginning in the 1990s. This
development has been driven by demands for improved floating-point accuracy
and a larger address space. The trend has stopped at a word size of 64 bits, since
this gives sufficient accuracy for floating-point numbers and covers a sufficiently
large address space of 264
bytes.
2. Parallelism by pipelining: The idea of pipelining at instruction level is an over-
lapping of the execution of multiple instructions. The execution of each instruc-
tion is partitioned into several steps which are performed by dedicated hardware
units (pipeline stages) one after another. A typical partitioning could result in the
following steps:
(a) fetch: fetch the next instruction to be executed from memory;
(b) decode: decode the instruction fetched in step (a);
(c) execute: load the operands specified and execute the instruction;
(d) write-back: write the result into the target register.
An instruction pipeline is like an assembly line in automobile industry. The
advantage is that the different pipeline stages can operate in parallel, if there
are no control or data dependencies between the instructions to be executed, see

2.1 Processor Architecture and Technology Trends 9
Fig. 2.1 Overlapping
execution of four independent
instructions by pipelining.
The execution of each
instruction is split into four
stages: fetch (F), decode (D),
execute (E), and
write-back (W)
F2
F3
F4
D1
D2
D3
D4
E1
E2
E3
E4
W1
W2
W4
t2
t1 t3 t4
F1
W3
instruction 4
instruction 3
instruction 2
instruction 1
time
Fig. 2.1 for an illustration. To avoid waiting times, the execution of the different
pipeline stages should take about the same amount of time. This time deter-
mines the cycle time of the processor. If there are no dependencies between
the instructions, in each clock cycle the execution of one instruction is fin-
ished and the execution of another instruction started. The number of instruc-
tions finished per time unit is defined as the throughput of the pipeline. Thus,
in the absence of dependencies, the throughput is one instruction per clock
cycle.
In the absence of dependencies, all pipeline stages work in parallel. Thus, the
number of pipeline stages determines the degree of parallelism attainable by a
pipelined computation. The number of pipeline stages used in practice depends
on the specific instruction and its potential to be partitioned into stages. Typical
numbers of pipeline stages lie between 2 and 26 stages. Processors which use
pipelining to execute instructions are called ILP processors (instruction-level
parallelism). Processors with a relatively large number of pipeline stages are
sometimes called superpipelined. Although the available degree of parallelism
increases with the number of pipeline stages, this number cannot be arbitrarily
increased, since it is not possible to partition the execution of the instruction into
a very large number of steps of equal size. Moreover, data dependencies often
inhibit a completely parallel use of the stages.
3. Parallelism by multiple functional units: Many processors are multiple-issue
processors. They use multiple, independent functional units like ALUs (arith-
metic logical units), FPUs (floating-point units), load/store units, or branch units.
These units can work in parallel, i.e., different independent instructions can be
executed in parallel by different functional units. Thus, the average execution rate
of instructions can be increased. Multiple-issue processors can be distinguished
into superscalar processors and VLIW (very long instruction word) processors,
see [84, 35] for a more detailed treatment.
The number of functional units that can efficiently be utilized is restricted
because of data dependencies between neighboring instructions. For superscalar
processors, these dependencies are determined at runtime dynamically by the
hardware, and decoded instructions are dispatched to the instruction units using
dynamic scheduling by the hardware. This may increase the complexity of the
circuit significantly. Moreover, simulations have shown that superscalar proces-
sors with up to four functional units yield a substantial benefit over a single

functional unit. But using even more functional units provides little additional
gain [35, 99] because of dependencies between instructions and branching of
control flow.
4. Parallelism at process or thread level: The three techniques described so far
assume a single sequential control flow which is provided by the compiler and
which determines the execution order if there are dependencies between instruc-
tions. For the programmer, this has the advantage that a sequential programming
language can be used nevertheless leading to a parallel execution of instructions.
However, the degree of parallelism obtained by pipelining and multiple func-
tional units is limited. This limit has already been reached for some time for
typical processors. But more and more transistors are available per processor
chip according to Moore’s law. This can be used to integrate larger caches on the
chip. But the cache sizes cannot be arbitrarily increased either, as larger caches
lead to a larger access time, see Sect. 2.7.
An alternative approach to use the increasing number of transistors on a chip
is to put multiple, independent processor cores onto a single processor chip. This
approach has been used for typical desktop processors since 2005. The resulting
processor chips are called multicore processors. Each of the cores of a multi-
core processor must obtain a separate flow of control, i.e., parallel programming
techniques must be used. The cores of a processor chip access the same mem-
ory and may even share caches. Therefore, memory accesses of the cores must
be coordinated. The coordination and synchronization techniques required are
described in later chapters.
A more detailed description of parallelism by multiple functional units can be found
in [35, 84, 137, 164]. Section 2.4.2 describes techniques like simultaneous multi-
threading and multicore processors requiring an explicit specification of parallelism.
2.2 Flynn’s Taxonomy of Parallel Architectures
Parallel computers have been used for many years, and many different architec-
tural alternatives have been proposed and used. In general, a parallel computer can
be characterized as a collection of processing elements that can communicate and
cooperate to solve large problems fast [14]. This definition is intentionally quite
vague to capture a large variety of parallel platforms. Many important details are not
addressed by the definition, including the number and complexity of the processing
elements, the structure of the interconnection network between the processing ele-
ments, the coordination of the work between the processing elements, as well as
important characteristics of the problem to be solved.
For a more detailed investigation, it is useful to make a classification according
to important characteristics of a parallel computer. A simple model for such a clas-
sification is given by Flynn’s taxonomy [52]. This taxonomy characterizes parallel
computers according to the global control and the resulting data and control flows.
Four categories are distinguished:

2.2 Flynn’s Taxonomy of Parallel Architectures 11
1. Single-Instruction, Single-Data (SISD): There is one processing element which
has access to a single program and data storage. In each step, the processing
element loads an instruction and the corresponding data and executes the instruc-
tion. The result is stored back in the data storage. Thus, SISD is the conventional
sequential computer according to the von Neumann model.
2. Multiple-Instruction, Single-Data (MISD): There are multiple processing ele-
ments each of which has a private program memory, but there is only one com-
mon access to a single global data memory. In each step, each processing element
obtains the same data element from the data memory and loads an instruction
from its private program memory. These possibly different instructions are then
executed in parallel by the processing elements using the previously obtained
(identical) data element as operand. This execution model is very restrictive and
no commercial parallel computer of this type has ever been built.
3. Single-Instruction, Multiple-Data (SIMD): There are multiple processing ele-
ments each of which has a private access to a (shared or distributed) data memory,
see Sect. 2.3 for a discussion of shared and distributed address spaces. But there
is only one program memory from which a special control processor fetches and
dispatches instructions. In each step, each processing element obtains from the
control processor the same instruction and loads a separate data element through
its private data access on which the instruction is performed. Thus, the instruction
is synchronously applied in parallel by all processing elements to different data
elements.
For applications with a significant degree of data parallelism, the SIMD
approach can be very efficient. Examples are multimedia applications or com-
puter graphics algorithms to generate realistic three-dimensional views of
computer-generated environments.
4. Multiple-Instruction, Multiple-Data (MIMD): There are multiple processing
elements each of which has a separate instruction and data access to a (shared
or distributed) program and data memory. In each step, each processing element
loads a separate instruction and a separate data element, applies the instruction
to the data element, and stores a possible result back into the data storage. The
processing elements work asynchronously with each other. Multicore processors
or cluster systems are examples for the MIMD model.
Compared to MIMD computers, SIMD computers have the advantage that they
are easy to program, since there is only one program flow, and the synchronous
execution does not require synchronization at program level. But the synchronous
execution is also a restriction, since conditional statements of the form
if (b==0) c=a; else c = a/b;
must be executed in two steps. In the first step, all processing elements whose local
value of b is zero execute the then part. In the second step, all other process-
ing elements execute the else part. MIMD computers are more flexible, as each
processing element can execute its own program flow. Most parallel computers

are based on the MIMD concept. Although Flynn’s taxonomy only provides a
coarse classification, it is useful to give an overview of the design space of parallel
computers.
2.3 Memory Organization of Parallel Computers
Nearly all general-purpose parallel computers are based on the MIMD model. A
further classification of MIMD computers can be done according to their memory
organization. Two aspects can be distinguished: the physical memory organization
and the view of the programmer of the memory. For the physical organization,
computers with a physically shared memory (also called multiprocessors) and com-
puters with a physically distributed memory (also called multicomputers) can be
distinguished, see Fig. 2.2 for an illustration. But there also exist many hybrid orga-
nizations, for example providing a virtually shared memory on top of a physically
distributed memory.
computers
with
memory
shared
computers
with
distributed
memory
MIMD computer systems
Multicomputer systems
shared
computers
with
virtually
memory
parallel and distributed
Multiprocessor systems
Fig. 2.2 Forms of memory organization of MIMD computers
From the programmer’s point of view, it can be distinguished between comput-
ers with a distributed address space and computers with a shared address space.
This view does not necessarily need to conform with the physical memory. For
example, a parallel computer with a physically distributed memory may appear to
the programmer as a computer with a shared address space when a corresponding
programming environment is used. In the following, we have a closer look at the
physical organization of the memory.
2.3.1 Computers with Distributed Memory Organization
Computers with a physically distributed memory are also called distributed mem-
ory machines (DMM). They consist of a number of processing elements (called
nodes) and an interconnection network which connects nodes and supports the
transfer of data between nodes. A node is an independent unit, consisting of pro-
cessor, local memory, and, sometimes, periphery elements, see Fig. 2.3 (a) for an
illustration.

2.3 Memory Organization of Parallel Computers 13
M P
DMA
R R R
R R R
R R R
computer with distributed memory
interconnection network
with a hypercube as
a)
b)
P = processor
M = local memory
c) DMA (direct memory access)
d)
P M
e)
R = Router
M P
DMA
Router
... ...
...
...
P
M
M
P P
M
N N N
N
N
N
N N N
node consisting of processor and local memory
with DMA connections
to the network
external
input channels
external
output channels
N = node consisting of
processor and
local memory
Fig. 2.3 Illustration of computers with distributed memory: (a) abstract structure, (b) computer
with distributed memory and hypercube as interconnection structure, (c) DMA (direct memory
access), (d) processor–memory node with router, and (e) interconnection network in the form of a
mesh to connect the routers of the different processor–memory nodes
Program data is stored in the local memory of one or several nodes. All local
memory is private and only the local processor can access the local memory directly.
When a processor needs data from the local memory of other nodes to perform
local computations, message-passing has to be performed via the interconnection
network. Therefore, distributed memory machines are strongly connected with the
message-passing programming model which is based on communication between
cooperating sequential processes and which will be considered in more detail in

Chaps. 3 and 5. To perform message-passing, two processes PA and PB on different
nodes A and B issue corresponding send and receive operations. When PB needs
data from the local memory of node A, PA performs a send operation containing
the data for the destination process PB. PB performs a receive operation specifying
a receive buffer to store the data from the source process PA from which the data is
expected.
The architecture of computers with a distributed memory has experienced many
changes over the years, especially concerning the interconnection network and the
coupling of network and nodes. The interconnection network of earlier multicom-
puters were often based on point-to-point connections between nodes. A node is
connected to a fixed set of other nodes by physical connections. The structure of the
interconnection network can be represented as a graph structure. The nodes repre-
sent the processors, the edges represent the physical interconnections (also called
links). Typically, the graph exhibits a regular structure. A typical network structure
is the hypercube which is used in Fig. 2.3(b) to illustrate the node connections; a
detailed description of interconnection structures is given in Sect. 2.5. In networks
with point-to-point connection, the structure of the network determines the possible
communications, since each node can only exchange data with its direct neighbor.
To decouple send and receive operations, buffers can be used to store a message
until the communication partner is ready. Point-to-point connections restrict paral-
lel programming, since the network topology determines the possibilities for data
exchange, and parallel algorithms have to be formulated such that their communi-
cation fits the given network structure [8, 115].
The execution of communication operations can be decoupled from the proces-
sor’s operations by adding a DMA controller (DMA – direct memory access) to the
nodes to control the data transfer between the local memory and the I/O controller.
This enables data transfer from or to the local memory without participation of the
processor (see Fig. 2.3(c) for an illustration) and allows asynchronous communica-
tion. A processor can issue a send operation to the DMA controller and can then
continue local operations while the DMA controller executes the send operation.
Messages are received at the destination node by its DMA controller which copies
the enclosed data to a specific system location in local memory. When the processor
then performs a receive operation, the data are copied from the system location to
the specified receive buffer. Communication is still restricted to neighboring nodes
in the network. Communication between nodes that do not have a direct connection
must be controlled by software to send a message along a path of direct inter-
connections. Therefore, communication times between nodes that are not directly
connected can be much larger than communication times between direct neighbors.
Thus, it is still more efficient to use algorithms with communication according to
the given network structure.
A further decoupling can be obtained by putting routers into the network, see
Fig. 2.3(d). The routers form the actual network over which communication can
be performed. The nodes are connected to the routers, see Fig. 2.3(e). Hardware-
supported routing reduces communication times as messages for processors on
remote nodes can be forwarded by the routers along a preselected path without

interaction of the processors in the nodes along the path. With router support, there
is not a large difference in communication time between neighboring nodes and
remote nodes, depending on the switching technique, see Sect. 2.6.3. Each physical
I/O channel of a router can be used by one message only at a specific point in time.
To decouple message forwarding, message buffers are used for each I/O channel to
store messages and apply specific routing algorithms to avoid deadlocks, see also
Sect. 2.6.1.
Technically, DMMs are quite easy to assemble since standard desktop computers
can be used as nodes. The programming of DMMs requires a careful data layout,
since each processor can directly access only its local data. Non-local data must
be accessed via message-passing, and the execution of the corresponding send and
receive operations takes significantly longer than a local memory access. Depending
on the interconnection network and the communication library used, the difference
can be more than a factor of 100. Therefore, data layout may have a significant influ-
ence on the resulting parallel runtime of a program. Data layout should be selected
such that the number of message transfers and the size of the data blocks exchanged
are minimized.
The structure of DMMs has many similarities with networks of workstations
(NOWs) in which standard workstations are connected by a fast local area net-
work (LAN). An important difference is that interconnection networks of DMMs
are typically more specialized and provide larger bandwidths and lower latencies,
thus leading to a faster message exchange.
Collections of complete computers with a dedicated interconnection network are
often called clusters. Clusters are usually based on standard computers and even
standard network topologies. The entire cluster is addressed and programmed as a
single unit. The popularity of clusters as parallel machines comes from the availabil-
ity of standard high-speed interconnections like FCS (Fiber Channel Standard), SCI
(Scalable Coherent Interface), Switched Gigabit Ethernet, Myrinet, or InfiniBand,
see [140, 84, 137]. A natural programming model of DMMs is the message-passing
model that is supported by communication libraries like MPI or PVM, see Chap. 5
for a detailed treatment of MPI. These libraries are often based on standard protocols
like TCP/IP [110, 139].
The difference between cluster systems and distributed systems lies in the fact
that the nodes in cluster systems use the same operating system and can usually
not be addressed individually; instead a special job scheduler must be used. Several
cluster systems can be connected to grid systems by using middleware software like
the Globus Toolkit, see www.globus.org [59]. This allows a coordinated collab-
oration of several clusters. In grid systems, the execution of application programs is
controlled by the middleware software.
2.3.2 Computers with Shared Memory Organization
Computers with a physically shared memory are also called shared memory ma-
chines (SMMs); the shared memory is also called global memory. SMMs consist

Fig. 2.4 Illustration of a
computer with shared
memory: (a) abstract view
and (b) implementation of the
shared memory with memory
modules
M M
P P P P
interconnection network interconnection network
memory modules
shared memory
(a) (b)
of a number of processors or cores, a shared physical memory (global memory), and
an interconnection network to connect the processors with the memory. The shared
memory can be implemented as a set of memory modules. Data can be exchanged
between processors via the global memory by reading or writing shared variables.
The cores of a multicore processor are an example for an SMM, see Sect. 2.4.2 for
a more detailed description. Physically, the global memory usually consists of sep-
arate memory modules providing a common address space which can be accessed
by all processors, see Fig. 2.4 for an illustration.
A natural programming model for SMMs is the use of shared variables which
can be accessed by all processors. Communication and cooperation between the
processors is organized by writing and reading shared variables that are stored in
the global memory. Accessing shared variables concurrently by several processors
should be avoided since race conditions with unpredictable effects can occur, see
also Chaps. 3 and 6.
The existence of a global memory is a significant advantage, since communi-
cation via shared variables is easy and since no data replication is necessary as is
sometimes the case for DMMs. But technically, the realization of SMMs requires
a larger effort, in particular because the interconnection network must provide fast
access to the global memory for each processor. This can be ensured for a small
number of processors, but scaling beyond a few dozen processors is difficult.
A special variant of SMMs are symmetric multiprocessors (SMPs). SMPs have
a single shared memory which provides a uniform access time from any processor
for all memory locations, i.e., all memory locations are equidistant to all processors
[35, 84]. SMPs usually have a small number of processors that are connected via a
central bus which also provides access to the shared memory. There are usually no
private memories of processors or specific I/O processors, but each processor has a
private cache hierarchy. As usual, access to a local cache is faster than access to the
global memory. In the spirit of the definition from above, each multicore processor
with several cores is an SMP system.
SMPs usually have only a small number of processors, since the central bus
provides a constant bandwidth which is shared by all processors. When too many
processors are connected, more and more access collisions may occur, thus increas-
ing the effective memory access time. This can be alleviated by the use of caches
and suitable cache coherence protocols, see Sect. 2.7.3. The maximum number of
processors used in bus-based SMPs typically lies between 32 and 64.
Parallel programs for SMMs are often based on the execution of threads. A thread
is a separate control flow which shares data with other threads via a global address

space. It can be distinguished between kernel threads that are managed by the
operating system and user threads that are explicitly generated and controlled by
the parallel program, see Sect. 3.7.2. The kernel threads are mapped by the oper-
ating system to processors for execution. User threads are managed by the specific
programming environment used and are mapped to kernel threads for execution.
The mapping algorithms as well as the exact number of processors can be hidden
from the user by the operating system. The processors are completely controlled
by the operating system. The operating system can also start multiple sequential
programs from several users on different processors, when no parallel program is
available. Small-size SMP systems are often used as servers, because of their cost-
effectiveness, see [35, 140] for a detailed description.
SMP systems can be used as nodes of a larger parallel computer by employing
an interconnection network for data exchange between processors of different SMP
nodes. For such systems, a shared address space can be defined by using a suitable
cache coherence protocol, see Sect. 2.7.3. A coherence protocol provides the view of
a shared address space, although the physical memory might be distributed. Such a
protocol must ensure that any memory access returns the most recently written value
for a specific memory address, no matter where this value is physically stored. The
resulting systems are also called distributed shared memory (DSM) architectures.
In contrast to single SMP systems, the access time in DSM systems depends on
the location of a data value in the global memory, since an access to a data value
in the local SMP memory is faster than an access to a data value in the memory
of another SMP node via the coherence protocol. These systems are therefore also
called NUMAs (non-uniform memory access), see Fig. 2.5. Since single SMP sys-
tems have a uniform memory latency for all processors, they are also called UMAs
(uniform memory access).
2.3.3 Reducing Memory Access Times
Memory access time has a large influence on program performance. This can also be
observed for computer systems with a shared address space. Technological develop-
ment with a steady reduction in the VLSI (very large scale integration) feature size
has led to significant improvements in processor performance. Since 1980, integer
performance on the SPEC benchmark suite has been increasing at about 55% per
year, and floating-point performance at about 75% per year [84], see Sect. 2.1.
Using the LINPACK benchmark, floating-point performance has been increasing
at more than 80% per year. A significant contribution to these improvements comes
from a reduction in processor cycle time. At the same time, the capacity of DRAM
chips that are used for building main memory has been increasing by about 60%
per year. In contrast, the access time of DRAM chips has only been decreasing by
about 25% per year. Thus, memory access time does not keep pace with processor
performance improvement, and there is an increasing gap between processor cycle
time and memory access time. A suitable organization of memory access becomes

P P n
2
1 P
(a)
cache e
h
c
a
c
e
h
c
a
c
memory
P1 n
P
2
P
M M Mn
2
1
(b)
processing
elements
1
P 2
P n
P
1
C 2
C n
C
M M
M n
2
1
(c)
processing
elements
1 n
n
2
1
2
P P P
C C C
Cache
(d)
Processor
processing
elements
Fig. 2.5 Illustration of the architecture of computers with shared memory: (a) SMP – symmet-
ric multiprocessors, (b) NUMA – non-uniform memory access, (c) CC-NUMA – cache-coherent
NUMA, and (d) COMA – cache-only memory access
more and more important to get good performance results at program level. This
is also true for parallel programs, in particular if a shared address space is used.
Reducing the average latency observed by a processor when accessing memory can
increase the resulting program performance significantly.
Two important approaches have been considered to reduce the average latency
for memory access [14]: the simulation of virtual processors by each physical
processor (multithreading) and the use of local caches to store data values that are
accessed often. We give now a short overview of these approaches in the following.

2.3.3.1 Multithreading
The idea of interleaved multithreading is to hide the latency of memory accesses
by simulating a fixed number of virtual processors for each physical processor. The
physical processor contains a separate program counter (PC) as well as a separate
set of registers for each virtual processor. After the execution of a machine instruc-
tion, an implicit switch to the next virtual processor is performed, i.e., the virtual
processors are simulated by the physical processor in a round-robin fashion. The
number of virtual processors per physical processor should be selected such that
the time between the executions of successive instructions of a virtual processor is
sufficiently large to load required data from the global memory. Thus, the memory
latency will be hidden by executing instructions of other virtual processors. This
approach does not reduce the amount of data loaded from the global memory via
the network. Instead, instruction execution is organized such that a virtual processor
accesses requested data not before their arrival. Therefore, from the point of view of
a virtual processor, memory latency cannot be observed. This approach is also called
fine-grained multithreading, since a switch is performed after each instruction. An
alternative approach is coarse-grained multithreading which switches between
virtual processors only on costly stalls, such as level 2 cache misses [84]. For the
programming of fine-grained multithreading architectures, a PRAM-like program-
ming model can be used, see Sect. 4.5.1. There are two drawbacks of fine-grained
multithreading:
• The programming must be based on a large number of virtual processors. There-
fore, the algorithm used must have a sufficiently large potential of parallelism to
employ all virtual processors.
• The physical processors must be specially designed for the simulation of virtual
processors. A software-based simulation using standard microprocessors is too
slow.
There have been several examples for the use of fine-grained multithreading in
the past, including Dencelor HEP (heterogeneous element processor) [161], NYU
Ultracomputer [73], SB-PRAM [1], Tera MTA [35, 95], as well as the Sun T1 and
T2 multiprocessors. For example, each T1 processor contains eight processor cores,
each supporting four threads which act as virtual processors [84]. Section 2.4.1 will
describe another variation of multithreading which is simultaneous multithreading.
2.3.3.2 Caches
A cache is a small, but fast memory between the processor and main memory. A
cache can be used to store data that is often accessed by the processor, thus avoiding
expensive main memory access. The data stored in a cache is always a subset of the
data in the main memory, and the management of the data elements in the cache
is done by hardware, e.g., by employing a set-associative strategy, see [84] and
Sect. 2.7.1 for a detailed treatment. For each memory access issued by the processor,
the hardware first checks whether the memory address specified currently resides

in the cache. If so, the data is loaded from the cache and no memory access is
necessary. Therefore, memory accesses that go into the cache are significantly faster
than memory accesses that require a load from the main memory. Since fast memory
is expensive, several levels of caches are typically used, starting from a small, fast,
and expensive level 1 (L1) cache over several stages (L2, L3) to the large, but slow
main memory. For a typical processor architecture, access to the L1 cache only takes
2–4 cycles whereas access to main memory can take up to several hundred cycles.
The primary goal of cache organization is to reduce the average memory access time
as far as possible and to achieve an access time as close as possible to that of the L1
cache. Whether this can be achieved depends on the memory access behavior of the
program considered, see Sect. 2.7.
Caches are used for single-processor computers, but they also play an important
role in SMPs and parallel computers with different memory organization. SMPs
provide a shared address space. If shared data is used by multiple processors, it
may be replicated in multiple caches to reduce access latencies. Each processor
should have a coherent view of the memory system, i.e., any read access should
return the most recently written value no matter which processor has issued the
corresponding write operation. A coherent view would be destroyed if a processor
p changes the value of a memory address in its local cache without writing this value
back to main memory. If another processor q would later read this memory address,
it would not get the most recently written value. But even if p writes the value back
to main memory, this may not be sufficient if q has a copy of the same memory
location in its local cache. In this case, it is also necessary to update the copy in the
local cache of q. The problem of providing a coherent view of the memory system
is often referred to as cache coherence problem. To ensure cache coherency, a
cache coherency protocol must be used, see Sect. 2.7.3 and [35, 84, 81] for a more
detailed description.
2.4 Thread-Level Parallelism
The architectural organization within a processor chip may require the use of
explicitly parallel programs to efficiently use the resources provided. This is called
thread-level parallelism, since the multiple control flows needed are often called
threads. The corresponding architectural organization is also called chip multipro-
cessing (CMP). An example for CMP is the placement of multiple independent exe-
cution cores with all execution resources onto a single processor chip. The resulting
processors are called multicore processors, see Sect. 2.4.2.
An alternative approach is the use of multithreading to execute multiple threads
simultaneously on a single processor by switching between the different threads
when needed by the hardware. As described in Sect. 2.3.3, this can be obtained by
fine-grained or coarse-grained multithreading. A variant of coarse-grained multi-
threading is timeslice multithreading in which the processor switches between the
threads after a predefined timeslice interval has elapsed. This can lead to situations
where the timeslices are not effectively used if a thread must wait for an event. If

2.4 Thread-Level Parallelism 21
this happens in the middle of a timeslice, the processor may remain unused for the
rest of the timeslice because of the waiting. Such unnecessary waiting times can
be avoided by using switch-on-event multithreading [119] in which the processor
can switch to the next thread if the current thread must wait for an event to occur as
can happen for cache misses.
A variant of this technique is simultaneous multithreading (SMT) which will
be described in the following. This technique is called hyperthreading for some
Intel processors. The technique is based on the observation that a single thread of
control often does not provide enough instruction-level parallelism to use all func-
tional units of modern superscalar processors.
2.4.1 Simultaneous Multithreading
The idea of simultaneous multithreading (SMT) is to use several threads and
to schedule executable instructions from different threads in the same cycle if
necessary, thus using the functional units of a processor more effectively. This
leads to a simultaneous execution of several threads which gives the technique
its name. In each cycle, instructions from several threads compete for the func-
tional units of a processor. Hardware support for simultaneous multithreading is
based on the replication of the chip area which is used to store the processor
state. This includes the program counter (PC), user and control registers, as well
as the interrupt controller with the corresponding registers. With this replication,
the processor appears to the operating system and the user program as a set of
logical processors to which processes or threads can be assigned for execution.
These processes or threads can come from a single or several user programs. The
number of replications of the processor state determines the number of logical
processors.
Each logical processor stores its processor state in a separate processor resource.
This avoids overhead for saving and restoring processor states when switching to
another logical processor. All other resources of the processor chip like caches, bus
system, and function and control units are shared by the logical processors. There-
fore, the implementation of SMT only leads to a small increase in chip size. For two
logical processors, the required increase in chip area for an Intel Xeon processor is
less than 5% [119, 178]. The shared resources are assigned to the logical processors
for simultaneous use, thus leading to a simultaneous execution of logical processors.
When a logical processor must wait for an event, the resources can be assigned to
another logical processor. This leads to a continuous use of the resources from the
view of the physical processor. Waiting times for logical processors can occur for
cache misses, wrong branch predictions, dependencies between instructions, and
pipeline hazards.
Investigations have shown that the simultaneous use of processor resources by
two logical processors can lead to performance improvements between 15% and
30%, depending on the application program [119]. Since the processor resources are
shared by the logical processors, it cannot be expected that the use of more than two

logical processors can lead to a significant additional performance improvement.
Therefore, SMT will likely be restricted to a small number of logical processors.
Examples of processors that support SMT are the IBM Power5 and Power6 proces-
sors (two logical processors) and the Sun T1 and T2 processors (four/eight logical
processors), see, e.g., [84] for a more detailed description.
To use SMT to obtain performance improvements, it is necessary that the oper-
ating system be able to control logical processors. From the point of view of the
application program, it is necessary that every logical processor has a separate thread
available for execution. Therefore, the application program must apply parallel pro-
gramming techniques to get performance improvements for SMT processors.
2.4.2 Multicore Processors
According to Moore’s law, the number of transistors of a processor chip doubles
every 18–24 months. This enormous increase has enabled hardware manufacturers
for many years to provide a significant performance increase for application pro-
grams, see also Sect. 2.1. Thus, a typical computer is considered old-fashioned and
too slow after at most 5 years, and customers buy new computers quite often. Hard-
ware manufacturers are therefore trying to keep the obtained performance increase
at least at the current level to avoid reduction in computer sales figures.
As discussed in Sect. 2.1, the most important factors for the performance increase
per year have been an increase in clock speed and the internal use of parallel pro-
cessing like pipelined execution of instructions and the use of multiple functional
units. But these traditional techniques have mainly reached their limits:
• Although it is possible to put additional functional units on the processor chip,
this would not increase performance for most application programs because
dependencies between instructions of a single control thread inhibit their par-
allel execution. A single control flow does not provide enough instruction-level
parallelism to keep a large number of functional units busy.
• There are two main reasons why the speed of processor clocks cannot be
increased significantly [106]. First, the increase in the number of transistors
on a chip is mainly achieved by increasing the transistor density. But this also
increases the power density and heat production because of leakage current and
power consumption, thus requiring an increased effort and more energy for cool-
ing. Second, memory access time could not be reduced at the same rate as the
processor clock period. This leads to an increased number of machine cycles for
a memory access. For example, in 1990 main memory access was between 6 and
8 cycles for a typical desktop computer system, whereas in 2006 memory access
typically took between 100 and 250 cycles, depending on the DRAM technology
used to build the main memory. Therefore, memory access times could become
a limiting factor for further performance increase, and cache memories are used
to prevent this, see Sect. 2.7 for a further discussion.

There are more problems that processor designers have to face: Using the
increased number of transistors to increase the complexity of the processor archi-
tecture may also lead to an increase in processor–internal wire length to transfer
control and data between the functional units of the processor. Here, the speed
of signal transfers within the wires could become a limiting factor. For exam-
ple, a 3 GHz processor has a cycle time of 0.33 ns. Assuming a signal transfer
at the speed of light (0.3 ·109
m/s), a signal can cross a distance of 0.33 ·10−9
s
·0.3 · 109
m/s = 10 cm in one processor cycle. This is not significantly larger
than the typical size of a processor chip, and wire lengths become an important
issue.
Another problem is the following: The physical size of a processor chip limits
the number of pins that can be used, thus limiting the bandwidth between CPU and
main memory. This may lead to a processor-to-memory performance gap which
is sometimes referred to as memory wall. This makes the use of high-bandwidth
memory architectures with an efficient cache hierarchy necessary [17].
All these reasons inhibit a processor performance increase at the previous rate
using the traditional techniques. Instead, new processor architectures have to be
used, and the use of multiple cores on a single processor die is considered as
the most promising approach. Instead of further increasing the complexity of the
internal organization of a processor chip, this approach integrates multiple indepen-
dent processing cores with a relatively simple architecture onto one processor chip.
This has the additional advantage that the energy consumption of a processor chip
can be reduced if necessary by switching off unused processor cores during idle
times [83].
Multicore processors integrate multiple execution cores on a single processor
chip. For the operating system, each execution core represents an independent log-
ical processor with separate execution resources like functional units or execution
pipelines. Each core has to be controlled separately, and the operating system can
assign different application programs to the different cores to obtain a parallel
execution. Background applications like virus checking, image compression, and
encoding can run in parallel to application programs of the user. By using techniques
of parallel programming, it is also possible to execute a computation-intensive appli-
cation program (like computer games, computer vision, or scientific simulations) in
parallel on a set of cores, thus reducing the execution time compared to an execution
on a single core or leading to more accurate results by performing more computa-
tions as in the sequential case. In the future, users of standard application programs
as computer games will likely expect an efficient use of the execution cores of a
processor chip. To achieve this, programmers have to use techniques from parallel
programming.
The use of multiple cores on a single processor chip also enables standard
programs, like text processing, office applications, or computer games, to provide
additional features that are computed in the background on a separate core so that
the user does not notice any delay in the main application. But again, techniques of
parallel programming have to be used for the implementation.

2.4.3 Architecture of Multicore Processors
There are many different design variants for multicore processors, differing in the
number of cores, the structure and size of the caches, the access of cores to caches,
and the use of heterogeneous components. From a high-level view, three different
types of architectures can be distinguished, and there are also hybrid organizations
[107].
2.4.3.1 Hierarchical Design
For a hierarchical design, multiple cores share multiple caches. The caches are orga-
nized in a tree-like configuration, and the size of the caches increases from the leaves
to the root, see Fig. 2.6 (left) for an illustration. The root represents the connection
to external memory. Thus, each core can have a separate L1 cache and shares the
L2 cache with other cores. All cores share the common external memory, resulting
in a three-level hierarchy as illustrated in Fig. 2.6 (left). This can be extended to
more levels. Additional sub-components can be used to connect the caches of one
level with each other. A typical usage area for a hierarchical design is the SMP
configuration.
A hierarchical design is also often used for standard desktop or server processors.
Examples are the IBM Power6 architecture, the processors of the Intel Xeon and
AMD Opteron family, as well as the Sun Niagara processors (T1 and T2). Figure 2.7
shows the design of the Quad-Core AMD Opteron and the Intel Quad-Core Xeon
processors as a typical example for desktop processors with a hierarchical design.
Many graphics processing units (GPUs) also exhibit a hierarchical design. An exam-
ple is shown in Fig. 2.8 for the Nvidia GeForce 8800, which has 128 stream proces-
sors (SP) at 1.35 GHz organized in 8 texture/processor clusters (TPC) such that each
TPC contains 16 SPs. This architecture is scalable to smaller and larger configura-
tions by scaling the number of SPs and memory partitions, see [137] for a detailed
description.
Cache
Cache
Cache
Cache
cache/memory cache/memory
memory
memory
memory
memory
control
core core core
core
core
core
core
core
pipelined design
hierarchical design network-based design
cache
cache
core
core
core core
Fig. 2.6 Design choices for multicore chips according to [107]
This
figure
will be
printed
in b/w

Core 2 Core 3 Core 4
L1 L1 L1 L1
L2 L2 L2 L2
Core 1
L3 Cache (shared)
crossbar
Hyper-Transport memory controller
L1
Core 1
L2Cache (shared)
Core 2
L1
Core 3
L1
L2Cache (shared)
Core 4
L1
Front-Side Bus Front-Side Bus
memory controller
(a) (b)
Fig. 2.7 Quad-Core AMD Opteron (left) vs. Intel Quad-Core Xeon architecture (right) as exam-
ples for a hierarchical design
This
figure
will be
printed
in b/w
Host
Input Assembler
Vtx Thread Issue Geom Thread Issue
Setup / Rstr / ZCull
Pixel Thread Issue
Memory
Thread
Processor
L2
L2
L2
L2
2
L
2
L
SP
SP
.........
TF
.......... .......... ..........
L1
SP
SP
.........
TF
.......... .......... ..........
L1
SP
SP
.........
TF
.......... .......... ..........
L1
SP
SP
.........
TF
.......... .......... ..........
L1
SP
SP
.........
TF
.......... .......... ..........
L1
SP
SP
.........
TF
.......... .......... ..........
L1
SP
SP
.........
TF
.......... .......... ..........
L1
SP
SP
.........
TF
.......... .......... ..........
L1
Fig. 2.8 Architectural overview of Nvidia GeForce 8800, see [128, 137] for a detailed description
This
figure
will be
printed
in b/w
2.4.3.2 Pipelined Designs
For a pipelined design, data elements are processed by multiple execution cores in
a pipelined way. Data elements enter the processor chip via an input port and are
passed successively through different cores until the processed data elements leave
the last core and the entire processor chip via an output port, see Fig. 2.6 (middle).
Each core performs specific processing steps on each data element.
Pipelined designs are useful for application areas in which the same computation
steps have to be applied to a long sequence of data elements. Network processors
used in routers and graphics processors both perform this style of computations.
Examples for network processors with a pipelined design are the Xelerator X10 and
X11 processors [176, 107] for the successive processing of network packets in a
pipelined way within the chip. The Xelerator X11 contains up to 800 separate cores
which are arranged in a logically linear pipeline, see Fig. 2.9 for an illustration. The
network packets to be processed enter the chip via multiple input ports on one side
of the chip, are successively processed by the cores, and then exit the chip.

Receive Module
Look Aside
Engine
Look Aside
Engine 0
Look Aside
Engine 1
Look Aside
Engine 2 Engine 3
Look Aside
Engine
Hash
Engine
Meter
Engine
Counter
Engine
TCAM
RX,
MAC
RX,
MAC
RX,
MAC
RX,
MAC
XAUI or 12x
Serdes−SGMII
XAUI or 12x
Serdes−SGMII
Arbiter
Port
XAUI or SPI4.2
XAUI or SPI4.2
Multicast
Copier
TX,
MAC
TX,
MAC
TX,
MAC
TX,
MAC
XAUI or 12x
Serdes−SGMII
XAUI or 12x
Serdes−SGMII
CPU i/f
Control CPU
Optional TCAM Optional RLDRAM, FCRAM, SRAM or LAI co-processor
XAUI or SPI4.2
XAUI or SPI4.2
Transmit Module
Priority
Buffer
Manager
PISC
Block
#0
PISC
Block
#1
PISC
Block
#2
PISC
Block
#3
PISC
Block
#4
Programmable Pipeline
Processor
2
Processor
31
Look−back path
Processor
0
Processor
1
A
P
E
A
P
E
A
P
E
A
P
E
A
P
E
Fig. 2.9 Xelerator X11 network processor as an example for a pipelined design [176]
2.4.3.3 Network-Based Design
For a network-based design, the cores of a processor chip and their local caches and
memories are connected via an interconnection network with other cores of the chip,
see Fig. 2.6 (right) for an illustration. Data transfer between the cores is performed
via the interconnection network. This network may also provide support for the
synchronization of the cores. Off-chip interfaces may be provided via specialized
cores or DMA ports. An example for a network-based design is the Intel Teraflop
processor, which has been designed by the Intel Tera-scale Computing Research
Program [83, 17].
This research program addresses the challenges of building processor chips with
tens to hundreds of execution cores, including core design, energy management,
cache and memory hierarchy, and I/O. The Teraflop processor developed as a pro-
totype contains 80 cores, which are arranged in a 8×10 mesh, see Fig. 2.10 for an
illustration. Each core can perform floating-point operations and contains a local
cache as well as a router to perform data transfer between the cores and the main
memory. There are additional cores for processing video data, encryption, and
graphics computations. Depending on the application area, the number of special-
ized cores of such a processor chip could be varied.
2.4.3.4 Future Trends and Developments
The potential of multicore processors has been realized by most processor man-
ufacturers like Intel or AMD, and since about 2005, many manufacturers deliver
processors with two or more cores. Since 2007, Intel and AMD provide quad-core
processors (like the Quad-Core AMD Opteron and the Quad-Core Intel Xeon), and

HD
CY
DSP
GFX
GFX
GFX
HD
CY
DSP
GFX
DSP
Graphics
Crypto
HD Video
Cache
Shared
Cache
Local
Streamlined
IA Core
Fig. 2.10 Intel Teraflop processor according to [83] as an example for a network-based design of
a multicore processor
This
figure
will be
printed
in b/w
the provision of oct-core processors is expected in 2010. The IBM Cell processor
integrates one standard desktop core based on the Power Architecture and eight
specialized processing cores. The UltraSPARC T2 processor from Sun has up to
eight processing cores each of which can simulate eight threads using SMT (which
is called CoolThreads by Sun). Thus, an UltraSPARC T2 processor can simultane-
ously execute up to 64 threads.
An important issue for the integration of a large number of cores in one processor
chip is an efficient on-chip interconnection, which provides enough bandwidth for
data transfers between the cores [83]. This interconnection should be scalable to
support an increasing number of cores for future generations of processor designs
and robust to tolerate failures of specific cores. If one or a few cores exhibit hard-
ware failures, the rest of the cores should be able to continue operation. The inter-
connection should also support an efficient energy management which allows the
scale-down of power consumption of individual cores by reducing the clock speed.
For an efficient use of processing cores, it is also important that the data to be
processed be transferred to the cores fast enough to avoid the cores to wait for
the data to be available. Therefore, an efficient memory system and I/O system
are important. The memory system may use private first-level (L1) caches which
can only be accessed by their associated cores, as well as shared second-level (L2)
caches which can contain data of different cores. In addition, a shared third-level
(L3) cache is often used. Processor chip with dozens or hundreds of cores will likely
require an additional level of caches in the memory hierarchy to fulfill bandwidth
requirements [83]. The I/O system must be able to provide enough bandwidth to
keep all cores busy for typical application programs. At the physical layer, the I/O
system must be able to bring hundreds of gigabits per second onto the chip. Such
powerful I/O systems are currently under development [83].
Table 2.1 gives a short overview of typical multicore processors in 2009. For
a more detailed treatment of the architecture of multicore processors and further
examples, we refer to [137, 84].

Table 2.1 Examples for multicore processors in 2009
Number of Number of Clock L1 L2 L3 Year
Processor cores threads GHz cache cache cache released
Intel Xeon E5450 4 4 3.0 4× 2× 2007
“Harpertown” 32 KB 6.1 MB
Intel Xeon E5540 4 8 2.53 4× 4× 8 MB 2009
“Gainestown” 64 KB 256 MB
AMD Opteron 4 4 2.0 4× 4× 2 MB 2007
“Barcelona” 64 KB 512 KB
AMD Opteron 6 6 2.8 6× 6× 6 MB 2009
“Istanbul” 128 KB 512 KB
IBM 2 4 4.7 128 KB 2× 32 MB 2007
Power6 4 MB
Sun T2 8 64 1.17 8× 4 MB 2007
Niagara 2 8 KB
2.5 Interconnection Networks
A physical connection between the different components of a parallel system is
provided by an interconnection network. Similar to control flow and data flow,
see Sect. 2.2, or memory organization, see Sect. 2.3, the interconnection network
can also be used for a classification of parallel systems. Internally, the network
consists of links and switches which are arranged and connected in some regular
way. In multicomputer systems, the interconnection network is used to connect
the processors or nodes with each other. Interactions between the processors for
coordination, synchronization, or exchange of data are obtained by communication
through message-passing over the links of the interconnection network. In multipro-
cessor systems, the interconnection network is used to connect the processors with
the memory modules. Thus, memory accesses of the processors are performed via
the interconnection network.
In both cases, the main task of the interconnection network is to transfer a mes-
sage from a specific processor to a specific destination. The message may contain
data or a memory request. The destination may be another processor or a memory
module. The requirement for the interconnection network is to perform the message
transfer correctly as fast as possible, even if several messages have to be transferred
at the same time. Message transfer and memory accesses represent a significant part
of operations of parallel systems with a distributed or shared address space. There-
fore, the interconnection network used represents a significant part of the design of a
parallel system and may have a large influence on its performance. Important design
criteria of networks are
• the topology describing the interconnection structure used to connect different
processors or processors and memory modules and
• the routing technique describing the exact message transmission used within the
network between processors or processors and memory modules.

2.5 Interconnection Networks 29
The topology of an interconnection network describes the geometric structure
used for the arrangement of switches and links to connect processors or processors
and memory modules. The geometric structure can be described as a graph in which
switches, processors, or memory modules are represented as vertices and physical
links are represented as edges. It can be distinguished between static and dynamic
interconnection networks. Static interconnection networks connect nodes (proces-
sors or memory modules) directly with each other by fixed physical links. They are
also called direct networks or point-to-point networks. The number of connec-
tions to or from a node may vary from only one in a star network to the total number
of nodes in the network for a completely connected graph, see Sect. 2.5.2. Static
networks are often used for systems with a distributed address space where a node
comprises a processor and the corresponding memory module. Dynamic intercon-
nection networks connect nodes indirectly via switches and links. They are also
called indirect networks. Examples of indirect networks are bus-based networks
or switching networks which consist of switches connected by links. Dynamic net-
works are used for both parallel systems with distributed and shared address space.
Often, hybrid strategies are used [35].
The routing technique determines how and along which path messages are trans-
ferred in the network from a sender to a receiver. A path in the network is a series
of nodes along which the message is transferred. Important aspects of the routing
technique are the routing algorithm which determines the path to be used for the
transmission and the switching strategy which determines whether and how mes-
sages are cut into pieces, how a routing path is assigned to a message, and how a
message is forwarded along the processors or switches on the routing path.
The combination of routing algorithm, switching strategy, and network topology
determines the performance of a network significantly. In Sects. 2.5.2 and 2.5.4,
important direct and indirect networks are described in more detail. Specific routing
algorithms and switching strategies are presented in Sects. 2.6.1 and 2.6.3. Efficient
algorithms for the realization of common communication operations on different
static networks are given in Chap. 4. A more detailed treatment of interconnection
networks is given in [19, 35, 44, 75, 95, 115, 158].
2.5.1 Properties of Interconnection Networks
Static interconnection networks use fixed links between the nodes. They can be
described by a connection graph G = (V, E) where V is a set of nodes to be con-
nected and E is a set of direct connection links between the nodes. If there is a direct
physical connection in the network between the nodes u ∈ V and v ∈ V , then it is
(u, v) ∈ E. For most parallel systems, the interconnection network is bidirectional.
This means that along a physical link messages can be transferred in both directions
at the same time. Therefore, the connection graph is usually defined as an undirected
graph. When a message must be transmitted from a node u to a node v and there
is no direct connection between u and v in the network, a path from u to v must
be selected which consists of several intermediate nodes along which the message

is transferred. A sequence of nodes (v0, . . . , vk) is called path of length k between
v0 and vk, if (vi , vi+1) ∈ E for 0 ≤ i k. For parallel systems, all interconnection
networks fulfill the property that there is at least one path between any pair of nodes
u, v ∈ V .
Static networks can be characterized by specific properties of the connection
graph, including the following properties: number of nodes, diameter of the net-
work, degree of the nodes, bisection bandwidth, node and edge connectivity of the
network, and flexibility of embeddings into other networks as well as the embedding
of other networks. In the following, a precise definition of these properties is given.
The diameter δ(G) of a network G is defined as the maximum distance between
any pair of nodes:
δ(G) = max
u,v∈V
min
ϕ path
from u to v
{k | k is the length of the path ϕ from u to v}.
The diameter of a network determines the length of the paths to be used for message
transmission between any pair of nodes. The degree g(G) of a network G is the
maximum degree of a node of the network where the degree of a node n is the
number of direct neighbor nodes of n:
g(G) = max{g(v) | g(v) degree of v ∈ V }.
In the following, we assume that |A| denotes the number of elements in a set A.
The bisection bandwidth B(G) of a network G is defined as the minimum number
of edges that must be removed to partition the network into two parts of equal size
without any connection between the two parts. For an uneven total number of nodes,
the size of the parts may differ by 1. This leads to the following definition for B(G):
B(G) = min
U1,U2 partition of V
||U1|−|U2||≤1
|{(u, v) ∈ E | u ∈ U1, v ∈ U2}|.
B(G) + 1 messages can saturate a network G, if these messages must be transferred
at the same time over the corresponding edges. Thus, bisection bandwidth is a mea-
sure for the capacity of a network when transmitting messages simultaneously.
The node and edge connectivity of a network measure the number of nodes or
edges that must fail to disconnect the network. A high connectivity value indicates
a high reliability of the network and is therefore desirable. Formally, the node con-
nectivity of a network is defined as the minimum number of nodes that must be
deleted to disconnect the network, i.e., to obtain two unconnected network parts
(which do not necessarily need to have the same size as is required for the bisection
bandwidth). For an exact definition, let GV M be the rest graph which is obtained
by deleting all nodes in M ⊂ V as well as all edges adjacent to these nodes. Thus,
it is GV M = (V M, E ∩ ((V M) × (V M))). The node connectivity nc(G) of
G is then defined as

nc(G) = min
M⊂V
{|M| | there exist u, v ∈ V M, such that there exists
no path in GV M from u to v}.
Similarly, the edge connectivity of a network is defined as the minimum number
of edges that must be deleted to disconnect the network. For an arbitrary subset
F ⊂ E, let GEF be the rest graph which is obtained by deleting the edges in F,
i.e., it is GEF = (V, E F). The edge connectivity ec(G) of G is then defined as
ec(G) = min
F⊂E
{|F| | there exist u, v ∈ V, such that there exists
no path in GEF from u to v}.
The node and edge connectivity of a network is a measure of the number of indepen-
dent paths between any pair of nodes. A high connectivity of a network is important
for its availability and reliability, since many nodes or edges can fail before the
network is disconnected. The minimum degree of a node in the network is an upper
bound on the node or edge connectivity, since such a node can be completely sepa-
rated from its neighboring nodes by deleting all incoming edges. Figure 2.11 shows
that the node connectivity of a network can be smaller than its edge connectivity.
Fig. 2.11 Network with node
connectivity 1, edge
connectivity 2, and degree 4.
The smallest degree of a node
is 3
The flexibility of a network can be captured by the notion of embedding. Let
G = (V, E) and G
= (V
, E
) be two networks. An embedding of G
into G
assigns each node of G
to a node of G such that different nodes of G
are mapped
to different nodes of G and such that edges between two nodes in G
are also present
between their associated nodes in G [19]. An embedding of G
into G can formally
be described by a mapping function σ : V
→ V such that the following holds:
• if u = v for u, v ∈ V
, then σ(u) = σ(v) and
• if (u, v) ∈ E
, then (σ(u), σ(v)) ∈ E.
If a network G
can be embedded into a network G, this means that G is at least as
flexible as G
, since any algorithm that is based on the network structure of G
, e.g.,
by using edges between nodes for communication, can be re-formulated for G with
the mapping function σ, thus using corresponding edges in G for communication.
The network of a parallel system should be designed to meet the requirements
formulated for the architecture of the parallel system based on typical usage pat-
terns. Generally, the following topological properties are desirable:
• a small diameter to ensure small distances for message transmission,
• a small node degree to reduce the hardware overhead for the nodes,
• a large bisection bandwidth to obtain large data throughputs,

• a large connectivity to ensure reliability of the network,
• embedding into a large number of networks to ensure flexibility, and
• easy extendability to a larger number of nodes.
Some of these properties are conflicting and there is no network that meets all
demands in an optimal way. In the following, some popular direct networks are
presented and analyzed. The topologies are illustrated in Fig. 2.12. The topological
properties are summarized in Table 2.2.
2.5.2 Direct Interconnection Networks
Direct interconnection networks usually have a regular structure which is transferred
to their graph representation G = (V, E). In the following, we use n = |V | for the
number of nodes in the network and use this as a parameter of the network type
considered. Thus, each network type captures an entire class of networks instead of
a fixed network with a given number of nodes.
A complete graph is a network G in which each node is directly connected with
every other node, see Fig. 2.12(a). This results in diameter δ(G) = 1 and degree
g(G) = n − 1. The node and edge connectivity is nc(G) = ec(G) = n − 1, since a
node can only be disconnected by deleting all n − 1 adjacent edges or neighboring
nodes. For even values of n, the bisection bandwidth is B(G) = n2
/4: If two subsets
of nodes of size n/2 each are built, there are n/2 edges from each of the nodes of
one subset into the other subset, resulting in n/2·n/2 edges between the subsets. All
other networks can be embedded into a complete graph, since there is a connection
between any two nodes. Because of the large node degree, complete graph networks
can only be built physically for a small number of nodes.
In a linear array network, nodes are arranged in a sequence and there is a
bidirectional connection between any pair of neighboring nodes, see Fig. 2.12(b),
i.e., it is V = {v1, . . . , vn} and E = {(vi , vi+1) | 1 ≤ i n}. Since n − 1 edges
have to be traversed to reach vn starting from v1, the diameter is δ(G) = n − 1.
The connectivity is nc(G) = ec(G) = 1, since the elimination of one node or edge
disconnects the network. The network degree is g(G) = 2 because of the inner
nodes, and the bisection bandwidth is B(G) = 1. A linear array network can be
embedded in nearly all standard networks except a tree network, see below. Since
there is a link only between neighboring nodes, a linear array network does not
provide fault tolerance for message transmission.
In a ring network, nodes are arranged in ring order. Compared to the linear array
network, there is one additional bidirectional edge from the first node to the last
node, see Fig. 2.12(c). The resulting diameter is δ(G) = n/2 , the degree is g(G) =
2, the connectivity is nc(G) = ec(G) = 2, and the bisection bandwidth is also
B(G) = 2. In practice, ring networks can be used for small number of processors
and as part of more complex networks.
A d-dimensional mesh (also called d-dimensional array) for d ≥ 1 consists
of n = n1 · n2 · . . . · nd nodes that are arranged as a d-dimensional mesh, see

i)
(110,1)
(110,0)
(010,1)
(010,2)
(111,1)
(110,2)
(111,2)
(011,2)
(011,1)
(100,2)
(100,0)
(000,1)
(000,2)
(001,2)
(001,1)
(101,2)
(100,1)
(101,0)
(101,1)
(001,0)
(000,0)
(010,0) (011,0)
(111,0)
000 001
010 011
100 101
110 111
1111
1101
1011
0100
1000
1010
1110
1001
0001
0000
0010 0011
0110 0111
0101
1100
h) 1
2 3
4 5 6 7
001
011
101
100
111
110
010
000
10
00
11
01
1
f)
0
1
2
3
4
5
a)
(1,1) (1,2) (1,3)
(2,3)
(2,2)
(2,1)
(3,2) (3,3)
(3,1)
(1,2) (1,3)
(2,3)
(2,2)
(2,1)
(3,2) (3,3)
(3,1)
(1,1)
1 2 3 4 5
1
2
3
4
5
g)
b)
e)
d)
c)
Fig. 2.12 Static interconnection networks: (a) complete graph, (b) linear array, (c) ring, (d) two-
dimensional mesh, (e) two-dimensional torus, (f) k-dimensional cube for k=1,2,3,4, (g) cube-
connected-cycles network for k = 3, (h) complete binary tree, (i) shuffle–exchange network with
8 nodes, where dashed edges represent exchange edges and straight edges represent shuffle edges

Table 2.2 Summary of important characteristics of static interconnection networks for selected
topologies
Degree Diameter Edge- connectivity Bisection bandwidth
Network G with n nodes g(G) δ(G) ec(G) B(G)
Complete graph n − 1 1 n − 1
n
2
2
Linear array 2 n − 1 1 1
Ring 2
n
2

2 2
d-Dimensional mesh 2d d( d
√
n − 1) d n
d−1
d
(n = rd
)
d-Dimensional torus 2d d
d
√
n
2

2d 2n
d−1
d
(n = rd
)
k-Dimensional hyper- log n log n log n n
2
cube (n = 2k
)
k-Dimensional 3 2k − 1 + k/2 3 n
2k
CCC network
(n = k2k
for k ≥ 3)
Complete binary 3 2 log n+1
2
1 1
tree (n = 2k
− 1)
k-ary d-cube 2d d
k
2

2d 2kd−1
(n = kd
)
Fig. 2.12(d). The parameter n j denotes the extension of the mesh in dimension j
for j = 1, . . . , d. Each node in the mesh is represented by its position (x1, . . . , xd)
in the mesh with 1 ≤ xj ≤ n j for j = 1, . . . , d. There is an edge between node
(x1, . . . , xd) and (x
1, . . . x
d), if there exists μ ∈ {1, . . . , d} with
|xμ − x
μ| = 1 and xj = x
j for all j = μ.
In the case that the mesh has the same extension in all dimensions (also called
symmetric mesh), i.e., n j = r = d
√
n for all j = 1, . . . , d, and therefore n =
rd
, the network diameter is δ(G) = d · ( d
√
n − 1), resulting from the path length
between nodes on opposite sides of the mesh. The node and edge connectivity is
nc(G) = ec(G) = d, since the corner nodes of the mesh can be disconnected by
deleting all d incoming edges or neighboring nodes. The network degree is g(G) =
2d, resulting from inner mesh nodes which have two neighbors in each dimension.
A two-dimensional mesh has been used for the Teraflop processor from Intel, see
Sect. 2.4.3.
A d-dimensional torus is a variation of a d-dimensional mesh. The difference is
the additional edges between the first and the last node in each dimension, i.e., for
each dimension j = 1, . . . , d there is an edge between node (x1, . . . , xj−1, 1, xj+1,
. . . , xd) and (x1, . . . , xj−1, n j , xj+1, . . . , xd), see Fig. 2.12(e). For the symmetric
case n j = d
√
n for all j = 1, . . . , d, the diameter of the torus network is δ(G) =
d · d
√
n/2 . The node degree is 2d for each node, i.e., g(G) = 2d. Therefore, node
and edge connectivities are also nc(G) = ec(G) = 2d.
A k-dimensional cube or hypercube consists of n = 2k
nodes which are
connected by edges according to a recursive construction, see Fig. 2.12(f). Each

node is represented by a binary word of length k, corresponding to the numbers
0, . . . , 2k
−1. A one-dimensional cube consists of two nodes with bit representations
0 and 1 which are connected by an edge. A k-dimensional cube is constructed
from two given (k − 1)-dimensional cubes, each using binary node representa-
tions 0, . . . , 2k−1
− 1. A k-dimensional cube results by adding edges between each
pair of nodes with the same binary representation in the two (k − 1)-dimensional
cubes. The binary representations of the nodes in the resulting k-dimensional
cube are obtained by adding a leading 0 to the previous representation of the
first (k − 1)-dimensional cube and adding a leading 1 to the previous represen-
tations of the second (k − 1)-dimensional cube. Using the binary representations
of the nodes V = {0, 1}k
, the recursive construction just mentioned implies that
there is an edge between node α0 . . . αj . . . αk−1 and node α0 . . . ¯
αj . . . αk−1 for
0 ≤ j ≤ k − 1 where ¯
αj = 1 for αj = 0 and ¯
αj = 0 for αj = 1. Thus,
there is an edge between every pair of nodes whose binary representation dif-
fers in exactly one bit position. This fact can also be captured by the Hamming
distance.
The Hamming distance of two binary words of the same length is defined as
the number of bit positions in which their binary representations differ. Thus, two
nodes of a k-dimensional cube are directly connected, if their Hamming distance is
1. Between two nodes v, w ∈ V with Hamming distance d, 1 ≤ d ≤ k, there exists
a path of length d connecting v and w. This path can be determined by traversing
the bit representation of v bitwise from left to right and inverting the bits in which
v and w differ. Each bit inversion corresponds to a traversal of the corresponding
edge to a neighboring node. Since the bit representation of any two nodes can differ
in at most k positions, there is a path of length ≤ k between any pair of nodes. Thus,
the diameter of a k-dimensional cube is δ(G) = k. The node degree is g(G) = k,
since a binary representation of length k allows k bit inversions, i.e., each node has
exactly k neighbors. The node and edge connectivity is nc(G) = ec(G) = k as will
be described in the following.
The connectivity of a hypercube is at most k, i.e., nc(G) ≤ k, since each node
can be completely disconnected from its neighbors by deleting all k neighbors or
all k adjacent edges. To show that the connectivity is at least k, we show that there
are exactly k independent paths between any pair of nodes v and w. Two paths are
independent of each other if they do not share any edge, i.e., independent paths
between v and w only share the two nodes v and w. The independent paths are
constructed based on the binary representations of v and w, which are denoted
by A and B, respectively, in the following. We assume that A and B differ in l
positions, 1 ≤ l ≤ k, and that these are the first l positions (which can be obtained
by a renumbering). We can construct l paths of length l each between v and w by
inverting the first l bits of A in different orders. For path i, 0 ≤ i l, we stepwise
invert bits i, . . . ,l −1 in this order first, and then invert bits 0, . . . , i −1 in this order.
This results in l independent paths. Additional k − l independent paths between v
and w of length l + 2 each can be constructed as follows: For i with 0 ≤ i k − l,
we first invert the bit (l +i) of A and then the bits at positions 0, . . . ,l −1 stepwise.
Finally, we invert the bit (l + i) again, obtaining bit representation B. This is shown

010
110
000 001
101
111
011
100
Fig. 2.13 In a three-dimensional cube network, we can construct three independent paths (from
node 000 to node 110). The Hamming distance between node 000 and node 110 is l = 2. There are
two independent paths between 000 and 110 of length l = 2: path (000, 100, 110) and path (000,
010, 110). Additionally, there are k −l = 1 path of length l +2 = 4: path (000, 001, 101, 111, 110)
in Fig. 2.13 for an example. All k paths constructed are independent of each other,
showing that nc(G) ≥ k holds.
A k-dimensional cube allows the embedding of many other networks as will be
shown in the next subsection.
A cube-connected cycles (CCC) network results from a k-dimensional cube by
replacing each node with a cycle of k nodes. Each of the nodes in the cycle has
one off-cycle connection to one neighbor of the original node of the k-dimensional
cube, thus covering all neighbors, see Fig. 2.12(g). The nodes of a CCC network
can be represented by V = {0, 1}k
× {0, . . . , k − 1} where {0, 1}k
are the binary
representations of the k-dimensional cube and i ∈ {0, . . . , k − 1} represents the
position in the cycle. It can be distinguished between cycle edges F and cube
edges E:
F = {((α, i), (α, (i + 1) mod k)) | α ∈ {0, 1}k
, 0 ≤ i k},
E = {((α, i), (β, i)) | αi = βi and αj = βj for j = i}.
Each of the k·2k
nodes of the CCC network has degree g(G) = 3, thus eliminating a
drawback of the k-dimensional cube. The connectivity is nc(G) = ec(G) = 3 since
each node can be disconnected by deleting its three neighboring nodes or edges. An
upper bound for the diameter is δ(G) = 2k − 1 + k/2 . To construct a path of this
length, we consider two nodes in two different cycles with maximum hypercube
distance k. These are nodes (α, i) and (β, j) for which α and β differ in all k bits.
We construct a path from (α, i) to (β, j) by sequentially traversing a cube edge
and a cycle edge for each bit position. The path starts with (α0 . . . αi . . . αk−1, i) and
reaches the next node by inverting αi to ᾱi = βi . From (α0 . . . βi . . . αk−1, i) the next
node (α0 . . . βi . . . αk−1, (i + 1) mod k) is reached by using a cycle edge. In the next
steps, the bits αi+1, . . . , αk−1 and α0, . . . , αi−1 are successively inverted in this way,
using a cycle edge between the steps. This results in 2k − 1 edge traversals. Using
at most k/2 additional traversals of cycle edges starting from (β, i +k −1 mod k)
leads to the target node (β, j).
A complete binary tree network has n = 2k
− 1 nodes which are arranged as
a binary tree in which all leaf nodes have the same depth, see Fig. 2.12(h). The

degree of inner nodes is 3, leading to a total degree of g(G) = 3. The diameter of
the network is δ(G) = 2 · log n+1
2
and is determined by the path length between
two leaf nodes in different subtrees of the root node; the path consists of a subpath
from the first leaf to the root followed by a subpath from the root to the second leaf.
The connectivity of the network is nc(G) = ec(G) = 1, since the network can be
disconnected by deleting the root or one of the edges to the root.
A k-dimensional shuffle–exchange network has n = 2k
nodes and 3·2k−1
edges
[167]. The nodes can be represented by k-bit words. A node with bit representation
α is connected with a node with bit representation β, if
• α and β differ in the last bit (exchange edge) or
• α results from β by a cyclic left shift or a cyclic right shift (shuffle edge).
Figure 2.12(i) shows a shuffle–exchange network with 8 nodes. The permutation
(α, β) where β results from α by a cyclic left shift is called perfect shuffle.
The permutation (α, β) where β results from α by a cyclic right shift is called
inverse perfect shuffle, see [115] for a detailed treatment of shuffle–exchange
networks.
A k-ary d-cube with k ≥ 2 is a generalization of the d-dimensional cube with
n = kd
nodes where each dimension i with i = 0, . . . , d −1 contains k nodes. Each
node can be represented by a word with d numbers (a0, . . . , ad−1) with 0 ≤ ai ≤
k −1, where ai represents the position of the node in dimension i, i = 0, . . . , d −1.
Two nodes A = (a0, . . . , ad−1) and B = (b0, . . . , bd−1) are connected by an edge if
there is a dimension j ∈ {0, . . . , d − 1} for which aj = (bj ± 1) mod k and ai = bi
for all other dimensions i = 0, . . . , d − 1, i = j. For k = 2, each node has one
neighbor in each dimension, resulting in degree g(G) = d. For k 2, each node
has two neighbors in each dimension, resulting in degree g(G) = 2d. The k-ary
d-cube captures some of the previously considered topologies as special case: A
k-ary 1-cube is a ring with k nodes, a k-ary 2-cube is a torus with k2
nodes, a 3-ary
3-cube is a three-dimensional torus with 3 × 3 × 3 nodes, and a 2-ary d-cube is a
d-dimensional cube.
Table 2.2 summarizes important characteristics of the network topologies
described.
2.5.3 Embeddings
In this section, we consider the embedding of several networks into a hypercube
network, demonstrating that the hypercube topology is versatile and flexible.
2.5.3.1 Embedding a Ring into a Hypercube Network
For an embedding of a ring network with n = 2k
nodes represented by V
=
{1, . . . , n} in a k-dimensional cube with nodes V = {0, 1}k
, a bijective function
from V
to V is constructed such that a ring edge (i, j) ∈ E
is mapped to a hyper-
cube edge. In the ring, there are edges between neighboring nodes in the sequence

1, . . . , n. To construct the embedding, we have to arrange the hypercube nodes
in V in a sequence such that there is also an edge between neighboring nodes in
the sequence. The sequence is constructed as reflected Gray code (RGC) sequence
which is defined as follows:
A k-bit RGC is a sequence with 2k
binary strings of length k such that two neigh-
boring strings differ in exactly one bit position. The RGC sequence is constructed
recursively, as follows:
• The 1-bit RGC sequence is RGC1 = (0, 1).
• The 2-bit RGC sequence is obtained from RGC1 by inserting a 0 and a 1 in front
of RGC1, resulting in the two sequences (00, 01) and (10, 11). Reversing the
second sequence and concatenation yields RGC2 = (00, 01, 11, 10).
• For k ≥ 2, the k-bit Gray code RGCk is constructed from the (k − 1)-bit Gray
code RGCk−1 = (b1, . . . , bm) with m = 2k−1
where each entry bi for 1 ≤ i ≤ m
is a binary string of length k − 1. To construct RGCk, RGCk−1 is duplicated; a 0
is inserted in front of each bi of the original sequence, and a 1 is inserted in front
of each bi of the duplicated sequence. This results in sequences (0b1, . . . , 0bm)
and (1b1, . . . , 1bm). RGCk results by reversing the second sequence and concate-
nating the two sequences; thus RGCk = (0b1, . . . , 0bm, 1bm, . . . , 1b1).
The Gray code sequences RGCk constructed in this way have the property that
they contain all binary representations of a k-dimensional hypercube, since the
construction corresponds to the construction of a k-dimensional cube from two
(k − 1)-dimensional cubes as described in the previous section. Two neighboring
k-bit words of RGCk differ in exactly one bit position, as can be shown by induc-
tion. The statement is surely true for RGC1. Assuming that the statement is true for
RGCk−1, it is true for the first 2k−1
elements of RGCk as well as for the last 2k−1
elements, since these differ only by a leading 0 or 1 from RGCk−1. The statement
is also true for the two middle elements 0bm and 1bm at which the two sequences
of length 2k−1
are concatenated. Similarly, the first element 0b1 and the last element
1b1 of RGCk differ only in the first bit. Thus, neighboring elements of RGCk are
connected by a hypercube edge.
An embedding of a ring into a k-dimensional cube can be defined by the mapping
σ : {1, . . . , n} → {0, 1}k
with σ(i) := RGCk(i),
where RGCk(i) denotes the ith element of RGCk. Figure 2.14(a) shows an example
for k = 3.
2.5.3.2 Embedding a Two-Dimensional Mesh into a Hypercube Network
The embedding of a two-dimensional mesh with n = n1 · n2 nodes into a k-
dimensional cube with n = 2k
nodes can be obtained by a generalization of the
embedding of a ring network. For k1 and k2 with n1 = 2k1
and n2 = 2k2
, i.e.,
k1 + k2 = k, the Gray codes RGCk1
= (a1, . . . , an1
) and RGCk2
= (b1, . . . , bn2
) are

Fig. 2.14 Embeddings into a
hypercube network: (a)
embedding of a ring network
with 8 nodes into a
three-dimensional hypercube
and (b) embedding of a
two-dimensional 2 × 4 mesh
into a three-dimensional
hypercube
010
000 001
101
111
011
100
110
110 111 101 100
010 011 001 000
001 011 010
010
110
000 001
101
111
011
100
111
101
100
110
000
a)
b)
used to construct an n1 × n2 matrix M whose entries are k-bit strings. In particular,
it is
M =
⎡
⎢
⎢
⎢
⎣
a1b1 a1b2 . . . a1bn2
a2b1 a2b2 . . . a2bn2
.
.
.
.
.
.
...
.
.
.
an1
b1 an1
b2 . . . an1
bn2
⎤
⎥
⎥
⎥
⎦
.
The matrix is constructed such that neighboring entries differ in exactly one bit
position. This is true for neighboring elements in a row, since identical elements
of RGCk1
and neighboring elements of RGCk2
are used. Similarly, this is true for
neighboring elements in a column, since identical elements of RGCk2
and neighbor-
ing elements of RGCk1
are used. All elements of M are bit strings of length k and
there are no identical bit strings according to the construction. Thus, the matrix M
contains all bit representations of nodes in a k-dimensional cube and neighboring
entries in M correspond to neighboring nodes in the k-dimensional cube, which are
connected by an edge. Thus, the mapping
σ : {1, . . . , n1} × {1, . . . , n2} → {0, 1}k
with σ(i, j) = M(i, j)
is an embedding of the two-dimensional mesh into the k-dimensional cube.
Figure 2.14(b) shows an example.
2.5.3.3 Embedding of a d-Dimensional Mesh into a Hypercube Network
In a d-dimensional mesh with ni = 2ki
nodes in dimension i, 1 ≤ i ≤ d, there are
n = n1 ·· · ··nd nodes in total. Each node can be represented by its mesh coordinates
(x1, . . . , xd) with 1 ≤ xi ≤ ni . The mapping

Other documents randomly have
different content

THE STRUGGLE AGAINST NATURAL
ECONOMY
Capitalism arises and develops historically amidst a non-capitalist
society. In Western Europe it is found at first in a feudal environment
from which it in fact sprang—the system of bondage in rural areas
and the guild system in the towns—and later, after having swallowed
up the feudal system, it exists mainly in an environment of peasants
and artisans, that is to say in a system of simple commodity
production both in agriculture and trade. European capitalism is
further surrounded by vast territories of non-European civilisation
ranging over all levels of development, from the primitive communist
hordes of nomad herdsmen, hunters and gatherers to commodity
production by peasants and artisans. This is the setting for the
accumulation of capital.
We must distinguish three phases: the struggle of capital against
natural economy, the struggle against commodity economy, and the
competitive struggle of capital on the international stage for the
remaining conditions of accumulation.
The existence and development of capitalism requires an
environment of non-capitalist forms of production, but not every one
of these forms will serve its ends. Capitalism needs non-capitalist
social strata as a market for its surplus value, as a source of supply
for its means of production and as a reservoir of labour power for its
wage system. For all these purposes, forms of production based
upon a natural economy are of no use to capital. In all social
organisations where natural economy prevails, where there are
primitive peasant communities with common ownership of the land,
a feudal system of bondage or anything of this nature, economic
organisation is essentially in response to the internal demand; and
therefore there is no demand, or very little, for foreign goods, and

also, as a rule, no surplus production, or at least no urgent need to
dispose of surplus products. What is most important, however, is
that, in any natural economy, production only goes on because both
means of production and labour power are bound in one form or
another. The communist peasant community no less than the feudal
corvée farm and similar institutions maintain their economic
organisation by subjecting the labour power, and the most important
means of production, the land, to the rule of law and custom. A
natural economy thus confronts the requirements of capitalism at
every turn with rigid barriers. Capitalism must therefore always and
everywhere fight a battle of annihilation against every historical form
of natural economy that it encounters, whether this is slave
economy, feudalism, primitive communism, or patriarchal peasant
economy. The principal methods in this struggle are political force
(revolution, war), oppressive taxation by the state, and cheap
goods; they are partly applied simultaneously, and partly they
succeed and complement one another. In Europe, force assumed
revolutionary forms in the fight against feudalism (this is the
ultimate explanation of the bourgeois revolutions in the seventeenth,
eighteenth and nineteenth centuries); in the non-European
countries, where it fights more primitive social organisations, it
assumes the forms of colonial policy. These methods, together with
the systems of taxation applied in such cases, and commercial
relations also, particularly with primitive communities, form an
alliance in which political power and economic factors go hand in
hand.
In detail, capital in its struggle against societies with a natural
economy pursues the following ends:
(1) To gain immediate possession of important sources of productive
forces such as land, game in primeval forests, minerals, precious
stones and ores, products of exotic flora such as rubber, etc.
(2) To ‘liberate’ labour power and to coerce it into service.
(3) To introduce a commodity economy.

(4) To separate trade and agriculture.
At the time of primitive accumulation, i.e. at the end of the Middle
Ages, when the history of capitalism in Europe began, and right into
the nineteenth century, dispossessing the peasants in England and
on the Continent was the most striking weapon in the large-scale
transformation of means of production and labour power into capital.
Yet capital in power performs the same task even to-day, and on an
even more important scale—by modern colonial policy. It is an
illusion to hope that capitalism will ever be content with the means
of production which it can acquire by way of commodity exchange.
In this respect already, capital is faced with difficulties because vast
tracts of the globe’s surface are in the possession of social
organisations that have no desire for commodity exchange or
cannot, because of the entire social structure and the forms of
ownership, offer for sale the productive forces in which capital is
primarily interested. The most important of these productive forces
is of course the land, its hidden mineral treasure, and its meadows,
woods and water, and further the flocks of the primitive shepherd
tribes. If capital were here to rely on the process of slow internal
disintegration, it might take centuries. To wait patiently until the
most important means of production could be alienated by trading in
consequence of this process were tantamount to renouncing the
productive forces of those territories altogether. Hence derives the
vital necessity for capitalism in its relations with colonial countries to
appropriate the most important means of production. Since the
primitive associations of the natives are the strongest protection for
their social organisations and for their material bases of existence,
capital must begin by planning for the systematic destruction and
annihilation of all the non-capitalist social units which obstruct its
development. With that we have passed beyond the stage of
primitive accumulation; this process is still going on. Each new
colonial expansion is accompanied, as a matter of course, by a
relentless battle of capital against the social and economic ties of the
natives, who are also forcibly robbed of their means of production
and labour power. Any hope to restrict the accumulation of capital

exclusively to ‘peaceful competition’, i.e. to regular commodity
exchange such as takes place between capitalist producer-countries,
rests on the pious belief that capital can accumulate without
mediation of the productive forces and without the demand of more
primitive organisations, and that it can rely upon the slow internal
process of a disintegrating natural economy. Accumulation, with its
spasmodic expansion, can no more wait for, and be content with, a
natural internal disintegration of non-capitalist formations and their
transition to commodity economy, than it can wait for, and be
content with, the natural increase of the working population. Force is
the only solution open to capital; the accumulation of capital, seen
as an historical process, employs force as a permanent weapon, not
only at its genesis, but further on down to the present day. From the
point of view of the primitive societies involved, it is a matter of life
or death; for them there can be no other attitude than opposition
and fight to the finish—complete exhaustion and extinction. Hence
permanent occupation of the colonies by the military, native risings
and punitive expeditions are the order of the day for any colonial
regime. The method of violence, then, is the immediate
consequence of the clash between capitalism and the organisations
of a natural economy which would restrict accumulation. Their
means of production and their labour power no less than their
demand for surplus products is necessary to capitalism. Yet the latter
is fully determined to undermine their independence as social units,
in order to gain possession of their means of production and labour
power and to convert them into commodity buyers. This method is
the most profitable and gets the quickest results, and so it is also
the most expedient for capital. In fact, it is invariably accompanied
by a growing militarism whose importance for accumulation will be
demonstrated below in another connection. British policy in India
and French policy in Algeria are the classical examples of the
application of these methods by capitalism.
The ancient economic organisations of the Indians—the communist
village community—had been preserved in their various forms
throughout thousands of years, in spite of all the political

disturbances during their long history. In the sixth century b.c. the
Persians invaded the Indus basin and subjected part of the country.
Two centuries later the Greeks entered and left behind them
colonies, founded by Alexander on the pattern of a completely alien
civilisation. Then the savage Scythians invaded the country, and for
centuries India remained under Arab rule. Later, the Afghans
swooped down from the Iran mountains, until they, too, were
expelled by the ruthless onslaught of Tartar hordes. The Mongols’
path was marked by terror and destruction, by the massacre of
entire villages—the peaceful countryside with the tender shoots of
rice made crimson with blood. And still the Indian village community
survived. For none of the successive Mahometan conquerors had
ultimately violated the internal social life of the peasant masses and
its traditional structure. They only set up their own governors in the
provinces to supervise military organisation and to collect taxes from
the population. All conquerors pursued the aim of dominating and
exploiting the country, but none was interested in robbing the people
of their productive forces and in destroying their social organisation.
In the Moghul Empire, the peasant had to pay his annual tribute in
kind to the foreign ruler, but he could live undisturbed in his village
and could cultivate his rice on his sholgura as his father had done
before him. Then came the British—and the blight of capitalist
civilisation succeeded in disrupting the entire social organisation of
the people; it achieved in a short time what thousands of years,
what the sword of the Nogaians, had failed to accomplish. The
ultimate purpose of British capital was to possess itself of the very
basis of existence of the Indian community: the land.
This end was served above all by the fiction, always popular with
European colonisers, that all the land of a colony belongs to the
political ruler. In retrospect, the British endowed the Moghul and his
governors with private ownership of the whole of India, in order to
‘legalise’ their succession. Economic experts of the highest repute,
such as James Mill, duly supported this fiction with ‘scientific’
arguments, so in particular with the famous conclusion given below.
[356]

As early as 1793, the British in Bengal gave landed property to all
the zemindars (Mahometan tax collectors) or hereditary market
superintendents they had found in their district so as to win native
support for the campaign against the peasant masses. Later they
adopted the same policy for their new conquests in the Agram
province, in Oudh, and in the Central Provinces. Turbulent peasant
risings followed in their wake, in the course of which tax collectors
were frequently driven out. In the resulting confusion and anarchy
British capitalists successfully appropriated a considerable portion of
the land.
The burden of taxation, moreover, was so ruthlessly increased that it
swallowed up nearly all the fruits of the people’s labour. This went to
such an extreme in the Delhi and Allahabad districts that, according
to the official evidence of the British tax authorities in 1854, the
peasants found it convenient to lease or pledge their shares in land
for the bare amount of the tax levied. Under the auspices of this
taxation, usury came to the Indian village, to stay and eat up the
social organisation from within like a canker.[357] In order to
accelerate this process, the British passed a law that flew in the face
of every tradition and justice known to the village community:
compulsory alienation of village land for tax arrears. In vain did the
old family associations try to protect themselves by options on their
hereditary land and that of their kindred. There was no stopping the
rot. Every day another plot of land fell under the hammer; individual
members withdrew from the family unit, and the peasants got into
debt and lost their land.
The British, with their wonted colonial stratagems, tried to make it
appear as if their power policy, which had in fact undermined the
traditional forms of landownership and brought about the collapse of
the Hindu peasant economy, had been dictated by the need to
protect the peasants against native oppression and exploitation and
served to safeguard their own interests.[358] Britain artificially
created a landed aristocracy at the expense of the ancient property-
rights of the peasant communities, and then proceeded to ‘protect’

the peasants against these alleged oppressors, and to bring this
illegally usurped land into the possession of British capitalists.
Thus large estates developed in India in a short time, while over
large areas the peasants in their masses were turned into
impoverished small tenants with a short-term lease.
Lastly, one more striking fact shows the typically capitalist method of
colonisation. The British were the first conquerors of India who
showed gross indifference to public utilities. Arabs, Afghans and
Mongols had organised and maintained magnificent works of
canalisation in India, they had given the country a network of roads,
spanned the rivers with bridges and seen to the sinking of wells.
Timur or Tamerlane, the founder of the Mongol dynasty in India, had
a care for the cultivation of the soil, for irrigation, for the safety of
the roads and the provision of food for travellers.[359] The primitive
Indian Rajahs, the Afghan or Mongol conquerors, at any rate, in
spite of occasional cruelty against individuals, made their mark with
the marvellous constructions we can find to-day at every step and
which seem to be the work of a giant race. ‘The (East India)
Company which ruled India until 1858 did not make one spring
accessible, did not sink a single well, nor build a bridge for the
benefit of the Indians.’[360]
Another witness, the Englishman James Wilson, says: ‘In the Madras
Province, no-one can help being impressed by the magnificent
ancient irrigation systems, traces of which have been preserved until
our time. Locks and weirs dam the rivers into great lakes, from
which canals distribute the water for an area of 60 or 70 miles
around. On the large rivers, there are 30 to 40 of such weirs.... The
rain water from the mountains was collected in artificial ponds, many
of which still remain and boast circumferences of between 15 and 25
miles. Nearly all these gigantic constructions were completed before
the year 1750. During the war between the Company and the
Mongol rulers—and, be it said, during the entire period of our rule in
India—they have sadly decayed.’[361]

No wonder! British capital had no object in giving the Indian
communities economic support or helping them to survive. Quite the
reverse, it aimed to destroy them and to deprive them of their
productive forces. The unbridled greed, the acquisitive instinct of
accumulation must by its very nature take every advantage of the
‘conditions of the market’ and can have no thought for the morrow.
It is incapable of seeing far enough to recognise the value of the
economic monuments of an older civilisation. (Recently British
engineers in Egypt feverishly tried to discover traces of an ancient
irrigation system rather like the one a stupid lack of vision had
allowed to decay in India, when they were charged with damming
the Nile on a grand scale in furtherance of capitalist enterprise.) Not
until 1867 was England able to appreciate the results of her noble
efforts in this respect. In the terrible famine of that year a million
people were killed in the Orissa district alone; and Parliament was
shocked into investigating the causes of the emergency. The British
government has now introduced administrative measures in an
attempt to save the peasant from usury. The Punjab Alienation Act
of 1900 made it illegal to sell or mortgage peasant lands to persons
other than of the peasant caste, though exceptions can be made in
individual cases, subject to the tax collector’s approval.[362] Having
deliberately disrupted the protecting ties of the ancient Hindu social
associations, after having nurtured a system of usury where nothing
is thought of a 15 per cent charge of interest, the British now
entrust the ruined Indian peasant to the tender care of the
Exchequer and its officials, under the ‘protection’, that is to say, of
those draining him of his livelihood.
Next to tormented British India, Algeria under French rule claims
pride of place in the annals of capitalist colonisation. When the
French conquered Algeria, ancient social and economic institutions
prevailed among the Arab-Kabyle population. These had been
preserved until the nineteenth century, and in spite of the long and
turbulent history of the country they survive in part even to the
present day.

Private property may have existed no doubt in the towns, among the
Moors and Jews, among merchants, artisans and usurers. Large
rural areas may have been seized by the State under Turkish
suzerainty—yet nearly half of the productive land is jointly held by
Arab and Kabyle tribes who still keep up the ancient patriarchal
customs. Many Arab families led the same kind of nomad life in the
nineteenth century as they had done since time immemorial, an
existence that appears restless and irregular only to the superficial
observer, but one that is in fact strictly regulated and extremely
monotonous. In summer they were wont, man, woman and child, to
take their herds and tents and migrate to the sea-swept shores of
the Tell district; and in the winter they would move back again to the
protective warmth of the desert. They travelled along definite routes,
and the summer and winter stations were fixed for every tribe and
family. The fields of those Arabs who had settled on the land were in
most cases the joint property of the clans, and the great Kabyle
family associations also lived according to old traditional rules under
the patriarchal guidance of their elected heads.
The women would take turns for household duties; a matriarch,
again elected by the family, being in complete charge of the clan’s
domestic affairs, or else the women taking turns of duty. This
organisation of the Kabyle clans on the fringe of the African desert
bears a startling resemblance to that of the famous Southern
Slavonic Zadruga—not only the fields but all the tools, weapons and
monies, all that the members acquire or need for their work, are
communal property of the clan. Personal property is confined to one
suit of clothing, and in the case of a woman to the dresses and
ornaments of her dowry. More valuable attire and jewels, however,
are considered common property, and individuals were allowed to
use them only if the whole family approved. If the clan was not too
numerous, meals were taken at a common table; the women took it
in turns to cook, but the eldest were entrusted with the dishing out.
If a family circle was too large, the head of the family would each
month ration out strictly proportionate quantities of uncooked food
to the individual families who then prepared them. These

communities were bound together by close ties of kinship, mutual
assistance and equality, and a patriarch would implore his sons on
his deathbed to remain faithful to the family.[363]
These social relations were already seriously impaired by the rule of
the Turks, established in Algeria in the sixteenth century. Yet the
Turkish exchequer had by no means confiscated all the land. That is
a legend invented by the French at a much later date. Indeed, only a
European mind is capable of such a flight of fancy which is contrary
to the entire economic foundation of Islam both in theory and
practice. In truth, the facts were quite different. The Turks did not
touch the communal fields of the village communities. They merely
confiscated a great part of uncultivated land from the clans and
converted it into crownland under Turkish local administrators
(Beyliks). The state worked these lands in part with native labour,
and in part they were leased out on rent or against payment in kind.
Further the Turks took advantage of every revolt of the subjected
families and of every disturbance in the country to add to their
possessions by large-scale confiscation of land, either for military
establishments or for public auction, when most of it went to Turkish
or other usurers. To escape from the burden of taxation and
confiscation, many peasants placed themselves under the protection
of the Church, just as they had done in medieval Germany. Hence
considerable areas became Church-property. All these changes
finally resulted in the following distribution of Algerian land at the
time of the French conquest: crownlands occupied nearly 3,750,000
acres, and a further 7,500,000 acres of uncultivated land as common
property of All the Faithful (Bled-el-Islam). 7,500,000 acres had been
privately owned by the Berbers since Roman times, and under
Turkish rule a further 3,750,000 acres had come into private
ownership, a mere 12,500,000 acres remaining communal property
of individual Arab clans. In the Sahara, some of the 7,500,000 acres
fertile land near the Sahara Oases was communally owned by the
clans and some belonged to private owners. The remaining
57,000,000 acres were mainly waste land.

With their conquest of Algeria, the French made a great ado about
their work of civilisation, since the country, having shaken off the
Turkish yoke at the beginning of the eighteenth century, was
harbouring the pirates who infested the Mediterranean and trafficked
in Christian slaves. Spain and the North American Union in particular,
themselves at that time slave traders on no mean scale, declared
relentless war on this Moslem iniquity. France, in the very throes of
the Great Revolution, proclaimed a crusade against Algerian anarchy.
Her subjection of that country was carried through under the
slogans of ‘combating slavery’ and ‘instituting orderly and civilised
conditions’. Yet practice was soon to show what was at the bottom of
it all. It is common knowledge that in the forty years following the
subjection of Algeria, no European state suffered so many changes
in its political system as France: the restoration of the monarchy was
followed by the July Revolution and the reign of the ‘Citizen King’,
and this was succeeded by the February Revolution, the Second
Republic, the Second Empire, and finally, after the disaster of 1870,
by the Third Republic. In turn, the aristocracy, high finance, petty
bourgeoisie and the large middle classes in general gained political
ascendancy. Yet French policy in Algeria remained undeflected by
this succession of events; it pursued a single aim from beginning to
end; at the fringe of the African desert, it demonstrated plainly that
all the political revolutions in nineteenth-century France centred in a
single basic interest: the rule of a capitalist bourgeoisie and its
institutions of ownership.
‘The bill submitted for your consideration’, said Deputy Humbert on
June 30, 1873, in the Session of the French National Assembly as
spokesman for the Commission for Regulating Agrarian Conditions in
Algeria, ‘is but the crowning touch to an edifice well-founded on a
whole series of ordinances, edicts, laws and decrees of the Senate
which together and severally have as the same object: the
establishment of private property among the Arabs.’
In spite of the ups and downs of internal French politics French
colonial policy persevered for fifty years in its systematic and
deliberate efforts to destroy and disrupt communal property. It

served two distinct purposes: The break-up of communal property
was primarily intended to smash the social power of the Arab family
associations and to quell their stubborn resistance against the
French yoke, in the course of which there were innumerable risings
so that, in spite of France’s military superiority, the country was in a
continual state of war.[364] Secondly, communal property had to be
disrupted in order to gain the economic assets of the conquered
country; the Arabs, that is to say, had to be deprived of the land
they had owned for a thousand years, so that French capitalists
could get it. Once again the fiction we know so well, that under
Moslem law all land belongs to the ruler, was brought into play. Just
as the English had done in British India, so Louis Philippe’s
governors in Algeria declared the existence of communal property
owned by the clan to be ‘impossible’. This fiction served as an
excuse to claim for the state most of the uncultivated areas, and
especially the commons, woods and meadows, and to use them for
purposes of colonisation. A complete system of settlement
developed, the so-called cantonments which settled French colonists
on the clan land and herded the tribes into a small area. Under the
decrees of 1830, 1831, 1840, 1844, 1845 and 1846 these thefts of
Arab family land were legalised. Yet this system of settlement did not
actually further colonisation; it only bred wild speculation and usury.
In most instances the Arabs managed to buy back the land that had
been taken from them, although they were thus incurring heavy
debts. French methods of oppressive taxation had the same
tendency, in particular the law of June 16, 1851, proclaiming all
forests to be state property, which robbed the natives of 6,000,000
acres of pasture and brushwood, and took away the prime essential
for animal husbandry. This spate of laws, ordinances and regulations
wrought havoc with the ownership of land in the country. Under the
prevailing condition of feverish speculation in land, many natives
sold their estates to the French in the hope of ultimately recovering
them. Quite often they sold the same plot to two or three buyers at
a time, and what is more, it was quite often inalienable family land
and did not even belong to them. A company of speculators from

Rouen, e.g., believed that they had bought 50,000 acres, but in fact
they had only acquired a disputed title to 3,425 acres. There
followed an infinite number of lawsuits in which the French courts
supported on principle all partitions and claims of the buyers. In
these uncertain conditions, speculation, usury and anarchy were rife.
But although the introduction of French colonists in large numbers
among the Arab population had aimed at securing support for the
French government, this scheme failed miserably. Thus, under the
Second Empire, French policy tried another tack. The government,
with its European lack of vision, had stubbornly denied the existence
of communal property for thirty years, but it had learned better at
last. By a single stroke of the pen, joint family property was officially
recognised and condemned to be broken up. This is the double
significance of the decree of the Senate dated April 22, 1864.
General Allard declared in the Senate:
‘The government does not lose sight of the fact that the general aim
of its policy is to weaken the influence of the tribal chieftains and to
dissolve the family associations. By this means, it will sweep away
the last remnants of feudalism [sic!] defended by the opponents of
the government bill.... The surest method of accelerating the process
of dissolving the family associations will be to institute private
property and to settle European colonists among the Arab
families.’[365]
The law of 1863 created special Commissions for cutting up the
landed estates, consisting of the Chairman, either a Brigadier-
General or Colonel, one sous-préfet, one representative of the Arab
military authorities and an official bailiff. These natural experts on
African economics and social conditions were faced with the
threefold task, first of determining the precise boundaries of the
great family estates, secondly to distribute the estates of each clan
among its various branches, and finally to break up this family land
into separate private allotments. This expedition of the Brigadiers
into the interior of Africa duly took place. The Commissions
proceeded to their destinations. They were to combine the office of

judge in all land disputes with that of surveyor and land distributor,
the final decision resting with the Governor-General of Algeria. Ten
years’ valiant efforts by the Commissions yielded the following
result: between 1863 and 1873, of 700 hereditary estates, 400 were
shared out among the branches of each clan, and the foundations
for future inequalities between great landed estates and small
allotments were thus laid. One family, in fact, might receive between
2·5 and 10 acres, while another might get as much as 250 or even
450 acres, depending on the size of the estate and the number of
collaterals within the clan. Partition, however, stopped at that point.
Arab customs presented unsurmountable difficulties to a further
division of family land. In spite of Colonels and Brigadiers, French
policy had again failed in its object to create private property for
transfer to the French.
But the Third Republic, an undisguised regime of the bourgeoisie,
had the courage and the cynicism to go straight for its goal and to
attack the problem from the other end, disdaining the preliminaries
of the Second Empire. In 1873, the National Assembly worked out a
law with the avowed intention immediately to split up the entire
estates of all the 700 Arab clans, and forcibly to institute private
property in the shortest possible time. Desperate conditions in the
colony were the pretext for this measure. It had taken the great
Indian famine of 1866 to awaken the British public to the marvellous
exploits of British colonial policy and to call for a parliamentary
investigation; and similarly, Europe was alarmed at the end of the
sixties by the crying needs of Algeria where more than forty years of
French rule culminated in wide-spread famine and a disastrous
mortality rate among the Arabs. A commission of inquiry was set up
to recommend new legislation with which to bless the Arabs: it was
unanimously resolved that there was only one life-buoy for them—
the institution of private property; that alone could save the Arab
from destitution, since he would then always be able to sell or
mortgage his land. It was decided therefore, that the only means of
alleviating the distress of the Arabs, deeply involved in debts as they
were because of the French land robberies and oppressive taxation,

was to deliver them completely into the hands of the usurers. This
farce was expounded in all seriousness before the National Assembly
and was accepted with equal gravity by that worthy body. The
‘victors’ of the Paris Commune flaunted their brazenness.
In the National Assembly, two arguments in particular served to
support the new law: those in favour of the bill emphasised over and
over again that the Arabs themselves urgently desired the
introduction of private property. And so they did, or rather the
Algerian land speculators and usurers did, since they were vitally
interested in ‘liberating’ their victims from the protection of the
family ties. As long as Moslem law prevailed in Algeria, hereditary
clan and family lands were inalienable, which laid insuperable
difficulties in the way of anyone who wished to mortgage his land.
The law of 1863 had merely made a breach in these obstacles, and
the issue now at stake was their complete abolition so as to give a
free hand to the usurers. The second argument was ‘scientific’, part
of the same intellectual equipment from which that worthy, James
Mill, had drawn for his abstruse conclusions regarding Indian
relations of ownership: English classical economics. Thoroughly
versed in their masters’ teachings, the disciples of Smith and Ricardo
impressively declaimed that private property is indispensable for the
prevention of famines in Algeria, for more intensive and better
cultivation of the land, since obviously no one would be prepared to
invest capital or intensive labour in a piece of land which does not
belong to him and whose produce is not his own to enjoy. But the
facts spoke a different language. They proved that the French
speculators employed the private property they had created in
Algeria for anything but the more intensive and improved cultivation
of the soil. In 1873, 1,000,000 acres were French property. But the
capitalist companies, the Algerian and Setif Company which owned
300,000 acres, did not cultivate the land at all but leased it to the
natives who tilled it in the traditional manner, nor were 25 per cent
of the other French owners engaged in agriculture. It was simply
impossible to conjure up capitalist investments and intensive
agriculture overnight, just as capitalist conditions in general could

not be created out of nothing. They existed only in the imagination
of profit-seeking French speculators, and in the benighted
doctrinaire visions of their scientific economists. The essential point,
shorn of all pretexts and flourishes which seem to justify the law of
1873, was simply the desire to deprive the Arabs of their land, their
livelihood. And although these arguments had worn threadbare and
were evidently insincere, this law which was to put paid to the
Algerian population and their material prosperity, was passed
unanimously on July 26, 1873.
But even this master-stroke soon proved a failure. The policy of the
Third Republic miscarried because of the difficulties in substituting at
one stroke bourgeois private property for the ancient clan
communism, just as the policy of the Second Empire had come to
grief over the same issue. In 1890, when the law of July 26, 1873,
supplemented by a second law on April 28, 1887, had been in force
for seventeen years, 14,000,000 francs had been spent on dealing
with 40,000,000 acres. It was estimated that the process would not
be completed before 1950 and would require a further 60,000,000
francs. And still abolition of clan communism, the ultimate purpose,
had not been accomplished. What had really been attained was all
too evident: reckless speculation in land, thriving usury and the
economic ruin of the natives.
Since it had been impossible to institute private property by force, a
new experiment was undertaken. The laws of 1873 and 1887 had
been condemned by a commission appointed for their revision by
the Algerian government in 1890. It was another seven years before
the legislators on the Seine made the effort to consider reforms for
the ruined country. The new decree of the Senate refrained in
principle from instituting private property by compulsion or
administrative measures. The laws of February 2, 1897, and the
edict of the Governor-General of Algeria (March 3, 1898) both
provided chiefly for the introduction of private property following a
voluntary application by the prospective purchaser or owner.[366] But
there were clauses to permit a single owner, without the consent of

the others, to claim private property; further, such a ‘voluntary’
application can be extorted at any convenient moment if the owner
is in debt and the usurer exerts pressure. And so the new law left
the doors wide open for French and native capitalists further to
disrupt and exploit the hereditary and clan lands.
Of recent years, this mutilation of Algeria which had been going on
for eight decades meets with even less opposition, since the Arabs,
surrounded as they are by French capital following the subjection of
Tunisia (1881) and the recent conquest of Morocco, have been
rendered more and more helpless. The latest result of the French
regime in Algeria is an Arab exodus into Turkey.[367]

THE INTRODUCTION OF
COMMODITY ECONOMY
The second condition of importance for acquiring means of
production and realising the surplus value is that commodity
exchange and commodity economy should be introduced in societies
based on natural economy as soon as their independence has been
abrogated, or rather in the course of this disruptive process. Capital
requires to buy the products of, and sell its commodities to, all non-
capitalist strata and societies. Here at last we seem to find the
beginnings of that ‘peace’ and ‘equality’, the do ut des, mutual
interest, ‘peaceful competition’ and the ‘influences of civilisation’. For
capital can indeed deprive alien social associations of their means of
production by force, it can compel the workers to submit to capitalist
exploitation, but it cannot force them to buy its commodities or to
realise its surplus value. In districts where natural economy formerly
prevailed, the introduction of means of transport—railways,
navigation, canals—is vital for the spreading of commodity economy,
a further hopeful sign. The triumphant march of commodity
economy thus begins in most cases with magnificent constructions
of modern transport, such as railway lines which cross primeval
forests and tunnel through the mountains, telegraph wires which
bridge the deserts, and ocean liners which call at the most outlying
ports. But it is a mere illusion that these are peaceful changes.
Under the standard of commerce, the relations between the East
India Company and the spice-producing countries were quite as
piratical, extortionate and blatantly fraudulent as present-day
relations between American capitalists and the Red Indians of
Canada whose furs they buy, or between German merchants and the
Negroes of Africa. Modern China presents a classical example of the
‘gentle’, ‘peace-loving’ practices of commodity exchange with

backward countries. Throughout the nineteenth century, beginning
with the early forties, her history has been punctuated by wars with
the object of opening her up to trade by brute force. Missionaries
provoked persecutions of Christians, Europeans instigated risings,
and in periodical massacres a completely helpless and peaceful
agrarian population was forced to match arms with the most modern
capitalist military technique of all the Great Powers of Europe. Heavy
war contributions necessitated a public debt, China taking up
European loans, resulting in European control over her finances and
occupation of her fortifications; the opening of free ports was
enforced, railway concessions to European capitalists extorted. By all
these measures commodity exchange was fostered in China, from
the early thirties of the last century until the beginning of the
Chinese revolution.
European civilisation, that is to say commodity exchange with
European capital, made its first impact on China with the Opium
Wars when she was compelled to buy the drug from Indian
plantations in order to make money for British capitalists. In the
seventeenth century, the East India Company had introduced the
cultivation of poppies in Bengal; the use of the drug was
disseminated in China by its Canton branch. At the beginning of the
nineteenth century, opium fell so considerably in price that it rapidly
became the ‘luxury of the people’. In 1821, 4,628 chests of opium
were imported to China at an average price of £265; then the price
fell by 50 per cent, and Chinese imports rose to 9,621 chests in
1825, and to 26,670 chests in 1830.[368] The deadly effects of the
drug, especially of the cheaper kinds used by the poorer population,
became a public calamity and made it necessary for China to lay an
embargo on imports, as an emergency measure. Already in 1828,
the viceroy of Canton had prohibited imports of opium, only to
deflect the trade to other ports. One of the Peking censors
commanded to investigate the question gave the following report:
‘I have learnt that people who smoke opium have developed such a
craving for this noxious drug that they make every effort to obtain

this gratification. If they do not get their opium at the usual hour,
their limbs begin to tremble, they break out in sweat, and they
cannot perform the slightest tasks. But as soon as they are given the
pipe, they inhale a few puffs and are cured immediately.
‘Opium has therefore become a necessity for all who smoke it, and it
is not surprising that under cross-examination by the local
authorities they will submit to every punishment rather than reveal
the names of their suppliers. Local authorities are also in some cases
given presents to tolerate the evil or to delay any investigation
already under way. Most merchants who bring goods for sale into
Canton also deal in smuggled opium.
‘I am of the opinion that opium is by far a greater evil than
gambling, and that opium smokers should therefore be punished no
less than gamblers.’
The censor suggested that every convicted opium smoker should be
sentenced to eighty strokes of the bamboo, and anybody refusing to
give the name of his supplier to a hundred strokes and three years
of exile. The pigtailed Cato of Peking concludes his report with a
frankness staggering to any European official: ‘Apparently opium is
mostly introduced from abroad by dishonest officials in connivance
with profit-seeking merchants who transport it into the interior of the
country. Then the first to indulge are people of good family, wealthy
private persons and merchants, but ultimately the drug habit
spreads among the common people. I have learnt that in all
provinces opium is smoked not only in the civil service but also in
the army. The officials of the various districts indeed enjoin the legal
prohibition of sale by special edicts. But at the same time, their
parents, families, dependants and servants simply go on smoking
opium, and the merchants profit from the ban by increased prices.
Even the police have been won over; they buy the stuff instead of
helping to suppress it, and this is an additional reason for the
disregard in which all prohibitions and ordinances are held.’[369]
Consequently, a stricter law was passed in 1833 which made every
opium smoker liable to a hundred strokes and two months in the

stocks, and provincial governors were ordered to report annually on
their progress in the battle against opium. But there were two
sequels to this campaign: on the one hand large-scale poppy
plantations sprang up in the interior, particularly in the Honan,
Setchuan, and Kueitchan provinces, and on the other, England
declared war on China to get her to lift the embargo. These were the
splendid beginnings of ‘opening China’ to European civilisation—by
the opium pipe.
Canton was the first objective. The fortifications of the town at the
main arm of the Perl estuary could not have been more primitive.
Every day at sunset a barrier of iron chains was attached to wooden
rafts anchored at various distances, and this was the main defence.
Moreover, the Chinese guns could only fire at a certain angle and
were therefore completely ineffectual. With such primitive defences,
just adequate to prevent a few merchant ships from landing, did the
Chinese meet the British attack. A couple of British cruisers, then,
sufficed to effect an entry on September 7, 1839. The sixteen battle-
junks and thirteen fire-ships which the Chinese put up for resistance
were shot up or dispersed in a matter of forty-five minutes. After this
initial victory, the British renewed the attack in the beginning of 1841
with a considerably reinforced fleet. This time the fleet, consisting in
a number of battle-junks, and the forts were attacked
simultaneously. The first incendiary rocket that was fired penetrated
through the armour casing of a junk into the powder chamber and
blew the ship with the entire crew sky-high. In a short time eleven
junks, including the flag-ship, were destroyed, and the remainder
precipitately made for safety. The action on land took a little longer.
Since the Chinese guns were quite useless, the British walked right
through the fortifications, climbed to a strategic position—which was
not even guarded—and proceeded to slaughter the helpless Chinese
from above. The casualty list of the battle was: for the Chinese 600
dead, and for the British, 1 dead and 30 wounded, more than half of
the latter having been injured by the accidental explosion of a
powder magazine. A few weeks later, there followed another British
exploit. The forts of Anung-Hoy and North Wantong were to be

taken. No less than twelve fully equipped cruisers were available for
this task. What is more, the Chinese, once again forgetful of the
most important thing, had omitted to fortify the island of South
Wantong. Thus the British calmly landed a battery of howitzers to
bombard the fort from one side, the cruisers shelling it from the
other. After that, the Chinese were driven from the forts in a matter
of minutes, and the landing met with no resistance. The ensuing
display of inhumanity—an English report says—will be for ever
deeply deplored by the British staff. The Chinese, trying to escape
from the barricades, had fallen into a moat which was soon literally
filled to the brim with helpless soldiers begging for mercy. Into this
mass of prostrate human bodies, the sepoys—acting against orders,
it is claimed—fired again and again. This is the way in which Canton
was made receptive to commodity exchange.
Nor did the other ports fare better. On July 4, 1841, three British
cruisers with 120 cannon appeared off the islands in the entrance to
the town of Ningpo. More cruisers arrived the following day. In the
evening the British admiral sent a message to the Chinese governor,
demanding the capitulation of the island. The governor explained
that he had no power to resist but could not surrender without
orders from Peking. He therefore asked for a delay. This was
refused, and at half-past two in the morning the British stormed the
defenceless island. Within eight minutes, the fort and the houses on
the shore were reduced to smouldering rubble. Having landed on the
deserted coast littered with broken spears, sabres, shields, rifles and
a few dead bodies, the troops advanced on the walls of the island
town of Tinghai. With daybreak, reinforced by the crews of other
ships which had meanwhile arrived, they proceeded to put scaling-
ladders to the scarcely defended ramparts. A few more minutes gave
them mastery of the town. This splendid victory was announced with
becoming modesty in an Order of the Day: ‘Fate has decreed that
the morning of July 5, 1841, should be the historic date on which
Her Majesty’s flag was first raised over the most beautiful island of
the Celestial Empire, the first European flag to fly triumphantly
above this lovely countryside.’[370]

On August 25, 1841, the British approached the town of Amoy,
whose forts were armed with a hundred of the heaviest Chinese
guns. These guns being almost useless, and the commanders
lacking in resource, the capture of the harbour was child’s play.
Under cover of a heavy barrage, British ships drew near the walls of
Kulangau, landed their marines, and after a short stand the Chinese
troops were driven out. The twenty-six battle-junks with 128 guns in
the harbour were also captured, their crews having fled. One
battery, manned by Tartars, heroically held out against the combined
fire of three British ships, but a British landing was effected in their
rear and the post wiped out.
This was the finale of the notorious Opium War. By the peace treaty
of August 27, 1842, the island of Hongkong was ceded to Britain. In
addition, the towns of Canton, Amoy, Futchou, Ningpo and Shanghai
were to open their ports to foreign commerce. But within fifteen
years, there was a further war against China. This time, Britain had
joined forces with the French. In 1857, the allied navies captured
Canton with a heroism equal to that of the first war. By the peace of
Tientsin (1858), the opium traffic, European commerce and Christian
missions were admitted into the interior. Already in 1859, however,
the British resumed hostilities and attempted to destroy the Chinese
fortifications on the Peiho river, but were driven off after a fierce
battle in which 464 people were wounded or killed.[371]
After that, Britain and France again joined forces. At the end of
August 1860, 12,600 English and 7,500 French troops under General
Cousin-Montauban first captured the Taku forts without a single shot
having been fired. Then they proceeded towards Tientsin and on
towards Peking. A bloody battle was joined at Palikao, and Peking
fell to the European Powers. Entering the almost depopulated and
completely undefended city, the victors began by pillaging the
Imperial Palace, manfully helped by General Cousin himself, who was
later to become field marshal and Count of Palikao. Then the Palace
went up in flames, fired on Lord Elgin’s order as an imposed
penance.[372]

The European Powers now obtained concessions to set up embassies
in Peking, and to start trading with Tientsin and other towns. The
Tchi-fu Convention of 1876 guaranteed full facilities for importing
opium into China—at a time when the Anti-Opium League in England
agitated against the spreading of the drug habit in London,
Manchester and other industrial districts, when a parliamentary
commission declared the consumption of opium to be harmful in the
extreme. By all treaties made at that time between China and the
Great Powers any European, whether merchant or missionary, was
guaranteed the right to acquire land, to which end the legitimate
arguments were ably supported by deliberate fraud.
First and foremost the ambiguity of the treaty texts made a
convenient excuse for European capital to encroach beyond the
Treaty Ports. It used every loophole in the wording of the treaties to
begin with, and subsequently blackmailed the Chinese government
into permitting the missions to acquire land not alone in the Treaty
Ports but in all the provinces of the realm. Their claim was based
upon the notorious bare-faced distortion of the Chinese original in
Abbé Delamarre’s official translation of the supplementary
convention with France. French diplomacy, and the Protestant
missions in particular, unanimously condemned the crafty swindle of
the Catholic padre, but nevertheless they were firm that the rights of
French missions obtained by this fraud should be explicitly extended
to the Protestant missions as well.[373]
China’s entry into commodity exchange, having begun with the
Opium Wars, was finally accomplished with a series of ‘leases’ and
the China campaign of 1900, when the commercial interests of
European capital sank to a brazen international dogfight over
Chinese land. The description of the Dowager Empress, who wrote
to Queen Victoria after the capture of the Taku forts, subtly
underlines this contrast between the initial theory and the ultimate
practice of the ‘agents of European civilisation’:
‘To your Majesty, greeting!—In all the dealings of England with the
Empire of China, since first relations were established between us,

there has never been any idea of territorial aggrandisement on the
part of Great Britain, but only a keen desire to promote the interests
of her trade. Reflecting upon the fact that our country is now
plunged into a dreadful condition of warfare, we bear in mind that a
large proportion of China’s trade, seventy or eighty per cent, is done
with England; moreover, your Customs duties are the lightest in the
world, and few restrictions are made at your sea-ports in the matter
of foreign importations; for these reasons our amiable relations with
British merchants at our Treaty Ports have continued unbroken for
the last half century, to our mutual benefit.—But a sudden change
has now occurred and general suspicion has been created against
us. We would therefore ask you now to consider that if, by any
conceivable combination of circumstances, the independence of our
Empire should be lost, and the Powers unite to carry out their long-
plotted schemes to possess themselves of our territory’—(in a
simultaneous message to the Emperor of Japan, the impulsive Tzu
Hsi openly refers to ‘The earth-hungry Powers of the West, whose
tigerish eyes of greed are fixed in our direction’[374] )—‘the results to
your country’s interests would be disastrous and fatal to your trade.
At this moment our Empire is striving to the utmost to raise an army
and funds sufficient for its protection; in the meanwhile we rely on
your good services to act as mediator, and now anxiously await your
decision.’[375]
Both during the wars and in the interim periods, European
civilisation was busy looting and thieving on a grand scale in the
Chinese Imperial Palaces, in the public buildings and in the
monuments of ancient civilisation, not only in 1860, when the French
pillaged the Emperor’s Palace with its legendary treasures, or in
1900, ‘when all the nations vied with each other to steal public and
private property’. Every European advance was marked not only with
the progress of commodity exchange, but by the smouldering ruins
of the largest and most venerable towns, by the decay of agriculture
over large rural areas, and by intolerably oppressive taxation for war
contributions. There are more than 40 Chinese Treaty Ports—and

every one of them has been paid for with streams of blood, with
massacre and ruin.

THE STRUGGLE AGAINST PEASANT
ECONOMY
An important final phase in the campaign against natural economy is
to separate industry from agriculture, to eradicate rural industries
altogether from peasant economy. Handicraft in its historical
beginnings was a subsidiary occupation, a mere appendage to
agriculture in civilised and settled societies. In medieval Europe it
became gradually independent of the corvée farm and agriculture, it
developed into specialised occupations, i.e. production of
commodities by urban guilds. In industrial districts, production had
progressed from home craft by way of primitive manufacture to the
capitalist factory of the staple industries, but in the rural areas,
under peasant economy, home crafts persisted as an intrinsic part of
agriculture. Every hour that could be spared from cultivating the soil
was devoted to handicrafts which, as an auxiliary domestic industry,
played an important part in providing for personal needs.[376]
It is a recurrent phenomenon in the development of capitalist
production that one branch of industry after the other is singled out,
isolated from agriculture and concentrated in factories for mass
production. The textile industry provides the textbook example, but
the same thing has happened, though less obviously, in the case of
other rural industries. Capital must get the peasants to buy its
commodities and will therefore begin by restricting peasant economy
to a single sphere—that of agriculture—which will not immediately
and, under European conditions of ownership, only with great
difficulty submit to capitalist domination.[377] To all outward
appearance, this process is quite peaceful. It is scarcely noticeable
and seemingly caused by purely economic factors. There can be no
doubt that mass production in the factories is technically superior to
primitive peasant industry, owing to a higher degree of

specialisation, scientific analysis and management of the productive
process, improved machinery and access to international resources
of raw materials. In reality, however, the process of separating
agriculture and industry is determined by factors such as oppressive
taxation, war, or squandering and monopolisation of the nation’s
land, and thus belongs to the spheres of political power and criminal
law no less than with economics.
Nowhere has this process been brought to such perfection as in the
United States. In the wake of the railways, financed by European
and in particular British capital, the American farmer crossed the
Union from East to West and in his progress over vast areas killed off
the Red Indians with fire-arms and blood-hounds, liquor and
venereal disease, pushing the survivors to the West, in order to
appropriate the land they had ‘vacated’, to clear it and bring it under
the plough. The American farmer, the ‘backwoodsman’ of the good
old times before the War of Secession, was very different indeed
from his modern counterpart. There was hardly anything he could
not do, and he led a practically self-sufficient life on his isolated
farm.
In the beginning of the nineties, one of the leaders of the Farmers’
Alliance, Senator Peffer, wrote as follows: ‘The American farmer of
to-day is altogether a different sort of man from his ancestor of fifty
or a hundred years ago. A great many men and women now living
remember when farmers were largely manufacturers; that is to say,
they made a great many implements for their own use. Every farmer
had an assortment of tools with which he made wooden implements,
as forks and rakes, handles for his hoes and ploughs, spokes for his
wagon, and various other implements made wholly out of wood.
Then the farmer produced flax and hemp and wool and cotton.
These fibres were prepared upon the farm; they were spun into
yarn, woven into cloth, made into garments, and worn at home.
Every farm had upon it a little shop for wood and iron work, and in
the dwelling were cards and looms; carpets were woven, bed-
clothing of different sorts was prepared; upon every farm geese
were kept, their feathers used for supplying the home demand with

Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com

Parallel Programming For Multicore And Cluster Systems Thomas Rauber

More Related Content

Similar to Parallel Programming For Multicore And Cluster Systems Thomas Rauber (20)

Recently uploaded (20)

Parallel Programming For Multicore And Cluster Systems Thomas Rauber