SlideShare a Scribd company logo
Transactions On Highperformance Embedded
Architectures And Compilers Iii 1st Edition
Miquel Moreto download
https://ebookbell.com/product/transactions-on-highperformance-
embedded-architectures-and-compilers-iii-1st-edition-miquel-
moreto-4143776
Explore and download more ebooks at ebookbell.com
Here are some recommended products that we believe you will be
interested in. You can click the link to download.
Transactions On Highperformance Embedded Architectures And Compilers
Iv 1st Edition Magnus Jahre
https://ebookbell.com/product/transactions-on-highperformance-
embedded-architectures-and-compilers-iv-1st-edition-magnus-
jahre-2319608
Transactions On Highperformance Embedded Architectures And Compilers
Iv 1st Edition Magnus Jahre
https://ebookbell.com/product/transactions-on-highperformance-
embedded-architectures-and-compilers-iv-1st-edition-magnus-
jahre-4143778
Transactions On Highperformance Embedded Architectures And Compilers
Ii 1st Edition Per Stenstrm
https://ebookbell.com/product/transactions-on-highperformance-
embedded-architectures-and-compilers-ii-1st-edition-per-
stenstrm-4201540
Transactions On Highperformance Embedded Architectures And Compilers V
1st Ed Cristina Silvano
https://ebookbell.com/product/transactions-on-highperformance-
embedded-architectures-and-compilers-v-1st-ed-cristina-silvano-9960876
Transactions On Highperformance Embedded Architectures And Compilers I
1st Edition Maurice V Wilkes Auth
https://ebookbell.com/product/transactions-on-highperformance-
embedded-architectures-and-compilers-i-1st-edition-maurice-v-wilkes-
auth-1227598
Quintilian On The Teaching Of Speaking And Writing Translations From
Books One Two And Ten Of The Institutio Oratoria 2nd Edition James J
Murphy
https://ebookbell.com/product/quintilian-on-the-teaching-of-speaking-
and-writing-translations-from-books-one-two-and-ten-of-the-institutio-
oratoria-2nd-edition-james-j-murphy-5720370
Transactions On Intelligent Welding Manufacturing Volume Iv No 1 2020
1st Ed 2022 Shanben Chen Editor
https://ebookbell.com/product/transactions-on-intelligent-welding-
manufacturing-volume-iv-no-1-2020-1st-ed-2022-shanben-chen-
editor-44889082
Transactions On Engineering Technologies Proceedings Of World Congress
On Engineering 2021 Sioiong Ao
https://ebookbell.com/product/transactions-on-engineering-
technologies-proceedings-of-world-congress-on-
engineering-2021-sioiong-ao-46389796
Transactions On Largescale Data And Knowledgecentered Systems Li
Special Issue On Data Management Principles Technologies And
Applications Abdelkader Hameurlain
https://ebookbell.com/product/transactions-on-largescale-data-and-
knowledgecentered-systems-li-special-issue-on-data-management-
principles-technologies-and-applications-abdelkader-
hameurlain-46517630
Transactions On Highperformance Embedded Architectures And Compilers Iii 1st Edition Miquel Moreto
Transactions On Highperformance Embedded Architectures And Compilers Iii 1st Edition Miquel Moreto
Lecture Notes in Computer Science 6590
Commenced Publication in 1973
Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Germany
Madhu Sudan
Microsoft Research, Cambridge, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Moshe Y. Vardi
Rice University, Houston, TX, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbruecken, Germany
Per Stenström (Ed.)
Transactions on
High-Performance
Embedded Architectures
and Compilers III
1 3
Volume Editor
Per Stenström
Chalmers University of Technology
Department of Computer Science and Engineering
412 96 Gothenburg, Sweden
E-mail: per.stenstrom@chalmers.se
ISSN 0302-9743 (LNCS) e-ISSN 1611-3349 (LNCS)
ISSN 1864-306X (THIPEAC) e-ISSN 1864-3078 (THIPEAC)
ISBN 978-3-642-19447-4 e-ISBN 978-3-642-19448-1
DOI 10.1007/978-3-642-19448-1
Springer Heidelberg Dordrecht London New York
Library of Congress Control Number: 2007923068
CR Subject Classification (1998): B.2, C.1, D.3.4, B.5, C.2, D.4
© Springer-Verlag Berlin Heidelberg 2011
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws
and regulations and therefore free for general use.
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
Editor-in-Chief’s Message
It is my pleasure to introduce you to the third volume of Transactions on High-
Performance Embedded Architectures and Compilers. This journal was created as
an archive for scientific articles in the converging fields of high-performance and
embedded computer architectures and compiler systems. Design considerations
in both general-purpose and embedded systems are increasingly being based on
similar scientific insights. For example, a state-of-the-art game console today
consists of a powerful parallel computer whose building blocks are the same as
those found in computational clusters for high-performance computing. More-
over, keeping power/energy consumption at a low level for high-performance
general-purpose systems as well as in, for example, mobile embedded systems is
equally important in order to either keep heat dissipation at a manageable level
or to maintain a long operating time despite the limited battery capacity. It is
clear that similar scientific issues have to be solved to build competitive systems
in both segments. Additionally, for high-performance systems to be realized – be
it embedded or general-purpose – a holistic design approach has to be taken by
factoring in the impact of applications as well as the underlying technology when
making design trade-offs. The main topics of this journal reflect this development
and include (among others):
– Processor architecture, e.g., network and security architectures, application
specific processors and accelerators, and reconfigurable architectures
– Memory system design
– Power, temperature, performance, and reliability constrained designs
– Evaluation methodologies, program characterization, and analysis techniques
– Compiler techniques for embedded systems, e.g, feedback-directed opti-
mization, dynamic compilation, adaptive execution, continuous profiling/
optimization, back-end code generation, and binary translation/optimization
– Code size/memory footprint optimizations
This volume contains 14 papers divided into four sections. The first section
is a special section containing the top four papers from the Third International
Conference on High-Performance and Embedded Architectures and Compilers -
HiPEAC. I would like to thank Manolis Katevenis (University of Crete and
FORTH) and Rajiv Gupta (University of California at Riverside) for acting as
guest editors of that section. Papers in this section deal with cache performance
issues and improved branch prediction
The second section is a set of four papers providing a snapshot from the
Eighth MEDEA Workshop. I am indebted to Sandro Bartolini and Pierfrancesco
Foglia for putting together this special section.
The third section contains two regular papers and the fourth section pro-
vides a snapshot from the First Workshop on Programmability Issues for Mul-
ticore Computers (MULTIPROG). The organizers – Eduard Ayguade, Roberto
VI Editor-in-Chief’s Message
Gioiosa, and Osman Unsal – have put together this section. I thank them for
their effort.
The editorial board has worked diligently to handle the papers for the journal.
I would like to thank all the contributing authors, editors, and reviewers for their
excellent work.
Per Stenström, Chalmers University of Technology
Editor-in-chief
Transactions on HiPEAC
Editorial Board
Per Stenström is a professor of computer engineering at Chalmers University
of Technology. His research interests are devoted to design principles for high-
performance computer systems and he has made multiple contributions to espe-
cially high-performance memory systems. He has authored or co-authored three
textbooks and more than 100 publications in international journals and con-
ferences. He regularly serves Program Committees of major conferences in the
computer architecture field. He is also an associate editor of IEEE Transac-
tions on Parallel and Distributed Processing Systems, a subject-area editor of
the Journal of Parallel and Distributed Computing, an associate editor of the
IEEE TCCA Computer Architecture Letters, and the founding Editor-in-Chief
of Transactions on High-Performance Embedded Architectures and Compilers.
He co-founded the HiPEAC Network of Excellence funded by the European
Commission. He has acted as General and Program Chair for a large number
of conferences including the ACM/IEEE Int. Symposium on Computer Archi-
tecture, the IEEE High-Performance Computer Architecture Symposium, and
the IEEE Int. Parallel and Distributed Processing Symposium. He is a Fellow
of the ACM and the IEEE and a member of Academia Europaea and the Royal
Swedish Academy of Engineering Sciences.
Koen De Bosschere obtained his PhD from Ghent University in 1992. He is a
professor in the ELIS Department at the Universiteit Gent where he teaches
courses on computer architecture and operating systems. His current research
interests include: computer architecture, system software, code optimization.
He has co-authored 150 contributions in the domain of optimization, perfor-
mance modeling, microarchitecture, and debugging. He is the coordinator of the
ACACES research network and of the European HiPEAC2 network. Contact
him at Koen.DeBosschere@elis.UGent.be.
VIII Editorial Board
Jose Duato is a professor in the Department of Computer Engineering (DISCA)
at UPV, Spain. His research interests include interconnection networks and mul-
tiprocessor architectures. He has published over 340 papers. His research results
have been used in the design of the Alpha 21364 microprocessor, the Cray T3E,
IBM BlueGene/L, and Cray Black Widow supercomputers. Dr. Duato is the
first author of the book Interconnection Networks: An Engineering Approach.
He has served as associate editor of IEEE TPDS and IEEE TC. He was General
Co-chair of ICPP 2001, Program Chair of HPCA-10, and Program Co-chair of
ICPP 2005. Also, he has served as Co-chair, Steering Committee member, Vice-
Chair, or Program Committee member in more than 55 conferences, including
HPCA, ISCA, IPPS/SPDP, IPDPS, ICPP, ICDCS, Europar, and HiPC.
Manolis Katevenis received his PhD degree from U.C. Berkeley in 1983 and the
ACM Doctoral Dissertation Award in 1984 for his thesis on “Reduced Instruc-
tion Set Computer Architectures for VLSI.” After a brief term on the faculty of
Computer Science at Stanford University, he has been in Greece, with the Uni-
versity of Crete and with FORTH since 1986. After RISC, his research has been
on interconnection networks and interprocessor communication. In packet switch
architectures, his contributions since 1987 have been mostly in per-flow queue-
ing, credit-based flow control, congestion management, weighted round-robin
scheduling, buffered crossbars, and non-blocking switching fabrics. In multipro-
cessing and clustering, his contributions since 1993 have been on remote-write-
based, protected, user-level communication.
His home URL is http://archvlsi.ics.forth.gr/∼kateveni
Editorial Board IX
Michael O’Boyle is a professor in the School of Informatics at the University
of Edinburgh and an EPSRC Advanced Research Fellow. He received his PhD
in Computer Science from the University of Manchester in 1992. He was for-
merly a SERC Postdoctoral Research Fellow, a Visiting Research Scientist at
IRISA/INRIA Rennes, a Visiting Research Fellow at the University of Vienna
and a Visiting Scholar at Stanford University. More recently he was a Visiting
Professor at UPC, Barcelona.
Dr. O’Boyle’s main research interests are in adaptive compilation, formal
program transformation representations, the compiler impact on embedded sys-
tems, compiler directed low-power optimization and automatic compilation for
parallel single-address space architectures. He has published over 50 papers in
international journals and conferences in this area and manages the Compiler
and Architecture Design group consisting of 18 members.
Cosimo Antonio Prete is a full professor of computer systems at the Univer-
sity of Pisa, Italy, faculty member of the PhD School in Computer Science and
Engineering (IMT), Italy. He is Coordinator of the Graduate Degree Program
in Computer Engineering and Rector’s Adviser for Innovative Training Tech-
nologies at the University of Pisa. His research interests are focused on multi-
processor architectures, cache memory, performance evaluation and embedded
systems. He is an author of more than 100 papers published in international
journals and conference proceedings. He has been project manager for several
research projects, including: the SPP project, OMI, Esprit IV; the CCO project,
supported by VLSI Technology, Sophia Antipolis; the ChArm project, supported
by VLSI Technology, San Jose, and the Esprit III Tracs project.
X Editorial Board
André Seznec is “directeur de recherches” at IRISA/INRIA. Since 1994, he
has been the head of the CAPS (Compiler Architecture for Superscalar and
Special-purpose Processors) research team. He has been conducting research
on computer architecture for more than 20 years. His research topics have in-
cluded memory hierarchy, pipeline organization, simultaneous multithreading
and branch prediction. In 1999–2000, he spent a sabbatical year with the Alpha
Group at Compaq.
Olivier Temam obtained a PhD in computer science from the University of
Rennes in 1993. He was assistant professor at the University of Versailles from
1994 to 1999, and then professor at the University of Paris Sud until 2004. Since
then, he is a senior researcher at INRIA Futurs in Paris, where he heads the
Alchemy group. His research interests include program optimization, processor
architecture, and emerging technologies, with a general emphasis on long-term
research.
Editorial Board XI
Theo Ungerer is Chair of Systems and Networking at the University of Augsburg,
Germany, and Scientific Director of the Computing Center of the University of
Augsburg. He received a Diploma in Mathematics at the Technical University
of Berlin in 1981, a Doctoral Degree at the University of Augsburg in 1986,
and a second Doctoral Degree (Habilitation) at the University of Augsburg in
1992. Before his current position he was scientific assistant at the University of
Augsburg (1982–1989 and 1990–1992), visiting assistant professor at the Uni-
versity of California, Irvine (1989–1990), professor of computer architecture at
the University of Jena (1992–1993) and the Technical University of Karlsruhe
(1993–2001). He is Steering Committee member of HiPEAC and of the German
Science Foundation’s priority programme on “Organic Computing.” His current
research interests are in the areas of embedded processor architectures, embed-
ded real-time systems, organic, bionic and ubiquitous systems.
Mateo Valero obtained his PhD at UPC in 1980. He is a professor in the
Computer Architecture Department at UPC. His research interests focus on
high-performance architectures. He has published approximately 400 papers on
these topics. He is the director of the Barcelona Supercomputing Center, the
National Center of Supercomputing in Spain. Dr. Valero has been honored with
several awards, including the King Jaime I award by the Generalitat Valen-
ciana, and the Spanish national award “Julio Rey Pastor” for his research on IT
technologies. In 2001, he was appointed Fellow of the IEEE, in 2002 Intel Distin-
guished Research Fellow and since 2003 a Fellow of the ACM. Since 1994, he has
been a foundational member of the Royal Spanish Academy of Engineering. In
2005 he was elected Correspondant Academic of the Spanish Royal Academy of
Sciences, and his native town of Alfamén named their public college after him.
XII Editorial Board
Georgi Gaydadjiev is a professor in the computer engineering laboratory of the
Technical University of Delft, The Netherlands. His research interests focus on
many aspects of embedded systems design with an emphasis on reconfigurable
computing. He has published about 50 papers on these topics in international
refereed journals and conferences. He has acted as Program Committee mem-
ber of many conferences and is subject area editor for the Journal of Systems
Architecture.
Table of Contents
Third International Conference on High-Performance
and Embedded Architectures and Compilers
(HiPEAC)
Dynamic Cache Partitioning Based on the MLP of Cache Misses . . . . . . . 3
Miquel Moreto, Francisco J. Cazorla, Alex Ramirez, and
Mateo Valero
Cache Sensitive Code Arrangement for Virtual Machine. . . . . . . . . . . . . . . 24
Chun-Chieh Lin and Chuen-Liang Chen
Data Layout for Cache Performance on a Multithreaded Architecture . . . 43
Subhradyuti Sarkar and Dean M. Tullsen
Improving Branch Prediction by Considering Affectors and Affectees
Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Yiannakis Sazeides, Andreas Moustakas, Kypros Constantinides, and
Marios Kleanthous
Eighth MEDEA Workshop (Selected Papers)
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Sandro Bartolini, Pierfrancesco Foglia, and Cosimo Antonia Prete
Exploring the Architecture of a Stream Register-Based Snoop Filter . . . . 93
Matthias Blumrich, Valentina Salapura, and Alan Gara
CROB: Implementing a Large Instruction Window through
Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Fernando Latorre, Grigorios Magklis, Jose González,
Pedro Chaparro, and Antonio González
Power-Aware Dynamic Cache Partitioning for CMPs . . . . . . . . . . . . . . . . . 135
Isao Kotera, Kenta Abe, Ryusuke Egawa, Hiroyuki Takizawa, and
Hiroaki Kobayashi
A Multithreaded Multicore System for Embedded Media Processing . . . . 154
Jan Hoogerbrugge and Andrei Terechko
XIV Table of Contents
Regular Papers
Parallelization Schemes for Memory Optimization on the Cell
Processor: A Case Study on the Harris Corner Detector . . . . . . . . . . . . . . . 177
Tarik Saidani, Lionel Lacassagne, Joel Falcou, Claude Tadonki, and
Samir Bouaziz
Constructing Application-Specific Memory Hierarchies on FPGAs . . . . . . 201
Harald Devos, Jan Van Campenhout, Ingrid Verbauwhede, and
Dirk Stroobandt
First Workshop on Programmability Issues for
Multi-core Computers (MULTIPROG)
autopin – Automated Optimization of Thread-to-Core Pinning on
Multicore Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Tobias Klug, Michael Ott, Josef Weidendorfer, and Carsten Trinitis
Robust Adaptation to Available Parallelism in Transactional Memory
Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
Mohammad Ansari, Mikel Luján, Christos Kotselidis, Kim Jarvis,
Chris Kirkham, and Ian Watson
Efficient Partial Roll-Backing Mechanism for Transactional Memory
Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
M.M. Waliullah
Software-Level Instruction-Cache Leakage Reduction Using
Value-Dependence of SRAM Leakage in Nanometer Technologies . . . . . . . 275
Maziar Goudarzi, Tohru Ishihara, and Hamid Noori
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
Dynamic Cache Partitioning Based on the MLP
of Cache Misses
Miquel Moreto1
, Francisco J. Cazorla2
, Alex Ramirez1,2
, and Mateo Valero1,2
1
Universitat Politècnica de Catalunya, DAC, Barcelona, Spain
HiPEAC European Network of Excellence
2
Barcelona Supercomputing Center – Centro Nacional de Supercomputación, Spain
{mmoreto,aramirez,mateo}@ac.upc.edu, francisco.cazorla@bsc.es
Abstract. Dynamic partitioning of shared caches has been proposed
to improve performance of traditional eviction policies in modern multi-
threaded architectures. All existing Dynamic Cache Partitioning (DCP)
algorithms work on the number of misses caused by each thread and
treat all misses equally. However, it has been shown that cache misses
cause different impact in performance depending on their distribution.
Clustered misses share their miss penalty as they can be served in par-
allel, while isolated misses have a greater impact on performance as the
memory latency is not shared with other misses.
We take this fact into account and propose a new DCP algorithm that
considers misses differently depending on their influence in performance.
Our proposal obtains improvements over traditional eviction policies up
to 63.9% (10.6% on average) and it also outperforms previous DCP pro-
posals by up to 15.4% (4.1% on average) in a four-core architecture. Our
proposal reaches the same performance as a 50% larger shared cache. Fi-
nally, we present a practical implementation of our proposal that requires
less than 8KB of storage.
1 Introduction
The limitation imposed by instruction-level parallelism (ILP) has motivated
the use of thread-level parallelism (TLP) as a common strategy for improv-
ing processor performance. TLP paradigms such as simultaneous multithreading
(SMT) [1,2], chip multiprocessor (CMP) [3] and combinations of both offer the
opportunity to obtain higher throughputs. However, they also have to face the
challenge of sharing resources of the architecture. Simply avoiding any resource
control can lead to undesired situations where one thread is monopolizing all the
resources and harming the other threads. Some studies deal with the resource
sharing problem in SMTs at core level resources like issue queues, registers,
etc. [4]. In CMPs, resource sharing is focused on the cache hierarchy.
Some applications present low reuse of their data and pollute caches with
data streams, such as multimedia, communications or streaming applications,
or have many compulsory misses that cannot be solved by assigning more cache
space to the application. Traditional eviction policies such as Least Recently
P. Stenström (Ed.): Transactions on HiPEAC III, LNCS 6590, pp. 3–23, 2011.
c
 Springer-Verlag Berlin Heidelberg 2011
4 M. Moreto et al.
Used (LRU), pseudo LRU or random are demand-driven, that is, they tend
to give more space to the application that has more accesses and misses to
the cache hierarchy [5, 6]. As a consequence, some threads can suffer a severe
degradation in performance. Previous work has tried to solve this problem by
using static and dynamic partitioning algorithms that monitor the L2 cache
accesses and decide a partition for a fixed amount of cycles in order to maximize
throughput [7,8,9] or fairness [10]. Basically, these proposals predict the number
of misses per application for each possible cache partition. Then, they use the
cache partition that leads to the minimum number of misses for the next interval.
A common characteristic of these proposals is that they treat all L2 misses
equally. However, in out-of-order architectures L2 misses affect performance dif-
ferently depending on how clustered they are. An isolated L2 miss has approxi-
mately the same miss penalty than a cluster of L2 misses, as they can be served
in parallel if they all fit in the reorder buffer (ROB) [11]. In Figure 1 we can
see this behavior. We have represented an ideal IPC curve that is constant until
an L2 miss occurs. After some cycles, commit stops. When the cache line comes
from main memory, commit ramps up to its steady state value. As a consequence,
an isolated L2 miss has a higher impact on performance than a miss in a burst
of misses as the memory latency is shared by all clustered misses.
(a) Isolated L2 miss. (b) Clustered L2 misses.
Fig. 1. Isolated and clustered L2 misses
Based on this fact, we propose a new DCP algorithm that gives a cost to each
L2 access according to its impact in final performance. We detect isolated and
clustered misses and assign a higher cost to isolated misses. Then, our algorithm
determines the partition that minimizes the total cost for all threads, which is
used in the next interval. Our results show that differentiating between clustered
and isolated L2 misses leads to cache partitions with higher performance than
previous proposals. The main contributions of this work are the following.
1) A runtime mechanism to dynamically partition shared L2 caches in a CMP
scenario that takes into account the MLP of each L2 access. We obtain improve-
ments over LRU up to 63.9% (10.6% on average) and over previous proposals
up to 15.4% (4.1% on average) in a four-core architecture. Our proposal reaches
the same performance as a 50% larger shared cache.
2) We extend previous workloads classifications for CMP architectures with
more than two cores. Results can be better analyzed in every workload group.
Dynamic Cache Partitioning Based on the MLP of Cache Misses 5
3) We present a sampling technique that reduces the hardware cost in terms
of storage to less than 1% of the total L2 cache size with an average throughput
degradation of 0.76% (compared to the throughput obtained without sampling).
We also show that scalable algorithms to decide cache partitions give near opti-
mal partitions, 0.59% close to the optimal decision.
The rest of this paper is structured as follows. Section 2 introduces the meth-
ods that have been previously proposed to decide L2 cache partitions and related
work. Next, Section 3 explains our MLP-aware DCP algorithm. Section 4 de-
scribes the experimental environment and in Section 5 we discuss simulation
results. Finally, Section 6 summarizes our results.
2 Prior Work in Dynamic Cache Partitioning
Stack Distance Histogram (SDH). Mattson et al. introduce the concept of
stack distance to study the behavior of storage hierarchies [12]. Common eviction
policies such as LRU have the stack property. Thus, each set in a cache can be
seen as an LRU stack, where lines are sorted by their last access cycle. In that
way, the first line of the LRU stack is the Most Recently Used (MRU) line while
the last line is the LRU line. The position that a line has in the LRU stack
when it is accessed again is defined as the stack distance of the access. As an
example, we can see in Table 1(a) a stream of accesses to the same set with their
corresponding stack distances.
Table 1. Stack Distance Histogram
(a) Stream of accesses to a given cache set. (b) SDH example.
# Reference 1 2 3 4 5 6 7 8
Cache Line A B C C A D B D
Stack Distance - - - 1 3 - 4 2
Stack Distance 1 2 3 4 4
# Accesses 60 20 10 5 5
For a K-way associative cache with LRU replacement algorithm, we need
K + 1 counters to build SDHs, denoted C1, C2, . . . , CK , CK. On each cache
access, one of the counters is incremented. If it is a cache access to a line in
the ith
position in the LRU stack of the set, Ci is incremented. If it is a cache
miss, the line is not found in the LRU stack and, as a result, we increment
the miss counter CK . SDH can be obtained during execution by running the
thread alone in the system [7] or by adding some hardware counters that profile
this information [8, 9]. A characteristic of these histograms is that the number
of cache misses for a smaller cache with the same number of sets can be easily
computed. For example, for a K
-way associative cache, where K
 K, the new
number of misses can be computed as misses = CK +
K
i=K+1 Ci.
As an example, in Table 1(b) we show an SDH for a set with 4 ways. Here,
we have 5 cache misses. However, if we reduce the number of ways to 2 (keeping
the number of sets constant), we will experience 20 misses (5 + 5 + 10).
6 M. Moreto et al.
Minimizing Total Misses. Using the SDHs of N applications, we can de-
rive the L2 cache partition that minimizes the total number of misses: this last
number corresponds to the sum of the number of misses of each thread for the
given configuration. The optimal partition in the last period of time is a suitable
candidate to become the future optimal partition. Partitions are decided period-
ically after a fixed amount of cycles. In this scenario, partitions are decided at a
way granularity. This mechanism is used in order to minimize the total number
of misses and try to maximize throughput. A first approach proposed a static
partitioning of the L2 cache using profiling information [7]. Then, a dynamic ap-
proach estimated SDHs with information inside the cache [9]. Finally, Qureshi
et al. presented a suitable and scalable circuit to measure SDHs using sampling
and obtained performance gains with just 0.2% extra space in the L2 cache [8].
Throughout this paper, we will call this last policy MinMisses.
Fair Partitioning. In some situations, MinMisses can lead to unfair parti-
tions that assign nearly all the resources to one thread while harming the oth-
ers [10]. For that reason, the authors propose considering fairness when deciding
new partitions. In that way, instead of minimizing the total number of misses,
they try to equalize the statistic Xi =
missessharedi
missesalonei
of each thread i. They desire
to force all threads to have the same increase in percentage of misses. Partitions
are decided periodically using an iterative method. The thread with largest Xi
receives a way from the thread with smallest Xi until all threads have a similar
value of Xi. Throughout this paper, we will call this policy Fair.
Table 2. Different Partitioning Proposals
Paper Partitioning Objective Decision Algorithm Eviction Policy
[7] Static Minimize Misses Programmer − Column Caching
[9] Dynamic Minimize Misses Architecture Marginal Gain Augmented LRU
[8] Dynamic Maximize Utility Architecture Lookahead Augmented LRU
[10] Dynamic Fairness Architecture Equalize Xi
1 Augmented LRU
[13] Dynamic Maximize reuse Architecture Reuse Column Caching
[14] Dyn./Static Configurable Operating System Configurable Augmented LRU
Other Related Work. Several papers propose different DCP algorithms in a
multithreaded scenario. In Table 2 we summarize these proposals with their most
significant characteristics. Settle et al. introduce a DCP similar to MinMisses
that decides partitions depending on the average data reuse of each application
[13]. Rafique et al. propose to manage shared caches with a hardware cache
quota enforcement mechanism and an interface between the architecture and
the OS to let the latter decide quotas [14]. We have to note that this mechanism
is completely orthogonal to our proposal and, in fact, they are compatible as
we can let the OS decide quotas according to our scheme. Hsu et al. evaluate
different cache policies in a CMP scenario [15]. They show that none of them is
optimal among all benchmarks and that the best cache policy varies depending
on the performance metric being used. Thus, they propose to use a thread-aware
Dynamic Cache Partitioning Based on the MLP of Cache Misses 7
cache resource allocation. In fact, their results reinforce the motivation of our
paper: if we do not consider the impact of each L2 miss in performance, we can
decide suboptimal L2 partitions in terms of throughput.
Cache partitions at a way granularity can be implemented with column caching
[7], which uses a bit mask to mark reserved ways, or by augmenting the LRU
policy with counters that keep track of the number of lines in a set belonging
to a thread [9]. The evicted line will be the LRU line among its owned lines or
other threads lines depending on whether it reaches its quota or not.
In [16] a new eviction policy for private caches was proposed in single-threaded
architectures. This policy gives a weight to each L2 miss according to its MLP
when the block is filled from memory. Eviction is decided using the LRU counters
and this weight. This idea was proposed for a different scenario as it focus on
single-threaded architectures.
3 MLP-Aware Dynamic Cache Partitioning
3.1 Algorithm Overview
Algorithm 3.1 shows the necessary steps to dynamically decide cache partitions
according to the MLP of each L2 access. At the beginning of the execution, we
decide an initial partition of the L2 cache. As we have no prior knowledge of the
applications, we evenly distribute ways among cores. Hence, each core receives
Associativity
Number of Cores
ways of the shared L2 cache.
Algorithm 3.1. MLP-aware DCP()
Step 1: Establish an initial even partition for each core.
Step 2: Run threads and collect data for the MLP-aware SDHs.
Step 3: Decide new partition.
Step 4: Update MLP-aware SDHs.
Step 5: Go back to Step 2.
Afterwards, we begin a period where we measure the total MLP cost of each
application. The histogram of each thread containing the total MLP cost for each
possible partition is denoted MLP-aware SDH. For a K-way associative cache,
exactly K registers are needed to store this histogram. For short periods, dy-
namic cache partitioning (DCP) algorithms react quicker to phase changes. Our
results show that, for different periods from 105
to 108
cycles, small performance
variations are obtained, with a peak for a period of 5 million cycles.
At the end of each interval, MLP-aware SDHs are analyzed and a new parti-
tion is decided for the next interval. We assume that running threads will have
a similar pattern of L2 accesses in the next measuring period. Thus, the opti-
mal partition for the last period is chosen for the following period. Evaluating
8 M. Moreto et al.
all possible cache partitions gives the optimal partition. This evaluation is done
concurrently with a dedicated hardware, which sets the partition for each pro-
cess in the next period. Having old values of partitions decisions does not impact
correctness of the running applications and does not affect performance as de-
ciding new partitions typically takes few thousand cycles and is invoked once
every 5 million cycles.
Since characteristics of applications dynamically change, MLP-aware SDHs
should reflect these changes. However, we also wish to maintain some history of
the past MLP-aware SDHs to make new decisions. Thus, after a new partition
is decided, we multiply all the values of the MLP-aware SDHs times ρ ∈ [0, 1].
Large values of ρ have larger reaction times to phase changes, while small values
of ρ quickly adapt to phase changes but tend to forget the behavior of the
application. Small performance variations are obtained for different values of ρ
ranging from 0 to 1, with a peak for ρ = 0.5. Furthermore, this value is very
convenient as we can use a shifter to update histograms. Next, a new period of
measuring MLP-aware SDHs begins. The key contribution of this paper is the
method to obtain MLP-aware SDHs that we explain in the following Subsection.
3.2 MLP-Aware Stack Distance Histogram
As previously stated, MinMisses assumes that all L2 accesses are equally im-
portant in terms of performance. However, it has been shown that cache misses
affect differently the performance of applications, even inside the same applica-
tion [11, 16]. An isolated L2 data miss has a penalty cost that can be approxi-
mated by the average memory latency. In the case of a burst of L2 data misses
that fit in the ROB, the penalty cost is shared among misses as L2 misses can
be served in parallel. In case of L2 instruction misses, they are serialized as fetch
stops. Thus, L2 instruction misses have a constant miss penalty and MLP.
We want to assign a cost to each L2 access according to its effect on perfor-
mance. In [16] a similar idea was used to modify LRU eviction policy for single
core and single threaded architectures. In our situation, we have a CMP sce-
nario where the shared L2 cache has a number of reserved ways for each core.
At the end of each period, we decide either to continue with the same partition
or change it. If we decide to modify the partition, a core i that had wi reserved
ways will receive w
i 
= wi. If wi  w
i, the thread receives more ways and, as a
consequence, some misses in the old configuration will become hits. Conversely,
if wi  w
i, the thread receives less ways and some hits in the old configuration
will become misses. Thus, we want to have an estimation of the performance ef-
fects when misses are converted into hits and vice versa. Throughout this paper,
we will call this impact on performance MLP cost.
MLP cost of L2 misses. In order to compute the MLP cost of an L2 miss with
stack distance di, we consider the situation shown in Figure 2(a). If we force an L2
configuration that assigns exactly w
i = di ways to thread i with w
i  wi, some
of the L2 misses of this thread will become hits, while other will remain being
misses, depending on their stack distance. In order to track the stack distance
Dynamic Cache Partitioning Based on the MLP of Cache Misses 9
(a) MLP cost of an L2 miss.
(b) Estimated MLP cost when an L2 hit becomes a miss.
Fig. 2. MLP cost of L2 accesses
and MLP cost of each L2 miss, we have modified the L2 Miss Status Holding
Registers (MSHR) [17]. This structure is similar to an L2 miss buffer and is used
to hold information about any load that has missed in the L2 cache. The modified
L2 MSHR has one extra field that contains the MLP cost of the miss as can be
seen in Figure 3(b). It is also necessary to store the stack distance of each access
in the MSHR. In Figure 3(a) we show the MSHR in the cache hierarchy.
(a) MSHR. (b) MSHR fields.
Fig. 3. Miss Status Holding Register
When the L2 cache is accessed and an L2 miss is determined, we assign an
MSHR entry to the miss and wait until the data comes from Main Memory. We
initialize the MLP cost field to zero when the entry is assigned. We store the
access stack distance together with the identifier of the owner core. Every cycle,
we obtain N, the number of L2 accesses with stack distance greater or equal
to di. We have a hardware counter that tracks this number for each possible
10 M. Moreto et al.
number of di, which means a total of Associativity counters. If we have N L2
misses that are being served in parallel, the miss penalty is shared. Thus, we
assign an equal share of 1
N to each miss. The value of the MLP cost is updated
until the data comes from Main Memory and fills the L2. At this moment we
can free the MSHR entry.
The number of adders required to update the MLP cost of all entries is equal
to the number of MSHR entries. However, this number can be reduced by sharing
several adders between valid MSHR entries in a round robin fashion. Then, if an
MSHR entry updates its MLP cost every 4 cycles, it has to add 4
N . In this work,
we assume that the MSHR contains only four adders for updating MLP cost
values, which has a negligible effect on the final MLP cost [16].
MLP cost of L2 hits. Next, we want to estimate the MLP cost of an L2 hit
with stack distance di when it becomes a miss. If we forced an L2 configuration
that assigned exactly w
i = di ways to the thread i with w
i  wi, some of the L2
hits of this thread would become misses, while L2 misses would remain as misses
(see Figure 2(b)). The hits that would become misses are the ones with stack
distance greater or equal to di. Thus, we count the total number of accesses with
stack distance greater or equal to di (including L2 hits and misses) to estimate
the length of the cluster of L2 misses in this configuration.
Deciding the moment to free the entry used by an L2 hit is more complex
than in the case of the MSHR. As it was said in [11], in a balanced architecture,
L2 data misses can be served in parallel if they all fit in the ROB. Equivalently,
we say that L2 data misses can be served in parallel if they are at ROB dis-
tance smaller than the ROB size. Thus, we should free the entry if the number
of committed instructions since the access has reached the ROB size or if the
number of cycles since the hit has reached the average latency to memory. The
first condition is clear as L2 misses can overlap only if their ROB distance is
less than the ROB size. When the entry is freed, we have to add the number of
pending cycles divided by the number of misses with stack distance greater or
equal to di. The second condition is also necessary as it can occur that no L2
access is done for a period of time. To obtain the average latency to memory,
we add a specific hardware that counts and averages the number of cycles that
a given entry is in the MSHR.
We use new hardware to obtain the MLP cost of L2 hits. We denote this
hardware Hit Status Holding Registers (HSHR) as it is similar to the MSHR.
However, the HSHR is private for each core. In each entry, the HSHR needs an
identifier of the ROB entry of the access, the address accessed by the L2 hit,
the stack distance value and a field with the corresponding MLP cost as can be
seen in Figure 4(b). In Figure 4(a) we show the HSHR in the cache hierarchy.
When the L2 cache is accessed and an L2 hit is determined, we assign an
HSHR entry to the L2 hit. We initialize the fields of the entry as in the case of
the MSHR. We have a stack distance di and we want to update the MLP cost
field in every cycle. With this objective, we need to know the number of active
entries with stack distance greater or equal to di in the HSHR, which can be
tracked with one hardware counter per core. We also need a ROB entry identifier
Dynamic Cache Partitioning Based on the MLP of Cache Misses 11
(a) HSHR. (b) HSHR fields.
Fig. 4. Hit Status Holding Register
for each L2 access. Every cycle, we obtain N, the number of L2 accesses with
stack distance greater or equal to di as in the L2 MSHR case. We have a hardware
counter that tracks this number for each possible number of di, which means a
total of Associativity counters.
In order to avoid array conflicts, we need as many entries in the HSHR as
possible L2 accesses in flight. This number is equal to the L1 MSHR size. In our
scenario, we have 32 L1 MSHR entries, which means a maximum of 32 in flight
L2 accesses per core. However, we have checked that we have enough with 24
entries per core to ensure that we have an available slot 95% of the time in an
architecture with a ROB of 256 entries. If there are no available slots, we simply
assign the minimum weight to the L2 access as there are many L2 accesses in
flight. The number of adders required to update the MLP cost of all entries is
equal to the number of HSHR entries. As we did with the MSHR, HSHR entries
can share four adders with a negligible effect on the final MLP cost.
Quantification of MLP cost. Dealing with values of MLP cost between 0
and the memory latency (or even greater) can represent a significant hardware
cost. Instead, we decide to quantify this MLP cost with an integer value between
0 and 7 as was done in [16]. For a memory latency of 300 cycles, we can see in
Table 3 how to quantify the MLP cost. We have split the interval [0; 300] with
7 intervals of equal length.
Table 3. MLP cost quantification
MLP cost Quantification MLP cost Quantification
From 0 to 42 cycles 0 From 171 to 213 cycles 4
From 43 to 85 cycles 1 From 214 to 256 cycles 5
From 86 to 128 cycles 2 From 257 to 300 cycles 6
From 129 to 170 cycles 3 300 or more cycles 7
Finally, when we have to update the corresponding MLP-aware SDH, we add
the quantified value of MLP cost. Thus, isolated L2 misses will have a weight
of 7, while two overlapped L2 misses will have a weight of 3 in the MLP-aware
SDH. In contrast, MinMisses always adds one to its histograms.
12 M. Moreto et al.
3.3 Obtaining Stack Distance Histograms
Normally, L2 caches have two separate parts that store data and address tags to
know if the access is a hit. Basically, our prediction mechanism needs to track
every L2 access and store a separated copy of the L2 tags information in an
Auxiliary Tag Directory (ATD), together with the LRU counters [8]. We need
an ATD for each core that keeps track of the L2 accesses for any possible cache
configuration. Independently of the number of ways assigned to each core, we
store the tags and LRU counters of the last K accesses of the thread, where K
is the L2 associativity. As we have explained in Section 2, an access with stack
distance di corresponds to a cache miss in any configuration that assigns less
than di ways to the thread. Thus, with this ATD we can determine whether an
L2 access would be a miss or a hit in all possible cache configurations.
3.4 Putting All Together
In Figure 5 we can see a sketch of the hardware implementation of our proposal.
When we have an L2 access, the ATD is used to determine its stack distance di.
Depending on whether it is a miss or a hit, either the MSHR or the HSHR is
used to compute the MLP cost of the access. Using the quantification process we
obtain the final MLP cost. This number estimates how performance is affected
when the applications has exactly w
i = di assigned ways. If w
i  wi, we are
estimating the performance benefit of converting this L2 miss into a hit. In case
w
i  wi, we are estimating the performance degradation of converting this L2
hit into a miss. Finally, using the stack distance, the MLP cost and the core
identifier, we can update the corresponding MLP-aware SDH.
We have used two different partitioning algorithms. The first one, that we de-
note MLP-DCP (standing for MLP-aware Dynamic Cache Partitioning), decides
Fig. 5. Hardware implementation
Dynamic Cache Partitioning Based on the MLP of Cache Misses 13
the optimal partition according to the MLP cost of each way. We
define the total MLP cost of a thread i that uses wi ways as T MLP(i, wi) =
MLP SDHi,K +
K
j=wi
MLP SDHi,j. We denote the total MLP cost of all
accesses of thread i with stack distance j as MLP SDHi,j. Thus, we have to
minimize the sum of total MLP costs for all cores:
N

i=1
T MLP(i, wi), where
N

i=1
wi = Associativity.
The second one consists in assigning a weight to each total MLP cost using the
IPC of the application in core i, IPCi. In this situation, we are giving priority
to threads with higher IPC. This point will give better results in throughput at
the cost of being less fair. IPCi is measured at runtime with a hardware counter
per core. We denote this proposal MLPIPC-DCP, which consists in minimizing
the following expression:
N

i=1
IPCi · T MLP(i, wi), where
N

i=1
wi = Associativity.
3.5 Case Study
We have seen that SDHs can give the optimal partition in terms of total L2
misses. However, total number of L2 misses is not the goal of DCP algorithms.
Throughput is the objective of these policies. The underlying idea of MinMisses
is that while minimizing total L2 misses, we are also increasing throughput. This
idea is intuitive as performance is clearly related to L2 miss rate. However, this
heuristic can lead to inadequate partitions in terms of throughput as can be seen
in the next case study.
In Figure 6, we can see the IPC curves of benchmarks galgel and gzip as we
increase L2 cache size in a way granularity (each way has a 64KB size). We also
show throughput for all possible 15 partitions. In this curve, we assign x ways to
gzip and 16−x to galgel. The optimal partition consists in assigning 6 to gzip
and 10 ways to galgel, obtaining a total throughput of 3.091 instructions per
cycle. However, if we use MinMisses algorithm to determine the new partition,
we will choose 4 to gzip and 12 ways to galgel according to the SDHs values.
In Figure 6 we can also see the total number of misses for each cache partition
as well as the per thread number of misses.
In this situation, misses in gzip are more important in terms of performance
than misses in galgel. Furthermore, gzip IPC is larger than galgel IPC. As
a consequence MinMisses obtains a non optimal partition in terms of IPC and
its throughput is 2.897, which is a 6.3% smaller than the optimal one. In fact,
galgel clusters of L2 misses are, in average, longer than the ones from gzip. In
that way, MLP-DCP assigns one extra way to gzip and increases performance
by 3%. If we use MLPIPC-DCP, we are giving more importance to gzip as it
has a higher IPC and, as a consequence, we end up assigning another extra way
to gzip, reaching the optimal partition and increasing throughput an extra 3%.
14 M. Moreto et al.
Fig. 6. Misses and IPC curves for galgel and gzip
4 Experimental Environment
4.1 Simulator Configuration
We target this study to the case of a CMP with two and four cores with their
respective own data and instruction L1 caches and a unified L2 cache shared
among threads as in previous studies [8,9,10]. Each core is single-threaded and
fetches up to 8 instructions each cycle. It has 6 integer (I), 3 floating point (FP),
and 4 load/store functional units and 32-entry I, load/store, and FP instruction
queues. Each thread has a 256-entry ROB and 256 physical registers. We use a
two-level cache hierarchy with 64B lines with separate 16KB, 4-way associative
data and instruction caches, and a unified L2 cache that is shared among all
cores. We have used two different L2 caches, one of size 1MB and 16-way asso-
ciativity, and the second one of size 2MB and 32-way associativity. Latency from
L1 to L2 is 15 cycles, and from L2 to memory 300 cycles. We use a 32B width
bus to access L2 and a multibanked L2 of 16 banks with 3 cycles of access time.
We extended the SMTSim simulator [2] to make it CMP. We collected traces
of the most representative 300 million instruction segment of each program, fol-
lowing the SimPoint methodology [18]. We use the FAME simulation method-
ology [19] with a Maximum Allowable IPC Variance of 5%. This evaluation
methodology measures the performance of multithreaded processors by reexe-
cuting all threads in a multithreaded workload until all of them are fairly repre-
sented in the final IPC taken from the workload.
4.2 Workload Classification
In [20] two metrics are used to model the performance of a partitioning algorithm
like MinMisses for pairings of benchmarks in the SPEC CPU 2000 benchmark
suite. Here, we extend this classification for architectures with more cores.
Metric 1. The wP %(B) metric measures the number of ways needed by a
benchmark B to obtain at least a given percentage P% of its maximum IPC
(when it uses all L2 ways).
Dynamic Cache Partitioning Based on the MLP of Cache Misses 15
(a) IPC as we vary the number of assigned
ways of a 1MB 16-way L2 cache.
(b) Average miss penalty of an L2 miss
with a 1MB 16-way L2 cache.
Fig. 7. Benchmark classification
The intuition behind this metric is to classify benchmarks depending on their
cache utilization. Using P = 90% we can classify benchmarks into three groups:
Low utility (L), Small working set or saturated utility (S) and High utility (H). L
benchmarks have 1 ≤ w90% ≤ K
8 where K is the L2 associativity. L benchmarks
are not affected by L2 cache space because nearly all L2 accesses are misses. S
benchmarks have K
8  w90% ≤ K
2 and just need some ways to have maximum
throughput as they fit in the L2 cache. Finally, H benchmarks have w90%  K
2
and always improve IPC as the number of ways given to them is increased. Clear
representatives of these three groups are applu (L), gzip (S) and ammp (H) in
Figure 7(a). In Table 4 we give w90% for all SPEC CPU 2000 benchmarks.
Table 4. The applications used in our evaluation. For each benchmark, we give the two
metrics needed to classify workloads together with IPC for a 1MB 16-way L2 cache.
Bench w90% APTC IPC Bench w90% APTC IPC Bench w90% APTC IPC
ammp 14 23.63 1.27 applu 1 16.83 1.03 apsi 10 21.14 2.17
art 10 46.04 0.52 bzip2 1 1.18 2.62 crafty 4 7.66 1.71
eon 3 7.09 2.31 equake 1 18.6 0.27 facerec 11 10.96 1.16
fma3d 9 15.1 0.11 galgel 15 18.9 1.14 gap 1 2.68 0.96
gcc 3 6.97 1.64 gzip 4 21.5 2.20 lucas 1 7.60 0.35
mcf 1 9.12 0.06 mesa 2 3.98 3.04 mgrid 11 9.52 0.71
parser 11 9.09 0.89 perl 5 3.82 2.68 sixtrack 1 1.34 2.02
swim 1 28.0 0.40 twolf 15 12.0 0.81 vortex 7 9.65 1.35
vpr 14 11.9 0.97 wupw 1 5.99 1.32
The average miss penalty of an L2 miss for the whole SPEC CPU 2000 bench-
mark suite is shown in Figure 7(b). We note that this average miss penalty varies
a lot, even inside each group of benchmarks, ranging from 30 to 294 cycles. This
Figure reinforces the main motivation of the paper, as it proves that the clus-
tering level of L2 misses changes for different applications.
16 M. Moreto et al.
Metric 2. The wLRU (thi) metric measures the number of ways given by
LRU to each thread thi in a workload composed of N threads. This can be done
simulating all benchmarks alone and using the frequency of L2 accesses for each
thread [5]. We denote the number of L2 Accesses in a Period of one Thousand
Cycles for thread i as APT Ci. In Table 4 we list these values for each benchmark.
wLRU (thi) =
APT Ci
N
j=1 APT Cj
· Associativity
Next, we use these two metrics to extend previous classifications [20] for work-
loads with more than two benchmarks.
Case 1. When w90%(thi) ≤ wLRU (thi) for all threads. In this situation LRU
attains 90% of each benchmark performance. Thus, it is intuitive that in this
situation there is very little room for improvement.
Case 2. When there exists two threads A and B such that w90%(thA) 
wLRU (thA) and w90%(thB)  wLRU (thB). In this situation, LRU is harming the
performance of thread A, because it gives more ways than necessary to thread B.
Thus, in this situation LRU is assigning some shared resources to a thread that
does not need them, while the other thread could benefit from these resources.
Case 3. Finally, the third case is obtained when w90%(thi)  wLRU (thi) for
all threads. In this situation, our L2 cache configuration is not big enough to
assure that all benchmarks will have at least a 90% of their peak performance.
In [20] it was observed that pairings belonging to this group showed worse results
when the value of |w90%(th1) − w90%(th2)| grows. In this case, we have a thread
that requires much less L2 cache space than the other to attain 90% of its peak
IPC. LRU treats threads equally and manages to satisfy the less demanding
thread necessities. In case of MinMisses, it assumes that all misses are equally
important for throughput and tends to give more space to the thread with higher
L2 cache necessity, while harming the less demanding thread. This is a problem
due to MinMisses algorithm. We will show in next Subsections that MLP-aware
partitioning policies are available to overcome this situation.
Table 5. Workloads belonging to each case for a 1MB 16-way and a 2MB 32-way
shared L2 caches
1MB 16-way 2MB 32-way
#cores
2
4
6
8
Case 1 Case 2 Case 3
155 (48%) 135 (41%) 35 (11%)
624 (4%) 12785 (86%) 1541 (10%)
306 (0.1%) 219790 (95%) 10134 (5%)
19 (0%) 1538538 (98%) 23718 (2%)
Case 1 Case 2 Case 3
159 (49%) 146 (45%) 20 (6.2%)
286 (1.9%) 12914 (86%) 1750 (12%)
57 (0.02%) 212384 (92%) 17789 (7.7%)
1 (0%) 1496215 (96%) 66059 (4.2%)
In Table 5 we show the total number of workloads that belong to each case
for different configurations. We have generated all possible combinations without
repeating benchmarks. The order of benchmarks is not important. In the case
of a 1MB 16-way L2, we note that Case 2 becomes the dominant case as the
Dynamic Cache Partitioning Based on the MLP of Cache Misses 17
number of cores increases. The same trend is observed for L2 caches with larger
associativity. In Table 5 we can also see the total number of workloads that
belong to each case as the number of cores increases for a 32-way 2MB L2 cache.
Note that with different L2 cache configurations, the value of w90% and APT Ci
will change for each benchmark. An important conclusion from Table 5 is that
as we increase the number of cores, there are more combinations that belong to
the second case, which is the one with more improvement possibilities.
To evaluate our proposals, we randomly generate 16 workloads belonging to
each group for three different configurations. We denote these configurations 2C
(2 cores and 1MB 16-way L2), 4C-1 (4 cores and 1MB 16-way L2) and 4C-2 (4
cores and 2MB 32-way L2). We have also used a 2MB 32-way L2 cache as future
CMP architectures will continue scaling L2 size and associativity. For example,
the IBM Power5 [21] has a 10-way 1.875MB L2 cache and the Niagara 2 has a
16-way 4MB L2.
4.3 Performance Metrics
As performance metrics we have used the IPC throughput, which corresponds to
the sum of individual IPCs. We also use the harmonic mean of relative IPCs to
measure fairness, which we denote Hmean. We use Hmean instead of weighted
speed up because it has been shown to provide better fairness-throughput bal-
ance than weighted speed up [22].
Average improvements do consider the distribution of workloads among the
three groups. We denote this mean weighted mean, as we assign a weight to the
speed up of each case depending on the distribution of workloads from Table 5.
For example, for the 2C configuration, we compute the weighted mean improve-
ment as 0.48 · x1 + 0.41 · x2 + 0.11 · x3, where xi is the average improvement in
Case i.
5 Evaluation Results
5.1 Performance Results
Throughput. The first experiment consists in comparing throughput for differ-
ent DCP algorithms, using LRU policy as the baseline. We simulate MinMisses
and our two proposals with the 48 workloads that were selected in the pre-
vious Subsection. We can see in Figure 8(a) the average speed up over LRU
for these mechanisms. MLPIPC-DCP systematically obtains the best average
results, nearly doubling the performance benefits of MinMisses over LRU in
the four-core configurations. In configuration 4C-1, MLPIPC-DCP outperforms
MinMisses by 4.1%. MLP-DCP always improves MinMisses but obtains worse
results than MLPIPC-DCP.
All algorithms have similar results in Case 1. This is intuitive as in this sit-
uation there is little room for improvement. In Case 2, MinMisses obtains a
relevant improvement over LRU in configuration 2C. MLP-DCP and MLPIPC-
DCP achieve an extra 2.5% and 5% improvement, respectively. In the other
18 M. Moreto et al.
(a) Throughput speed up over LRU. (b) Fairness speed up over LRU.
Fig. 8. Average performance speed ups over LRU
configurations, MLP-DCP and MLPIPC-DCP still outperform MinMisses by a
2.1% and 3.6%. In Case 3, MinMisses presents larger performance degradation
as the asymmetry between the necessities of the two cores increases. As a con-
sequence, it has worse average throughput than LRU. Assigning an appropriate
weight to each L2 access gives the possibility to obtain better results than LRU
using MLP-DCP and MLPIPC-DCP.
Fairness. We have used the harmonic mean of relative IPCs [22] to measure
fairness. The relative IPC is computed as IP Cshared
IP Calone
. In Figure 8(b) we show the
average speed up over LRU of the harmonic mean of relative IPCs. Fair stands
for the policy explained in Section 2. We can see that in all situations, MLP-DCP
always improves over both MinMisses and LRU (except in Case 3 for two cores).
It even obtains better results than Fair in configurations 2C and 4C-1. MLPIPC-
DCP is a variant of the MLP-DCP algorithm optimized for throughput. As a
consequence, it obtains worse results in fairness than MLP-DCP.
Fig. 9. Average throughput speed up over LRU with a 1MB 16-way L2 cache
Equivalent cache space. DCP algorithms reach the performance of a larger L2
cache with LRU eviction policy. Figure 9 shows the performance evolution when
the L2 size is increased from 1MB to 2MB with LRU as eviction policy. In this
Dynamic Cache Partitioning Based on the MLP of Cache Misses 19
experiment, the workloads correspond to the ones selected for the configuration
4C-1. Figure 9 also shows the average speed up over LRU of MinMisses, MLP-
DCP and MLPIPC-DCP with a 1MB 16-way L2 cache. MinMisses has the same
average performance as a 1.25MB 20-way L2 cache with LRU, which means that
MinMisses provides the performance obtained with a 25% larger shared cache.
MLP-DCP reaches the performance of a 37.5% larger cache. Finally, MLPIPC-
DCP doubles the increase in size of MinMisses, reaching the performance of a
50% larger L2 cache.
5.2 Design Parameters
Figure 10(a) shows the sensitivity of our proposal to the period of partition de-
cisions. For shorter periods, the partitioning algorithm reacts quicker to phase
changes. Once again, small performance variations are obtained for different pe-
riods. However, we observe that for longer periods throughput tends to decrease.
As can be seen in Figure 10(a), the peak performance is obtained with a period
of 5 million cycles.
(a) Average throughput for different pe-
riods for the MLP-DCP algorithm with
the 2C configuration.
(b) Average speed up over LRU for different
ROB sizes with the 4C-1 configuration.
Fig. 10. Sensitivity analysis to different design parameters
Finally, we have varied the size of the ROB from 128 to 512 entries to show the
sensitivity of our proposals to this parameter of the architecture. Our mechanism
is the only one which is aware of the ROB size: The higher the size of the ROB,
the larger size of the cluster of L2 misses. Other policies only work with the
number of L2 misses, which will not change if we vary the size of the ROB.
When the ROB size increases, clusters of misses can contain more misses and,
as a consequence, our mechanism can differentiate better between isolated and
clustered misses. As we show in Figure 10(b), average improvements in the 4C-1
configuration are a little bit higher for a ROB with 512 entries, while MinMisses
shows worse results. MLPIPC-DCP outperforms LRU and MinMisses by 10.4%
and 4.3% respectively.
5.3 Hardware Cost
We have used the hardware implementation of Figure 5 to estimate the hard-
ware cost of our proposal. In this Subsection, we focus our attention on the
20 M. Moreto et al.
configuration 2C. We suppose a 40-bit physical address space. Each entry in the
ATD needs 29 bits (1 valid bit + 24-bit tag + 4-bit for LRU counter). Each set
has 16 ways, so we have an overhead of 58 Bytes (B) for each set. As we have
1024 sets, we have a total cost of 58KB per core.
The hardware cost that corresponds to the extra fields of each entry in the L2
MSHR is 5 bits for the stack distance and 2B for the MLP cost. As we have 32
entries, we have a total of 84B. Four adders are needed to update the MLP cost
of the active MSHR entries. HSHR entries need 1 valid bit, 8 bits to identify
the ROB entry, 34 bits for the address, 5 bits for the stack distance and 2B for
the MLP cost. In total we need 64 bits per entry. As we have 24 entries in each
HSHR, we have a total of 192B per core. Four adders per core are needed to
update the MLP cost of the active HSHR entries. Finally, we need 17 counters
of 4B for each MLP-Aware SDH, which supposes a total of 68B per core. In
addition to the storage bits, we also need an adder for incrementing MLP-aware
SDHs and a shifter to halve the hit counters after each partitioning interval.
Fig. 11. Throughput and hardware cost depending on ds in a two-core CMP
Sampled ATD. The main contribution to hardware cost corresponds to the
ATD. Instead of monitoring every cache set, we can decide to track accesses
from a reduced number of sets. This idea was also used in [8] with MinMisses
in a CMP environment. Here, we use it in a different situation, say to estimate
MLP-aware SDHs with a sampled number of sets. We define a sampling distance
ds that gives the distance between tracked sets. For example, if ds = 1, we are
tracking all the sets. If ds = 2, we track half of the sets, and so on. Sampling
reduces the size of the ATD at the expense of less accuracy in MLP-aware
SDHs predictions as some accesses are not tracked, Figure 11 shows throughput
degradation in a 2 cores scenario as the ds increases. This curve is measured
on the left y-axis. We also show the storage overhead in percentage of the total
L2 cache size, measured on the right y-axis. Thanks to the sampling technique,
storage overhead drastically decreases. Thus, with a sampling distance of 16
we obtain average throughput degradations of 0.76% and a storage overhead of
0.77% of the L2 cache size, which is less than 8KB of storage. We think that this
is an interesting point of design.
Dynamic Cache Partitioning Based on the MLP of Cache Misses 21
5.4 Scalable Algorithm to Decide Cache Partitions
Evaluating all possible combinations allows determining the optimal partition
for the next period. However, this algorithm does not scale adequately when
associativity and the number of applications sharing the cache is raised. If we
have a K-way associativity L2 cache shared by N cores, the number of possible
partitions without considering the order is
N+K−1
K

. For example, for 8 cores
and 16 ways, we have 245157 possible combinations. Consequently, the time to
decide new cache partitions does not scale. Several heuristics have been proposed
to reduce the number of cycles required to decide the new partition [8,9], which
can be used in our situation. These proposals bound the length of the decision
period by 10000 cycles. This overhead is very low compared to 5 million cycles
(less than 0.2%).
Fig. 12. Average throughput speed up over LRU for different decision algorithms in
the 4C-1 configuration
Figure 12 shows the average speed up of MLP-DCP over LRU with the 4C-1
configuration with three different decision algorithms. Evaluating all possible par-
titions (denoted EvalAll) gives the highest speed up. The first greedy algorithm
(denoted Marginal Gains) assigns one way to a thread in each iteration [9]. The
selected way is the one that gives the largest increase in MLP cost. This process
is repeated until all ways have been assigned. The number of operations (com-
parisons) is of order K · N, where K is the associativity of the L2 cache and N
the number of cores. With this heuristic, an average throughput degradations of
0.59% is obtained. The second greedy algorithm (denoted Look Ahead) is similar
to Marginal Gains. The basic difference between them is that Look Ahead con-
siders the total MLP cost for all possible number of blocks that the application
can receive [8] and can assign more than one way in each iteration. The number
of operations (add-divide-compare) is of order N · K2
2 , where K is the associativ-
ity of the L2 cache and N the number of cores. With this heuristic, an average
throughput degradations of 1.04% is obtained.
22 M. Moreto et al.
6 Conclusions
In this paper we propose a new DCP algorithm that assigns a cost to each
L2 access according to its impact in final performance: isolated misses receive
higher costs than clustered misses. Next, our algorithm decides the L2 cache
partition that minimizes the total cost for all running threads. Furthermore, we
have classified workloads for multiple cores into three groups and shown that
the dominant situation is precisely the one that offers room for improvement.
We show that our proposal reaches high throughput for two- and four-core
architectures. In all evaluated configurations, our proposal consistently outper-
forms both LRU and MinMisses, reaching a speed up of 63.9% (10.6% on aver-
age) and 15.4% (4.1% on average), respectively. With our proposals, we reach
the performance of a 50% larger cache. Finally, we used a sampling technique to
propose a practical implementation with a storage cost to less than 1% of the
total L2 cache size and a scalable algorithm to determine cache partitions with
nearly no performance degradation.
Acknowledgments
This work is supported by the Ministry of Education and Science of Spain un-
der contracts TIN2004-07739, TIN2007-60625 and grant AP-2005-3318, and by
SARC European Project. The authors would like to thank C. Acosta, A. Falcon,
D. Ortega, J. Vermoulen and O. J. Santana for their work in the simulation tool.
We also thank F. Cabarcas, I. Gelado, A. Rico and C. Villavieja for comments
on earlier drafts of this paper and the reviewers for their helpful comments.
References
1. Serrano, M.J., Wood, R., Nemirovsky, M.: A study on multistreamed superscalar
processors, Technical Report 93-05, University of California Santa Barbara (1993)
2. Tullsen, D.M., Eggers, S.J., Levy, H.M.: Simultaneous multithreading: maximizing
on-chip parallelism. In: ISCA (1995)
3. Hammond, L., Nayfeh, B.A., Olukotun, K.: A single-chip multiprocessor. Com-
puter 30(9), 79–85 (1997)
4. Cazorla, F.J., Ramirez, A., Valero, M., Fernandez, E.: Dynamically controlled
resource allocation in SMT processors. In: MICRO (2004)
5. Chandra, D., Guo, F., Kim, S., Solihin, Y.: Predicting inter-thread cache contention
on a chip multi-processor architecture. In: HPCA (2005)
6. Petoumenos, P., Keramidas, G., Zeffer, H., Kaxiras, S., Hagersten, E.: Modeling
cache sharing on chip multiprocessor architectures. In: IISWC, pp. 160–171 (2006)
7. Chiou, D., Jain, P., Devadas, S., Rudolph, L.: Dynamic cache partitioning via
columnization. In: Design Automation Conference (2000)
8. Qureshi, M.K., Patt, Y.N.: Utility-based cache partitioning: A low-overhead, high-
performance, runtime mechanism to partition shared caches. In: MICRO (2006)
9. Suh, G.E., Devadas, S., Rudolph, L.: A new memory monitoring scheme for
memory-aware scheduling and partitioning. In: HPCA (2002)
Dynamic Cache Partitioning Based on the MLP of Cache Misses 23
10. Kim, S., Chandra, D., Solihin, Y.: Fair cache sharing and partitioning in a chip
multiprocessor architecture. In: PACT (2004)
11. Karkhanis, T.S., Smith, J.E.: A first-order superscalar processor model. In: ISCA
(2004)
12. Mattson, R.L., Gecsei, J., Slutz, D.R., Traiger, I.L.: Evaluation techniques for
storage hierarchies. IBM Systems Journal 9(2), 78–117 (1970)
13. Settle, A., Connors, D., Gibert, E., Gonzalez, A.: A dynamically reconfigurable
cache for multithreaded processors. Journal of Embedded Computing 1(3-4) (2005)
14. Rafique, N., Lim, W.T., Thottethodi, M.: Architectural support for operating
system-driven CMP cache management. In: PACT (2006)
15. Hsu, L.R., Reinhardt, S.K., Iyer, R., Makineni, S.: Communist, utilitarian, and
capitalist cache policies on CMPs: caches as a shared resource. In: PACT (2006)
16. Qureshi, M.K., Lynch, D.N., Mutlu, O., Patt, Y.N.: A case for MLP-aware cache
replacement. In: ISCA (2006)
17. Kroft, D.: Lockup-free instruction fetch/prefetch cache organization. In: ISCA
(1981)
18. Sherwood, T., Perelman, E., Hamerly, G., Sair, S., Calder, B.: Discovering and
exploiting program phases. IEEE Micro (2003)
19. Vera, J., Cazorla, F.J., Pajuelo, A., Santana, O.J., Fernandez, E., Valero, M.:
FAME: Fairly measuring multithreaded architectures. In: PACT (2007)
20. Moreto, M., Cazorla, F.J., Ramirez, A., Valero, M.: Explaining dynamic cache
partitioning speed ups. IEEE CAL (2007)
21. Sinharoy, B., Kalla, R.N., Tendler, J.M., Eickemeyer, R.J., Joyner, J.B.: Power5
system microarchitecture. IBM J. Res. Dev. 49(4/5), 505–521 (2005)
22. Luo, K., Gummaraju, J., Franklin, M.: Balancing throughput and fairness in SMT
processors. In: ISPASS (2001)
P. Stenström (Ed.): Transactions on HiPEAC III, LNCS 6590, pp. 24–42, 2011.
© Springer-Verlag Berlin Heidelberg 2011
Cache Sensitive Code Arrangement for
Virtual Machine*
Chun-Chieh Lin and Chuen-Liang Chen
Department of Computer Science and Information Engineering,
National Taiwan University, Taipei,
10764, Taiwan
{d93020,clchen}@csie.ntu.edu.tw
Abstract. This paper proposes a systematic approach to optimize the code layout
of a Java ME virtual machine for an embedded system with a cache-sensitive
architecture. A practice example is to run JVM directly (execution-in-place) in
NAND flash memory, for which cache miss penalty is too high to endure. The
refined virtual machine generated cache misses 96% less than the original
version. We developed a mathematical approach helping to predict the flow of
the interpreter inside the virtual machine. This approach analyzed both the static
control flow graph and the pattern of bytecode instruction streams, since we
found the input sequence drives the program flow of the virtual machine
interpreter. Then we proposed a rule to model the execution flows of Java
instructions of real applications. Furthermore, we used a graph partition
algorithm as a tool to deal with the mathematical model, and this finding helped
the relocation process to move program blocks to proper memory pages. The
refinement approach dramatically improved the locality of the virtual machine
thus reduced cache miss rates. Our technique can help Java ME-enabled devices
to run faster and extend longer battery life. The approach also brings potential for
designers to integrate the XIP function into System-on-Chip thanks to lower
demand for cache memory.
Keywords: cache sensitive, cache miss, NAND flash memory, code arrange-
ment, Java virtual machine, interpreter, embedded system.
1 Introduction
Java platform extensively exists in all kinds of embedded and mobile devices. The
Java™ Platform, Micro Edition (Java ME) [1] is no doubt a de facto standard platform
of smart phone. The Java virtual machine (it is KVM in Java ME) is a key component
that affects performance and power consumptions.
NAND flash memory comes with serial bus interface. It does not allow random
access, and the CPU must read out the whole page at a time, which is a slow operation
compared to RAM. This property leads a processor hardly to execute programs stored
*
We acknowledge the support for this study through grants from National Science Council of
Taiwan (NSC 95-2221-E-002 -137).
Cache Sensitive Code Arrangement for Virtual Machine 25
in NAND flash memory using the “execute-in-place” (XIP) technique. In the mean-
while, NAND flash memory offers fast write access time, and the most important of all,
the technology has advantages in offering higher capacity than NOR flash technology
does. As the applications of embedded devices become large and complicated, more
mainstream devices adopt NAND flash memory to replace NOR-flash memory.
In this paper, we tried to offer an answer to the question: can we speed up an em-
bedded device using NAND flash memory to store programs? “Page-based” storage
media, like NAND flash memory, have higher access penalty than RAM does. Re-
ducing the page miss becomes a critical issue. Thus, we set forth to find way to reduce
the page miss rate generated by the KVM. Due to the unique structure of the KVM
interpreter, we found a special way to exploit the dynamic locality of the KVM that is to
trace the patterns of executed bytecode instructions instead of the internal flow of the
KVM. It turned out to be a combinatorial optimization problem because the code layout
must fulfill certain code size constraints. Our approach achieved the effect of static
page preloading by properly arranging program blocks. In the experiment, we imple-
mented a post-processing program to modify the intermediate files generated by the C
compiler. The post-processing program refined machine code placement of the KVM
based on the mathematical model. Finally, the obtained tuned KVMs dramatically
reduced page accesses to NAND flash memories. The outcome of this study helps
embedded systems to boost performance and extend battery life as well.
2 Related Works
Park et al., in [2], proposed a hardware module to allow direct code execution from
NAND flash memory. In this approach, program codes stored in NAND flash pages
will be loaded into RAM cache on-demand instead of moving entire contents into
RAM. Their work is a universal hardware-based solution without considering appli-
cation-specific characteristics.
Samsung Electronics offers a commercial product called “OneNAND” [3] based on
the same. It is a single chip with a standard NOR flash interface. Actually, it contains a
NAND flash memory array for storage. The vendor intent was to provide a
cost-effective alternative to NOR flash memory used in existing designs. The internal
structure of OneNAND comprises a NAND flash memory, control logic, hardware
ECC, and 5KB buffer RAM. The 5KB buffer RAM is comprised of three buffers: 1KB
for boot RAM, and a pair of 2KB buffers used for bi-directional data buffers. Our
approach is suitable for systems using this type of flash memories.
Park et al., in [4], proposed yet another pure software approach to achieve exe-
cute-in-place by using a customized compiler that inserts NAND flash reading opera-
tions into program code at proper place. Their compiler determines insertion points by
summing up sizes of basic blocks along the calling tree. Special hardware is no longer
required, but in contrast to earlier work [2], there is still a need for tailor-made compiler.
Typical studies of refining code placement to minimize cache misses can apply to
NAND flash cache system. Parameswaran et al., in [5], used the bin-packing approach.
It reorders the program codes by examining the execution frequency of basic blocks.
Code segments with higher execution frequency are placed next to each other within
the cache. Janapsatya et al., in [6], proposed a pure software heuristic approach to
reduce number of cache misses by relocating program sections in the main memory.
26 C.-C. Lin and C.-L. Chen
Their approach was to analyze program flow graph, identify and pack basic blocks
within the same loop. They have also created relations between cache miss and energy
consumption. Although their approach can identify loops within a program, breaking
the interpreter of a virtual machine into individual circuits is hard because all the loops
share the same starting point.
There are researches in improving program locality and optimizing code placement
for either cache or virtual memory environment. Pettis [7] proposed a systematic
approach using dynamic call graph to position procedures. They tried to place two
procedures as close as possible if one of the procedure calls another frequently. The
first step of Pettis’ approach uses the profiling information to create weighted call
graph. The second step iteratively merges vertices connected by heaviest weight edges.
The process repeats until the whole graph composed of one or more individual vertex
without edges.
However, the approach to collect profiling information and their accuracy is yet
another issue. For example, Young and Smith in [8] developed techniques to extract
effective branch profile information from a limited depth of branch history. Ball and
Larus in [9] described an algorithm for inserting monitoring code to trace programs.
Our approach is very different by nature. Previous studies all focused in the flow of
program codes, but we tried to model the profile by input data.
This research project created a post-processor to optimize the code arrangements. It
is analogous to “Diablo linker” [10]. They utilized symbolic information in the object
files to generate optimized executable files. However, our approach will generate
feedback intermediate files for the compiler, and invoke the compiler to generate
optimized machine code.
3 Background
3.1 XIP with NAND Flash
NOR flash memory is popular as code memory because of the XIP feature. There are
several approaches designed for using NAND flash memory as an alternative to NOR
flash memory. Because NAND flash memory interface cannot connect to the CPU host
bus, there has to be a memory interface controller to move data from NAND flash
memory to RAM.
Fig. 1. Access NAND flash through shadow RAM
Cache Sensitive Code Arrangement for Virtual Machine 27
In system-level view, Figure 1 shows a straightforward design which uses RAM as
the shadow copy of NAND flash. The system treats NAND flash memory as secondary
storage device [11]. There should be a boot loader or RTOS resided in ROM or NOR
flash memory. It copies program codes from NAND flash to RAM, then the processor
executes program codes in RAM [12]. This approach offers best execution speed
because the processor operates with RAM. The downside of this approach is it needs
huge amount of RAM to mirror NAND flash. In embedded devices, RAM is a precious
resource. For example, the Sony Ericsson T610 mobile phone [13] reserved 256KB
RAM for Java heap. In contrast to using 256MB for mirroring NAND flash memory, all
designers should agree that they would prefer to retain RAM for Java applets rather
than for mirroring. The second pitfall is the implementation takes longer time to boot
because the system must copy contents to RAM prior to execution.
Figure 2 shows a demand paging approach uses limited amount of RAM as the cache
of NAND flash. The “romized” program codes stay in NAND flash memory, and a
MMU loads only portions of program codes which is about to be executed from NAND
into the cache. The major advantage of this approach is it consumes less RAM. Several
kilobytes of RAM are enough to mirror NAND flash memory. Using less RAM means
integrating CPU, MMU and cache into a single chip (the shadowed part in Figure 2) can
be easier. The startup latency is shorter since the CPU is ready to run soon after the first
NAND flash page is loaded into the cache. The component cost is lower than in the
previous approach. The realization of the MMU might be either hardware or software
approach, which is not covered in this paper.
Fig. 2. Using cache unit to access NAND flash
However, performance is the major drawback of this approach. The penalty of each
cache miss is high, because loading contents from a NAND flash page is nearly 200
times slower than doing the same operation with RAM. Therefore reducing cache
misses becomes a critical issue for such configurations.
3.2 KVM Internals
Source Level. In respect of functionality, the KVM can be broken down into several
parts: startup, class files loading, constant pool resolving, interpreter, garbage collection,
28 C.-C. Lin and C.-L. Chen
and KVM cleanup. Lafond et al., in [14], have measured the energy consumptions of
each part in the KVM. Their study showed that the interpreter consumed more than
50% of total energy. In our experiments running Embedded Caffeine Benchmark [15],
the interpreter contributed 96% of total memory accesses. These evidences lead to the
conclusion that the interpreter is the performance bottleneck of the KVM, and they
motivated us to focus on reducing the cache misses generated by the interpreter.
Figure 3 shows the program structure of the interpreter. It is a loop enclosing a large
switch-case dispatcher. The loop fetches bytecode instructions from Java applications,
and each “case” sub-clause deals with one bytecode instruction. The control flow graph
of the interpreter, as illustrated in Figure 4, is a flat and shallow spanning tree. There are
three major steps in the interpreter,
ReschedulePoint:
RESCHEDULE
opcode = FETCH_BYTECODE ( ProgramCounter );
switch ( opcode )
{
case ALOAD: /* do something */
goto ReschedulePoint;
case IADD: /* do something */
…
case IFEQ: /* do something */
goto BranchPoint;
…
}
BranchPoint:
take care of program counter;
goto ReschedulePoint;
Fig. 3. Pseudo code of KVM interpreter
Fig. 4. Control flow graph of the interpreter
Cache Sensitive Code Arrangement for Virtual Machine 29
(1) Rescheduling and Fetching. In this step, KVM prepares the execution context and
the stack frame. Then it fetches a bytecode instruction from Java programs.
(2) Dispatching and Execution. After reading a bytecode instruction from Java pro-
grams, the interpreter jumps to corresponding bytecode handlers through the big
“switch…case…” statement. Each bytecode handler carries out the function of the
corresponding bytecode instruction.
(3) Branching. The branch bytecode instructions may bring the Java program flow
away from original track. In this step, the interpreter resolves the target address and
modifies the program counter.
Fig. 5. The organization of the interpreter at assembly level
Assembly Level. Our analysis of the source files revealed the peculiar program
structure of the VM interpreter. Analyzing the code layout in the compiled executables
of the interpreter helped this study to create a code placement strategy. The assembly
code analysis in this study is restricted to ARM and gcc for the sake of demonstration,
30 C.-C. Lin and C.-L. Chen
but applying our theory to other platforms and tools is an easy job. Figure 5 illustrates
the layout of the interpreter in assembly form (FastInterpret() in interp.c). The first
trunk BytecodeFetching is the code block for rescheduling and fetching, it is exactly the
first part in the original source code. The second trunk LookupTable is a large lookup
table used in dispatching bytecode instructions. Each entry links to a bytecode handler.
It is actually the translated result of the “switch…case…case” statement.
The third trunk BytecodeDispatch is the aggregation of more than a hundred byte-
code handlers. Most bytecode handlers are self-contained which means a bytecode
handler occupies a contiguous memory space in this trunk, and it does not jump to
program codes stored in other trunks. There are only a few exceptions which call
functions stored in other trunks, such as “invokevirtual.” Besides, there are several
constant symbol tables spread over this trunk. These tables are referenced by the pro-
gram codes within the BytecodeDispatch trunk.
The last trunk ExceptionHandling contains code fragments for exception handling.
Each trunk occupies a number of NAND flash pages. In fact, the total size of Byteco-
deFetching and LookupTable is about 1200 bytes (compiled with arm-elf-gcc-3.4.3),
which is almost small enough to fit into two or three 512-bytes-page. Figure 6 shows
the size distribution of bytecode handlers. The average size of a bytecode handler is 131
bytes, and there are 79 handlers smaller than 56 bytes. In other words, a 512-bytes-page
could gather 4 to 8 bytecode handlers. The inter-handler execution flow dominates the
number of cache misses generated by the interpreter. This is the reason that our ap-
proach tries to rearrange bytecode handlers within the BytecodeDispatch trunk.
Fig. 6. Distribution of Bytecode Handler Size (compiled with gcc-3.4.3)
4 Analyzing Control Flow
4.1 Indirect Control Flow Graph
Static branch-prediction and typical code placement approaches derive the layout of a
program from its control flow graph (CFG). However, the CFG of a VM interpreter is a
special case, its CFG is a flat spanning tree enclosed by a loop. The CFG does not
provide sufficient information to distinguish the temporal relations of each bytecode
handler pair. If someone wants to improve the program locality by observing the dy-
namic execution order of program blocks, the CFG is apparently not a good tool to this
Cache Sensitive Code Arrangement for Virtual Machine 31
end. Therefore, we propose a concept called “Indirect Control Flow Graph” (ICFG); it
uses the real bytecode instruction sequences to construct the dual CFG of the interpreter.
Consider a simplified virtual machine with 5 bytecode instructions: A, B, C, D, and E,
and use the virtual machine to run a very simple user applet. Consider the following
short alphabetic sequence as the instruction sequence of the user applet:
A-B-A-B-C-D-E-C
Each alphabet in the sequence represents a bytecode instruction. In Figure 7, the graph
connected with the solid lines is the CFG of the simplified interpreter. By observing the
flow in the CFG, the program flow becomes:
[Dispatch] – [Handler A] – [Dispatch] – [Handler B]…
Fig. 7. The CFG of the simplified interpreter
It is hard to tell the relation between handler-A and handler-B because the loop
header hides it. In other words, this CFG cannot easily present which handler would be
invoked after handler-A is executed. The idea of the ICFG is to observe the patterns of
the bytecode sequences executed by the virtual machine, not to analyze the structure of
the virtual machine itself. Figure 8 expresses the ICFG in a readable way, it happens to
be the sub-graph connected by the dashed directed lines in Figure 7.
Fig. 8. An ICFG example. The number inside the circle represents the size of the handler
32 C.-C. Lin and C.-L. Chen
4.2 Tracing the Locality of the Interpreter
As stated, the Java applications that a KVM runs dominate the temporal locality of the
interpreter. Precisely speaking, the incoming Java instruction sequence dominates the
temporal locality of the KVM. Therefore, the first step to exploit the temporal locality
is to consider the bytecode sequences executed by the virtual machine. Consider the
previous example sequence, the order of accessed NAND flash pages is supposed to be:
[BytecodeFetching]–[LookupTable]–[A]–[BytecodeFetching]–[LookupTable]–
[B]–[BytecodeFetching]–[LookupTable]–[A]…
Obviously, memory pages containing BytecodeFetching and LookupTable are much
often to appear in the sequence than those containing BytecodeDispatch. As a result,
pages containing BytecodeFetching and LookupTable are favorable to last in the cache.
Pages holding bytecode handlers have to compete with each other to stay in the cache.
Thus, we induced that the order of executed bytecode instructions is the key factor
impacts cache misses.
Consider an extreme case: In a system with three cache blocks, two cache blocks
always hold memory pages containing BytecodeFetching and LookupTable due to the
stated reason. Therefore, there is only one cache block available for swapping pages
containing bytecode handlers. If all the bytecode handlers were located in distinct
memory pages, processing a bytecode instruction would cause a cache miss. This is
because the next-to-execute bytecode handler is always located in an uncached memory
page. In other words, the sample sequence causes at least eight cache misses. Never-
theless, if both the handlers of A and B are grouped to the same page, cache misses will
decline to 5 times, and the page access trace becomes:
fault-A-B-A-B-fault-C-fault-D-fault-E-fault-C
If we extend the group (A, B) to include the handler of C, the cache miss count would
even drop to four times, and the page access trace looks like the following one:
fault-A-B-A-B-C-fault-D-fault-E-fault-C
Therefore, the core issue of this study is to find an efficient code layout method parti-
tioning all bytecode instructions into disjoined sets based on their execution relevance.
Each NAND flash page contains one set of bytecode handlers. We propose partitioning
the ICFG reaches this goal.
Back to Figure 8, the directed edges represent the temporal order of the instruction
sequence. The weight of an edge is the transition count for transitions from one bytecode
instruction to the next. If we remove the edge (B, C), the ICFG is divided into two
disjoined sets. That is, the bytecode handlers of A and B are placed in one page, and the
bytecode handlers of C, D, and E are placed in the other. The page access trace becomes:
fault-A-B-A-B-fault-C-D-E-C
This placement causes only two cache misses, which is 75% lower than the worst case!
The next step is to transform the ICFG diagram to an undirected graph by merging
reversed edges connecting same vertices, and the weight of the undirected edge is the
sum of weights of the two directed edges. The consequence is actually a variation of the
classical MIN k-CUT problem. Formally speaking, we can model a given graph
G(V, E) as:
Cache Sensitive Code Arrangement for Virtual Machine 33
z Vi – represents the i-th bytecode instruction.
z Ei,j – the edge connecting i-th and j-th bytecode instruction.
z Fi,j – number of times that two bytecode instructions i and j executed after each
other. It is the weight of edge Ei,j.
z K – number of expected partitions.
z Wx,y – the inter-set weight. ∀ x ≠ y, Wx,y= ΣFi,j where Vi ∈ Px and Vj ∈ Py.
The goal is to model the problem as the following definition:
Definition 1. The MIN k-CUT problem is to divide G into K disjoined partitions {P1,
P2,…,Pk} such that ΣWi,j is minimized.
4.3 The Mathematical Model
Yet there is an additional constraint in our model. It is impractical to gather bytecode
instructions to a partition regardless of the sum of the program size of consisted byte-
code handlers. The size of each bytecode handler is distinct, and the code size of a
partition cannot exceed the size of a memory page (e.g. NAND flash page). Our aim is
to distribute bytecode handlers into several disjoined partitions {P1, P2,…,Pk}. We
define the following notations:
z Si – the code size of bytecode handler Vi.
z N – the size of a memory page.
z M(Pk ) – the size of partition Pk . It is ΣSm for all Vm∈ Pk .
z H(Pk ) – the value of partition Pk . It is ΣFi,j for all Vi , Vj ∈ Pk .
Our goal is to construct partitions that satisfy the following constraints.
Definition 2. The problem is to divide G into K disjoined partitions {P1, P2,…,Pk}. For
each Pk that M(Pk) ≤ N such that Wi,j is minimized, and maximize ΣH(Pi ) for all Pi ∈
{P1, P2,…,Pk}.
This rectified model is exactly an application of the graph partition problem, i.e., the
size of each partition must satisfy the constraint (size of a memory page), and the sum
of inter-partition path weights is minimal. The graph partition problem is NP-complete
[16]. However, the purpose of this paper was neither to create a new graph partition
algorithm nor to discuss the difference between existing algorithms. The experimental
implementation just adopted the following algorithm to demonstrate our approach
works. Other implementations based on this approach may choose another graph
partition algorithm that satisfies specific requirements.
Partition (G)
1. Find the edge with maximal weight Fi,j among graph G, while the Si + Sj ≤ N. If
there is no such an edge, go to step 4.
2. Call Merge (Vi , Vj ) to combine vertices Vi and Vj.
3. Remove both Vi and Vj from G, go to step 1.
4. Find a pair of vertices Vi and Vj in G such that Si + Sj ≤ N. If there isn’t any pair
satisfied the criteria, go to step 7.
5. Call Merge (Vi , Vj ) to combine vertices Vi and Vj.
6. Remove both Vi and Vj out of G, go to step 4.
7. End.
34 C.-C. Lin and C.-L. Chen
The procedure of merging both vertices Vi and Vj is:
Merge (Vi , Vj )
1. Add a new vertex Vk to G.
2. Pickup an edge E connects Vt with either Vi or Vj . If there is no such an edge, go
to step 6.
3. If there is already an edge F connects Vt to Vk.
4. Then, add the weight of E to F, and discard E.
5. Else, replace one end of E which is either Vi or Vj with Vk.
6. End.
Finally, each vertex in G is a collection of several bytecode handlers. The refinement
process is to collect bytecode handlers belonging to the same vertex and place them into
one memory page.
5 The Process of Rewriting the Virtual Machine
Our approach emphasizes that the arrangements of bytecode handlers affects cache
miss rate. In other words, it implies that programmers should be able to speed up their
programs by properly changing the order of the “case” sub-clauses in the source files.
Therefore, this study tries to optimize the virtual machine in two distinct ways. The first
approach revises the order of the “case” sub-clauses in the sources of the virtual ma-
chine. If our theory were correct, this tentative approach should show that the modified
virtual machine performs better in most test cases. The second version precisely reor-
ganizes the layout of assembly code blocks of bytecode handlers, and this approach
should be able to generate larger improvements than the first version.
5.1 Source-Level Rearrangement
The concept of the refining process is to arrange the order of these “case” statements in
the source file (execute.c). The consequence is that after translating the rearranged
source files, the compiler will place bytecode handlers in machine code form in me-
ditated order. The following steps are the outline of the refining procedures.
A. Profiling. Run the Java benchmark program on the unmodified KVM. A custom
profiler traces the bytecode instruction sequence, and it generates the statistics of
inter-bytecode instruction counts. Although we can collect some patterns of instruction
combinations by investigating the Java compiler, using a dynamic approach can cap-
ture further application-dependent patterns.
B. Measuring the size of each bytecode handler. The refining program compiles the
KVM source files and measures the code size of each bytecode handler (i.e., the size of
each ‘case’ sub-clause) by parsing intermediate files generated by the compiler.
C. Partitioning the ICFG. The previous steps collect all necessary information for
constructing the ICFG. Then, the refining program partitions the ICFG by using a graph
partition algorithm. From that result, the refining program knows the way to group
bytecode handlers together. For example, a partition result groups (A, B) to a bundle
and (C, D, E) to another as shown in Figure 8.
Cache Sensitive Code Arrangement for Virtual Machine 35
D. Rewriting the source file. According to the computed results, the refining program
rewrites the source file by arranging the order of all “case” sub-clauses within the
interpreter loop. Figure 9 shows the order of all “case” sub-clauses in the previous
example.
switch ( opcode ) {
case B: …;
case A: …;
case E: …;
case D: …;
case C: …;
}
Fig. 9. The output of rearranged case statements
5.2 Assembly-Level Rearrangement
The robust implementation of the refinement process consists of two steps. The re-
finement process acts as a post processor of the compiler. It parses intermediate files
generated by the compiler, rearranges program blocks, and generates optimized
assembly codes. Our implementation is inevitably compiler-dependent and CPU-
dependent. Current implementation tightly is integrated with gcc for ARM, but the
approach is easy to apply to other platforms. Figure 10 illustrates the outline of the
processing flow, entities, and relations between each entity. The following paragraphs
explain the functions of each step.
Fig. 10. Entities in the refinement process
36 C.-C. Lin and C.-L. Chen
A. Collecting dynamic bytecode instruction trace. The first step is to collect statis-
tics from real Java applications or benchmarks, because the following steps will need
these data for partitioning bytecode handlers. The modified KVM dumps the bytecode
instruction trace while running Java applications. A special program called TRACER
analyzes the trace dump to find the transition counts for all instruction pairs.
B. Rearranging the KVM interpreter. This is the core step and is realized by a
program called REFINER. It acts as a post processor of gcc. Its duty is to parse byte-
code handlers expressed in the assembly code and organize them into partitions. Each
partition fits into one NAND flash page. The program consists of several sub tasks
described as follows.
(i) Parsing layout information of the original KVM. The very first thing is to com-
pile the original KVM. REFINER parses the intermediate files generated by gcc.
According to structure of the interpreter expressed in assembly code introduced in §3.2,
REFINER analyzes the jump table in the LookupTable trunk to find out the address and
size of each bytecode handler.
(ii) Using the graph partition algorithm to group bytecode handlers into disjoined
partitions. At this stage, REFINER constructs the ICFG with two key parameters: (1)
the transition counts of bytecode instructions collected by TRACER; (2) the machine
code layout information collected in the step A. It uses the approximate algorithm
described in §4.3 to divide the undirected ICFG into disjoined partitions.
(iii) Rewriting the assembly code. REFINER parses and extracts assembly codes of
all bytecode handlers. Then, it creates a new assembly file and dumps all bytecode
handlers partition by partition according to the result of (ii).
(iv) Propagating symbol tables to each partition. As described in §3.2, there are
several symbol tables distributed in the BytecodeDispatch trunk. For most RISC pro-
cessors like ARM and MIPS, an instruction is unable to carry arbitrary constants as
operands because of limited instruction word length. The solution is to gather used
constants into a symbol table and place this table near the instructions that will access
these constants. Hence, the compiler generates instructions with relative addressing
operands to load constants from the nearby symbol tables. Take ARM for example, its
application binary interface (ABI) defines two instructions called LDR and ADR for
loading a constant from a symbol table to a register [17]. The ABI restricts the maximal
distance between a LDR/ADR instruction and the referred symbol table to 4K bytes.
Besides, it would cause a cache miss if a machine instruction in page X loads a
constant si from symbol table SY located in page Y. Our solution is to create a local
symbol table SX in page X and copy the value si to the new table. Therefore, the relative
distance between si and the instruction never exceeds 4KB neither causes cache misses
when the CPU tries to load si.
(v) Dumping contents in partitions to NAND flash pages. The aim is to map byte-
code handlers to NAND flash pages. Its reassembled bytecode handlers belong to the
same partition in one NAND flash page. After that, REFINER refreshes the address and
size information of all bytecode handlers. The updated information helps REFINER to
add padding to each partition and enforce the starting address of each partition to align
to the boundary of a NAND flash page.
Cache Sensitive Code Arrangement for Virtual Machine 37
6 Evaluation
In this section, we start from a brief introduction of the environment and conditions
used in the experiments. The first part of the experimental results is the outcome of
source-level rearranged virtual machine. Those positive results prove our theory works.
The next part is the experiment of assembly-level rearranged virtual machine. It further
proves our refinement approach is able to produce better results than the original
version.
6.1 Evaluation Environment
Figure 11 shows the block diagram of our experimental setup. In order to mimic real
embedded applications, we have implanted Java ME KVM into uClinux for ARM7 in
the experiment. One of the reasons to use this platform is that uClinux supports FLAT
executable file format which is perfect for realizing XIP. We ran KVM/uClinux on a
customized gdb. This customized gdb dumped memory access traces and performance
statistics to files. The experimental setup assumed there was a specialized hardware
unit acting as the NAND flash memory controller, which loads program codes from
NAND flash pages to the cache. It also assumed all flash access operations worked
transparently without the help from the operating system. In other words, modifying the
OS kernel for the experiment is unnecessary. This experiment used “Embedded Caf-
feine Mark 3.0” [15] as the benchmark.
Embedded
Caffeine Mark
J2ME API
K Virtual Machine (KVM) 1.1
uClinux Kernel
GDB 5.0/ARMulator
Windows/Cygwin
ARM7 / FLASH
ARM7 / ROM
Java / RAM
Intel X86
Title Version
arm-elf-binutil 2.15
arm-elf-gcc 3.4.3
uClibc 0.9.18
J2ME (KVM) CLDC 1.1
elf2flt 20040326
Fig. 11. Hierarchy of simulation environment
There are several kinds of NAND flash commodities in the market: 512-bytes,
2048-bytes, and 4096-bytes per page. In this experiment, we model the cache simulator
after the following conditions:
1. There were four NAND flash page size options: 512, 1024, 2048 and 4096.
2. The page replacement policy was full associative, and it is a FIFO cache.
3. The number of cache memory blocks varied from 2, 4 … to 32.
6.2 Results of Source-Level Rearrangement
First, we rearranged the “case” sub-clauses in the source codes using the introduced
method. Table 1 lists the raw statistics of cache miss rates, and Figure 12 plots the
charts of normalized cache miss rates from the optimized KVM. The experiment
38 C.-C. Lin and C.-L. Chen
assumed the maximal cache size is 64K bytes. For each NAND flash page size, the
number of cache blocks starts from 4 to (64K / NAND flash page size).
In Table 1, each column is the experimental result from a kind of the KVM. The
“original” column refers to statistics from the original KVM, in which bytecode han-
dlers is ordered by in machine codes. The second column “optimized” is the result from
the KVM refined with our approach.
For example, in the best case (2048 bytes per page, 8 cache pages), the optimized
KVM generates 105,157 misses, which is only 4.5% of the misses caused by the
original KVM, and the improvement ratio is 95%.
Broadly speaking, the experiment shows that the optimized KVM outperforms the
original KVM in most cases. Looking at the charts in Figure 12, the curves of nor-
malized cache miss rates (i.e., optimized_miss_rate / original_miss_rate ) tend to be
concave. It means the improvement for the case of eight pages is greater than the one of
four pages. It benefits from the smaller “locality” of the optimized KVM. Therefore,
the cache could hold more localities, and this is helpful in reducing cache misses. After
touching the bottom, the cache is large enough to hold most of the KVM program code.
As the cache size grows, the numbers of cache misses of all configurations converge.
However, the miss rate at 1024 bytes * 32 blocks is an exceptional case. This is
because our approach rearranges the order of bytecode handlers at source level, and it
hardly predicts the precise starting address and code size of a bytecode handler. This is
the drawback of the approach.
0.00
0.20
0.40
0.60
0.80
1.00
1.20
2048 4096 8192 16384 32768 65536
Cache Memory Sizes (bytes)
Normalized
Cache
Miss
Rates
512 bytes/page
1024 bytes/page
2048 bytes/page
4096 bytes/page
Fig. 12. The charts of normalized cache-miss rates from the source-level refined virtual machine.
Each chart is an experiment performs on a specific page size. The x-axis is the size of the cache
memory ( number_of_pages * page_size ).
Another Random Scribd Document
with Unrelated Content
Transactions On Highperformance Embedded Architectures And Compilers Iii 1st Edition Miquel Moreto
1631
§ 1. Bernhard of
Saxe Weimar.
CHAPTER IX.
THE DEATH OF WALLENSTEIN AND
THE TREATY OF PRAGUE.
Section I.—French Influence in Germany.
In Germany, after the death of Gustavus at Lützen,
it was as it was in Greece after the death of
Epaminondas at Mantinea. There was more
disturbance and more dispute after the battle than
before it. In Sweden, Christina, the infant daughter of Gustavus,
succeeded peaceably to her father's throne, and authority was
exercised without contradiction by the Chancellor Oxenstjerna. But,
wise and prudent as Oxenstjerna was, it was not in the nature of
things that he should be listened to as Gustavus had been listened
to. The chiefs of the army, no longer held in by a soldier's hand,
threatened to assume an almost independent position. Foremost of
these was the young Bernhard of Weimar, demanding, like
Wallenstein, a place among the princely houses of Germany. In his
person he hoped the glories of the elder branch of the Saxon House
would revive, and the disgrace inflicted upon it by Charles V. for its
attachment to the Protestant cause would be repaired. He claimed
the rewards of victory for those whose swords had gained it, and
payment for the soldiers, who during the winter months following
the victory at Lützen had received little or nothing. His own share
was to be a new duchy of Franconia, formed out of the united
bishoprics of Würzburg and Bamberg. Oxenstjerna was compelled to
admit his pretensions, and to confirm him in his duchy.
The step was thus taken which Gustavus had undoubtedly
contemplated, but which he had prudently refrained from carrying
§ 2. The League
of Heilbronn.
§ 3. Defection of
Saxony.
1631
§ 4. French
politics.
into action. The seizure of ecclesiastical lands in
which the population was Catholic was as great a
barrier to peace on the one side as the seizure of
the Protestant bishoprics in the north had been on the other. There
was, therefore, all the more necessity to be ready for war. If a
complete junction of all the Protestant forces was not to be had,
something at least was attainable. On April 23, 1633, the League of
Heilbronn was signed. The four circles of Swabia, Franconia, and the
Upper and Lower Rhine formed a union with Sweden for mutual
support.
It is not difficult to explain the defection of the
Elector of Saxony. The seizure of a territory by
military violence had always been most obnoxious
to him. He had resisted it openly in the case of Frederick in
Bohemia. He had resisted it, as far as he dared, in the case of
Wallenstein in Mecklenburg. He was not inclined to put up with it in
the case of Bernhard in Franconia. Nor could he fail to see that with
the prolongation of the war, the chances of French intervention were
considerably increasing.
In 1631 there had been a great effervescence of
the French feudal aristocracy against the royal
authority. But Richelieu stood firm. In March the
king's brother, Gaston Duke of Orleans, fled from
the country. In July his mother, Mary of Medici, followed his
example. But they had no intention of abandoning their position.
From their exile in the Spanish Netherlands they formed a close
alliance with Spain, and carried on a thousand intrigues with the
nobility at home. The Cardinal smote right and left with a heavy
hand. Amongst his enemies were the noblest names in France. The
Duke of Guise shrank from the conflict and retired to Italy to die far
from his native land. The keeper of the seals died in prison. His
kinsman, a marshal of France, perished on the scaffold. In the
summer of the year 1632, whilst Gustavus was conducting his last
campaign, there was a great rising in the south of France. Gaston
§ 5. Richelieu did
for France all that
could be done.
§ 6. Richelieu and
Germany.
himself came to share in the glory or the disgrace of the rebellion.
The Duke of Montmorenci was the real leader of the enterprise. He
was a bold and vigorous commander, the Rupert of the French
cavaliers. But his gay horsemen dashed in vain against the serried
ranks of the royal infantry, and he expiated his fault upon the
scaffold. Gaston, helpless and low-minded as he was, could live on,
secure under an ignominious pardon.
It was not the highest form of political life which
Richelieu was establishing. For the free expression
of opinion, as a foundation of government, France,
in that day, was not prepared. But within the limits
of possibility, Richelieu's method of ruling was a magnificent
spectacle. He struck down a hundred petty despotisms that he might
exalt a single despotism in their place. And if the despotism of the
Crown was subject to all the dangers and weaknesses by which
sooner or later the strength of all despotisms is eaten away,
Richelieu succeeded for the time in gaining the co-operation of those
classes whose good will was worth conciliating. Under him
commerce and industry lifted up their heads, knowledge and
literature smiled at last. Whilst Corneille was creating the French
drama, Descartes was seizing the sceptre of the world of science.
The first play of the former appeared on the stage in 1629. Year by
year he rose in excellence, till in 1636 he produced the 'Cid;' and
from that time one masterpiece followed another in rapid
succession. Descartes published his first work in Holland in 1637, in
which he laid down those principles of metaphysics which were to
make his name famous in Europe.
All this, however welcome to France, boded no
good to Germany. In the old struggles of the
sixteenth century, Catholic and Protestant each
believed himself to be doing the best, not merely for his own
country, but for the world in general. Alva, with his countless
executions in the Netherlands, honestly believed that the
Netherlands as well as Spain would be the better for the rude
surgery. The English volunteers, who charged home on a hundred
battle-fields in Europe, believed that they were benefiting Europe,
not England alone. It was time that all this should cease, and that
the long religious strife should have its end. It was well that
Richelieu should stand forth to teach the world that there were
objects for a Catholic state to pursue better than slaughtering
Protestants. But the world was a long way, in the seventeenth
century, from the knowledge that the good of one nation is the good
of all, and in putting off its religious partisanship France became
terribly hard and selfish in its foreign policy. Gustavus had been half
a German, and had sympathized deeply with Protestant Germany.
Richelieu had no sympathy with Protestantism, no sympathy with
German nationality. He doubtless had a general belief that the
predominance of the House of Austria was a common evil for all, but
he cared chiefly to see Germany too weak to support Spain. He
accepted the alliance of the League of Heilbronn, but he would have
been equally ready to accept the alliance of the Elector of Bavaria if
it would have served him as well in his purpose of dividing Germany.
§ 7. His policy
French, not
European.
§ 1. Saxon
negotiations with
Wallenstein.
The plan of Gustavus might seem unsatisfactory to
a patriotic German, but it was undoubtedly
conceived with the intention of benefiting Germany.
Richelieu had no thought of constituting any new
organization in Germany. He was already aiming at the left bank of
the Rhine. The Elector of Treves, fearing Gustavus, and doubtful of
the power of Spain to protect him, had called in the French, and had
established them in his new fortress of Ehrenbreitstein, which looked
down from its height upon the low-lying buildings of Coblentz, and
guarded the junction of the Rhine and the Moselle. The Duke of
Lorraine had joined Spain, and had intrigued with Gaston. In the
summer of 1632 he had been compelled by a French army to make
his submission. The next year he moved again, and the French again
interfered, and wrested from him his capital of Nancy. Richelieu
treated the old German frontier-land as having no rights against the
King of France.
Section II.—Wallenstein's Attempt to dictate
Peace.
Already, before the League of Heilbronn was
signed, the Elector of Saxony was in negotiation
with Wallenstein. In June peace was all but
concluded between them. The Edict of Restitution
was to be cancelled. A few places on the Baltic coast were to be
ceded to Sweden, and a portion at least of the Palatinate was to be
restored to the son of the Elector Frederick, whose death in the
preceding winter had removed one of the difficulties in the way of an
agreement. The precise form in which the restitution should take
place, however, still remained to be settled.
Such a peace would doubtless have been highly disagreeable to
adventurers like Bernhard of Weimar, but it would have given the
Protestants of Germany all that they could reasonably expect to
gain, and would have given the House of Austria one last chance of
§ 2. Opposition to
Wallenstein.
§ 3. General
disapprobation of
his proceedings.
§ 4. Wallenstein
and the Swedes.
taking up the championship of national interests against foreign
aggression.
Such last chances, in real life, are seldom taken
hold of for any useful purpose. If Ferdinand had
had it in him to rise up in the position of a national
ruler, he would have been in that position long before. His confessor,
Father Lamormain, declared against the concessions which
Wallenstein advised, and the word of Father Lamormain had always
great weight with Ferdinand.
Even if Wallenstein had been single-minded he
would have had difficulty in meeting such
opposition. But Wallenstein was not single-minded.
He proposed to meet the difficulties which were
made to the restitution of the Palatinate by giving the Palatinate,
largely increased by neighbouring territories, to himself. He would
thus have a fair recompense for the loss of Mecklenburg, which he
could no longer hope to regain. He fancied that the solution would
satisfy everybody. In fact, it displeased everybody. Even the
Spaniards, who had been on his side in 1632 were alienated by it.
They were especially jealous of the rise of any strong power near
the line of march between Italy and the Spanish Netherlands.
The greater the difficulties in Wallenstein's way the
more determined he was to overcome them.
Regarding himself, with some justification, as a
power in Germany, he fancied himself able to act at the head of his
army as if he were himself the ruler of an independent state. If the
Emperor listened to Spain and his confessor in 1633 as he had
listened to Maximilian and his confessor in 1630, Wallenstein might
step forward and force upon him a wiser policy. Before the end of
August he had opened a communication with Oxenstjerna, asking for
his assistance in effecting a reasonable compromise, whether the
Emperor liked it or not. But he had forgotten that such a proposal as
this can only be accepted where there is confidence in him who
makes it. In Wallenstein—the man of many schemes and many
§ 5. Was he in
earnest?
§ 6. He attacks
the Saxons.
§ 7. Bernhard at
Ratisbon.
§ 8. Wallenstein's
difficulties.
intrigues—no man had any confidence whatever. Oxenstjerna
cautiously replied that if Wallenstein meant to join him against the
Emperor he had better be the first to begin the attack.
Whether Wallenstein seriously meant at this time
to move against the emperor it is impossible to say.
He loved to enter upon plots in every direction
without binding himself to any; but he was plainly in a dangerous
position. How could he impose peace upon all parties when no single
party trusted him?
If he was not trusted, however, he might still make
himself feared. Throwing himself vigorously upon
Silesia, he forced the Swedish garrisons to
surrender, and, presenting himself upon the frontiers of Saxony,
again offered peace to the two northern electors.
But Wallenstein could not be everywhere. Whilst
the electors were still hesitating, Bernhard made a
dash at Ratisbon, and firmly established himself in
the city, within a little distance of the Austrian frontier. Wallenstein,
turning sharply southward, stood in the way of his further advance,
but he did nothing to recover the ground which had been lost. He
was himself weary of the war. In his first command he had aimed at
crushing out all opposition in the name of the imperial authority. His
judgment was too clear to allow him to run the old course. He saw
plainly that strength was now to be gained only by allowing each of
the opposing forces their full weight. 'If the Emperor,' he said, 'were
to gain ten victories it would do him no good. A single defeat would
ruin him.' In December he was back again in Bohemia.
It was a strange, Cassandra-like position, to be
wiser than all the world, and to be listened to by
no one; to suffer the fate of supreme intelligence
which touches no moral chord and awakens no human sympathy.
For many months the hostile influences had been gaining strength at
Vienna. There were War-Office officials whose wishes Wallenstein
§ 9. Opposition of
Spain.
§ 10. The Cardinal
Infant.
§ 11. The
Emperor's
systematically disregarded; Jesuits who objected to peace with
heretics at all; friends of the Bavarian Maximilian who thought that
the country round Ratisbon should have been better defended
against the enemy; and Spaniards who were tired of hearing that all
matters of importance were to be settled by Wallenstein alone.
The Spanish opposition was growing daily. Spain
now looked to the German branch of the House of
Austria to make a fitting return for the aid which
she had rendered in 1620. Richelieu, having mastered Lorraine, was
pushing on towards Alsace, and if Spain had good reasons for
objecting to see Wallenstein established in the Palatinate, she had
far better reasons for objecting to see France established in Alsace.
Yet for all these special Spanish interests Wallenstein cared nothing.
His aim was to place himself at the head of a German national force,
and to regard all questions simply from his own point of view. If he
wished to see the French out of Alsace and Lorraine, he wished to
see the Spaniards out of Alsace and Lorraine as well.
And, as was often the case with Wallenstein, a
personal difference arose by the side of the political
difference. The Emperor's eldest son, Ferdinand,
the King of Hungary, was married to a Spanish Infanta, the sister of
Philip IV., who had once been the promised bride of Charles I. of
England. Her brother, another Ferdinand, usually known from his
rank in Church and State as the Cardinal-Infant, had recently been
appointed Governor of the Spanish Netherlands, and was waiting in
Italy for assistance to enable him to conduct an army through
Germany to Brussels. That assistance Wallenstein refused to give.
The military reasons which he alleged for his refusal may have been
good enough, but they had a dubious sound in Spanish ears. It
looked as if he was simply jealous of Spanish influence in Western
Germany.
Such were the influences which were brought to
bear upon the Emperor after Wallenstein's return
from Ratisbon in December. Ferdinand, as usual,
hesitation.
1634
§ 12. Wallenstein
and the army.
§ 1. Oñate's
movements.
§ 2. Belief at
Vienna that
Wallenstein was a
traitor.
was distracted between the two courses proposed.
Was he to make the enormous concessions to the
Protestants involved in the plan of Wallenstein; or was he to fight it
out with France and the Protestants together according to the plan
of Spain? To Wallenstein by this time the Emperor's resolutions had
become almost a matter of indifference. He had resolved to force a
reasonable peace upon Germany, with the Emperor, if it might be so;
without him, if he refused his support.
Wallenstein was well aware that his whole plan
depended on his hold over the army. In January he
received assurances from three of his principal
generals, Piccolomini, Gallas, and Aldringer, that
they were ready to follow him wheresoever he might lead them, and
he was sanguine enough to take these assurances for far more than
they were worth. Neither they nor he himself were aware to what
lengths he would go in the end. For the present it was a mere
question of putting pressure upon the Emperor to induce him to
accept a wise and beneficent peace.
Section III.—Resistance to Wallenstein's Plans.
The Spanish ambassador, Oñate, was ill at ease.
Wallenstein, he was convinced, was planning
something desperate. What it was he could hardly
guess; but he was sure that it was something most prejudicial to the
Catholic religion and the united House of Austria. The worst was that
Ferdinand could not be persuaded that there was cause for
suspicion. The sick man, said Oñate, speaking of the Emperor, will
die in my arms without my being able to help him.
Such was Oñate's feelings toward the end of
January. Then came information that the case was
worse than even he had deemed possible.
Wallenstein, he learned, had been intriguing with
the Bohemian exiles, who had offered, with
§ 3. Oñate
informs
Ferdinand.
§ 4. Decision of
the Emperor
against
Wallenstein.
§ 5.
Determination to
displace
Wallenstein.
Richelieu's consent, to place upon his head the crown of Bohemia,
which had fourteen years before been snatched from the unhappy
Frederick. In all this there was much exaggeration. Though
Wallenstein had listened to these overtures, it is almost certain that
he had not accepted them. But neither had he revealed them to the
government. It was his way to keep in his hands the threads of
many intrigues to be used or not to be used as occasion might
serve.
Oñate, naturally enough, believed the worst. And
for him the worst was the best. He went
triumphantly to Eggenberg with his news, and then
to Ferdinand. Coming alone, this statement might
perhaps have been received with suspicion. Coming, as it did, after
so many evidences that the general had been acting in complete
independence of the government, it carried conviction with it.
Ferdinand had long been tossed backwards and
forwards by opposing influences. He had given no
answer to Wallenstein's communication of the
terms of peace arranged with Saxony. The
necessity of deciding, he said, would not allow him
to sleep. It was in his thoughts when he lay down and when he
arose. Prayers to God to enlighten the mind of the Emperor had
been offered in the churches of Vienna.
All this hesitation was now at an end. Ferdinand
resolved to continue the war in alliance with Spain,
and, as a necessary preliminary, to remove
Wallenstein from his generalship. But it was more
easily said than done. A declaration was drawn up
releasing the army from its obedience to Wallenstein, and
provisionally appointing Gallas, who had by this time given
assurances of loyalty, to the chief command. It was intended, if
circumstances proved favourable, to intrust the command ultimately
to the young King of Hungary.
§ 6. The Generals
gained over.
§ 7. Attempt to
seize Wallenstein.
§ 8. Wallenstein
at Pilsen.
§ 9. The colonels
engage to support
The declaration was kept secret for many days. To
publish it would only be to provoke the rebellion
which was feared. The first thing to be done was to
gain over the principal generals. In the beginning of February
Piccolomini and Aldringer expressed their readiness to obey the
Emperor rather than Wallenstein. Commanders of a secondary rank
would doubtless find their position more independent under an
inexperienced young man like the King of Hungary than under the
first living strategist. These two generals agreed to make themselves
masters of Wallenstein's person and to bring him to Vienna to
answer the accusations of treason against him.
For Oñate this was not enough. It would be easier,
he said, to kill the general than to carry him off.
The event proved that he was right. On February 7,
Aldringer and Piccolomini set off for Pilsen with the intention of
capturing Wallenstein. But they found the garrison faithful to its
general, and they did not even venture to make the attempt.
Wallenstein's success depended on his chance of
carrying with him the lower ranks of the army. On
the 19th he summoned the colonels round him and
assured them that he would stand security for money which they
had advanced in raising their regiments, the repayment of which had
been called in question. Having thus won them to a favourable
mood, he told them that it had been falsely stated that he wished to
change his religion and attack the Emperor. On the contrary, he was
anxious to conclude a peace which would benefit the Emperor and
all who were concerned. As, however, certain persons at Court had
objected to it, he wished to ask the opinion of the army on its terms.
But he must first of all know whether they were ready to support
him, as he knew that there was an intention to put a disgrace upon
him.
It was not the first time that Wallenstein had
appealed to the colonels. A month before, when
the news had come of the alienation of the Court,
him.
§ 1. The garrison
of Prague
abandons him.
he had induced them to sign an acknowledgment
that they would stand by him, from which all
reference to the possibility of his dismissal was expressly excluded.
They now, on February 20, signed a fresh agreement, in which they
engaged to defend him against the machinations of his enemies,
upon his promising to undertake nothing against the Emperor or the
Catholic religion.
Section IV.—Assassination of Wallenstein.
Wallenstein thus hoped, with the help of the army,
to force the Emperor's hand, and to obtain his
signature to the peace. Of the co-operation of the
Elector of Saxony he was already secure; and since
the beginning of February he had been pressing Oxenstjerna and
Bernhard to come to his aid. If all the armies in the field declared for
peace, Ferdinand would be compelled to abandon the Spaniards and
to accept the offered terms. Without some such hazardous venture,
Wallenstein would be checkmated by Oñate. The Spaniard had been
unceasingly busy during these weeks of intrigue. Spanish gold was
provided to content the colonels for their advances, and hopes of
promotion were scattered broadcast amongst them. Two other of
the principal generals had gone over to the Court, and on February
18, the day before the meeting at Pilsen, a second declaration had
been issued accusing Wallenstein of treason, and formally depriving
him of the command. Wallenstein, before this declaration reached
him, had already appointed a meeting of large masses of troops to
take place on the White Hill before Prague on the 21st, where he
hoped to make his intentions more generally known. But he had
miscalculated the devotion of the army to his person. The garrison
of Prague refused to obey his orders. Soldiers and citizens alike
declared for the Emperor. He was obliged to retrace his steps. I had
peace in my hands, he said. Then he added, God is righteous, as
if still counting on the aid of Heaven in so good a work.
§ 2.
Understanding
with the Swedes.
§ 3. His arrival at
Eger.
§ 4. Wallenstein's
assassination.
He did not yet despair. He ordered the colonels to
meet him at Eger, assuring them that all that he
was doing was for the Emperor's good. He had
now at last hopes of other assistance. Oxenstjerna,
indeed, ever cautious, still refused to do anything for him till he had
positively declared against the Emperor. Bernhard, equally prudent
for some time, had been carried away by the news, which reached
him on the 21st, of the meeting at Pilsen, and the Emperor's
denouncement of the general. Though he was still suspicious, he
moved in the direction of Eger.
On the 24th Wallenstein entered Eger. In what
precise way he meant to escape from the labyrinth
in which he was, or whether he had still any clear
conception of the course before him, it is impossible to say. But
Arnim was expected at Eger, as well as Bernhard, and it may be that
Wallenstein fancied still that he could gather all the armies of
Germany into his hands, to defend the peace which he was ready to
make. The great scheme, however, whatever it was, was doomed to
failure. Amongst the officers who accompanied him was a Colonel
Butler, an Irish Catholic, who had no fancy for such dealings with
Swedish and Saxon heretics. Already he had received orders from
Piccolomini to bring in Wallenstein dead or alive. No official
instructions had been given to Piccolomini. But the thought was
certain to arise in the minds of all who retained their loyalty to the
Emperor. A general who attempts to force his sovereign to a certain
political course with the help of the enemy is placed, by that very
fact, beyond the pale of law.
The actual decision did not lie with Butler. The
fortress was in the hands of two Scotch officers,
Leslie and Gordon. As Protestants, they might have
been expected to feel some sympathy with Wallenstein. But the
sentiment of military honour prevailed. On the morning of the 25th
they were called upon by one of the general's confederates to take
orders from Wallenstein alone. I have sworn to obey the Emperor,
§ 5. Reason of his
failure.
§ 6. Comparison
between Gustavus
and Wallenstein.
answered Gordon, at last, and who shall release me from my oath?
You, gentlemen, was the reply, are strangers in the Empire. What
have you to do with the Empire? Such arguments were addressed
to deaf ears. That afternoon Butler, Leslie, and Gordon consulted
together. Leslie, usually a silent, reserved man, was the first to
speak. Let us kill the traitors, he said. That evening Wallenstein's
chief supporters were butchered at a banquet. Then there was a
short and sharp discussion whether Wallenstein's life should be
spared. Bernhard's troops were known to be approaching, and the
conspirators dared not leave a chance of escape open. An Irish
captain, Devereux by name, was selected to do the deed. Followed
by a few soldiers, he burst into the room where Wallenstein was
preparing for rest. Scoundrel and traitor, were the words which he
flung at Devereux as he entered. Then, stretching out his arms, he
received the fatal blow in his breast. The busy brain of the great
calculator was still forever.
The attempt to snatch at a wise and beneficent
peace by mingled force and intrigue had failed.
Other generals—Cæsar, Cromwell, Napoleon—have
succeeded to supreme power with the support of an armed force.
But they did so by placing themselves at the head of the civil
institutions of their respective countries, and by making themselves
the organs of a strong national policy. Wallenstein stood alone in
attempting to guide the political destinies of a people, while
remaining a soldier and nothing more. The plan was doomed to
failure, and is only excusable on the ground that there were no
national institutions at the head of which Wallenstein could place
himself; not even a chance of creating such institutions afresh.
In spite of all his faults, Germany turns ever to
Wallenstein as she turns to no other amongst the
leaders of the Thirty Years' War. From amidst the
divisions and weaknesses of his native country, a
great poet enshrined his memory in a succession of noble dramas.
Such faithfulness is not without a reason. Gustavus's was a higher
§ 1. Campaign of
1634.
nature than Wallenstein's. Some of his work, at least the rescue of
German Protestantism from oppression, remained imperishable,
whilst Wallenstein's military and political success vanished into
nothingness. But Gustavus was a hero not of Germany as a nation,
but of European Protestantism. His Corpus Evangelicorum was at the
best a choice of evils to a German. Wallenstein's wildest schemes,
impossible of execution as they were by military violence, were
always built upon the foundation of German unity. In the way in
which he walked that unity was doubtless unattainable. To combine
devotion to Ferdinand with religious liberty was as hopeless a
conception as it was to burst all bonds of political authority on the
chance that a new and better world would spring into being out of
the discipline of the camp. But during the long dreary years of
confusion which were to follow, it was something to think of the last
supremely able man whose life had been spent in battling against
the great evils of the land, against the spirit of religious intolerance,
and the spirit of division.
Section V.—Imperialist Victories and the Treaty
of Prague.
For the moment, the House of Austria seemed to
have gained everything by the execution or the
murder of Wallenstein, whichever we may choose
to call it. The army was reorganized and placed under the command
of the Emperor's son, the King of Hungary. The Cardinal-Infant, now
eagerly welcomed, was preparing to join him through Tyrol. And
while on the one side there was union and resolution, there was
division and hesitation on the other. The Elector of Saxony stood
aloof from the League of Heilbronn, weakly hoping that the terms of
peace which had been offered him by Wallenstein would be
confirmed by the Emperor now that Wallenstein was gone. Even
amongst those who remained under arms there was no unity of
purpose. Bernhard, the daring and impetuous, was not of one mind
with the cautious Horn, who commanded the Swedish forces, and
§ 2. The Battle of
Nördlingen.
§ 3. Important
results from it.
§ 4. French
intervention.
both agreed in thinking Oxenstjerna remiss because he did not
supply them with more money than he was able to provide.
As might have been expected under these
circumstances, the imperials made rapid progress.
Ratisbon, the prize of Bernhard the year before,
surrendered to the king of Hungary in July. Then Donauwörth was
stormed, and siege was laid to Nördlingen. On September 2 the
Cardinal-Infant came up with 15,000 men. The enemy watched the
siege with a force far inferior in numbers. Bernhard was eager to put
all to the test of battle. Horn recommended caution in vain. Against
his better judgment he consented to fight. On September 6 the
attack was made. By the end of the day Horn was a prisoner, and
Bernhard was in full retreat, leaving 10,000 of his men dead upon
the field, and 6,000 prisoners in the hands of the enemy, whilst the
imperialists lost only 1,200 men.
Since the day of Breitenfeld, three years before,
there had been no such battle fought as this of
Nördlingen. As Breitenfeld had recovered the
Protestant bishoprics of the north, Nördlingen recovered the Catholic
bishoprics of the south. Bernhard's Duchy of Franconia disappeared
in a moment under the blow. Before the spring of 1635 came, the
whole of South Germany, with the exception of one or two fortified
posts, was in the hands of the imperial commanders. The Cardinal-
Infant was able to pursue his way to Brussels, with the assurance
that he had done a good stroke of work on the way.
The victories of mere force are never fruitful of
good. As it had been after the successes of Tilly in
1622, and the successes of Wallenstein in 1626
and 1627, so it was now with the successes of the King of Hungary
in 1634 and 1635. The imperialist armies had gained victories, and
had taken cities. But the Emperor was none the nearer to the
confidence of Germans. An alienated people, crushed by military
force, served merely as a bait to tempt foreign aggression, and to
make the way easy before it. After 1622, the King of Denmark had
§ 5. The Peace of
Prague.
been called in. After 1627, an appeal was made to the King of
Sweden. After 1634, Richelieu found his opportunity. The bonds
between France and the mutilated League of Heilbronn were drawn
more closely. German troops were to be taken into French pay, and
the empty coffers of the League were filled with French livres. He
who holds the purse holds the sceptre, and the princes of Southern
and Western Germany, whether they wished it or not, were reduced
to the position of satellites revolving round the central orb at Paris.
Nowhere was the disgrace of submitting to French
intervention felt so deeply as at Dresden. The
battle of Nördlingen had cut short any hopes which
John George might have entertained of obtaining that which
Wallenstein would willingly have granted him. But, on the other
hand, Ferdinand had learned something from experience. He would
allow the Edict of Restitution to fall, though he was resolved not to
make the sacrifice in so many words. But he refused to replace the
Empire in the condition in which it had been before the war. The
year 1627 was to be chosen as the starting point for the new
arrangement. The greater part of the northern bishoprics would thus
be saved to Protestantism. But Halberstadt would remain in the
hands of a Catholic bishop, and the Palatinate would be lost to
Protestantism for ever. Lusatia, which had been held in the hands of
the Elector of Saxony for his expenses in the war of 1620, was to be
ceded to him permanently, and Protestantism in Silesia was to be
placed under the guarantee of the Emperor. Finally, Lutheranism
alone was still reckoned as the privileged religion, so that Hesse
Cassel and the other Calvinist states gained no security at all. On
May 30, 1635, a treaty embodying these arrangements was signed
at Prague by the representatives of the Emperor and the Elector of
Saxony. It was intended not to be a separate treaty, but to be the
starting point of a general pacification. Most of the princes and
towns so accepted it, after more or less delay, and acknowledged
the supremacy of the Emperor on its conditions. Yet it was not in the
nature of things that it should put an end to the war. It was not an
agreement which any one was likely to be enthusiastic about. The
§ 6. It fails in
securing general
acceptance.
§ 7. Degeneration
of the war.
ties which bound Ferdinand to his Protestant subjects had been
rudely broken, and the solemn promise to forget and forgive could
not weld the nation into that unity of heart and spirit which was
needed to resist the foreigner. A Protestant of the north might
reasonably come to the conclusion that the price to be paid to the
Swede and the Frenchman for the vindication of the rights of the
southern Protestants was too high to make it prudent for him to
continue the struggle against the Emperor. But it was hardly likely
that he would be inclined to fight very vigorously for the Emperor on
such terms.
If the treaty gave no great encouragement to
anyone who was comprehended by it, it threw still
further into the arms of the enemy those who were
excepted from its benefits. The leading members of
the League of Heilbronn were excepted from the general amnesty,
though hopes of better treatment were held out to them if they
made their submission. The Landgrave of Hesse Cassel was shut out
as a Calvinist. Besides such as nourished legitimate grievances, there
were others who, like Bernhard, were bent upon carving out a
fortune for themselves, or who had so blended in their own minds
consideration for the public good as to lose all sense of any
distinction between the two.
There was no lack here of materials for a long and
terrible struggle. But there was no longer any noble
aim in view on either side. The ideal of Ferdinand
and Maximilian was gone. The Church was not to recover its lost
property. The Empire was not to recover its lost dignity. The ideal of
Gustavus of a Protestant political body was equally gone. Even the
ideal of Wallenstein, that unity might be founded on an army, had
vanished. From henceforth French and Swedes on the one side,
Austrians and Spaniards on the other, were busily engaged in riving
at the corpse of the dead Empire. The great quarrel of principle had
merged into a mere quarrel between the Houses of Austria and
Bourbon, in which the shred of principle which still remained in the
§ 8. Condition of
Germany.
1636
§ 9. Notes of an
English traveller.
question of the rights of the southern Protestants was almost
entirely disregarded.
Horrible as the war had been from its
commencement, it was every day assuming a more
horrible character. On both sides all traces of
discipline had vanished in the dealings of the armies with the
inhabitants of the countries in which they were quartered. Soldiers
treated men and women as none but the vilest of mankind would
now treat brute beasts. 'He who had money,' says a contemporary,
'was their enemy. He who had none was tortured because he had it
not.' Outrages of unspeakable atrocity were committed everywhere.
Human beings were driven naked into the streets, their flesh pierced
with needles, or cut to the bone with saws. Others were scalded
with boiling water, or hunted with fierce dogs. The horrors of a town
taken by storm were repeated every day in the open country. Even
apart from its excesses, the war itself was terrible enough. When
Augsburg was besieged by the imperialists, after their victory at
Nördlingen, it contained an industrious population of 70,000 souls.
After a siege of seven months, 10,000 living beings, wan and
haggard with famine, remained to open the gates to the conquerors,
and the great commercial city of the Fuggers dwindled down into a
country town.
How is it possible to bring such scenes before our
eyes in their ghastly reality? Let us turn for the
moment to some notes taken by the companion of
an English ambassador who passed through the
country in 1636. As the party were towed up the Rhine from
Cologne, on the track so well known to the modern tourist, they
passed by many villages pillaged and shot down. Further on, a
French garrison was in Ehrenbreitstein, firing down upon Coblentz,
which had just been taken by the imperialists. They in the town, if
they do but look out of their windows, have a bullet presently
presented at their head. More to the south, things grew worse. At
Bacharach, the poor people are found dead with grass in their
mouths. At Rüdesheim, many persons were praying where dead
bones were in a little old house; and here his Excellency gave some
relief to the poor, which were almost starved, as it appeared by the
violence they used to get it from one another. At Mentz, the
ambassador was obliged to remain on shipboard, for there was
nothing to relieve us, since it was taken by the King of Sweden, and
miserably battered.... Here, likewise, the poor people were almost
starved, and those that could relieve others before now humbly
begged to be relieved; and after supper all had relief sent from the
ship ashore, at the sight of which they strove so violently that some
of them fell into the Rhine, and were like to have been drowned. Up
the Main, again, all the towns, villages, and castles be battered,
pillaged, or burnt. After leaving Würzburg, the ambassador's train
came to plundered villages, and then to Neustadt, which hath been
a fair city, though now pillaged and burnt miserably. Poor children
were sitting at their doors almost starved to death, his Excellency
giving them food and leaving money with their parents to help them,
if but for a time. In the Upper Palatinate, they passed by churches
demolished to the ground, and through woods in danger,
understanding that Croats were lying hereabout. Further on they
stayed for dinner at a poor little village which hath been pillaged
eight-and-twenty times in two years, and twice in one day. And so
on, and so on. The corner of the veil is lifted up in the pages of the
old book, and the rest is left to the imagination to picture forth, as
best it may, the misery behind. After reading the sober narrative, we
shall perhaps not be inclined to be so very hard upon the Elector of
Saxony for making peace at Prague.
§ 1. Protestantism
not yet out of
danger.
§ 2. The allies of
France.
CHAPTER X.
THE PREPONDERANCE OF FRANCE.
Section I.—Open Intervention of France.
The peacemakers of Prague hoped to restore the
Empire to its old form. But this could not be.
Things done cannot pass away as though they had
never been. Ferdinand's attempt to gain a
partizan's advantage for his religion by availing himself of legal forms
had given rise to a general distrust. Nations and governments, like
individual men, are tied and bound by the chain of their sins, from
which they can be freed only when a new spirit is breathed into
them. Unsatisfactory as the territorial arrangements of the peace
were, the entire absence of any constitutional reform in connexion
with the peace was more unsatisfactory still. The majority in the two
Upper Houses of the Diet was still Catholic; the Imperial Council was
still altogether Catholic. It was possible that the Diet and Council,
under the teaching of experience, might refrain from pushing their
pretensions as far as they had pushed them before; but a
government which refrains from carrying out its principles from
motives of prudence cannot inspire confidence. A strong central
power would never arise in such a way, and a strong central power
to defend Germany against foreign invasion was the especial need of
the hour.
In the failure of the Elector of Saxony to obtain
some of the most reasonable of the Protestant
demands lay the best excuse of men like Bernhard
of Saxe-Weimar and William of Hesse Cassel for refusing the terms
of accommodation offered. Largely as personal ambition and greed
§ 3. Foreign
intervention.
of territory found a place in the motives of these men, it is not
absolutely necessary to assert that their religious enthusiasm was
nothing more than mere hypocrisy. They raised the war-cry of God
with us before rushing to the storm of a city doomed to massacre
and pillage; they set apart days for prayer and devotion when battle
was at hand—veiling, perhaps, from their own eyes the hideous
misery which they were spreading around, in contemplation of the
loftiness of their aim: for, in all but the most vile, there is a natural
tendency to shrink from contemplating the lower motives of action,
and to fix the eyes solely on the higher. But the ardour inspired by a
military career, and the mere love of fighting for its own sake, must
have counted for much; and the refusal to submit to a domination
which had been so harshly used soon grew into a restless disdain of
all authority whatever. The nobler motives which had imparted a
glow to the work of Tilly and Gustavus, and which even lit up the
profound selfishness of Wallenstein, flickered and died away, till the
fatal disruption of the Empire was accomplished amidst the strivings
and passions of heartless and unprincipled men.
The work of riving Germany in pieces was not
accomplished by Germans alone. As in nature a
living organism which has become unhealthy and
corrupt is seized upon by the lower forms of animal life, a nation
divided amongst itself, and devoid of a sense of life within it higher
than the aims of parties and individuals, becomes the prey of
neighbouring nations, which would not have ventured to meddle
with it in the days of its strength. The carcase was there, and the
eagles were gathered together. The gathering of Wallenstein's army
in 1632, the overthrow of Wallenstein in 1634, had alike been made
possible by the free use of Spanish gold. The victory of Nördlingen
had been owing to the aid of Spanish troops; and the aim of Spain
was not the greatness or peace of Germany, but at the best the
greatness of the House of Austria in Germany; at the worst, the
maintenance of the old system of intolerance and unthinking
obedience, which had been the ruin of Germany. With Spain for an
ally, France was a necessary enemy. The strife for supreme power
§ 4. Alsace and
Lorraine.
§ 5. Richelieu
asks for fortresses
in Alsace.
§ 6. War between
France and Spain.
between the two representative states of the old system and the
new could not long be delayed, and the German parties would be
dragged, consciously or unconsciously, in their wake. If Bernhard
became a tool of Richelieu, Ferdinand became a tool of Spain.
In this phase of the war Protestantism and
Catholicism, tolerance and intolerance, ceased to
be the immediate objects of the strife. The
possession of Alsace and Lorraine rose into primary importance, not
because, as in our own days, Germany needed a bulwark against
France, or France needed a bulwark against Germany, but because
Germany was not strong enough to prevent these territories from
becoming the highway of intercourse between Spain and the
Spanish Netherlands. The command of the sea was in the hands of
the Dutch, and the valley of the Upper Rhine was the artery through
which the life blood of the Spanish monarchy flowed. If Spain or the
Emperor, the friend of Spain, could hold that valley, men and
munitions of warfare would flow freely to the Netherlands to support
the Cardinal-Infant in his struggle with the Dutch. If Richelieu could
lay his hand heavily upon it, he had seized his enemy by the throat,
and could choke him as he lay.
After the battle of Nördlingen, Richelieu's first
demand from Oxenstjerna as the price of his
assistance had been the strong places held by
Swedish garrisons in Alsace. As soon as he had
them safely under his control, he felt himself strong enough to
declare war openly against Spain.
On May 19, eleven days before peace was agreed
upon at Prague, the declaration of war was
delivered at Brussels by a French herald. To the
astonishment of all, France was able to place in the field what was
then considered the enormous number of 132,000 men. One army
was to drive the Spaniards out of the Milanese, and to set free the
Italian princes. Another was to defend Lorraine whilst Bernhard
crossed the Rhine and carried on war in Germany. The main force
§ 1. Failure of the
French attack on
the Netherlands.
§ 2. Spanish
invasion of
France.
was to be thrown upon the Spanish Netherlands, and, after effecting
a junction with the Prince of Orange, was to strike directly at
Brussels.
Section II.—Spanish Successes.
Precisely in the most ambitious part of his
programme Richelieu failed most signally. The
junction with the Dutch was effected without
difficulty; but the hoped-for instrument of success
proved the parent of disaster. Whatever Flemings and Brabanters
might think of Spain, they soon made it plain that they would have
nothing to do with the Dutch. A national enthusiasm against
Protestant aggression from the north made defence easy, and the
French army had to return completely unsuccessful. Failure, too, was
reported from other quarters. The French armies had no experience
of war on a large scale, and no military leader of eminent ability had
yet appeared to command them. The Italian campaign came to
nothing, and it was only by a supreme effort of military skill that
Bernhard, driven to retreat, preserved his army from complete
destruction.
In 1636 France was invaded. The Cardinal-Infant
crossed the Somme, took Corbie, and advanced to
the banks of the Oise. All Paris was in commotion.
An immediate siege was expected, and inquiry was
anxiously made into the state of the defences. Then Richelieu,
coming out of his seclusion, threw himself upon the nation. He
appealed to the great legal, ecclesiastical, and commercial
corporations of Paris, and he did not appeal in vain. Money,
voluntarily offered, came pouring into the treasury for the payment
of the troops. Those who had no money gave themselves eagerly for
military service. It was remarked that Paris, so fanatically Catholic in
the days of St. Bartholomew and the League, entrusted its defence
§ 3. The invaders
driven back.
§ 4. Battle of
Wittstock.
§ 5. Death of
Ferdinand II.
§ 6. Ferdinand III.
to the Protestant marshal La Force, whose reputation for integrity
inspired universal confidence.
The resistance undertaken in such a spirit in Paris
was imitated by the other towns of the kingdom.
Even the nobility, jealous as they were of the
Cardinal, forgot their grievances as an aristocracy in their duties as
Frenchmen. Their devotion was not put to the test of action. The
invaders, frightened at the unanimity opposed to them, hesitated
and turned back. In September, Lewis took the field in person. In
November he appeared before Corbie; and the last days of the year
saw the fortress again in the keeping of a French garrison. The war,
which was devastating Germany, was averted from France by the
union produced by the mild tolerance of Richelieu.
In Germany, too, affairs had taken a turn. The
Elector of Saxony had hoped to drive the Swedes
across the sea; but a victory gained on October 4,
at Wittstock, by the Swedish general, Baner, the ablest of the
successors of Gustavus, frustrated his intentions. Henceforward
North Germany was delivered over to a desolation with which even
the misery inflicted by Wallenstein affords no parallel.
Amidst these scenes of failure and misfortune the
man whose policy had been mainly responsible for
the miseries of his country closed his eyes for ever.
On February 15, 1637, Ferdinand II. died at Vienna. Shortly before
his death the King of Hungary had been elected King of the Romans,
and he now, by his father's death, became the Emperor Ferdinand
III.
The new Emperor had no vices. He did not even
care, as his father did, for hunting and music.
When the battle of Nördlingen was won under his
command he was praying in his tent whilst his soldiers were fighting.
He sometimes took upon himself to give military orders, but the
handwriting in which they were conveyed was such an abominable
§ 7. Campaign of
1637.
§ 1. The capture
of Breisach.
scrawl that they only served to enable his generals to excuse their
defeats by the impossibility of reading their instructions. His great
passion was for keeping strict accounts. Even the Jesuits, it is said,
found out that, devoted as he was to his religion, he had a sharp eye
for his expenditure. One day they complained that some tolls
bequeathed to them by his father had not been made over to them,
and represented the value of the legacy as a mere trifle of 500
florins a year. The Emperor at once gave them an order upon the
treasury for the yearly payment of the sum named, and took
possession of the tolls for the maintenance of the fortifications of
Vienna. The income thus obtained is said to have been no less than
12,000 florins a year.
Such a man was not likely to rescue the Empire
from its miseries. The first year of his reign,
however, was marked by a gleam of good fortune.
Baner lost all that he had gained at Wittstock, and was driven back
to the shores of the Baltic. On the western frontier the imperialists
were equally successful. Würtemberg accepted the Peace of Prague,
and submitted to the Emperor. A more general peace was talked of.
But till Alsace was secured to one side or the other no peace was
possible.
Section III.—The Struggle for Alsace.
The year 1638 was to decide the question.
Bernhard was looking to the Austrian lands in
Alsace and the Breisgau as a compensation for his
lost duchy of Franconia. In February he was besieging Rheinfelden.
Driven off by the imperialists on the 26th, he re-appeared
unexpectedly on March 3, taking the enemy by surprise. They had
not even sufficient powder with them to load their guns, and the
victory of Rheinfelden was the result. On the 24th Rheinfelden itself
surrendered. Freiburg followed its example on April 22, and
Bernhard proceeded to undertake the siege of Breisach, the great
§ 2. The capture
a turning point in
the war.
§ 3. Bernhard
wishes to keep
Breisach.
§ 4. Refuses to
dismember the
Empire.
§ 5. Death of
Bernhard.
fortress which domineered over the whole valley of the Upper Rhine.
Small as his force was, he succeeded, by a series of rapid
movements, in beating off every attempt to introduce supplies, and
on December 19 he entered the place in triumph.
The campaign of 1638 was the turning point in the
struggle between France and the united House of
Austria. A vantage ground was then won which
was never lost.
Bernhard himself, however, was loth to realize the
world-wide importance of the events in which he
had played his part. He fancied that he had been
fighting for his own, and he claimed the lands
which he had conquered for himself. He received the homage of the
citizens of Breisach in his own name. He celebrated a Lutheran
thanksgiving festival in the cathedral. But the French Government
looked upon the rise of an independent German principality in Alsace
with as little pleasure as the Spanish government had contemplated
the prospect of the establishment of Wallenstein in the Palatinate.
They ordered Bernhard to place his conquests under the orders of
the King of France.
Strange as it may seem, the man who had done so
much to tear in pieces the Empire believed, in a
sort of way, in the Empire still. I will never suffer,
he said, in reply to the French demands, that men
can truly reproach me with being the first to dismember the Empire.
The next year he crossed the Rhine with the most
brilliant expectations. Baner had recovered
strength, and was pushing on through North
Germany into Bohemia. Bernhard hoped that he too might strike a
blow which would force on a peace on his own conditions. But his
greatest achievement, the capture of Breisach, was also his last. A
fatal disease seized upon him when he had hardly entered upon the
campaign. On July 8, 1639, he died.
§ 6. Alsace in
French
possession.
§ 1. State of Italy.
§ 2. Maritime
warfare.
§ 3. The Spanish
fleet in the
Downs.
There was no longer any question of the ownership
of the fortresses in Alsace and the Breisgau. French
governors entered into possession. A French
general took the command of Bernhard's army. For
the next two or three years Bernhard's old troops fought up and
down Germany in conjunction with Baner, not without success, but
without any decisive victory. The French soldiers were becoming, like
the Germans, inured to war. The lands on the Rhine were not easily
to be wrenched out of the strong hands which had grasped them.
Section IV.—French Successes.
Richelieu had other successes to count besides
these victories on the Rhine. In 1637 the Spaniards
drove out of Turin the Duchess-Regent Christina,
the mother of the young Duke of Savoy. She was a sister of the King
of France; and, even if that had not been the case, the enemy of
Spain was, in the nature of the case, the friend of France. In 1640
she re-entered her capital with French assistance.
At sea, too, where Spain, though unable to hold its
own against the Dutch, had long continued to be
superior to France, the supremacy of Spain was
coming to an end. During the whole course of his ministry, Richelieu
had paid special attention to the encouragement of commerce and
the formation of a navy. Troops could no longer be despatched with
safety to Italy from the coasts of Spain. In 1638 a French squadron
burnt Spanish galleys in the Bay of Biscay.
In 1639 a great Spanish fleet on its way to the
Netherlands was strong enough to escape the
French, who were watching to intercept it. It sailed
up the English Channel with the not distant goal of
the Flemish ports almost in view. But the huge galleons were ill-
manned and ill-found. They were still less able to resist the lighter,
well-equipped vessels of the Dutch fleet, which was waiting to
§ 4. Destruction
of the fleet.
§ 5. France and
England.
intercept them, than the Armada had been able to resist Drake and
Raleigh fifty-one years before. The Spanish commander sought
refuge in the Downs, under the protection of the neutral flag of
England.
The French ambassador pleaded hard with the king
of England to allow the Dutch to follow up their
success. The Spanish ambassador pleaded hard
with him for protection to those who had taken refuge on his shores.
Charles saw in the occurrence an opportunity to make a bargain with
one side or the other. He offered to abandon the Spaniards if the
French would agree to restore his nephew, Charles Lewis, the son of
his sister Elizabeth, to his inheritance in the Palatinate. He offered to
protect the Spaniards if Spain would pay him the large sum which he
would want for the armaments needed to bid defiance to France.
Richelieu had no intention of completing the bargain offered to him.
He deluded Charles with negotiations, whilst the Dutch admiral
treated the English neutrality with scorn. He dashed amongst the tall
Spanish ships as they lay anchored in the Downs: some he sank,
some he set on fire. Eleven of the galleons were soon destroyed.
The remainder took advantage of a thick fog, slipped across the
Straits, and placed themselves in safety under the guns of Dunkirk.
Never again did such a fleet as this venture to leave the Spanish
coast for the harbours of Flanders. The injury to Spain went far
beyond the actual loss. Coming, as the blow did, within a few
months after the surrender of Breisach, it all but severed the
connexion for military purposes between Brussels and Madrid.
Charles at first took no umbrage at the insult. He
still hoped that Richelieu would forward his
nephew's interests, and he even expected that
Charles Lewis would be placed by the King of France in command of
the army which had been under Bernhard's orders. But Richelieu
was in no mood to place a German at the head of these well-trained
veterans, and the proposal was definitively rejected. The King of
England, dissatisfied at this repulse, inclined once more to the side
§ 6. Insurrection
in Catalonia.
§ 7. Break-up of
the Spanish
monarchy.
of Spain. But Richelieu found a way to prevent Spain from securing
even what assistance it was in the power of a king so unpopular as
Charles to render. It was easy to enter into communication with
Charles's domestic enemies. His troubles, indeed, were mostly of his
own making, and he would doubtless have lost his throne whether
Richelieu had stirred the fire or not. But the French minister
contributed all that was in his power to make the confusion greater,
and encouraged, as far as possible, the resistance which had already
broken out in Scotland, and which was threatening to break out in
England.
The failure of 1636 had been fully redeemed. No
longer attacking any one of the masses of which
the Spanish monarchy was composed, Richelieu
placed his hands upon the lines of communication between them. He
made his presence felt not at Madrid, at Brussels, at Milan, or at
Naples, but in Alsace, in the Mediterranean, in the English Channel.
The effect was as complete as is the effect of snapping the wire of a
telegraph. At once the Peninsula startled Europe by showing signs of
dissolution. In 1639 the Catalonians had manfully defended
Roussillon against a French invasion. In 1640 they were prepared to
fight with equal vigour. But the Spanish Government, in its desperate
straits, was not content to leave them to combat in their own way,
after the irregular fashion which befitted mountaineers. Orders were
issued commanding all men capable of fighting to arm themselves
for the war, all women to bear food and supplies for the army on
their backs. A royal edict followed, threatening those who showed
themselves remiss with imprisonment and the confiscation of their
goods.
The cord which bound the hearts of Spaniards to
their king was a strong one; but it snapped at last.
It was not by threats that Richelieu had defended
France in 1636. The old traditions of provincial
independence were strong in Catalonia, and the Catalans were soon
in full revolt. Who were they, to be driven to the combat by
§ 8.
Independence of
Portugal.
§ 9. Failure of
Soissons in
France.
§ 10. Richelieu's
last days.
menaces, as the Persian slaves had been driven on at Thermopylæ
by the blows of their masters' officers?
Equally alarming was the news which reached
Madrid from the other side of the Peninsula. Ever
since the days of Philip II. Portugal had formed an
integral part of the Spanish monarchy. In
December 1640 Portugal renounced its allegiance, and reappeared
amongst European States under a sovereign of the House of
Braganza.
Everything prospered in Richelieu's hands. In 1641
a fresh attempt was made by the partizans of
Spain to raise France against him. The Count of
Soissons, a prince of the blood, placed himself at
the head of an imperialist army to attack his native country. He
succeeded in defeating the French forces sent to oppose him not far
from Sedan. But a chance shot passing through the brain of Soissons
made the victory a barren one. His troops, without the support of his
name, could not hope to rouse the country against Richelieu. They
had become mere invaders, and they were far too few to think of
conquering France.
Equal success attended the French arms in
Germany. In 1641 Guebriant, with his German and
Swedish army, defeated the imperialists at
Wolfenbüttel, in the north. In 1642 he defeated them again at
Kempten, in the south. In the same year Roussillon submitted to
France. Nor was Richelieu less fortunate at home. The conspiracy of
a young courtier, the last of the efforts of the aristocracy to shake off
the heavy rule of the Cardinal, was detected, and expiated on the
scaffold. Richelieu did not long survive his latest triumph. He died on
December 4, 1642.
Section V.—Aims and Character of Richelieu.
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com

More Related Content

PDF
Transactions On Highperformance Embedded Architectures And Compilers Iv 1st E...
PDF
Transactions On Highperformance Embedded Architectures And Compilers I 1st Ed...
PDF
Distributed Realtime Systems Theory And Practice 1st Ed K Erciyes
DOCX
Computer Architecture Formulas1. CPU time = Instru.docx
PDF
Facing The Multicorechallenge Ii Aspects Of New Paradigms And Technologies In...
PDF
Pipelined Processor Farms Structured Design For Embedded Parallel Systems 1st...
PPTX
Fundamental Of Computer Architecture.pptx
PDF
Heterogeneous Computing with Open CL 1st Edition Perhaad Mistry And Dana Scha...
Transactions On Highperformance Embedded Architectures And Compilers Iv 1st E...
Transactions On Highperformance Embedded Architectures And Compilers I 1st Ed...
Distributed Realtime Systems Theory And Practice 1st Ed K Erciyes
Computer Architecture Formulas1. CPU time = Instru.docx
Facing The Multicorechallenge Ii Aspects Of New Paradigms And Technologies In...
Pipelined Processor Farms Structured Design For Embedded Parallel Systems 1st...
Fundamental Of Computer Architecture.pptx
Heterogeneous Computing with Open CL 1st Edition Perhaad Mistry And Dana Scha...

Similar to Transactions On Highperformance Embedded Architectures And Compilers Iii 1st Edition Miquel Moreto (20)

PDF
Computer architecture fundamentals and principles of computer design 1st Edit...
PDF
Design Principles for Embedded Systems Kcs Murti
PDF
Performance Modeling And Design Of Computer Systems Queueing Theory In Action...
PDF
Software Engineering 8ed. Edition Sommerville I.
PDF
Software Engineering 8ed. Edition Sommerville I.
PDF
Isometric Making Essay
PDF
Computer architecture fundamentals and principles of computer design 1st Edit...
PDF
A Practical Introduction To Hardware Software Codesign
PDF
Encyclopedia of Software Engineering vol 2 2nd Edition John J. Marciniak
PDF
Advances in Network Embedded Management and Applications Alexander Clemm
PDF
Cyber-Physical System Design from an Architecture Analysis Viewpoint: Communi...
PDF
Internetworking technologies handbook 4th ed Edition Cisco Systems Inc.
PDF
Internetworking technologies handbook 4th ed Edition Cisco Systems Inc.
PDF
Parallel Programming For Multicore And Cluster Systems Thomas Rauber
PDF
Software Engineering 8ed. Edition Sommerville I.
PDF
A Practical Introduction To Hardwaresoftware Codesign 2nd 2nd Edition Patrick...
PDF
Software Engineering 8ed. Edition Sommerville I.
PDF
Modeling and Simulation of Invasive Applications and Architectures Sascha Roloff
PDF
Software Engineering Update 8th Edition Ian Sommerville
PDF
Download full ebook of Approximate Computing Weiqiang Liu instant download pdf
Computer architecture fundamentals and principles of computer design 1st Edit...
Design Principles for Embedded Systems Kcs Murti
Performance Modeling And Design Of Computer Systems Queueing Theory In Action...
Software Engineering 8ed. Edition Sommerville I.
Software Engineering 8ed. Edition Sommerville I.
Isometric Making Essay
Computer architecture fundamentals and principles of computer design 1st Edit...
A Practical Introduction To Hardware Software Codesign
Encyclopedia of Software Engineering vol 2 2nd Edition John J. Marciniak
Advances in Network Embedded Management and Applications Alexander Clemm
Cyber-Physical System Design from an Architecture Analysis Viewpoint: Communi...
Internetworking technologies handbook 4th ed Edition Cisco Systems Inc.
Internetworking technologies handbook 4th ed Edition Cisco Systems Inc.
Parallel Programming For Multicore And Cluster Systems Thomas Rauber
Software Engineering 8ed. Edition Sommerville I.
A Practical Introduction To Hardwaresoftware Codesign 2nd 2nd Edition Patrick...
Software Engineering 8ed. Edition Sommerville I.
Modeling and Simulation of Invasive Applications and Architectures Sascha Roloff
Software Engineering Update 8th Edition Ian Sommerville
Download full ebook of Approximate Computing Weiqiang Liu instant download pdf
Ad

Recently uploaded (20)

PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
RMMM.pdf make it easy to upload and study
PDF
Complications of Minimal Access Surgery at WLH
PDF
Classroom Observation Tools for Teachers
PDF
Yogi Goddess Pres Conference Studio Updates
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Weekly quiz Compilation Jan -July 25.pdf
PDF
Computing-Curriculum for Schools in Ghana
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
Cell Types and Its function , kingdom of life
Final Presentation General Medicine 03-08-2024.pptx
Final Presentation General Medicine 03-08-2024.pptx
Supply Chain Operations Speaking Notes -ICLT Program
STATICS OF THE RIGID BODIES Hibbelers.pdf
RMMM.pdf make it easy to upload and study
Complications of Minimal Access Surgery at WLH
Classroom Observation Tools for Teachers
Yogi Goddess Pres Conference Studio Updates
O5-L3 Freight Transport Ops (International) V1.pdf
Weekly quiz Compilation Jan -July 25.pdf
Computing-Curriculum for Schools in Ghana
O7-L3 Supply Chain Operations - ICLT Program
GDM (1) (1).pptx small presentation for students
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
Module 4: Burden of Disease Tutorial Slides S2 2025
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Microbial diseases, their pathogenesis and prophylaxis
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Cell Types and Its function , kingdom of life
Ad

Transactions On Highperformance Embedded Architectures And Compilers Iii 1st Edition Miquel Moreto

  • 1. Transactions On Highperformance Embedded Architectures And Compilers Iii 1st Edition Miquel Moreto download https://ebookbell.com/product/transactions-on-highperformance- embedded-architectures-and-compilers-iii-1st-edition-miquel- moreto-4143776 Explore and download more ebooks at ebookbell.com
  • 2. Here are some recommended products that we believe you will be interested in. You can click the link to download. Transactions On Highperformance Embedded Architectures And Compilers Iv 1st Edition Magnus Jahre https://ebookbell.com/product/transactions-on-highperformance- embedded-architectures-and-compilers-iv-1st-edition-magnus- jahre-2319608 Transactions On Highperformance Embedded Architectures And Compilers Iv 1st Edition Magnus Jahre https://ebookbell.com/product/transactions-on-highperformance- embedded-architectures-and-compilers-iv-1st-edition-magnus- jahre-4143778 Transactions On Highperformance Embedded Architectures And Compilers Ii 1st Edition Per Stenstrm https://ebookbell.com/product/transactions-on-highperformance- embedded-architectures-and-compilers-ii-1st-edition-per- stenstrm-4201540 Transactions On Highperformance Embedded Architectures And Compilers V 1st Ed Cristina Silvano https://ebookbell.com/product/transactions-on-highperformance- embedded-architectures-and-compilers-v-1st-ed-cristina-silvano-9960876
  • 3. Transactions On Highperformance Embedded Architectures And Compilers I 1st Edition Maurice V Wilkes Auth https://ebookbell.com/product/transactions-on-highperformance- embedded-architectures-and-compilers-i-1st-edition-maurice-v-wilkes- auth-1227598 Quintilian On The Teaching Of Speaking And Writing Translations From Books One Two And Ten Of The Institutio Oratoria 2nd Edition James J Murphy https://ebookbell.com/product/quintilian-on-the-teaching-of-speaking- and-writing-translations-from-books-one-two-and-ten-of-the-institutio- oratoria-2nd-edition-james-j-murphy-5720370 Transactions On Intelligent Welding Manufacturing Volume Iv No 1 2020 1st Ed 2022 Shanben Chen Editor https://ebookbell.com/product/transactions-on-intelligent-welding- manufacturing-volume-iv-no-1-2020-1st-ed-2022-shanben-chen- editor-44889082 Transactions On Engineering Technologies Proceedings Of World Congress On Engineering 2021 Sioiong Ao https://ebookbell.com/product/transactions-on-engineering- technologies-proceedings-of-world-congress-on- engineering-2021-sioiong-ao-46389796 Transactions On Largescale Data And Knowledgecentered Systems Li Special Issue On Data Management Principles Technologies And Applications Abdelkader Hameurlain https://ebookbell.com/product/transactions-on-largescale-data-and- knowledgecentered-systems-li-special-issue-on-data-management- principles-technologies-and-applications-abdelkader- hameurlain-46517630
  • 6. Lecture Notes in Computer Science 6590 Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Moshe Y. Vardi Rice University, Houston, TX, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany
  • 7. Per Stenström (Ed.) Transactions on High-Performance Embedded Architectures and Compilers III 1 3
  • 8. Volume Editor Per Stenström Chalmers University of Technology Department of Computer Science and Engineering 412 96 Gothenburg, Sweden E-mail: per.stenstrom@chalmers.se ISSN 0302-9743 (LNCS) e-ISSN 1611-3349 (LNCS) ISSN 1864-306X (THIPEAC) e-ISSN 1864-3078 (THIPEAC) ISBN 978-3-642-19447-4 e-ISBN 978-3-642-19448-1 DOI 10.1007/978-3-642-19448-1 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2007923068 CR Subject Classification (1998): B.2, C.1, D.3.4, B.5, C.2, D.4 © Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
  • 9. Editor-in-Chief’s Message It is my pleasure to introduce you to the third volume of Transactions on High- Performance Embedded Architectures and Compilers. This journal was created as an archive for scientific articles in the converging fields of high-performance and embedded computer architectures and compiler systems. Design considerations in both general-purpose and embedded systems are increasingly being based on similar scientific insights. For example, a state-of-the-art game console today consists of a powerful parallel computer whose building blocks are the same as those found in computational clusters for high-performance computing. More- over, keeping power/energy consumption at a low level for high-performance general-purpose systems as well as in, for example, mobile embedded systems is equally important in order to either keep heat dissipation at a manageable level or to maintain a long operating time despite the limited battery capacity. It is clear that similar scientific issues have to be solved to build competitive systems in both segments. Additionally, for high-performance systems to be realized – be it embedded or general-purpose – a holistic design approach has to be taken by factoring in the impact of applications as well as the underlying technology when making design trade-offs. The main topics of this journal reflect this development and include (among others): – Processor architecture, e.g., network and security architectures, application specific processors and accelerators, and reconfigurable architectures – Memory system design – Power, temperature, performance, and reliability constrained designs – Evaluation methodologies, program characterization, and analysis techniques – Compiler techniques for embedded systems, e.g, feedback-directed opti- mization, dynamic compilation, adaptive execution, continuous profiling/ optimization, back-end code generation, and binary translation/optimization – Code size/memory footprint optimizations This volume contains 14 papers divided into four sections. The first section is a special section containing the top four papers from the Third International Conference on High-Performance and Embedded Architectures and Compilers - HiPEAC. I would like to thank Manolis Katevenis (University of Crete and FORTH) and Rajiv Gupta (University of California at Riverside) for acting as guest editors of that section. Papers in this section deal with cache performance issues and improved branch prediction The second section is a set of four papers providing a snapshot from the Eighth MEDEA Workshop. I am indebted to Sandro Bartolini and Pierfrancesco Foglia for putting together this special section. The third section contains two regular papers and the fourth section pro- vides a snapshot from the First Workshop on Programmability Issues for Mul- ticore Computers (MULTIPROG). The organizers – Eduard Ayguade, Roberto
  • 10. VI Editor-in-Chief’s Message Gioiosa, and Osman Unsal – have put together this section. I thank them for their effort. The editorial board has worked diligently to handle the papers for the journal. I would like to thank all the contributing authors, editors, and reviewers for their excellent work. Per Stenström, Chalmers University of Technology Editor-in-chief Transactions on HiPEAC
  • 11. Editorial Board Per Stenström is a professor of computer engineering at Chalmers University of Technology. His research interests are devoted to design principles for high- performance computer systems and he has made multiple contributions to espe- cially high-performance memory systems. He has authored or co-authored three textbooks and more than 100 publications in international journals and con- ferences. He regularly serves Program Committees of major conferences in the computer architecture field. He is also an associate editor of IEEE Transac- tions on Parallel and Distributed Processing Systems, a subject-area editor of the Journal of Parallel and Distributed Computing, an associate editor of the IEEE TCCA Computer Architecture Letters, and the founding Editor-in-Chief of Transactions on High-Performance Embedded Architectures and Compilers. He co-founded the HiPEAC Network of Excellence funded by the European Commission. He has acted as General and Program Chair for a large number of conferences including the ACM/IEEE Int. Symposium on Computer Archi- tecture, the IEEE High-Performance Computer Architecture Symposium, and the IEEE Int. Parallel and Distributed Processing Symposium. He is a Fellow of the ACM and the IEEE and a member of Academia Europaea and the Royal Swedish Academy of Engineering Sciences. Koen De Bosschere obtained his PhD from Ghent University in 1992. He is a professor in the ELIS Department at the Universiteit Gent where he teaches courses on computer architecture and operating systems. His current research interests include: computer architecture, system software, code optimization. He has co-authored 150 contributions in the domain of optimization, perfor- mance modeling, microarchitecture, and debugging. He is the coordinator of the ACACES research network and of the European HiPEAC2 network. Contact him at Koen.DeBosschere@elis.UGent.be.
  • 12. VIII Editorial Board Jose Duato is a professor in the Department of Computer Engineering (DISCA) at UPV, Spain. His research interests include interconnection networks and mul- tiprocessor architectures. He has published over 340 papers. His research results have been used in the design of the Alpha 21364 microprocessor, the Cray T3E, IBM BlueGene/L, and Cray Black Widow supercomputers. Dr. Duato is the first author of the book Interconnection Networks: An Engineering Approach. He has served as associate editor of IEEE TPDS and IEEE TC. He was General Co-chair of ICPP 2001, Program Chair of HPCA-10, and Program Co-chair of ICPP 2005. Also, he has served as Co-chair, Steering Committee member, Vice- Chair, or Program Committee member in more than 55 conferences, including HPCA, ISCA, IPPS/SPDP, IPDPS, ICPP, ICDCS, Europar, and HiPC. Manolis Katevenis received his PhD degree from U.C. Berkeley in 1983 and the ACM Doctoral Dissertation Award in 1984 for his thesis on “Reduced Instruc- tion Set Computer Architectures for VLSI.” After a brief term on the faculty of Computer Science at Stanford University, he has been in Greece, with the Uni- versity of Crete and with FORTH since 1986. After RISC, his research has been on interconnection networks and interprocessor communication. In packet switch architectures, his contributions since 1987 have been mostly in per-flow queue- ing, credit-based flow control, congestion management, weighted round-robin scheduling, buffered crossbars, and non-blocking switching fabrics. In multipro- cessing and clustering, his contributions since 1993 have been on remote-write- based, protected, user-level communication. His home URL is http://archvlsi.ics.forth.gr/∼kateveni
  • 13. Editorial Board IX Michael O’Boyle is a professor in the School of Informatics at the University of Edinburgh and an EPSRC Advanced Research Fellow. He received his PhD in Computer Science from the University of Manchester in 1992. He was for- merly a SERC Postdoctoral Research Fellow, a Visiting Research Scientist at IRISA/INRIA Rennes, a Visiting Research Fellow at the University of Vienna and a Visiting Scholar at Stanford University. More recently he was a Visiting Professor at UPC, Barcelona. Dr. O’Boyle’s main research interests are in adaptive compilation, formal program transformation representations, the compiler impact on embedded sys- tems, compiler directed low-power optimization and automatic compilation for parallel single-address space architectures. He has published over 50 papers in international journals and conferences in this area and manages the Compiler and Architecture Design group consisting of 18 members. Cosimo Antonio Prete is a full professor of computer systems at the Univer- sity of Pisa, Italy, faculty member of the PhD School in Computer Science and Engineering (IMT), Italy. He is Coordinator of the Graduate Degree Program in Computer Engineering and Rector’s Adviser for Innovative Training Tech- nologies at the University of Pisa. His research interests are focused on multi- processor architectures, cache memory, performance evaluation and embedded systems. He is an author of more than 100 papers published in international journals and conference proceedings. He has been project manager for several research projects, including: the SPP project, OMI, Esprit IV; the CCO project, supported by VLSI Technology, Sophia Antipolis; the ChArm project, supported by VLSI Technology, San Jose, and the Esprit III Tracs project.
  • 14. X Editorial Board André Seznec is “directeur de recherches” at IRISA/INRIA. Since 1994, he has been the head of the CAPS (Compiler Architecture for Superscalar and Special-purpose Processors) research team. He has been conducting research on computer architecture for more than 20 years. His research topics have in- cluded memory hierarchy, pipeline organization, simultaneous multithreading and branch prediction. In 1999–2000, he spent a sabbatical year with the Alpha Group at Compaq. Olivier Temam obtained a PhD in computer science from the University of Rennes in 1993. He was assistant professor at the University of Versailles from 1994 to 1999, and then professor at the University of Paris Sud until 2004. Since then, he is a senior researcher at INRIA Futurs in Paris, where he heads the Alchemy group. His research interests include program optimization, processor architecture, and emerging technologies, with a general emphasis on long-term research.
  • 15. Editorial Board XI Theo Ungerer is Chair of Systems and Networking at the University of Augsburg, Germany, and Scientific Director of the Computing Center of the University of Augsburg. He received a Diploma in Mathematics at the Technical University of Berlin in 1981, a Doctoral Degree at the University of Augsburg in 1986, and a second Doctoral Degree (Habilitation) at the University of Augsburg in 1992. Before his current position he was scientific assistant at the University of Augsburg (1982–1989 and 1990–1992), visiting assistant professor at the Uni- versity of California, Irvine (1989–1990), professor of computer architecture at the University of Jena (1992–1993) and the Technical University of Karlsruhe (1993–2001). He is Steering Committee member of HiPEAC and of the German Science Foundation’s priority programme on “Organic Computing.” His current research interests are in the areas of embedded processor architectures, embed- ded real-time systems, organic, bionic and ubiquitous systems. Mateo Valero obtained his PhD at UPC in 1980. He is a professor in the Computer Architecture Department at UPC. His research interests focus on high-performance architectures. He has published approximately 400 papers on these topics. He is the director of the Barcelona Supercomputing Center, the National Center of Supercomputing in Spain. Dr. Valero has been honored with several awards, including the King Jaime I award by the Generalitat Valen- ciana, and the Spanish national award “Julio Rey Pastor” for his research on IT technologies. In 2001, he was appointed Fellow of the IEEE, in 2002 Intel Distin- guished Research Fellow and since 2003 a Fellow of the ACM. Since 1994, he has been a foundational member of the Royal Spanish Academy of Engineering. In 2005 he was elected Correspondant Academic of the Spanish Royal Academy of Sciences, and his native town of Alfamén named their public college after him.
  • 16. XII Editorial Board Georgi Gaydadjiev is a professor in the computer engineering laboratory of the Technical University of Delft, The Netherlands. His research interests focus on many aspects of embedded systems design with an emphasis on reconfigurable computing. He has published about 50 papers on these topics in international refereed journals and conferences. He has acted as Program Committee mem- ber of many conferences and is subject area editor for the Journal of Systems Architecture.
  • 17. Table of Contents Third International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC) Dynamic Cache Partitioning Based on the MLP of Cache Misses . . . . . . . 3 Miquel Moreto, Francisco J. Cazorla, Alex Ramirez, and Mateo Valero Cache Sensitive Code Arrangement for Virtual Machine. . . . . . . . . . . . . . . 24 Chun-Chieh Lin and Chuen-Liang Chen Data Layout for Cache Performance on a Multithreaded Architecture . . . 43 Subhradyuti Sarkar and Dean M. Tullsen Improving Branch Prediction by Considering Affectors and Affectees Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Yiannakis Sazeides, Andreas Moustakas, Kypros Constantinides, and Marios Kleanthous Eighth MEDEA Workshop (Selected Papers) Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Sandro Bartolini, Pierfrancesco Foglia, and Cosimo Antonia Prete Exploring the Architecture of a Stream Register-Based Snoop Filter . . . . 93 Matthias Blumrich, Valentina Salapura, and Alan Gara CROB: Implementing a Large Instruction Window through Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Fernando Latorre, Grigorios Magklis, Jose González, Pedro Chaparro, and Antonio González Power-Aware Dynamic Cache Partitioning for CMPs . . . . . . . . . . . . . . . . . 135 Isao Kotera, Kenta Abe, Ryusuke Egawa, Hiroyuki Takizawa, and Hiroaki Kobayashi A Multithreaded Multicore System for Embedded Media Processing . . . . 154 Jan Hoogerbrugge and Andrei Terechko
  • 18. XIV Table of Contents Regular Papers Parallelization Schemes for Memory Optimization on the Cell Processor: A Case Study on the Harris Corner Detector . . . . . . . . . . . . . . . 177 Tarik Saidani, Lionel Lacassagne, Joel Falcou, Claude Tadonki, and Samir Bouaziz Constructing Application-Specific Memory Hierarchies on FPGAs . . . . . . 201 Harald Devos, Jan Van Campenhout, Ingrid Verbauwhede, and Dirk Stroobandt First Workshop on Programmability Issues for Multi-core Computers (MULTIPROG) autopin – Automated Optimization of Thread-to-Core Pinning on Multicore Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 Tobias Klug, Michael Ott, Josef Weidendorfer, and Carsten Trinitis Robust Adaptation to Available Parallelism in Transactional Memory Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 Mohammad Ansari, Mikel Luján, Christos Kotselidis, Kim Jarvis, Chris Kirkham, and Ian Watson Efficient Partial Roll-Backing Mechanism for Transactional Memory Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 M.M. Waliullah Software-Level Instruction-Cache Leakage Reduction Using Value-Dependence of SRAM Leakage in Nanometer Technologies . . . . . . . 275 Maziar Goudarzi, Tohru Ishihara, and Hamid Noori Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
  • 19. Dynamic Cache Partitioning Based on the MLP of Cache Misses Miquel Moreto1 , Francisco J. Cazorla2 , Alex Ramirez1,2 , and Mateo Valero1,2 1 Universitat Politècnica de Catalunya, DAC, Barcelona, Spain HiPEAC European Network of Excellence 2 Barcelona Supercomputing Center – Centro Nacional de Supercomputación, Spain {mmoreto,aramirez,mateo}@ac.upc.edu, francisco.cazorla@bsc.es Abstract. Dynamic partitioning of shared caches has been proposed to improve performance of traditional eviction policies in modern multi- threaded architectures. All existing Dynamic Cache Partitioning (DCP) algorithms work on the number of misses caused by each thread and treat all misses equally. However, it has been shown that cache misses cause different impact in performance depending on their distribution. Clustered misses share their miss penalty as they can be served in par- allel, while isolated misses have a greater impact on performance as the memory latency is not shared with other misses. We take this fact into account and propose a new DCP algorithm that considers misses differently depending on their influence in performance. Our proposal obtains improvements over traditional eviction policies up to 63.9% (10.6% on average) and it also outperforms previous DCP pro- posals by up to 15.4% (4.1% on average) in a four-core architecture. Our proposal reaches the same performance as a 50% larger shared cache. Fi- nally, we present a practical implementation of our proposal that requires less than 8KB of storage. 1 Introduction The limitation imposed by instruction-level parallelism (ILP) has motivated the use of thread-level parallelism (TLP) as a common strategy for improv- ing processor performance. TLP paradigms such as simultaneous multithreading (SMT) [1,2], chip multiprocessor (CMP) [3] and combinations of both offer the opportunity to obtain higher throughputs. However, they also have to face the challenge of sharing resources of the architecture. Simply avoiding any resource control can lead to undesired situations where one thread is monopolizing all the resources and harming the other threads. Some studies deal with the resource sharing problem in SMTs at core level resources like issue queues, registers, etc. [4]. In CMPs, resource sharing is focused on the cache hierarchy. Some applications present low reuse of their data and pollute caches with data streams, such as multimedia, communications or streaming applications, or have many compulsory misses that cannot be solved by assigning more cache space to the application. Traditional eviction policies such as Least Recently P. Stenström (Ed.): Transactions on HiPEAC III, LNCS 6590, pp. 3–23, 2011. c Springer-Verlag Berlin Heidelberg 2011
  • 20. 4 M. Moreto et al. Used (LRU), pseudo LRU or random are demand-driven, that is, they tend to give more space to the application that has more accesses and misses to the cache hierarchy [5, 6]. As a consequence, some threads can suffer a severe degradation in performance. Previous work has tried to solve this problem by using static and dynamic partitioning algorithms that monitor the L2 cache accesses and decide a partition for a fixed amount of cycles in order to maximize throughput [7,8,9] or fairness [10]. Basically, these proposals predict the number of misses per application for each possible cache partition. Then, they use the cache partition that leads to the minimum number of misses for the next interval. A common characteristic of these proposals is that they treat all L2 misses equally. However, in out-of-order architectures L2 misses affect performance dif- ferently depending on how clustered they are. An isolated L2 miss has approxi- mately the same miss penalty than a cluster of L2 misses, as they can be served in parallel if they all fit in the reorder buffer (ROB) [11]. In Figure 1 we can see this behavior. We have represented an ideal IPC curve that is constant until an L2 miss occurs. After some cycles, commit stops. When the cache line comes from main memory, commit ramps up to its steady state value. As a consequence, an isolated L2 miss has a higher impact on performance than a miss in a burst of misses as the memory latency is shared by all clustered misses. (a) Isolated L2 miss. (b) Clustered L2 misses. Fig. 1. Isolated and clustered L2 misses Based on this fact, we propose a new DCP algorithm that gives a cost to each L2 access according to its impact in final performance. We detect isolated and clustered misses and assign a higher cost to isolated misses. Then, our algorithm determines the partition that minimizes the total cost for all threads, which is used in the next interval. Our results show that differentiating between clustered and isolated L2 misses leads to cache partitions with higher performance than previous proposals. The main contributions of this work are the following. 1) A runtime mechanism to dynamically partition shared L2 caches in a CMP scenario that takes into account the MLP of each L2 access. We obtain improve- ments over LRU up to 63.9% (10.6% on average) and over previous proposals up to 15.4% (4.1% on average) in a four-core architecture. Our proposal reaches the same performance as a 50% larger shared cache. 2) We extend previous workloads classifications for CMP architectures with more than two cores. Results can be better analyzed in every workload group.
  • 21. Dynamic Cache Partitioning Based on the MLP of Cache Misses 5 3) We present a sampling technique that reduces the hardware cost in terms of storage to less than 1% of the total L2 cache size with an average throughput degradation of 0.76% (compared to the throughput obtained without sampling). We also show that scalable algorithms to decide cache partitions give near opti- mal partitions, 0.59% close to the optimal decision. The rest of this paper is structured as follows. Section 2 introduces the meth- ods that have been previously proposed to decide L2 cache partitions and related work. Next, Section 3 explains our MLP-aware DCP algorithm. Section 4 de- scribes the experimental environment and in Section 5 we discuss simulation results. Finally, Section 6 summarizes our results. 2 Prior Work in Dynamic Cache Partitioning Stack Distance Histogram (SDH). Mattson et al. introduce the concept of stack distance to study the behavior of storage hierarchies [12]. Common eviction policies such as LRU have the stack property. Thus, each set in a cache can be seen as an LRU stack, where lines are sorted by their last access cycle. In that way, the first line of the LRU stack is the Most Recently Used (MRU) line while the last line is the LRU line. The position that a line has in the LRU stack when it is accessed again is defined as the stack distance of the access. As an example, we can see in Table 1(a) a stream of accesses to the same set with their corresponding stack distances. Table 1. Stack Distance Histogram (a) Stream of accesses to a given cache set. (b) SDH example. # Reference 1 2 3 4 5 6 7 8 Cache Line A B C C A D B D Stack Distance - - - 1 3 - 4 2 Stack Distance 1 2 3 4 4 # Accesses 60 20 10 5 5 For a K-way associative cache with LRU replacement algorithm, we need K + 1 counters to build SDHs, denoted C1, C2, . . . , CK , CK. On each cache access, one of the counters is incremented. If it is a cache access to a line in the ith position in the LRU stack of the set, Ci is incremented. If it is a cache miss, the line is not found in the LRU stack and, as a result, we increment the miss counter CK . SDH can be obtained during execution by running the thread alone in the system [7] or by adding some hardware counters that profile this information [8, 9]. A characteristic of these histograms is that the number of cache misses for a smaller cache with the same number of sets can be easily computed. For example, for a K -way associative cache, where K K, the new number of misses can be computed as misses = CK + K i=K+1 Ci. As an example, in Table 1(b) we show an SDH for a set with 4 ways. Here, we have 5 cache misses. However, if we reduce the number of ways to 2 (keeping the number of sets constant), we will experience 20 misses (5 + 5 + 10).
  • 22. 6 M. Moreto et al. Minimizing Total Misses. Using the SDHs of N applications, we can de- rive the L2 cache partition that minimizes the total number of misses: this last number corresponds to the sum of the number of misses of each thread for the given configuration. The optimal partition in the last period of time is a suitable candidate to become the future optimal partition. Partitions are decided period- ically after a fixed amount of cycles. In this scenario, partitions are decided at a way granularity. This mechanism is used in order to minimize the total number of misses and try to maximize throughput. A first approach proposed a static partitioning of the L2 cache using profiling information [7]. Then, a dynamic ap- proach estimated SDHs with information inside the cache [9]. Finally, Qureshi et al. presented a suitable and scalable circuit to measure SDHs using sampling and obtained performance gains with just 0.2% extra space in the L2 cache [8]. Throughout this paper, we will call this last policy MinMisses. Fair Partitioning. In some situations, MinMisses can lead to unfair parti- tions that assign nearly all the resources to one thread while harming the oth- ers [10]. For that reason, the authors propose considering fairness when deciding new partitions. In that way, instead of minimizing the total number of misses, they try to equalize the statistic Xi = missessharedi missesalonei of each thread i. They desire to force all threads to have the same increase in percentage of misses. Partitions are decided periodically using an iterative method. The thread with largest Xi receives a way from the thread with smallest Xi until all threads have a similar value of Xi. Throughout this paper, we will call this policy Fair. Table 2. Different Partitioning Proposals Paper Partitioning Objective Decision Algorithm Eviction Policy [7] Static Minimize Misses Programmer − Column Caching [9] Dynamic Minimize Misses Architecture Marginal Gain Augmented LRU [8] Dynamic Maximize Utility Architecture Lookahead Augmented LRU [10] Dynamic Fairness Architecture Equalize Xi 1 Augmented LRU [13] Dynamic Maximize reuse Architecture Reuse Column Caching [14] Dyn./Static Configurable Operating System Configurable Augmented LRU Other Related Work. Several papers propose different DCP algorithms in a multithreaded scenario. In Table 2 we summarize these proposals with their most significant characteristics. Settle et al. introduce a DCP similar to MinMisses that decides partitions depending on the average data reuse of each application [13]. Rafique et al. propose to manage shared caches with a hardware cache quota enforcement mechanism and an interface between the architecture and the OS to let the latter decide quotas [14]. We have to note that this mechanism is completely orthogonal to our proposal and, in fact, they are compatible as we can let the OS decide quotas according to our scheme. Hsu et al. evaluate different cache policies in a CMP scenario [15]. They show that none of them is optimal among all benchmarks and that the best cache policy varies depending on the performance metric being used. Thus, they propose to use a thread-aware
  • 23. Dynamic Cache Partitioning Based on the MLP of Cache Misses 7 cache resource allocation. In fact, their results reinforce the motivation of our paper: if we do not consider the impact of each L2 miss in performance, we can decide suboptimal L2 partitions in terms of throughput. Cache partitions at a way granularity can be implemented with column caching [7], which uses a bit mask to mark reserved ways, or by augmenting the LRU policy with counters that keep track of the number of lines in a set belonging to a thread [9]. The evicted line will be the LRU line among its owned lines or other threads lines depending on whether it reaches its quota or not. In [16] a new eviction policy for private caches was proposed in single-threaded architectures. This policy gives a weight to each L2 miss according to its MLP when the block is filled from memory. Eviction is decided using the LRU counters and this weight. This idea was proposed for a different scenario as it focus on single-threaded architectures. 3 MLP-Aware Dynamic Cache Partitioning 3.1 Algorithm Overview Algorithm 3.1 shows the necessary steps to dynamically decide cache partitions according to the MLP of each L2 access. At the beginning of the execution, we decide an initial partition of the L2 cache. As we have no prior knowledge of the applications, we evenly distribute ways among cores. Hence, each core receives Associativity Number of Cores ways of the shared L2 cache. Algorithm 3.1. MLP-aware DCP() Step 1: Establish an initial even partition for each core. Step 2: Run threads and collect data for the MLP-aware SDHs. Step 3: Decide new partition. Step 4: Update MLP-aware SDHs. Step 5: Go back to Step 2. Afterwards, we begin a period where we measure the total MLP cost of each application. The histogram of each thread containing the total MLP cost for each possible partition is denoted MLP-aware SDH. For a K-way associative cache, exactly K registers are needed to store this histogram. For short periods, dy- namic cache partitioning (DCP) algorithms react quicker to phase changes. Our results show that, for different periods from 105 to 108 cycles, small performance variations are obtained, with a peak for a period of 5 million cycles. At the end of each interval, MLP-aware SDHs are analyzed and a new parti- tion is decided for the next interval. We assume that running threads will have a similar pattern of L2 accesses in the next measuring period. Thus, the opti- mal partition for the last period is chosen for the following period. Evaluating
  • 24. 8 M. Moreto et al. all possible cache partitions gives the optimal partition. This evaluation is done concurrently with a dedicated hardware, which sets the partition for each pro- cess in the next period. Having old values of partitions decisions does not impact correctness of the running applications and does not affect performance as de- ciding new partitions typically takes few thousand cycles and is invoked once every 5 million cycles. Since characteristics of applications dynamically change, MLP-aware SDHs should reflect these changes. However, we also wish to maintain some history of the past MLP-aware SDHs to make new decisions. Thus, after a new partition is decided, we multiply all the values of the MLP-aware SDHs times ρ ∈ [0, 1]. Large values of ρ have larger reaction times to phase changes, while small values of ρ quickly adapt to phase changes but tend to forget the behavior of the application. Small performance variations are obtained for different values of ρ ranging from 0 to 1, with a peak for ρ = 0.5. Furthermore, this value is very convenient as we can use a shifter to update histograms. Next, a new period of measuring MLP-aware SDHs begins. The key contribution of this paper is the method to obtain MLP-aware SDHs that we explain in the following Subsection. 3.2 MLP-Aware Stack Distance Histogram As previously stated, MinMisses assumes that all L2 accesses are equally im- portant in terms of performance. However, it has been shown that cache misses affect differently the performance of applications, even inside the same applica- tion [11, 16]. An isolated L2 data miss has a penalty cost that can be approxi- mated by the average memory latency. In the case of a burst of L2 data misses that fit in the ROB, the penalty cost is shared among misses as L2 misses can be served in parallel. In case of L2 instruction misses, they are serialized as fetch stops. Thus, L2 instruction misses have a constant miss penalty and MLP. We want to assign a cost to each L2 access according to its effect on perfor- mance. In [16] a similar idea was used to modify LRU eviction policy for single core and single threaded architectures. In our situation, we have a CMP sce- nario where the shared L2 cache has a number of reserved ways for each core. At the end of each period, we decide either to continue with the same partition or change it. If we decide to modify the partition, a core i that had wi reserved ways will receive w i = wi. If wi w i, the thread receives more ways and, as a consequence, some misses in the old configuration will become hits. Conversely, if wi w i, the thread receives less ways and some hits in the old configuration will become misses. Thus, we want to have an estimation of the performance ef- fects when misses are converted into hits and vice versa. Throughout this paper, we will call this impact on performance MLP cost. MLP cost of L2 misses. In order to compute the MLP cost of an L2 miss with stack distance di, we consider the situation shown in Figure 2(a). If we force an L2 configuration that assigns exactly w i = di ways to thread i with w i wi, some of the L2 misses of this thread will become hits, while other will remain being misses, depending on their stack distance. In order to track the stack distance
  • 25. Dynamic Cache Partitioning Based on the MLP of Cache Misses 9 (a) MLP cost of an L2 miss. (b) Estimated MLP cost when an L2 hit becomes a miss. Fig. 2. MLP cost of L2 accesses and MLP cost of each L2 miss, we have modified the L2 Miss Status Holding Registers (MSHR) [17]. This structure is similar to an L2 miss buffer and is used to hold information about any load that has missed in the L2 cache. The modified L2 MSHR has one extra field that contains the MLP cost of the miss as can be seen in Figure 3(b). It is also necessary to store the stack distance of each access in the MSHR. In Figure 3(a) we show the MSHR in the cache hierarchy. (a) MSHR. (b) MSHR fields. Fig. 3. Miss Status Holding Register When the L2 cache is accessed and an L2 miss is determined, we assign an MSHR entry to the miss and wait until the data comes from Main Memory. We initialize the MLP cost field to zero when the entry is assigned. We store the access stack distance together with the identifier of the owner core. Every cycle, we obtain N, the number of L2 accesses with stack distance greater or equal to di. We have a hardware counter that tracks this number for each possible
  • 26. 10 M. Moreto et al. number of di, which means a total of Associativity counters. If we have N L2 misses that are being served in parallel, the miss penalty is shared. Thus, we assign an equal share of 1 N to each miss. The value of the MLP cost is updated until the data comes from Main Memory and fills the L2. At this moment we can free the MSHR entry. The number of adders required to update the MLP cost of all entries is equal to the number of MSHR entries. However, this number can be reduced by sharing several adders between valid MSHR entries in a round robin fashion. Then, if an MSHR entry updates its MLP cost every 4 cycles, it has to add 4 N . In this work, we assume that the MSHR contains only four adders for updating MLP cost values, which has a negligible effect on the final MLP cost [16]. MLP cost of L2 hits. Next, we want to estimate the MLP cost of an L2 hit with stack distance di when it becomes a miss. If we forced an L2 configuration that assigned exactly w i = di ways to the thread i with w i wi, some of the L2 hits of this thread would become misses, while L2 misses would remain as misses (see Figure 2(b)). The hits that would become misses are the ones with stack distance greater or equal to di. Thus, we count the total number of accesses with stack distance greater or equal to di (including L2 hits and misses) to estimate the length of the cluster of L2 misses in this configuration. Deciding the moment to free the entry used by an L2 hit is more complex than in the case of the MSHR. As it was said in [11], in a balanced architecture, L2 data misses can be served in parallel if they all fit in the ROB. Equivalently, we say that L2 data misses can be served in parallel if they are at ROB dis- tance smaller than the ROB size. Thus, we should free the entry if the number of committed instructions since the access has reached the ROB size or if the number of cycles since the hit has reached the average latency to memory. The first condition is clear as L2 misses can overlap only if their ROB distance is less than the ROB size. When the entry is freed, we have to add the number of pending cycles divided by the number of misses with stack distance greater or equal to di. The second condition is also necessary as it can occur that no L2 access is done for a period of time. To obtain the average latency to memory, we add a specific hardware that counts and averages the number of cycles that a given entry is in the MSHR. We use new hardware to obtain the MLP cost of L2 hits. We denote this hardware Hit Status Holding Registers (HSHR) as it is similar to the MSHR. However, the HSHR is private for each core. In each entry, the HSHR needs an identifier of the ROB entry of the access, the address accessed by the L2 hit, the stack distance value and a field with the corresponding MLP cost as can be seen in Figure 4(b). In Figure 4(a) we show the HSHR in the cache hierarchy. When the L2 cache is accessed and an L2 hit is determined, we assign an HSHR entry to the L2 hit. We initialize the fields of the entry as in the case of the MSHR. We have a stack distance di and we want to update the MLP cost field in every cycle. With this objective, we need to know the number of active entries with stack distance greater or equal to di in the HSHR, which can be tracked with one hardware counter per core. We also need a ROB entry identifier
  • 27. Dynamic Cache Partitioning Based on the MLP of Cache Misses 11 (a) HSHR. (b) HSHR fields. Fig. 4. Hit Status Holding Register for each L2 access. Every cycle, we obtain N, the number of L2 accesses with stack distance greater or equal to di as in the L2 MSHR case. We have a hardware counter that tracks this number for each possible number of di, which means a total of Associativity counters. In order to avoid array conflicts, we need as many entries in the HSHR as possible L2 accesses in flight. This number is equal to the L1 MSHR size. In our scenario, we have 32 L1 MSHR entries, which means a maximum of 32 in flight L2 accesses per core. However, we have checked that we have enough with 24 entries per core to ensure that we have an available slot 95% of the time in an architecture with a ROB of 256 entries. If there are no available slots, we simply assign the minimum weight to the L2 access as there are many L2 accesses in flight. The number of adders required to update the MLP cost of all entries is equal to the number of HSHR entries. As we did with the MSHR, HSHR entries can share four adders with a negligible effect on the final MLP cost. Quantification of MLP cost. Dealing with values of MLP cost between 0 and the memory latency (or even greater) can represent a significant hardware cost. Instead, we decide to quantify this MLP cost with an integer value between 0 and 7 as was done in [16]. For a memory latency of 300 cycles, we can see in Table 3 how to quantify the MLP cost. We have split the interval [0; 300] with 7 intervals of equal length. Table 3. MLP cost quantification MLP cost Quantification MLP cost Quantification From 0 to 42 cycles 0 From 171 to 213 cycles 4 From 43 to 85 cycles 1 From 214 to 256 cycles 5 From 86 to 128 cycles 2 From 257 to 300 cycles 6 From 129 to 170 cycles 3 300 or more cycles 7 Finally, when we have to update the corresponding MLP-aware SDH, we add the quantified value of MLP cost. Thus, isolated L2 misses will have a weight of 7, while two overlapped L2 misses will have a weight of 3 in the MLP-aware SDH. In contrast, MinMisses always adds one to its histograms.
  • 28. 12 M. Moreto et al. 3.3 Obtaining Stack Distance Histograms Normally, L2 caches have two separate parts that store data and address tags to know if the access is a hit. Basically, our prediction mechanism needs to track every L2 access and store a separated copy of the L2 tags information in an Auxiliary Tag Directory (ATD), together with the LRU counters [8]. We need an ATD for each core that keeps track of the L2 accesses for any possible cache configuration. Independently of the number of ways assigned to each core, we store the tags and LRU counters of the last K accesses of the thread, where K is the L2 associativity. As we have explained in Section 2, an access with stack distance di corresponds to a cache miss in any configuration that assigns less than di ways to the thread. Thus, with this ATD we can determine whether an L2 access would be a miss or a hit in all possible cache configurations. 3.4 Putting All Together In Figure 5 we can see a sketch of the hardware implementation of our proposal. When we have an L2 access, the ATD is used to determine its stack distance di. Depending on whether it is a miss or a hit, either the MSHR or the HSHR is used to compute the MLP cost of the access. Using the quantification process we obtain the final MLP cost. This number estimates how performance is affected when the applications has exactly w i = di assigned ways. If w i wi, we are estimating the performance benefit of converting this L2 miss into a hit. In case w i wi, we are estimating the performance degradation of converting this L2 hit into a miss. Finally, using the stack distance, the MLP cost and the core identifier, we can update the corresponding MLP-aware SDH. We have used two different partitioning algorithms. The first one, that we de- note MLP-DCP (standing for MLP-aware Dynamic Cache Partitioning), decides Fig. 5. Hardware implementation
  • 29. Dynamic Cache Partitioning Based on the MLP of Cache Misses 13 the optimal partition according to the MLP cost of each way. We define the total MLP cost of a thread i that uses wi ways as T MLP(i, wi) = MLP SDHi,K + K j=wi MLP SDHi,j. We denote the total MLP cost of all accesses of thread i with stack distance j as MLP SDHi,j. Thus, we have to minimize the sum of total MLP costs for all cores: N i=1 T MLP(i, wi), where N i=1 wi = Associativity. The second one consists in assigning a weight to each total MLP cost using the IPC of the application in core i, IPCi. In this situation, we are giving priority to threads with higher IPC. This point will give better results in throughput at the cost of being less fair. IPCi is measured at runtime with a hardware counter per core. We denote this proposal MLPIPC-DCP, which consists in minimizing the following expression: N i=1 IPCi · T MLP(i, wi), where N i=1 wi = Associativity. 3.5 Case Study We have seen that SDHs can give the optimal partition in terms of total L2 misses. However, total number of L2 misses is not the goal of DCP algorithms. Throughput is the objective of these policies. The underlying idea of MinMisses is that while minimizing total L2 misses, we are also increasing throughput. This idea is intuitive as performance is clearly related to L2 miss rate. However, this heuristic can lead to inadequate partitions in terms of throughput as can be seen in the next case study. In Figure 6, we can see the IPC curves of benchmarks galgel and gzip as we increase L2 cache size in a way granularity (each way has a 64KB size). We also show throughput for all possible 15 partitions. In this curve, we assign x ways to gzip and 16−x to galgel. The optimal partition consists in assigning 6 to gzip and 10 ways to galgel, obtaining a total throughput of 3.091 instructions per cycle. However, if we use MinMisses algorithm to determine the new partition, we will choose 4 to gzip and 12 ways to galgel according to the SDHs values. In Figure 6 we can also see the total number of misses for each cache partition as well as the per thread number of misses. In this situation, misses in gzip are more important in terms of performance than misses in galgel. Furthermore, gzip IPC is larger than galgel IPC. As a consequence MinMisses obtains a non optimal partition in terms of IPC and its throughput is 2.897, which is a 6.3% smaller than the optimal one. In fact, galgel clusters of L2 misses are, in average, longer than the ones from gzip. In that way, MLP-DCP assigns one extra way to gzip and increases performance by 3%. If we use MLPIPC-DCP, we are giving more importance to gzip as it has a higher IPC and, as a consequence, we end up assigning another extra way to gzip, reaching the optimal partition and increasing throughput an extra 3%.
  • 30. 14 M. Moreto et al. Fig. 6. Misses and IPC curves for galgel and gzip 4 Experimental Environment 4.1 Simulator Configuration We target this study to the case of a CMP with two and four cores with their respective own data and instruction L1 caches and a unified L2 cache shared among threads as in previous studies [8,9,10]. Each core is single-threaded and fetches up to 8 instructions each cycle. It has 6 integer (I), 3 floating point (FP), and 4 load/store functional units and 32-entry I, load/store, and FP instruction queues. Each thread has a 256-entry ROB and 256 physical registers. We use a two-level cache hierarchy with 64B lines with separate 16KB, 4-way associative data and instruction caches, and a unified L2 cache that is shared among all cores. We have used two different L2 caches, one of size 1MB and 16-way asso- ciativity, and the second one of size 2MB and 32-way associativity. Latency from L1 to L2 is 15 cycles, and from L2 to memory 300 cycles. We use a 32B width bus to access L2 and a multibanked L2 of 16 banks with 3 cycles of access time. We extended the SMTSim simulator [2] to make it CMP. We collected traces of the most representative 300 million instruction segment of each program, fol- lowing the SimPoint methodology [18]. We use the FAME simulation method- ology [19] with a Maximum Allowable IPC Variance of 5%. This evaluation methodology measures the performance of multithreaded processors by reexe- cuting all threads in a multithreaded workload until all of them are fairly repre- sented in the final IPC taken from the workload. 4.2 Workload Classification In [20] two metrics are used to model the performance of a partitioning algorithm like MinMisses for pairings of benchmarks in the SPEC CPU 2000 benchmark suite. Here, we extend this classification for architectures with more cores. Metric 1. The wP %(B) metric measures the number of ways needed by a benchmark B to obtain at least a given percentage P% of its maximum IPC (when it uses all L2 ways).
  • 31. Dynamic Cache Partitioning Based on the MLP of Cache Misses 15 (a) IPC as we vary the number of assigned ways of a 1MB 16-way L2 cache. (b) Average miss penalty of an L2 miss with a 1MB 16-way L2 cache. Fig. 7. Benchmark classification The intuition behind this metric is to classify benchmarks depending on their cache utilization. Using P = 90% we can classify benchmarks into three groups: Low utility (L), Small working set or saturated utility (S) and High utility (H). L benchmarks have 1 ≤ w90% ≤ K 8 where K is the L2 associativity. L benchmarks are not affected by L2 cache space because nearly all L2 accesses are misses. S benchmarks have K 8 w90% ≤ K 2 and just need some ways to have maximum throughput as they fit in the L2 cache. Finally, H benchmarks have w90% K 2 and always improve IPC as the number of ways given to them is increased. Clear representatives of these three groups are applu (L), gzip (S) and ammp (H) in Figure 7(a). In Table 4 we give w90% for all SPEC CPU 2000 benchmarks. Table 4. The applications used in our evaluation. For each benchmark, we give the two metrics needed to classify workloads together with IPC for a 1MB 16-way L2 cache. Bench w90% APTC IPC Bench w90% APTC IPC Bench w90% APTC IPC ammp 14 23.63 1.27 applu 1 16.83 1.03 apsi 10 21.14 2.17 art 10 46.04 0.52 bzip2 1 1.18 2.62 crafty 4 7.66 1.71 eon 3 7.09 2.31 equake 1 18.6 0.27 facerec 11 10.96 1.16 fma3d 9 15.1 0.11 galgel 15 18.9 1.14 gap 1 2.68 0.96 gcc 3 6.97 1.64 gzip 4 21.5 2.20 lucas 1 7.60 0.35 mcf 1 9.12 0.06 mesa 2 3.98 3.04 mgrid 11 9.52 0.71 parser 11 9.09 0.89 perl 5 3.82 2.68 sixtrack 1 1.34 2.02 swim 1 28.0 0.40 twolf 15 12.0 0.81 vortex 7 9.65 1.35 vpr 14 11.9 0.97 wupw 1 5.99 1.32 The average miss penalty of an L2 miss for the whole SPEC CPU 2000 bench- mark suite is shown in Figure 7(b). We note that this average miss penalty varies a lot, even inside each group of benchmarks, ranging from 30 to 294 cycles. This Figure reinforces the main motivation of the paper, as it proves that the clus- tering level of L2 misses changes for different applications.
  • 32. 16 M. Moreto et al. Metric 2. The wLRU (thi) metric measures the number of ways given by LRU to each thread thi in a workload composed of N threads. This can be done simulating all benchmarks alone and using the frequency of L2 accesses for each thread [5]. We denote the number of L2 Accesses in a Period of one Thousand Cycles for thread i as APT Ci. In Table 4 we list these values for each benchmark. wLRU (thi) = APT Ci N j=1 APT Cj · Associativity Next, we use these two metrics to extend previous classifications [20] for work- loads with more than two benchmarks. Case 1. When w90%(thi) ≤ wLRU (thi) for all threads. In this situation LRU attains 90% of each benchmark performance. Thus, it is intuitive that in this situation there is very little room for improvement. Case 2. When there exists two threads A and B such that w90%(thA) wLRU (thA) and w90%(thB) wLRU (thB). In this situation, LRU is harming the performance of thread A, because it gives more ways than necessary to thread B. Thus, in this situation LRU is assigning some shared resources to a thread that does not need them, while the other thread could benefit from these resources. Case 3. Finally, the third case is obtained when w90%(thi) wLRU (thi) for all threads. In this situation, our L2 cache configuration is not big enough to assure that all benchmarks will have at least a 90% of their peak performance. In [20] it was observed that pairings belonging to this group showed worse results when the value of |w90%(th1) − w90%(th2)| grows. In this case, we have a thread that requires much less L2 cache space than the other to attain 90% of its peak IPC. LRU treats threads equally and manages to satisfy the less demanding thread necessities. In case of MinMisses, it assumes that all misses are equally important for throughput and tends to give more space to the thread with higher L2 cache necessity, while harming the less demanding thread. This is a problem due to MinMisses algorithm. We will show in next Subsections that MLP-aware partitioning policies are available to overcome this situation. Table 5. Workloads belonging to each case for a 1MB 16-way and a 2MB 32-way shared L2 caches 1MB 16-way 2MB 32-way #cores 2 4 6 8 Case 1 Case 2 Case 3 155 (48%) 135 (41%) 35 (11%) 624 (4%) 12785 (86%) 1541 (10%) 306 (0.1%) 219790 (95%) 10134 (5%) 19 (0%) 1538538 (98%) 23718 (2%) Case 1 Case 2 Case 3 159 (49%) 146 (45%) 20 (6.2%) 286 (1.9%) 12914 (86%) 1750 (12%) 57 (0.02%) 212384 (92%) 17789 (7.7%) 1 (0%) 1496215 (96%) 66059 (4.2%) In Table 5 we show the total number of workloads that belong to each case for different configurations. We have generated all possible combinations without repeating benchmarks. The order of benchmarks is not important. In the case of a 1MB 16-way L2, we note that Case 2 becomes the dominant case as the
  • 33. Dynamic Cache Partitioning Based on the MLP of Cache Misses 17 number of cores increases. The same trend is observed for L2 caches with larger associativity. In Table 5 we can also see the total number of workloads that belong to each case as the number of cores increases for a 32-way 2MB L2 cache. Note that with different L2 cache configurations, the value of w90% and APT Ci will change for each benchmark. An important conclusion from Table 5 is that as we increase the number of cores, there are more combinations that belong to the second case, which is the one with more improvement possibilities. To evaluate our proposals, we randomly generate 16 workloads belonging to each group for three different configurations. We denote these configurations 2C (2 cores and 1MB 16-way L2), 4C-1 (4 cores and 1MB 16-way L2) and 4C-2 (4 cores and 2MB 32-way L2). We have also used a 2MB 32-way L2 cache as future CMP architectures will continue scaling L2 size and associativity. For example, the IBM Power5 [21] has a 10-way 1.875MB L2 cache and the Niagara 2 has a 16-way 4MB L2. 4.3 Performance Metrics As performance metrics we have used the IPC throughput, which corresponds to the sum of individual IPCs. We also use the harmonic mean of relative IPCs to measure fairness, which we denote Hmean. We use Hmean instead of weighted speed up because it has been shown to provide better fairness-throughput bal- ance than weighted speed up [22]. Average improvements do consider the distribution of workloads among the three groups. We denote this mean weighted mean, as we assign a weight to the speed up of each case depending on the distribution of workloads from Table 5. For example, for the 2C configuration, we compute the weighted mean improve- ment as 0.48 · x1 + 0.41 · x2 + 0.11 · x3, where xi is the average improvement in Case i. 5 Evaluation Results 5.1 Performance Results Throughput. The first experiment consists in comparing throughput for differ- ent DCP algorithms, using LRU policy as the baseline. We simulate MinMisses and our two proposals with the 48 workloads that were selected in the pre- vious Subsection. We can see in Figure 8(a) the average speed up over LRU for these mechanisms. MLPIPC-DCP systematically obtains the best average results, nearly doubling the performance benefits of MinMisses over LRU in the four-core configurations. In configuration 4C-1, MLPIPC-DCP outperforms MinMisses by 4.1%. MLP-DCP always improves MinMisses but obtains worse results than MLPIPC-DCP. All algorithms have similar results in Case 1. This is intuitive as in this sit- uation there is little room for improvement. In Case 2, MinMisses obtains a relevant improvement over LRU in configuration 2C. MLP-DCP and MLPIPC- DCP achieve an extra 2.5% and 5% improvement, respectively. In the other
  • 34. 18 M. Moreto et al. (a) Throughput speed up over LRU. (b) Fairness speed up over LRU. Fig. 8. Average performance speed ups over LRU configurations, MLP-DCP and MLPIPC-DCP still outperform MinMisses by a 2.1% and 3.6%. In Case 3, MinMisses presents larger performance degradation as the asymmetry between the necessities of the two cores increases. As a con- sequence, it has worse average throughput than LRU. Assigning an appropriate weight to each L2 access gives the possibility to obtain better results than LRU using MLP-DCP and MLPIPC-DCP. Fairness. We have used the harmonic mean of relative IPCs [22] to measure fairness. The relative IPC is computed as IP Cshared IP Calone . In Figure 8(b) we show the average speed up over LRU of the harmonic mean of relative IPCs. Fair stands for the policy explained in Section 2. We can see that in all situations, MLP-DCP always improves over both MinMisses and LRU (except in Case 3 for two cores). It even obtains better results than Fair in configurations 2C and 4C-1. MLPIPC- DCP is a variant of the MLP-DCP algorithm optimized for throughput. As a consequence, it obtains worse results in fairness than MLP-DCP. Fig. 9. Average throughput speed up over LRU with a 1MB 16-way L2 cache Equivalent cache space. DCP algorithms reach the performance of a larger L2 cache with LRU eviction policy. Figure 9 shows the performance evolution when the L2 size is increased from 1MB to 2MB with LRU as eviction policy. In this
  • 35. Dynamic Cache Partitioning Based on the MLP of Cache Misses 19 experiment, the workloads correspond to the ones selected for the configuration 4C-1. Figure 9 also shows the average speed up over LRU of MinMisses, MLP- DCP and MLPIPC-DCP with a 1MB 16-way L2 cache. MinMisses has the same average performance as a 1.25MB 20-way L2 cache with LRU, which means that MinMisses provides the performance obtained with a 25% larger shared cache. MLP-DCP reaches the performance of a 37.5% larger cache. Finally, MLPIPC- DCP doubles the increase in size of MinMisses, reaching the performance of a 50% larger L2 cache. 5.2 Design Parameters Figure 10(a) shows the sensitivity of our proposal to the period of partition de- cisions. For shorter periods, the partitioning algorithm reacts quicker to phase changes. Once again, small performance variations are obtained for different pe- riods. However, we observe that for longer periods throughput tends to decrease. As can be seen in Figure 10(a), the peak performance is obtained with a period of 5 million cycles. (a) Average throughput for different pe- riods for the MLP-DCP algorithm with the 2C configuration. (b) Average speed up over LRU for different ROB sizes with the 4C-1 configuration. Fig. 10. Sensitivity analysis to different design parameters Finally, we have varied the size of the ROB from 128 to 512 entries to show the sensitivity of our proposals to this parameter of the architecture. Our mechanism is the only one which is aware of the ROB size: The higher the size of the ROB, the larger size of the cluster of L2 misses. Other policies only work with the number of L2 misses, which will not change if we vary the size of the ROB. When the ROB size increases, clusters of misses can contain more misses and, as a consequence, our mechanism can differentiate better between isolated and clustered misses. As we show in Figure 10(b), average improvements in the 4C-1 configuration are a little bit higher for a ROB with 512 entries, while MinMisses shows worse results. MLPIPC-DCP outperforms LRU and MinMisses by 10.4% and 4.3% respectively. 5.3 Hardware Cost We have used the hardware implementation of Figure 5 to estimate the hard- ware cost of our proposal. In this Subsection, we focus our attention on the
  • 36. 20 M. Moreto et al. configuration 2C. We suppose a 40-bit physical address space. Each entry in the ATD needs 29 bits (1 valid bit + 24-bit tag + 4-bit for LRU counter). Each set has 16 ways, so we have an overhead of 58 Bytes (B) for each set. As we have 1024 sets, we have a total cost of 58KB per core. The hardware cost that corresponds to the extra fields of each entry in the L2 MSHR is 5 bits for the stack distance and 2B for the MLP cost. As we have 32 entries, we have a total of 84B. Four adders are needed to update the MLP cost of the active MSHR entries. HSHR entries need 1 valid bit, 8 bits to identify the ROB entry, 34 bits for the address, 5 bits for the stack distance and 2B for the MLP cost. In total we need 64 bits per entry. As we have 24 entries in each HSHR, we have a total of 192B per core. Four adders per core are needed to update the MLP cost of the active HSHR entries. Finally, we need 17 counters of 4B for each MLP-Aware SDH, which supposes a total of 68B per core. In addition to the storage bits, we also need an adder for incrementing MLP-aware SDHs and a shifter to halve the hit counters after each partitioning interval. Fig. 11. Throughput and hardware cost depending on ds in a two-core CMP Sampled ATD. The main contribution to hardware cost corresponds to the ATD. Instead of monitoring every cache set, we can decide to track accesses from a reduced number of sets. This idea was also used in [8] with MinMisses in a CMP environment. Here, we use it in a different situation, say to estimate MLP-aware SDHs with a sampled number of sets. We define a sampling distance ds that gives the distance between tracked sets. For example, if ds = 1, we are tracking all the sets. If ds = 2, we track half of the sets, and so on. Sampling reduces the size of the ATD at the expense of less accuracy in MLP-aware SDHs predictions as some accesses are not tracked, Figure 11 shows throughput degradation in a 2 cores scenario as the ds increases. This curve is measured on the left y-axis. We also show the storage overhead in percentage of the total L2 cache size, measured on the right y-axis. Thanks to the sampling technique, storage overhead drastically decreases. Thus, with a sampling distance of 16 we obtain average throughput degradations of 0.76% and a storage overhead of 0.77% of the L2 cache size, which is less than 8KB of storage. We think that this is an interesting point of design.
  • 37. Dynamic Cache Partitioning Based on the MLP of Cache Misses 21 5.4 Scalable Algorithm to Decide Cache Partitions Evaluating all possible combinations allows determining the optimal partition for the next period. However, this algorithm does not scale adequately when associativity and the number of applications sharing the cache is raised. If we have a K-way associativity L2 cache shared by N cores, the number of possible partitions without considering the order is N+K−1 K . For example, for 8 cores and 16 ways, we have 245157 possible combinations. Consequently, the time to decide new cache partitions does not scale. Several heuristics have been proposed to reduce the number of cycles required to decide the new partition [8,9], which can be used in our situation. These proposals bound the length of the decision period by 10000 cycles. This overhead is very low compared to 5 million cycles (less than 0.2%). Fig. 12. Average throughput speed up over LRU for different decision algorithms in the 4C-1 configuration Figure 12 shows the average speed up of MLP-DCP over LRU with the 4C-1 configuration with three different decision algorithms. Evaluating all possible par- titions (denoted EvalAll) gives the highest speed up. The first greedy algorithm (denoted Marginal Gains) assigns one way to a thread in each iteration [9]. The selected way is the one that gives the largest increase in MLP cost. This process is repeated until all ways have been assigned. The number of operations (com- parisons) is of order K · N, where K is the associativity of the L2 cache and N the number of cores. With this heuristic, an average throughput degradations of 0.59% is obtained. The second greedy algorithm (denoted Look Ahead) is similar to Marginal Gains. The basic difference between them is that Look Ahead con- siders the total MLP cost for all possible number of blocks that the application can receive [8] and can assign more than one way in each iteration. The number of operations (add-divide-compare) is of order N · K2 2 , where K is the associativ- ity of the L2 cache and N the number of cores. With this heuristic, an average throughput degradations of 1.04% is obtained.
  • 38. 22 M. Moreto et al. 6 Conclusions In this paper we propose a new DCP algorithm that assigns a cost to each L2 access according to its impact in final performance: isolated misses receive higher costs than clustered misses. Next, our algorithm decides the L2 cache partition that minimizes the total cost for all running threads. Furthermore, we have classified workloads for multiple cores into three groups and shown that the dominant situation is precisely the one that offers room for improvement. We show that our proposal reaches high throughput for two- and four-core architectures. In all evaluated configurations, our proposal consistently outper- forms both LRU and MinMisses, reaching a speed up of 63.9% (10.6% on aver- age) and 15.4% (4.1% on average), respectively. With our proposals, we reach the performance of a 50% larger cache. Finally, we used a sampling technique to propose a practical implementation with a storage cost to less than 1% of the total L2 cache size and a scalable algorithm to determine cache partitions with nearly no performance degradation. Acknowledgments This work is supported by the Ministry of Education and Science of Spain un- der contracts TIN2004-07739, TIN2007-60625 and grant AP-2005-3318, and by SARC European Project. The authors would like to thank C. Acosta, A. Falcon, D. Ortega, J. Vermoulen and O. J. Santana for their work in the simulation tool. We also thank F. Cabarcas, I. Gelado, A. Rico and C. Villavieja for comments on earlier drafts of this paper and the reviewers for their helpful comments. References 1. Serrano, M.J., Wood, R., Nemirovsky, M.: A study on multistreamed superscalar processors, Technical Report 93-05, University of California Santa Barbara (1993) 2. Tullsen, D.M., Eggers, S.J., Levy, H.M.: Simultaneous multithreading: maximizing on-chip parallelism. In: ISCA (1995) 3. Hammond, L., Nayfeh, B.A., Olukotun, K.: A single-chip multiprocessor. Com- puter 30(9), 79–85 (1997) 4. Cazorla, F.J., Ramirez, A., Valero, M., Fernandez, E.: Dynamically controlled resource allocation in SMT processors. In: MICRO (2004) 5. Chandra, D., Guo, F., Kim, S., Solihin, Y.: Predicting inter-thread cache contention on a chip multi-processor architecture. In: HPCA (2005) 6. Petoumenos, P., Keramidas, G., Zeffer, H., Kaxiras, S., Hagersten, E.: Modeling cache sharing on chip multiprocessor architectures. In: IISWC, pp. 160–171 (2006) 7. Chiou, D., Jain, P., Devadas, S., Rudolph, L.: Dynamic cache partitioning via columnization. In: Design Automation Conference (2000) 8. Qureshi, M.K., Patt, Y.N.: Utility-based cache partitioning: A low-overhead, high- performance, runtime mechanism to partition shared caches. In: MICRO (2006) 9. Suh, G.E., Devadas, S., Rudolph, L.: A new memory monitoring scheme for memory-aware scheduling and partitioning. In: HPCA (2002)
  • 39. Dynamic Cache Partitioning Based on the MLP of Cache Misses 23 10. Kim, S., Chandra, D., Solihin, Y.: Fair cache sharing and partitioning in a chip multiprocessor architecture. In: PACT (2004) 11. Karkhanis, T.S., Smith, J.E.: A first-order superscalar processor model. In: ISCA (2004) 12. Mattson, R.L., Gecsei, J., Slutz, D.R., Traiger, I.L.: Evaluation techniques for storage hierarchies. IBM Systems Journal 9(2), 78–117 (1970) 13. Settle, A., Connors, D., Gibert, E., Gonzalez, A.: A dynamically reconfigurable cache for multithreaded processors. Journal of Embedded Computing 1(3-4) (2005) 14. Rafique, N., Lim, W.T., Thottethodi, M.: Architectural support for operating system-driven CMP cache management. In: PACT (2006) 15. Hsu, L.R., Reinhardt, S.K., Iyer, R., Makineni, S.: Communist, utilitarian, and capitalist cache policies on CMPs: caches as a shared resource. In: PACT (2006) 16. Qureshi, M.K., Lynch, D.N., Mutlu, O., Patt, Y.N.: A case for MLP-aware cache replacement. In: ISCA (2006) 17. Kroft, D.: Lockup-free instruction fetch/prefetch cache organization. In: ISCA (1981) 18. Sherwood, T., Perelman, E., Hamerly, G., Sair, S., Calder, B.: Discovering and exploiting program phases. IEEE Micro (2003) 19. Vera, J., Cazorla, F.J., Pajuelo, A., Santana, O.J., Fernandez, E., Valero, M.: FAME: Fairly measuring multithreaded architectures. In: PACT (2007) 20. Moreto, M., Cazorla, F.J., Ramirez, A., Valero, M.: Explaining dynamic cache partitioning speed ups. IEEE CAL (2007) 21. Sinharoy, B., Kalla, R.N., Tendler, J.M., Eickemeyer, R.J., Joyner, J.B.: Power5 system microarchitecture. IBM J. Res. Dev. 49(4/5), 505–521 (2005) 22. Luo, K., Gummaraju, J., Franklin, M.: Balancing throughput and fairness in SMT processors. In: ISPASS (2001)
  • 40. P. Stenström (Ed.): Transactions on HiPEAC III, LNCS 6590, pp. 24–42, 2011. © Springer-Verlag Berlin Heidelberg 2011 Cache Sensitive Code Arrangement for Virtual Machine* Chun-Chieh Lin and Chuen-Liang Chen Department of Computer Science and Information Engineering, National Taiwan University, Taipei, 10764, Taiwan {d93020,clchen}@csie.ntu.edu.tw Abstract. This paper proposes a systematic approach to optimize the code layout of a Java ME virtual machine for an embedded system with a cache-sensitive architecture. A practice example is to run JVM directly (execution-in-place) in NAND flash memory, for which cache miss penalty is too high to endure. The refined virtual machine generated cache misses 96% less than the original version. We developed a mathematical approach helping to predict the flow of the interpreter inside the virtual machine. This approach analyzed both the static control flow graph and the pattern of bytecode instruction streams, since we found the input sequence drives the program flow of the virtual machine interpreter. Then we proposed a rule to model the execution flows of Java instructions of real applications. Furthermore, we used a graph partition algorithm as a tool to deal with the mathematical model, and this finding helped the relocation process to move program blocks to proper memory pages. The refinement approach dramatically improved the locality of the virtual machine thus reduced cache miss rates. Our technique can help Java ME-enabled devices to run faster and extend longer battery life. The approach also brings potential for designers to integrate the XIP function into System-on-Chip thanks to lower demand for cache memory. Keywords: cache sensitive, cache miss, NAND flash memory, code arrange- ment, Java virtual machine, interpreter, embedded system. 1 Introduction Java platform extensively exists in all kinds of embedded and mobile devices. The Java™ Platform, Micro Edition (Java ME) [1] is no doubt a de facto standard platform of smart phone. The Java virtual machine (it is KVM in Java ME) is a key component that affects performance and power consumptions. NAND flash memory comes with serial bus interface. It does not allow random access, and the CPU must read out the whole page at a time, which is a slow operation compared to RAM. This property leads a processor hardly to execute programs stored * We acknowledge the support for this study through grants from National Science Council of Taiwan (NSC 95-2221-E-002 -137).
  • 41. Cache Sensitive Code Arrangement for Virtual Machine 25 in NAND flash memory using the “execute-in-place” (XIP) technique. In the mean- while, NAND flash memory offers fast write access time, and the most important of all, the technology has advantages in offering higher capacity than NOR flash technology does. As the applications of embedded devices become large and complicated, more mainstream devices adopt NAND flash memory to replace NOR-flash memory. In this paper, we tried to offer an answer to the question: can we speed up an em- bedded device using NAND flash memory to store programs? “Page-based” storage media, like NAND flash memory, have higher access penalty than RAM does. Re- ducing the page miss becomes a critical issue. Thus, we set forth to find way to reduce the page miss rate generated by the KVM. Due to the unique structure of the KVM interpreter, we found a special way to exploit the dynamic locality of the KVM that is to trace the patterns of executed bytecode instructions instead of the internal flow of the KVM. It turned out to be a combinatorial optimization problem because the code layout must fulfill certain code size constraints. Our approach achieved the effect of static page preloading by properly arranging program blocks. In the experiment, we imple- mented a post-processing program to modify the intermediate files generated by the C compiler. The post-processing program refined machine code placement of the KVM based on the mathematical model. Finally, the obtained tuned KVMs dramatically reduced page accesses to NAND flash memories. The outcome of this study helps embedded systems to boost performance and extend battery life as well. 2 Related Works Park et al., in [2], proposed a hardware module to allow direct code execution from NAND flash memory. In this approach, program codes stored in NAND flash pages will be loaded into RAM cache on-demand instead of moving entire contents into RAM. Their work is a universal hardware-based solution without considering appli- cation-specific characteristics. Samsung Electronics offers a commercial product called “OneNAND” [3] based on the same. It is a single chip with a standard NOR flash interface. Actually, it contains a NAND flash memory array for storage. The vendor intent was to provide a cost-effective alternative to NOR flash memory used in existing designs. The internal structure of OneNAND comprises a NAND flash memory, control logic, hardware ECC, and 5KB buffer RAM. The 5KB buffer RAM is comprised of three buffers: 1KB for boot RAM, and a pair of 2KB buffers used for bi-directional data buffers. Our approach is suitable for systems using this type of flash memories. Park et al., in [4], proposed yet another pure software approach to achieve exe- cute-in-place by using a customized compiler that inserts NAND flash reading opera- tions into program code at proper place. Their compiler determines insertion points by summing up sizes of basic blocks along the calling tree. Special hardware is no longer required, but in contrast to earlier work [2], there is still a need for tailor-made compiler. Typical studies of refining code placement to minimize cache misses can apply to NAND flash cache system. Parameswaran et al., in [5], used the bin-packing approach. It reorders the program codes by examining the execution frequency of basic blocks. Code segments with higher execution frequency are placed next to each other within the cache. Janapsatya et al., in [6], proposed a pure software heuristic approach to reduce number of cache misses by relocating program sections in the main memory.
  • 42. 26 C.-C. Lin and C.-L. Chen Their approach was to analyze program flow graph, identify and pack basic blocks within the same loop. They have also created relations between cache miss and energy consumption. Although their approach can identify loops within a program, breaking the interpreter of a virtual machine into individual circuits is hard because all the loops share the same starting point. There are researches in improving program locality and optimizing code placement for either cache or virtual memory environment. Pettis [7] proposed a systematic approach using dynamic call graph to position procedures. They tried to place two procedures as close as possible if one of the procedure calls another frequently. The first step of Pettis’ approach uses the profiling information to create weighted call graph. The second step iteratively merges vertices connected by heaviest weight edges. The process repeats until the whole graph composed of one or more individual vertex without edges. However, the approach to collect profiling information and their accuracy is yet another issue. For example, Young and Smith in [8] developed techniques to extract effective branch profile information from a limited depth of branch history. Ball and Larus in [9] described an algorithm for inserting monitoring code to trace programs. Our approach is very different by nature. Previous studies all focused in the flow of program codes, but we tried to model the profile by input data. This research project created a post-processor to optimize the code arrangements. It is analogous to “Diablo linker” [10]. They utilized symbolic information in the object files to generate optimized executable files. However, our approach will generate feedback intermediate files for the compiler, and invoke the compiler to generate optimized machine code. 3 Background 3.1 XIP with NAND Flash NOR flash memory is popular as code memory because of the XIP feature. There are several approaches designed for using NAND flash memory as an alternative to NOR flash memory. Because NAND flash memory interface cannot connect to the CPU host bus, there has to be a memory interface controller to move data from NAND flash memory to RAM. Fig. 1. Access NAND flash through shadow RAM
  • 43. Cache Sensitive Code Arrangement for Virtual Machine 27 In system-level view, Figure 1 shows a straightforward design which uses RAM as the shadow copy of NAND flash. The system treats NAND flash memory as secondary storage device [11]. There should be a boot loader or RTOS resided in ROM or NOR flash memory. It copies program codes from NAND flash to RAM, then the processor executes program codes in RAM [12]. This approach offers best execution speed because the processor operates with RAM. The downside of this approach is it needs huge amount of RAM to mirror NAND flash. In embedded devices, RAM is a precious resource. For example, the Sony Ericsson T610 mobile phone [13] reserved 256KB RAM for Java heap. In contrast to using 256MB for mirroring NAND flash memory, all designers should agree that they would prefer to retain RAM for Java applets rather than for mirroring. The second pitfall is the implementation takes longer time to boot because the system must copy contents to RAM prior to execution. Figure 2 shows a demand paging approach uses limited amount of RAM as the cache of NAND flash. The “romized” program codes stay in NAND flash memory, and a MMU loads only portions of program codes which is about to be executed from NAND into the cache. The major advantage of this approach is it consumes less RAM. Several kilobytes of RAM are enough to mirror NAND flash memory. Using less RAM means integrating CPU, MMU and cache into a single chip (the shadowed part in Figure 2) can be easier. The startup latency is shorter since the CPU is ready to run soon after the first NAND flash page is loaded into the cache. The component cost is lower than in the previous approach. The realization of the MMU might be either hardware or software approach, which is not covered in this paper. Fig. 2. Using cache unit to access NAND flash However, performance is the major drawback of this approach. The penalty of each cache miss is high, because loading contents from a NAND flash page is nearly 200 times slower than doing the same operation with RAM. Therefore reducing cache misses becomes a critical issue for such configurations. 3.2 KVM Internals Source Level. In respect of functionality, the KVM can be broken down into several parts: startup, class files loading, constant pool resolving, interpreter, garbage collection,
  • 44. 28 C.-C. Lin and C.-L. Chen and KVM cleanup. Lafond et al., in [14], have measured the energy consumptions of each part in the KVM. Their study showed that the interpreter consumed more than 50% of total energy. In our experiments running Embedded Caffeine Benchmark [15], the interpreter contributed 96% of total memory accesses. These evidences lead to the conclusion that the interpreter is the performance bottleneck of the KVM, and they motivated us to focus on reducing the cache misses generated by the interpreter. Figure 3 shows the program structure of the interpreter. It is a loop enclosing a large switch-case dispatcher. The loop fetches bytecode instructions from Java applications, and each “case” sub-clause deals with one bytecode instruction. The control flow graph of the interpreter, as illustrated in Figure 4, is a flat and shallow spanning tree. There are three major steps in the interpreter, ReschedulePoint: RESCHEDULE opcode = FETCH_BYTECODE ( ProgramCounter ); switch ( opcode ) { case ALOAD: /* do something */ goto ReschedulePoint; case IADD: /* do something */ … case IFEQ: /* do something */ goto BranchPoint; … } BranchPoint: take care of program counter; goto ReschedulePoint; Fig. 3. Pseudo code of KVM interpreter Fig. 4. Control flow graph of the interpreter
  • 45. Cache Sensitive Code Arrangement for Virtual Machine 29 (1) Rescheduling and Fetching. In this step, KVM prepares the execution context and the stack frame. Then it fetches a bytecode instruction from Java programs. (2) Dispatching and Execution. After reading a bytecode instruction from Java pro- grams, the interpreter jumps to corresponding bytecode handlers through the big “switch…case…” statement. Each bytecode handler carries out the function of the corresponding bytecode instruction. (3) Branching. The branch bytecode instructions may bring the Java program flow away from original track. In this step, the interpreter resolves the target address and modifies the program counter. Fig. 5. The organization of the interpreter at assembly level Assembly Level. Our analysis of the source files revealed the peculiar program structure of the VM interpreter. Analyzing the code layout in the compiled executables of the interpreter helped this study to create a code placement strategy. The assembly code analysis in this study is restricted to ARM and gcc for the sake of demonstration,
  • 46. 30 C.-C. Lin and C.-L. Chen but applying our theory to other platforms and tools is an easy job. Figure 5 illustrates the layout of the interpreter in assembly form (FastInterpret() in interp.c). The first trunk BytecodeFetching is the code block for rescheduling and fetching, it is exactly the first part in the original source code. The second trunk LookupTable is a large lookup table used in dispatching bytecode instructions. Each entry links to a bytecode handler. It is actually the translated result of the “switch…case…case” statement. The third trunk BytecodeDispatch is the aggregation of more than a hundred byte- code handlers. Most bytecode handlers are self-contained which means a bytecode handler occupies a contiguous memory space in this trunk, and it does not jump to program codes stored in other trunks. There are only a few exceptions which call functions stored in other trunks, such as “invokevirtual.” Besides, there are several constant symbol tables spread over this trunk. These tables are referenced by the pro- gram codes within the BytecodeDispatch trunk. The last trunk ExceptionHandling contains code fragments for exception handling. Each trunk occupies a number of NAND flash pages. In fact, the total size of Byteco- deFetching and LookupTable is about 1200 bytes (compiled with arm-elf-gcc-3.4.3), which is almost small enough to fit into two or three 512-bytes-page. Figure 6 shows the size distribution of bytecode handlers. The average size of a bytecode handler is 131 bytes, and there are 79 handlers smaller than 56 bytes. In other words, a 512-bytes-page could gather 4 to 8 bytecode handlers. The inter-handler execution flow dominates the number of cache misses generated by the interpreter. This is the reason that our ap- proach tries to rearrange bytecode handlers within the BytecodeDispatch trunk. Fig. 6. Distribution of Bytecode Handler Size (compiled with gcc-3.4.3) 4 Analyzing Control Flow 4.1 Indirect Control Flow Graph Static branch-prediction and typical code placement approaches derive the layout of a program from its control flow graph (CFG). However, the CFG of a VM interpreter is a special case, its CFG is a flat spanning tree enclosed by a loop. The CFG does not provide sufficient information to distinguish the temporal relations of each bytecode handler pair. If someone wants to improve the program locality by observing the dy- namic execution order of program blocks, the CFG is apparently not a good tool to this
  • 47. Cache Sensitive Code Arrangement for Virtual Machine 31 end. Therefore, we propose a concept called “Indirect Control Flow Graph” (ICFG); it uses the real bytecode instruction sequences to construct the dual CFG of the interpreter. Consider a simplified virtual machine with 5 bytecode instructions: A, B, C, D, and E, and use the virtual machine to run a very simple user applet. Consider the following short alphabetic sequence as the instruction sequence of the user applet: A-B-A-B-C-D-E-C Each alphabet in the sequence represents a bytecode instruction. In Figure 7, the graph connected with the solid lines is the CFG of the simplified interpreter. By observing the flow in the CFG, the program flow becomes: [Dispatch] – [Handler A] – [Dispatch] – [Handler B]… Fig. 7. The CFG of the simplified interpreter It is hard to tell the relation between handler-A and handler-B because the loop header hides it. In other words, this CFG cannot easily present which handler would be invoked after handler-A is executed. The idea of the ICFG is to observe the patterns of the bytecode sequences executed by the virtual machine, not to analyze the structure of the virtual machine itself. Figure 8 expresses the ICFG in a readable way, it happens to be the sub-graph connected by the dashed directed lines in Figure 7. Fig. 8. An ICFG example. The number inside the circle represents the size of the handler
  • 48. 32 C.-C. Lin and C.-L. Chen 4.2 Tracing the Locality of the Interpreter As stated, the Java applications that a KVM runs dominate the temporal locality of the interpreter. Precisely speaking, the incoming Java instruction sequence dominates the temporal locality of the KVM. Therefore, the first step to exploit the temporal locality is to consider the bytecode sequences executed by the virtual machine. Consider the previous example sequence, the order of accessed NAND flash pages is supposed to be: [BytecodeFetching]–[LookupTable]–[A]–[BytecodeFetching]–[LookupTable]– [B]–[BytecodeFetching]–[LookupTable]–[A]… Obviously, memory pages containing BytecodeFetching and LookupTable are much often to appear in the sequence than those containing BytecodeDispatch. As a result, pages containing BytecodeFetching and LookupTable are favorable to last in the cache. Pages holding bytecode handlers have to compete with each other to stay in the cache. Thus, we induced that the order of executed bytecode instructions is the key factor impacts cache misses. Consider an extreme case: In a system with three cache blocks, two cache blocks always hold memory pages containing BytecodeFetching and LookupTable due to the stated reason. Therefore, there is only one cache block available for swapping pages containing bytecode handlers. If all the bytecode handlers were located in distinct memory pages, processing a bytecode instruction would cause a cache miss. This is because the next-to-execute bytecode handler is always located in an uncached memory page. In other words, the sample sequence causes at least eight cache misses. Never- theless, if both the handlers of A and B are grouped to the same page, cache misses will decline to 5 times, and the page access trace becomes: fault-A-B-A-B-fault-C-fault-D-fault-E-fault-C If we extend the group (A, B) to include the handler of C, the cache miss count would even drop to four times, and the page access trace looks like the following one: fault-A-B-A-B-C-fault-D-fault-E-fault-C Therefore, the core issue of this study is to find an efficient code layout method parti- tioning all bytecode instructions into disjoined sets based on their execution relevance. Each NAND flash page contains one set of bytecode handlers. We propose partitioning the ICFG reaches this goal. Back to Figure 8, the directed edges represent the temporal order of the instruction sequence. The weight of an edge is the transition count for transitions from one bytecode instruction to the next. If we remove the edge (B, C), the ICFG is divided into two disjoined sets. That is, the bytecode handlers of A and B are placed in one page, and the bytecode handlers of C, D, and E are placed in the other. The page access trace becomes: fault-A-B-A-B-fault-C-D-E-C This placement causes only two cache misses, which is 75% lower than the worst case! The next step is to transform the ICFG diagram to an undirected graph by merging reversed edges connecting same vertices, and the weight of the undirected edge is the sum of weights of the two directed edges. The consequence is actually a variation of the classical MIN k-CUT problem. Formally speaking, we can model a given graph G(V, E) as:
  • 49. Cache Sensitive Code Arrangement for Virtual Machine 33 z Vi – represents the i-th bytecode instruction. z Ei,j – the edge connecting i-th and j-th bytecode instruction. z Fi,j – number of times that two bytecode instructions i and j executed after each other. It is the weight of edge Ei,j. z K – number of expected partitions. z Wx,y – the inter-set weight. ∀ x ≠ y, Wx,y= ΣFi,j where Vi ∈ Px and Vj ∈ Py. The goal is to model the problem as the following definition: Definition 1. The MIN k-CUT problem is to divide G into K disjoined partitions {P1, P2,…,Pk} such that ΣWi,j is minimized. 4.3 The Mathematical Model Yet there is an additional constraint in our model. It is impractical to gather bytecode instructions to a partition regardless of the sum of the program size of consisted byte- code handlers. The size of each bytecode handler is distinct, and the code size of a partition cannot exceed the size of a memory page (e.g. NAND flash page). Our aim is to distribute bytecode handlers into several disjoined partitions {P1, P2,…,Pk}. We define the following notations: z Si – the code size of bytecode handler Vi. z N – the size of a memory page. z M(Pk ) – the size of partition Pk . It is ΣSm for all Vm∈ Pk . z H(Pk ) – the value of partition Pk . It is ΣFi,j for all Vi , Vj ∈ Pk . Our goal is to construct partitions that satisfy the following constraints. Definition 2. The problem is to divide G into K disjoined partitions {P1, P2,…,Pk}. For each Pk that M(Pk) ≤ N such that Wi,j is minimized, and maximize ΣH(Pi ) for all Pi ∈ {P1, P2,…,Pk}. This rectified model is exactly an application of the graph partition problem, i.e., the size of each partition must satisfy the constraint (size of a memory page), and the sum of inter-partition path weights is minimal. The graph partition problem is NP-complete [16]. However, the purpose of this paper was neither to create a new graph partition algorithm nor to discuss the difference between existing algorithms. The experimental implementation just adopted the following algorithm to demonstrate our approach works. Other implementations based on this approach may choose another graph partition algorithm that satisfies specific requirements. Partition (G) 1. Find the edge with maximal weight Fi,j among graph G, while the Si + Sj ≤ N. If there is no such an edge, go to step 4. 2. Call Merge (Vi , Vj ) to combine vertices Vi and Vj. 3. Remove both Vi and Vj from G, go to step 1. 4. Find a pair of vertices Vi and Vj in G such that Si + Sj ≤ N. If there isn’t any pair satisfied the criteria, go to step 7. 5. Call Merge (Vi , Vj ) to combine vertices Vi and Vj. 6. Remove both Vi and Vj out of G, go to step 4. 7. End.
  • 50. 34 C.-C. Lin and C.-L. Chen The procedure of merging both vertices Vi and Vj is: Merge (Vi , Vj ) 1. Add a new vertex Vk to G. 2. Pickup an edge E connects Vt with either Vi or Vj . If there is no such an edge, go to step 6. 3. If there is already an edge F connects Vt to Vk. 4. Then, add the weight of E to F, and discard E. 5. Else, replace one end of E which is either Vi or Vj with Vk. 6. End. Finally, each vertex in G is a collection of several bytecode handlers. The refinement process is to collect bytecode handlers belonging to the same vertex and place them into one memory page. 5 The Process of Rewriting the Virtual Machine Our approach emphasizes that the arrangements of bytecode handlers affects cache miss rate. In other words, it implies that programmers should be able to speed up their programs by properly changing the order of the “case” sub-clauses in the source files. Therefore, this study tries to optimize the virtual machine in two distinct ways. The first approach revises the order of the “case” sub-clauses in the sources of the virtual ma- chine. If our theory were correct, this tentative approach should show that the modified virtual machine performs better in most test cases. The second version precisely reor- ganizes the layout of assembly code blocks of bytecode handlers, and this approach should be able to generate larger improvements than the first version. 5.1 Source-Level Rearrangement The concept of the refining process is to arrange the order of these “case” statements in the source file (execute.c). The consequence is that after translating the rearranged source files, the compiler will place bytecode handlers in machine code form in me- ditated order. The following steps are the outline of the refining procedures. A. Profiling. Run the Java benchmark program on the unmodified KVM. A custom profiler traces the bytecode instruction sequence, and it generates the statistics of inter-bytecode instruction counts. Although we can collect some patterns of instruction combinations by investigating the Java compiler, using a dynamic approach can cap- ture further application-dependent patterns. B. Measuring the size of each bytecode handler. The refining program compiles the KVM source files and measures the code size of each bytecode handler (i.e., the size of each ‘case’ sub-clause) by parsing intermediate files generated by the compiler. C. Partitioning the ICFG. The previous steps collect all necessary information for constructing the ICFG. Then, the refining program partitions the ICFG by using a graph partition algorithm. From that result, the refining program knows the way to group bytecode handlers together. For example, a partition result groups (A, B) to a bundle and (C, D, E) to another as shown in Figure 8.
  • 51. Cache Sensitive Code Arrangement for Virtual Machine 35 D. Rewriting the source file. According to the computed results, the refining program rewrites the source file by arranging the order of all “case” sub-clauses within the interpreter loop. Figure 9 shows the order of all “case” sub-clauses in the previous example. switch ( opcode ) { case B: …; case A: …; case E: …; case D: …; case C: …; } Fig. 9. The output of rearranged case statements 5.2 Assembly-Level Rearrangement The robust implementation of the refinement process consists of two steps. The re- finement process acts as a post processor of the compiler. It parses intermediate files generated by the compiler, rearranges program blocks, and generates optimized assembly codes. Our implementation is inevitably compiler-dependent and CPU- dependent. Current implementation tightly is integrated with gcc for ARM, but the approach is easy to apply to other platforms. Figure 10 illustrates the outline of the processing flow, entities, and relations between each entity. The following paragraphs explain the functions of each step. Fig. 10. Entities in the refinement process
  • 52. 36 C.-C. Lin and C.-L. Chen A. Collecting dynamic bytecode instruction trace. The first step is to collect statis- tics from real Java applications or benchmarks, because the following steps will need these data for partitioning bytecode handlers. The modified KVM dumps the bytecode instruction trace while running Java applications. A special program called TRACER analyzes the trace dump to find the transition counts for all instruction pairs. B. Rearranging the KVM interpreter. This is the core step and is realized by a program called REFINER. It acts as a post processor of gcc. Its duty is to parse byte- code handlers expressed in the assembly code and organize them into partitions. Each partition fits into one NAND flash page. The program consists of several sub tasks described as follows. (i) Parsing layout information of the original KVM. The very first thing is to com- pile the original KVM. REFINER parses the intermediate files generated by gcc. According to structure of the interpreter expressed in assembly code introduced in §3.2, REFINER analyzes the jump table in the LookupTable trunk to find out the address and size of each bytecode handler. (ii) Using the graph partition algorithm to group bytecode handlers into disjoined partitions. At this stage, REFINER constructs the ICFG with two key parameters: (1) the transition counts of bytecode instructions collected by TRACER; (2) the machine code layout information collected in the step A. It uses the approximate algorithm described in §4.3 to divide the undirected ICFG into disjoined partitions. (iii) Rewriting the assembly code. REFINER parses and extracts assembly codes of all bytecode handlers. Then, it creates a new assembly file and dumps all bytecode handlers partition by partition according to the result of (ii). (iv) Propagating symbol tables to each partition. As described in §3.2, there are several symbol tables distributed in the BytecodeDispatch trunk. For most RISC pro- cessors like ARM and MIPS, an instruction is unable to carry arbitrary constants as operands because of limited instruction word length. The solution is to gather used constants into a symbol table and place this table near the instructions that will access these constants. Hence, the compiler generates instructions with relative addressing operands to load constants from the nearby symbol tables. Take ARM for example, its application binary interface (ABI) defines two instructions called LDR and ADR for loading a constant from a symbol table to a register [17]. The ABI restricts the maximal distance between a LDR/ADR instruction and the referred symbol table to 4K bytes. Besides, it would cause a cache miss if a machine instruction in page X loads a constant si from symbol table SY located in page Y. Our solution is to create a local symbol table SX in page X and copy the value si to the new table. Therefore, the relative distance between si and the instruction never exceeds 4KB neither causes cache misses when the CPU tries to load si. (v) Dumping contents in partitions to NAND flash pages. The aim is to map byte- code handlers to NAND flash pages. Its reassembled bytecode handlers belong to the same partition in one NAND flash page. After that, REFINER refreshes the address and size information of all bytecode handlers. The updated information helps REFINER to add padding to each partition and enforce the starting address of each partition to align to the boundary of a NAND flash page.
  • 53. Cache Sensitive Code Arrangement for Virtual Machine 37 6 Evaluation In this section, we start from a brief introduction of the environment and conditions used in the experiments. The first part of the experimental results is the outcome of source-level rearranged virtual machine. Those positive results prove our theory works. The next part is the experiment of assembly-level rearranged virtual machine. It further proves our refinement approach is able to produce better results than the original version. 6.1 Evaluation Environment Figure 11 shows the block diagram of our experimental setup. In order to mimic real embedded applications, we have implanted Java ME KVM into uClinux for ARM7 in the experiment. One of the reasons to use this platform is that uClinux supports FLAT executable file format which is perfect for realizing XIP. We ran KVM/uClinux on a customized gdb. This customized gdb dumped memory access traces and performance statistics to files. The experimental setup assumed there was a specialized hardware unit acting as the NAND flash memory controller, which loads program codes from NAND flash pages to the cache. It also assumed all flash access operations worked transparently without the help from the operating system. In other words, modifying the OS kernel for the experiment is unnecessary. This experiment used “Embedded Caf- feine Mark 3.0” [15] as the benchmark. Embedded Caffeine Mark J2ME API K Virtual Machine (KVM) 1.1 uClinux Kernel GDB 5.0/ARMulator Windows/Cygwin ARM7 / FLASH ARM7 / ROM Java / RAM Intel X86 Title Version arm-elf-binutil 2.15 arm-elf-gcc 3.4.3 uClibc 0.9.18 J2ME (KVM) CLDC 1.1 elf2flt 20040326 Fig. 11. Hierarchy of simulation environment There are several kinds of NAND flash commodities in the market: 512-bytes, 2048-bytes, and 4096-bytes per page. In this experiment, we model the cache simulator after the following conditions: 1. There were four NAND flash page size options: 512, 1024, 2048 and 4096. 2. The page replacement policy was full associative, and it is a FIFO cache. 3. The number of cache memory blocks varied from 2, 4 … to 32. 6.2 Results of Source-Level Rearrangement First, we rearranged the “case” sub-clauses in the source codes using the introduced method. Table 1 lists the raw statistics of cache miss rates, and Figure 12 plots the charts of normalized cache miss rates from the optimized KVM. The experiment
  • 54. 38 C.-C. Lin and C.-L. Chen assumed the maximal cache size is 64K bytes. For each NAND flash page size, the number of cache blocks starts from 4 to (64K / NAND flash page size). In Table 1, each column is the experimental result from a kind of the KVM. The “original” column refers to statistics from the original KVM, in which bytecode han- dlers is ordered by in machine codes. The second column “optimized” is the result from the KVM refined with our approach. For example, in the best case (2048 bytes per page, 8 cache pages), the optimized KVM generates 105,157 misses, which is only 4.5% of the misses caused by the original KVM, and the improvement ratio is 95%. Broadly speaking, the experiment shows that the optimized KVM outperforms the original KVM in most cases. Looking at the charts in Figure 12, the curves of nor- malized cache miss rates (i.e., optimized_miss_rate / original_miss_rate ) tend to be concave. It means the improvement for the case of eight pages is greater than the one of four pages. It benefits from the smaller “locality” of the optimized KVM. Therefore, the cache could hold more localities, and this is helpful in reducing cache misses. After touching the bottom, the cache is large enough to hold most of the KVM program code. As the cache size grows, the numbers of cache misses of all configurations converge. However, the miss rate at 1024 bytes * 32 blocks is an exceptional case. This is because our approach rearranges the order of bytecode handlers at source level, and it hardly predicts the precise starting address and code size of a bytecode handler. This is the drawback of the approach. 0.00 0.20 0.40 0.60 0.80 1.00 1.20 2048 4096 8192 16384 32768 65536 Cache Memory Sizes (bytes) Normalized Cache Miss Rates 512 bytes/page 1024 bytes/page 2048 bytes/page 4096 bytes/page Fig. 12. The charts of normalized cache-miss rates from the source-level refined virtual machine. Each chart is an experiment performs on a specific page size. The x-axis is the size of the cache memory ( number_of_pages * page_size ).
  • 55. Another Random Scribd Document with Unrelated Content
  • 57. 1631 § 1. Bernhard of Saxe Weimar. CHAPTER IX. THE DEATH OF WALLENSTEIN AND THE TREATY OF PRAGUE. Section I.—French Influence in Germany. In Germany, after the death of Gustavus at Lützen, it was as it was in Greece after the death of Epaminondas at Mantinea. There was more disturbance and more dispute after the battle than before it. In Sweden, Christina, the infant daughter of Gustavus, succeeded peaceably to her father's throne, and authority was exercised without contradiction by the Chancellor Oxenstjerna. But, wise and prudent as Oxenstjerna was, it was not in the nature of things that he should be listened to as Gustavus had been listened to. The chiefs of the army, no longer held in by a soldier's hand, threatened to assume an almost independent position. Foremost of these was the young Bernhard of Weimar, demanding, like Wallenstein, a place among the princely houses of Germany. In his person he hoped the glories of the elder branch of the Saxon House would revive, and the disgrace inflicted upon it by Charles V. for its attachment to the Protestant cause would be repaired. He claimed the rewards of victory for those whose swords had gained it, and payment for the soldiers, who during the winter months following the victory at Lützen had received little or nothing. His own share was to be a new duchy of Franconia, formed out of the united bishoprics of Würzburg and Bamberg. Oxenstjerna was compelled to admit his pretensions, and to confirm him in his duchy. The step was thus taken which Gustavus had undoubtedly contemplated, but which he had prudently refrained from carrying
  • 58. § 2. The League of Heilbronn. § 3. Defection of Saxony. 1631 § 4. French politics. into action. The seizure of ecclesiastical lands in which the population was Catholic was as great a barrier to peace on the one side as the seizure of the Protestant bishoprics in the north had been on the other. There was, therefore, all the more necessity to be ready for war. If a complete junction of all the Protestant forces was not to be had, something at least was attainable. On April 23, 1633, the League of Heilbronn was signed. The four circles of Swabia, Franconia, and the Upper and Lower Rhine formed a union with Sweden for mutual support. It is not difficult to explain the defection of the Elector of Saxony. The seizure of a territory by military violence had always been most obnoxious to him. He had resisted it openly in the case of Frederick in Bohemia. He had resisted it, as far as he dared, in the case of Wallenstein in Mecklenburg. He was not inclined to put up with it in the case of Bernhard in Franconia. Nor could he fail to see that with the prolongation of the war, the chances of French intervention were considerably increasing. In 1631 there had been a great effervescence of the French feudal aristocracy against the royal authority. But Richelieu stood firm. In March the king's brother, Gaston Duke of Orleans, fled from the country. In July his mother, Mary of Medici, followed his example. But they had no intention of abandoning their position. From their exile in the Spanish Netherlands they formed a close alliance with Spain, and carried on a thousand intrigues with the nobility at home. The Cardinal smote right and left with a heavy hand. Amongst his enemies were the noblest names in France. The Duke of Guise shrank from the conflict and retired to Italy to die far from his native land. The keeper of the seals died in prison. His kinsman, a marshal of France, perished on the scaffold. In the summer of the year 1632, whilst Gustavus was conducting his last campaign, there was a great rising in the south of France. Gaston
  • 59. § 5. Richelieu did for France all that could be done. § 6. Richelieu and Germany. himself came to share in the glory or the disgrace of the rebellion. The Duke of Montmorenci was the real leader of the enterprise. He was a bold and vigorous commander, the Rupert of the French cavaliers. But his gay horsemen dashed in vain against the serried ranks of the royal infantry, and he expiated his fault upon the scaffold. Gaston, helpless and low-minded as he was, could live on, secure under an ignominious pardon. It was not the highest form of political life which Richelieu was establishing. For the free expression of opinion, as a foundation of government, France, in that day, was not prepared. But within the limits of possibility, Richelieu's method of ruling was a magnificent spectacle. He struck down a hundred petty despotisms that he might exalt a single despotism in their place. And if the despotism of the Crown was subject to all the dangers and weaknesses by which sooner or later the strength of all despotisms is eaten away, Richelieu succeeded for the time in gaining the co-operation of those classes whose good will was worth conciliating. Under him commerce and industry lifted up their heads, knowledge and literature smiled at last. Whilst Corneille was creating the French drama, Descartes was seizing the sceptre of the world of science. The first play of the former appeared on the stage in 1629. Year by year he rose in excellence, till in 1636 he produced the 'Cid;' and from that time one masterpiece followed another in rapid succession. Descartes published his first work in Holland in 1637, in which he laid down those principles of metaphysics which were to make his name famous in Europe. All this, however welcome to France, boded no good to Germany. In the old struggles of the sixteenth century, Catholic and Protestant each believed himself to be doing the best, not merely for his own country, but for the world in general. Alva, with his countless executions in the Netherlands, honestly believed that the Netherlands as well as Spain would be the better for the rude
  • 60. surgery. The English volunteers, who charged home on a hundred battle-fields in Europe, believed that they were benefiting Europe, not England alone. It was time that all this should cease, and that the long religious strife should have its end. It was well that Richelieu should stand forth to teach the world that there were objects for a Catholic state to pursue better than slaughtering Protestants. But the world was a long way, in the seventeenth century, from the knowledge that the good of one nation is the good of all, and in putting off its religious partisanship France became terribly hard and selfish in its foreign policy. Gustavus had been half a German, and had sympathized deeply with Protestant Germany. Richelieu had no sympathy with Protestantism, no sympathy with German nationality. He doubtless had a general belief that the predominance of the House of Austria was a common evil for all, but he cared chiefly to see Germany too weak to support Spain. He accepted the alliance of the League of Heilbronn, but he would have been equally ready to accept the alliance of the Elector of Bavaria if it would have served him as well in his purpose of dividing Germany.
  • 61. § 7. His policy French, not European. § 1. Saxon negotiations with Wallenstein. The plan of Gustavus might seem unsatisfactory to a patriotic German, but it was undoubtedly conceived with the intention of benefiting Germany. Richelieu had no thought of constituting any new organization in Germany. He was already aiming at the left bank of the Rhine. The Elector of Treves, fearing Gustavus, and doubtful of the power of Spain to protect him, had called in the French, and had established them in his new fortress of Ehrenbreitstein, which looked down from its height upon the low-lying buildings of Coblentz, and guarded the junction of the Rhine and the Moselle. The Duke of Lorraine had joined Spain, and had intrigued with Gaston. In the summer of 1632 he had been compelled by a French army to make his submission. The next year he moved again, and the French again interfered, and wrested from him his capital of Nancy. Richelieu treated the old German frontier-land as having no rights against the King of France. Section II.—Wallenstein's Attempt to dictate Peace. Already, before the League of Heilbronn was signed, the Elector of Saxony was in negotiation with Wallenstein. In June peace was all but concluded between them. The Edict of Restitution was to be cancelled. A few places on the Baltic coast were to be ceded to Sweden, and a portion at least of the Palatinate was to be restored to the son of the Elector Frederick, whose death in the preceding winter had removed one of the difficulties in the way of an agreement. The precise form in which the restitution should take place, however, still remained to be settled. Such a peace would doubtless have been highly disagreeable to adventurers like Bernhard of Weimar, but it would have given the Protestants of Germany all that they could reasonably expect to gain, and would have given the House of Austria one last chance of
  • 62. § 2. Opposition to Wallenstein. § 3. General disapprobation of his proceedings. § 4. Wallenstein and the Swedes. taking up the championship of national interests against foreign aggression. Such last chances, in real life, are seldom taken hold of for any useful purpose. If Ferdinand had had it in him to rise up in the position of a national ruler, he would have been in that position long before. His confessor, Father Lamormain, declared against the concessions which Wallenstein advised, and the word of Father Lamormain had always great weight with Ferdinand. Even if Wallenstein had been single-minded he would have had difficulty in meeting such opposition. But Wallenstein was not single-minded. He proposed to meet the difficulties which were made to the restitution of the Palatinate by giving the Palatinate, largely increased by neighbouring territories, to himself. He would thus have a fair recompense for the loss of Mecklenburg, which he could no longer hope to regain. He fancied that the solution would satisfy everybody. In fact, it displeased everybody. Even the Spaniards, who had been on his side in 1632 were alienated by it. They were especially jealous of the rise of any strong power near the line of march between Italy and the Spanish Netherlands. The greater the difficulties in Wallenstein's way the more determined he was to overcome them. Regarding himself, with some justification, as a power in Germany, he fancied himself able to act at the head of his army as if he were himself the ruler of an independent state. If the Emperor listened to Spain and his confessor in 1633 as he had listened to Maximilian and his confessor in 1630, Wallenstein might step forward and force upon him a wiser policy. Before the end of August he had opened a communication with Oxenstjerna, asking for his assistance in effecting a reasonable compromise, whether the Emperor liked it or not. But he had forgotten that such a proposal as this can only be accepted where there is confidence in him who makes it. In Wallenstein—the man of many schemes and many
  • 63. § 5. Was he in earnest? § 6. He attacks the Saxons. § 7. Bernhard at Ratisbon. § 8. Wallenstein's difficulties. intrigues—no man had any confidence whatever. Oxenstjerna cautiously replied that if Wallenstein meant to join him against the Emperor he had better be the first to begin the attack. Whether Wallenstein seriously meant at this time to move against the emperor it is impossible to say. He loved to enter upon plots in every direction without binding himself to any; but he was plainly in a dangerous position. How could he impose peace upon all parties when no single party trusted him? If he was not trusted, however, he might still make himself feared. Throwing himself vigorously upon Silesia, he forced the Swedish garrisons to surrender, and, presenting himself upon the frontiers of Saxony, again offered peace to the two northern electors. But Wallenstein could not be everywhere. Whilst the electors were still hesitating, Bernhard made a dash at Ratisbon, and firmly established himself in the city, within a little distance of the Austrian frontier. Wallenstein, turning sharply southward, stood in the way of his further advance, but he did nothing to recover the ground which had been lost. He was himself weary of the war. In his first command he had aimed at crushing out all opposition in the name of the imperial authority. His judgment was too clear to allow him to run the old course. He saw plainly that strength was now to be gained only by allowing each of the opposing forces their full weight. 'If the Emperor,' he said, 'were to gain ten victories it would do him no good. A single defeat would ruin him.' In December he was back again in Bohemia. It was a strange, Cassandra-like position, to be wiser than all the world, and to be listened to by no one; to suffer the fate of supreme intelligence which touches no moral chord and awakens no human sympathy. For many months the hostile influences had been gaining strength at Vienna. There were War-Office officials whose wishes Wallenstein
  • 64. § 9. Opposition of Spain. § 10. The Cardinal Infant. § 11. The Emperor's systematically disregarded; Jesuits who objected to peace with heretics at all; friends of the Bavarian Maximilian who thought that the country round Ratisbon should have been better defended against the enemy; and Spaniards who were tired of hearing that all matters of importance were to be settled by Wallenstein alone. The Spanish opposition was growing daily. Spain now looked to the German branch of the House of Austria to make a fitting return for the aid which she had rendered in 1620. Richelieu, having mastered Lorraine, was pushing on towards Alsace, and if Spain had good reasons for objecting to see Wallenstein established in the Palatinate, she had far better reasons for objecting to see France established in Alsace. Yet for all these special Spanish interests Wallenstein cared nothing. His aim was to place himself at the head of a German national force, and to regard all questions simply from his own point of view. If he wished to see the French out of Alsace and Lorraine, he wished to see the Spaniards out of Alsace and Lorraine as well. And, as was often the case with Wallenstein, a personal difference arose by the side of the political difference. The Emperor's eldest son, Ferdinand, the King of Hungary, was married to a Spanish Infanta, the sister of Philip IV., who had once been the promised bride of Charles I. of England. Her brother, another Ferdinand, usually known from his rank in Church and State as the Cardinal-Infant, had recently been appointed Governor of the Spanish Netherlands, and was waiting in Italy for assistance to enable him to conduct an army through Germany to Brussels. That assistance Wallenstein refused to give. The military reasons which he alleged for his refusal may have been good enough, but they had a dubious sound in Spanish ears. It looked as if he was simply jealous of Spanish influence in Western Germany. Such were the influences which were brought to bear upon the Emperor after Wallenstein's return from Ratisbon in December. Ferdinand, as usual,
  • 65. hesitation. 1634 § 12. Wallenstein and the army. § 1. Oñate's movements. § 2. Belief at Vienna that Wallenstein was a traitor. was distracted between the two courses proposed. Was he to make the enormous concessions to the Protestants involved in the plan of Wallenstein; or was he to fight it out with France and the Protestants together according to the plan of Spain? To Wallenstein by this time the Emperor's resolutions had become almost a matter of indifference. He had resolved to force a reasonable peace upon Germany, with the Emperor, if it might be so; without him, if he refused his support. Wallenstein was well aware that his whole plan depended on his hold over the army. In January he received assurances from three of his principal generals, Piccolomini, Gallas, and Aldringer, that they were ready to follow him wheresoever he might lead them, and he was sanguine enough to take these assurances for far more than they were worth. Neither they nor he himself were aware to what lengths he would go in the end. For the present it was a mere question of putting pressure upon the Emperor to induce him to accept a wise and beneficent peace. Section III.—Resistance to Wallenstein's Plans. The Spanish ambassador, Oñate, was ill at ease. Wallenstein, he was convinced, was planning something desperate. What it was he could hardly guess; but he was sure that it was something most prejudicial to the Catholic religion and the united House of Austria. The worst was that Ferdinand could not be persuaded that there was cause for suspicion. The sick man, said Oñate, speaking of the Emperor, will die in my arms without my being able to help him. Such was Oñate's feelings toward the end of January. Then came information that the case was worse than even he had deemed possible. Wallenstein, he learned, had been intriguing with the Bohemian exiles, who had offered, with
  • 66. § 3. Oñate informs Ferdinand. § 4. Decision of the Emperor against Wallenstein. § 5. Determination to displace Wallenstein. Richelieu's consent, to place upon his head the crown of Bohemia, which had fourteen years before been snatched from the unhappy Frederick. In all this there was much exaggeration. Though Wallenstein had listened to these overtures, it is almost certain that he had not accepted them. But neither had he revealed them to the government. It was his way to keep in his hands the threads of many intrigues to be used or not to be used as occasion might serve. Oñate, naturally enough, believed the worst. And for him the worst was the best. He went triumphantly to Eggenberg with his news, and then to Ferdinand. Coming alone, this statement might perhaps have been received with suspicion. Coming, as it did, after so many evidences that the general had been acting in complete independence of the government, it carried conviction with it. Ferdinand had long been tossed backwards and forwards by opposing influences. He had given no answer to Wallenstein's communication of the terms of peace arranged with Saxony. The necessity of deciding, he said, would not allow him to sleep. It was in his thoughts when he lay down and when he arose. Prayers to God to enlighten the mind of the Emperor had been offered in the churches of Vienna. All this hesitation was now at an end. Ferdinand resolved to continue the war in alliance with Spain, and, as a necessary preliminary, to remove Wallenstein from his generalship. But it was more easily said than done. A declaration was drawn up releasing the army from its obedience to Wallenstein, and provisionally appointing Gallas, who had by this time given assurances of loyalty, to the chief command. It was intended, if circumstances proved favourable, to intrust the command ultimately to the young King of Hungary.
  • 67. § 6. The Generals gained over. § 7. Attempt to seize Wallenstein. § 8. Wallenstein at Pilsen. § 9. The colonels engage to support The declaration was kept secret for many days. To publish it would only be to provoke the rebellion which was feared. The first thing to be done was to gain over the principal generals. In the beginning of February Piccolomini and Aldringer expressed their readiness to obey the Emperor rather than Wallenstein. Commanders of a secondary rank would doubtless find their position more independent under an inexperienced young man like the King of Hungary than under the first living strategist. These two generals agreed to make themselves masters of Wallenstein's person and to bring him to Vienna to answer the accusations of treason against him. For Oñate this was not enough. It would be easier, he said, to kill the general than to carry him off. The event proved that he was right. On February 7, Aldringer and Piccolomini set off for Pilsen with the intention of capturing Wallenstein. But they found the garrison faithful to its general, and they did not even venture to make the attempt. Wallenstein's success depended on his chance of carrying with him the lower ranks of the army. On the 19th he summoned the colonels round him and assured them that he would stand security for money which they had advanced in raising their regiments, the repayment of which had been called in question. Having thus won them to a favourable mood, he told them that it had been falsely stated that he wished to change his religion and attack the Emperor. On the contrary, he was anxious to conclude a peace which would benefit the Emperor and all who were concerned. As, however, certain persons at Court had objected to it, he wished to ask the opinion of the army on its terms. But he must first of all know whether they were ready to support him, as he knew that there was an intention to put a disgrace upon him. It was not the first time that Wallenstein had appealed to the colonels. A month before, when the news had come of the alienation of the Court,
  • 68. him. § 1. The garrison of Prague abandons him. he had induced them to sign an acknowledgment that they would stand by him, from which all reference to the possibility of his dismissal was expressly excluded. They now, on February 20, signed a fresh agreement, in which they engaged to defend him against the machinations of his enemies, upon his promising to undertake nothing against the Emperor or the Catholic religion. Section IV.—Assassination of Wallenstein. Wallenstein thus hoped, with the help of the army, to force the Emperor's hand, and to obtain his signature to the peace. Of the co-operation of the Elector of Saxony he was already secure; and since the beginning of February he had been pressing Oxenstjerna and Bernhard to come to his aid. If all the armies in the field declared for peace, Ferdinand would be compelled to abandon the Spaniards and to accept the offered terms. Without some such hazardous venture, Wallenstein would be checkmated by Oñate. The Spaniard had been unceasingly busy during these weeks of intrigue. Spanish gold was provided to content the colonels for their advances, and hopes of promotion were scattered broadcast amongst them. Two other of the principal generals had gone over to the Court, and on February 18, the day before the meeting at Pilsen, a second declaration had been issued accusing Wallenstein of treason, and formally depriving him of the command. Wallenstein, before this declaration reached him, had already appointed a meeting of large masses of troops to take place on the White Hill before Prague on the 21st, where he hoped to make his intentions more generally known. But he had miscalculated the devotion of the army to his person. The garrison of Prague refused to obey his orders. Soldiers and citizens alike declared for the Emperor. He was obliged to retrace his steps. I had peace in my hands, he said. Then he added, God is righteous, as if still counting on the aid of Heaven in so good a work.
  • 69. § 2. Understanding with the Swedes. § 3. His arrival at Eger. § 4. Wallenstein's assassination. He did not yet despair. He ordered the colonels to meet him at Eger, assuring them that all that he was doing was for the Emperor's good. He had now at last hopes of other assistance. Oxenstjerna, indeed, ever cautious, still refused to do anything for him till he had positively declared against the Emperor. Bernhard, equally prudent for some time, had been carried away by the news, which reached him on the 21st, of the meeting at Pilsen, and the Emperor's denouncement of the general. Though he was still suspicious, he moved in the direction of Eger. On the 24th Wallenstein entered Eger. In what precise way he meant to escape from the labyrinth in which he was, or whether he had still any clear conception of the course before him, it is impossible to say. But Arnim was expected at Eger, as well as Bernhard, and it may be that Wallenstein fancied still that he could gather all the armies of Germany into his hands, to defend the peace which he was ready to make. The great scheme, however, whatever it was, was doomed to failure. Amongst the officers who accompanied him was a Colonel Butler, an Irish Catholic, who had no fancy for such dealings with Swedish and Saxon heretics. Already he had received orders from Piccolomini to bring in Wallenstein dead or alive. No official instructions had been given to Piccolomini. But the thought was certain to arise in the minds of all who retained their loyalty to the Emperor. A general who attempts to force his sovereign to a certain political course with the help of the enemy is placed, by that very fact, beyond the pale of law. The actual decision did not lie with Butler. The fortress was in the hands of two Scotch officers, Leslie and Gordon. As Protestants, they might have been expected to feel some sympathy with Wallenstein. But the sentiment of military honour prevailed. On the morning of the 25th they were called upon by one of the general's confederates to take orders from Wallenstein alone. I have sworn to obey the Emperor,
  • 70. § 5. Reason of his failure. § 6. Comparison between Gustavus and Wallenstein. answered Gordon, at last, and who shall release me from my oath? You, gentlemen, was the reply, are strangers in the Empire. What have you to do with the Empire? Such arguments were addressed to deaf ears. That afternoon Butler, Leslie, and Gordon consulted together. Leslie, usually a silent, reserved man, was the first to speak. Let us kill the traitors, he said. That evening Wallenstein's chief supporters were butchered at a banquet. Then there was a short and sharp discussion whether Wallenstein's life should be spared. Bernhard's troops were known to be approaching, and the conspirators dared not leave a chance of escape open. An Irish captain, Devereux by name, was selected to do the deed. Followed by a few soldiers, he burst into the room where Wallenstein was preparing for rest. Scoundrel and traitor, were the words which he flung at Devereux as he entered. Then, stretching out his arms, he received the fatal blow in his breast. The busy brain of the great calculator was still forever. The attempt to snatch at a wise and beneficent peace by mingled force and intrigue had failed. Other generals—Cæsar, Cromwell, Napoleon—have succeeded to supreme power with the support of an armed force. But they did so by placing themselves at the head of the civil institutions of their respective countries, and by making themselves the organs of a strong national policy. Wallenstein stood alone in attempting to guide the political destinies of a people, while remaining a soldier and nothing more. The plan was doomed to failure, and is only excusable on the ground that there were no national institutions at the head of which Wallenstein could place himself; not even a chance of creating such institutions afresh. In spite of all his faults, Germany turns ever to Wallenstein as she turns to no other amongst the leaders of the Thirty Years' War. From amidst the divisions and weaknesses of his native country, a great poet enshrined his memory in a succession of noble dramas. Such faithfulness is not without a reason. Gustavus's was a higher
  • 71. § 1. Campaign of 1634. nature than Wallenstein's. Some of his work, at least the rescue of German Protestantism from oppression, remained imperishable, whilst Wallenstein's military and political success vanished into nothingness. But Gustavus was a hero not of Germany as a nation, but of European Protestantism. His Corpus Evangelicorum was at the best a choice of evils to a German. Wallenstein's wildest schemes, impossible of execution as they were by military violence, were always built upon the foundation of German unity. In the way in which he walked that unity was doubtless unattainable. To combine devotion to Ferdinand with religious liberty was as hopeless a conception as it was to burst all bonds of political authority on the chance that a new and better world would spring into being out of the discipline of the camp. But during the long dreary years of confusion which were to follow, it was something to think of the last supremely able man whose life had been spent in battling against the great evils of the land, against the spirit of religious intolerance, and the spirit of division. Section V.—Imperialist Victories and the Treaty of Prague. For the moment, the House of Austria seemed to have gained everything by the execution or the murder of Wallenstein, whichever we may choose to call it. The army was reorganized and placed under the command of the Emperor's son, the King of Hungary. The Cardinal-Infant, now eagerly welcomed, was preparing to join him through Tyrol. And while on the one side there was union and resolution, there was division and hesitation on the other. The Elector of Saxony stood aloof from the League of Heilbronn, weakly hoping that the terms of peace which had been offered him by Wallenstein would be confirmed by the Emperor now that Wallenstein was gone. Even amongst those who remained under arms there was no unity of purpose. Bernhard, the daring and impetuous, was not of one mind with the cautious Horn, who commanded the Swedish forces, and
  • 72. § 2. The Battle of Nördlingen. § 3. Important results from it. § 4. French intervention. both agreed in thinking Oxenstjerna remiss because he did not supply them with more money than he was able to provide. As might have been expected under these circumstances, the imperials made rapid progress. Ratisbon, the prize of Bernhard the year before, surrendered to the king of Hungary in July. Then Donauwörth was stormed, and siege was laid to Nördlingen. On September 2 the Cardinal-Infant came up with 15,000 men. The enemy watched the siege with a force far inferior in numbers. Bernhard was eager to put all to the test of battle. Horn recommended caution in vain. Against his better judgment he consented to fight. On September 6 the attack was made. By the end of the day Horn was a prisoner, and Bernhard was in full retreat, leaving 10,000 of his men dead upon the field, and 6,000 prisoners in the hands of the enemy, whilst the imperialists lost only 1,200 men. Since the day of Breitenfeld, three years before, there had been no such battle fought as this of Nördlingen. As Breitenfeld had recovered the Protestant bishoprics of the north, Nördlingen recovered the Catholic bishoprics of the south. Bernhard's Duchy of Franconia disappeared in a moment under the blow. Before the spring of 1635 came, the whole of South Germany, with the exception of one or two fortified posts, was in the hands of the imperial commanders. The Cardinal- Infant was able to pursue his way to Brussels, with the assurance that he had done a good stroke of work on the way. The victories of mere force are never fruitful of good. As it had been after the successes of Tilly in 1622, and the successes of Wallenstein in 1626 and 1627, so it was now with the successes of the King of Hungary in 1634 and 1635. The imperialist armies had gained victories, and had taken cities. But the Emperor was none the nearer to the confidence of Germans. An alienated people, crushed by military force, served merely as a bait to tempt foreign aggression, and to make the way easy before it. After 1622, the King of Denmark had
  • 73. § 5. The Peace of Prague. been called in. After 1627, an appeal was made to the King of Sweden. After 1634, Richelieu found his opportunity. The bonds between France and the mutilated League of Heilbronn were drawn more closely. German troops were to be taken into French pay, and the empty coffers of the League were filled with French livres. He who holds the purse holds the sceptre, and the princes of Southern and Western Germany, whether they wished it or not, were reduced to the position of satellites revolving round the central orb at Paris. Nowhere was the disgrace of submitting to French intervention felt so deeply as at Dresden. The battle of Nördlingen had cut short any hopes which John George might have entertained of obtaining that which Wallenstein would willingly have granted him. But, on the other hand, Ferdinand had learned something from experience. He would allow the Edict of Restitution to fall, though he was resolved not to make the sacrifice in so many words. But he refused to replace the Empire in the condition in which it had been before the war. The year 1627 was to be chosen as the starting point for the new arrangement. The greater part of the northern bishoprics would thus be saved to Protestantism. But Halberstadt would remain in the hands of a Catholic bishop, and the Palatinate would be lost to Protestantism for ever. Lusatia, which had been held in the hands of the Elector of Saxony for his expenses in the war of 1620, was to be ceded to him permanently, and Protestantism in Silesia was to be placed under the guarantee of the Emperor. Finally, Lutheranism alone was still reckoned as the privileged religion, so that Hesse Cassel and the other Calvinist states gained no security at all. On May 30, 1635, a treaty embodying these arrangements was signed at Prague by the representatives of the Emperor and the Elector of Saxony. It was intended not to be a separate treaty, but to be the starting point of a general pacification. Most of the princes and towns so accepted it, after more or less delay, and acknowledged the supremacy of the Emperor on its conditions. Yet it was not in the nature of things that it should put an end to the war. It was not an agreement which any one was likely to be enthusiastic about. The
  • 74. § 6. It fails in securing general acceptance. § 7. Degeneration of the war. ties which bound Ferdinand to his Protestant subjects had been rudely broken, and the solemn promise to forget and forgive could not weld the nation into that unity of heart and spirit which was needed to resist the foreigner. A Protestant of the north might reasonably come to the conclusion that the price to be paid to the Swede and the Frenchman for the vindication of the rights of the southern Protestants was too high to make it prudent for him to continue the struggle against the Emperor. But it was hardly likely that he would be inclined to fight very vigorously for the Emperor on such terms. If the treaty gave no great encouragement to anyone who was comprehended by it, it threw still further into the arms of the enemy those who were excepted from its benefits. The leading members of the League of Heilbronn were excepted from the general amnesty, though hopes of better treatment were held out to them if they made their submission. The Landgrave of Hesse Cassel was shut out as a Calvinist. Besides such as nourished legitimate grievances, there were others who, like Bernhard, were bent upon carving out a fortune for themselves, or who had so blended in their own minds consideration for the public good as to lose all sense of any distinction between the two. There was no lack here of materials for a long and terrible struggle. But there was no longer any noble aim in view on either side. The ideal of Ferdinand and Maximilian was gone. The Church was not to recover its lost property. The Empire was not to recover its lost dignity. The ideal of Gustavus of a Protestant political body was equally gone. Even the ideal of Wallenstein, that unity might be founded on an army, had vanished. From henceforth French and Swedes on the one side, Austrians and Spaniards on the other, were busily engaged in riving at the corpse of the dead Empire. The great quarrel of principle had merged into a mere quarrel between the Houses of Austria and Bourbon, in which the shred of principle which still remained in the
  • 75. § 8. Condition of Germany. 1636 § 9. Notes of an English traveller. question of the rights of the southern Protestants was almost entirely disregarded. Horrible as the war had been from its commencement, it was every day assuming a more horrible character. On both sides all traces of discipline had vanished in the dealings of the armies with the inhabitants of the countries in which they were quartered. Soldiers treated men and women as none but the vilest of mankind would now treat brute beasts. 'He who had money,' says a contemporary, 'was their enemy. He who had none was tortured because he had it not.' Outrages of unspeakable atrocity were committed everywhere. Human beings were driven naked into the streets, their flesh pierced with needles, or cut to the bone with saws. Others were scalded with boiling water, or hunted with fierce dogs. The horrors of a town taken by storm were repeated every day in the open country. Even apart from its excesses, the war itself was terrible enough. When Augsburg was besieged by the imperialists, after their victory at Nördlingen, it contained an industrious population of 70,000 souls. After a siege of seven months, 10,000 living beings, wan and haggard with famine, remained to open the gates to the conquerors, and the great commercial city of the Fuggers dwindled down into a country town. How is it possible to bring such scenes before our eyes in their ghastly reality? Let us turn for the moment to some notes taken by the companion of an English ambassador who passed through the country in 1636. As the party were towed up the Rhine from Cologne, on the track so well known to the modern tourist, they passed by many villages pillaged and shot down. Further on, a French garrison was in Ehrenbreitstein, firing down upon Coblentz, which had just been taken by the imperialists. They in the town, if they do but look out of their windows, have a bullet presently presented at their head. More to the south, things grew worse. At Bacharach, the poor people are found dead with grass in their
  • 76. mouths. At Rüdesheim, many persons were praying where dead bones were in a little old house; and here his Excellency gave some relief to the poor, which were almost starved, as it appeared by the violence they used to get it from one another. At Mentz, the ambassador was obliged to remain on shipboard, for there was nothing to relieve us, since it was taken by the King of Sweden, and miserably battered.... Here, likewise, the poor people were almost starved, and those that could relieve others before now humbly begged to be relieved; and after supper all had relief sent from the ship ashore, at the sight of which they strove so violently that some of them fell into the Rhine, and were like to have been drowned. Up the Main, again, all the towns, villages, and castles be battered, pillaged, or burnt. After leaving Würzburg, the ambassador's train came to plundered villages, and then to Neustadt, which hath been a fair city, though now pillaged and burnt miserably. Poor children were sitting at their doors almost starved to death, his Excellency giving them food and leaving money with their parents to help them, if but for a time. In the Upper Palatinate, they passed by churches demolished to the ground, and through woods in danger, understanding that Croats were lying hereabout. Further on they stayed for dinner at a poor little village which hath been pillaged eight-and-twenty times in two years, and twice in one day. And so on, and so on. The corner of the veil is lifted up in the pages of the old book, and the rest is left to the imagination to picture forth, as best it may, the misery behind. After reading the sober narrative, we shall perhaps not be inclined to be so very hard upon the Elector of Saxony for making peace at Prague.
  • 77. § 1. Protestantism not yet out of danger. § 2. The allies of France. CHAPTER X. THE PREPONDERANCE OF FRANCE. Section I.—Open Intervention of France. The peacemakers of Prague hoped to restore the Empire to its old form. But this could not be. Things done cannot pass away as though they had never been. Ferdinand's attempt to gain a partizan's advantage for his religion by availing himself of legal forms had given rise to a general distrust. Nations and governments, like individual men, are tied and bound by the chain of their sins, from which they can be freed only when a new spirit is breathed into them. Unsatisfactory as the territorial arrangements of the peace were, the entire absence of any constitutional reform in connexion with the peace was more unsatisfactory still. The majority in the two Upper Houses of the Diet was still Catholic; the Imperial Council was still altogether Catholic. It was possible that the Diet and Council, under the teaching of experience, might refrain from pushing their pretensions as far as they had pushed them before; but a government which refrains from carrying out its principles from motives of prudence cannot inspire confidence. A strong central power would never arise in such a way, and a strong central power to defend Germany against foreign invasion was the especial need of the hour. In the failure of the Elector of Saxony to obtain some of the most reasonable of the Protestant demands lay the best excuse of men like Bernhard of Saxe-Weimar and William of Hesse Cassel for refusing the terms of accommodation offered. Largely as personal ambition and greed
  • 78. § 3. Foreign intervention. of territory found a place in the motives of these men, it is not absolutely necessary to assert that their religious enthusiasm was nothing more than mere hypocrisy. They raised the war-cry of God with us before rushing to the storm of a city doomed to massacre and pillage; they set apart days for prayer and devotion when battle was at hand—veiling, perhaps, from their own eyes the hideous misery which they were spreading around, in contemplation of the loftiness of their aim: for, in all but the most vile, there is a natural tendency to shrink from contemplating the lower motives of action, and to fix the eyes solely on the higher. But the ardour inspired by a military career, and the mere love of fighting for its own sake, must have counted for much; and the refusal to submit to a domination which had been so harshly used soon grew into a restless disdain of all authority whatever. The nobler motives which had imparted a glow to the work of Tilly and Gustavus, and which even lit up the profound selfishness of Wallenstein, flickered and died away, till the fatal disruption of the Empire was accomplished amidst the strivings and passions of heartless and unprincipled men. The work of riving Germany in pieces was not accomplished by Germans alone. As in nature a living organism which has become unhealthy and corrupt is seized upon by the lower forms of animal life, a nation divided amongst itself, and devoid of a sense of life within it higher than the aims of parties and individuals, becomes the prey of neighbouring nations, which would not have ventured to meddle with it in the days of its strength. The carcase was there, and the eagles were gathered together. The gathering of Wallenstein's army in 1632, the overthrow of Wallenstein in 1634, had alike been made possible by the free use of Spanish gold. The victory of Nördlingen had been owing to the aid of Spanish troops; and the aim of Spain was not the greatness or peace of Germany, but at the best the greatness of the House of Austria in Germany; at the worst, the maintenance of the old system of intolerance and unthinking obedience, which had been the ruin of Germany. With Spain for an ally, France was a necessary enemy. The strife for supreme power
  • 79. § 4. Alsace and Lorraine. § 5. Richelieu asks for fortresses in Alsace. § 6. War between France and Spain. between the two representative states of the old system and the new could not long be delayed, and the German parties would be dragged, consciously or unconsciously, in their wake. If Bernhard became a tool of Richelieu, Ferdinand became a tool of Spain. In this phase of the war Protestantism and Catholicism, tolerance and intolerance, ceased to be the immediate objects of the strife. The possession of Alsace and Lorraine rose into primary importance, not because, as in our own days, Germany needed a bulwark against France, or France needed a bulwark against Germany, but because Germany was not strong enough to prevent these territories from becoming the highway of intercourse between Spain and the Spanish Netherlands. The command of the sea was in the hands of the Dutch, and the valley of the Upper Rhine was the artery through which the life blood of the Spanish monarchy flowed. If Spain or the Emperor, the friend of Spain, could hold that valley, men and munitions of warfare would flow freely to the Netherlands to support the Cardinal-Infant in his struggle with the Dutch. If Richelieu could lay his hand heavily upon it, he had seized his enemy by the throat, and could choke him as he lay. After the battle of Nördlingen, Richelieu's first demand from Oxenstjerna as the price of his assistance had been the strong places held by Swedish garrisons in Alsace. As soon as he had them safely under his control, he felt himself strong enough to declare war openly against Spain. On May 19, eleven days before peace was agreed upon at Prague, the declaration of war was delivered at Brussels by a French herald. To the astonishment of all, France was able to place in the field what was then considered the enormous number of 132,000 men. One army was to drive the Spaniards out of the Milanese, and to set free the Italian princes. Another was to defend Lorraine whilst Bernhard crossed the Rhine and carried on war in Germany. The main force
  • 80. § 1. Failure of the French attack on the Netherlands. § 2. Spanish invasion of France. was to be thrown upon the Spanish Netherlands, and, after effecting a junction with the Prince of Orange, was to strike directly at Brussels. Section II.—Spanish Successes. Precisely in the most ambitious part of his programme Richelieu failed most signally. The junction with the Dutch was effected without difficulty; but the hoped-for instrument of success proved the parent of disaster. Whatever Flemings and Brabanters might think of Spain, they soon made it plain that they would have nothing to do with the Dutch. A national enthusiasm against Protestant aggression from the north made defence easy, and the French army had to return completely unsuccessful. Failure, too, was reported from other quarters. The French armies had no experience of war on a large scale, and no military leader of eminent ability had yet appeared to command them. The Italian campaign came to nothing, and it was only by a supreme effort of military skill that Bernhard, driven to retreat, preserved his army from complete destruction. In 1636 France was invaded. The Cardinal-Infant crossed the Somme, took Corbie, and advanced to the banks of the Oise. All Paris was in commotion. An immediate siege was expected, and inquiry was anxiously made into the state of the defences. Then Richelieu, coming out of his seclusion, threw himself upon the nation. He appealed to the great legal, ecclesiastical, and commercial corporations of Paris, and he did not appeal in vain. Money, voluntarily offered, came pouring into the treasury for the payment of the troops. Those who had no money gave themselves eagerly for military service. It was remarked that Paris, so fanatically Catholic in the days of St. Bartholomew and the League, entrusted its defence
  • 81. § 3. The invaders driven back. § 4. Battle of Wittstock. § 5. Death of Ferdinand II. § 6. Ferdinand III. to the Protestant marshal La Force, whose reputation for integrity inspired universal confidence. The resistance undertaken in such a spirit in Paris was imitated by the other towns of the kingdom. Even the nobility, jealous as they were of the Cardinal, forgot their grievances as an aristocracy in their duties as Frenchmen. Their devotion was not put to the test of action. The invaders, frightened at the unanimity opposed to them, hesitated and turned back. In September, Lewis took the field in person. In November he appeared before Corbie; and the last days of the year saw the fortress again in the keeping of a French garrison. The war, which was devastating Germany, was averted from France by the union produced by the mild tolerance of Richelieu. In Germany, too, affairs had taken a turn. The Elector of Saxony had hoped to drive the Swedes across the sea; but a victory gained on October 4, at Wittstock, by the Swedish general, Baner, the ablest of the successors of Gustavus, frustrated his intentions. Henceforward North Germany was delivered over to a desolation with which even the misery inflicted by Wallenstein affords no parallel. Amidst these scenes of failure and misfortune the man whose policy had been mainly responsible for the miseries of his country closed his eyes for ever. On February 15, 1637, Ferdinand II. died at Vienna. Shortly before his death the King of Hungary had been elected King of the Romans, and he now, by his father's death, became the Emperor Ferdinand III. The new Emperor had no vices. He did not even care, as his father did, for hunting and music. When the battle of Nördlingen was won under his command he was praying in his tent whilst his soldiers were fighting. He sometimes took upon himself to give military orders, but the handwriting in which they were conveyed was such an abominable
  • 82. § 7. Campaign of 1637. § 1. The capture of Breisach. scrawl that they only served to enable his generals to excuse their defeats by the impossibility of reading their instructions. His great passion was for keeping strict accounts. Even the Jesuits, it is said, found out that, devoted as he was to his religion, he had a sharp eye for his expenditure. One day they complained that some tolls bequeathed to them by his father had not been made over to them, and represented the value of the legacy as a mere trifle of 500 florins a year. The Emperor at once gave them an order upon the treasury for the yearly payment of the sum named, and took possession of the tolls for the maintenance of the fortifications of Vienna. The income thus obtained is said to have been no less than 12,000 florins a year. Such a man was not likely to rescue the Empire from its miseries. The first year of his reign, however, was marked by a gleam of good fortune. Baner lost all that he had gained at Wittstock, and was driven back to the shores of the Baltic. On the western frontier the imperialists were equally successful. Würtemberg accepted the Peace of Prague, and submitted to the Emperor. A more general peace was talked of. But till Alsace was secured to one side or the other no peace was possible. Section III.—The Struggle for Alsace. The year 1638 was to decide the question. Bernhard was looking to the Austrian lands in Alsace and the Breisgau as a compensation for his lost duchy of Franconia. In February he was besieging Rheinfelden. Driven off by the imperialists on the 26th, he re-appeared unexpectedly on March 3, taking the enemy by surprise. They had not even sufficient powder with them to load their guns, and the victory of Rheinfelden was the result. On the 24th Rheinfelden itself surrendered. Freiburg followed its example on April 22, and Bernhard proceeded to undertake the siege of Breisach, the great
  • 83. § 2. The capture a turning point in the war. § 3. Bernhard wishes to keep Breisach. § 4. Refuses to dismember the Empire. § 5. Death of Bernhard. fortress which domineered over the whole valley of the Upper Rhine. Small as his force was, he succeeded, by a series of rapid movements, in beating off every attempt to introduce supplies, and on December 19 he entered the place in triumph. The campaign of 1638 was the turning point in the struggle between France and the united House of Austria. A vantage ground was then won which was never lost. Bernhard himself, however, was loth to realize the world-wide importance of the events in which he had played his part. He fancied that he had been fighting for his own, and he claimed the lands which he had conquered for himself. He received the homage of the citizens of Breisach in his own name. He celebrated a Lutheran thanksgiving festival in the cathedral. But the French Government looked upon the rise of an independent German principality in Alsace with as little pleasure as the Spanish government had contemplated the prospect of the establishment of Wallenstein in the Palatinate. They ordered Bernhard to place his conquests under the orders of the King of France. Strange as it may seem, the man who had done so much to tear in pieces the Empire believed, in a sort of way, in the Empire still. I will never suffer, he said, in reply to the French demands, that men can truly reproach me with being the first to dismember the Empire. The next year he crossed the Rhine with the most brilliant expectations. Baner had recovered strength, and was pushing on through North Germany into Bohemia. Bernhard hoped that he too might strike a blow which would force on a peace on his own conditions. But his greatest achievement, the capture of Breisach, was also his last. A fatal disease seized upon him when he had hardly entered upon the campaign. On July 8, 1639, he died.
  • 84. § 6. Alsace in French possession. § 1. State of Italy. § 2. Maritime warfare. § 3. The Spanish fleet in the Downs. There was no longer any question of the ownership of the fortresses in Alsace and the Breisgau. French governors entered into possession. A French general took the command of Bernhard's army. For the next two or three years Bernhard's old troops fought up and down Germany in conjunction with Baner, not without success, but without any decisive victory. The French soldiers were becoming, like the Germans, inured to war. The lands on the Rhine were not easily to be wrenched out of the strong hands which had grasped them. Section IV.—French Successes. Richelieu had other successes to count besides these victories on the Rhine. In 1637 the Spaniards drove out of Turin the Duchess-Regent Christina, the mother of the young Duke of Savoy. She was a sister of the King of France; and, even if that had not been the case, the enemy of Spain was, in the nature of the case, the friend of France. In 1640 she re-entered her capital with French assistance. At sea, too, where Spain, though unable to hold its own against the Dutch, had long continued to be superior to France, the supremacy of Spain was coming to an end. During the whole course of his ministry, Richelieu had paid special attention to the encouragement of commerce and the formation of a navy. Troops could no longer be despatched with safety to Italy from the coasts of Spain. In 1638 a French squadron burnt Spanish galleys in the Bay of Biscay. In 1639 a great Spanish fleet on its way to the Netherlands was strong enough to escape the French, who were watching to intercept it. It sailed up the English Channel with the not distant goal of the Flemish ports almost in view. But the huge galleons were ill- manned and ill-found. They were still less able to resist the lighter, well-equipped vessels of the Dutch fleet, which was waiting to
  • 85. § 4. Destruction of the fleet. § 5. France and England. intercept them, than the Armada had been able to resist Drake and Raleigh fifty-one years before. The Spanish commander sought refuge in the Downs, under the protection of the neutral flag of England. The French ambassador pleaded hard with the king of England to allow the Dutch to follow up their success. The Spanish ambassador pleaded hard with him for protection to those who had taken refuge on his shores. Charles saw in the occurrence an opportunity to make a bargain with one side or the other. He offered to abandon the Spaniards if the French would agree to restore his nephew, Charles Lewis, the son of his sister Elizabeth, to his inheritance in the Palatinate. He offered to protect the Spaniards if Spain would pay him the large sum which he would want for the armaments needed to bid defiance to France. Richelieu had no intention of completing the bargain offered to him. He deluded Charles with negotiations, whilst the Dutch admiral treated the English neutrality with scorn. He dashed amongst the tall Spanish ships as they lay anchored in the Downs: some he sank, some he set on fire. Eleven of the galleons were soon destroyed. The remainder took advantage of a thick fog, slipped across the Straits, and placed themselves in safety under the guns of Dunkirk. Never again did such a fleet as this venture to leave the Spanish coast for the harbours of Flanders. The injury to Spain went far beyond the actual loss. Coming, as the blow did, within a few months after the surrender of Breisach, it all but severed the connexion for military purposes between Brussels and Madrid. Charles at first took no umbrage at the insult. He still hoped that Richelieu would forward his nephew's interests, and he even expected that Charles Lewis would be placed by the King of France in command of the army which had been under Bernhard's orders. But Richelieu was in no mood to place a German at the head of these well-trained veterans, and the proposal was definitively rejected. The King of England, dissatisfied at this repulse, inclined once more to the side
  • 86. § 6. Insurrection in Catalonia. § 7. Break-up of the Spanish monarchy. of Spain. But Richelieu found a way to prevent Spain from securing even what assistance it was in the power of a king so unpopular as Charles to render. It was easy to enter into communication with Charles's domestic enemies. His troubles, indeed, were mostly of his own making, and he would doubtless have lost his throne whether Richelieu had stirred the fire or not. But the French minister contributed all that was in his power to make the confusion greater, and encouraged, as far as possible, the resistance which had already broken out in Scotland, and which was threatening to break out in England. The failure of 1636 had been fully redeemed. No longer attacking any one of the masses of which the Spanish monarchy was composed, Richelieu placed his hands upon the lines of communication between them. He made his presence felt not at Madrid, at Brussels, at Milan, or at Naples, but in Alsace, in the Mediterranean, in the English Channel. The effect was as complete as is the effect of snapping the wire of a telegraph. At once the Peninsula startled Europe by showing signs of dissolution. In 1639 the Catalonians had manfully defended Roussillon against a French invasion. In 1640 they were prepared to fight with equal vigour. But the Spanish Government, in its desperate straits, was not content to leave them to combat in their own way, after the irregular fashion which befitted mountaineers. Orders were issued commanding all men capable of fighting to arm themselves for the war, all women to bear food and supplies for the army on their backs. A royal edict followed, threatening those who showed themselves remiss with imprisonment and the confiscation of their goods. The cord which bound the hearts of Spaniards to their king was a strong one; but it snapped at last. It was not by threats that Richelieu had defended France in 1636. The old traditions of provincial independence were strong in Catalonia, and the Catalans were soon in full revolt. Who were they, to be driven to the combat by
  • 87. § 8. Independence of Portugal. § 9. Failure of Soissons in France. § 10. Richelieu's last days. menaces, as the Persian slaves had been driven on at Thermopylæ by the blows of their masters' officers? Equally alarming was the news which reached Madrid from the other side of the Peninsula. Ever since the days of Philip II. Portugal had formed an integral part of the Spanish monarchy. In December 1640 Portugal renounced its allegiance, and reappeared amongst European States under a sovereign of the House of Braganza. Everything prospered in Richelieu's hands. In 1641 a fresh attempt was made by the partizans of Spain to raise France against him. The Count of Soissons, a prince of the blood, placed himself at the head of an imperialist army to attack his native country. He succeeded in defeating the French forces sent to oppose him not far from Sedan. But a chance shot passing through the brain of Soissons made the victory a barren one. His troops, without the support of his name, could not hope to rouse the country against Richelieu. They had become mere invaders, and they were far too few to think of conquering France. Equal success attended the French arms in Germany. In 1641 Guebriant, with his German and Swedish army, defeated the imperialists at Wolfenbüttel, in the north. In 1642 he defeated them again at Kempten, in the south. In the same year Roussillon submitted to France. Nor was Richelieu less fortunate at home. The conspiracy of a young courtier, the last of the efforts of the aristocracy to shake off the heavy rule of the Cardinal, was detected, and expiated on the scaffold. Richelieu did not long survive his latest triumph. He died on December 4, 1642. Section V.—Aims and Character of Richelieu.
  • 88. Welcome to our website – the perfect destination for book lovers and knowledge seekers. We believe that every book holds a new world, offering opportunities for learning, discovery, and personal growth. That’s why we are dedicated to bringing you a diverse collection of books, ranging from classic literature and specialized publications to self-development guides and children's books. More than just a book-buying platform, we strive to be a bridge connecting you with timeless cultural and intellectual values. With an elegant, user-friendly interface and a smart search system, you can quickly find the books that best suit your interests. Additionally, our special promotions and home delivery services help you save time and fully enjoy the joy of reading. Join us on a journey of knowledge exploration, passion nurturing, and personal growth every day! ebookbell.com