Transactions On Highperformance Embedded Architectures And Compilers Iii 1st Edition Miquel Moreto

Transactions On Highperformance Embedded
Architectures And Compilers Iii 1st Edition
Miquel Moreto download
https://ebookbell.com/product/transactions-on-highperformance-
embedded-architectures-and-compilers-iii-1st-edition-miquel-
moreto-4143776
Explore and download more ebooks at ebookbell.com

Here are some recommended products that we believe you will be
interested in. You can click the link to download.
Transactions On Highperformance Embedded Architectures And Compilers
Iv 1st Edition Magnus Jahre
embedded-architectures-and-compilers-iv-1st-edition-magnus-
jahre-2319608
Iv 1st Edition Magnus Jahre
embedded-architectures-and-compilers-iv-1st-edition-magnus-
jahre-4143778
Ii 1st Edition Per Stenstrm
embedded-architectures-and-compilers-ii-1st-edition-per-
stenstrm-4201540
Transactions On Highperformance Embedded Architectures And Compilers V
1st Ed Cristina Silvano
embedded-architectures-and-compilers-v-1st-ed-cristina-silvano-9960876

Transactions On Highperformance Embedded Architectures And Compilers I
1st Edition Maurice V Wilkes Auth
embedded-architectures-and-compilers-i-1st-edition-maurice-v-wilkes-
auth-1227598
Quintilian On The Teaching Of Speaking And Writing Translations From
Books One Two And Ten Of The Institutio Oratoria 2nd Edition James J
Murphy
https://ebookbell.com/product/quintilian-on-the-teaching-of-speaking-
and-writing-translations-from-books-one-two-and-ten-of-the-institutio-
oratoria-2nd-edition-james-j-murphy-5720370
Transactions On Intelligent Welding Manufacturing Volume Iv No 1 2020
1st Ed 2022 Shanben Chen Editor
https://ebookbell.com/product/transactions-on-intelligent-welding-
manufacturing-volume-iv-no-1-2020-1st-ed-2022-shanben-chen-
editor-44889082
Transactions On Engineering Technologies Proceedings Of World Congress
On Engineering 2021 Sioiong Ao
https://ebookbell.com/product/transactions-on-engineering-
technologies-proceedings-of-world-congress-on-
engineering-2021-sioiong-ao-46389796
Transactions On Largescale Data And Knowledgecentered Systems Li
Special Issue On Data Management Principles Technologies And
Applications Abdelkader Hameurlain
https://ebookbell.com/product/transactions-on-largescale-data-and-
knowledgecentered-systems-li-special-issue-on-data-management-
principles-technologies-and-applications-abdelkader-
hameurlain-46517630

Lecture Notes in Computer Science 6590
Commenced Publication in 1973
Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Germany
Madhu Sudan
Microsoft Research, Cambridge, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Moshe Y. Vardi
Rice University, Houston, TX, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbruecken, Germany

Per Stenström (Ed.)
Transactions on
High-Performance
Embedded Architectures
and Compilers III
1 3

Volume Editor
Per Stenström
Chalmers University of Technology
Department of Computer Science and Engineering
412 96 Gothenburg, Sweden
E-mail: per.stenstrom@chalmers.se
ISSN 0302-9743 (LNCS) e-ISSN 1611-3349 (LNCS)
ISSN 1864-306X (THIPEAC) e-ISSN 1864-3078 (THIPEAC)
ISBN 978-3-642-19447-4 e-ISBN 978-3-642-19448-1
DOI 10.1007/978-3-642-19448-1
Springer Heidelberg Dordrecht London New York
Library of Congress Control Number: 2007923068
CR Subject Classification (1998): B.2, C.1, D.3.4, B.5, C.2, D.4
© Springer-Verlag Berlin Heidelberg 2011
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws
and regulations and therefore free for general use.
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)

Editor-in-Chief’s Message
It is my pleasure to introduce you to the third volume of Transactions on High-
Performance Embedded Architectures and Compilers. This journal was created as
an archive for scientific articles in the converging fields of high-performance and
embedded computer architectures and compiler systems. Design considerations
in both general-purpose and embedded systems are increasingly being based on
similar scientific insights. For example, a state-of-the-art game console today
consists of a powerful parallel computer whose building blocks are the same as
those found in computational clusters for high-performance computing. More-
over, keeping power/energy consumption at a low level for high-performance
general-purpose systems as well as in, for example, mobile embedded systems is
equally important in order to either keep heat dissipation at a manageable level
or to maintain a long operating time despite the limited battery capacity. It is
clear that similar scientific issues have to be solved to build competitive systems
in both segments. Additionally, for high-performance systems to be realized – be
it embedded or general-purpose – a holistic design approach has to be taken by
factoring in the impact of applications as well as the underlying technology when
making design trade-offs. The main topics of this journal reflect this development
and include (among others):
– Processor architecture, e.g., network and security architectures, application
specific processors and accelerators, and reconfigurable architectures
– Memory system design
– Power, temperature, performance, and reliability constrained designs
– Evaluation methodologies, program characterization, and analysis techniques
– Compiler techniques for embedded systems, e.g, feedback-directed opti-
mization, dynamic compilation, adaptive execution, continuous profiling/
optimization, back-end code generation, and binary translation/optimization
– Code size/memory footprint optimizations
This volume contains 14 papers divided into four sections. The first section
is a special section containing the top four papers from the Third International
Conference on High-Performance and Embedded Architectures and Compilers -
HiPEAC. I would like to thank Manolis Katevenis (University of Crete and
FORTH) and Rajiv Gupta (University of California at Riverside) for acting as
guest editors of that section. Papers in this section deal with cache performance
issues and improved branch prediction
The second section is a set of four papers providing a snapshot from the
Eighth MEDEA Workshop. I am indebted to Sandro Bartolini and Pierfrancesco
Foglia for putting together this special section.
The third section contains two regular papers and the fourth section pro-
vides a snapshot from the First Workshop on Programmability Issues for Mul-
ticore Computers (MULTIPROG). The organizers – Eduard Ayguade, Roberto

VI Editor-in-Chief’s Message
Gioiosa, and Osman Unsal – have put together this section. I thank them for
their eﬀort.
The editorial board has worked diligently to handle the papers for the journal.
I would like to thank all the contributing authors, editors, and reviewers for their
excellent work.
Per Stenström, Chalmers University of Technology
Editor-in-chief
Transactions on HiPEAC

Editorial Board
Per Stenström is a professor of computer engineering at Chalmers University
of Technology. His research interests are devoted to design principles for high-
performance computer systems and he has made multiple contributions to espe-
cially high-performance memory systems. He has authored or co-authored three
textbooks and more than 100 publications in international journals and con-
ferences. He regularly serves Program Committees of major conferences in the
computer architecture ﬁeld. He is also an associate editor of IEEE Transac-
tions on Parallel and Distributed Processing Systems, a subject-area editor of
the Journal of Parallel and Distributed Computing, an associate editor of the
IEEE TCCA Computer Architecture Letters, and the founding Editor-in-Chief
of Transactions on High-Performance Embedded Architectures and Compilers.
He co-founded the HiPEAC Network of Excellence funded by the European
Commission. He has acted as General and Program Chair for a large number
of conferences including the ACM/IEEE Int. Symposium on Computer Archi-
tecture, the IEEE High-Performance Computer Architecture Symposium, and
the IEEE Int. Parallel and Distributed Processing Symposium. He is a Fellow
of the ACM and the IEEE and a member of Academia Europaea and the Royal
Swedish Academy of Engineering Sciences.
Koen De Bosschere obtained his PhD from Ghent University in 1992. He is a
professor in the ELIS Department at the Universiteit Gent where he teaches
courses on computer architecture and operating systems. His current research
interests include: computer architecture, system software, code optimization.
He has co-authored 150 contributions in the domain of optimization, perfor-
mance modeling, microarchitecture, and debugging. He is the coordinator of the
ACACES research network and of the European HiPEAC2 network. Contact
him at Koen.DeBosschere@elis.UGent.be.

VIII Editorial Board
Jose Duato is a professor in the Department of Computer Engineering (DISCA)
at UPV, Spain. His research interests include interconnection networks and mul-
tiprocessor architectures. He has published over 340 papers. His research results
have been used in the design of the Alpha 21364 microprocessor, the Cray T3E,
IBM BlueGene/L, and Cray Black Widow supercomputers. Dr. Duato is the
first author of the book Interconnection Networks: An Engineering Approach.
He has served as associate editor of IEEE TPDS and IEEE TC. He was General
Co-chair of ICPP 2001, Program Chair of HPCA-10, and Program Co-chair of
ICPP 2005. Also, he has served as Co-chair, Steering Committee member, Vice-
Chair, or Program Committee member in more than 55 conferences, including
HPCA, ISCA, IPPS/SPDP, IPDPS, ICPP, ICDCS, Europar, and HiPC.
Manolis Katevenis received his PhD degree from U.C. Berkeley in 1983 and the
ACM Doctoral Dissertation Award in 1984 for his thesis on “Reduced Instruc-
tion Set Computer Architectures for VLSI.” After a brief term on the faculty of
Computer Science at Stanford University, he has been in Greece, with the Uni-
versity of Crete and with FORTH since 1986. After RISC, his research has been
on interconnection networks and interprocessor communication. In packet switch
architectures, his contributions since 1987 have been mostly in per-flow queue-
ing, credit-based flow control, congestion management, weighted round-robin
scheduling, buffered crossbars, and non-blocking switching fabrics. In multipro-
cessing and clustering, his contributions since 1993 have been on remote-write-
based, protected, user-level communication.
His home URL is http://archvlsi.ics.forth.gr/∼kateveni

Editorial Board IX
Michael O’Boyle is a professor in the School of Informatics at the University
of Edinburgh and an EPSRC Advanced Research Fellow. He received his PhD
in Computer Science from the University of Manchester in 1992. He was for-
merly a SERC Postdoctoral Research Fellow, a Visiting Research Scientist at
IRISA/INRIA Rennes, a Visiting Research Fellow at the University of Vienna
and a Visiting Scholar at Stanford University. More recently he was a Visiting
Professor at UPC, Barcelona.
Dr. O’Boyle’s main research interests are in adaptive compilation, formal
program transformation representations, the compiler impact on embedded sys-
tems, compiler directed low-power optimization and automatic compilation for
parallel single-address space architectures. He has published over 50 papers in
international journals and conferences in this area and manages the Compiler
and Architecture Design group consisting of 18 members.
Cosimo Antonio Prete is a full professor of computer systems at the Univer-
sity of Pisa, Italy, faculty member of the PhD School in Computer Science and
Engineering (IMT), Italy. He is Coordinator of the Graduate Degree Program
in Computer Engineering and Rector’s Adviser for Innovative Training Tech-
nologies at the University of Pisa. His research interests are focused on multi-
processor architectures, cache memory, performance evaluation and embedded
systems. He is an author of more than 100 papers published in international
journals and conference proceedings. He has been project manager for several
research projects, including: the SPP project, OMI, Esprit IV; the CCO project,
supported by VLSI Technology, Sophia Antipolis; the ChArm project, supported
by VLSI Technology, San Jose, and the Esprit III Tracs project.

X Editorial Board
André Seznec is “directeur de recherches” at IRISA/INRIA. Since 1994, he
has been the head of the CAPS (Compiler Architecture for Superscalar and
Special-purpose Processors) research team. He has been conducting research
on computer architecture for more than 20 years. His research topics have in-
cluded memory hierarchy, pipeline organization, simultaneous multithreading
and branch prediction. In 1999–2000, he spent a sabbatical year with the Alpha
Group at Compaq.
Olivier Temam obtained a PhD in computer science from the University of
Rennes in 1993. He was assistant professor at the University of Versailles from
1994 to 1999, and then professor at the University of Paris Sud until 2004. Since
then, he is a senior researcher at INRIA Futurs in Paris, where he heads the
Alchemy group. His research interests include program optimization, processor
architecture, and emerging technologies, with a general emphasis on long-term
research.

Editorial Board XI
Theo Ungerer is Chair of Systems and Networking at the University of Augsburg,
Germany, and Scientiﬁc Director of the Computing Center of the University of
Augsburg. He received a Diploma in Mathematics at the Technical University
of Berlin in 1981, a Doctoral Degree at the University of Augsburg in 1986,
and a second Doctoral Degree (Habilitation) at the University of Augsburg in
1992. Before his current position he was scientiﬁc assistant at the University of
Augsburg (1982–1989 and 1990–1992), visiting assistant professor at the Uni-
versity of California, Irvine (1989–1990), professor of computer architecture at
the University of Jena (1992–1993) and the Technical University of Karlsruhe
(1993–2001). He is Steering Committee member of HiPEAC and of the German
Science Foundation’s priority programme on “Organic Computing.” His current
research interests are in the areas of embedded processor architectures, embed-
ded real-time systems, organic, bionic and ubiquitous systems.
Mateo Valero obtained his PhD at UPC in 1980. He is a professor in the
Computer Architecture Department at UPC. His research interests focus on
high-performance architectures. He has published approximately 400 papers on
these topics. He is the director of the Barcelona Supercomputing Center, the
National Center of Supercomputing in Spain. Dr. Valero has been honored with
several awards, including the King Jaime I award by the Generalitat Valen-
ciana, and the Spanish national award “Julio Rey Pastor” for his research on IT
technologies. In 2001, he was appointed Fellow of the IEEE, in 2002 Intel Distin-
guished Research Fellow and since 2003 a Fellow of the ACM. Since 1994, he has
been a foundational member of the Royal Spanish Academy of Engineering. In
2005 he was elected Correspondant Academic of the Spanish Royal Academy of
Sciences, and his native town of Alfamén named their public college after him.

XII Editorial Board
Georgi Gaydadjiev is a professor in the computer engineering laboratory of the
Technical University of Delft, The Netherlands. His research interests focus on
many aspects of embedded systems design with an emphasis on reconﬁgurable
computing. He has published about 50 papers on these topics in international
refereed journals and conferences. He has acted as Program Committee mem-
ber of many conferences and is subject area editor for the Journal of Systems
Architecture.

Table of Contents
Third International Conference on High-Performance
and Embedded Architectures and Compilers
(HiPEAC)
Dynamic Cache Partitioning Based on the MLP of Cache Misses . . . . . . . 3
Miquel Moreto, Francisco J. Cazorla, Alex Ramirez, and
Mateo Valero
Cache Sensitive Code Arrangement for Virtual Machine. . . . . . . . . . . . . . . 24
Chun-Chieh Lin and Chuen-Liang Chen
Data Layout for Cache Performance on a Multithreaded Architecture . . . 43
Subhradyuti Sarkar and Dean M. Tullsen
Improving Branch Prediction by Considering Aﬀectors and Aﬀectees
Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Yiannakis Sazeides, Andreas Moustakas, Kypros Constantinides, and
Marios Kleanthous
Eighth MEDEA Workshop (Selected Papers)
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Sandro Bartolini, Pierfrancesco Foglia, and Cosimo Antonia Prete
Exploring the Architecture of a Stream Register-Based Snoop Filter . . . . 93
Matthias Blumrich, Valentina Salapura, and Alan Gara
CROB: Implementing a Large Instruction Window through
Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Fernando Latorre, Grigorios Magklis, Jose González,
Pedro Chaparro, and Antonio González
Power-Aware Dynamic Cache Partitioning for CMPs . . . . . . . . . . . . . . . . . 135
Isao Kotera, Kenta Abe, Ryusuke Egawa, Hiroyuki Takizawa, and
Hiroaki Kobayashi
A Multithreaded Multicore System for Embedded Media Processing . . . . 154
Jan Hoogerbrugge and Andrei Terechko

XIV Table of Contents
Regular Papers
Parallelization Schemes for Memory Optimization on the Cell
Processor: A Case Study on the Harris Corner Detector . . . . . . . . . . . . . . . 177
Tarik Saidani, Lionel Lacassagne, Joel Falcou, Claude Tadonki, and
Samir Bouaziz
Constructing Application-Speciﬁc Memory Hierarchies on FPGAs . . . . . . 201
Harald Devos, Jan Van Campenhout, Ingrid Verbauwhede, and
Dirk Stroobandt
First Workshop on Programmability Issues for
Multi-core Computers (MULTIPROG)
autopin – Automated Optimization of Thread-to-Core Pinning on
Multicore Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Tobias Klug, Michael Ott, Josef Weidendorfer, and Carsten Trinitis
Robust Adaptation to Available Parallelism in Transactional Memory
Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
Mohammad Ansari, Mikel Luján, Christos Kotselidis, Kim Jarvis,
Chris Kirkham, and Ian Watson
Eﬃcient Partial Roll-Backing Mechanism for Transactional Memory
Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
M.M. Waliullah
Software-Level Instruction-Cache Leakage Reduction Using
Value-Dependence of SRAM Leakage in Nanometer Technologies . . . . . . . 275
Maziar Goudarzi, Tohru Ishihara, and Hamid Noori
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301

Dynamic Cache Partitioning Based on the MLP
of Cache Misses
Miquel Moreto1
, Francisco J. Cazorla2
, Alex Ramirez1,2
, and Mateo Valero1,2
1
Universitat Politècnica de Catalunya, DAC, Barcelona, Spain
HiPEAC European Network of Excellence
2
Barcelona Supercomputing Center – Centro Nacional de Supercomputación, Spain
{mmoreto,aramirez,mateo}@ac.upc.edu, francisco.cazorla@bsc.es
Abstract. Dynamic partitioning of shared caches has been proposed
to improve performance of traditional eviction policies in modern multi-
threaded architectures. All existing Dynamic Cache Partitioning (DCP)
algorithms work on the number of misses caused by each thread and
treat all misses equally. However, it has been shown that cache misses
cause different impact in performance depending on their distribution.
Clustered misses share their miss penalty as they can be served in par-
allel, while isolated misses have a greater impact on performance as the
memory latency is not shared with other misses.
We take this fact into account and propose a new DCP algorithm that
considers misses differently depending on their influence in performance.
Our proposal obtains improvements over traditional eviction policies up
to 63.9% (10.6% on average) and it also outperforms previous DCP pro-
posals by up to 15.4% (4.1% on average) in a four-core architecture. Our
proposal reaches the same performance as a 50% larger shared cache. Fi-
nally, we present a practical implementation of our proposal that requires
less than 8KB of storage.
1 Introduction
The limitation imposed by instruction-level parallelism (ILP) has motivated
the use of thread-level parallelism (TLP) as a common strategy for improv-
ing processor performance. TLP paradigms such as simultaneous multithreading
(SMT) [1,2], chip multiprocessor (CMP) [3] and combinations of both offer the
opportunity to obtain higher throughputs. However, they also have to face the
challenge of sharing resources of the architecture. Simply avoiding any resource
control can lead to undesired situations where one thread is monopolizing all the
resources and harming the other threads. Some studies deal with the resource
sharing problem in SMTs at core level resources like issue queues, registers,
etc. [4]. In CMPs, resource sharing is focused on the cache hierarchy.
Some applications present low reuse of their data and pollute caches with
data streams, such as multimedia, communications or streaming applications,
or have many compulsory misses that cannot be solved by assigning more cache
space to the application. Traditional eviction policies such as Least Recently
P. Stenström (Ed.): Transactions on HiPEAC III, LNCS 6590, pp. 3–23, 2011.
c
Springer-Verlag Berlin Heidelberg 2011

4 M. Moreto et al.
Used (LRU), pseudo LRU or random are demand-driven, that is, they tend
to give more space to the application that has more accesses and misses to
the cache hierarchy [5, 6]. As a consequence, some threads can suffer a severe
degradation in performance. Previous work has tried to solve this problem by
using static and dynamic partitioning algorithms that monitor the L2 cache
accesses and decide a partition for a fixed amount of cycles in order to maximize
throughput [7,8,9] or fairness [10]. Basically, these proposals predict the number
of misses per application for each possible cache partition. Then, they use the
cache partition that leads to the minimum number of misses for the next interval.
A common characteristic of these proposals is that they treat all L2 misses
equally. However, in out-of-order architectures L2 misses affect performance dif-
ferently depending on how clustered they are. An isolated L2 miss has approxi-
mately the same miss penalty than a cluster of L2 misses, as they can be served
in parallel if they all fit in the reorder buffer (ROB) [11]. In Figure 1 we can
see this behavior. We have represented an ideal IPC curve that is constant until
an L2 miss occurs. After some cycles, commit stops. When the cache line comes
from main memory, commit ramps up to its steady state value. As a consequence,
an isolated L2 miss has a higher impact on performance than a miss in a burst
of misses as the memory latency is shared by all clustered misses.
(a) Isolated L2 miss. (b) Clustered L2 misses.
Fig. 1. Isolated and clustered L2 misses
Based on this fact, we propose a new DCP algorithm that gives a cost to each
L2 access according to its impact in final performance. We detect isolated and
clustered misses and assign a higher cost to isolated misses. Then, our algorithm
determines the partition that minimizes the total cost for all threads, which is
used in the next interval. Our results show that differentiating between clustered
and isolated L2 misses leads to cache partitions with higher performance than
previous proposals. The main contributions of this work are the following.
1) A runtime mechanism to dynamically partition shared L2 caches in a CMP
scenario that takes into account the MLP of each L2 access. We obtain improve-
ments over LRU up to 63.9% (10.6% on average) and over previous proposals
up to 15.4% (4.1% on average) in a four-core architecture. Our proposal reaches
the same performance as a 50% larger shared cache.
2) We extend previous workloads classifications for CMP architectures with
more than two cores. Results can be better analyzed in every workload group.

Dynamic Cache Partitioning Based on the MLP of Cache Misses 5
3) We present a sampling technique that reduces the hardware cost in terms
of storage to less than 1% of the total L2 cache size with an average throughput
degradation of 0.76% (compared to the throughput obtained without sampling).
We also show that scalable algorithms to decide cache partitions give near opti-
mal partitions, 0.59% close to the optimal decision.
The rest of this paper is structured as follows. Section 2 introduces the meth-
ods that have been previously proposed to decide L2 cache partitions and related
work. Next, Section 3 explains our MLP-aware DCP algorithm. Section 4 de-
scribes the experimental environment and in Section 5 we discuss simulation
results. Finally, Section 6 summarizes our results.
2 Prior Work in Dynamic Cache Partitioning
Stack Distance Histogram (SDH). Mattson et al. introduce the concept of
stack distance to study the behavior of storage hierarchies [12]. Common eviction
policies such as LRU have the stack property. Thus, each set in a cache can be
seen as an LRU stack, where lines are sorted by their last access cycle. In that
way, the first line of the LRU stack is the Most Recently Used (MRU) line while
the last line is the LRU line. The position that a line has in the LRU stack
when it is accessed again is defined as the stack distance of the access. As an
example, we can see in Table 1(a) a stream of accesses to the same set with their
corresponding stack distances.
Table 1. Stack Distance Histogram
(a) Stream of accesses to a given cache set. (b) SDH example.
# Reference 1 2 3 4 5 6 7 8
Cache Line A B C C A D B D
Stack Distance - - - 1 3 - 4 2
Stack Distance 1 2 3 4 4
# Accesses 60 20 10 5 5
For a K-way associative cache with LRU replacement algorithm, we need
K + 1 counters to build SDHs, denoted C1, C2, . . . , CK , CK. On each cache
access, one of the counters is incremented. If it is a cache access to a line in
the ith
position in the LRU stack of the set, Ci is incremented. If it is a cache
miss, the line is not found in the LRU stack and, as a result, we increment
the miss counter CK . SDH can be obtained during execution by running the
thread alone in the system [7] or by adding some hardware counters that profile
this information [8, 9]. A characteristic of these histograms is that the number
of cache misses for a smaller cache with the same number of sets can be easily
computed. For example, for a K
-way associative cache, where K
K, the new
number of misses can be computed as misses = CK +
K
i=K+1 Ci.
As an example, in Table 1(b) we show an SDH for a set with 4 ways. Here,
we have 5 cache misses. However, if we reduce the number of ways to 2 (keeping
the number of sets constant), we will experience 20 misses (5 + 5 + 10).

6 M. Moreto et al.
Minimizing Total Misses. Using the SDHs of N applications, we can de-
rive the L2 cache partition that minimizes the total number of misses: this last
number corresponds to the sum of the number of misses of each thread for the
given configuration. The optimal partition in the last period of time is a suitable
candidate to become the future optimal partition. Partitions are decided period-
ically after a fixed amount of cycles. In this scenario, partitions are decided at a
way granularity. This mechanism is used in order to minimize the total number
of misses and try to maximize throughput. A first approach proposed a static
partitioning of the L2 cache using profiling information [7]. Then, a dynamic ap-
proach estimated SDHs with information inside the cache [9]. Finally, Qureshi
et al. presented a suitable and scalable circuit to measure SDHs using sampling
and obtained performance gains with just 0.2% extra space in the L2 cache [8].
Throughout this paper, we will call this last policy MinMisses.
Fair Partitioning. In some situations, MinMisses can lead to unfair parti-
tions that assign nearly all the resources to one thread while harming the oth-
ers [10]. For that reason, the authors propose considering fairness when deciding
new partitions. In that way, instead of minimizing the total number of misses,
they try to equalize the statistic Xi =
missessharedi
missesalonei
of each thread i. They desire
to force all threads to have the same increase in percentage of misses. Partitions
are decided periodically using an iterative method. The thread with largest Xi
receives a way from the thread with smallest Xi until all threads have a similar
value of Xi. Throughout this paper, we will call this policy Fair.
Table 2. Different Partitioning Proposals
Paper Partitioning Objective Decision Algorithm Eviction Policy
[7] Static Minimize Misses Programmer − Column Caching
[9] Dynamic Minimize Misses Architecture Marginal Gain Augmented LRU
[8] Dynamic Maximize Utility Architecture Lookahead Augmented LRU
[10] Dynamic Fairness Architecture Equalize Xi
1 Augmented LRU
[13] Dynamic Maximize reuse Architecture Reuse Column Caching
[14] Dyn./Static Configurable Operating System Configurable Augmented LRU
Other Related Work. Several papers propose different DCP algorithms in a
multithreaded scenario. In Table 2 we summarize these proposals with their most
significant characteristics. Settle et al. introduce a DCP similar to MinMisses
that decides partitions depending on the average data reuse of each application
[13]. Rafique et al. propose to manage shared caches with a hardware cache
quota enforcement mechanism and an interface between the architecture and
the OS to let the latter decide quotas [14]. We have to note that this mechanism
is completely orthogonal to our proposal and, in fact, they are compatible as
we can let the OS decide quotas according to our scheme. Hsu et al. evaluate
different cache policies in a CMP scenario [15]. They show that none of them is
optimal among all benchmarks and that the best cache policy varies depending
on the performance metric being used. Thus, they propose to use a thread-aware

cache resource allocation. In fact, their results reinforce the motivation of our
paper: if we do not consider the impact of each L2 miss in performance, we can
decide suboptimal L2 partitions in terms of throughput.
Cache partitions at a way granularity can be implemented with column caching
[7], which uses a bit mask to mark reserved ways, or by augmenting the LRU
policy with counters that keep track of the number of lines in a set belonging
to a thread [9]. The evicted line will be the LRU line among its owned lines or
other threads lines depending on whether it reaches its quota or not.
In [16] a new eviction policy for private caches was proposed in single-threaded
architectures. This policy gives a weight to each L2 miss according to its MLP
when the block is filled from memory. Eviction is decided using the LRU counters
and this weight. This idea was proposed for a different scenario as it focus on
single-threaded architectures.
3 MLP-Aware Dynamic Cache Partitioning
3.1 Algorithm Overview
Algorithm 3.1 shows the necessary steps to dynamically decide cache partitions
according to the MLP of each L2 access. At the beginning of the execution, we
decide an initial partition of the L2 cache. As we have no prior knowledge of the
applications, we evenly distribute ways among cores. Hence, each core receives
Associativity
Number of Cores
ways of the shared L2 cache.
Algorithm 3.1. MLP-aware DCP()
Step 1: Establish an initial even partition for each core.
Step 2: Run threads and collect data for the MLP-aware SDHs.
Step 3: Decide new partition.
Step 4: Update MLP-aware SDHs.
Step 5: Go back to Step 2.
Afterwards, we begin a period where we measure the total MLP cost of each
application. The histogram of each thread containing the total MLP cost for each
possible partition is denoted MLP-aware SDH. For a K-way associative cache,
exactly K registers are needed to store this histogram. For short periods, dy-
namic cache partitioning (DCP) algorithms react quicker to phase changes. Our
results show that, for different periods from 105
to 108
cycles, small performance
variations are obtained, with a peak for a period of 5 million cycles.
At the end of each interval, MLP-aware SDHs are analyzed and a new parti-
tion is decided for the next interval. We assume that running threads will have
a similar pattern of L2 accesses in the next measuring period. Thus, the opti-
mal partition for the last period is chosen for the following period. Evaluating

8 M. Moreto et al.
all possible cache partitions gives the optimal partition. This evaluation is done
concurrently with a dedicated hardware, which sets the partition for each pro-
cess in the next period. Having old values of partitions decisions does not impact
correctness of the running applications and does not affect performance as de-
ciding new partitions typically takes few thousand cycles and is invoked once
every 5 million cycles.
Since characteristics of applications dynamically change, MLP-aware SDHs
should reflect these changes. However, we also wish to maintain some history of
the past MLP-aware SDHs to make new decisions. Thus, after a new partition
is decided, we multiply all the values of the MLP-aware SDHs times ρ ∈ [0, 1].
Large values of ρ have larger reaction times to phase changes, while small values
of ρ quickly adapt to phase changes but tend to forget the behavior of the
application. Small performance variations are obtained for different values of ρ
ranging from 0 to 1, with a peak for ρ = 0.5. Furthermore, this value is very
convenient as we can use a shifter to update histograms. Next, a new period of
measuring MLP-aware SDHs begins. The key contribution of this paper is the
method to obtain MLP-aware SDHs that we explain in the following Subsection.
3.2 MLP-Aware Stack Distance Histogram
As previously stated, MinMisses assumes that all L2 accesses are equally im-
portant in terms of performance. However, it has been shown that cache misses
affect differently the performance of applications, even inside the same applica-
tion [11, 16]. An isolated L2 data miss has a penalty cost that can be approxi-
mated by the average memory latency. In the case of a burst of L2 data misses
that fit in the ROB, the penalty cost is shared among misses as L2 misses can
be served in parallel. In case of L2 instruction misses, they are serialized as fetch
stops. Thus, L2 instruction misses have a constant miss penalty and MLP.
We want to assign a cost to each L2 access according to its effect on perfor-
mance. In [16] a similar idea was used to modify LRU eviction policy for single
core and single threaded architectures. In our situation, we have a CMP sce-
nario where the shared L2 cache has a number of reserved ways for each core.
At the end of each period, we decide either to continue with the same partition
or change it. If we decide to modify the partition, a core i that had wi reserved
ways will receive w
i
= wi. If wi w
i, the thread receives more ways and, as a
consequence, some misses in the old configuration will become hits. Conversely,
if wi w
i, the thread receives less ways and some hits in the old configuration
will become misses. Thus, we want to have an estimation of the performance ef-
fects when misses are converted into hits and vice versa. Throughout this paper,
we will call this impact on performance MLP cost.
MLP cost of L2 misses. In order to compute the MLP cost of an L2 miss with
stack distance di, we consider the situation shown in Figure 2(a). If we force an L2
configuration that assigns exactly w
i = di ways to thread i with w
i wi, some
of the L2 misses of this thread will become hits, while other will remain being
misses, depending on their stack distance. In order to track the stack distance

(a) MLP cost of an L2 miss.
(b) Estimated MLP cost when an L2 hit becomes a miss.
Fig. 2. MLP cost of L2 accesses
and MLP cost of each L2 miss, we have modified the L2 Miss Status Holding
Registers (MSHR) [17]. This structure is similar to an L2 miss buffer and is used
to hold information about any load that has missed in the L2 cache. The modified
L2 MSHR has one extra field that contains the MLP cost of the miss as can be
seen in Figure 3(b). It is also necessary to store the stack distance of each access
in the MSHR. In Figure 3(a) we show the MSHR in the cache hierarchy.
(a) MSHR. (b) MSHR fields.
Fig. 3. Miss Status Holding Register
When the L2 cache is accessed and an L2 miss is determined, we assign an
MSHR entry to the miss and wait until the data comes from Main Memory. We
initialize the MLP cost field to zero when the entry is assigned. We store the
access stack distance together with the identifier of the owner core. Every cycle,
we obtain N, the number of L2 accesses with stack distance greater or equal
to di. We have a hardware counter that tracks this number for each possible

10 M. Moreto et al.
number of di, which means a total of Associativity counters. If we have N L2
misses that are being served in parallel, the miss penalty is shared. Thus, we
assign an equal share of 1
N to each miss. The value of the MLP cost is updated
until the data comes from Main Memory and fills the L2. At this moment we
can free the MSHR entry.
The number of adders required to update the MLP cost of all entries is equal
to the number of MSHR entries. However, this number can be reduced by sharing
several adders between valid MSHR entries in a round robin fashion. Then, if an
MSHR entry updates its MLP cost every 4 cycles, it has to add 4
N . In this work,
we assume that the MSHR contains only four adders for updating MLP cost
values, which has a negligible effect on the final MLP cost [16].
MLP cost of L2 hits. Next, we want to estimate the MLP cost of an L2 hit
with stack distance di when it becomes a miss. If we forced an L2 configuration
that assigned exactly w
i = di ways to the thread i with w
i wi, some of the L2
hits of this thread would become misses, while L2 misses would remain as misses
(see Figure 2(b)). The hits that would become misses are the ones with stack
distance greater or equal to di. Thus, we count the total number of accesses with
stack distance greater or equal to di (including L2 hits and misses) to estimate
the length of the cluster of L2 misses in this configuration.
Deciding the moment to free the entry used by an L2 hit is more complex
than in the case of the MSHR. As it was said in [11], in a balanced architecture,
L2 data misses can be served in parallel if they all fit in the ROB. Equivalently,
we say that L2 data misses can be served in parallel if they are at ROB dis-
tance smaller than the ROB size. Thus, we should free the entry if the number
of committed instructions since the access has reached the ROB size or if the
number of cycles since the hit has reached the average latency to memory. The
first condition is clear as L2 misses can overlap only if their ROB distance is
less than the ROB size. When the entry is freed, we have to add the number of
pending cycles divided by the number of misses with stack distance greater or
equal to di. The second condition is also necessary as it can occur that no L2
access is done for a period of time. To obtain the average latency to memory,
we add a specific hardware that counts and averages the number of cycles that
a given entry is in the MSHR.
We use new hardware to obtain the MLP cost of L2 hits. We denote this
hardware Hit Status Holding Registers (HSHR) as it is similar to the MSHR.
However, the HSHR is private for each core. In each entry, the HSHR needs an
identifier of the ROB entry of the access, the address accessed by the L2 hit,
the stack distance value and a field with the corresponding MLP cost as can be
seen in Figure 4(b). In Figure 4(a) we show the HSHR in the cache hierarchy.
When the L2 cache is accessed and an L2 hit is determined, we assign an
HSHR entry to the L2 hit. We initialize the fields of the entry as in the case of
the MSHR. We have a stack distance di and we want to update the MLP cost
field in every cycle. With this objective, we need to know the number of active
entries with stack distance greater or equal to di in the HSHR, which can be
tracked with one hardware counter per core. We also need a ROB entry identifier

(a) HSHR. (b) HSHR fields.
Fig. 4. Hit Status Holding Register
for each L2 access. Every cycle, we obtain N, the number of L2 accesses with
stack distance greater or equal to di as in the L2 MSHR case. We have a hardware
counter that tracks this number for each possible number of di, which means a
total of Associativity counters.
In order to avoid array conflicts, we need as many entries in the HSHR as
possible L2 accesses in flight. This number is equal to the L1 MSHR size. In our
scenario, we have 32 L1 MSHR entries, which means a maximum of 32 in flight
L2 accesses per core. However, we have checked that we have enough with 24
entries per core to ensure that we have an available slot 95% of the time in an
architecture with a ROB of 256 entries. If there are no available slots, we simply
assign the minimum weight to the L2 access as there are many L2 accesses in
flight. The number of adders required to update the MLP cost of all entries is
equal to the number of HSHR entries. As we did with the MSHR, HSHR entries
can share four adders with a negligible effect on the final MLP cost.
Quantification of MLP cost. Dealing with values of MLP cost between 0
and the memory latency (or even greater) can represent a significant hardware
cost. Instead, we decide to quantify this MLP cost with an integer value between
0 and 7 as was done in [16]. For a memory latency of 300 cycles, we can see in
Table 3 how to quantify the MLP cost. We have split the interval [0; 300] with
7 intervals of equal length.
Table 3. MLP cost quantification
MLP cost Quantification MLP cost Quantification
From 0 to 42 cycles 0 From 171 to 213 cycles 4
From 129 to 170 cycles 3 300 or more cycles 7
Finally, when we have to update the corresponding MLP-aware SDH, we add
the quantified value of MLP cost. Thus, isolated L2 misses will have a weight
of 7, while two overlapped L2 misses will have a weight of 3 in the MLP-aware
SDH. In contrast, MinMisses always adds one to its histograms.

12 M. Moreto et al.
3.3 Obtaining Stack Distance Histograms
Normally, L2 caches have two separate parts that store data and address tags to
know if the access is a hit. Basically, our prediction mechanism needs to track
every L2 access and store a separated copy of the L2 tags information in an
Auxiliary Tag Directory (ATD), together with the LRU counters [8]. We need
an ATD for each core that keeps track of the L2 accesses for any possible cache
configuration. Independently of the number of ways assigned to each core, we
store the tags and LRU counters of the last K accesses of the thread, where K
is the L2 associativity. As we have explained in Section 2, an access with stack
distance di corresponds to a cache miss in any configuration that assigns less
than di ways to the thread. Thus, with this ATD we can determine whether an
L2 access would be a miss or a hit in all possible cache configurations.
3.4 Putting All Together
In Figure 5 we can see a sketch of the hardware implementation of our proposal.
When we have an L2 access, the ATD is used to determine its stack distance di.
Depending on whether it is a miss or a hit, either the MSHR or the HSHR is
used to compute the MLP cost of the access. Using the quantification process we
obtain the final MLP cost. This number estimates how performance is affected
when the applications has exactly w
i = di assigned ways. If w
i wi, we are
estimating the performance benefit of converting this L2 miss into a hit. In case
w
i wi, we are estimating the performance degradation of converting this L2
hit into a miss. Finally, using the stack distance, the MLP cost and the core
identifier, we can update the corresponding MLP-aware SDH.
We have used two different partitioning algorithms. The first one, that we de-
note MLP-DCP (standing for MLP-aware Dynamic Cache Partitioning), decides
Fig. 5. Hardware implementation

the optimal partition according to the MLP cost of each way. We
deﬁne the total MLP cost of a thread i that uses wi ways as T MLP(i, wi) =
MLP SDHi,K +
K
j=wi
MLP SDHi,j. We denote the total MLP cost of all
accesses of thread i with stack distance j as MLP SDHi,j. Thus, we have to
minimize the sum of total MLP costs for all cores:
N

i=1
T MLP(i, wi), where
N

i=1
wi = Associativity.
The second one consists in assigning a weight to each total MLP cost using the
IPC of the application in core i, IPCi. In this situation, we are giving priority
to threads with higher IPC. This point will give better results in throughput at
the cost of being less fair. IPCi is measured at runtime with a hardware counter
per core. We denote this proposal MLPIPC-DCP, which consists in minimizing
the following expression:
N

i=1
IPCi · T MLP(i, wi), where
N

i=1
wi = Associativity.
3.5 Case Study
We have seen that SDHs can give the optimal partition in terms of total L2
misses. However, total number of L2 misses is not the goal of DCP algorithms.
Throughput is the objective of these policies. The underlying idea of MinMisses
is that while minimizing total L2 misses, we are also increasing throughput. This
idea is intuitive as performance is clearly related to L2 miss rate. However, this
heuristic can lead to inadequate partitions in terms of throughput as can be seen
in the next case study.
In Figure 6, we can see the IPC curves of benchmarks galgel and gzip as we
increase L2 cache size in a way granularity (each way has a 64KB size). We also
show throughput for all possible 15 partitions. In this curve, we assign x ways to
gzip and 16−x to galgel. The optimal partition consists in assigning 6 to gzip
and 10 ways to galgel, obtaining a total throughput of 3.091 instructions per
cycle. However, if we use MinMisses algorithm to determine the new partition,
we will choose 4 to gzip and 12 ways to galgel according to the SDHs values.
In Figure 6 we can also see the total number of misses for each cache partition
as well as the per thread number of misses.
In this situation, misses in gzip are more important in terms of performance
than misses in galgel. Furthermore, gzip IPC is larger than galgel IPC. As
a consequence MinMisses obtains a non optimal partition in terms of IPC and
its throughput is 2.897, which is a 6.3% smaller than the optimal one. In fact,
galgel clusters of L2 misses are, in average, longer than the ones from gzip. In
that way, MLP-DCP assigns one extra way to gzip and increases performance
by 3%. If we use MLPIPC-DCP, we are giving more importance to gzip as it
has a higher IPC and, as a consequence, we end up assigning another extra way
to gzip, reaching the optimal partition and increasing throughput an extra 3%.

14 M. Moreto et al.
Fig. 6. Misses and IPC curves for galgel and gzip
4 Experimental Environment
4.1 Simulator Configuration
We target this study to the case of a CMP with two and four cores with their
respective own data and instruction L1 caches and a unified L2 cache shared
among threads as in previous studies [8,9,10]. Each core is single-threaded and
fetches up to 8 instructions each cycle. It has 6 integer (I), 3 floating point (FP),
and 4 load/store functional units and 32-entry I, load/store, and FP instruction
queues. Each thread has a 256-entry ROB and 256 physical registers. We use a
two-level cache hierarchy with 64B lines with separate 16KB, 4-way associative
data and instruction caches, and a unified L2 cache that is shared among all
cores. We have used two different L2 caches, one of size 1MB and 16-way asso-
ciativity, and the second one of size 2MB and 32-way associativity. Latency from
L1 to L2 is 15 cycles, and from L2 to memory 300 cycles. We use a 32B width
bus to access L2 and a multibanked L2 of 16 banks with 3 cycles of access time.
We extended the SMTSim simulator [2] to make it CMP. We collected traces
of the most representative 300 million instruction segment of each program, fol-
lowing the SimPoint methodology [18]. We use the FAME simulation method-
ology [19] with a Maximum Allowable IPC Variance of 5%. This evaluation
methodology measures the performance of multithreaded processors by reexe-
cuting all threads in a multithreaded workload until all of them are fairly repre-
sented in the final IPC taken from the workload.
4.2 Workload Classification
In [20] two metrics are used to model the performance of a partitioning algorithm
like MinMisses for pairings of benchmarks in the SPEC CPU 2000 benchmark
suite. Here, we extend this classification for architectures with more cores.
Metric 1. The wP %(B) metric measures the number of ways needed by a
benchmark B to obtain at least a given percentage P% of its maximum IPC
(when it uses all L2 ways).

(a) IPC as we vary the number of assigned
ways of a 1MB 16-way L2 cache.
(b) Average miss penalty of an L2 miss
with a 1MB 16-way L2 cache.
Fig. 7. Benchmark classification
The intuition behind this metric is to classify benchmarks depending on their
cache utilization. Using P = 90% we can classify benchmarks into three groups:
Low utility (L), Small working set or saturated utility (S) and High utility (H). L
benchmarks have 1 ≤ w90% ≤ K
8 where K is the L2 associativity. L benchmarks
are not affected by L2 cache space because nearly all L2 accesses are misses. S
benchmarks have K
8 w90% ≤ K
2 and just need some ways to have maximum
throughput as they fit in the L2 cache. Finally, H benchmarks have w90% K
2
and always improve IPC as the number of ways given to them is increased. Clear
representatives of these three groups are applu (L), gzip (S) and ammp (H) in
Figure 7(a). In Table 4 we give w90% for all SPEC CPU 2000 benchmarks.
Table 4. The applications used in our evaluation. For each benchmark, we give the two
metrics needed to classify workloads together with IPC for a 1MB 16-way L2 cache.
Bench w90% APTC IPC Bench w90% APTC IPC Bench w90% APTC IPC
ammp 14 23.63 1.27 applu 1 16.83 1.03 apsi 10 21.14 2.17
art 10 46.04 0.52 bzip2 1 1.18 2.62 crafty 4 7.66 1.71
eon 3 7.09 2.31 equake 1 18.6 0.27 facerec 11 10.96 1.16
fma3d 9 15.1 0.11 galgel 15 18.9 1.14 gap 1 2.68 0.96
gcc 3 6.97 1.64 gzip 4 21.5 2.20 lucas 1 7.60 0.35
mcf 1 9.12 0.06 mesa 2 3.98 3.04 mgrid 11 9.52 0.71
parser 11 9.09 0.89 perl 5 3.82 2.68 sixtrack 1 1.34 2.02
swim 1 28.0 0.40 twolf 15 12.0 0.81 vortex 7 9.65 1.35
vpr 14 11.9 0.97 wupw 1 5.99 1.32
The average miss penalty of an L2 miss for the whole SPEC CPU 2000 bench-
mark suite is shown in Figure 7(b). We note that this average miss penalty varies
a lot, even inside each group of benchmarks, ranging from 30 to 294 cycles. This
Figure reinforces the main motivation of the paper, as it proves that the clus-
tering level of L2 misses changes for different applications.

16 M. Moreto et al.
Metric 2. The wLRU (thi) metric measures the number of ways given by
LRU to each thread thi in a workload composed of N threads. This can be done
simulating all benchmarks alone and using the frequency of L2 accesses for each
thread [5]. We denote the number of L2 Accesses in a Period of one Thousand
Cycles for thread i as APT Ci. In Table 4 we list these values for each benchmark.
wLRU (thi) =
APT Ci
N
j=1 APT Cj
· Associativity
Next, we use these two metrics to extend previous classifications [20] for work-
loads with more than two benchmarks.
Case 1. When w90%(thi) ≤ wLRU (thi) for all threads. In this situation LRU
attains 90% of each benchmark performance. Thus, it is intuitive that in this
situation there is very little room for improvement.
Case 2. When there exists two threads A and B such that w90%(thA)
wLRU (thA) and w90%(thB) wLRU (thB). In this situation, LRU is harming the
performance of thread A, because it gives more ways than necessary to thread B.
Thus, in this situation LRU is assigning some shared resources to a thread that
does not need them, while the other thread could benefit from these resources.
Case 3. Finally, the third case is obtained when w90%(thi) wLRU (thi) for
all threads. In this situation, our L2 cache configuration is not big enough to
assure that all benchmarks will have at least a 90% of their peak performance.
In [20] it was observed that pairings belonging to this group showed worse results
when the value of |w90%(th1) − w90%(th2)| grows. In this case, we have a thread
that requires much less L2 cache space than the other to attain 90% of its peak
IPC. LRU treats threads equally and manages to satisfy the less demanding
thread necessities. In case of MinMisses, it assumes that all misses are equally
important for throughput and tends to give more space to the thread with higher
L2 cache necessity, while harming the less demanding thread. This is a problem
due to MinMisses algorithm. We will show in next Subsections that MLP-aware
partitioning policies are available to overcome this situation.
Table 5. Workloads belonging to each case for a 1MB 16-way and a 2MB 32-way
shared L2 caches
1MB 16-way 2MB 32-way
#cores
2
4
6
8
Case 1 Case 2 Case 3
155 (48%) 135 (41%) 35 (11%)
624 (4%) 12785 (86%) 1541 (10%)
306 (0.1%) 219790 (95%) 10134 (5%)
19 (0%) 1538538 (98%) 23718 (2%)
Case 1 Case 2 Case 3
159 (49%) 146 (45%) 20 (6.2%)
286 (1.9%) 12914 (86%) 1750 (12%)
57 (0.02%) 212384 (92%) 17789 (7.7%)
1 (0%) 1496215 (96%) 66059 (4.2%)
In Table 5 we show the total number of workloads that belong to each case
for different configurations. We have generated all possible combinations without
repeating benchmarks. The order of benchmarks is not important. In the case
of a 1MB 16-way L2, we note that Case 2 becomes the dominant case as the

number of cores increases. The same trend is observed for L2 caches with larger
associativity. In Table 5 we can also see the total number of workloads that
belong to each case as the number of cores increases for a 32-way 2MB L2 cache.
Note that with different L2 cache configurations, the value of w90% and APT Ci
will change for each benchmark. An important conclusion from Table 5 is that
as we increase the number of cores, there are more combinations that belong to
the second case, which is the one with more improvement possibilities.
To evaluate our proposals, we randomly generate 16 workloads belonging to
each group for three different configurations. We denote these configurations 2C
(2 cores and 1MB 16-way L2), 4C-1 (4 cores and 1MB 16-way L2) and 4C-2 (4
cores and 2MB 32-way L2). We have also used a 2MB 32-way L2 cache as future
CMP architectures will continue scaling L2 size and associativity. For example,
the IBM Power5 [21] has a 10-way 1.875MB L2 cache and the Niagara 2 has a
16-way 4MB L2.
4.3 Performance Metrics
As performance metrics we have used the IPC throughput, which corresponds to
the sum of individual IPCs. We also use the harmonic mean of relative IPCs to
measure fairness, which we denote Hmean. We use Hmean instead of weighted
speed up because it has been shown to provide better fairness-throughput bal-
ance than weighted speed up [22].
Average improvements do consider the distribution of workloads among the
three groups. We denote this mean weighted mean, as we assign a weight to the
speed up of each case depending on the distribution of workloads from Table 5.
For example, for the 2C configuration, we compute the weighted mean improve-
ment as 0.48 · x1 + 0.41 · x2 + 0.11 · x3, where xi is the average improvement in
Case i.
5 Evaluation Results
5.1 Performance Results
Throughput. The first experiment consists in comparing throughput for differ-
ent DCP algorithms, using LRU policy as the baseline. We simulate MinMisses
and our two proposals with the 48 workloads that were selected in the pre-
vious Subsection. We can see in Figure 8(a) the average speed up over LRU
for these mechanisms. MLPIPC-DCP systematically obtains the best average
results, nearly doubling the performance benefits of MinMisses over LRU in
the four-core configurations. In configuration 4C-1, MLPIPC-DCP outperforms
MinMisses by 4.1%. MLP-DCP always improves MinMisses but obtains worse
results than MLPIPC-DCP.
All algorithms have similar results in Case 1. This is intuitive as in this sit-
uation there is little room for improvement. In Case 2, MinMisses obtains a
relevant improvement over LRU in configuration 2C. MLP-DCP and MLPIPC-
DCP achieve an extra 2.5% and 5% improvement, respectively. In the other

18 M. Moreto et al.
(a) Throughput speed up over LRU. (b) Fairness speed up over LRU.
Fig. 8. Average performance speed ups over LRU
conﬁgurations, MLP-DCP and MLPIPC-DCP still outperform MinMisses by a
2.1% and 3.6%. In Case 3, MinMisses presents larger performance degradation
as the asymmetry between the necessities of the two cores increases. As a con-
sequence, it has worse average throughput than LRU. Assigning an appropriate
weight to each L2 access gives the possibility to obtain better results than LRU
using MLP-DCP and MLPIPC-DCP.
Fairness. We have used the harmonic mean of relative IPCs [22] to measure
fairness. The relative IPC is computed as IP Cshared
IP Calone
. In Figure 8(b) we show the
average speed up over LRU of the harmonic mean of relative IPCs. Fair stands
for the policy explained in Section 2. We can see that in all situations, MLP-DCP
always improves over both MinMisses and LRU (except in Case 3 for two cores).
It even obtains better results than Fair in conﬁgurations 2C and 4C-1. MLPIPC-
DCP is a variant of the MLP-DCP algorithm optimized for throughput. As a
consequence, it obtains worse results in fairness than MLP-DCP.
Fig. 9. Average throughput speed up over LRU with a 1MB 16-way L2 cache
Equivalent cache space. DCP algorithms reach the performance of a larger L2
cache with LRU eviction policy. Figure 9 shows the performance evolution when
the L2 size is increased from 1MB to 2MB with LRU as eviction policy. In this

experiment, the workloads correspond to the ones selected for the configuration
4C-1. Figure 9 also shows the average speed up over LRU of MinMisses, MLP-
DCP and MLPIPC-DCP with a 1MB 16-way L2 cache. MinMisses has the same
average performance as a 1.25MB 20-way L2 cache with LRU, which means that
MinMisses provides the performance obtained with a 25% larger shared cache.
MLP-DCP reaches the performance of a 37.5% larger cache. Finally, MLPIPC-
DCP doubles the increase in size of MinMisses, reaching the performance of a
50% larger L2 cache.
5.2 Design Parameters
Figure 10(a) shows the sensitivity of our proposal to the period of partition de-
cisions. For shorter periods, the partitioning algorithm reacts quicker to phase
changes. Once again, small performance variations are obtained for different pe-
riods. However, we observe that for longer periods throughput tends to decrease.
As can be seen in Figure 10(a), the peak performance is obtained with a period
of 5 million cycles.
(a) Average throughput for different pe-
riods for the MLP-DCP algorithm with
the 2C configuration.
(b) Average speed up over LRU for different
ROB sizes with the 4C-1 configuration.
Fig. 10. Sensitivity analysis to different design parameters
Finally, we have varied the size of the ROB from 128 to 512 entries to show the
sensitivity of our proposals to this parameter of the architecture. Our mechanism
is the only one which is aware of the ROB size: The higher the size of the ROB,
the larger size of the cluster of L2 misses. Other policies only work with the
number of L2 misses, which will not change if we vary the size of the ROB.
When the ROB size increases, clusters of misses can contain more misses and,
as a consequence, our mechanism can differentiate better between isolated and
clustered misses. As we show in Figure 10(b), average improvements in the 4C-1
configuration are a little bit higher for a ROB with 512 entries, while MinMisses
shows worse results. MLPIPC-DCP outperforms LRU and MinMisses by 10.4%
and 4.3% respectively.
5.3 Hardware Cost
We have used the hardware implementation of Figure 5 to estimate the hard-
ware cost of our proposal. In this Subsection, we focus our attention on the

20 M. Moreto et al.
configuration 2C. We suppose a 40-bit physical address space. Each entry in the
ATD needs 29 bits (1 valid bit + 24-bit tag + 4-bit for LRU counter). Each set
has 16 ways, so we have an overhead of 58 Bytes (B) for each set. As we have
1024 sets, we have a total cost of 58KB per core.
The hardware cost that corresponds to the extra fields of each entry in the L2
MSHR is 5 bits for the stack distance and 2B for the MLP cost. As we have 32
entries, we have a total of 84B. Four adders are needed to update the MLP cost
of the active MSHR entries. HSHR entries need 1 valid bit, 8 bits to identify
the ROB entry, 34 bits for the address, 5 bits for the stack distance and 2B for
the MLP cost. In total we need 64 bits per entry. As we have 24 entries in each
HSHR, we have a total of 192B per core. Four adders per core are needed to
update the MLP cost of the active HSHR entries. Finally, we need 17 counters
of 4B for each MLP-Aware SDH, which supposes a total of 68B per core. In
addition to the storage bits, we also need an adder for incrementing MLP-aware
SDHs and a shifter to halve the hit counters after each partitioning interval.
Fig. 11. Throughput and hardware cost depending on ds in a two-core CMP
Sampled ATD. The main contribution to hardware cost corresponds to the
ATD. Instead of monitoring every cache set, we can decide to track accesses
from a reduced number of sets. This idea was also used in [8] with MinMisses
in a CMP environment. Here, we use it in a different situation, say to estimate
MLP-aware SDHs with a sampled number of sets. We define a sampling distance
ds that gives the distance between tracked sets. For example, if ds = 1, we are
tracking all the sets. If ds = 2, we track half of the sets, and so on. Sampling
reduces the size of the ATD at the expense of less accuracy in MLP-aware
SDHs predictions as some accesses are not tracked, Figure 11 shows throughput
degradation in a 2 cores scenario as the ds increases. This curve is measured
on the left y-axis. We also show the storage overhead in percentage of the total
L2 cache size, measured on the right y-axis. Thanks to the sampling technique,
storage overhead drastically decreases. Thus, with a sampling distance of 16
we obtain average throughput degradations of 0.76% and a storage overhead of
0.77% of the L2 cache size, which is less than 8KB of storage. We think that this
is an interesting point of design.

5.4 Scalable Algorithm to Decide Cache Partitions
Evaluating all possible combinations allows determining the optimal partition
for the next period. However, this algorithm does not scale adequately when
associativity and the number of applications sharing the cache is raised. If we
have a K-way associativity L2 cache shared by N cores, the number of possible
partitions without considering the order is
N+K−1
K

. For example, for 8 cores
and 16 ways, we have 245157 possible combinations. Consequently, the time to
decide new cache partitions does not scale. Several heuristics have been proposed
to reduce the number of cycles required to decide the new partition [8,9], which
can be used in our situation. These proposals bound the length of the decision
period by 10000 cycles. This overhead is very low compared to 5 million cycles
(less than 0.2%).
Fig. 12. Average throughput speed up over LRU for different decision algorithms in
the 4C-1 configuration
Figure 12 shows the average speed up of MLP-DCP over LRU with the 4C-1
configuration with three different decision algorithms. Evaluating all possible par-
titions (denoted EvalAll) gives the highest speed up. The first greedy algorithm
(denoted Marginal Gains) assigns one way to a thread in each iteration [9]. The
selected way is the one that gives the largest increase in MLP cost. This process
is repeated until all ways have been assigned. The number of operations (com-
parisons) is of order K · N, where K is the associativity of the L2 cache and N
the number of cores. With this heuristic, an average throughput degradations of
0.59% is obtained. The second greedy algorithm (denoted Look Ahead) is similar
to Marginal Gains. The basic difference between them is that Look Ahead con-
siders the total MLP cost for all possible number of blocks that the application
can receive [8] and can assign more than one way in each iteration. The number
of operations (add-divide-compare) is of order N · K2
2 , where K is the associativ-
ity of the L2 cache and N the number of cores. With this heuristic, an average
throughput degradations of 1.04% is obtained.

22 M. Moreto et al.
6 Conclusions
In this paper we propose a new DCP algorithm that assigns a cost to each
L2 access according to its impact in final performance: isolated misses receive
higher costs than clustered misses. Next, our algorithm decides the L2 cache
partition that minimizes the total cost for all running threads. Furthermore, we
have classified workloads for multiple cores into three groups and shown that
the dominant situation is precisely the one that offers room for improvement.
We show that our proposal reaches high throughput for two- and four-core
architectures. In all evaluated configurations, our proposal consistently outper-
forms both LRU and MinMisses, reaching a speed up of 63.9% (10.6% on aver-
age) and 15.4% (4.1% on average), respectively. With our proposals, we reach
the performance of a 50% larger cache. Finally, we used a sampling technique to
propose a practical implementation with a storage cost to less than 1% of the
total L2 cache size and a scalable algorithm to determine cache partitions with
nearly no performance degradation.
Acknowledgments
This work is supported by the Ministry of Education and Science of Spain un-
der contracts TIN2004-07739, TIN2007-60625 and grant AP-2005-3318, and by
SARC European Project. The authors would like to thank C. Acosta, A. Falcon,
D. Ortega, J. Vermoulen and O. J. Santana for their work in the simulation tool.
We also thank F. Cabarcas, I. Gelado, A. Rico and C. Villavieja for comments
on earlier drafts of this paper and the reviewers for their helpful comments.
References
1. Serrano, M.J., Wood, R., Nemirovsky, M.: A study on multistreamed superscalar
processors, Technical Report 93-05, University of California Santa Barbara (1993)
2. Tullsen, D.M., Eggers, S.J., Levy, H.M.: Simultaneous multithreading: maximizing
on-chip parallelism. In: ISCA (1995)
3. Hammond, L., Nayfeh, B.A., Olukotun, K.: A single-chip multiprocessor. Com-
puter 30(9), 79–85 (1997)
4. Cazorla, F.J., Ramirez, A., Valero, M., Fernandez, E.: Dynamically controlled
resource allocation in SMT processors. In: MICRO (2004)
5. Chandra, D., Guo, F., Kim, S., Solihin, Y.: Predicting inter-thread cache contention
on a chip multi-processor architecture. In: HPCA (2005)
6. Petoumenos, P., Keramidas, G., Zeffer, H., Kaxiras, S., Hagersten, E.: Modeling
cache sharing on chip multiprocessor architectures. In: IISWC, pp. 160–171 (2006)
7. Chiou, D., Jain, P., Devadas, S., Rudolph, L.: Dynamic cache partitioning via
columnization. In: Design Automation Conference (2000)
8. Qureshi, M.K., Patt, Y.N.: Utility-based cache partitioning: A low-overhead, high-
performance, runtime mechanism to partition shared caches. In: MICRO (2006)
9. Suh, G.E., Devadas, S., Rudolph, L.: A new memory monitoring scheme for
memory-aware scheduling and partitioning. In: HPCA (2002)

10. Kim, S., Chandra, D., Solihin, Y.: Fair cache sharing and partitioning in a chip
multiprocessor architecture. In: PACT (2004)
11. Karkhanis, T.S., Smith, J.E.: A first-order superscalar processor model. In: ISCA
(2004)
12. Mattson, R.L., Gecsei, J., Slutz, D.R., Traiger, I.L.: Evaluation techniques for
storage hierarchies. IBM Systems Journal 9(2), 78–117 (1970)
13. Settle, A., Connors, D., Gibert, E., Gonzalez, A.: A dynamically reconfigurable
cache for multithreaded processors. Journal of Embedded Computing 1(3-4) (2005)
14. Rafique, N., Lim, W.T., Thottethodi, M.: Architectural support for operating
system-driven CMP cache management. In: PACT (2006)
15. Hsu, L.R., Reinhardt, S.K., Iyer, R., Makineni, S.: Communist, utilitarian, and
capitalist cache policies on CMPs: caches as a shared resource. In: PACT (2006)
16. Qureshi, M.K., Lynch, D.N., Mutlu, O., Patt, Y.N.: A case for MLP-aware cache
replacement. In: ISCA (2006)
17. Kroft, D.: Lockup-free instruction fetch/prefetch cache organization. In: ISCA
(1981)
18. Sherwood, T., Perelman, E., Hamerly, G., Sair, S., Calder, B.: Discovering and
exploiting program phases. IEEE Micro (2003)
19. Vera, J., Cazorla, F.J., Pajuelo, A., Santana, O.J., Fernandez, E., Valero, M.:
FAME: Fairly measuring multithreaded architectures. In: PACT (2007)
20. Moreto, M., Cazorla, F.J., Ramirez, A., Valero, M.: Explaining dynamic cache
partitioning speed ups. IEEE CAL (2007)
21. Sinharoy, B., Kalla, R.N., Tendler, J.M., Eickemeyer, R.J., Joyner, J.B.: Power5
system microarchitecture. IBM J. Res. Dev. 49(4/5), 505–521 (2005)
22. Luo, K., Gummaraju, J., Franklin, M.: Balancing throughput and fairness in SMT
processors. In: ISPASS (2001)

P. Stenström (Ed.): Transactions on HiPEAC III, LNCS 6590, pp. 24–42, 2011.
© Springer-Verlag Berlin Heidelberg 2011
Cache Sensitive Code Arrangement for
Virtual Machine*
Chun-Chieh Lin and Chuen-Liang Chen
Department of Computer Science and Information Engineering,
National Taiwan University, Taipei,
10764, Taiwan
{d93020,clchen}@csie.ntu.edu.tw
Abstract. This paper proposes a systematic approach to optimize the code layout
of a Java ME virtual machine for an embedded system with a cache-sensitive
architecture. A practice example is to run JVM directly (execution-in-place) in
NAND flash memory, for which cache miss penalty is too high to endure. The
refined virtual machine generated cache misses 96% less than the original
version. We developed a mathematical approach helping to predict the flow of
the interpreter inside the virtual machine. This approach analyzed both the static
control flow graph and the pattern of bytecode instruction streams, since we
found the input sequence drives the program flow of the virtual machine
interpreter. Then we proposed a rule to model the execution flows of Java
instructions of real applications. Furthermore, we used a graph partition
algorithm as a tool to deal with the mathematical model, and this finding helped
the relocation process to move program blocks to proper memory pages. The
refinement approach dramatically improved the locality of the virtual machine
thus reduced cache miss rates. Our technique can help Java ME-enabled devices
to run faster and extend longer battery life. The approach also brings potential for
designers to integrate the XIP function into System-on-Chip thanks to lower
demand for cache memory.
Keywords: cache sensitive, cache miss, NAND flash memory, code arrange-
ment, Java virtual machine, interpreter, embedded system.
1 Introduction
Java platform extensively exists in all kinds of embedded and mobile devices. The
Java™ Platform, Micro Edition (Java ME) [1] is no doubt a de facto standard platform
of smart phone. The Java virtual machine (it is KVM in Java ME) is a key component
that affects performance and power consumptions.
NAND flash memory comes with serial bus interface. It does not allow random
access, and the CPU must read out the whole page at a time, which is a slow operation
compared to RAM. This property leads a processor hardly to execute programs stored
*
We acknowledge the support for this study through grants from National Science Council of
Taiwan (NSC 95-2221-E-002 -137).

Cache Sensitive Code Arrangement for Virtual Machine 25
in NAND flash memory using the “execute-in-place” (XIP) technique. In the mean-
while, NAND flash memory offers fast write access time, and the most important of all,
the technology has advantages in offering higher capacity than NOR flash technology
does. As the applications of embedded devices become large and complicated, more
mainstream devices adopt NAND flash memory to replace NOR-flash memory.
In this paper, we tried to offer an answer to the question: can we speed up an em-
bedded device using NAND flash memory to store programs? “Page-based” storage
media, like NAND flash memory, have higher access penalty than RAM does. Re-
ducing the page miss becomes a critical issue. Thus, we set forth to find way to reduce
the page miss rate generated by the KVM. Due to the unique structure of the KVM
interpreter, we found a special way to exploit the dynamic locality of the KVM that is to
trace the patterns of executed bytecode instructions instead of the internal flow of the
KVM. It turned out to be a combinatorial optimization problem because the code layout
must fulfill certain code size constraints. Our approach achieved the effect of static
page preloading by properly arranging program blocks. In the experiment, we imple-
mented a post-processing program to modify the intermediate files generated by the C
compiler. The post-processing program refined machine code placement of the KVM
based on the mathematical model. Finally, the obtained tuned KVMs dramatically
reduced page accesses to NAND flash memories. The outcome of this study helps
embedded systems to boost performance and extend battery life as well.
2 Related Works
Park et al., in [2], proposed a hardware module to allow direct code execution from
NAND flash memory. In this approach, program codes stored in NAND flash pages
will be loaded into RAM cache on-demand instead of moving entire contents into
RAM. Their work is a universal hardware-based solution without considering appli-
cation-specific characteristics.
Samsung Electronics offers a commercial product called “OneNAND” [3] based on
the same. It is a single chip with a standard NOR flash interface. Actually, it contains a
NAND flash memory array for storage. The vendor intent was to provide a
cost-effective alternative to NOR flash memory used in existing designs. The internal
structure of OneNAND comprises a NAND flash memory, control logic, hardware
ECC, and 5KB buffer RAM. The 5KB buffer RAM is comprised of three buffers: 1KB
for boot RAM, and a pair of 2KB buffers used for bi-directional data buffers. Our
approach is suitable for systems using this type of flash memories.
Park et al., in [4], proposed yet another pure software approach to achieve exe-
cute-in-place by using a customized compiler that inserts NAND flash reading opera-
tions into program code at proper place. Their compiler determines insertion points by
summing up sizes of basic blocks along the calling tree. Special hardware is no longer
required, but in contrast to earlier work [2], there is still a need for tailor-made compiler.
Typical studies of refining code placement to minimize cache misses can apply to
NAND flash cache system. Parameswaran et al., in [5], used the bin-packing approach.
It reorders the program codes by examining the execution frequency of basic blocks.
Code segments with higher execution frequency are placed next to each other within
the cache. Janapsatya et al., in [6], proposed a pure software heuristic approach to
reduce number of cache misses by relocating program sections in the main memory.

26 C.-C. Lin and C.-L. Chen
Their approach was to analyze program flow graph, identify and pack basic blocks
within the same loop. They have also created relations between cache miss and energy
consumption. Although their approach can identify loops within a program, breaking
the interpreter of a virtual machine into individual circuits is hard because all the loops
share the same starting point.
There are researches in improving program locality and optimizing code placement
for either cache or virtual memory environment. Pettis [7] proposed a systematic
approach using dynamic call graph to position procedures. They tried to place two
procedures as close as possible if one of the procedure calls another frequently. The
first step of Pettis’ approach uses the profiling information to create weighted call
graph. The second step iteratively merges vertices connected by heaviest weight edges.
The process repeats until the whole graph composed of one or more individual vertex
without edges.
However, the approach to collect profiling information and their accuracy is yet
another issue. For example, Young and Smith in [8] developed techniques to extract
effective branch profile information from a limited depth of branch history. Ball and
Larus in [9] described an algorithm for inserting monitoring code to trace programs.
Our approach is very different by nature. Previous studies all focused in the flow of
program codes, but we tried to model the profile by input data.
This research project created a post-processor to optimize the code arrangements. It
is analogous to “Diablo linker” [10]. They utilized symbolic information in the object
files to generate optimized executable files. However, our approach will generate
feedback intermediate files for the compiler, and invoke the compiler to generate
optimized machine code.
3 Background
3.1 XIP with NAND Flash
NOR flash memory is popular as code memory because of the XIP feature. There are
several approaches designed for using NAND flash memory as an alternative to NOR
flash memory. Because NAND flash memory interface cannot connect to the CPU host
bus, there has to be a memory interface controller to move data from NAND flash
memory to RAM.
Fig. 1. Access NAND flash through shadow RAM

In system-level view, Figure 1 shows a straightforward design which uses RAM as
the shadow copy of NAND flash. The system treats NAND flash memory as secondary
storage device [11]. There should be a boot loader or RTOS resided in ROM or NOR
flash memory. It copies program codes from NAND flash to RAM, then the processor
executes program codes in RAM [12]. This approach offers best execution speed
because the processor operates with RAM. The downside of this approach is it needs
huge amount of RAM to mirror NAND flash. In embedded devices, RAM is a precious
resource. For example, the Sony Ericsson T610 mobile phone [13] reserved 256KB
RAM for Java heap. In contrast to using 256MB for mirroring NAND flash memory, all
designers should agree that they would prefer to retain RAM for Java applets rather
than for mirroring. The second pitfall is the implementation takes longer time to boot
because the system must copy contents to RAM prior to execution.
Figure 2 shows a demand paging approach uses limited amount of RAM as the cache
of NAND flash. The “romized” program codes stay in NAND flash memory, and a
MMU loads only portions of program codes which is about to be executed from NAND
into the cache. The major advantage of this approach is it consumes less RAM. Several
kilobytes of RAM are enough to mirror NAND flash memory. Using less RAM means
integrating CPU, MMU and cache into a single chip (the shadowed part in Figure 2) can
be easier. The startup latency is shorter since the CPU is ready to run soon after the first
NAND flash page is loaded into the cache. The component cost is lower than in the
previous approach. The realization of the MMU might be either hardware or software
approach, which is not covered in this paper.
Fig. 2. Using cache unit to access NAND flash
However, performance is the major drawback of this approach. The penalty of each
cache miss is high, because loading contents from a NAND flash page is nearly 200
times slower than doing the same operation with RAM. Therefore reducing cache
misses becomes a critical issue for such configurations.
3.2 KVM Internals
Source Level. In respect of functionality, the KVM can be broken down into several
parts: startup, class files loading, constant pool resolving, interpreter, garbage collection,

and KVM cleanup. Lafond et al., in [14], have measured the energy consumptions of
each part in the KVM. Their study showed that the interpreter consumed more than
50% of total energy. In our experiments running Embedded Caffeine Benchmark [15],
the interpreter contributed 96% of total memory accesses. These evidences lead to the
conclusion that the interpreter is the performance bottleneck of the KVM, and they
motivated us to focus on reducing the cache misses generated by the interpreter.
Figure 3 shows the program structure of the interpreter. It is a loop enclosing a large
switch-case dispatcher. The loop fetches bytecode instructions from Java applications,
and each “case” sub-clause deals with one bytecode instruction. The control flow graph
of the interpreter, as illustrated in Figure 4, is a flat and shallow spanning tree. There are
three major steps in the interpreter,
ReschedulePoint:
RESCHEDULE
opcode = FETCH_BYTECODE ( ProgramCounter );
switch ( opcode )
{
case ALOAD: /* do something */
goto ReschedulePoint;
case IADD: /* do something */
…
case IFEQ: /* do something */
goto BranchPoint;
…
}
BranchPoint:
take care of program counter;
goto ReschedulePoint;
Fig. 3. Pseudo code of KVM interpreter
Fig. 4. Control flow graph of the interpreter

(1) Rescheduling and Fetching. In this step, KVM prepares the execution context and
the stack frame. Then it fetches a bytecode instruction from Java programs.
(2) Dispatching and Execution. After reading a bytecode instruction from Java pro-
grams, the interpreter jumps to corresponding bytecode handlers through the big
“switch…case…” statement. Each bytecode handler carries out the function of the
corresponding bytecode instruction.
(3) Branching. The branch bytecode instructions may bring the Java program flow
away from original track. In this step, the interpreter resolves the target address and
modifies the program counter.
Fig. 5. The organization of the interpreter at assembly level
Assembly Level. Our analysis of the source files revealed the peculiar program
structure of the VM interpreter. Analyzing the code layout in the compiled executables
of the interpreter helped this study to create a code placement strategy. The assembly
code analysis in this study is restricted to ARM and gcc for the sake of demonstration,

but applying our theory to other platforms and tools is an easy job. Figure 5 illustrates
the layout of the interpreter in assembly form (FastInterpret() in interp.c). The first
trunk BytecodeFetching is the code block for rescheduling and fetching, it is exactly the
first part in the original source code. The second trunk LookupTable is a large lookup
table used in dispatching bytecode instructions. Each entry links to a bytecode handler.
It is actually the translated result of the “switch…case…case” statement.
The third trunk BytecodeDispatch is the aggregation of more than a hundred byte-
code handlers. Most bytecode handlers are self-contained which means a bytecode
handler occupies a contiguous memory space in this trunk, and it does not jump to
program codes stored in other trunks. There are only a few exceptions which call
functions stored in other trunks, such as “invokevirtual.” Besides, there are several
constant symbol tables spread over this trunk. These tables are referenced by the pro-
gram codes within the BytecodeDispatch trunk.
The last trunk ExceptionHandling contains code fragments for exception handling.
Each trunk occupies a number of NAND flash pages. In fact, the total size of Byteco-
deFetching and LookupTable is about 1200 bytes (compiled with arm-elf-gcc-3.4.3),
which is almost small enough to fit into two or three 512-bytes-page. Figure 6 shows
the size distribution of bytecode handlers. The average size of a bytecode handler is 131
bytes, and there are 79 handlers smaller than 56 bytes. In other words, a 512-bytes-page
could gather 4 to 8 bytecode handlers. The inter-handler execution flow dominates the
number of cache misses generated by the interpreter. This is the reason that our ap-
proach tries to rearrange bytecode handlers within the BytecodeDispatch trunk.
Fig. 6. Distribution of Bytecode Handler Size (compiled with gcc-3.4.3)
4 Analyzing Control Flow
4.1 Indirect Control Flow Graph
Static branch-prediction and typical code placement approaches derive the layout of a
program from its control flow graph (CFG). However, the CFG of a VM interpreter is a
special case, its CFG is a flat spanning tree enclosed by a loop. The CFG does not
provide sufficient information to distinguish the temporal relations of each bytecode
handler pair. If someone wants to improve the program locality by observing the dy-
namic execution order of program blocks, the CFG is apparently not a good tool to this

end. Therefore, we propose a concept called “Indirect Control Flow Graph” (ICFG); it
uses the real bytecode instruction sequences to construct the dual CFG of the interpreter.
Consider a simplified virtual machine with 5 bytecode instructions: A, B, C, D, and E,
and use the virtual machine to run a very simple user applet. Consider the following
short alphabetic sequence as the instruction sequence of the user applet:
A-B-A-B-C-D-E-C
Each alphabet in the sequence represents a bytecode instruction. In Figure 7, the graph
connected with the solid lines is the CFG of the simplified interpreter. By observing the
flow in the CFG, the program flow becomes:
[Dispatch] – [Handler A] – [Dispatch] – [Handler B]…
Fig. 7. The CFG of the simplified interpreter
It is hard to tell the relation between handler-A and handler-B because the loop
header hides it. In other words, this CFG cannot easily present which handler would be
invoked after handler-A is executed. The idea of the ICFG is to observe the patterns of
the bytecode sequences executed by the virtual machine, not to analyze the structure of
the virtual machine itself. Figure 8 expresses the ICFG in a readable way, it happens to
be the sub-graph connected by the dashed directed lines in Figure 7.
Fig. 8. An ICFG example. The number inside the circle represents the size of the handler

4.2 Tracing the Locality of the Interpreter
As stated, the Java applications that a KVM runs dominate the temporal locality of the
interpreter. Precisely speaking, the incoming Java instruction sequence dominates the
temporal locality of the KVM. Therefore, the first step to exploit the temporal locality
is to consider the bytecode sequences executed by the virtual machine. Consider the
previous example sequence, the order of accessed NAND flash pages is supposed to be:
[BytecodeFetching]–[LookupTable]–[A]–[BytecodeFetching]–[LookupTable]–
[B]–[BytecodeFetching]–[LookupTable]–[A]…
Obviously, memory pages containing BytecodeFetching and LookupTable are much
often to appear in the sequence than those containing BytecodeDispatch. As a result,
pages containing BytecodeFetching and LookupTable are favorable to last in the cache.
Pages holding bytecode handlers have to compete with each other to stay in the cache.
Thus, we induced that the order of executed bytecode instructions is the key factor
impacts cache misses.
Consider an extreme case: In a system with three cache blocks, two cache blocks
always hold memory pages containing BytecodeFetching and LookupTable due to the
stated reason. Therefore, there is only one cache block available for swapping pages
containing bytecode handlers. If all the bytecode handlers were located in distinct
memory pages, processing a bytecode instruction would cause a cache miss. This is
because the next-to-execute bytecode handler is always located in an uncached memory
page. In other words, the sample sequence causes at least eight cache misses. Never-
theless, if both the handlers of A and B are grouped to the same page, cache misses will
decline to 5 times, and the page access trace becomes:
fault-A-B-A-B-fault-C-fault-D-fault-E-fault-C
If we extend the group (A, B) to include the handler of C, the cache miss count would
even drop to four times, and the page access trace looks like the following one:
fault-A-B-A-B-C-fault-D-fault-E-fault-C
Therefore, the core issue of this study is to find an efficient code layout method parti-
tioning all bytecode instructions into disjoined sets based on their execution relevance.
Each NAND flash page contains one set of bytecode handlers. We propose partitioning
the ICFG reaches this goal.
Back to Figure 8, the directed edges represent the temporal order of the instruction
sequence. The weight of an edge is the transition count for transitions from one bytecode
instruction to the next. If we remove the edge (B, C), the ICFG is divided into two
disjoined sets. That is, the bytecode handlers of A and B are placed in one page, and the
bytecode handlers of C, D, and E are placed in the other. The page access trace becomes:
fault-A-B-A-B-fault-C-D-E-C
This placement causes only two cache misses, which is 75% lower than the worst case!
The next step is to transform the ICFG diagram to an undirected graph by merging
reversed edges connecting same vertices, and the weight of the undirected edge is the
sum of weights of the two directed edges. The consequence is actually a variation of the
classical MIN k-CUT problem. Formally speaking, we can model a given graph
G(V, E) as:

z Vi – represents the i-th bytecode instruction.
z Ei,j – the edge connecting i-th and j-th bytecode instruction.
z Fi,j – number of times that two bytecode instructions i and j executed after each
other. It is the weight of edge Ei,j.
z K – number of expected partitions.
z Wx,y – the inter-set weight. ∀ x ≠ y, Wx,y= ΣFi,j where Vi ∈ Px and Vj ∈ Py.
The goal is to model the problem as the following definition:
Definition 1. The MIN k-CUT problem is to divide G into K disjoined partitions {P1,
P2,…,Pk} such that ΣWi,j is minimized.
4.3 The Mathematical Model
Yet there is an additional constraint in our model. It is impractical to gather bytecode
instructions to a partition regardless of the sum of the program size of consisted byte-
code handlers. The size of each bytecode handler is distinct, and the code size of a
partition cannot exceed the size of a memory page (e.g. NAND flash page). Our aim is
to distribute bytecode handlers into several disjoined partitions {P1, P2,…,Pk}. We
define the following notations:
z Si – the code size of bytecode handler Vi.
z N – the size of a memory page.
z M(Pk ) – the size of partition Pk . It is ΣSm for all Vm∈ Pk .
z H(Pk ) – the value of partition Pk . It is ΣFi,j for all Vi , Vj ∈ Pk .
Our goal is to construct partitions that satisfy the following constraints.
Definition 2. The problem is to divide G into K disjoined partitions {P1, P2,…,Pk}. For
each Pk that M(Pk) ≤ N such that Wi,j is minimized, and maximize ΣH(Pi ) for all Pi ∈
{P1, P2,…,Pk}.
This rectified model is exactly an application of the graph partition problem, i.e., the
size of each partition must satisfy the constraint (size of a memory page), and the sum
of inter-partition path weights is minimal. The graph partition problem is NP-complete
[16]. However, the purpose of this paper was neither to create a new graph partition
algorithm nor to discuss the difference between existing algorithms. The experimental
implementation just adopted the following algorithm to demonstrate our approach
works. Other implementations based on this approach may choose another graph
partition algorithm that satisfies specific requirements.
Partition (G)
1. Find the edge with maximal weight Fi,j among graph G, while the Si + Sj ≤ N. If
there is no such an edge, go to step 4.
2. Call Merge (Vi , Vj ) to combine vertices Vi and Vj.
3. Remove both Vi and Vj from G, go to step 1.
4. Find a pair of vertices Vi and Vj in G such that Si + Sj ≤ N. If there isn’t any pair
satisfied the criteria, go to step 7.
5. Call Merge (Vi , Vj ) to combine vertices Vi and Vj.
6. Remove both Vi and Vj out of G, go to step 4.
7. End.

The procedure of merging both vertices Vi and Vj is:
Merge (Vi , Vj )
1. Add a new vertex Vk to G.
2. Pickup an edge E connects Vt with either Vi or Vj . If there is no such an edge, go
to step 6.
3. If there is already an edge F connects Vt to Vk.
4. Then, add the weight of E to F, and discard E.
5. Else, replace one end of E which is either Vi or Vj with Vk.
6. End.
Finally, each vertex in G is a collection of several bytecode handlers. The refinement
process is to collect bytecode handlers belonging to the same vertex and place them into
one memory page.
5 The Process of Rewriting the Virtual Machine
Our approach emphasizes that the arrangements of bytecode handlers affects cache
miss rate. In other words, it implies that programmers should be able to speed up their
programs by properly changing the order of the “case” sub-clauses in the source files.
Therefore, this study tries to optimize the virtual machine in two distinct ways. The first
approach revises the order of the “case” sub-clauses in the sources of the virtual ma-
chine. If our theory were correct, this tentative approach should show that the modified
virtual machine performs better in most test cases. The second version precisely reor-
ganizes the layout of assembly code blocks of bytecode handlers, and this approach
should be able to generate larger improvements than the first version.
5.1 Source-Level Rearrangement
The concept of the refining process is to arrange the order of these “case” statements in
the source file (execute.c). The consequence is that after translating the rearranged
source files, the compiler will place bytecode handlers in machine code form in me-
ditated order. The following steps are the outline of the refining procedures.
A. Profiling. Run the Java benchmark program on the unmodified KVM. A custom
profiler traces the bytecode instruction sequence, and it generates the statistics of
inter-bytecode instruction counts. Although we can collect some patterns of instruction
combinations by investigating the Java compiler, using a dynamic approach can cap-
ture further application-dependent patterns.
B. Measuring the size of each bytecode handler. The refining program compiles the
KVM source files and measures the code size of each bytecode handler (i.e., the size of
each ‘case’ sub-clause) by parsing intermediate files generated by the compiler.
C. Partitioning the ICFG. The previous steps collect all necessary information for
constructing the ICFG. Then, the refining program partitions the ICFG by using a graph
partition algorithm. From that result, the refining program knows the way to group
bytecode handlers together. For example, a partition result groups (A, B) to a bundle
and (C, D, E) to another as shown in Figure 8.

D. Rewriting the source file. According to the computed results, the refining program
rewrites the source file by arranging the order of all “case” sub-clauses within the
interpreter loop. Figure 9 shows the order of all “case” sub-clauses in the previous
example.
switch ( opcode ) {
case B: …;
case A: …;
case E: …;
case D: …;
case C: …;
}
Fig. 9. The output of rearranged case statements
5.2 Assembly-Level Rearrangement
The robust implementation of the refinement process consists of two steps. The re-
finement process acts as a post processor of the compiler. It parses intermediate files
generated by the compiler, rearranges program blocks, and generates optimized
assembly codes. Our implementation is inevitably compiler-dependent and CPU-
dependent. Current implementation tightly is integrated with gcc for ARM, but the
approach is easy to apply to other platforms. Figure 10 illustrates the outline of the
processing flow, entities, and relations between each entity. The following paragraphs
explain the functions of each step.
Fig. 10. Entities in the refinement process

A. Collecting dynamic bytecode instruction trace. The first step is to collect statis-
tics from real Java applications or benchmarks, because the following steps will need
these data for partitioning bytecode handlers. The modified KVM dumps the bytecode
instruction trace while running Java applications. A special program called TRACER
analyzes the trace dump to find the transition counts for all instruction pairs.
B. Rearranging the KVM interpreter. This is the core step and is realized by a
program called REFINER. It acts as a post processor of gcc. Its duty is to parse byte-
code handlers expressed in the assembly code and organize them into partitions. Each
partition fits into one NAND flash page. The program consists of several sub tasks
described as follows.
(i) Parsing layout information of the original KVM. The very first thing is to com-
pile the original KVM. REFINER parses the intermediate files generated by gcc.
According to structure of the interpreter expressed in assembly code introduced in §3.2,
REFINER analyzes the jump table in the LookupTable trunk to find out the address and
size of each bytecode handler.
(ii) Using the graph partition algorithm to group bytecode handlers into disjoined
partitions. At this stage, REFINER constructs the ICFG with two key parameters: (1)
the transition counts of bytecode instructions collected by TRACER; (2) the machine
code layout information collected in the step A. It uses the approximate algorithm
described in §4.3 to divide the undirected ICFG into disjoined partitions.
(iii) Rewriting the assembly code. REFINER parses and extracts assembly codes of
all bytecode handlers. Then, it creates a new assembly file and dumps all bytecode
handlers partition by partition according to the result of (ii).
(iv) Propagating symbol tables to each partition. As described in §3.2, there are
several symbol tables distributed in the BytecodeDispatch trunk. For most RISC pro-
cessors like ARM and MIPS, an instruction is unable to carry arbitrary constants as
operands because of limited instruction word length. The solution is to gather used
constants into a symbol table and place this table near the instructions that will access
these constants. Hence, the compiler generates instructions with relative addressing
operands to load constants from the nearby symbol tables. Take ARM for example, its
application binary interface (ABI) defines two instructions called LDR and ADR for
loading a constant from a symbol table to a register [17]. The ABI restricts the maximal
distance between a LDR/ADR instruction and the referred symbol table to 4K bytes.
Besides, it would cause a cache miss if a machine instruction in page X loads a
constant si from symbol table SY located in page Y. Our solution is to create a local
symbol table SX in page X and copy the value si to the new table. Therefore, the relative
distance between si and the instruction never exceeds 4KB neither causes cache misses
when the CPU tries to load si.
(v) Dumping contents in partitions to NAND flash pages. The aim is to map byte-
code handlers to NAND flash pages. Its reassembled bytecode handlers belong to the
same partition in one NAND flash page. After that, REFINER refreshes the address and
size information of all bytecode handlers. The updated information helps REFINER to
add padding to each partition and enforce the starting address of each partition to align
to the boundary of a NAND flash page.

6 Evaluation
In this section, we start from a brief introduction of the environment and conditions
used in the experiments. The first part of the experimental results is the outcome of
source-level rearranged virtual machine. Those positive results prove our theory works.
The next part is the experiment of assembly-level rearranged virtual machine. It further
proves our refinement approach is able to produce better results than the original
version.
6.1 Evaluation Environment
Figure 11 shows the block diagram of our experimental setup. In order to mimic real
embedded applications, we have implanted Java ME KVM into uClinux for ARM7 in
the experiment. One of the reasons to use this platform is that uClinux supports FLAT
executable file format which is perfect for realizing XIP. We ran KVM/uClinux on a
customized gdb. This customized gdb dumped memory access traces and performance
statistics to files. The experimental setup assumed there was a specialized hardware
unit acting as the NAND flash memory controller, which loads program codes from
NAND flash pages to the cache. It also assumed all flash access operations worked
transparently without the help from the operating system. In other words, modifying the
OS kernel for the experiment is unnecessary. This experiment used “Embedded Caf-
feine Mark 3.0” [15] as the benchmark.
Embedded
Caffeine Mark
J2ME API
K Virtual Machine (KVM) 1.1
uClinux Kernel
GDB 5.0/ARMulator
Windows/Cygwin
ARM7 / FLASH
ARM7 / ROM
Java / RAM
Intel X86
Title Version
arm-elf-binutil 2.15
arm-elf-gcc 3.4.3
uClibc 0.9.18
J2ME (KVM) CLDC 1.1
elf2flt 20040326
Fig. 11. Hierarchy of simulation environment
There are several kinds of NAND flash commodities in the market: 512-bytes,
2048-bytes, and 4096-bytes per page. In this experiment, we model the cache simulator
after the following conditions:
1. There were four NAND flash page size options: 512, 1024, 2048 and 4096.
2. The page replacement policy was full associative, and it is a FIFO cache.
3. The number of cache memory blocks varied from 2, 4 … to 32.
6.2 Results of Source-Level Rearrangement
First, we rearranged the “case” sub-clauses in the source codes using the introduced
method. Table 1 lists the raw statistics of cache miss rates, and Figure 12 plots the
charts of normalized cache miss rates from the optimized KVM. The experiment

assumed the maximal cache size is 64K bytes. For each NAND flash page size, the
number of cache blocks starts from 4 to (64K / NAND flash page size).
In Table 1, each column is the experimental result from a kind of the KVM. The
“original” column refers to statistics from the original KVM, in which bytecode han-
dlers is ordered by in machine codes. The second column “optimized” is the result from
the KVM refined with our approach.
For example, in the best case (2048 bytes per page, 8 cache pages), the optimized
KVM generates 105,157 misses, which is only 4.5% of the misses caused by the
original KVM, and the improvement ratio is 95%.
Broadly speaking, the experiment shows that the optimized KVM outperforms the
original KVM in most cases. Looking at the charts in Figure 12, the curves of nor-
malized cache miss rates (i.e., optimized_miss_rate / original_miss_rate ) tend to be
concave. It means the improvement for the case of eight pages is greater than the one of
four pages. It benefits from the smaller “locality” of the optimized KVM. Therefore,
the cache could hold more localities, and this is helpful in reducing cache misses. After
touching the bottom, the cache is large enough to hold most of the KVM program code.
As the cache size grows, the numbers of cache misses of all configurations converge.
However, the miss rate at 1024 bytes * 32 blocks is an exceptional case. This is
because our approach rearranges the order of bytecode handlers at source level, and it
hardly predicts the precise starting address and code size of a bytecode handler. This is
the drawback of the approach.
0.00
0.20
0.40
0.60
0.80
1.00
1.20
2048 4096 8192 16384 32768 65536
Cache Memory Sizes (bytes)
Normalized
Cache
Miss
Rates
512 bytes/page
1024 bytes/page
2048 bytes/page
4096 bytes/page
Fig. 12. The charts of normalized cache-miss rates from the source-level refined virtual machine.
Each chart is an experiment performs on a specific page size. The x-axis is the size of the cache
memory ( number_of_pages * page_size ).

Another Random Scribd Document
with Unrelated Content

1631
§ 1. Bernhard of
Saxe Weimar.
CHAPTER IX.
THE DEATH OF WALLENSTEIN AND
THE TREATY OF PRAGUE.
Section I.—French Influence in Germany.
In Germany, after the death of Gustavus at Lützen,
it was as it was in Greece after the death of
Epaminondas at Mantinea. There was more
disturbance and more dispute after the battle than
before it. In Sweden, Christina, the infant daughter of Gustavus,
succeeded peaceably to her father's throne, and authority was
exercised without contradiction by the Chancellor Oxenstjerna. But,
wise and prudent as Oxenstjerna was, it was not in the nature of
things that he should be listened to as Gustavus had been listened
to. The chiefs of the army, no longer held in by a soldier's hand,
threatened to assume an almost independent position. Foremost of
these was the young Bernhard of Weimar, demanding, like
Wallenstein, a place among the princely houses of Germany. In his
person he hoped the glories of the elder branch of the Saxon House
would revive, and the disgrace inflicted upon it by Charles V. for its
attachment to the Protestant cause would be repaired. He claimed
the rewards of victory for those whose swords had gained it, and
payment for the soldiers, who during the winter months following
the victory at Lützen had received little or nothing. His own share
was to be a new duchy of Franconia, formed out of the united
bishoprics of Würzburg and Bamberg. Oxenstjerna was compelled to
admit his pretensions, and to confirm him in his duchy.
The step was thus taken which Gustavus had undoubtedly
contemplated, but which he had prudently refrained from carrying

§ 2. The League
of Heilbronn.
§ 3. Defection of
Saxony.
1631
§ 4. French
politics.
into action. The seizure of ecclesiastical lands in
which the population was Catholic was as great a
barrier to peace on the one side as the seizure of
the Protestant bishoprics in the north had been on the other. There
was, therefore, all the more necessity to be ready for war. If a
complete junction of all the Protestant forces was not to be had,
something at least was attainable. On April 23, 1633, the League of
Heilbronn was signed. The four circles of Swabia, Franconia, and the
Upper and Lower Rhine formed a union with Sweden for mutual
support.
It is not difficult to explain the defection of the
Elector of Saxony. The seizure of a territory by
military violence had always been most obnoxious
to him. He had resisted it openly in the case of Frederick in
Bohemia. He had resisted it, as far as he dared, in the case of
Wallenstein in Mecklenburg. He was not inclined to put up with it in
the case of Bernhard in Franconia. Nor could he fail to see that with
the prolongation of the war, the chances of French intervention were
considerably increasing.
In 1631 there had been a great effervescence of
the French feudal aristocracy against the royal
authority. But Richelieu stood firm. In March the
king's brother, Gaston Duke of Orleans, fled from
the country. In July his mother, Mary of Medici, followed his
example. But they had no intention of abandoning their position.
From their exile in the Spanish Netherlands they formed a close
alliance with Spain, and carried on a thousand intrigues with the
nobility at home. The Cardinal smote right and left with a heavy
hand. Amongst his enemies were the noblest names in France. The
Duke of Guise shrank from the conflict and retired to Italy to die far
from his native land. The keeper of the seals died in prison. His
kinsman, a marshal of France, perished on the scaffold. In the
summer of the year 1632, whilst Gustavus was conducting his last
campaign, there was a great rising in the south of France. Gaston

§ 5. Richelieu did
for France all that
could be done.
§ 6. Richelieu and
Germany.
himself came to share in the glory or the disgrace of the rebellion.
The Duke of Montmorenci was the real leader of the enterprise. He
was a bold and vigorous commander, the Rupert of the French
cavaliers. But his gay horsemen dashed in vain against the serried
ranks of the royal infantry, and he expiated his fault upon the
scaffold. Gaston, helpless and low-minded as he was, could live on,
secure under an ignominious pardon.
It was not the highest form of political life which
Richelieu was establishing. For the free expression
of opinion, as a foundation of government, France,
in that day, was not prepared. But within the limits
of possibility, Richelieu's method of ruling was a magnificent
spectacle. He struck down a hundred petty despotisms that he might
exalt a single despotism in their place. And if the despotism of the
Crown was subject to all the dangers and weaknesses by which
sooner or later the strength of all despotisms is eaten away,
Richelieu succeeded for the time in gaining the co-operation of those
classes whose good will was worth conciliating. Under him
commerce and industry lifted up their heads, knowledge and
literature smiled at last. Whilst Corneille was creating the French
drama, Descartes was seizing the sceptre of the world of science.
The first play of the former appeared on the stage in 1629. Year by
year he rose in excellence, till in 1636 he produced the 'Cid;' and
from that time one masterpiece followed another in rapid
succession. Descartes published his first work in Holland in 1637, in
which he laid down those principles of metaphysics which were to
make his name famous in Europe.
All this, however welcome to France, boded no
good to Germany. In the old struggles of the
sixteenth century, Catholic and Protestant each
believed himself to be doing the best, not merely for his own
country, but for the world in general. Alva, with his countless
executions in the Netherlands, honestly believed that the
Netherlands as well as Spain would be the better for the rude

surgery. The English volunteers, who charged home on a hundred
battle-fields in Europe, believed that they were benefiting Europe,
not England alone. It was time that all this should cease, and that
the long religious strife should have its end. It was well that
Richelieu should stand forth to teach the world that there were
objects for a Catholic state to pursue better than slaughtering
Protestants. But the world was a long way, in the seventeenth
century, from the knowledge that the good of one nation is the good
of all, and in putting off its religious partisanship France became
terribly hard and selfish in its foreign policy. Gustavus had been half
a German, and had sympathized deeply with Protestant Germany.
Richelieu had no sympathy with Protestantism, no sympathy with
German nationality. He doubtless had a general belief that the
predominance of the House of Austria was a common evil for all, but
he cared chiefly to see Germany too weak to support Spain. He
accepted the alliance of the League of Heilbronn, but he would have
been equally ready to accept the alliance of the Elector of Bavaria if
it would have served him as well in his purpose of dividing Germany.

§ 7. His policy
French, not
European.
§ 1. Saxon
negotiations with
Wallenstein.
The plan of Gustavus might seem unsatisfactory to
a patriotic German, but it was undoubtedly
conceived with the intention of benefiting Germany.
Richelieu had no thought of constituting any new
organization in Germany. He was already aiming at the left bank of
the Rhine. The Elector of Treves, fearing Gustavus, and doubtful of
the power of Spain to protect him, had called in the French, and had
established them in his new fortress of Ehrenbreitstein, which looked
down from its height upon the low-lying buildings of Coblentz, and
guarded the junction of the Rhine and the Moselle. The Duke of
Lorraine had joined Spain, and had intrigued with Gaston. In the
summer of 1632 he had been compelled by a French army to make
his submission. The next year he moved again, and the French again
interfered, and wrested from him his capital of Nancy. Richelieu
treated the old German frontier-land as having no rights against the
King of France.
Section II.—Wallenstein's Attempt to dictate
Peace.
Already, before the League of Heilbronn was
signed, the Elector of Saxony was in negotiation
with Wallenstein. In June peace was all but
concluded between them. The Edict of Restitution
was to be cancelled. A few places on the Baltic coast were to be
ceded to Sweden, and a portion at least of the Palatinate was to be
restored to the son of the Elector Frederick, whose death in the
preceding winter had removed one of the difficulties in the way of an
agreement. The precise form in which the restitution should take
place, however, still remained to be settled.
Such a peace would doubtless have been highly disagreeable to
adventurers like Bernhard of Weimar, but it would have given the
Protestants of Germany all that they could reasonably expect to
gain, and would have given the House of Austria one last chance of

§ 2. Opposition to
Wallenstein.
§ 3. General
disapprobation of
his proceedings.
§ 4. Wallenstein
and the Swedes.
taking up the championship of national interests against foreign
aggression.
Such last chances, in real life, are seldom taken
hold of for any useful purpose. If Ferdinand had
had it in him to rise up in the position of a national
ruler, he would have been in that position long before. His confessor,
Father Lamormain, declared against the concessions which
Wallenstein advised, and the word of Father Lamormain had always
great weight with Ferdinand.
Even if Wallenstein had been single-minded he
would have had difficulty in meeting such
opposition. But Wallenstein was not single-minded.
He proposed to meet the difficulties which were
made to the restitution of the Palatinate by giving the Palatinate,
largely increased by neighbouring territories, to himself. He would
thus have a fair recompense for the loss of Mecklenburg, which he
could no longer hope to regain. He fancied that the solution would
satisfy everybody. In fact, it displeased everybody. Even the
Spaniards, who had been on his side in 1632 were alienated by it.
They were especially jealous of the rise of any strong power near
the line of march between Italy and the Spanish Netherlands.
The greater the difficulties in Wallenstein's way the
more determined he was to overcome them.
Regarding himself, with some justification, as a
power in Germany, he fancied himself able to act at the head of his
army as if he were himself the ruler of an independent state. If the
Emperor listened to Spain and his confessor in 1633 as he had
listened to Maximilian and his confessor in 1630, Wallenstein might
step forward and force upon him a wiser policy. Before the end of
August he had opened a communication with Oxenstjerna, asking for
his assistance in effecting a reasonable compromise, whether the
Emperor liked it or not. But he had forgotten that such a proposal as
this can only be accepted where there is confidence in him who
makes it. In Wallenstein—the man of many schemes and many

§ 5. Was he in
earnest?
§ 6. He attacks
the Saxons.
§ 7. Bernhard at
Ratisbon.
§ 8. Wallenstein's
difficulties.
intrigues—no man had any confidence whatever. Oxenstjerna
cautiously replied that if Wallenstein meant to join him against the
Emperor he had better be the first to begin the attack.
Whether Wallenstein seriously meant at this time
to move against the emperor it is impossible to say.
He loved to enter upon plots in every direction
without binding himself to any; but he was plainly in a dangerous
position. How could he impose peace upon all parties when no single
party trusted him?
If he was not trusted, however, he might still make
himself feared. Throwing himself vigorously upon
Silesia, he forced the Swedish garrisons to
surrender, and, presenting himself upon the frontiers of Saxony,
again offered peace to the two northern electors.
But Wallenstein could not be everywhere. Whilst
the electors were still hesitating, Bernhard made a
dash at Ratisbon, and firmly established himself in
the city, within a little distance of the Austrian frontier. Wallenstein,
turning sharply southward, stood in the way of his further advance,
but he did nothing to recover the ground which had been lost. He
was himself weary of the war. In his first command he had aimed at
crushing out all opposition in the name of the imperial authority. His
judgment was too clear to allow him to run the old course. He saw
plainly that strength was now to be gained only by allowing each of
the opposing forces their full weight. 'If the Emperor,' he said, 'were
to gain ten victories it would do him no good. A single defeat would
ruin him.' In December he was back again in Bohemia.
It was a strange, Cassandra-like position, to be
wiser than all the world, and to be listened to by
no one; to suffer the fate of supreme intelligence
which touches no moral chord and awakens no human sympathy.
For many months the hostile influences had been gaining strength at
Vienna. There were War-Office officials whose wishes Wallenstein

§ 9. Opposition of
Spain.
§ 10. The Cardinal
Infant.
§ 11. The
Emperor's
systematically disregarded; Jesuits who objected to peace with
heretics at all; friends of the Bavarian Maximilian who thought that
the country round Ratisbon should have been better defended
against the enemy; and Spaniards who were tired of hearing that all
matters of importance were to be settled by Wallenstein alone.
The Spanish opposition was growing daily. Spain
now looked to the German branch of the House of
Austria to make a fitting return for the aid which
she had rendered in 1620. Richelieu, having mastered Lorraine, was
pushing on towards Alsace, and if Spain had good reasons for
objecting to see Wallenstein established in the Palatinate, she had
far better reasons for objecting to see France established in Alsace.
Yet for all these special Spanish interests Wallenstein cared nothing.
His aim was to place himself at the head of a German national force,
and to regard all questions simply from his own point of view. If he
wished to see the French out of Alsace and Lorraine, he wished to
see the Spaniards out of Alsace and Lorraine as well.
And, as was often the case with Wallenstein, a
personal difference arose by the side of the political
difference. The Emperor's eldest son, Ferdinand,
the King of Hungary, was married to a Spanish Infanta, the sister of
Philip IV., who had once been the promised bride of Charles I. of
England. Her brother, another Ferdinand, usually known from his
rank in Church and State as the Cardinal-Infant, had recently been
appointed Governor of the Spanish Netherlands, and was waiting in
Italy for assistance to enable him to conduct an army through
Germany to Brussels. That assistance Wallenstein refused to give.
The military reasons which he alleged for his refusal may have been
good enough, but they had a dubious sound in Spanish ears. It
looked as if he was simply jealous of Spanish influence in Western
Germany.
Such were the influences which were brought to
bear upon the Emperor after Wallenstein's return
from Ratisbon in December. Ferdinand, as usual,

hesitation.
1634
§ 12. Wallenstein
and the army.
§ 1. Oñate's
movements.
§ 2. Belief at
Vienna that
Wallenstein was a
traitor.
was distracted between the two courses proposed.
Was he to make the enormous concessions to the
Protestants involved in the plan of Wallenstein; or was he to fight it
out with France and the Protestants together according to the plan
of Spain? To Wallenstein by this time the Emperor's resolutions had
become almost a matter of indifference. He had resolved to force a
reasonable peace upon Germany, with the Emperor, if it might be so;
without him, if he refused his support.
Wallenstein was well aware that his whole plan
depended on his hold over the army. In January he
received assurances from three of his principal
generals, Piccolomini, Gallas, and Aldringer, that
they were ready to follow him wheresoever he might lead them, and
he was sanguine enough to take these assurances for far more than
they were worth. Neither they nor he himself were aware to what
lengths he would go in the end. For the present it was a mere
question of putting pressure upon the Emperor to induce him to
accept a wise and beneficent peace.
Section III.—Resistance to Wallenstein's Plans.
The Spanish ambassador, Oñate, was ill at ease.
Wallenstein, he was convinced, was planning
something desperate. What it was he could hardly
guess; but he was sure that it was something most prejudicial to the
Catholic religion and the united House of Austria. The worst was that
Ferdinand could not be persuaded that there was cause for
suspicion. The sick man, said Oñate, speaking of the Emperor, will
die in my arms without my being able to help him.
Such was Oñate's feelings toward the end of
January. Then came information that the case was
worse than even he had deemed possible.
Wallenstein, he learned, had been intriguing with
the Bohemian exiles, who had offered, with

§ 3. Oñate
informs
Ferdinand.
§ 4. Decision of
the Emperor
against
Wallenstein.
§ 5.
Determination to
displace
Wallenstein.
Richelieu's consent, to place upon his head the crown of Bohemia,
which had fourteen years before been snatched from the unhappy
Frederick. In all this there was much exaggeration. Though
Wallenstein had listened to these overtures, it is almost certain that
he had not accepted them. But neither had he revealed them to the
government. It was his way to keep in his hands the threads of
many intrigues to be used or not to be used as occasion might
serve.
Oñate, naturally enough, believed the worst. And
for him the worst was the best. He went
triumphantly to Eggenberg with his news, and then
to Ferdinand. Coming alone, this statement might
perhaps have been received with suspicion. Coming, as it did, after
so many evidences that the general had been acting in complete
independence of the government, it carried conviction with it.
Ferdinand had long been tossed backwards and
forwards by opposing influences. He had given no
answer to Wallenstein's communication of the
terms of peace arranged with Saxony. The
necessity of deciding, he said, would not allow him
to sleep. It was in his thoughts when he lay down and when he
arose. Prayers to God to enlighten the mind of the Emperor had
been offered in the churches of Vienna.
All this hesitation was now at an end. Ferdinand
resolved to continue the war in alliance with Spain,
and, as a necessary preliminary, to remove
Wallenstein from his generalship. But it was more
easily said than done. A declaration was drawn up
releasing the army from its obedience to Wallenstein, and
provisionally appointing Gallas, who had by this time given
assurances of loyalty, to the chief command. It was intended, if
circumstances proved favourable, to intrust the command ultimately
to the young King of Hungary.

§ 6. The Generals
gained over.
§ 7. Attempt to
seize Wallenstein.
§ 8. Wallenstein
at Pilsen.
§ 9. The colonels
engage to support
The declaration was kept secret for many days. To
publish it would only be to provoke the rebellion
which was feared. The first thing to be done was to
gain over the principal generals. In the beginning of February
Piccolomini and Aldringer expressed their readiness to obey the
Emperor rather than Wallenstein. Commanders of a secondary rank
would doubtless find their position more independent under an
inexperienced young man like the King of Hungary than under the
first living strategist. These two generals agreed to make themselves
masters of Wallenstein's person and to bring him to Vienna to
answer the accusations of treason against him.
For Oñate this was not enough. It would be easier,
he said, to kill the general than to carry him off.
The event proved that he was right. On February 7,
Aldringer and Piccolomini set off for Pilsen with the intention of
capturing Wallenstein. But they found the garrison faithful to its
general, and they did not even venture to make the attempt.
Wallenstein's success depended on his chance of
carrying with him the lower ranks of the army. On
the 19th he summoned the colonels round him and
assured them that he would stand security for money which they
had advanced in raising their regiments, the repayment of which had
been called in question. Having thus won them to a favourable
mood, he told them that it had been falsely stated that he wished to
change his religion and attack the Emperor. On the contrary, he was
anxious to conclude a peace which would benefit the Emperor and
all who were concerned. As, however, certain persons at Court had
objected to it, he wished to ask the opinion of the army on its terms.
But he must first of all know whether they were ready to support
him, as he knew that there was an intention to put a disgrace upon
him.
It was not the first time that Wallenstein had
appealed to the colonels. A month before, when
the news had come of the alienation of the Court,

him.
§ 1. The garrison
of Prague
abandons him.
he had induced them to sign an acknowledgment
that they would stand by him, from which all
reference to the possibility of his dismissal was expressly excluded.
They now, on February 20, signed a fresh agreement, in which they
engaged to defend him against the machinations of his enemies,
upon his promising to undertake nothing against the Emperor or the
Catholic religion.
Section IV.—Assassination of Wallenstein.
Wallenstein thus hoped, with the help of the army,
to force the Emperor's hand, and to obtain his
signature to the peace. Of the co-operation of the
Elector of Saxony he was already secure; and since
the beginning of February he had been pressing Oxenstjerna and
Bernhard to come to his aid. If all the armies in the field declared for
peace, Ferdinand would be compelled to abandon the Spaniards and
to accept the offered terms. Without some such hazardous venture,
Wallenstein would be checkmated by Oñate. The Spaniard had been
unceasingly busy during these weeks of intrigue. Spanish gold was
provided to content the colonels for their advances, and hopes of
promotion were scattered broadcast amongst them. Two other of
the principal generals had gone over to the Court, and on February
18, the day before the meeting at Pilsen, a second declaration had
been issued accusing Wallenstein of treason, and formally depriving
him of the command. Wallenstein, before this declaration reached
him, had already appointed a meeting of large masses of troops to
take place on the White Hill before Prague on the 21st, where he
hoped to make his intentions more generally known. But he had
miscalculated the devotion of the army to his person. The garrison
of Prague refused to obey his orders. Soldiers and citizens alike
declared for the Emperor. He was obliged to retrace his steps. I had
peace in my hands, he said. Then he added, God is righteous, as
if still counting on the aid of Heaven in so good a work.

§ 2.
Understanding
with the Swedes.
§ 3. His arrival at
Eger.
§ 4. Wallenstein's
assassination.
He did not yet despair. He ordered the colonels to
meet him at Eger, assuring them that all that he
was doing was for the Emperor's good. He had
now at last hopes of other assistance. Oxenstjerna,
indeed, ever cautious, still refused to do anything for him till he had
positively declared against the Emperor. Bernhard, equally prudent
for some time, had been carried away by the news, which reached
him on the 21st, of the meeting at Pilsen, and the Emperor's
denouncement of the general. Though he was still suspicious, he
moved in the direction of Eger.
On the 24th Wallenstein entered Eger. In what
precise way he meant to escape from the labyrinth
in which he was, or whether he had still any clear
conception of the course before him, it is impossible to say. But
Arnim was expected at Eger, as well as Bernhard, and it may be that
Wallenstein fancied still that he could gather all the armies of
Germany into his hands, to defend the peace which he was ready to
make. The great scheme, however, whatever it was, was doomed to
failure. Amongst the officers who accompanied him was a Colonel
Butler, an Irish Catholic, who had no fancy for such dealings with
Swedish and Saxon heretics. Already he had received orders from
Piccolomini to bring in Wallenstein dead or alive. No official
instructions had been given to Piccolomini. But the thought was
certain to arise in the minds of all who retained their loyalty to the
Emperor. A general who attempts to force his sovereign to a certain
political course with the help of the enemy is placed, by that very
fact, beyond the pale of law.
The actual decision did not lie with Butler. The
fortress was in the hands of two Scotch officers,
Leslie and Gordon. As Protestants, they might have
been expected to feel some sympathy with Wallenstein. But the
sentiment of military honour prevailed. On the morning of the 25th
they were called upon by one of the general's confederates to take
orders from Wallenstein alone. I have sworn to obey the Emperor,

§ 5. Reason of his
failure.
§ 6. Comparison
between Gustavus
and Wallenstein.
answered Gordon, at last, and who shall release me from my oath?
You, gentlemen, was the reply, are strangers in the Empire. What
have you to do with the Empire? Such arguments were addressed
to deaf ears. That afternoon Butler, Leslie, and Gordon consulted
together. Leslie, usually a silent, reserved man, was the first to
speak. Let us kill the traitors, he said. That evening Wallenstein's
chief supporters were butchered at a banquet. Then there was a
short and sharp discussion whether Wallenstein's life should be
spared. Bernhard's troops were known to be approaching, and the
conspirators dared not leave a chance of escape open. An Irish
captain, Devereux by name, was selected to do the deed. Followed
by a few soldiers, he burst into the room where Wallenstein was
preparing for rest. Scoundrel and traitor, were the words which he
flung at Devereux as he entered. Then, stretching out his arms, he
received the fatal blow in his breast. The busy brain of the great
calculator was still forever.
The attempt to snatch at a wise and beneficent
peace by mingled force and intrigue had failed.
Other generals—Cæsar, Cromwell, Napoleon—have
succeeded to supreme power with the support of an armed force.
But they did so by placing themselves at the head of the civil
institutions of their respective countries, and by making themselves
the organs of a strong national policy. Wallenstein stood alone in
attempting to guide the political destinies of a people, while
remaining a soldier and nothing more. The plan was doomed to
failure, and is only excusable on the ground that there were no
national institutions at the head of which Wallenstein could place
himself; not even a chance of creating such institutions afresh.
In spite of all his faults, Germany turns ever to
Wallenstein as she turns to no other amongst the
leaders of the Thirty Years' War. From amidst the
divisions and weaknesses of his native country, a
great poet enshrined his memory in a succession of noble dramas.
Such faithfulness is not without a reason. Gustavus's was a higher

§ 1. Campaign of
1634.
nature than Wallenstein's. Some of his work, at least the rescue of
German Protestantism from oppression, remained imperishable,
whilst Wallenstein's military and political success vanished into
nothingness. But Gustavus was a hero not of Germany as a nation,
but of European Protestantism. His Corpus Evangelicorum was at the
best a choice of evils to a German. Wallenstein's wildest schemes,
impossible of execution as they were by military violence, were
always built upon the foundation of German unity. In the way in
which he walked that unity was doubtless unattainable. To combine
devotion to Ferdinand with religious liberty was as hopeless a
conception as it was to burst all bonds of political authority on the
chance that a new and better world would spring into being out of
the discipline of the camp. But during the long dreary years of
confusion which were to follow, it was something to think of the last
supremely able man whose life had been spent in battling against
the great evils of the land, against the spirit of religious intolerance,
and the spirit of division.
Section V.—Imperialist Victories and the Treaty
of Prague.
For the moment, the House of Austria seemed to
have gained everything by the execution or the
murder of Wallenstein, whichever we may choose
to call it. The army was reorganized and placed under the command
of the Emperor's son, the King of Hungary. The Cardinal-Infant, now
eagerly welcomed, was preparing to join him through Tyrol. And
while on the one side there was union and resolution, there was
division and hesitation on the other. The Elector of Saxony stood
aloof from the League of Heilbronn, weakly hoping that the terms of
peace which had been offered him by Wallenstein would be
confirmed by the Emperor now that Wallenstein was gone. Even
amongst those who remained under arms there was no unity of
purpose. Bernhard, the daring and impetuous, was not of one mind
with the cautious Horn, who commanded the Swedish forces, and

§ 2. The Battle of
Nördlingen.
§ 3. Important
results from it.
§ 4. French
intervention.
both agreed in thinking Oxenstjerna remiss because he did not
supply them with more money than he was able to provide.
As might have been expected under these
circumstances, the imperials made rapid progress.
Ratisbon, the prize of Bernhard the year before,
surrendered to the king of Hungary in July. Then Donauwörth was
stormed, and siege was laid to Nördlingen. On September 2 the
Cardinal-Infant came up with 15,000 men. The enemy watched the
siege with a force far inferior in numbers. Bernhard was eager to put
all to the test of battle. Horn recommended caution in vain. Against
his better judgment he consented to fight. On September 6 the
attack was made. By the end of the day Horn was a prisoner, and
Bernhard was in full retreat, leaving 10,000 of his men dead upon
the field, and 6,000 prisoners in the hands of the enemy, whilst the
imperialists lost only 1,200 men.
Since the day of Breitenfeld, three years before,
there had been no such battle fought as this of
Nördlingen. As Breitenfeld had recovered the
Protestant bishoprics of the north, Nördlingen recovered the Catholic
bishoprics of the south. Bernhard's Duchy of Franconia disappeared
in a moment under the blow. Before the spring of 1635 came, the
whole of South Germany, with the exception of one or two fortified
posts, was in the hands of the imperial commanders. The Cardinal-
Infant was able to pursue his way to Brussels, with the assurance
that he had done a good stroke of work on the way.
The victories of mere force are never fruitful of
good. As it had been after the successes of Tilly in
1622, and the successes of Wallenstein in 1626
and 1627, so it was now with the successes of the King of Hungary
in 1634 and 1635. The imperialist armies had gained victories, and
had taken cities. But the Emperor was none the nearer to the
confidence of Germans. An alienated people, crushed by military
force, served merely as a bait to tempt foreign aggression, and to
make the way easy before it. After 1622, the King of Denmark had

§ 5. The Peace of
Prague.
been called in. After 1627, an appeal was made to the King of
Sweden. After 1634, Richelieu found his opportunity. The bonds
between France and the mutilated League of Heilbronn were drawn
more closely. German troops were to be taken into French pay, and
the empty coffers of the League were filled with French livres. He
who holds the purse holds the sceptre, and the princes of Southern
and Western Germany, whether they wished it or not, were reduced
to the position of satellites revolving round the central orb at Paris.
Nowhere was the disgrace of submitting to French
intervention felt so deeply as at Dresden. The
battle of Nördlingen had cut short any hopes which
John George might have entertained of obtaining that which
Wallenstein would willingly have granted him. But, on the other
hand, Ferdinand had learned something from experience. He would
allow the Edict of Restitution to fall, though he was resolved not to
make the sacrifice in so many words. But he refused to replace the
Empire in the condition in which it had been before the war. The
year 1627 was to be chosen as the starting point for the new
arrangement. The greater part of the northern bishoprics would thus
be saved to Protestantism. But Halberstadt would remain in the
hands of a Catholic bishop, and the Palatinate would be lost to
Protestantism for ever. Lusatia, which had been held in the hands of
the Elector of Saxony for his expenses in the war of 1620, was to be
ceded to him permanently, and Protestantism in Silesia was to be
placed under the guarantee of the Emperor. Finally, Lutheranism
alone was still reckoned as the privileged religion, so that Hesse
Cassel and the other Calvinist states gained no security at all. On
May 30, 1635, a treaty embodying these arrangements was signed
at Prague by the representatives of the Emperor and the Elector of
Saxony. It was intended not to be a separate treaty, but to be the
starting point of a general pacification. Most of the princes and
towns so accepted it, after more or less delay, and acknowledged
the supremacy of the Emperor on its conditions. Yet it was not in the
nature of things that it should put an end to the war. It was not an
agreement which any one was likely to be enthusiastic about. The

§ 6. It fails in
securing general
acceptance.
§ 7. Degeneration
of the war.
ties which bound Ferdinand to his Protestant subjects had been
rudely broken, and the solemn promise to forget and forgive could
not weld the nation into that unity of heart and spirit which was
needed to resist the foreigner. A Protestant of the north might
reasonably come to the conclusion that the price to be paid to the
Swede and the Frenchman for the vindication of the rights of the
southern Protestants was too high to make it prudent for him to
continue the struggle against the Emperor. But it was hardly likely
that he would be inclined to fight very vigorously for the Emperor on
such terms.
If the treaty gave no great encouragement to
anyone who was comprehended by it, it threw still
further into the arms of the enemy those who were
excepted from its benefits. The leading members of
the League of Heilbronn were excepted from the general amnesty,
though hopes of better treatment were held out to them if they
made their submission. The Landgrave of Hesse Cassel was shut out
as a Calvinist. Besides such as nourished legitimate grievances, there
were others who, like Bernhard, were bent upon carving out a
fortune for themselves, or who had so blended in their own minds
consideration for the public good as to lose all sense of any
distinction between the two.
There was no lack here of materials for a long and
terrible struggle. But there was no longer any noble
aim in view on either side. The ideal of Ferdinand
and Maximilian was gone. The Church was not to recover its lost
property. The Empire was not to recover its lost dignity. The ideal of
Gustavus of a Protestant political body was equally gone. Even the
ideal of Wallenstein, that unity might be founded on an army, had
vanished. From henceforth French and Swedes on the one side,
Austrians and Spaniards on the other, were busily engaged in riving
at the corpse of the dead Empire. The great quarrel of principle had
merged into a mere quarrel between the Houses of Austria and
Bourbon, in which the shred of principle which still remained in the

§ 8. Condition of
Germany.
1636
§ 9. Notes of an
English traveller.
question of the rights of the southern Protestants was almost
entirely disregarded.
Horrible as the war had been from its
commencement, it was every day assuming a more
horrible character. On both sides all traces of
discipline had vanished in the dealings of the armies with the
inhabitants of the countries in which they were quartered. Soldiers
treated men and women as none but the vilest of mankind would
now treat brute beasts. 'He who had money,' says a contemporary,
'was their enemy. He who had none was tortured because he had it
not.' Outrages of unspeakable atrocity were committed everywhere.
Human beings were driven naked into the streets, their flesh pierced
with needles, or cut to the bone with saws. Others were scalded
with boiling water, or hunted with fierce dogs. The horrors of a town
taken by storm were repeated every day in the open country. Even
apart from its excesses, the war itself was terrible enough. When
Augsburg was besieged by the imperialists, after their victory at
Nördlingen, it contained an industrious population of 70,000 souls.
After a siege of seven months, 10,000 living beings, wan and
haggard with famine, remained to open the gates to the conquerors,
and the great commercial city of the Fuggers dwindled down into a
country town.
How is it possible to bring such scenes before our
eyes in their ghastly reality? Let us turn for the
moment to some notes taken by the companion of
an English ambassador who passed through the
country in 1636. As the party were towed up the Rhine from
Cologne, on the track so well known to the modern tourist, they
passed by many villages pillaged and shot down. Further on, a
French garrison was in Ehrenbreitstein, firing down upon Coblentz,
which had just been taken by the imperialists. They in the town, if
they do but look out of their windows, have a bullet presently
presented at their head. More to the south, things grew worse. At
Bacharach, the poor people are found dead with grass in their

mouths. At Rüdesheim, many persons were praying where dead
bones were in a little old house; and here his Excellency gave some
relief to the poor, which were almost starved, as it appeared by the
violence they used to get it from one another. At Mentz, the
ambassador was obliged to remain on shipboard, for there was
nothing to relieve us, since it was taken by the King of Sweden, and
miserably battered.... Here, likewise, the poor people were almost
starved, and those that could relieve others before now humbly
begged to be relieved; and after supper all had relief sent from the
ship ashore, at the sight of which they strove so violently that some
of them fell into the Rhine, and were like to have been drowned. Up
the Main, again, all the towns, villages, and castles be battered,
pillaged, or burnt. After leaving Würzburg, the ambassador's train
came to plundered villages, and then to Neustadt, which hath been
a fair city, though now pillaged and burnt miserably. Poor children
were sitting at their doors almost starved to death, his Excellency
giving them food and leaving money with their parents to help them,
if but for a time. In the Upper Palatinate, they passed by churches
demolished to the ground, and through woods in danger,
understanding that Croats were lying hereabout. Further on they
stayed for dinner at a poor little village which hath been pillaged
eight-and-twenty times in two years, and twice in one day. And so
on, and so on. The corner of the veil is lifted up in the pages of the
old book, and the rest is left to the imagination to picture forth, as
best it may, the misery behind. After reading the sober narrative, we
shall perhaps not be inclined to be so very hard upon the Elector of
Saxony for making peace at Prague.

§ 1. Protestantism
not yet out of
danger.
§ 2. The allies of
France.
CHAPTER X.
THE PREPONDERANCE OF FRANCE.
Section I.—Open Intervention of France.
The peacemakers of Prague hoped to restore the
Empire to its old form. But this could not be.
Things done cannot pass away as though they had
never been. Ferdinand's attempt to gain a
partizan's advantage for his religion by availing himself of legal forms
had given rise to a general distrust. Nations and governments, like
individual men, are tied and bound by the chain of their sins, from
which they can be freed only when a new spirit is breathed into
them. Unsatisfactory as the territorial arrangements of the peace
were, the entire absence of any constitutional reform in connexion
with the peace was more unsatisfactory still. The majority in the two
Upper Houses of the Diet was still Catholic; the Imperial Council was
still altogether Catholic. It was possible that the Diet and Council,
under the teaching of experience, might refrain from pushing their
pretensions as far as they had pushed them before; but a
government which refrains from carrying out its principles from
motives of prudence cannot inspire confidence. A strong central
power would never arise in such a way, and a strong central power
to defend Germany against foreign invasion was the especial need of
the hour.
In the failure of the Elector of Saxony to obtain
some of the most reasonable of the Protestant
demands lay the best excuse of men like Bernhard
of Saxe-Weimar and William of Hesse Cassel for refusing the terms
of accommodation offered. Largely as personal ambition and greed

§ 3. Foreign
intervention.
of territory found a place in the motives of these men, it is not
absolutely necessary to assert that their religious enthusiasm was
nothing more than mere hypocrisy. They raised the war-cry of God
with us before rushing to the storm of a city doomed to massacre
and pillage; they set apart days for prayer and devotion when battle
was at hand—veiling, perhaps, from their own eyes the hideous
misery which they were spreading around, in contemplation of the
loftiness of their aim: for, in all but the most vile, there is a natural
tendency to shrink from contemplating the lower motives of action,
and to fix the eyes solely on the higher. But the ardour inspired by a
military career, and the mere love of fighting for its own sake, must
have counted for much; and the refusal to submit to a domination
which had been so harshly used soon grew into a restless disdain of
all authority whatever. The nobler motives which had imparted a
glow to the work of Tilly and Gustavus, and which even lit up the
profound selfishness of Wallenstein, flickered and died away, till the
fatal disruption of the Empire was accomplished amidst the strivings
and passions of heartless and unprincipled men.
The work of riving Germany in pieces was not
accomplished by Germans alone. As in nature a
living organism which has become unhealthy and
corrupt is seized upon by the lower forms of animal life, a nation
divided amongst itself, and devoid of a sense of life within it higher
than the aims of parties and individuals, becomes the prey of
neighbouring nations, which would not have ventured to meddle
with it in the days of its strength. The carcase was there, and the
eagles were gathered together. The gathering of Wallenstein's army
in 1632, the overthrow of Wallenstein in 1634, had alike been made
possible by the free use of Spanish gold. The victory of Nördlingen
had been owing to the aid of Spanish troops; and the aim of Spain
was not the greatness or peace of Germany, but at the best the
greatness of the House of Austria in Germany; at the worst, the
maintenance of the old system of intolerance and unthinking
obedience, which had been the ruin of Germany. With Spain for an
ally, France was a necessary enemy. The strife for supreme power

§ 4. Alsace and
Lorraine.
§ 5. Richelieu
asks for fortresses
in Alsace.
§ 6. War between
France and Spain.
between the two representative states of the old system and the
new could not long be delayed, and the German parties would be
dragged, consciously or unconsciously, in their wake. If Bernhard
became a tool of Richelieu, Ferdinand became a tool of Spain.
In this phase of the war Protestantism and
Catholicism, tolerance and intolerance, ceased to
be the immediate objects of the strife. The
possession of Alsace and Lorraine rose into primary importance, not
because, as in our own days, Germany needed a bulwark against
France, or France needed a bulwark against Germany, but because
Germany was not strong enough to prevent these territories from
becoming the highway of intercourse between Spain and the
Spanish Netherlands. The command of the sea was in the hands of
the Dutch, and the valley of the Upper Rhine was the artery through
which the life blood of the Spanish monarchy flowed. If Spain or the
Emperor, the friend of Spain, could hold that valley, men and
munitions of warfare would flow freely to the Netherlands to support
the Cardinal-Infant in his struggle with the Dutch. If Richelieu could
lay his hand heavily upon it, he had seized his enemy by the throat,
and could choke him as he lay.
After the battle of Nördlingen, Richelieu's first
demand from Oxenstjerna as the price of his
assistance had been the strong places held by
Swedish garrisons in Alsace. As soon as he had
them safely under his control, he felt himself strong enough to
declare war openly against Spain.
On May 19, eleven days before peace was agreed
upon at Prague, the declaration of war was
delivered at Brussels by a French herald. To the
astonishment of all, France was able to place in the field what was
then considered the enormous number of 132,000 men. One army
was to drive the Spaniards out of the Milanese, and to set free the
Italian princes. Another was to defend Lorraine whilst Bernhard
crossed the Rhine and carried on war in Germany. The main force

§ 1. Failure of the
French attack on
the Netherlands.
§ 2. Spanish
invasion of
France.
was to be thrown upon the Spanish Netherlands, and, after effecting
a junction with the Prince of Orange, was to strike directly at
Brussels.
Section II.—Spanish Successes.
Precisely in the most ambitious part of his
programme Richelieu failed most signally. The
junction with the Dutch was effected without
difficulty; but the hoped-for instrument of success
proved the parent of disaster. Whatever Flemings and Brabanters
might think of Spain, they soon made it plain that they would have
nothing to do with the Dutch. A national enthusiasm against
Protestant aggression from the north made defence easy, and the
French army had to return completely unsuccessful. Failure, too, was
reported from other quarters. The French armies had no experience
of war on a large scale, and no military leader of eminent ability had
yet appeared to command them. The Italian campaign came to
nothing, and it was only by a supreme effort of military skill that
Bernhard, driven to retreat, preserved his army from complete
destruction.
In 1636 France was invaded. The Cardinal-Infant
crossed the Somme, took Corbie, and advanced to
the banks of the Oise. All Paris was in commotion.
An immediate siege was expected, and inquiry was
anxiously made into the state of the defences. Then Richelieu,
coming out of his seclusion, threw himself upon the nation. He
appealed to the great legal, ecclesiastical, and commercial
corporations of Paris, and he did not appeal in vain. Money,
voluntarily offered, came pouring into the treasury for the payment
of the troops. Those who had no money gave themselves eagerly for
military service. It was remarked that Paris, so fanatically Catholic in
the days of St. Bartholomew and the League, entrusted its defence

§ 3. The invaders
driven back.
§ 4. Battle of
Wittstock.
§ 5. Death of
Ferdinand II.
§ 6. Ferdinand III.
to the Protestant marshal La Force, whose reputation for integrity
inspired universal confidence.
The resistance undertaken in such a spirit in Paris
was imitated by the other towns of the kingdom.
Even the nobility, jealous as they were of the
Cardinal, forgot their grievances as an aristocracy in their duties as
Frenchmen. Their devotion was not put to the test of action. The
invaders, frightened at the unanimity opposed to them, hesitated
and turned back. In September, Lewis took the field in person. In
November he appeared before Corbie; and the last days of the year
saw the fortress again in the keeping of a French garrison. The war,
which was devastating Germany, was averted from France by the
union produced by the mild tolerance of Richelieu.
In Germany, too, affairs had taken a turn. The
Elector of Saxony had hoped to drive the Swedes
across the sea; but a victory gained on October 4,
at Wittstock, by the Swedish general, Baner, the ablest of the
successors of Gustavus, frustrated his intentions. Henceforward
North Germany was delivered over to a desolation with which even
the misery inflicted by Wallenstein affords no parallel.
Amidst these scenes of failure and misfortune the
man whose policy had been mainly responsible for
the miseries of his country closed his eyes for ever.
On February 15, 1637, Ferdinand II. died at Vienna. Shortly before
his death the King of Hungary had been elected King of the Romans,
and he now, by his father's death, became the Emperor Ferdinand
III.
The new Emperor had no vices. He did not even
care, as his father did, for hunting and music.
When the battle of Nördlingen was won under his
command he was praying in his tent whilst his soldiers were fighting.
He sometimes took upon himself to give military orders, but the
handwriting in which they were conveyed was such an abominable

§ 7. Campaign of
1637.
§ 1. The capture
of Breisach.
scrawl that they only served to enable his generals to excuse their
defeats by the impossibility of reading their instructions. His great
passion was for keeping strict accounts. Even the Jesuits, it is said,
found out that, devoted as he was to his religion, he had a sharp eye
for his expenditure. One day they complained that some tolls
bequeathed to them by his father had not been made over to them,
and represented the value of the legacy as a mere trifle of 500
florins a year. The Emperor at once gave them an order upon the
treasury for the yearly payment of the sum named, and took
possession of the tolls for the maintenance of the fortifications of
Vienna. The income thus obtained is said to have been no less than
12,000 florins a year.
Such a man was not likely to rescue the Empire
from its miseries. The first year of his reign,
however, was marked by a gleam of good fortune.
Baner lost all that he had gained at Wittstock, and was driven back
to the shores of the Baltic. On the western frontier the imperialists
were equally successful. Würtemberg accepted the Peace of Prague,
and submitted to the Emperor. A more general peace was talked of.
But till Alsace was secured to one side or the other no peace was
possible.
Section III.—The Struggle for Alsace.
The year 1638 was to decide the question.
Bernhard was looking to the Austrian lands in
Alsace and the Breisgau as a compensation for his
lost duchy of Franconia. In February he was besieging Rheinfelden.
Driven off by the imperialists on the 26th, he re-appeared
unexpectedly on March 3, taking the enemy by surprise. They had
not even sufficient powder with them to load their guns, and the
victory of Rheinfelden was the result. On the 24th Rheinfelden itself
surrendered. Freiburg followed its example on April 22, and
Bernhard proceeded to undertake the siege of Breisach, the great

§ 2. The capture
a turning point in
the war.
§ 3. Bernhard
wishes to keep
Breisach.
§ 4. Refuses to
dismember the
Empire.
§ 5. Death of
Bernhard.
fortress which domineered over the whole valley of the Upper Rhine.
Small as his force was, he succeeded, by a series of rapid
movements, in beating off every attempt to introduce supplies, and
on December 19 he entered the place in triumph.
The campaign of 1638 was the turning point in the
struggle between France and the united House of
Austria. A vantage ground was then won which
was never lost.
Bernhard himself, however, was loth to realize the
world-wide importance of the events in which he
had played his part. He fancied that he had been
fighting for his own, and he claimed the lands
which he had conquered for himself. He received the homage of the
citizens of Breisach in his own name. He celebrated a Lutheran
thanksgiving festival in the cathedral. But the French Government
looked upon the rise of an independent German principality in Alsace
with as little pleasure as the Spanish government had contemplated
the prospect of the establishment of Wallenstein in the Palatinate.
They ordered Bernhard to place his conquests under the orders of
the King of France.
Strange as it may seem, the man who had done so
much to tear in pieces the Empire believed, in a
sort of way, in the Empire still. I will never suffer,
he said, in reply to the French demands, that men
can truly reproach me with being the first to dismember the Empire.
The next year he crossed the Rhine with the most
brilliant expectations. Baner had recovered
strength, and was pushing on through North
Germany into Bohemia. Bernhard hoped that he too might strike a
blow which would force on a peace on his own conditions. But his
greatest achievement, the capture of Breisach, was also his last. A
fatal disease seized upon him when he had hardly entered upon the
campaign. On July 8, 1639, he died.

§ 6. Alsace in
French
possession.
§ 1. State of Italy.
§ 2. Maritime
warfare.
§ 3. The Spanish
fleet in the
Downs.
There was no longer any question of the ownership
of the fortresses in Alsace and the Breisgau. French
governors entered into possession. A French
general took the command of Bernhard's army. For
the next two or three years Bernhard's old troops fought up and
down Germany in conjunction with Baner, not without success, but
without any decisive victory. The French soldiers were becoming, like
the Germans, inured to war. The lands on the Rhine were not easily
to be wrenched out of the strong hands which had grasped them.
Section IV.—French Successes.
Richelieu had other successes to count besides
these victories on the Rhine. In 1637 the Spaniards
drove out of Turin the Duchess-Regent Christina,
the mother of the young Duke of Savoy. She was a sister of the King
of France; and, even if that had not been the case, the enemy of
Spain was, in the nature of the case, the friend of France. In 1640
she re-entered her capital with French assistance.
At sea, too, where Spain, though unable to hold its
own against the Dutch, had long continued to be
superior to France, the supremacy of Spain was
coming to an end. During the whole course of his ministry, Richelieu
had paid special attention to the encouragement of commerce and
the formation of a navy. Troops could no longer be despatched with
safety to Italy from the coasts of Spain. In 1638 a French squadron
burnt Spanish galleys in the Bay of Biscay.
In 1639 a great Spanish fleet on its way to the
Netherlands was strong enough to escape the
French, who were watching to intercept it. It sailed
up the English Channel with the not distant goal of
the Flemish ports almost in view. But the huge galleons were ill-
manned and ill-found. They were still less able to resist the lighter,
well-equipped vessels of the Dutch fleet, which was waiting to

§ 4. Destruction
of the fleet.
§ 5. France and
England.
intercept them, than the Armada had been able to resist Drake and
Raleigh fifty-one years before. The Spanish commander sought
refuge in the Downs, under the protection of the neutral flag of
England.
The French ambassador pleaded hard with the king
of England to allow the Dutch to follow up their
success. The Spanish ambassador pleaded hard
with him for protection to those who had taken refuge on his shores.
Charles saw in the occurrence an opportunity to make a bargain with
one side or the other. He offered to abandon the Spaniards if the
French would agree to restore his nephew, Charles Lewis, the son of
his sister Elizabeth, to his inheritance in the Palatinate. He offered to
protect the Spaniards if Spain would pay him the large sum which he
would want for the armaments needed to bid defiance to France.
Richelieu had no intention of completing the bargain offered to him.
He deluded Charles with negotiations, whilst the Dutch admiral
treated the English neutrality with scorn. He dashed amongst the tall
Spanish ships as they lay anchored in the Downs: some he sank,
some he set on fire. Eleven of the galleons were soon destroyed.
The remainder took advantage of a thick fog, slipped across the
Straits, and placed themselves in safety under the guns of Dunkirk.
Never again did such a fleet as this venture to leave the Spanish
coast for the harbours of Flanders. The injury to Spain went far
beyond the actual loss. Coming, as the blow did, within a few
months after the surrender of Breisach, it all but severed the
connexion for military purposes between Brussels and Madrid.
Charles at first took no umbrage at the insult. He
still hoped that Richelieu would forward his
nephew's interests, and he even expected that
Charles Lewis would be placed by the King of France in command of
the army which had been under Bernhard's orders. But Richelieu
was in no mood to place a German at the head of these well-trained
veterans, and the proposal was definitively rejected. The King of
England, dissatisfied at this repulse, inclined once more to the side

§ 6. Insurrection
in Catalonia.
§ 7. Break-up of
the Spanish
monarchy.
of Spain. But Richelieu found a way to prevent Spain from securing
even what assistance it was in the power of a king so unpopular as
Charles to render. It was easy to enter into communication with
Charles's domestic enemies. His troubles, indeed, were mostly of his
own making, and he would doubtless have lost his throne whether
Richelieu had stirred the fire or not. But the French minister
contributed all that was in his power to make the confusion greater,
and encouraged, as far as possible, the resistance which had already
broken out in Scotland, and which was threatening to break out in
England.
The failure of 1636 had been fully redeemed. No
longer attacking any one of the masses of which
the Spanish monarchy was composed, Richelieu
placed his hands upon the lines of communication between them. He
made his presence felt not at Madrid, at Brussels, at Milan, or at
Naples, but in Alsace, in the Mediterranean, in the English Channel.
The effect was as complete as is the effect of snapping the wire of a
telegraph. At once the Peninsula startled Europe by showing signs of
dissolution. In 1639 the Catalonians had manfully defended
Roussillon against a French invasion. In 1640 they were prepared to
fight with equal vigour. But the Spanish Government, in its desperate
straits, was not content to leave them to combat in their own way,
after the irregular fashion which befitted mountaineers. Orders were
issued commanding all men capable of fighting to arm themselves
for the war, all women to bear food and supplies for the army on
their backs. A royal edict followed, threatening those who showed
themselves remiss with imprisonment and the confiscation of their
goods.
The cord which bound the hearts of Spaniards to
their king was a strong one; but it snapped at last.
It was not by threats that Richelieu had defended
France in 1636. The old traditions of provincial
independence were strong in Catalonia, and the Catalans were soon
in full revolt. Who were they, to be driven to the combat by

§ 8.
Independence of
Portugal.
§ 9. Failure of
Soissons in
France.
§ 10. Richelieu's
last days.
menaces, as the Persian slaves had been driven on at Thermopylæ
by the blows of their masters' officers?
Equally alarming was the news which reached
Madrid from the other side of the Peninsula. Ever
since the days of Philip II. Portugal had formed an
integral part of the Spanish monarchy. In
December 1640 Portugal renounced its allegiance, and reappeared
amongst European States under a sovereign of the House of
Braganza.
Everything prospered in Richelieu's hands. In 1641
a fresh attempt was made by the partizans of
Spain to raise France against him. The Count of
Soissons, a prince of the blood, placed himself at
the head of an imperialist army to attack his native country. He
succeeded in defeating the French forces sent to oppose him not far
from Sedan. But a chance shot passing through the brain of Soissons
made the victory a barren one. His troops, without the support of his
name, could not hope to rouse the country against Richelieu. They
had become mere invaders, and they were far too few to think of
conquering France.
Equal success attended the French arms in
Germany. In 1641 Guebriant, with his German and
Swedish army, defeated the imperialists at
Wolfenbüttel, in the north. In 1642 he defeated them again at
Kempten, in the south. In the same year Roussillon submitted to
France. Nor was Richelieu less fortunate at home. The conspiracy of
a young courtier, the last of the efforts of the aristocracy to shake off
the heavy rule of the Cardinal, was detected, and expiated on the
scaffold. Richelieu did not long survive his latest triumph. He died on
December 4, 1642.
Section V.—Aims and Character of Richelieu.

Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com

Transactions On Highperformance Embedded Architectures And Compilers Iii 1st Edition Miquel Moreto

More Related Content

Similar to Transactions On Highperformance Embedded Architectures And Compilers Iii 1st Edition Miquel Moreto (20)

Recently uploaded (20)

Transactions On Highperformance Embedded Architectures And Compilers Iii 1st Edition Miquel Moreto