Data Streams Models And Algorithms Charu C Aggarwal Ed
Data Streams Models And Algorithms Charu C Aggarwal Ed
Data Streams Models And Algorithms Charu C Aggarwal Ed
Data Streams Models And Algorithms Charu C Aggarwal Ed
1. Data Streams Models And Algorithms Charu C
Aggarwal Ed download
https://ebookbell.com/product/data-streams-models-and-algorithms-
charu-c-aggarwal-ed-36520980
Explore and download more ebooks at ebookbell.com
2. Here are some recommended products that we believe you will be
interested in. You can click the link to download.
Data Streams Models And Algorithms 1st Edition Charu C Aggarwal Auth
https://ebookbell.com/product/data-streams-models-and-algorithms-1st-
edition-charu-c-aggarwal-auth-4239842
Dataflow Programming Visualizing And Managing Data Streams For
Effective Processing And Parallelism Programming Models Edet
https://ebookbell.com/product/dataflow-programming-visualizing-and-
managing-data-streams-for-effective-processing-and-parallelism-
programming-models-edet-232947356
Building Big Data Pipelines With Apache Beam Use A Single Programming
Model For Both Batch And Stream Data Processing 1st Edition Jan
Lukavsky
https://ebookbell.com/product/building-big-data-pipelines-with-apache-
beam-use-a-single-programming-model-for-both-batch-and-stream-data-
processing-1st-edition-jan-lukavsky-37633446
Data Stream Management Processing Highspeed Data Streams 1st Edition
Minos Garofalakis
https://ebookbell.com/product/data-stream-management-processing-
highspeed-data-streams-1st-edition-minos-garofalakis-5608642
3. Statistical Analysis Of Massive Data Streams Proceedings Of A Workshop
1st Edition Committee On Applied And Theoretical Statistics Board On
Mathematical Sciences And Their Applications
https://ebookbell.com/product/statistical-analysis-of-massive-data-
streams-proceedings-of-a-workshop-1st-edition-committee-on-applied-
and-theoretical-statistics-board-on-mathematical-sciences-and-their-
applications-51848662
Knowledge Discovery From Data Streams 1st Edition Joao Gama
https://ebookbell.com/product/knowledge-discovery-from-data-
streams-1st-edition-joao-gama-2253200
Machine Learning For Data Streams With Practical Examples In Moa
Adaptive Computation And Machine Learning Series Albert Bifet
https://ebookbell.com/product/machine-learning-for-data-streams-with-
practical-examples-in-moa-adaptive-computation-and-machine-learning-
series-albert-bifet-32906616
Transactional Machine Learning With Data Streams And Automl Build
Frictionless And Elastic Machine Learning Solutions With Apache Kafka
In The Cloud Using Python 1st Edition Sebastian Maurice
https://ebookbell.com/product/transactional-machine-learning-with-
data-streams-and-automl-build-frictionless-and-elastic-machine-
learning-solutions-with-apache-kafka-in-the-cloud-using-python-1st-
edition-sebastian-maurice-37321806
Autonomous Learning Systems From Data Streams To Knowledge In Realtime
Plamen Angelovauth
https://ebookbell.com/product/autonomous-learning-systems-from-data-
streams-to-knowledge-in-realtime-plamen-angelovauth-4299632
7. ADVANCES IN DATABASE SYSTEMS
Series Editor
Ahmed K. Elmagarmid
Purdue Universify
WestLafayette, IN 47907
Other books in the Series:
SIMILARITY SEARCH: The Metric Space Approach, P. Zezuln, C. A~wito,V.
Dohnal, M. Batko, ISBN: 0-387-29146-6
STREAM DATA MANAGEMENT, Naurnan Chaudhry, Kevin Shaw, Mahdi
Abdelgueifi, ISBN: 0-387-24393-3
FUZZY DATABASE MODELING WITH XML, Zongrnin Ma, ISBN: 0-387-
24248-1
MINING SEQUENTIAL PATTERNS FROM LARGE DATA SETS, Wei Wang
and Jiong Yang; ISBN: 0-387-24246-5
ADVANCED SIGNATURE INDEXING FOR MULTIMEDIA AND WEB
APPLICATIONS, Yannis Manolopoulos, Alexandros Nanopoulos, Eleni
Tousidou; ISBN: 1-4020-7425-5
ADVANCES IN DIGITAL GOVERNMENT: Technology, Human Factors, and
Policy, edited by William J. Mclver, Jr. and Ahrned K. Elrnagarrnid; ISBN: 1-
4020-7067-5
INFORMATION AND DATABASE QUALITY, Mario Piattini, Coral Calero and
Marcela Genero; ISBN: 0-7923- 7599-8
DATA QUALITY, Richard Y. Wang, Mostapha Ziad, Yang W. Lee: ISBN: 0-7923-
7215-8
THE FRACTAL STRUCTURE OF DATA REFERENCE: Applications to the
Memory Hierarchy, Bruce McNutt; ISBN: 0-7923-7945-4
SEMANTIC MODELS FOR MULTIMEDIA DATABASE SEARCHING AND
BROWSING, Shu-Ching Chen,R.L. Kashyap, and ArifGhafoor;ISBN:0-7923-
7888-1
INFORMATIONBROKERINGACROSSHETEROGENEOUSDIGITALDATA:
AMetadata-based Approach, VipulKashyap,Arnit Sheth;ISBN:0-7923-7883-0
DATA DISSEMINATION IN WIRELESS COMPUTING ENVIRONMENTS,
Kian-Lee Tan and Beng Chin Ooi;ISBN: 0-7923-7866-0
MIDDLEWARE NETWORKS: Concept, Design and Deployment of Internet
Infrastructure, Michah Lerner, George Vanecek,Nino Vidovic,Dad Vrsalovic;
ISBN: 0-7923-7840-7
ADVANCED DATABASE INDEXING, YannisManolopoulos, Yannis Theodoridis,
VassilisJ. Tsotras; ISBN: 0-7923-7716-8
MULTILEVEL SECURE TRANSACTION PROCESSING, Vijay Atluri, Sushi1
Jajodia, Binto George ISBN: 0-7923-7702-8
FUZZY LOGIC IN DATA MODELING, Guoqing Chen ISBN: 0-7923-8253-6
For a complete listing of books in this series, go to htt~://www.s~rin~er.com
8. Data Streams
Models and Algorithms
edited by
Charu C. Aggarwal
ZBM, T.J. WatsonResearch Center
Yorktown Heights, NY, USA
a
- Springer
9. Charu C. Aggarwal
IBM
Thomas J. Watson Research Center
19Skyline Drive
Hawthorne NY 10532
Library of Congress Control Number: 2006934111
DATA STREAMS: Models and Algorithms edited by Charu C. Aggarwal
ISBN-10:0-387-28759-0
ISBN-13:978-0-387-28759-1
e-ISBN-10:0-387-47534-6
e-ISBN-13: 978-0-387-47534-9
Cover by Will Ladd, NRL Mapping, Charting and Geodesy Branch
utilizing NRL's GIDBB Portal System that can be utilized at
http://dmap.nrlssc.navy.mil
Printed on acid-free paper.
O 2007 Springer Science+BusinessMedia, LLC.
All rights reserved. This work may not be translated or copied in whole or
in part without the written permission of the publisher (Springer
Science+Business Media, LLC, 233 Spring Street, New York, NY 10013,
USA), except for brief excerpts in connection with reviews or scholarly
analysis. Use in connection with any form of information storage and
retrieval, electronic adaptation, computer software, or by similar or
dissimilar methodology now know or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks and
similar terms, even if the are not identified as such, is not to be taken as
an expression of opinion as to whether or not they are subject to
proprietary rights.
10. Contents
List of Figures
List of Tables
Preface
xv
xvii
1
An Introduction to Data Streams
Cham C.Aggarwal
1. Introduction
2. Stream Mining Algorithms
3. Conclusions and Summary
References
2
On Clustering MassiveData Streams: A SummarizationParadigm
Cham C. Aggarwal,Jiawei Han,Jianyong Wangand Philip S. Yu
1. Introduction
2. The Micro-clustering Based StreamMining Framework
3. Clustering EvolvingData Streams: A Micro-clusteringApproach
3.1 Micro-clustering Challenges
3.2 Online Micro-cluster Maintenance: The CluStream Algo-
rithm
3.3 High DimensionalProjected Stream Clustering
4. Classificationof Data Streams: A Micro-clusteringApproach
4.1 On-DemandStream Classification
5. Other Applications of Micro-clusteringand Research Directions
6. Performance Studyand ExperimentalResults
7. Discussion
References
3
A Survey of ClassificationMethods in Data Streams
Mohamed Medhat Gaber,Arkady Zaslavsky and Shonali Krishnaswamy
1. Introduction
2. Research Issues
3. SolutionApproaches
4. ClassificationTechniques
4.1 Ensemble Based Classification
4.2 Very Fast Decision Trees (VFDT)
11. DATA STREAMS: MODELS AND ALGORITHMS
4.3 On DemandClassification
4.4 Online InformationNetwork (OLIN)
4.5 LWClass Algorithm
4.6 ANNCAD Algorithm
4.7 SCALLOPAlgorithm
5. Summary
References
4
Frequent Pattern Mining in Data Streams
RuomingJin and GaganAgrawal
1. Introduction
2. Overview
3. New Algorithm
4. Work on OtherRelated Problems
5. Conclusions and Future Directions
References
5
A Surveyof Change Diagnosis
Algorithms in Evolving Data
Streams
Cham C.Agganval
1. Introduction
2. The Velocity Density Method
2.1 Spatial Velocity Profiles
2.2 Evolution Computationsin High Dimensional Case
2.3 On the use of clustering for characterizing stream evolution
3. On the Effect of Evolution in Data Mining Algorithms
4. Conclusions
References
6
Multi-Dimensional Analysis of Data 103
Streams Using Stream Cubes
Jiawei Hun, Z Dora Cai, rain Chen, GuozhuDong, Jian Pei, Benjamin W: Wah,and
Jianyong Wang
1. Introduction 104
2. Problem Definition 106
3. Architecture for On-line Analysis of Data Streams 108
3.1 Tilted time fiame 108
3.2 Criticallayers 110
3.3 Partialmaterialization of stream cube 111
4. Stream Data Cube Computation 112
4.1 Algorithms for cube computation 115
5. Performance Study 117
6. Related Work 120
7. PossibleExtensions 121
8. Conclusions 122
References 123
12. Contents vii
7
Load Sheddingin Data Stream Systems
Brian Babcoclr,Mayur Datar andRajeevMotwani
1. Load Sheddingfor AggregationQueries
1.1 Problem Formulation
1.2 Load SheddingAlgorithm
1.3 Extensions
2. Load Shedding in Aurora
3. Load Shedding for Sliding WindowJoins
4. Load Sheddingfor ClassificationQueries
5. Summary
References
8
The Sliding-WindowComputationModel and Results
Mayur Datar andRajeevMotwani
0.1 Motivationand Road Map
1. A Solution to the BASICCOUNTING
Problem
1.1 The Approximation Scheme
2. SpaceLower Bound for BASICCOUNTING
Problem
3. Beyond 0's and 1's
4. References and Related Work
5. Conclusion
References
9
A Survey of SynopsisConstruction
in Data Streams
Cham C. Agganual,Philip S. Y
u
1. Introduction
2. SamplingMethods
2.1 Random Samplingwith a Reservoir
2.2 Concise Sampling
3. Wavelets
3.1 Recent Research on Wavelet Decomposition in Data Streams
4. Sketches
4.1 Fixed Window Sketchesfor MassiveTime Series
4.2 VariableWindow Sketchesof MassiveTime Series
4.3 Sketches and their applications in Data Streams
4.4 Sketcheswith p-stable distributions
4.5 The Count-Min Sketch
4.6 RelatedCountingMethods: HashFunctionsforDetermining
Distinct Elements
4.7 Advantages and Limitations of SketchBased Methods
5. Histograms
5.1 One Pass Construction of Equi-depthHistograms
5.2 Constructing V-Optimal Histograms
5.3 WaveletBased Histograms for Query Answering
5.4 SketchBased Methods for Multi-dimensionalHistograms
6. Discussion and Challenges
13. viii DATA STREAMS:MODELS AND ALGORITHMS
References
10
A Surveyof Join Processing in
Data Streams
Junyi Xie and Jun Yang
1. Introduction
2. Model and Semantics
3. State Management for StreamJoins
3.1 Exploiting Constraints
3.2 Exploiting Statistical Properties
4. FundamentalAlgorithms for StreamJoin Processing
5. Optimizing Stream Joins
6. Conclusion
Acknowledgments
References
11
Indexing and Querying Data Streams
Ahmet Bulut,Ambuj K.Singh
Introduction
Indexing Streams
2.1 Preliminariesand definitions
2.2 Feature extraction
2.3 Index maintenance
2.4 DiscreteWaveletTransform
Querying Streams
3.1 Monitoring an aggregate query
3.2 Monitoring a pattern query
3.3 Monitoring a correlationquery
Related Work
Future Directions
5.1 Distributed monitoring systems
5.2 Probabilistic modeling of sensornetworks
5.3 Content distributionnetworks
Chapter Summary
References
12
Dimensionality Reduction and
Forecasting on Streams
Spiros Papadimitriou, Jimeng Sun, and ChristosFaloutsos
1. Related work
2. Principalcomponent analysis (PCA)
3. Auto-regressivemodels and recursive least squares
4. MUSCLES
5. Tracking correlations and hidden variables: SPIRIT
6. Putting SPIRITto work
7. Experimental case studies
14. Contents i
x
8. Performance and accuracy
9. Conclusion
Acknowledgments
References 287
13
A Surveyof Distributed Mining of Data Streams
SrinivasanParthasarathy,Am01 Ghotingand Matthew Eric Otey
1. Introduction
2. Outlierand AnomalyDetection
3. Clustering
4. Frequent itemset mining
5. Classification
6. Summarization
7. Mining Distributed Data Streams in Resource Constrained Environ-
ments
8. SystemsSupport
References
14
Algorithms for Distributed 309
Data StreamMining
Kanishka Bhaduri, Kamalika Das, Krishnamoorthy Sivakumar,Hill01 Kargupta,Ran
WolfandRong Chen
1. Introduction 310
2. Motivation: Why DistributedData StreamMining? 311
3. Existing Distributed Data StreamMining Algorithms 312
4. A localalgorithm for distributed data streammining 315
4.1 Local Algorithms : definition 315
4.2 Algorithm details 316
4.3 Experimentalresults 318
4.4 Modificationsand extensions 320
5. Bayesian Network Learning from Distributed Data Streams 321
5.1 Distributed Bayesian Network Learning Algorithm 322
5.2 Selection of samples for transmission to global site 323
5.3 Online Distributed BayesianNetwork Learning 324
5.4 ExperimentalResults 326
6. Conclusion 326
References 329
15
A Surveyof Stream Processing
Problems and Techniques
in SensorNetworks
Sharmila Subramaniam, Dimitrios Gunopulos
1. Challenges
15. DATA STREAMS: MODELS AND ALGORITHMS
2. TheData CollectionModel
3. Data Communication
4. Query Processing
4.1 Aggregate Queries
4.2 Join Queries
4.3 Top-k Monitoring
4.4 Continuous Queries
5. CompressionandModeling
5.1 Data Distribution Modeling
5.2 OutlierDetection
6. Application: Tracking of Objectsusing SensorNetworks
7. Summary
References
Index
16. List of Figures
Micro-clustering Examples 11
Some Simple Time Windows 11
Varying Horizons for the classificationprocess 23
Qualitycomparison(NetworkIntrusiondataset,horizon=256,
stream_speed=200) 30
Quality comparison (Charitable Donation dataset, hori-
zon=4, stream_speed=200) 30
Accuracycomparison(NetworkIntrusiondataset,streamspeed=80,
buffer_size=1600,lcfit=80, init_number=400) 31
Distribution of the (smallest) best horizon (Network In-
trusiondataset,Timeunits=2500,buffer_size=1600,kfit=80,
init_number=400) 31
Accuracy comparison (Synthetic dataset B300kC5D20,
stream_speed=l00,buffer_size=500,lc it=25,init_number=400) 31
Distributionofthe(smallest)besthorizon(Syntheticdataset
B300kC5D20, Timeunits=2000,buffer_size=500,
lc it=25,
init_number=400) 32
Stream Proc. Rate (Charit. Donation data, stream_speed=2000) 33
Stream Proc. Rate (Ntwk. Intrusion data, stream_speed=2000) 33
Scalabilitywith Data Dimensionality(stream_speed=2000) 34
Scalabilitywith Number of Clusters (stream_speed=2000) 34
The ensemble based classificationmethod 53
VFDT Learning Systems 54
On Demand Classification 54
Online InformationNetwork System 55
Algorithm Output Granularity 55
ANNCAD Framework 56
SCALLOP Process 56
Karp et al. Algorithmto Find Frequent Items 68
ImprovingAlgorithm with An Accuracy Bound 71
17. xii DATA STREAMS: MODELS AND ALGORITHMS
StreamMining-Fixed:AlgorithmAssumingFixedLength
Transactions 73
SubroutinesDescription 73
StreamMining-Bounded: Algorithm with a Bound on Accuracy 75
StreamMining: Final Algorithm
The Forward Time SliceDensity Estimate
The Reverse Time Slice Density Estimate
The Temporal VelocityProfile
The SpatialVelocityProfile
A tilted time frame with natural time partition
A tilted time frame with logarithmictime partition
A tilted time frame with progressive logarithmic time
partition
Two critical layers in the stream cube
Cube structurefrom the m-layer to the o-layer
H-tree structure for cube computation
Cube computation: time and memory usage vs. # tuples
at the m-layer for the data set D5L3C10
Cube computation: time and space vs. # of dimensions
for the data set L3ClOT100K
Cube computation: time and space vs. # of levels for the data set
D5C10T50K
Data Flow Diagram
Illustration of Example 7.1
Illustration of Observation 1.4
Procedure SetSamplingRate(x,R,)
Sliding window model notation
An illustration of an ExponentialHistogram (EH).
Illustration of the Wavelet Decomposition
The Error Tree from the Wavelet Decomposition
Drifting normal distributions.
Example ECBs.
ECBsforsliding-windowjoins underthefrequency-based
model.
ECBs under the age-basedmodel.
Thesystemarchitectureforamulti-resolutionindexstruc-
tureconsistingof3levelsandstream-specificauto-regressive
(AR) models for capturing multi-resolutiontrends in the data. 240
Exact featureextraction,update rate T = 1. 241
Incremental feature extraction,update rate T = 1. 241
18. List of Figures
...
Xlll
Approximate feature extraction,update rate T = 1.
Incremental featureextraction,update rate T = 2.
Transformingan MBR using discretewavelettransform.
Transformationcorrespondsto rotating the axes (the ro-
tation angle = 45"for Haar wavelets) 247
Aggregatequerydecompositionandapproximationcom-
position for a query window of sizew = 26. 249
Subsequence query decomposition for a query window
of size IQI = 9. 253
Illustration of problem. 262
Illustration of updating wl when a new point xt+l arrives. 266
Chlorine dataset. 279
Mote dataset. 280
Critter dataset 281
Detail of forecasts on Critter with blanked values. 282
River data. 283
Wall-clock times (includingtime to update forecastingmodels). 284
Hidden variable tracking accuracy.
Centralized Stream Processing Architecture (left) Dis-
tributed StreamProcessing Architecture (right)
(A) the area inside an E circle. (B) Seven evenly spaced
vectors - ul ...u7. (C) The borders of the seven halfs-
paces tii .x 2 E define a polygon in which the circle is
circumscribed. (D) The area between the circle and the
union of half-spaces.
Quality of the algorithmwith increasingnumber of nodes
Cost of the algorithmwith increasingnumber of nodes
ASIA Model
Bayesian network for onlinedistributedparameter learning
SimulationresultsforonlineBayesianlearning: (left)KL
distancebetween theconditionalprobabilitiesforthenet-
worksBol(k)andBb,forthreenodes(right)KLdistance
between the conditional probabilities for the networks
Bol(k)and Bb, for three nodes
An instanceofdynamicclusterassignmentin sensorsys-
tem according to LEACH protocol. Sensornodes of the
sameclustersareshownwith samesymbolandthecluster
heads are marked with highlighted symbols.
19. xiv DATA STREAMS: MODELS AND ALGORITHMS
Interest Propagation, gradient setup and path reinforce-
ment fordatapropagationindirected-dzfusion paradigm.
Event is described in terms of attribute value pairs. The
figure illustrates an event detectedbased on the location
of the node and target detection.
Sensors aggregatingthe result for a MAX query in-netwc
Error filter assignments in tree topology. The nodes that
are shown shaded are the passive nodes that take part
only in routing the measurements. A sensor comrnuni-
catesa measurementonly if it lies outside the intervalof
values specified by Eii.e., maximum permitted error at
the node. A sensor that receives partial results from its
children aggregates the results and communicatesthem
to its parent after checking against the error interval
Usageofduplicate-sensitivesketchestoallowresultprop-
agationtomultipleparentsprovidingfaulttolerance. The
system is divided into levels during the query propaga-
tion phase. Partial results from a higher level (level 2 in
thefigure) is received at more than onenode inthe lower
level (Level 1in the figure)
(a) Two dimensional Gaussian model of the measure-
ments from sensors S1 and S2(b) The marginal distri-
bution of the values of sensor S1, given S2:New obser-
vations from one sensor is used to estimatetheposterior
density of the other sensors
Estimation of probability distribution of the measure-
ments over slidingwindow
Trade-offs in modeling sensor data
Tracking a target. The leader nodes estimate the prob-
ability of the target's direction and determines the next
monitoringregion thatthetargetisgoingto traverse. The
leadersof the cells within the next monitoringregion are
alerted
20. List of Tables
An exampleof snapshots stored for a = 2 and I = 2
A geometric time window
Data Based Techniques
Task Based Techniques
Typical LWClassTrainingResults
Summaryof Reviewed Techniques
Algorithms for Frequent Itemsets Mining over Data Streams
Summaryof results for the sliding-window model.
An Example of Wavelet Coefficient Computation
Description of notation.
Description of datasets.
Reconstruction accuracy(mean squarederrorrate).
21. Preface
In recent years, the progress in hardware technology has made it possible
for organizationsto store and record large streams of transactional data. Such
data setswhich continuouslyandrapidly grow over time arereferred to as data
streams. In addition, the development of sensor technology has resulted in
the possibility of monitoring many events in real time. While data mining has
become a fairly well established field now, the data stream problem poses a
number of unique challenges which are not easily solved by traditional data
mining methods.
The topic of data streams is a very recent one. The first research papers on
this topic appeared slightly under a decade ago, and since then this field has
grown rapidly. There is a large volume of literature which has been published
in this field over the past few years. The work is also of great interest to
practitionersinthefieldwhohavetomineactionableinsightswithlargevolumes
of continuously growing data. Because of the large volume of literature in the
field,practitioners andresearchersmay oftenfind it an arduoustask of isolating
the right literature for a given topic. In addition, from a practitioners point of
view, the use of research literature is even more difficult, since much of the
relevant material is buried in publications. While handling a real problem, it
may often be difficult to know where to look in order to solvethe problem.
This book contains contributed chapters from a variety of well known re-
searchers in the data mining field. While the chapters will be written by dif-
ferent researchers, the topics and content will be organizedin such a way so as
to present the most important models, algorithms, and applications in the data
mining fieldin a structured and conciseway. In addition,the book is organized
in order to make it more accessible to application driven practitioners. Given
the lack of structurally organized information on the topic, the book will pro-
vide insightswhich are not easily accessible otherwise. In addition, the book
will be a great help to researchersand graduate students interested in the topic.
The popularity and currentnature of the topic of data streams is likely to make
it an important source of information for researchers interested in the topic.
The data mining communityhas grownrapidly overthepast few years, and the
topic of data streamsis one of the most relevant and current areasof interestto
22. xviii DATA STREAMS: MODELS AND ALGORITHMS
the community. This is because of the rapid advancement of the field of data
streams in the past two to three years. While the data stream field clearlyfalls
in the emerging category because of its recency, it is now beginning to reach a
maturation and popularity point, where the development of an overview book
on the topic becomes both possible and necessary. Whilethis book attemptsto
provide an overview of the stream mining area, it also tries to discuss current
topics of interest so as to be useful to students and researchers. It is hoped that
this book will provide a reference to students,researchers and practitioners in
both introducing the topic of data streams and understandingthe practical and
algorithmic aspectsof the area.
23. Chapter 1
AN INTRODUCTION TO DATA STREAMS
Cham C. Aggarwal
IBM ZJ WatsonResearch Center
Hawthorne,NY 10532
Abstract
Inrecentyears, advancesinhardwaretechnologyhavefacilitatednew waysof
collecting data continuously. In many applicationssuch as network monitoring,
the volume of such data is so large that it may be impossible to store the data
on disk. Furthermore, even when the data can be stored, the volume of the
incomingdatamay be solargethat itmay be impossibletoprocessanyparticular
record more than once. Therefore, many data mining and database operations
such as classification, clustering, frequentpattern mining and indexing become
significantlymore challengingin this context.
In many cases, the datapatternsmay evolvecontinuously,as a resultof which
it is necessaryto design the mining algorithmseffectively in orderto accountfor
changesinunderlyingstructureofthedatastream. Thismakesthesolutionsofthe
underlyingproblems evenmore difficult from an algorithmicand computational
pointofview. Thisbook containsanumberofchapterswhicharecarefullychosen
in order to discussthe broad researchissuesin data streams. The purpose of this
chapter is to provide an overview of the organization of the stream processing
and mining techniqueswhich are covered in this book.
1 Introduction
In recent years, advancesin hardwaretechnologyhave facilitatedthe ability
to collect datacontinuously. Simpletransactionsof everydaylifesuch as using
a credit card, a phone or browsing the web lead to automated data storage.
Similarly, advances in informationtechnologyhave lead to large flows of data
acrossIPnetworks. Inmanycases,these largevolumesofdatacanbe minedfor
interestingandrelevantinformationin awidevarietyofapplications. Whenthe
24. 2 DATA STREAMS:MODELS AND ALGORITHMS
volumeoftheunderlyingdataisverylarge,itleadstoanumberofcomputational
and mining challenges:
With increasingvolume ofthedata, it isno longerpossibleto processthe
data efficientlyby using multiple passes. Rather, one can process a data
item at most once. This leadsto constraintsonthe implementationof the
underlying algorithms. Therefore, stream mining algorithms typically
need to be designed so that the algorithms work with one pass of the
data.
In most cases, there is an inherent temporal component to the stream
mining process. This is because the data may evolve over time. This
behavior of data streams is referred to as temporal locality. Therefore,
a straightforward adaptation of one-pass mining algorithms may not be
an effective solution to the task. Stream mining algorithms need to be
carefully designed with a clear focus on the evolutionof the underlying
data.
Another important characteristicof data streams is that they are often mined in
a distributed fashion. Furthermore,the individualprocessorsmay have limited
processing and memory. Examples of such cases include sensor networks, in
which it maybe desirableto perfom in-network processingof data streamwith
limited processing and memory [8, 191.This book will also contain a number
of chapters devoted to these topics.
This chapter will provide an overview of the different stream mining algo-
rithmscoveredinthisbook. Wewill discussthechallengesassociatedwitheach
kind of problem, and discuss an overview of the material in the corresponding
chapter.
2. StreamMining Algorithms
In this section, we will discuss the key stream mining problems and will
discussthe challenges associated with each problem. We will also discuss an
overview ofthematerial coveredin eachchapterofthisbook. Thebroad topics
covered in this book are as follows:
Data Stream Clustering. Clustering is a widely studied problem in the
data mining literature. However, it is more difficult to adapt arbitrary clus-
tering algorithms to data streams because of one-pass constraints on the data
set. An interesting adaptation of the k-means algorithm has been discussed
in [14] which uses a partitioning based approach on the entire data set. This
approachuses an adaptation of a k-means technique in order to createclusters
over the entire data stream. In the context of data streams, it may be more
desirable to determine clusters in specificuser defined horizons rather than on
25. An Introduction to Data Streams 3
the entiredata set. In chapter 2, we discuss the micro-clusteringtechnique [3]
which determines clusters over the entire data set. We also discuss a variety
of applicationsof micro-clusteringwhich can performeffectivesummarization
based analysis of the data set. For example, micro-clusteringcan be extended
to the problem of classificationon data streams [5]. In many cases, it can also
be used for arbitrarydata mining applications such as privacy preserving data
mining or query estimation.
Data Stream Classification. The problem of classificationis perhaps one
of the most widely studied in the context of data stream mining. The problem
of classification is made more difficultby the evolutionof the underlying data
stream. Therefore, effective algorithms need to be designed in order to take
temporal locality into account. In chapter 3, we discuss a survey of classifica-
tion algorithms for data streams. A wide variety of data stream classification
algorithmsarecoveredinthischapter. Someofthesealgorithmsaredesignedto
be purely one-pass adaptations of conventionalclassificationalgorithms [12],
whereas others (such as the methods in [5, 161)are more effectivein account-
ing for the evolution of the underlying data stream. Chapter 3 discusses the
different kinds of algorithms and the relative advantagesof each.
Frequent Pattern Mining. The problem of frequent pattern mining was
first introduced in [6], and was extensivelyanalyzed for the conventionalcase
of diskresident data sets. In the case of data streams,one may wish to find the
frequentitemsetseitherover a slidingwindowortheentiredata stream[15,17].
In Chapter 4, we discuss an overview of the different frequent pattern mining
algorithms, and also provide a detailed discussion of some interesting recent
algorithms on the topic.
Change Detection in Data Streams. As discussed earlier, the patterns
in a data stream may evolve over time. In many cases, it is desirable to track
and analyze the nature of these changesover time. In [I, 11, 181, a number of
methodshave been discussedforchangedetectionof data streams. In addition,
data streamevolutioncanalsoaffectthebehavioroftheunderlyingdatamining
algorithms sincethe results can become stale over time. Therefore, in Chapter
5, we have discussed the differentmethods for change detection data streams.
Wehavealsodiscussedtheeffectofevolutionondatastreamminingalgorithms.
Stream Cube Analysis of Multi-dimensional Streams. Much of stream
data resides at a multi-dimensionalspace and at rather low level of abstraction,
whereasmostanalystsareinterestedinrelativelyhigh-level dynamicchangesin
somecombinationof dimensions. Todiscoverhigh-level dynamicandevolving
characteristics,onemayneed toperformmulti-level, multi-dimensionalon-line
26. 4 DATA STREAMS: MODELS AND ALGORITHMS
analyticalprocessing(OLAP)of streamdata. Suchnecessitycallsfortheinves-
tigation of new architecturesthat may facilitateon-lineanalyticalprocessing of
multi-dimensional stream data [7, 101.
In Chapter 6, an interesting stream-cube architecture that effectively per-
forms on-line partial aggregation of multi-dimensional stream data, captures
the essential dynamic and evolving characteristics of data streams, and facil-
itates fast OLAP on stream data. Stream cube architecture facilitates online
analytical processing of stream data. It also forms a preliminary structure for
online stream mining. The impact of the design and implementationof stream
cube in the context of stream mining is also discussed in the chapter.
Loadshedding in Data Streams. Since data streams are generated by
processes which are extraneous to the stream processing application, it is not
possible to control the incoming streamrate. As a result, it is necessary for the
system to have the ability to quickly adjust to varying incoming stream pro-
cessingrates. Chapter 7 discusses one particular type of adaptivity: the ability
to gracefully degradeperformancevia "load shedding" (droppingunprocessed
tuples to reduce system load) when the demands placed on the system can-
not be met in full given availableresources. Focusing on aggregation queries,
the chapter presents algorithms that determine at what points in a query plan
should load sheddingbe performed and what amount of load shouldbe shed at
eachpoint in order to minimize the degree of inaccuracyintroducedinto query
answers.
SlidingWindow Computations in Data Streams. Many of the synopsis
structures discussed use the entire data stream in order to construct the cor-
responding synopsis structure. The sliding-windowmodel of computation is
motivated by the assumptionthat it is more importantto use recent data in data
streamcomputation [9]. Therefore,theprocessingand analysisis onlydone on
a fixed history of the data stream. Chapter 8 formalizes this model of compu-
tation and answers questions about how much space and computation time is
required to solve certainproblems under the sliding-windowmodel.
SynopsisConstructioninData Streams. Thelargevolumeofdata streams
poses unique space and time constraints on the computation process. Many
query processing, database operations,and mining algorithmsrequire efficient
execution which can be difficult to achieve with a fast data stream. In many
cases, it may be acceptable to generate approximate solutions for such prob-
lems. In recent years a number of synopsis structures have been developed,
which can be used in conjunction with a variety of mining and query process-
ing techniques [13]. Some key synopsis methods include those of sampling,
wavelets, sketches and histograms. In Chapter 9, a survey of the key synopsis
27. An Introduction to Data Streams 5
techniquesisdiscussed, andtheminingtechniquessupportedby suchmethods.
The chapter discusses the challenges and tradeoffs associated with using dif-
ferent kinds of techniques, and the important research directions for synopsis
construction.
Join Processingin Data Streams. Streamjoin is a fundamentaloperation
for relating information from different streams. This is especially useful in
many applications such as sensornetworks in which the streams arriving from
differentsourcesmayneed tobe related with one another. In the stream setting,
inputtuples arrivecontinuously,andresult tuples need to be produced continu-
ouslyaswell. Wecannotassumethatthe inputdata isalreadystoredorindexed,
or that the input rate can be controlled by the query plan. Standardjoin algo-
rithmsthatuseblockingoperations,e.g., sorting,no longerwork. Conventional
methods for cost estimation and query optimizationare also inappropriate,be-
cause they assume finite input. Moreover, the long-running nature of stream
queries calls for more adaptiveprocessing strategies that can react to changes
and fluctuations in data and stream characteristics. The "stateful" nature of
streamjoins adds another dimension to the challenge. In general, in order to
computethe completeresult of a streamjoin, we need to retain allpast arrivals
as part of the processing state, becausea new tuple mayjoin with an arbitrarily
old tuple arrived in the past. This problem is exacerbatedby unbounded input
streams, limited processing resources, and high performancerequirements, as
it is impossible in the long run to keep all past history in fast memory. Chap-
ter 10provides an overview of research problems,recent advances,and future
research directions in streamjoin processing.
Indexing Data Streams. The problem of indexing data streams attempts
to create a an indexed representation, sothat it is possible to efficientlyanswer
different kinds of queries such as aggregation queries or trend based queries.
This is especially important in the data stream case because of the huge vol-
ume of the underlying data. Chapter 11 exploresthe problem of indexing and
querying data streams.
DimensionalityReduction and Forecasting in Data Streams. Because
of the inherent temporal nature of data streams, the problems of dimension-
ality reduction and forecasting and particularly important. When there are a
largenumber of simultaneousdata stream,we canuse the correlationsbetween
different data streams in order to make effective predictions [20, 211 on the
futurebehavior of the data stream. In Chapter 12,an overviewof dimensional-
ity reduction and forecasting methods have been discussed for the problem of
data streams. In particular, the well known MUSCLES method [21] has been
discussed, and its application to data streams have been explored. In addition,
28. 6 DATA STREAMS: MODELS AND ALGORITHMS
the chapterpresents the SPIRITalgorithm,which exploresthe relationshipbe-
tween dimensionality reduction and forecasting in data streams. In particular,
the chapter explores the use of a compact number of hidden variablesto com-
prehensivelydescribethe data stream. This compact representationcan alsobe
used for effectiveforecasting of the data streams.
Distributed Mining of Data Streams. In many instances, streams are
generated at multiple distributed computingnodes. Analyzing and monitoring
data in such environmentsrequires data mining technology that requires opti-
mization of a variety of criteria such as communication costs across different
nodes, aswell as computational,memoryor storagerequirementsat eachnode.
A comprehensivesurveyof the adaptation of differentconventionalmining al-
gorithms to the distributed case is provided in Chapter 13. In particular, the
clustering, classification, outlier detection, frequent pattern mining, and surn-
marization problems are discussed. In Chapter 14, some recent advances in
stream mining algorithms are discussed.
Stream Mining in SensorNetworks. With recent advances in hardware
technology, ithasbecomepossibletotracklargeamountsofdatainadistributed
fashionwith the use of sensortechnology. The large amountsof data collected
by the sensor nodes makes the problem of monitoring a challenging one from
many technological stand points. Sensor nodes have limited local storage,
computational power, and battery life, as a result of which it is desirable to
minimize the storage, processing and communication from these nodes. The
problem is furthermagnifiedby the factthat a givennetworkmay havemillions
ofsensornodesandthereforeitisveryexpensiveto localizeallthedataatagiven
globalnode for analysisboth from a storage and communicationpoint of view.
In Chapter 15, we discuss an overview of a number of stream mining issues
in the context of sensor networks. This topic is closely related to distributed
stream mining, and a number of concepts related to sensor mining have also
been discussed in Chapters 13and 14.
3. Conclusions and Summary
Datastreamsareacomputationalchallengeto dataminingproblemsbecause
ofthe additionalalgorithmicconstraintscreatedby the largevolumeof data. In
addition, the problem of temporal locality leads to a number of unique mining
challenges in the data stream case. This chapter provides an overview to the
different mining algorithms which are covered in this book. We discussed the
differentproblems and the challengeswhich are associatedwith eachproblem.
We also provided an overview of the material in each chapter of the book.
29. An Intmduction to Data Streams 7
References
[I] Aggarwal C. (2003). A Framework for Diagnosing Changes in Evolving
Data Streams.ACM SIGMOD Conference.
[2] AggarwalC (2002).An IntuitiveFramework forunderstandingChangesin
EvolvingData Streams.IEEE ICDE Conference.
[3] Aggarwal C., Han J., Wang J., Yu P (2003). A Framework for Clustering
EvolvingData Streams. VLDBConference.
[4] AggarwalC., HanJ., WangJ., Yu P (2004).A FrameworkforHigh Dimen-
sional Projected Clustering of Data Streams. VLDBConference.
[5] Aggarwal C, Han J., Wang J., Yu P. (2004). On-DemandClassification of
Data Streams.ACM KDD Conference.
[6] Agrawal R., Imielinski T., Swami A. (1993) Mining Association Rules
between Setsof items in Large Databases. ACM SIGMOD Conference.
[7] Chen Y., Dong G., Han J., Wah B. W., Wang J. (2002) Multi-dimensional
regression analysisof time-series data streams. VLDBConference.
[8] Cormode G., Garofalakis M. (2005) Sketching Streams Through the Net:
DistributedApproximate Query Tracking. VLDBConference.
[9] Datar M., Gionis A., Indyk P., Motwani R. (2002) Maintaining stream
statisticsover slidingwindows. SIAM Journal on Computing,3l(6):1794-
1813.
[lo] DongG.,HanJ., LamJ.,PeiJ., WangK. (2001)Miningmulti-dimensional
constrained gradients in data cubes. VLDBConference.
[ll] Dasu T., Krishnan S., Venkatasubramaniam S., Yi K. (2005).
An Information-Theoretic Approach to Detecting Changes in Multi-
dimensional data Streams.Duke University TechnicalReport CS-2005-06.
[12] Domingos P. and Hulten G. (2000) Mining High-speed Data Streams.In
Proceedings of the ACM KDD Conference.
[13] Garofalakis M., Gehrke J., Rastogi R. (2002) Querying and mining data
streams: you only get one look (a tutorial). SIGMOD Conference.
[14] Guha S., MishraN., MotwaniR., O'Callaghan L. (2000).ClusteringData
Streams.IEEE FOCS Conference.
[I51 Giannella C., Han J., Pei J., Yan X., and Yu P. (2002) Mining Frequent
Patterns in Data Streams at Multiple Time Granularities. Proceedings of
the NSF Workshopon Next GenerationData Mining.
1161 Hulten G., SpencerL., DomingosP. (2001).MiningTimeChangingData
Streams.ACM KDD Conference.
[17] Jin R., AgrawalG. (2005)An algorithmfor in-core frequent itemsetmin-
ing on streaming data. ICDM Conference.
30. 8 DATA STREAMS: MODELS AND ALGORITHMS
[18] Kifer D., David S.-B., Gehrke J. (2004). Detecting Change in Data
Streams. VLDB Conference, 2004.
1191 Kollios G., Byers J., ConsidineJ., HadjielefttheriouM., Li F. (2005)Ro-
bust Aggregation in SensorNetworks. IEEEData EngineeringBulletin.
[20] S a h a iY
.
, PapadimitriouS., FaloutsosC. (2005).BRAID: Streammining
through group lag correlations.ACMSIGMOD Conference.
[21] Yi B.-K., Sidiropoulos N.D., Johnson T., Jagadish, H. V
.
,Faloutsos C.,
BilirisA. (2000).Onlinedataminingforco-evolvingtimesequences.ICDE
Conference.
31. Chapter 2
ON CLUSTERING MASSIVE DATA STREAMS: A
SUMMARIZATIONPARADIGM
Cham C. Aggarwal
IBM Z J. WatsonResearch Center
Hawthorne, W 1053.2
Jiawei Han
UniversityofIllinois at Urbana-Champaign
Urbana,IL
hanj@cs.uiuc.edu
Jianyong Wang
Universityof Illinois at Urbana-Champaign
Urbana,Z
L
jianyong @tsinghua.edu.cn
Philip S. Yu
IBM Z J. WatsonResearch Center
Hawthorne, NY 10532
Abstract
In recent years, data streams have become ubiquitous because of the large
number of applications which generate huge volumes of data in an automated
way. Many existing data mining methods cannot be applied directly on data
streams because of the fact that the data needs to be mined in one pass. Fur-
thermore, datastreamsshowa considerableamountof temporal localitybecause
of which a direct application of the existing methods may lead to misleading
results. In this paper, we develop an efficient and effective approach for min-
ing fast evolving data streams, which integratesthe micro-clusteringtechnique
32. DATA STREAMS: MODELS AND ALGORITHMS
with the high-level datamining process, and discoversdataevolutionregularities
as well. Our analysis and experimentsdemonstratetwo important data mining
problems, namely stream clustering and stream classification,can be performed
effectively using this approach, with high quality mining results. We discuss
the use of micro-clusteringas a general summarization technology to solvedata
mining problems on streams. Our discussion illustrates the importance of our
approachfor a variety of miningproblems in the data stream domain.
1. Introduction
In recent years, advances in hardware technology have allowed us to auto-
matically record transactions and other pieces of information of everyday life
at a rapid rate. Such processes generate huge amounts of online data which
grow at an unlimited rate. These kinds of online data are referred to as data
streams. The issues on management and analysis of data streams have been
researched extensivelyin recent years because of its emerging, imminent, and
broad applications [l 1, 14, 17,231.
Many important problems such as clustering and classification have been
widely studied in the data mining community. However, a majority of such
methods may not be working effectively on data streams. Data streams pose
special challenges to a number of data mining algorithms, not only because
of the huge volume of the online data streams, but also because of the fact
that the data in the streams may show temporal correlations. Such temporal
correlationsmayhelpdiscloseimportantdataevolutioncharacteristics,andthey
canalsobeusedtodevelopefficientandeffectiveminingalgorithms. Moreover,
data streams require online mining, in which we wish to mine the data in a
continuous fashion. Furthermore, the system needs to have the capability to
perform an ofline analysis as well based on the user interests. This is similar
to an onlineanalyticalprocessing(OLAP)frameworkwhich usestheparadigm
of pre-processing once, querying many times.
Based on the aboveconsiderations,we propose a new streammining frame-
work, which adopts a tilted time window framework, takes micro-clustering
as a preprocessing process, and integrates the preprocessing with the incre-
mental, dynamic mining process. Micro-clustering preprocessing effectively
compressesthe data, preservesthe generaltemporal localityof data, and facili-
tatesboth onlineand offlineanalysis, aswell asthe analysisof current data and
data evolutionregularities.
In this study, we primarily concentrate on the application of this technique
to two problems: (1) streamclustering,and (2) streamclassification. Theheart
of the approach is to use an online summarizationapproach which is efficient
and also allows for effectiveprocessing of the data streams. We also discuss
33. On ClusteringMassive Data Streams: A Summarization Paradigm
Figure 2.I. Micro-clustering Examples
.time
Now
Figure 2.2. Some SimpleTimeWindows
a number of research directions, in which we show how the approach can be
adapted to a variety of other problems.
This paper is organized as follows. In the next section, we will present our
micro-clusteringbased stream mining Eramework. In section 3, we discuss the
streamclusteringproblem. Theclassificationmethodsaredeveloped in Section
4. In section 5, we discuss a number of other problems which can be solved
with the micro-clustering approach, and other possible research directions. In
section 6, we will discuss some empirical results for the clustering and classi-
fication problems. In Section 7 we discuss the issues related to our proposed
streammining methodologyand compareit with other related work. Section 8
concludes our study.
34. 12 DATA STREAMS: MODELS AND ALGORITHMS
2. The Micro-clustering Based Stream Mining
Framework
In order to apply our technique to a variety of data mining algorithms, we
utilize a micro-clusteringbased stream mining framework. This frameworkis
designedbycapturingsummaryinformationaboutthenatureofthedatastream.
This summaryinformation is defined by the following structures:
Micro-clusters: Wemaintainstatisticalinformationaboutthedatalocality
in terms of micro-clusters. These micro-clusters are defined as a temporal
extension of the clusterfeature vector [24]. The additivity property of the
micro-clustersmakes them a natural choice for the data streamproblem.
Pyramidal Time Frame: The micro-clusters are stored at snapshots in
time which followapyramidalpattern. Thispatternprovidesan effectivetrade-
offbetweenthe storagerequirementsandthe abilityto recall summarystatistics
from different time horizons.
The summary information in the micro-clusters is used by an offline com-
ponent which is dependent upon a wide variety of user inputs such as the time
horizon or the granularity of clustering. In order to define the micro-clusters,
we will introduce a few concepts. It is assumed that the data stream consists
-
of a set of multi-dimensional records ...Xk... arriving at time stamps
TI
...Tk.... Each is a multi-dimensionalrecord containing d dimensions
which are denoted by = (xi...x$.
We will first begin by definingthe concept of micro-clusters and pyramidal
time frame more precisely.
DEFINITION
2.1 A micro-clusterfor aset ofd-dimensionalpoints Xi,
...Xi,
--
withtimestamps~,
...T,, isthe (2-d+3)tuple (CF2",C F l X ,
CF2t,CFlt,n),
wherein CF2" and C F l Xeach correspond to a vector of d entries. The de$-
nition of each of these entries is asfollows:
For eachdimension, thesum of thesquares of thedata valuesismaintained
in CF2". Thus, CF2" contains d values. Thep-th entry of CF2" is equal to
EY=l(<
12.
For each dimension, the sum of the data values is maintained in CFlX.
Thus, CFIXcontains d values. Thep-th entry of CFIXis equal to E7L=1
e;.
The sum of the squares of the time stamps Ti,
...Tin
is maintained in
CF2t.
Thesum of the time stamps Ti,...Tin
is maintained in CFlt.
The number of datapoints is maintained in n.
We note that the above definition of micro-cluster maintains similar summary
information as the cluster feature vector of [24], except for the additional in-
formation about time stamps. We will refer to this temporal extension of the
clusterfeaturevectorfora setofpointsCby CFT(C).As in [24],this summary
35. On ClusteringMassive Data Streams: A Summarization Paradigm 13
information can be expressed in an additiveway over the different data points.
This makes it a natural choice for use in data stream algorithms.
Wenotethatthe maintenanceof a largenumberofmicro-clustersisessential
in the abilityto maintain more detailed informationabout the micro-clustering
process. For example,Figure 2.1 forms3 clusters,which are denotedby a, b, c.
At a later stage,evolutionforms3 differentfiguresal, a2,bc, with a splitintoa1
and a2, whereas b and c merged into bc. If we keep micro-clusters(each point
represents a micro-cluster), such evolution can be easilycaptured. However, if
we keep only 3 cluster centers a, byc, it is impossibleto derive later al, a2, bc
clusterssincethe information of more detailed points are already lost.
The data stream clustering algorithm discussed in this paper can generate
approximate clusters in any user-specified length of history from the current
instant. This is achieved by storing the micro-clusters at particular moments
in the stream which are referred to as snapshots. At the same time, the current
snapshotof micro-clusters is alwaysmaintainedby the algorithm. The macro-
clustering algorithm discussed at a later stage in this paper will use these h e r
level micro-clusters in order to create higher level clusters which can be more
easilyunderstoodby the user. Considerfor example, the casewhen the current
clock time is t, and the user wishes to find clusters in the stream based on
a history of length h. Then, the macro-clustering algorithm discussed in this
paper will use some of the additive properties of the micro-clusters stored at
snapshots t, and (t,- h) in order to find the higher level clusters in a history
or time horizon of length h. Of course, since it is not possible to store the
snapshotsat eachand everymoment in time, it isimportantto chooseparticular
instantsof time at which it ispossible to storethe stateof the micro-clusters so
thatclustersin anyuser specifiedtimehorizon (t, -h, t,) canbe approximated.
We note that some examples of time frames used for the clustering process
are the natural time frame (Figure 2.2(a) and (b)), and the logarithmic time
frame (Figure 2.2(c)). In the natural time frame the snapshots are stored at
regular intervals. We note that the scale of the natural time frame could be
based on the applicationrequirements. For example, we could choose days,
monthsoryearsdependingupon thelevelofgranularityrequiredintheanalysis.
Amoreflexibleapproachisto usethe logarithmictime framein whichdifferent
variationsof the time intervalcan be stored. As illustrated in Figure 2.2(c), we
store snapshots at times oft, 2 t, 4 t .... The danger of this is that we may
jump too farbetween successivelevels of granularity. We need an intermediate
solution which provides a good balance between storage requirements and the
level of approximationwhich a user specified horizon can be approximated.
In order to achieve this, we will introduce the concept of a pyramidal time
frame. In thistechnique,the snapshotsarestoredat differinglevels of granular-
ity depending upon the recency. Snapshotsare classified into different orders
which can vary from 1to log(T), where T is the clock time elapsed since the
36. 14 DATA STREAMS: MODELS AND ALGORITHMS
beginning of the stream. The order of a particular class of snapshots define
the level of granularity in time at which the snapshots are maintained. The
snapshots of differentorder are maintained as follows:
0 Snapshots of the i-th order occur at time intervals of ai,
where a is an
integer and a 2 1. Specifically, each snapshot of the i-th order is taken at
a moment in time when the clock value1 from the beginning of the stream is
exactly divisibleby a2.
0 At any given moment in time, onlythe last a +1snapshotsof order i are
stored.
We note that the above definition allows for considerable redundancy in
storage of snapshots. For example, the clock time of 8 is divisible by 2', 2l,
22,and 23 (where cr = 2). Therefore,the state of the micro-clusters at a clock
time of 8 simultaneously corresponds to order 0, order 1, order 2 and order
3 snapshots. From an implementation point of view, a snapshot needs to be
maintained only once. We make the followingobservations:
0 For a data stream, the maximum order of any snapshot stored at T time
units sincethe beginning of the stream mining process is log, (T).
For a data streamthe maximumnumberof snapshotsmaintainedat Ttime
units sincethe beginning of the stream mining process is (a+1).log, (T).
0 For any user specifiedtime window of h, at least one stored snapshot can
be found within 2 .h units of the current time.
While the first two results are quite easy to see, the last one needs to be
proven formally.
LEMMA
2.2 Let h be a user-speciJiedtime window,t, be the currenttime, and
t, be the time of the last stored snapshot ofany orderjust before the time t, -h.
Then t, - t, 5 2 .h.
Proof: Let r be the smallestinteger suchthat ar2 h. Therefore,we know that
ar-I< h. Sinceweknowthattherearea+ 1snapshotsoforder (r-I),at least
onesnapshotoforderr-1mustalwaysexistbeforet, -h. Lett, bethesnapshot
of order r - 1which occursjust before t, - h. Then (t, - h) - t, 5 ar-l.
Therefore, we have t, - t, 5 h +ar-l< 2 - h.
Thus, in this case, it is possible to find a snapshot within a factor of 2 of
any user-specified time window. Furthermore, the total number of snapshots
which need to be maintained are relatively modest. For example, for a data
stream running for 100 years with a clock time granularity of 1 second, the
total number of snapshots which need to be maintained are given by (2 +1) .
log2(100*365 *24 *60 *60) w 95. This is quite a modest requirement given
the fact that a snapshotwithin a factorof 2 can alwaysbe foundwithin anyuser
specifiedtime window.
It is possible to improve the accuracy of time horizon approximation at a
modest additional cost. In order to achieve this, we save the a1+1snapshots
37. On ClusteringMassive Data Streams: A SummarizationParadigm
Table2.1. An example of snapshotsstored for a = 2 and 1 = 2
Order of
Snapshots
0
1
2
3
4
5
of order r for 1 > 1. In this case, the storage requirement of the technique
correspondsto (az+1) log, (T)snapshots. Onthe otherhand, theaccuracyof
time horizon approximationalso increases substantially. In this case, any time
horizon can be approximatedto a factor of (1 +l/az-l). We summarizethis
result as follows:
Clock Times (Last 5 Snapshots)
5554535251
5452504846
5248444036
48403224 16
48 32 16
32
LEMMA
2.3 Let h be a userspecijied time horizon, t, be the current time, and
t, be the time of the laststored snapshot of any orderjust before the time t, -h.
Thent, - t, < (1 +l/az-l) - h.
Proof: Similarto previous case.
For larger values of I, the time horizon can be approximated as closely as
desired. For example, by choosing 1 = 10, it is possible to approximate any
time horizon within 0.2%, while a total of only (2'' +1) log2(100* 365 *
24 * 60 * 60) = 32343 snapshots are required for 100years. Since historical
snapshots can be stored on disk and only the current snapshot needs to be
maintained in main memory, this requirement is quite feasible from a practical
point of view. It is also possible to specify the pyramidal time window in
accordancewith user preferencescorrespondingto particular moments in time
such as beginning of calendar years, months, and days. While the storage
requirementsandhorizonestimationpossibilitiesof suchaschemearedifferent,
all the algorithmic descriptions of this paper are directly applicable.
In order to clarifythe way in which snapshotsare stored, let us consider the
case when the stream has been running starting at a clock-time of 1,and a use
of a = 2 and 1= 2. Therefore 22+1= 5 snapshotsof each order are stored.
Then, at a clock time of 55, snapshotsat the clocktimes illustratedin Table2.1
are stored.
Wenotethatalargenumberofsnapshotsarecommonamongdifferentorders.
From an implementationpoint of view, the states of the micro-clustersat times
of 16,24,32,36,40,44,46,48,50,51,52,53,54,and 55 are stored. It is easy
to see that for more recent clock times, there is less distance between succes-
sive snapshots (better granularity). We also note that the storage requirements
38. 16 DATA STMAMS: MODELS AND ALGORITHMS
estimated in this section do not take this redundancy into account. Therefore,
the requirements which have been presented so far are actually worst-case re-
quirements.
These redundancies can be eliminated by using a systematicrule described
in [6], orby using amore sophisticatedgeometrictime frame. Inthistechnique,
snapshotsareclassifiedintodifferentframe numberswhich can varyfrom0to a
valueno largerthanlog2(T),whereTisthemaximumlengthofthestream. The
frame number of a particular class of snapshotsdefines the level of granularity
in time at which the snapshotsare maintained. Specifically,snapshotsof frame
number i are stored at clock times which are divisible by 2i, but not by 2i+1.
Therefore, snapshots of frame number 0 are stored only at odd clock times. It
is assumed that for each frame number, at most max-capacity snapshots are
stored.
We note that for a data stream,the maximum framenumber of any snapshot
stored at T time units since the beginning of the stream mining process is
log2(T). Since at most max-capacity snapshots of any order are stored, this
also means that the maximum number of snapshotsmaintainedat T time units
sincethebeginning ofthe streamminingprocess is (max-capacity) .log2(T).
Oneinterestingcharacteristicof thegeometrictimewindowisthat foranyuser-
specifiedtime window of h, at least one stored snapshot can be found within
a factor of 2 of the specified horizon. This ensures that sufficient granularity
is available for analyzing the behavior of the data stream over different time
horizons. We will formalize this result in the lemma below.
LEMMA
2.4 Let h be a user-specijiedtime window,and t, be the current time.
Let us also assume that max-capacity >2. Thena snapshot exists at time t,,
such that h/2 5 t, - t, I
:2 .h.
Proof: Let r be the smallestintegersuchthat h < 2T+1.Sincer is the smallest
such integer, it also means that h > 2'. This means that for any interval
(t, - h, t,) of length h, at least one integer t' E (t, - h, t,) must exist which
satisfiesthepropertythat t' mod 2'-l = 0andt' mod 2r # 0. Let t' be thetime
stamp of the last (most current) such snapshot. This also means the following:
Then, if max-capacity isat least 2, the secondlast snapshotof order (r -1)
is also stored and has a time-stamp value of t' - 2'. Let us pick the time
t, = t' - 2'. By substitutingthe value oft,, we get:
t, - t, = (t, - t' +
Since (t, - t') L 0 and 2' > h/2, it easily follows from Equation 2.2 that
tc -t, > h/2.
39. On ClusteringMassive Data Streams: A Summarization Paradigm
Table2.2. A geometrictime window
- Frameno.
0
1
Sincet' isthepositionofthelatest snapshotof frame (r-1)occurringbefore
the current time t,, it followsthat (t, -t') <2r. Subsitutingthis inequality in
Equation 2.2, we get t, - t, <2' +2r <h +h = 2 .h. Thus, we have:
Snapshots(by clock time) I
69 67 65
70 66 62 I
The aboveresult ensures that everypossible horizon can be closelyapprox-
imated within a modest level of accuracy. While the geometric time frame
shares a number of conceptual similarities with the pyramidal time frame [6],
it is actually quite different and also much more efficient. This is because it
eliminates the double counting of the snapshotsover different frame numbers,
as is the case with the pyramidal time frame [6]. In Table 2.2, we present
an example of a frame table illustrating snapshots of different frame numbers.
The rules for insertion of a snapshott (at time t) into the snapshot frame table
are defined as follows: (1) if (t mod 2i) = 0 but (t mod 2'+') # 0, t is in-
serted into frame-number i (2) each slot has a max-capacity (which is 3 in
our example). At the insertion o f t into frame-number i, if the slot already
reaches its max-capacity, the oldest snapshot in this frame is removed and
the new snapshot inserted. For example, at time 70, since (70 mod 2') = 0
but (70 mod 22) # 0, 70 is inserted into framenumber 1which knocks out
the oldest snapshot 58 if the slot capacity is 3. Following this rule, when slot
capacity is 3, the followingsnapshotsare stored in the geometrictime window
table: 16,24,32,40,48,52,56,60,62,64,65,66,67,68,69,70,as shown in
Table 2.2. From the table, one can see that the closer to the current time, the
denser are the snapshots stored.
3. ClusteringEvolving Data Streams: A Micro-clustering
Approach
The clustering problem is defined as follows: for a given set of data points,
we wish to partition them into one or more groups of similar objects. The
similarity of the objects with one another is typically defined with the use of
some distance measure or objectivefunction. The clusteringproblem has been
40. 18 DATA STREAMS: MODELS AND ALGORITHMS
widely researched in the database, data mining and statistics communities [I2,
18,22,20,21,24]because of its use in a wide range of applications. Recently,
the clustering problem has also been studied in the context of the data stream
environment[17,23].
ApreviousalgorithmcalledSTREAM[23]assumesthattheclustersaretobe
computedoverthe entiredata stream. While suchatask maybe useful in many
applications, a clustering problem may often be defined only over a portion of
a data stream. This is because a data stream should be viewed as an infinite
process consisting of data which continuously evolves with time. As a result,
the underlying clustersmay also changeconsiderablywith time. The natureof
theclustersmay vary with both themoment at which they arecomputedas well
as the time horizon over which they are measured. For example, a data analyst
may wish to examine clusters occurring in the last month, last year, or last
decade. Such clusters may be considerably different. Therefore, we assume
that one of the inputs to the clustering algorithm is a time horizon over which
the clusters are found. Next, we will discuss CluStream, the online algorithm
used for clustering data streams.
3.1 Micro-clusteringChallenges
Wenotethat sincestreamdatanaturally imposesa one-passconstraintonthe
design of the algorithms, it becomes more difficultto provide such a flexibility
in computing clusters over differentkinds of time horizons using conventional
algorithms. For example,a direct extensionof the streambased Ic-meansalgo-
rithm in [23] to such a case would require a simultaneousmaintenance of the
intermediate results of clustering algorithms over all possible time horizons.
Sucha computationalburden increaseswith progressionof the data stream and
can rapidly become a bottleneck for online implementation. Furthermore, in
many cases,ananalystmaywishto determinetheclustersatapreviousmoment
in time, and compare them to the current clusters. This requires even greater
book-keeping and can rapidly become unwieldy for fast data streams.
Since a data stream cannot be revisited over the course of the computation,
the clustering algorithmneeds to maintain a substantialamount of information
so that important details are not lost. For example, the algorithm in [23] is
implemented as a continuous version of k-means algorithm which continues
to maintain a number of cluster centers which change or merge as necessary
throughoutthe executionofthe algorithm. Suchan approach isespeciallyrisky
when the characteristics of the stream change over time. This is because the
amount of informationmaintainedby a k-means type approach is too approxi-
mate in granularity,and once two cluster centers arejoined, there is no way to
informativelysplit the clusters when required by the changes in the stream at a
later stage.
41. On ClusteringMassive Data Streams: A Summarization Paradigm 19
Thereforeanaturaldesignto streamclusteringwouldbe separateoutthepro-
cessintoan onlinemicro-clusteringcomponentand an offlinemacro-clustering
component. The online micro-clustering component requires a very efficient
process for storageof appropriate summarystatistics in a fast data stream. The
offline componentuses these summarystatisticsin conjunctionwith other user
input in order to provide the user with a quick understanding of the clusters
whenever required. Since the offline component requires only the summary
statistics as input, it turns out to be very efficient in practice. This leads to
severalchallenges:
0 What is the nature of the summary information which can be stored ef-
ficiently in a continuous data stream? The summary statistics should provide
sufficient temporal and spatial information for a horizon specific offline clus-
tering process, while being prone to an efficient (online) update process.
At what moments intime shouldthe summaryinformationbe storedaway
on disk? How can an effective trade-off be achieved between the storagere-
quirements of such a periodic process and the ability to cluster for a specific
time horizon to within a desired level of approximation?
How can the periodic summarystatisticsbe used to provide clustering and
evolutioninsights over user-specified time horizons?
3
.
2 Online Micro-cluster Maintenance: The CluStream
Algorithm
The micro-clustering phase is the online statistical data collection portion
of the algorithm. This process is not dependent on any user input such as the
time horizon or the required granularity of the clustering process. The aim
is to maintain statistics at a sufficientlyhigh level of (temporal and spatial)
granularity so that it can be effectively used by the offline components such
as horizon-specific macro-clustering as well as evolution analysis. The basic
concept of the micro-cluster maintenance algorithm derives ideas from the k-
means and nearest neighbor algorithms. The algorithm works in an iterative
fashion,by alwaysmaintainingacurrentsetofmicro-clusters. Itisassumedthat
a total of q micro-clusters are stored at any moment by the algorithm. We will
denotethesemicro-clustersbyM1 ...Mq.Associatedwitheachmicro-cluster
i, we create a unique id whenever it is first created. If two micro-clusters are
merged (aswillbecomeevidentfromthedetailsofourmaintenancealgorithm),
a list of ids is created in order to identify the constituent micro-clusters. The
value of q is determined by the amount of main memory available in order to
store the micro-clusters. Therefore, typical values of q are significantlylarger
than the natural number of clustersin the data but are also significantlysmaller
than the number of data points arriving in a long period of time for a massive
data stream. These micro-clusters represent the current snapshot of clusters
42. 20 DATA STREAMS: MODELS AND ALGORITHMS
which change overthe courseofthe streamasnew points arrive. Their status is
stored away on disk wheneverthe clock time is divisibleby aifor any integer
i. At the same time any micro-clusters of order r which were stored at a time
in the past more remote than aZ+"
units are deleted by the algorithm.
We first need to create the initial q micro-clusters. This is done using an
offline process at the very beginning of the data stream computation process.
At the very beginningof the data stream,we storethe first InitNumber points
on disk and use a standard k-means clustering algorithm in order to create the
q initialmicro-clusters. The value of InitNumber is chosen to be as large as
permitted by the computationalcomplexity of a k-means algorithm creating q
clusters.
Oncethese initialmicro-clustershavebeen established,theonlineprocessof
updatingthemicro-clustersisinitiated. Wheneveranew datapoint arrives,
the micro-clusters are updated in order to reflect the changes. Each data point
eitherneedstobe absorbedbyamicro-cluster, oritneedstobeput in aclusterof
its own. The firstpreference isto absorbthe datapoint into a currentlyexisting
micro-cluster. We first find the distance of each data point to the micro-cluster
centroids M I ...M4. Let us denote this distance value of the data point Xi,
to the centroid of the micro-cluster M by dist(Mj,Xi,).Sincethe centroid
of the micro-cluster is available in the cluster feature vector, this value can be
computedrelatively easily.
We findthe closest cluster M, to the data point z
.
We note that in many
cases, the point Xi,does not naturally belong to the cluster Mp. These cases
are as follows:
0 The data point Xi,correspondsto an outlier.
0 The data point Xi,correspondsto the beginning of a new cluster because
of evolutionof the data stream.
While the two cases above cannot be distinguished until more data points
arrive,the data point needs to be assigneda (new)micro-clusterof its own
with a unique id. How do we decide whether a completelynew cluster should
be created? In order to make this decision, we use the cluster feature vector
of M pto decide if this data point falls within the maximum boundary of the
micro-cluster Mp.If SO,then the data point Xi,is added to the micro-cluster
M pusing the CF additivity property. The maximum boundary of the micro-
cluster M pis defined as a factor o f t of the RMS deviation of the data points
in M pfrom the centroid. We define this as the maximal boundaryfactor. We
note that the RMS deviation can only be defined for a cluster with more than
1 point. For a cluster with only 1previous point, the maximum boundary is
defined in a heuristic way. Specifically, we choose it to be r times that of the
next closest cluster.
If the data point does not lie within the maximum boundary of the nearest
micro-cluster, then a new micro-cluster must be created containing the data
43. On ClusteringMassive Data Streams: A Summarization Paradigm 21
point Xi,.This newly created micro-cluster is assigned a new id which can
identify it uniquely at any future stage of the data steam process. However,
in order to create this new micro-cluster, the number of other clusters must
be reduced by one in order to create memory space. This can be achieved by
eitherdeletinganoldclusterorjoining twoofthe oldclusters. Ourmaintenance
algorithmfirstdeterminesif it is safeto delete any of the currentmicro-clusters
as outliers. If not, then a merge of two micro-clusters is initiated.
The first step is to identify if any of the old micro-clusters are possibly out-
liers which can be safelydeleted by the algorithm. While it might be tempting
to simplypick themicro-clusterwith the fewestnumber ofpoints asthe micro-
cluster to be deleted, this may often lead to misleadingresults. In many cases,
a given micro-cluster might correspondto a point of considerablecluster pres-
ence in the past history of the stream, but may no longer be an active cluster
in the recent stream activity. Such a micro-cluster can be considered an out-
lier from the current point of view. An ideal goal would be to estimate the
average timestamp of the last m arrivals in each micro-cluster 2, and delete
the micro-cluster with the least recent timestamp. While the above estimation
can be achieved by simply storing the last m points in each micro-cluster, this
increases the memory requirements of a micro-cluster by a factor of m. Such
a requirement reduces the number of micro-clusters that can be stored by the
availablememory and therefore reduces the effectivenessof the algorithm.
We will find a way to approximatethe averagetimestamp of the last m data
points of the cluster M. This will be achieved by using the data about the
timestamps stored in the micro-cluster M. We note that the timestamp data
allowsuito calculate the mean and standarddeviation3of the arrival times of
points in a given micro-cluster M. Let these values be denoted by pM and
OMrespectively. Then,wefindthetimeofarrivalofthem/(2 n)-th percentile
ofthepoints in M assumingthat thetimestampsarenormallydistributed. This
timestamp is used as the approximate value of the recency. We shall call this
value as the relevancestamp of cluster M. When the least relevance stamp of
any micro-cluster is below a user-defined threshold 6, it can be eliminated and
anew micro-clustercanbe createdwith aunique id correspondingto thenewly
arrived data point Xi,.
In some cases, none of the micro-clusters can be readily eliminated. This
happens when all relevance stamps are sufficientlyrecent and lie above the
user-defined threshold 6. In such a case, two of the micro-clusters need to be
merged. We merge the two micro-clusters which are closest to one another.
The new micro-cluster no longer corresponds to one id. Instead, an idlist is
created which is a union of the the ids in the individualmicro-clusters. Thus,
any micro-cluster which is result of one or more merging operations can be
identified in terms of the individualmicro-clustersmerged into it.
44. 22 DATA STREAMS: MODELS AND ALGORITHMS
While the above process of updating is executed at the arrival of each data
point, an additional process is executed at each clock time which is divisible
by ai for any integer i. At each such time, we store away the current set of
micro-clusters(possiblyon disk)togetherwith their id list, and indexedby their
time of storage. We also delete the least recent snapshot of order i, if a' +1
snapshotsof suchorderhad alreadybeen storedondisk, andiftheclocktimefor
this snapshotisnot divisibleby ai+l.(Inthe lattercase,the snapshotcontinues
to be a viable snapshotof order (i+I).) Thesemicro-clusterscan then be used
to form higher level clustersor an evolutionanalysis of the data stream.
3.3 High Dimensional Projected Stream Clustering
The method can also be extended to the case of high dimensionalprojected
stream clustering . The algorithms is referred to as HPSTREAM. The high-
dimensional case presents a special challenge to clustering algorithms even in
the traditional domain of static data sets. This is because of the sparsity of
the data in the high-dimensional case. In high-dimensional space, all pairs
of points tend to be almost equidistant from one another. As a result, it is
often unrealistic to define distance-based clusters in a meaningful way. Some
recent work on high-dimensionaldata uses techniques forprojected clustering
which candetermineclustersfora specificsubsetof dimensions[I, 41. Inthese
methods, the definitions of the clusters are such that each cluster is specific
to a particular group of dimensions. This alleviates the sparsity problem in
high-dimensional space to some extent. Even though a cluster may not be
meaningfully defined on all the dimensionsbecause of the sparsity of the data,
somesubsetof thedimensionscan alwaysbe found on whichparticularsubsets
of points form high quality and meaningful clusters. Of course, these subsets
of dimensions may vary over the different clusters. Such clusters are referred
to asprojected clusters [I].
In [8], we have discussedmethodsforhigh dimensionalprojected clustering
of data streams. The basic idea is to use an (incremental) algorithm in which
we associate a set of dimensions with each cluster. The set of dimensions is
represented as a d-dimensional bit vector B(Ci) for each cluster structure in
FCS. This bit vector contains a 1 bit for each dimension which is included
in cluster Ci. In addition, the maximum number of clusters k and the average
cluster dimensionality 1 is used as an input parameter. The average cluster
dimensionality1representstheaveragenumberofdimensionsusedinthecluster
projection. An iterative approach is used in which the dimensions are used to
update the clusters and vice-versa. The structure in FCS uses a decay-based
mechanisminordertoadjustforevolutionintheunderlyingdatastream. Details
are discussed in [8].
45. On ClusteringMassive Data Streams: A Summarization Paradigm
Time tl
Timet2
Time
Figure 2.3. Varying Horizons for the classification process
Classification of Data Streams: A Micro-clustering
Approach
Oneimportantdataminingproblemwhichhasbeen studiedin the contextof
data streamsisthatof streamclassification[15]. Themainthrust ondata stream
miningin thecontextof classificationhasbeen that ofone-passmining [14,19].
In general, the use of one-pass mining does not recognize the changes which
have occurred in the model since the beginning of the stream construction
process [5]. While the work in [19] works on time changing data streams,
the focus is on providing effective methods for incremental updating of the
classification model. We note that the accuracy of such a model cannot be
greater than the best sliding window model on a data stream. For example, in
the case illustrated in Figure 2.3, we have illustrated two classes (labeled by
'x' and '-') whose distribution changes over time. Correspondingly, the best
horizon at times tl and t 2 will also be different. As our empirical results will
show,thetruebehaviorofthedata streamiscapturedin atemporalmodelwhich
is sensitiveto the level of evolutionof the data stream.
The classificationprocessmay require simultaneousmodelconstructionand
testing in an environmentwhich constantlyevolvesover time. We assumethat
the testing process is performed concurrently with the training process. This
is often the case in many practical applications, in which only a portion of
the data is labeled, whereas the remaining is not. Therefore, such data can
be separated out into the (labeled) training stream, and the (unlabeled) testing
stream. The main difference in the construction of the micro-clusters is that
the micro-clusters are associatedwith a class label; therefore an incomingdata
point in the training stream can only be added to a micro-cluster belonging to
the same class. Therefore,we constructmicro-clustersin almost the sameway
as the unsupervised algorithm, with an additional class-label restriction.
From thetestingperspective,the importantpoint to be noted is that the most
effectiveclassificationmodel does not stay constant over time, but varies with
46. 24 DATA STREAMS: MODELS AND ALGORITHMS
progression of the data stream. If a static classificationmodel were used for
an evolving test stream, the accuracy of the underlying classificationprocess
is likely to drop suddenly when there is a suddenburst of records belonging to
a particular class. In such a case, a classificationmodel which is constructed
using a smaller history of data is likely to provide better accuracy. In other
cases, a longer history of training provides greater robustness.
In the classification process of an evolving data stream, either the short
term or long term behavior of the stream may be more important, and it often
cannot be known a-priori as to which one is more important. How do we
decidethewindow or horizon of the training datato use soasto obtainthe best
classificationaccuracy? While techniques such as decision trees are useful for
one-pass mining of data streams [14, 191, these cannot be easily used in the
contextof an on-demandclassijier in an evolvingenvironment. Thisisbecause
such a classifier requires rapid variation in the horizon selection process due
to data stream evolution. Furthermore, it is too expensive to keep track of
the entire history of the data in its original fine granularity. Therefore, the
on-demand classification process still requires the appropriate machinery for
efficientstatisticaldata collectionin orderto performthe classificationprocess.
4.1 On-Demand StreamClassification
We use the micro-clusters to perform an On Demand Stream Classijication
Process. In ordertoperformeffectiveclassificationofthestream,it isimportant
to find the correct time-horizon which should be used for classification. How
do we find the most effective horizon for classification at a given moment in
time? In order to do so, a small portion of the training stream is not used
for the creation of the micro-clusters. This portion of the training stream is
referred to as the horizon fitting stream segment. The number of points in the
streamused forhorizon fitting is denotedby kfit. Theremainingportion of the
training stream is used for the creation and maintenance of the class-specific
micro-clusters as discussed in the previous section.
Since the micro-clusters are based on the entire history of the stream, they
cannotdirectlybeusedtotesttheeffectivenessofthe classificationprocess over
different time horizons. This is essential, since we would like to find the time
horizon which provides the greatest accuracyduringthe classificationprocess.
We will denote the set of micro-clusters at time t, and horizon h by N(t,, h).
This set of micro-clusters is determined by subtracting out the micro-clusters
at time t, - h from the micro-clusters at time t,. The subtraction operation
is naturally defined for the micro-clustering approach. The essential idea is
to match the micro-clusters at time t, to the micro-clusters at time t, - h,
and subtract out the corresponding statistics. The additiveproperty of micro-
47. On ClusteringMassive Data Streams: A Summarization Paradigm 25
clustersensuresthattheresulting clusterscorrespondto thehorizon (t, -h, t,).
More details can be found in [6].
Once the micro-clusters for a particular time horizon have been determined,
they areutilized to determinethe classificationaccuracyof that particular hori-
zon. This process is executed periodically in order to adjust for the changes
which have occurred in the stream in recent time periods. For this purpose,
we use the horizon fitting stream segment. The last kfit points which have
arrived in the horizon fitting stream segment are utilized in order to test the
classification accuracy of that particular horizon. The value of kfit is chosen
while taking into consideration the computational complexity of the horizon
accuracy estimation. In addition, the value of kfit should be small enough so
that the points in it reflect the immediatelocality oft,. Typically, the value of
kfit should be chosen in such a way that the least recent point should be no
largerthan a pre-specified number oftime units fromthecurrenttimet,. Let us
denote this set of points by Q it. Note that since &fit is a part of the training
stream,the class labels are known a-priori.
Inordertotesttheclassificationaccuracyoftheprocess,eachpoint;
i
f E &fit
is used in the followingnearest neighbor classificationprocedure:
0 We find the closest micro-cluster in N(tc, h) to x.
We determine the class label of this micro-cluster and compare it to the true
class label of X.The accuracy over all the points in Qfit is then determined.
This provides the accuracy over that particular time horizon.
The accuracy of all the time horizons which are tracked by the geometric
time frame are determined. The p time horizons which provide the greatest
dynamic classificationaccuracy (usingthe last kfit points) are selectedfor the
classification of the stream. Let us denote the corresponding horizon values
by 3-1 = {hl ...h,). We note that since kfit represents only a small locality
of the points within the current time period t,, it would seem at first sight
that the system would always pick the smallest possible horizons in order to
maximize the accuracy of classification. However, this is often not the case
for evolving data streams. Consider for example, a data stream in which the
records fora givenclassarriveforaperiod, andthen subsequentlystartarriving
again after a time interval in which the records for another class have arrived.
In such a case, the horizon which includes previous occurrences of the same
class is likely to provide higher accuracy than shorter horizons. Thus, such a
system dynamically adapts to the most effective horizon for classification of
data streams. In addition, for a stable stream the system is also likely to pick
largerhorizonsbecause of the greateraccuracyresulting fromuse of largerdata
sizes.
48. 26 DATA STRFAMS:MODELSAND ALGORITHMS
The classificationof the test stream is a separateprocess which is executed
continuously throughout the algorithm. For each given test instance x,
the
above described nearest neighbor classification process is applied using each
hi E 'Ti. It is often possible that in the case of a rapidly evolvingdata stream,
differenthorizonsmayreportresult inthedeterminationofdifferentclasslabels.
The majority class among these p class labels is reported as the relevant class.
More detailson the technique may be found in [7].
5. Other Applications of Micro-clustering and Research
Directions
Whilethispaper discussestwo applicationsofmicro-clustering,wenotethat
anumberofotherproblemscanbe handledwith themicro-clusteringapproach.
This is because the process of micro-clustering createsa summaryof the data
which can be leveraged in a variety of ways for otherproblems in data mining.
Some examples of such problems are as follows:
Privacy PreservingData Mining: Intheproblem ofprivacypreserving
data mining, we create condensedrepresentations [3] of the data which
show k-anonymity. These condensed representations are like micro-
clusters, except that each cluster has a minimum cardinality threshold
on the number of data points in it. Thus, each cluster contains at least
k data-points, and we ensure that the each record in the data cannot be
distinguished from at least k other records. For this purpose, we only
maintain the summary statistics for the data points in the clusters as
opposed to the individual data points themselves. In addition to the first
and second order moments we also maintain the covariance matrix for
the data in each cluster. We note that the covariance matrix provides
a complete overview of the distribution of in the data. This covariance
matrix can be used in order to generate the pseudo-points which match
the distributionbehavior of the data in eachmicro-cluster. For relatively
smallmicro-clusters, it is possible to match theprobabilistic distribution
inthedatafairlyclosely. Thepseudo-pointscanbeusedasa surrogatefor
the actualdatapoints in the clusters in order to generatethe relevant data
mining results. Since the pseudo-points match the original distribution
quiteclosely, they canbe used forthepurposeof a varietyof data mining
algorithms. In [3], we have illustrated the use of the privacy-preserving
technique in the context of the classificationproblem. Our results show
thatthe classificationaccuracyisnot significantlyreducedbecauseof the
use of pseudo-points instead of the individualdata points.
Query Estimation: Since micro-clusters encode summary information
about the data, they can also be used for query estimation . A typical
exampleof suchatechniqueisthatofestimatingtheselectivityofqueries.
49. On ClusteringMassive Data Streams: A Summarization Paradigm 27
In such cases, the summary statistics of micro-clusters can be used in
order to estimate the number of data points which lie within a certain
interval such as a range query. Such an approach can be very efficient
in a variety of applications sincevoluminousdata streams are difficult to
use if they need to be utilized for query estimation. However, the micro-
clusteringapproachcancondensethedataintosummarystatistics,sothat
it is possible to efficiently use it for various kinds of queries. We note
that the technique is quite flexibleas long as it can be used for different
kinds of queries. An exampleof such a technique is illustrated in [9], in
which we use the micro-clustering technique (with some modifications
on the tracked statistics) for futuristic query processing in data streams.
StatisticalForecasting: Sincemicro-clusterscontaintemporal and con-
densed information, they can be used for methods such as statistical
forecasting of streams . While it can be computationally intensive to
use standard forecasting methods with large volumes of data points, the
micro-clustering approach provides a methodology in which the con-
densed data can be used as a surrogate for the original data points. For
example, for a standardregressionproblem, it is possible to use the cen-
troidsofdifferentmicro-clustersoverthevarioustemporaltimeframesin
order to estimatethe values of the data points. These values can then be
used for making aggregate statistical observations about the future. We
note that this is a useful approach in many applications since it is often
not possible to effectivelymake forecastsabout the futureusing the large
volume of the data in the stream. In [9], it has been shownhow to use the
technique for querying and analysis of future behavior of data streams.
In addition, we believe that the micro-clustering approach is powefil enough
to accomodatea wide variety of problems which require informationabout the
summary distribution of the data. In general, since many new data mining
problemsrequire summaryinformationaboutthedata, it is conceivablethatthe
micro-clustering approach can be used as a methodology to store condensed
statistics for general data mining and exploration applications.
6. Performance Study and Experimental Results
AllofourexperimentsareconductedonaPCwithIntelPentiumI11processor
and 512MB memory, which runs WindowsXP professional operating system.
For testingtheaccuracyandefficiencyoftheCluStreamalgorithm,we compare
CluStream with the STREAM algorithm [17,23], the best algorithm reported
so far for clustering data streams. CluStream is implementedaccording to the
descriptionin this paper, and the STREAMK-means is done strictlyaccording
to [23],whichshowsbetteraccuracythanBIRCH[24]. Tomakethecomparison
fair, both CluStream and STREAMK-means use the sameamount of memory.
50. 28 DATA STREAMS: MODELS AND ALGORITHMS
Specifically, they use the same stream incoming speed, the same amount of
memoryto storeintermediateclusters(calledMicro-clustersinCluStream),and
the same amount of memory to store the final clusters (called Macro-clusters
in CluStream).
Because the synthetic datasets can be generated by controlling the number
of data points, the dimensionality, and the number of clusters, with different
distributionor evolutioncharacteristics,theyareusedto evaluatethe scalability
in our experiments. However, since synthetic datasets are usually rather dif-
ferent from real ones, we will mainly use real datasets to test accuracy, cluster
evolution,and outlier detection.
Real datasets. First, weneedtofindsomereal datasetsthat evolvesignificantly
over time in order to test the effectivenessof CluStream. A good candidate for
such testing is the KDD-CUP'99 Network Intrusion Detection stream data set
which has been used earlier [23] to evaluate STREAM accuracy with respect
to BIRCH. This data set corresponds to the important problem of automatic
and real-time detection of cyber attacks. This is also a challenging problem
for dynamic stream clustering in its own right. The offline clustering algo-
rithms cannot detect such intrusions in real time. Even the recently proposed
stream clustering algorithms such as BIRCH and STREAMcannot be very ef-
fectivebecause the clustersreported by these algorithmsare all generatedfrom
the entirehistory of data stream, whereas the current cases may have evolved
significantly.
The Network Intrusion Detection dataset consists of a series of TCP con-
nection records from two weeks of LAN network traffic managed by MIT
Lincoln Labs. Each n record can either correspondto a normal connection, or
an intrusion or attack. The attacks fall into four main categories: DOS (i.e.,
denial-of-service),R2L(i.e., unauthorizedaccessfromaremotemachine),U2R
(i.e., unauthorized access to local superuser privileges), and PROBING (i.e.,
surveillance and other probing). As a result, the data contains a total of five
clusters including the class for "normal connections". The attack-types are
furtherclassified into one of 24 types, such as buffer-overflow, guess-passwd,
neptune, portsweep, rootkit, smurf, warezclient, spy, and so on. It is evident
that each specific attacktype can be treated as a sub-cluster. Most of the con-
nections in this dataset are normal, but occasionally there could be a burst of
attacks at certain times. Also, each connection record in this dataset contains
42 attributes,suchasdurationofthe connection,thenumberofdatabytestrans-
mitted from source to destination (and vice versa), percentile of connections
that have "SYN" errors, the number of "root" accesses, etc. As in 1231, all 34
continuous attributes will be used for clustering and one outlierpoint has been
removed.
Second,besidestestingontherapidlyevolvingnetworkintrusiondatastream,
we also test our method over relatively stable streams. Since previously re-
51. On ClusteringMassive Data Streams: A Summarization Paradigm 29
ported stream clusteringalgorithms work on the entirehistory of stream data,
we believe that they should perform effectivelyfor some data sets with stable
distribution over time. An example of such a data set is the KDD-CUP'98
Charitable Donation data set. We will show that even for such datasets, the
CluStream can consistently beat the STREAMalgorithm.
The KDD-CUP'98 Charitable Donation data set has also been used in eval-
uating severalone-scan clustering algorithms, such as [16]. This data set con-
tains 95412 records of information about people who have made charitable
donations in response to direct mailing requests, and clustering can be used to
group donors showing similar donation behavior. As in [16], we will only use
56 fields which can be extracted from the total 481 fields of each record. This
data set is converted into a data stream by taking the data input order as the
order of streaming and assumingthat they flow-in with a uniform speed.
Synthetic datasets. To test the scalability of CluStream, we generate some
syntheticdatasetsby varyingbase sizefrom 1O
O
Kto 1O
O
O
Kpoints, thenumber
of clusters from 4 to 64, and the dimensionality in the range of 10 to 100.
Because we know the true cluster distribution a priori, we can compare the
clustersfound with the true clusters. The data points of each synthetic dataset
will followa seriesof Gaussiandistributions,and to reflect the evolutionof the
streamdataovertime, wechangethemeanandvarianceofthecurrentGaussian
distribution every 10Kpoints in the synthetic data generation.
The quality of clustering on the real data sets was measured using the sum
of square distance(SSQ), defined as follows. Assume that there are a total of
N points in the past horizon at current time Tc.For each point pi, we find the
centroid Cpi of its closest macro-cluster,and compute d(pi,Cpi),the distance
between pi and C,,. Then the SSQ at time Tcwith horizon H (denoted as
SSQ(Tc7
H))is equalto the sum of d2(pi,Cpi)for all the N points within the
previous horizon H. Unless otherwise mentioned, the algorithm parameters
were set at a = 2,1 = 10,InitNumber = 2000, and t = 2.
We compare the clustering quality of CluStreamwith that of STREAM for
differenthorizons at differenttimesusingtheNetwork Intrusiondatasetandthe
Charitable donation data set. The results are illustrated in Figures 2.4 and 2.5.
We run each algorithm 5 times and compute their average SSQs. The results
show that CluStream is almost always better than STREAM. All experiments
for these datasetshave shown that CluStream has substantially higher quality
than STREAM. However the Network Intrusion data set showed significantly
betterresultsthanthecharitabledonationdatasetbecauseofthefactthenetwork
intrusion data set was a highly evolvingdata set. For such cases, the evolution
sensitive CluStream algorithm was much more effective than the STREAM
algorithm.
We also tested the accuracy of the On-DemandStream ClassiJier.The first
test was performed on the Network Intrusion Data Set. The first experiment
52. DATA STREAMS: MODELS AND ALGORITHMS
1 CluStream HSTREAMI
750 1250 1750 2250
Stream (in time units)
Figure 2.4. Quality comparison (NetworkIntrusion dataset, horizon=256,stream-speed=200)
Stream (in time units)
Figure 2.5. Quality comparison (CharitableDonation dataset, horizon=4, streamspeed=200)
53. On ClusteringMassive Data Streams: A Summarization Paradigm
.On DemandStream .Fixed SlidlngWindow DEntlreDataset
100 -
E
Z
F 9s
0
4
90
1500 2000 2500
Stream (In time units)
Figure 2.6. Accuracy comparison (Network Intrusion dataset,
buffer-size=1600,kfit=80, init_number=400)
0.25 0.5 1 2 4 8 16 32
Best horizon
Figure 2.7. Distribution of the (smallest) best horizon (Network Intrusion dataset, Time
units=2500,buffer-size=1600,kf$t=80,init-number=400)
EOn DemandStream .Fixed SlidingWindow OEntimDataset
"T
T
500 1000 1500 2000
Stream (in time units)
Figure 2.8. Accuracy comparison (Synthetic dataset B300kC5D20,
buffer_size=500,kfit=25, init-number=400)
54. DATA STREAMS:MODELS AND ALGORITHMS
I OStream s m d 400 points w r time unit
0.25 0.5 1 2
Best horizon
Figure 2.9. Distributionof the (smallest) best horizon (Syntheticdataset B300kCSD20, Time
units=2000, buffersize=500, lcfit=25, init-number400)
was conducted with a stream speed at 80 connectionsper time unit (i.e., there
are 40 training stream points and 40 test stream points per time unit). We
set the buffersize at 1600 points, which means upon receiving 1600 points
(including both training and test stream points) we'll use a small set of the
training data points (In this case kfit =80) to choose the best horizon. We
compared the accuracy of the On-Demand-Stream classifier with two simple
one-pass stream classifiers over the entire data stream and the selected sliding
window(i.e., slidingwindowH=8). Figure2.6 showstheaccuracycomparison
among the three algorithms. We can see the On-Demand-Stream classifier
consistentlybeatsthetwo simpleone-passclassifiers. For example,at timeunit
2000, the On-Demand-Stream classijier's accuracyis about4%higher than the
classifierwith fixed sliding window, and is about 2% higher than the classifier
with the entire dataset. Because the class distribution of this dataset evolves
significantlyover time, eitherthe entiredataset or a fixed sliding window may
not always capture the underlying stream evolution nature. As a result, they
always have a worse accuracy than the On-Demand-Stream classifier which
always dynamicallychoosesthe best horizon for classifying.
Figure 2.7 showsthe distributionof the best horizons (They are the smallest
onesifthereexistseveralbesthorizonsatthesametime). Althoughabout78.4%
of the (smallest)best horizonshave avalue 114,theredo exist about21.6% best
horizons ranging from 112to 32 (e.g., about 6.4% of the best horizons have a
value 32). This also illustratesthat there is no fixed sliding window that can
achievethebest accuracyandthereasonwhy the On-Demand-Streamclassifier
can outperform the simpleone-pass classifiersover either the entiredataset or
a fixed sliding window.
We have also generated one synthetic dataset B300kC5D20to test the clas-
sificationaccuracyof these algorithms. This dataset contains5 classlabelsand
300Kdatapointswith 20dimensions. Wefirstsetthe streamspeedat 100points
56. The journey to Italy duly took place, the proposed party of two
being enlarged to one of four by the addition of Ignaz Brüll and
Simrock. Original plans had to be modified on account of the
exceptionally wet season, and the chief places visited were Vicenza,
Padua, and Venice.
The personnel of Brahms' intimate friends in Vienna had remained
on the whole much what it had become a very few years after his
arrival in the Austrian capital. Of its closest circle the Fabers,
Billroths, and Hanslicks, with whom must be associated Joachim's
cousins, the various members of the Wittgenstein family—amongst
them Frau Franz and Frau Dr. Oser—still formed the nucleus. An
acquaintance with Herr Victor von Miller zu Aichholz and his wife had
meanwhile ripened into warm friendship, and their house became
one of those whose hospitality was most frequently and gladly
accepted by the master. Amongst the musicians, Carl Ferdinand Pohl,
author of the standard Life of Mozart, and, since 1866, archivar to
the Gesellschaft, was one of his dearest friends. With the leading
professors of the conservatoire his relations continued very cordial,
and amongst the younger musicians to whom, in addition to his
early allies, Goldmark, Gänsbacher and Epstein, he extended his
friendly regard, may be mentioned Anton Door and Robert Fuchs.
The feeling of warm friendship existing between Brahms and Johann
Strauss has been commemorated in several well-known anecdotes.
The autumn of 1881, however, brought to permanent residence in
Vienna a family that before long made notable addition to the
master's intimate circle. Special circumstances conduced to the
speedy formation of a bond of friendship between Brahms and the
new-comers, Dr. and Frau Fellinger. In the first place, they were
friends of Frau Schumann and her daughters, and as such had an
instant claim on his courtesy, which he acknowledged by calling on
them as soon as possible after their arrival. In the second, his
interest was awakened by the fact that Frau Dr. Fellinger was the
daughter of Frau Professor Lang-Köstlin, the gifted Josephine Lang,
whose attractive personality and talent for composition made a
strong impression upon Mendelssohn when he was a youth of
57. twenty-one and some six years the lady's senior. The story of
Josephine, who at the age of twenty-six married Professor Köstlin of
Tübingen, is given in Hiller's 'Tonleben,' and Mendelssohn's
congratulations to her bridegroom-elect may be read in the second
volume of the 'Letters.' The talent for art which had come to her as
a family inheritance was transmitted to her daughter, though with a
difference. Frau Dr. Fellinger's gifts have associated themselves
especially with the plastic arts; in the first place with that of
painting, but they have become well known in the musical world also
by her busts and statuettes of Brahms, Billroth, and others belonging
to their circle. Her photographs of our master are now familiar to
most music-lovers. When it is added that Brahms found he could
command in Dr. Fellinger's hospitable house, not only congenial
intellectual sympathy, but the unceremonious intercourse with a
simple, affectionate family circle in which he had through life found a
pre-eminent source of happiness, it will easily be understood that he
became a more and more frequent guest there, until, during the
closing years of his life, it became for him almost a second home.
The master introduced two of his new works in the course of a few
weeks' journey undertaken in the winter of 1882-83. According to
Simrock's Thematic Catalogue, the Pianoforte Trio in C major, the
String Quintet in F major, and the 'Parzenlied' constitute the
publications of 1883. Early copies of the trio and quintet were sent
out, however, and the works were publicly performed from them in
December, 1882. An interesting entry in Frau Schumann's diary says:
'I had invited Koning and Müller to come and try Brahms'
new trio with me on Thursday 21st [December]. Who
should surprise us as we were playing it—he himself! He
came from Strassburg and means to stay with us for
Christmas. I played the trio first and he repeated it.'
Both works were performed on December 29 at a Museum chamber
music concert—the Quintet by the Heermann-Müller party, the Trio
by Brahms, Heermann, and Müller.
58. Amongst the early performances of the Trio were those on January
17 and 22 respectively in Berlin (Trio Concerts: Barth, de Ahna,
Hausmann) and London (Monday Popular Concerts: Hallé, Madame
Néruda, Piatti), and at Hellmesberger's in Vienna on March 15.
The work has not become one of the most generally familiar of the
master's compositions, though it is not easy to say why. It contains
no trace of the 'heaven-storming Johannes,' but, like many of the
later compositions, it breathes, and especially the first movement,
with a rich, mellow warmth suggestive of one to whom the
experiences of life have brought a solution of their own to its
problems, which has quieted, if it has not altogether satisfied, the
aspirations and impulses of youth.
The Quintet in F for strings is, for the most part, bright, concise, and
easy to follow. As one of its special features may be mentioned the
combination of the usual two middle movements in the second. It
was given in Hamburg on the 22nd and in Berlin on the 23rd of
January, respectively by Bargheer and Joachim and their colleagues
(it should be noted that Hausmann had at this time succeeded
Müller as the violoncellist of the Joachim Quartet), at
Hellmesberger's on February 15, and at the Monday Popular,
London, of March 5.
Brahms conducted the first performance of the Parzenlied in Basle
on December 8, 1882. Excellently sung by the members of the Basle
Choral Society, the work met with extraordinary success, and was
repeated after the New Year by general desire. Similar results
followed its performance in other towns, of which Strassburg and
Crefeld should be specially mentioned. The programme of the
Crefeld concert included the fifth movement of the Requiem. 'What
is your tempo?' Brahms inquired, on the morning of the rehearsal, of
Fräulein Antonia Kufferath, who was to sing the solo. The lady, not
taking the question seriously from the composer of the music,
waived a reply. 'No, I mean it; you have to hold out the long notes.
Well, we shall understand each other,' he added; 'sing only as you
feel, and I will follow with the chorus.'
59. These are characteristic words, and valuable in more than one
sense. To most of the few works to which the master has placed
metronome indications—and the Requiem is amongst these—he
added them by special request, and attached to them only a limited
importance. An absolutely and uniformly 'correct' pace for a piece of
genuine music does not exist. The pace must vary to some extent
according to subtle conditions existent in the performer, and the
instinct of a really musical executant or conductor will, as a rule, be
a safer guide, within limits, than what can be at best but the
mechanical markings even of the composer himself.
The Parzenlied, received with enthusiasm throughout Brahms' tour in
Germany and Switzerland, was not equally successful in Vienna,
where it was heard for the first time at the Gesellschaft concert of
February 18 under Gericke. The austere simplicity of the music,
which paces majestically onward with the concentrated, resigned
calm of despair, adds extraordinary force to Goethe's poem, but does
not appeal to every audience, and the work has never become a
prime favourite in the Austrian Kaiserstadt. The song is set for six-
part chorus with orchestra, in plainer harmonic masses and with less
employment of imitative counterpoint than we usually find in the
works of Brahms, who has accommodated his music here, as in
'Nänie,' to the classical spirit of the text. A singular deviation,
however, which occurs in the course of the setting, from the
uncompromising severity of the words, furnishes a remarkable
illustration of the composer's unconquerable idealism. Comment was
made in its place on the beautiful device by which he has sought to
relieve the dark mood of Hölderlin's 'Song of Destiny'—the addition
of an instrumental postlude which breathes forth a message of
tender consolation that the poet could hardly have rendered in
words. In Schiller's 'Nänie' the lament, with all its calm, gives
expression to a sentiment of compassionate sorrow that is perfectly
reproduced in the master's music. Goethe's Fates, however, in their
measured recitation of the gods' relentless cruelty, would have
seemed to offer no possible opportunity for even the inarticulate
expression of ruth. Least of all, it might be imagined, could any
60. concession to the demands of the human heart have been found in
the penultimate stanza of their song:
'The rulers exclude from
Their favouring glances
Entire generations,
And heed not in children
The once so belovèd
And still speaking features
Of distant forefathers.'
Our Brahms, however, who, in spite of his increasing weight, his
shaggy beard, his frequently rough manners, his unsatisfied
affections, his impenetrable reserve, remained at fifty, in his heart of
hearts, the very same being whom we have watched as the loving
child of seven, the simple-minded boy of fourteen, the broken-
hearted man of thirty, sobbing by the death-bed of his mother,
cannot leave the dread gloom of his subject unrelieved by a single
ray. He seems, in his setting of the last strophe but one, to
concentrate attention on past kindness of the gods, and thus,
perhaps, subtly to suggest a plea for present hope. How far the
musician was justified in thus wandering from the obvious intention
of his poet must be left to each hearer of the work to determine for
himself. If it be the case, as has sometimes been suggested, that the
variation was made by the composer in the musical interests of the
piece as a work of art, it cannot be held to have fulfilled its purpose;
for the striking inconsistency between words and music in the verse
in question has a disturbing effect on the mind of the listener. We
believe, however, that the true explanation of the master's procedure
is more radical, and is to be found in the nature of the man in which
that of the musician was grounded.
The Parzenlied was dedicated to 'His Highness George, Duke of
Saxe-Meiningen,' and was included in a Brahms programme
performed in Meiningen on April 2 to celebrate the Duke's birthday.
The complete breakdown of Bülow's health necessitated his
temporary retirement from his conductor's duties, which were
61. divided on this occasion between Brahms and Court Capellmeister
Franz Mannstädt, appointed to assist Bülow. Returning by a
circuitous route to Vienna after a few days at the ducal castle,
Brahms paid a short visit to Hamburg to take part in another Brahms
programme arranged by the talented young conductor of the Cecilia
Society, Julius Spengel. This was the first of several occasions on
which the master gave testimony of his appreciation of Dr. Spengel's
talents and musicianship by co-operating in the concerts of the
society.
Brahms celebrated his fiftieth birthday by entertaining his friends
Faber, Billroth, and Hanslick at a bachelor supper. He was occupied
during the summer with the completion of a third symphony, on
which he had worked the preceding year, and lived at Wiesbaden in
a house that had belonged to the celebrated painter Ludwig Knaus,
in whose former studio—Brahms' music-room for the nonce—the
work was finished.
It was known to the composer that a delicate elderly lady inhabited
the first-floor of the house of which Frau von Dewitz's flat, where he
lodged, formed an upper story. Every night, therefore, on returning
to his rooms, he took off his boots before going upstairs, and made
the ascent in his socks, so that her rest should not be disturbed. This
anecdote is but one amongst several of the same kind that have
been related to the author by Brahms' intimate associates. Samples
of another variety should not, however, be omitted.
A private performance of the new symphony, this time arranged for
two pianofortes, was given as usual at Ehrbar's by Brahms and Brüll,
and aroused immense expectations for the future of the work.
Amongst the listeners was a musician who, not having hitherto
allowed himself to be suspected of a partiality for the master's art,
expressed his enthusiastic admiration of the composition. 'Have you
had any conversation with X?' young Mr. Ehrbar asked Brahms; 'he
has been telling me how delighted he is with the symphony.' 'And
have you told him that he very often lies when he opens his mouth?'
angrily retorted the composer, who could never bring himself to
62. submit to the humiliation of accepting a compliment which he
suspected—perhaps unjustly in this case—of being insincere.
A terrible rebuff was administered by him on the evening of a first
Gewandhaus performance. It must be owned that Brahms was
seldom in his happiest mood when on a visit to Leipzig; he was well
aware that his music was not appreciated within the official 'ring'
there, and suspiciously resented any well-meant efforts made to
ignore this fact. 'And where are you going to lead us to-night, Herr
Doctor?' inquired one of the committee a few minutes before the
beginning of the concert, assuming a conciliatory manner as he
smoothed on his white kid gloves; 'to heaven?' 'It is the same to me
where you go,' rejoined Brahms.
The first performance of the Symphony in F major (No. 3) took place
in Vienna at the Philharmonic concert of December 2, under Hans
Richter, who was, according to Hanslick, originally responsible for the
name 'the Brahms Eroica,' by which it has occasionally been called.
Whether or not the suggestion is happy, a saying of the kind,
probably uttered on the impulse of the moment, should not be taken
very seriously.
Nothing of the quiescent autumn mood which we have observed in
the master's chamber music of this period is to be traced in either of
his symphonies, and the third, like its companions, represents him in
the zenith of his energies, working happily in the consciousness of
his absolute command over the resources of his art. Whether it be
judged by its effect as an entire work or studied movement by
movement, whether each movement be listened to as a whole or
analyzed into its component parts, all is found to be without halt of
inspiration or flaw in workmanship. Each theme is striking and
pregnant, and, though contrasting with what precedes it, seems to
belong inevitably to the movement and place in which it occurs,
whilst the development of the thematic material is so masterly that
to speak of admiring it seems almost ridiculous. The last movement
closes with a very beautiful and distinctive Brahms coda. The third
symphony is more immediately easy to follow than the first, and of
63. broader atmosphere than the second. It is of an essentially objective
character, and belongs absolutely to the domain of pure music.
The supreme and glorious pre-eminence which the great master had
by this time attained in contemporary estimation naturally made it
an object of competition with concert-givers and directors to
announce the earliest performances of his works, and this was
especially the case in the rare event of a new symphony which
succeeded its immediate predecessor after an interval of six years.
Brahms, however, had his own ideas on this matter, as on every
other that he thought important, and after the first performance of
the work in Vienna he sent the manuscript to Joachim in Berlin, and
begged him to conduct the second performance when and where he
liked. This proceeding would hardly have been noteworthy under the
circumstances of intimate friendship which had so long united the
two musicians, had it not been that the old relation between Brahms
and Joachim had been clouded during the past year or two, during
which there had been a cessation of their former affectionate
intercourse. When, therefore, it became known that Joachim, acting
on the composer's wish, proposed to conduct the symphony at one
of the subscription concerts of the Royal Academy of Arts, Berlin, so
much disappointment and heart-burning were felt and expressed
that Joachim, although he had already replied in the affirmative to
Brahms' request, consented to write again and ask what his wishes
really were. The answer came without delay, and was clear enough
to set the matter quite at rest. Brahms desired that the performance
should be committed unreservedly to the care of his old friend.
The symphony was heard for the second time, therefore, on January
4 under Joachim at Berlin, and was enthusiastically received by all
sections of the public and press. It was given again three times
during the same month in the German imperial capital under the
composer's bâton.
Detailed description of the triumphant progress of the new work
from town to town is no longer necessary. The composer was
overwhelmed with invitations to conduct it from the manuscript, and
64. Bülow, convalescent from his illness, and determined not to be
outdone in enthusiasm, placed it twice, as second and fourth
numbers, in a Meiningen programme of five works. On publication, it
was performed in all the chief music-loving towns of Germany, Great
Britain, Holland, Russia, Switzerland, and the United States.
In an account of a performance of the symphony at a Hamburg
Philharmonic concert under Brahms in December, which followed one
under von Bernuth after three weeks' interval, the critic of the
Correspondenten says:
'Brahms' interpretation of his works frequently differs so
inconceivably in delicate rhythmic and harmonic accents
from anything to which one is accustomed, that the
apprehension of his intentions could only be entirely
possible to another man possessed of exactly similar
sound-susceptibility or inspired by the power of
divination.'
The author feels a peculiar interest in quoting these lines, which
strikingly corroborate the impression formed by her on hearing this
and other of Brahms' works played under his own direction.
The publications of 1884 were, besides the third Symphony, Two
Songs for Contralto with Viola and Pianoforte, the second being the
'Virgin's Cradle Song,' already mentioned as one of the compositions
of 1865; two sets of four-part Songs, the one for accompanied Solo
voices, the other for mixed Chorus a capella, and the two books of
Songs, Op. 94 and 95.
At this date Brahms had entered into what we may call the third
period of his activity as a song-writer—one in which he frequently
chose texts that speak of loneliness or death. The wonderful beauty
of his settings of these subjects penetrates the very soul, and by the
mere force of its pathos carries to the hearer the conviction that the
composer speaks out of the feeling of his own heart. Stockhausen,
trying the song 'Mit vierzig Jahren' (Op. 94, No. 1) from the
65. manuscript to the composer's accompaniment, was so affected
during its performance that he could not at once proceed to the end.
Our remarks are, however, by no means intended to convey the
impression that Brahms only or generally chose poems of a
melancholy tendency at this time.
WITH FORTY YEARS.
By Friedrich Rückert (1788-1866).
With forty years we've gained the
mountain's summit,
We stand awhile and look
behind;
There we behold the quiet years
of childhood
And there the joy of youth we
find.
Look once again, and then, with
freshened vigour,
Take up thy staff and onward
wend!
A mountain-ridge extendeth,
broad, before thee,
Not here, but there must thou
descend.
No longer, climbing, need'st thou
struggle breathless,
The level path will lead thee on;
And then with thee a little
downward tending,
Before thou know'st, thy
journey's done.
With the knowledge we have gained of the master's habit of
producing his large works in couples, we are prepared to find him
employed this summer on the composition of a fourth symphony.
66. Avoiding a long journey, he settled down to his work at Mürz
Zuschlag in Styria, not far from the highest ridge of the Semmering.
Hearing soon after his arrival there that his old friend Misi Reinthaler,
now grown up into a young lady, was leaving home under her
mother's care to go through a course of treatment under a famous
Vienna specialist, he wrote to place his rooms in Carlsgasse at Frau
Reinthaler's disposal. The offer was not accepted, but when the
invalid was sufficiently convalescent, he insisted that the two ladies
should come for a few days as his guests to Mürz Zuschlag, where
he took rooms for them near his own lodgings. He went over to see
them also at Vienna, and spent the greater part of a morning
showing them his valuable collection of autographs and other
treasures. 'Yes, these would have been something to give a wife!'
was his answer to the ladies' expressions of delight. Amongst his
collection of musical autographs were two written on different sides
of the same sheet of paper—one of Beethoven, the song 'Ich liebe
dich'; the other of Schubert, part of a pianoforte composition. These,
with Brahms' autograph signature 'Joh. Brahms in April 1872,'
written at the bottom of one of the pages, constitute a unique
triplet. The sheet now belongs to the Gesellschaft library, and is
framed within glass.
The society of Hanslick, who came with his wife to stay near Mürz
Zuschlag for part of the summer, was very acceptable to Brahms.
The departure of his friends at the close of the season, in the
company of some mutual Vienna acquaintances, incited the
composer to an act of courtesy of a kind quite unusual with him, the
sequel to which seems to have caused him almost comical
annoyance that found expression in a couple of notes sent
immediately afterwards to Hanslick.
'Dearest Friend,
'Here I stand with roses and pansies; which means with a
basket of fruit, liqueurs and cakes! You must have
travelled through by the earlier Sunday extra train? I
67. made a good and unusual impression for politeness at the
station! The children are now rejoicing over the cakes....'
and, on finding that, mistaking the time of the train, he had arrived
a quarter of an hour late:
'How such a stupid thing can spoil one's day and the
thought of it recur to torment one. I hope you do not
know this as well as I, who am for ever preparing for
myself such vexatious worry....'
Later on, writing about other matters, he adds:
'... I hope Professor Schmidt's ladies do not describe my
promenade with the basket too graphically in Vienna!
Otherwise my unspoiled lady friends may cease to be so
unassuming.'[68]
The journeys of the winter included visits to Bremen and Oldenburg,
during which Hermine Spiess, one of the very favourite younger
interpreters of Brahms' songs, sang dainty selections of them to the
composer's accompaniment, with overwhelming success. The early
death of this gifted artist, soon after her marriage, caused the
master, with whom she was a great favourite, deep and sincere grief.
Brahms went also to Crefeld, where the 'Tafellied,' dedicated on
publication 'To the friends in Crefeld in remembrance of Jan. 28th
1885,' was sung on the date in question, with some of the new part-
songs a capella, and other of the composer's works, at the jubilee of
the Crefeld Concert Society. The manuscript score of the 'Tafellied' is
in the possession of Herr Alwin von Beckerath, to whom it was
presented by Brahms with an affectionate inscription.
68. CHAPTER XX
1885-1888
Vienna Tonkünstlerverein—Fourth Symphony—Hugo Wolf
—Brahms at Thun—Three new works of chamber music—
First performances of the second Violoncello Sonata by
Brahms and Hausmann—Frau Celestine Truxa—Double
Concerto—Marxsen's death—Eugen d'Albert—The Gipsy
Songs—Conrat's translations from the Hungarian—Brahms
and Jenner—The 'Zum rothen Igel'—Ehrbar's asparagus
luncheons—Third Sonata for Pianoforte and Violin.
The early part of the year 1885 offers for record no event of unusual
interest to the reader. The greater portion of it was spent by Brahms
in his customary routine in Vienna. He was generally to be seen at
the weekly meetings of the Tonkünstlerverein, a musicians' club
founded by Epstein, Gänsbacher, and others, of which the master
had consented to be named honorary life-president. The Monday
evening proceedings included a short musical programme,
sometimes followed by an informal supper. Brahms did not usually
sit in the music-room, but would remain in a smaller apartment
smoking and chatting sociably with friends of either sex. His arrival
always became known at once to the assembled company, 'Brahms
is here; Brahms is come!' being passed eagerly from mouth to
mouth. His old love of open-air exercise had not diminished with
increasing years, and the Sunday custom of a long walk in the
country was still kept up. A few friends used to meet in the morning
outside the Café Bauer, opposite the Opera House, and, taking train
or tram to the outskirts of the city, would thence proceed on foot,
returning in the late afternoon. Brahms, nearly always in a good
humour on these occasions, was generally soon ahead of his
companions, or leading the way with the foremost, and, as had
69. usually been the case with him through life, was looked upon by his
friends as the chief occasion of their meetings, allowed his own way,
and admired as a kind of pet oracle. The excursions always
commenced for the season on his return to Vienna in the autumn,
and were continued with considerable regularity until his departure
in the spring. They not infrequently gave opportunity for the
employment of the composer's unfailing readiness of repartee, as on
the occasion of a meeting in the train, on the return journey, with a
learned but unmusical acquaintance of one of the party, between
whom and Brahms an animated conversation arose. 'Will you not
join us one day, Herr Doctor? Next Sunday, perhaps?' asked Brahms.
'I!' exclaimed the other. 'Saul among the prophets?' 'Na, so you give
yourself royal airs!' instantly rejoined the master.
The fourth symphony was completed during the summer at Mürz
Zuschlag, where Brahms this year had the advantage of Dr. and Frau
Fellinger's society, and—indispensable for his complete enjoyment of
a home circle—that of their children. Returning one afternoon from a
walk, he found that the house in which he lodged had caught fire,
and that his friends were busily engaged in bringing his papers, and
amongst them the nearly-finished manuscript of the new symphony,
into the garden. He immediately set to work to help in getting the
fire under, whilst Frau Fellinger sat out of doors with either arm
outspread on the precious papers piled on each side of her. Luckily,
all serious harm was averted, and it was soon possible to restore the
manuscripts intact to the composer's apartments.
Brahms paid a neighbourly call, in the course of the summer, on the
author Rosegger, who was living in his small country house at
Krieglach near Mürz Zuschlag, and tasted the unusual experience of
a repulse. Absorbed in work at the moment when his servant
announced 'a strange gentleman,' Rosegger, without glancing at the
card placed beside him, desired his visitor to 'sit down for a
moment.' Conscious only of the presence of a bearded stranger with
a gray overcoat over his shoulder and a light-coloured umbrella in
his hand, he vouchsafed but scant answer to the trifling remarks
70. with which his caller tried to pave the way to cordiality, and before
long Brahms composedly remarked that he would be on his legs
again, and took leave. It was not till some minutes after his
departure that it occurred to Rosegger to glance at the card, and he
has himself described the feelings of despair with which he read the
words 'Johannes Brahms' staring at him in all the reality of black on
white. Not he alone, but the ladies of his family, were enthusiastic
admirers of the composer's genius. He was so overwhelmed by his
mistake as to be incapable of taking any steps to remedy it, and
firmly declined to yield to the entreaties of his wife and daughter
that he would return the visit and explain matters to Brahms. He
published an amusing account of the misadventure in the year 1894
in an issue of the Heimgarten. Perhaps it may have fallen into the
master's hands.
The honour not only of the first, but of several subsequent early
performances of the Symphony in E minor, fell to the Meiningen
orchestra. The work was announced for the third subscription
concert of the season 1885-86, and shortly beforehand the score
and parts of the third and fourth movements were sent by the
composer to Meiningen for correction at a preliminary rehearsal
under Bülow. Three listeners were, by Bülow's invitation, present on
the occasion—the Landgraf of Hesse; Richard Strauss, the now
famous composer, who had succeeded Mannstädt as second
conductor of the Meiningen orchestra; and Frederic Lamond. The
lapse of another day or so brought Brahms himself with the first and
second movements, and the first public performance of the work
took place on October 25.
That the new symphony was enthusiastically received on the
occasion goes almost without saying. Persevering but unsuccessful
efforts were made by the audience to obtain a repetition of the third
movement, and the close of the work was followed by the emphatic
demonstration incident to a great success.
The work was repeated under Bülow's direction at the following
Meiningen concert of November 1, and was conducted by the
71. composer throughout a three weeks' tour on which he started with
Bülow and his orchestra immediately afterwards, and which included
the towns Siegen, Dortmund, Essen, Elberfeld, Düsseldorf,
Rotterdam, Utrecht, Amsterdam, the Hague, Arnheim, Crefeld, Bonn,
and Cologne. A performance at Wiesbaden followed, and the work
was heard for the first time in Vienna at the Philharmonic concert of
January 17, 1886, under Richter. This occasion was celebrated by a
dinner given by Billroth at the Hôtel Sacher, the guests invited to
meet the composer being Richter, Hanslick, Goldmark, Faber, Door,
Epstein, Ehrbar, Fuchs, Kalbeck, and Dömpke.
A new and important work by Brahms could hardly fail to obtain a
warm reception in Vienna at a period when the composer could look
back to thirty years' residence in the imperial city with which his
name had become as closely associated as those of Haydn, Mozart,
Beethoven, and Schubert; but though the symphony was applauded
by the public and praised by all but the inveterately hostile section of
the press, it did not reach the hearts of the Vienna audience in the
same unmistakable manner as its two immediate predecessors, both
of which had, as we have seen, made a more striking impression on
a first hearing in Austria than the first Symphony in C minor.
Strangely enough, the fourth symphony at once obtained some
measure of real appreciation in Leipzig, where the first had been far
more successful than the second and third. It was performed under
the composer at the Gewandhaus concert of February 18. The
account given of the occasion by the Leipziger Nachrichten is,
perhaps, the more satisfactory since our old friend Dörffel, who
might possibly have been suspected of partiality, had long since
retired from the staff of the journal. Bernhard Vögl, his second
successor, says:
'... The reception must, we think, have made amends to
Brahms for former ones, which, in Bülow's opinion, were
too cool. After each movement the hall resounded with
tumultuous and long-continued applause, and, at the
conclusion of the work, the composer was repeatedly
72. called forward.... The finale is certainly the most original
of the movements, and furnishes more complete
argument than has before been brought forward for the
opinion of those who see in Brahms the modern Sebastian
Bach. The movement is not only constructed on the form
displayed in Bach's Chaconne for violin, but is filled with
Bach's spirit. It is built up with astounding mastery upon
the eight notes,
[Listen]
and in such a manner that its contrapuntal learning remains
subordinate to its poetic contents.... It can be compared with no
former work of Brahms and stands alone in the symphonic literature
of the present and the past.'
A still more triumphant issue attended the production of the
symphony under Brahms at a concert of the Hamburg Cecilia Society
on April 9. Josef Sittard, who had recently been appointed musical
critic to the Hamburger Correspondenten, a post he has held to the
present day, wrote:
'To-day we abide by what we have affirmed for years past
in musical journals; that Brahms is the greatest
instrumental composer since Beethoven. Power, passion,
depth of thought, exalted nobility of melody and form, are
the qualities which form the artistic sign manual of his
creations. The E minor (fourth) Symphony is distinguished
from the second and third principally by the rigorous and
even grim earnestness which, though in a totally different
way, mark the first. More than ever does the composer
follow out his ideas to their conclusion, and this
unbending logic makes the immediate understanding of
73. the work difficult. But the oftener we have heard it, the
more clearly have its great beauties, the depth, energy
and power of its thoughts, the clearness of its classic
form, revealed themselves to us. In the contrapuntal
treatment of its themes, in richness of harmony and in the
art of instrumentation, it seems to as superior to the
second and third, these, perhaps, have the advantage of
greater melodic beauty; a guarantee of popularity. In
depth, power and originality of conception, however, the
fourth symphony takes its place by the side of the first....'
After an interesting discussion of the several movements, the writer
adds: 'In a word, the symphony is of monumental significance.'
Brahms' fourth symphony, produced when he was over fifty, is, in
the opinion of most musicians, unsurpassed by any other
achievement of his genius. It has during the past twenty years been
growing slowly into general knowledge and favour, and will, it may
be safely predicted, become still more deeply rooted in its place
amongst the composer's most widely-valued works. The second
movement, in the opinion of the late Philipp Spitta, 'does not find its
equal in the symphonic world'; and the fourth, written in
'Passacaglia' form, is the most astonishing illustration achieved even
by Brahms himself of the limitless capability of variation form, in
which he is pre-eminent.[69]
It is with something of a mournful feeling that we find ourselves at
the close of our enumeration of the master's four greatest
instrumental works. Enough, we may hope, has been said to indicate
that any comparison of the symphonies as inferior or superior is
impossible, for the reason that each, while perfectly fulfilling its own
particular destiny, is quite different from all the others, and such
natural preference as may be felt by this or that listener for either
must be considered as purely personal. The present writer may,
perhaps, be allowed to confess that, with all joy in the dainty second
and the magnificent third and fourth—emphatically the fourth—
74. neither appeals to her quite so strongly as the first. There is here a
quality of youth in the intensity of the soaring imagination that
seems to search the universe, which, presented as it is with the
wealth of resource that was at the command of the mature
composer, could not by its nature be other than unique. The
presence of this very quality may be the reason why the first
symphony suffers even more lamentably than its companions from
the dull, cold, cautious, 'classical' rendering which Brahms' orchestral
works receive at the hands of some conductors, who seem unable to
realize that a composer who founds his works on certain definite and
traditional principles of structure does not thereby change his
nature, or in any degree renounce the free exercise of his poetic
gifts.
Perhaps the present is as good an opportunity as may occur for
passing mention of a newspaper episode of the eighties, which was
much talked of for a few years, but which, though it may have
caused Brahms annoyance, could not possibly at this period of his
career have had any more serious consequence so far as he was
concerned.
Hugo Wolf, in 1884 a young aspirant to fame, seeking recognition
but finding none, poor, gifted, disappointed, weak in health, highly
nervous, without influential friends, accepted an opportunity of
increasing his miserably small means of subsistence by becoming the
musical critic of the Salon Blatt, a weekly society paper of Vienna,
and soon made for himself an unenviable notoriety by his persistent
attacks upon Brahms' compositions. The affair would not now
demand mention in a biography of our master if it were not that the
posthumous recognition afforded to Wolf's art gives some interest,
though not of an agreeable nature, to this association of his name
with that of Brahms. For the benefit of those readers who may wish
to study the matter further, it may be added that Wolf's criticisms
have been republished since his death. For ourselves, having done
what was, perhaps, incumbent on us by referring to the matter, we
shall adopt what we believe would have been Brahms' desire, by
75. allowing it, so far as these pages are concerned, to follow others of
the kind to oblivion.
The summer of 1886 was the first of the three seasons passed by
Brahms at Thun, of which Widmann has written so charming an
account. He rented the entire first-floor of a house opposite the spot
where the river Aare flows out of the lake, the ground-floor being
occupied by the owner, who kept a little haberdashery shop.
According to his general custom, he dined in fine weather in the
garden of some inn, occasionally alone, but oftener in the company
of a friend or friends. Every Saturday he went to Bern to remain till
Monday or longer with the Widmanns, who, like other friends, found
him a most considerate and easily satisfied guest, though his
exceptional energy of body and mind often made it exhausting work
to keep up with him.
'His week-end visits were,' says Widmann, 'high festivals
and times of rejoicing for me and mine; days of rest they
certainly were not, for the constantly active mind of our
guest demanded similar wakefulness from all his
associates and one had to pull one's self well together to
maintain sufficient freshness to satisfy the requirements of
his indefatigable vitality.... I have never seen anyone who
took such fresh, genuine and lasting interest in the
surroundings of life as Brahms, whether in objects of
nature, art, or even industry. The smallest invention, the
improvement of some article for household use, every
trace, in short, of practical ingenuity gave him real
pleasure. And nothing escaped his observation.... He
hated bicycles because the flow of his ideas was so often
disturbed by the noiseless rushing past, or the sudden
signal, of these machines, and also because he thought
the trampling movement of the rider ugly. He was,
however, glad to live in the age of great inventions and
could not sufficiently admire the electric light, Edison's
phonographs, etc. He was equally interested in the animal
76. world. I always had to tell him anew about the family
customs of the bears in the Bern bear-pits before which
we often stood together. Indeed, subjects of conversation
seemed inexhaustible during his visits.'[70]
Brahms' ordinary costume, the same here as elsewhere, was chosen
quite without regard to appearances. Mere lapse of time must
occasionally have compelled him to wear a new coat, but it is safe to
conclude that his feelings suffered discomposure on the rare
occurrence of such a crisis. Neckties and white collars were reserved
as special marks of deference to conventionality. During his visits to
Thun he used on wet Saturdays to appear at Bern wearing 'an old
brown-gray plaid fastened over his chest with an immense pin,
which completed his strange appearance.' Many were the books
borrowed from Widmann at the beginning, and brought back at the
end, of the week, carried by him in a leather bag slung over his
shoulder. Most of them were standard works; he was not devoted to
modern literature on the whole, though he read with pleasure new
and really good books of history and travel, and was fond of
Gottfried Keller's novels and poems. Over engravings and
photographs of Italian works of art he would pore for hours, never
weary of discussing memories and predilections with his friend.
Visits to the Bern summer theatre, a short mountain tour with
Widmann, an introduction to Ernst von Wildenbruch, whose dramas
the master liked, and with whom he now found himself in personal
sympathy—events such as these served to diversify the summer
season of 1886, which was made musically noteworthy by the
composition of a group of chamber works, the Sonatas in A and F
major for pianoforte with violin and violoncello respectively, and the
Trio in C minor for pianoforte and strings. The Sonatas were
performed for the first time in public in Vienna; severally by Brahms
and Hellmesberger, at the Quartet concert of December 2, and by
Brahms and Hausmann at Hausmann's concert of November 24; the
Trio was introduced at Budapest about the same time by Brahms,
Hubay, and Popper, in each case from the manuscript.
77. Detailed discussion of these works is superfluous; two of them, at all
events, are amongst the best known of Brahms' compositions. The
Sonata for pianoforte and violoncello in F is the least familiar of the
group, but assuredly not because it is inferior to its companions. It
is, indeed, one of the masterpieces of Brahms' later concise style.
Each movement has a remarkable individuality of its own, whilst all
are unmistakably characteristic of the composer. The first is broad
and energetic, the second profoundly touching, the third vehemently
passionate—in the Brahms' signification of the word, be it noted,
which means that the emotions are reached through the intellectual
imagination—the fourth written from beginning to end in a spirit of
vivacity and fun. The work was tried in the first instance at Frau
Fellinger's house. 'Are you expecting Hausmann?' Brahms inquired
carelessly of this lady soon after his return in the autumn. Frau
Fellinger, suspecting that something lay behind the question,
telegraphed to the great violoncellist, who usually stayed at her
house when in Vienna, to come as soon as possible, if only for a day.
He duly appeared, and the new sonata was played by Brahms and
himself on the evening of his arrival. They performed it again the
day before the concert above recorded, at a large party at Billroth's.
The last movement of the beautiful Sonata in A for pianoforte and
violin is sometimes criticised as being almost too concise. The
present writer confesses that she always feels it to be so, and one
day confided this sentiment to Joachim, who did not agree with her,
but said that the coda was originally considerably longer. 'Brahms
told me he had cut a good deal away; he aimed always at
condensation.'
Dr. Widmann allows us to publish an English version of a poem
written by him on this work, the original of which is published in the
appendix to his 'Brahms Recollections.' We have desired to place it
before our English-speaking readers, not only because it coincides
remarkably with what we related in our early chapters of the
delicate, fanciful tastes of the youthful Hannes, but because it gave
pleasure to the Brahms of fifty-three, and even of sixty-three, and
78. thus seems to illustrate the fact on which we have insisted, that if in
any case then in our master's, the child was father to the man. Only
a year before his death the great composer wrote to Widmann to
beg for one or two more copies of the poem, which had been
printed for private circulation.
THE THUN SONATA.
Poem on the Sonata in A for Pianoforte and Violin, Op. 100,
By Johannes Brahms,
WRITTEN BY
J. V. WIDMANN.
There where the Aare's waters
gently glide
From out the lake and flow
towards the town,
Where pleasant shelter spreading
trees provide,
Amidst the waving grass I laid
me down;
And sleeping softly on that
summer day,
I saw a wondrous vision as I lay.
Three knights rode up on proudly
stepping steeds,
Tiny as elves, but with the mien
of kings,
And spake to me: 'We come to
search the meads,
To seek a treasure here, of
precious things
Amongst the fairest; wilt thou
help us trace
79. A new-born child, a child of
heav'nly race?'
'And who are ye?' I, dreaming,
made reply;
'Knights of the golden
meadows' then they said,
'That at the foot of yonder
Niesen[71] lie;
And in our ancient castles many
a maid
Hath listened to the greeting of
our strings,
Long mute and passed amid
forgotten things.
'But lately tones were heard upon
the lake,
A sound of strings whose like
we never knew,
So David played, perhaps, for
Saul's dread sake,
Soothing the monarch curtained
from his view;
It reached us as it softly swelled
and sank,
And drew us, filled with longing,
to this bank.
'Then help us search, for surely
from this place,
This meadow by the river, came
the sound;
Help us then here the miracle to
trace,
That we may offer homage
80. when 'tis found.
Sleeps under flow'rs the new-born
creature rare?
Or is it floating in the evening air?'
But ere they ceased, a sudden
rapid twirl
Ruffled the waters, and, before
our eyes,
A fairy boat from out the wavelet's
whirl
Floated up stream, guided by
dragon-flies;
Within it sat a sweet-limbed, fair-
haired may,
Singing as to herself in ecstasy.
'To ride on waters clear and cool
is sweet,
For clear as deep my being's
living source;
To open worlds where joy and
sorrow meet,
Each flowing pure and full in
mingling course;
Go on, my boat, upstream with
happy cheer,
Heaven is reposing on the tranquil
mere.'
So sang the fairy child and they
that heard
Owned, by their swelling hearts,
the music's might,
The knights had only tears, nor
spake a word,
Welling from pain that thrilled
81. them with delight;
But when the skiff had vanished
from their eyes,
The eldest, pointing, said in
tender wise:
'Thou beauteous wonder of the
boat, farewell,
Sweet melody, revealed to us
to-day;
We that with slumb'ring
minnesingers dwell,
Bid thee Godspeed, thou
guileless stranger fay;
Our land is newly consecrate in
thee
That rang of old with fame of
minstrelsy.
'Now we may sleep again
amongst our dead,
The harper's holy spirit is
awake,
And as the evening glory, purple-
red,
Shineth upon our Alps and o'er
our lake,
And yet on distant mountain
sheds its light,
Throughout the earth this song
will wing its flight.
'Yet, though subduing many a
list'ning throng,
In stately town, in princely hall
it sound,
To this our land it ever will belong,
82. For here on flowing river it was
found.'
Fervent and glad the minnesinger
spake;
'Yes!' cried my heart—and then I
was awake.
Whilst our master had been living through the spring and summer
months in the enchanted world of his imagination, coming out of it
only for brief intervals of sojourn in earth's pleasant places amidst
the companionship of chosen friends, certain hard, commonplace
realities of the workaday world, which had arisen earlier at home in
Vienna, were still awaiting a satisfactory solution. The death of the
occupier of the third-floor flat of No. 4, Carlsgasse, the last
remaining member of the family with whom Brahms had lodged for
fourteen or fifteen years, had confronted him with the necessity of
choosing between several alternatives almost equally disagreeable to
him, concerning which it is only necessary to say that he had
avoided the annoyance of a removal by taking on the entire dwelling
direct from the landlord, and had escaped the disturbance of having
to replace the furniture of his rooms by accepting the offer of friends
to lend him sufficient for his absolute needs. Arrangements and all
necessary changes were made during his absence. To Frau Fellinger
Brahms had entrusted the keys of the flat and of his rooms, which
under her directions were brought into apple-pie order by the time
of his return, the drawers being tidied, and a list of the contents of
each neatly drawn up on a piece of cardboard, so that everything
should be ready to his hand. The greatest difficulty, however, still
remained. Who was to keep the rooms in order and see to the very
few of Brahms' daily requirements which he was not in the habit of
looking after himself? His coffee, as we know, he always prepared at
a very early hour in the morning, and he was kept provided with a
regular supply of the finest Mocha by a lady friend at Marseilles.
Dinner, afternoon coffee, and often supper, were taken away from
home. The master now declared he would have no one in the flat.
To as many visitors as he felt disposed to admit he could himself
83. open the door, whilst the cleaning and tidying of the rooms could be
done by the 'Hausmeisterin,' an old woman occupying a room in the
courtyard, and responsible for the cleaning of the general staircase,
etc. In vain Frau Fellinger contested the point. Brahms was
inflexible, and this kind lady apparently withdrew her opposition to
his plan, though remaining quietly on the look-out for an opportunity
of securing more suitable arrangements. By-and-by it presented
itself. In Frau Celestine Truxa, the widow of a journalist, whose
family party consisted of two young sons and an old aunt, Frau
Fellinger felt that she saw a most desirable tenant for the Carlsgasse
flat, and after a renewed attack on the master, whose arguments,
founded on the immaculate purity of his rooms under the old
woman's care, she irretrievably damaged by lifting a sofa cushion
and laying bare a collection of dust, which she declared would soon
develop into something worse, he was so far shaken as to say that if
she would make inquiries for him he would consider her views. Frau
Fellinger wisely abstained from further discussion, but after a few
days Frau Truxa herself, having been duly advised to open the
matter to Brahms with diplomatic sang-froid, went in person to apply
for the dwelling. After her third ring at the door-bell, the door was
opened by the master himself, who started in dismay at seeing a
strange lady standing in front of him.
'I have come to see the flat,' said Frau Truxa.
'What!' cried Brahms.
'I have heard there is an empty flat here, and have come to look at
it,' responded Frau Truxa indifferently; 'but perhaps it is not to let?'
A moment's pause, and the composer's suspicious expression
relaxed.
'Frau Dr. Fellinger mentioned the circumstances to me,' she
continued, 'and I thought they might suit me.'
By this time Brahms had become sufficiently reassured to show the
rooms and to listen, though without remark, to a brief description of
84. Frau Truxa's family and of the circumstances in which she found
herself.
'Perhaps, Dr. Brahms, you will consider the matter,' she concluded,
'and communicate with me if you think further of it. If I hear nothing
more from you, I shall consider the matter at an end.'
After about a week, during which Frau Truxa kept her own
confidence, her maid came one day to tell her a gentleman had
called to see her. Being engaged at the moment, she asked her aunt
to ascertain his business, but the old lady returned immediately with
a frightened look.
'I don't know what to think!' she exclaimed; 'there is a strange-
looking man walking about in the next room measuring the furniture
with a tape!'
'The things will all go in!' exclaimed the master as Frau Truxa hurried
to receive him.
The upshot was that the master gave up the tenancy of the flat,
returning to his old irresponsible position as lodger, whilst Frau
Truxa, bringing her household with her, stepped into the position of
his former landlady, thereby giving Brahms cause to be grateful for
the remainder of his life for Frau Fellinger's wise firmness. He was,
says Frau Truxa, perfectly easy to get on with; all he desired was to
be let alone. He was extremely orderly and neat in his ways, and
expected the things scattered about his room to be dusted and kept
tidy, but was vexed if he found the least trifle at all displaced—even
if his glasses were turned the wrong way—and, without making
direct allusion to the subject, would manage to show that he had
noticed it. Observing, after she had been a little time in the flat, that
he always rearranged the things returned from the laundress after
they had been placed in their drawer, she asked him why he did so.
'Only,' he said, 'because perhaps it is better that those last sent back
should be put at the bottom, then they all get worn alike.' A glove or
other article requiring a little mending would be placed carelessly at
85. the top of a drawer left open as if by accident. The next day he
would observe to Frau Truxa, 'I found my glove mended last night; I
wonder who can have done it!' and on her replying, 'I did it, Herr
Doctor,' would answer, 'You? How very kind!'
Frau Truxa came to respect and honour the composer more and
more the longer he lived in her house. She made his peculiarities her
study, and after a short time understood his little signs, and was
able to supply his requirements as they arose without being
expressly asked to do so. It is almost needless to say that he took
great interest in her two boys, and once, when she was summoned
away from Vienna to the sick-bed of her father, begged that the
maid-servant might be instructed to give all her attention to the
children during their mother's absence, even if his rooms were
neglected. 'I can take care of myself, but suppose something were to
happen to the children whilst the girl was engaged for me!' Every
night whilst Frau Truxa was away, the master himself looked in on
the boys to assure himself of their being safe in bed. For the old
aunt he always had a pleasant passing word.
The fourth Symphony and two books of Songs were published in
1886, and the three new works of chamber music, Op. 99, 100, 101,
in 1887. Of the songs we would select for particular mention the
wonderfully beautiful setting of Heine's verses:
'Death is the cool night,
Life is the sultry day,'
Op. 96, No. 1, and Nos. 1 and 2 of Op. 97.
Brahms' Italian journey in the spring of 1887 was made in the
company of Simrock and Kirchner. The following year he travelled in
Widmann's society, visiting Verona, Bologna, Rimini, Ancona, Loretto,
Rome, and Turin. Widmann sees in Brahms' spiritual kinship with the
masters of the Italian Renaissance the chief secret of his love for
Italy.
86. 'Their buildings, their statues, their pictures were his
delight and when one witnessed the absorbed devotion
with which he contemplated their works, or heard him
admire in the old masters a trait conspicuous in himself,
their conscientious perfection of detail ... even where it
could hardly be noticeable to the ordinary observer, one
could not help instituting the comparison between himself
and them.'
Brahms had an interview when on this journey with the now famous
Italian composer Martucci, who displayed a thorough familiarity with
the works of the German master.
Amongst the friends and acquaintances whom the composer met at
Thun during his second and third summers there were the Landgraf
of Hesse, Hanslick, Gottfried Keller, Professor Bächthold, Hermine
Spiess and her sister, Gustav Wendt, the Hegars, Max Kalbeck,
Steiner, Claus Groth, etc. One day, as he had started for a walk, he
was stopped by a stranger, who asked if he knew where Dr. Brahms
lived. 'He lives there,' replied the master, pointing to the
haberdasher's shop. 'Do you know if he is at home?' 'That I cannot
tell you,' was the reply. 'But go and ask in the shop; you will
certainly be able to find out there.' The gentleman followed this
advice, sent his card up, and received the answer that the Doctor
was at home, and would be pleased to see him. To his surprise, on
ascending the stairs, he found his newly-formed acquaintance
waiting for him at the top.
87. Brahms' Lodgings near Thun.
Photograph by Moegle, Thun.
The rumour revived in the summer of 1887 that Brahms was
engaged on an opera. This came about, perhaps, from his intimacy
with Widmann. 'I am composing the entr'actes,' he jestingly replied
to the Landgraf's question as to whether the report had any
foundation. As a matter of fact, the subject of opera was not
mentioned between the composer and his friend at this time.
88. The works which really occupied Brahms during the summer of 1887
were the double Concerto for violin and violoncello, with orchestral
accompaniment, and the 'Gipsy Songs.'
The Concerto was performed privately, immediately on its
completion, in the 'Louis Quinze' room of the Baden-Baden Kurhaus.
Brahms conducted, and the solo parts were performed by Joachim
and Hausmann. Amongst the listeners were Frau Schumann and her
eldest daughter, Rosenhain, Lachner, the violoncellist Hugo Becker,
and Gustav Wendt. The work was heard in public for the first time in
Cologne on October 15, Brahms conducting, and Joachim and
Hausmann playing the solos as before; and the next performances,
carried out under the same unique opportunities for success, were in
Wiesbaden, Frankfurt, and Basle, on November 17, 18, and 20.
In the autumn of this year one of the few remaining figures linked
with the most cherished associations of Brahms' early youth passed
away. Marxsen died on November 17, 1887, at the age of eighty-one,
having retained to the end almost unimpaired vigour of his mental
faculties. The last great pleasure of his life was associated with his
beloved art. In spite of great bodily weakness, he managed to be
present a week before his death at a concert of the Hamburg
Philharmonic Society to hear a performance of the 'ninth' Symphony.
'I am here for the last time,' he said, pressing Sittard's hand; and he
passed peacefully away fourteen days later.
A few years previously his artistic jubilee had been celebrated in
Hamburg, and his dear Johannes had surprised him with the proof-
sheets of a set of one hundred Variations composed long ago by
Marxsen, not with a view to publication, but as a practical illustration
of the inexhaustible possibilities contained in the art of thematic
development. Brahms, who happened to see the manuscript in
Marxsen's room during one of his subsequent visits to Hamburg, was
so strongly interested in it that in the end Marxsen gave it him, with
leave to do as he should like with it after his death. The parcel of
proof-sheets was accompanied by an affectionate letter, in which
Brahms begged forgiveness for having anticipated this permission
89. and yielded to his desire of placing the work within general reach
during his master's lifetime; and perhaps no jubilee honour of which
the old musician was the recipient filled him with such lively joy as
was caused by this tribute. Marxsen's name as a composer is,
indeed, now forgotten without chance of revival, but his memory will
live gloriously in the way he would have chosen, carried through the
years by the hand that wrote the great composer's acknowledgment
to his teacher on the title-page of the Concerto in B flat.
Four more performances from the manuscript of the double concerto
of interest in our narrative remain to be chronicled—those of the
Leipzig Gewandhaus, under Brahms, on January 1, 1888; of the
Berlin Philharmonic Society, under Bülow, of February 6; and of the
London Symphony Concerts, under Henschel, on February 15 and
21. The work, published in time for the autumn season, was given in
Vienna at the Philharmonic concert of December 23 under Richter.
On all these occasions the solos were played, as before, by Joachim
and Hausmann.
Bülow, having at this time resigned his post at Meiningen, had
entered on a period of activity as conductor in some of the northern
cities of Germany, and particularly in Hamburg and Berlin. His future
programmes, in which our master's works were well represented,
though not with the conspicuous prominence that had been possible
at Meiningen, do not fall within the scope of these pages, since, with
the mention of the double concerto, the enumeration of Brahms'
orchestral works is complete. Bülow's successor at Meiningen, Court
Capellmeister Fritz Steinbach, carried on the traditions and
preferences of the little Thuringian capital as he found them, until
his removal to Cologne a year or two ago, and has become
especially appreciated as a conductor of the works of Brahms, whose
personal friendship and artistic confidence he enjoyed in a high
degree.
The name of Eugen d'Albert, whose great gifts and attainments were
warmly recognised by Brahms, should not be omitted from our
pages, though detailed account of his relations with the master is
90. outside their limits. D'Albert's fine performances of the pianoforte
concertos helped to make these works familiar to many Continental
audiences, and certainly contributed, during the second half of the
eighties, to the better understanding of the great composer which
has gradually come to prevail at Leipzig.
But little needs to be said about the double concerto. This fine work,
which may be regarded as in some sort a successor to the double
and triple concertos of Mozart and Beethoven, exhibits all the power
of construction, the command of resource, the logical unity of idea,
characteristic of Brahms' style, whilst its popularity has been
hindered by the same cause that has retarded that of the pianoforte
concertos; the solo parts do not stand out sufficiently from the
orchestral accompaniment to give effective opportunity for the
display of virtuosity, in the absence of which no performer, appearing
before a great public as the exponent of an unfamiliar work for an
accompanied solo instrument, has much chance of sustaining the
lively interest of his audience in the composition. Of the three
movements of the double concerto, the first is especially interesting
to musicians, whilst the second, a beautiful example of Brahms'
expressive lyrical muse, appeals equally to less technically prepared
listeners. On the copy of the work presented by Brahms to Joachim
the words are inscribed in the composer's handwriting: 'To him for
whom it was written.'
Widely contrasted in every respect was the other new work of 1887,
introduced to the private circle of Vienna musicians at the last
meeting for the season of the Tonkünstlerverein in April, 1888. The
eleven four-part 'Gipsy Songs,' published in the course of the year as
Op. 103, were sung from the manuscript by Fräulein Walter, Frau
Gomperz-Bettelheim, Gustav Walter, and Weiglein of the imperial
opera, to the composer's accompaniment. Brahms obtained the texts
of this characteristic and attractive work from a collection of twenty-
five 'Hungarian Folk-songs' translated into German by Hugo Conrat,
and published in Budapest, with their original melodies set by Zoltan
Nagy for mezzo-soprano or baritone, with the addition of pianoforte
91. Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com