SlideShare a Scribd company logo
Data Streams Models And Algorithms Charu C
Aggarwal Ed download
https://ebookbell.com/product/data-streams-models-and-algorithms-
charu-c-aggarwal-ed-36520980
Explore and download more ebooks at ebookbell.com
Here are some recommended products that we believe you will be
interested in. You can click the link to download.
Data Streams Models And Algorithms 1st Edition Charu C Aggarwal Auth
https://ebookbell.com/product/data-streams-models-and-algorithms-1st-
edition-charu-c-aggarwal-auth-4239842
Dataflow Programming Visualizing And Managing Data Streams For
Effective Processing And Parallelism Programming Models Edet
https://ebookbell.com/product/dataflow-programming-visualizing-and-
managing-data-streams-for-effective-processing-and-parallelism-
programming-models-edet-232947356
Building Big Data Pipelines With Apache Beam Use A Single Programming
Model For Both Batch And Stream Data Processing 1st Edition Jan
Lukavsky
https://ebookbell.com/product/building-big-data-pipelines-with-apache-
beam-use-a-single-programming-model-for-both-batch-and-stream-data-
processing-1st-edition-jan-lukavsky-37633446
Data Stream Management Processing Highspeed Data Streams 1st Edition
Minos Garofalakis
https://ebookbell.com/product/data-stream-management-processing-
highspeed-data-streams-1st-edition-minos-garofalakis-5608642
Statistical Analysis Of Massive Data Streams Proceedings Of A Workshop
1st Edition Committee On Applied And Theoretical Statistics Board On
Mathematical Sciences And Their Applications
https://ebookbell.com/product/statistical-analysis-of-massive-data-
streams-proceedings-of-a-workshop-1st-edition-committee-on-applied-
and-theoretical-statistics-board-on-mathematical-sciences-and-their-
applications-51848662
Knowledge Discovery From Data Streams 1st Edition Joao Gama
https://ebookbell.com/product/knowledge-discovery-from-data-
streams-1st-edition-joao-gama-2253200
Machine Learning For Data Streams With Practical Examples In Moa
Adaptive Computation And Machine Learning Series Albert Bifet
https://ebookbell.com/product/machine-learning-for-data-streams-with-
practical-examples-in-moa-adaptive-computation-and-machine-learning-
series-albert-bifet-32906616
Transactional Machine Learning With Data Streams And Automl Build
Frictionless And Elastic Machine Learning Solutions With Apache Kafka
In The Cloud Using Python 1st Edition Sebastian Maurice
https://ebookbell.com/product/transactional-machine-learning-with-
data-streams-and-automl-build-frictionless-and-elastic-machine-
learning-solutions-with-apache-kafka-in-the-cloud-using-python-1st-
edition-sebastian-maurice-37321806
Autonomous Learning Systems From Data Streams To Knowledge In Realtime
Plamen Angelovauth
https://ebookbell.com/product/autonomous-learning-systems-from-data-
streams-to-knowledge-in-realtime-plamen-angelovauth-4299632
Data Streams Models And Algorithms Charu C Aggarwal Ed
Data Streams Models And Algorithms Charu C Aggarwal Ed
Data Streams
Models and Algorithms
ADVANCES IN DATABASE SYSTEMS
Series Editor
Ahmed K. Elmagarmid
Purdue Universify
WestLafayette, IN 47907
Other books in the Series:
SIMILARITY SEARCH: The Metric Space Approach, P. Zezuln, C. A~wito,V.
Dohnal, M. Batko, ISBN: 0-387-29146-6
STREAM DATA MANAGEMENT, Naurnan Chaudhry, Kevin Shaw, Mahdi
Abdelgueifi, ISBN: 0-387-24393-3
FUZZY DATABASE MODELING WITH XML, Zongrnin Ma, ISBN: 0-387-
24248-1
MINING SEQUENTIAL PATTERNS FROM LARGE DATA SETS, Wei Wang
and Jiong Yang; ISBN: 0-387-24246-5
ADVANCED SIGNATURE INDEXING FOR MULTIMEDIA AND WEB
APPLICATIONS, Yannis Manolopoulos, Alexandros Nanopoulos, Eleni
Tousidou; ISBN: 1-4020-7425-5
ADVANCES IN DIGITAL GOVERNMENT: Technology, Human Factors, and
Policy, edited by William J. Mclver, Jr. and Ahrned K. Elrnagarrnid; ISBN: 1-
4020-7067-5
INFORMATION AND DATABASE QUALITY, Mario Piattini, Coral Calero and
Marcela Genero; ISBN: 0-7923- 7599-8
DATA QUALITY, Richard Y. Wang, Mostapha Ziad, Yang W. Lee: ISBN: 0-7923-
7215-8
THE FRACTAL STRUCTURE OF DATA REFERENCE: Applications to the
Memory Hierarchy, Bruce McNutt; ISBN: 0-7923-7945-4
SEMANTIC MODELS FOR MULTIMEDIA DATABASE SEARCHING AND
BROWSING, Shu-Ching Chen,R.L. Kashyap, and ArifGhafoor;ISBN:0-7923-
7888-1
INFORMATIONBROKERINGACROSSHETEROGENEOUSDIGITALDATA:
AMetadata-based Approach, VipulKashyap,Arnit Sheth;ISBN:0-7923-7883-0
DATA DISSEMINATION IN WIRELESS COMPUTING ENVIRONMENTS,
Kian-Lee Tan and Beng Chin Ooi;ISBN: 0-7923-7866-0
MIDDLEWARE NETWORKS: Concept, Design and Deployment of Internet
Infrastructure, Michah Lerner, George Vanecek,Nino Vidovic,Dad Vrsalovic;
ISBN: 0-7923-7840-7
ADVANCED DATABASE INDEXING, YannisManolopoulos, Yannis Theodoridis,
VassilisJ. Tsotras; ISBN: 0-7923-7716-8
MULTILEVEL SECURE TRANSACTION PROCESSING, Vijay Atluri, Sushi1
Jajodia, Binto George ISBN: 0-7923-7702-8
FUZZY LOGIC IN DATA MODELING, Guoqing Chen ISBN: 0-7923-8253-6
For a complete listing of books in this series, go to htt~://www.s~rin~er.com
Data Streams
Models and Algorithms
edited by
Charu C. Aggarwal
ZBM, T.J. WatsonResearch Center
Yorktown Heights, NY, USA
a
- Springer
Charu C. Aggarwal
IBM
Thomas J. Watson Research Center
19Skyline Drive
Hawthorne NY 10532
Library of Congress Control Number: 2006934111
DATA STREAMS: Models and Algorithms edited by Charu C. Aggarwal
ISBN-10:0-387-28759-0
ISBN-13:978-0-387-28759-1
e-ISBN-10:0-387-47534-6
e-ISBN-13: 978-0-387-47534-9
Cover by Will Ladd, NRL Mapping, Charting and Geodesy Branch
utilizing NRL's GIDBB Portal System that can be utilized at
http://dmap.nrlssc.navy.mil
Printed on acid-free paper.
O 2007 Springer Science+BusinessMedia, LLC.
All rights reserved. This work may not be translated or copied in whole or
in part without the written permission of the publisher (Springer
Science+Business Media, LLC, 233 Spring Street, New York, NY 10013,
USA), except for brief excerpts in connection with reviews or scholarly
analysis. Use in connection with any form of information storage and
retrieval, electronic adaptation, computer software, or by similar or
dissimilar methodology now know or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks and
similar terms, even if the are not identified as such, is not to be taken as
an expression of opinion as to whether or not they are subject to
proprietary rights.
Contents
List of Figures
List of Tables
Preface
xv
xvii
1
An Introduction to Data Streams
Cham C.Aggarwal
1. Introduction
2. Stream Mining Algorithms
3. Conclusions and Summary
References
2
On Clustering MassiveData Streams: A SummarizationParadigm
Cham C. Aggarwal,Jiawei Han,Jianyong Wangand Philip S. Yu
1. Introduction
2. The Micro-clustering Based StreamMining Framework
3. Clustering EvolvingData Streams: A Micro-clusteringApproach
3.1 Micro-clustering Challenges
3.2 Online Micro-cluster Maintenance: The CluStream Algo-
rithm
3.3 High DimensionalProjected Stream Clustering
4. Classificationof Data Streams: A Micro-clusteringApproach
4.1 On-DemandStream Classification
5. Other Applications of Micro-clusteringand Research Directions
6. Performance Studyand ExperimentalResults
7. Discussion
References
3
A Survey of ClassificationMethods in Data Streams
Mohamed Medhat Gaber,Arkady Zaslavsky and Shonali Krishnaswamy
1. Introduction
2. Research Issues
3. SolutionApproaches
4. ClassificationTechniques
4.1 Ensemble Based Classification
4.2 Very Fast Decision Trees (VFDT)
DATA STREAMS: MODELS AND ALGORITHMS
4.3 On DemandClassification
4.4 Online InformationNetwork (OLIN)
4.5 LWClass Algorithm
4.6 ANNCAD Algorithm
4.7 SCALLOPAlgorithm
5. Summary
References
4
Frequent Pattern Mining in Data Streams
RuomingJin and GaganAgrawal
1. Introduction
2. Overview
3. New Algorithm
4. Work on OtherRelated Problems
5. Conclusions and Future Directions
References
5
A Surveyof Change Diagnosis
Algorithms in Evolving Data
Streams
Cham C.Agganval
1. Introduction
2. The Velocity Density Method
2.1 Spatial Velocity Profiles
2.2 Evolution Computationsin High Dimensional Case
2.3 On the use of clustering for characterizing stream evolution
3. On the Effect of Evolution in Data Mining Algorithms
4. Conclusions
References
6
Multi-Dimensional Analysis of Data 103
Streams Using Stream Cubes
Jiawei Hun, Z Dora Cai, rain Chen, GuozhuDong, Jian Pei, Benjamin W: Wah,and
Jianyong Wang
1. Introduction 104
2. Problem Definition 106
3. Architecture for On-line Analysis of Data Streams 108
3.1 Tilted time fiame 108
3.2 Criticallayers 110
3.3 Partialmaterialization of stream cube 111
4. Stream Data Cube Computation 112
4.1 Algorithms for cube computation 115
5. Performance Study 117
6. Related Work 120
7. PossibleExtensions 121
8. Conclusions 122
References 123
Contents vii
7
Load Sheddingin Data Stream Systems
Brian Babcoclr,Mayur Datar andRajeevMotwani
1. Load Sheddingfor AggregationQueries
1.1 Problem Formulation
1.2 Load SheddingAlgorithm
1.3 Extensions
2. Load Shedding in Aurora
3. Load Shedding for Sliding WindowJoins
4. Load Sheddingfor ClassificationQueries
5. Summary
References
8
The Sliding-WindowComputationModel and Results
Mayur Datar andRajeevMotwani
0.1 Motivationand Road Map
1. A Solution to the BASICCOUNTING
Problem
1.1 The Approximation Scheme
2. SpaceLower Bound for BASICCOUNTING
Problem
3. Beyond 0's and 1's
4. References and Related Work
5. Conclusion
References
9
A Survey of SynopsisConstruction
in Data Streams
Cham C. Agganual,Philip S. Y
u
1. Introduction
2. SamplingMethods
2.1 Random Samplingwith a Reservoir
2.2 Concise Sampling
3. Wavelets
3.1 Recent Research on Wavelet Decomposition in Data Streams
4. Sketches
4.1 Fixed Window Sketchesfor MassiveTime Series
4.2 VariableWindow Sketchesof MassiveTime Series
4.3 Sketches and their applications in Data Streams
4.4 Sketcheswith p-stable distributions
4.5 The Count-Min Sketch
4.6 RelatedCountingMethods: HashFunctionsforDetermining
Distinct Elements
4.7 Advantages and Limitations of SketchBased Methods
5. Histograms
5.1 One Pass Construction of Equi-depthHistograms
5.2 Constructing V-Optimal Histograms
5.3 WaveletBased Histograms for Query Answering
5.4 SketchBased Methods for Multi-dimensionalHistograms
6. Discussion and Challenges
viii DATA STREAMS:MODELS AND ALGORITHMS
References
10
A Surveyof Join Processing in
Data Streams
Junyi Xie and Jun Yang
1. Introduction
2. Model and Semantics
3. State Management for StreamJoins
3.1 Exploiting Constraints
3.2 Exploiting Statistical Properties
4. FundamentalAlgorithms for StreamJoin Processing
5. Optimizing Stream Joins
6. Conclusion
Acknowledgments
References
11
Indexing and Querying Data Streams
Ahmet Bulut,Ambuj K.Singh
Introduction
Indexing Streams
2.1 Preliminariesand definitions
2.2 Feature extraction
2.3 Index maintenance
2.4 DiscreteWaveletTransform
Querying Streams
3.1 Monitoring an aggregate query
3.2 Monitoring a pattern query
3.3 Monitoring a correlationquery
Related Work
Future Directions
5.1 Distributed monitoring systems
5.2 Probabilistic modeling of sensornetworks
5.3 Content distributionnetworks
Chapter Summary
References
12
Dimensionality Reduction and
Forecasting on Streams
Spiros Papadimitriou, Jimeng Sun, and ChristosFaloutsos
1. Related work
2. Principalcomponent analysis (PCA)
3. Auto-regressivemodels and recursive least squares
4. MUSCLES
5. Tracking correlations and hidden variables: SPIRIT
6. Putting SPIRITto work
7. Experimental case studies
Contents i
x
8. Performance and accuracy
9. Conclusion
Acknowledgments
References 287
13
A Surveyof Distributed Mining of Data Streams
SrinivasanParthasarathy,Am01 Ghotingand Matthew Eric Otey
1. Introduction
2. Outlierand AnomalyDetection
3. Clustering
4. Frequent itemset mining
5. Classification
6. Summarization
7. Mining Distributed Data Streams in Resource Constrained Environ-
ments
8. SystemsSupport
References
14
Algorithms for Distributed 309
Data StreamMining
Kanishka Bhaduri, Kamalika Das, Krishnamoorthy Sivakumar,Hill01 Kargupta,Ran
WolfandRong Chen
1. Introduction 310
2. Motivation: Why DistributedData StreamMining? 311
3. Existing Distributed Data StreamMining Algorithms 312
4. A localalgorithm for distributed data streammining 315
4.1 Local Algorithms : definition 315
4.2 Algorithm details 316
4.3 Experimentalresults 318
4.4 Modificationsand extensions 320
5. Bayesian Network Learning from Distributed Data Streams 321
5.1 Distributed Bayesian Network Learning Algorithm 322
5.2 Selection of samples for transmission to global site 323
5.3 Online Distributed BayesianNetwork Learning 324
5.4 ExperimentalResults 326
6. Conclusion 326
References 329
15
A Surveyof Stream Processing
Problems and Techniques
in SensorNetworks
Sharmila Subramaniam, Dimitrios Gunopulos
1. Challenges
DATA STREAMS: MODELS AND ALGORITHMS
2. TheData CollectionModel
3. Data Communication
4. Query Processing
4.1 Aggregate Queries
4.2 Join Queries
4.3 Top-k Monitoring
4.4 Continuous Queries
5. CompressionandModeling
5.1 Data Distribution Modeling
5.2 OutlierDetection
6. Application: Tracking of Objectsusing SensorNetworks
7. Summary
References
Index
List of Figures
Micro-clustering Examples 11
Some Simple Time Windows 11
Varying Horizons for the classificationprocess 23
Qualitycomparison(NetworkIntrusiondataset,horizon=256,
stream_speed=200) 30
Quality comparison (Charitable Donation dataset, hori-
zon=4, stream_speed=200) 30
Accuracycomparison(NetworkIntrusiondataset,streamspeed=80,
buffer_size=1600,lcfit=80, init_number=400) 31
Distribution of the (smallest) best horizon (Network In-
trusiondataset,Timeunits=2500,buffer_size=1600,kfit=80,
init_number=400) 31
Accuracy comparison (Synthetic dataset B300kC5D20,
stream_speed=l00,buffer_size=500,lc it=25,init_number=400) 31
Distributionofthe(smallest)besthorizon(Syntheticdataset
B300kC5D20, Timeunits=2000,buffer_size=500,
lc it=25,
init_number=400) 32
Stream Proc. Rate (Charit. Donation data, stream_speed=2000) 33
Stream Proc. Rate (Ntwk. Intrusion data, stream_speed=2000) 33
Scalabilitywith Data Dimensionality(stream_speed=2000) 34
Scalabilitywith Number of Clusters (stream_speed=2000) 34
The ensemble based classificationmethod 53
VFDT Learning Systems 54
On Demand Classification 54
Online InformationNetwork System 55
Algorithm Output Granularity 55
ANNCAD Framework 56
SCALLOP Process 56
Karp et al. Algorithmto Find Frequent Items 68
ImprovingAlgorithm with An Accuracy Bound 71
xii DATA STREAMS: MODELS AND ALGORITHMS
StreamMining-Fixed:AlgorithmAssumingFixedLength
Transactions 73
SubroutinesDescription 73
StreamMining-Bounded: Algorithm with a Bound on Accuracy 75
StreamMining: Final Algorithm
The Forward Time SliceDensity Estimate
The Reverse Time Slice Density Estimate
The Temporal VelocityProfile
The SpatialVelocityProfile
A tilted time frame with natural time partition
A tilted time frame with logarithmictime partition
A tilted time frame with progressive logarithmic time
partition
Two critical layers in the stream cube
Cube structurefrom the m-layer to the o-layer
H-tree structure for cube computation
Cube computation: time and memory usage vs. # tuples
at the m-layer for the data set D5L3C10
Cube computation: time and space vs. # of dimensions
for the data set L3ClOT100K
Cube computation: time and space vs. # of levels for the data set
D5C10T50K
Data Flow Diagram
Illustration of Example 7.1
Illustration of Observation 1.4
Procedure SetSamplingRate(x,R,)
Sliding window model notation
An illustration of an ExponentialHistogram (EH).
Illustration of the Wavelet Decomposition
The Error Tree from the Wavelet Decomposition
Drifting normal distributions.
Example ECBs.
ECBsforsliding-windowjoins underthefrequency-based
model.
ECBs under the age-basedmodel.
Thesystemarchitectureforamulti-resolutionindexstruc-
tureconsistingof3levelsandstream-specificauto-regressive
(AR) models for capturing multi-resolutiontrends in the data. 240
Exact featureextraction,update rate T = 1. 241
Incremental feature extraction,update rate T = 1. 241
List of Figures
...
Xlll
Approximate feature extraction,update rate T = 1.
Incremental featureextraction,update rate T = 2.
Transformingan MBR using discretewavelettransform.
Transformationcorrespondsto rotating the axes (the ro-
tation angle = 45"for Haar wavelets) 247
Aggregatequerydecompositionandapproximationcom-
position for a query window of sizew = 26. 249
Subsequence query decomposition for a query window
of size IQI = 9. 253
Illustration of problem. 262
Illustration of updating wl when a new point xt+l arrives. 266
Chlorine dataset. 279
Mote dataset. 280
Critter dataset 281
Detail of forecasts on Critter with blanked values. 282
River data. 283
Wall-clock times (includingtime to update forecastingmodels). 284
Hidden variable tracking accuracy.
Centralized Stream Processing Architecture (left) Dis-
tributed StreamProcessing Architecture (right)
(A) the area inside an E circle. (B) Seven evenly spaced
vectors - ul ...u7. (C) The borders of the seven halfs-
paces tii .x 2 E define a polygon in which the circle is
circumscribed. (D) The area between the circle and the
union of half-spaces.
Quality of the algorithmwith increasingnumber of nodes
Cost of the algorithmwith increasingnumber of nodes
ASIA Model
Bayesian network for onlinedistributedparameter learning
SimulationresultsforonlineBayesianlearning: (left)KL
distancebetween theconditionalprobabilitiesforthenet-
worksBol(k)andBb,forthreenodes(right)KLdistance
between the conditional probabilities for the networks
Bol(k)and Bb, for three nodes
An instanceofdynamicclusterassignmentin sensorsys-
tem according to LEACH protocol. Sensornodes of the
sameclustersareshownwith samesymbolandthecluster
heads are marked with highlighted symbols.
xiv DATA STREAMS: MODELS AND ALGORITHMS
Interest Propagation, gradient setup and path reinforce-
ment fordatapropagationindirected-dzfusion paradigm.
Event is described in terms of attribute value pairs. The
figure illustrates an event detectedbased on the location
of the node and target detection.
Sensors aggregatingthe result for a MAX query in-netwc
Error filter assignments in tree topology. The nodes that
are shown shaded are the passive nodes that take part
only in routing the measurements. A sensor comrnuni-
catesa measurementonly if it lies outside the intervalof
values specified by Eii.e., maximum permitted error at
the node. A sensor that receives partial results from its
children aggregates the results and communicatesthem
to its parent after checking against the error interval
Usageofduplicate-sensitivesketchestoallowresultprop-
agationtomultipleparentsprovidingfaulttolerance. The
system is divided into levels during the query propaga-
tion phase. Partial results from a higher level (level 2 in
thefigure) is received at more than onenode inthe lower
level (Level 1in the figure)
(a) Two dimensional Gaussian model of the measure-
ments from sensors S1 and S2(b) The marginal distri-
bution of the values of sensor S1, given S2:New obser-
vations from one sensor is used to estimatetheposterior
density of the other sensors
Estimation of probability distribution of the measure-
ments over slidingwindow
Trade-offs in modeling sensor data
Tracking a target. The leader nodes estimate the prob-
ability of the target's direction and determines the next
monitoringregion thatthetargetisgoingto traverse. The
leadersof the cells within the next monitoringregion are
alerted
List of Tables
An exampleof snapshots stored for a = 2 and I = 2
A geometric time window
Data Based Techniques
Task Based Techniques
Typical LWClassTrainingResults
Summaryof Reviewed Techniques
Algorithms for Frequent Itemsets Mining over Data Streams
Summaryof results for the sliding-window model.
An Example of Wavelet Coefficient Computation
Description of notation.
Description of datasets.
Reconstruction accuracy(mean squarederrorrate).
Preface
In recent years, the progress in hardware technology has made it possible
for organizationsto store and record large streams of transactional data. Such
data setswhich continuouslyandrapidly grow over time arereferred to as data
streams. In addition, the development of sensor technology has resulted in
the possibility of monitoring many events in real time. While data mining has
become a fairly well established field now, the data stream problem poses a
number of unique challenges which are not easily solved by traditional data
mining methods.
The topic of data streams is a very recent one. The first research papers on
this topic appeared slightly under a decade ago, and since then this field has
grown rapidly. There is a large volume of literature which has been published
in this field over the past few years. The work is also of great interest to
practitionersinthefieldwhohavetomineactionableinsightswithlargevolumes
of continuously growing data. Because of the large volume of literature in the
field,practitioners andresearchersmay oftenfind it an arduoustask of isolating
the right literature for a given topic. In addition, from a practitioners point of
view, the use of research literature is even more difficult, since much of the
relevant material is buried in publications. While handling a real problem, it
may often be difficult to know where to look in order to solvethe problem.
This book contains contributed chapters from a variety of well known re-
searchers in the data mining field. While the chapters will be written by dif-
ferent researchers, the topics and content will be organizedin such a way so as
to present the most important models, algorithms, and applications in the data
mining fieldin a structured and conciseway. In addition,the book is organized
in order to make it more accessible to application driven practitioners. Given
the lack of structurally organized information on the topic, the book will pro-
vide insightswhich are not easily accessible otherwise. In addition, the book
will be a great help to researchersand graduate students interested in the topic.
The popularity and currentnature of the topic of data streams is likely to make
it an important source of information for researchers interested in the topic.
The data mining communityhas grownrapidly overthepast few years, and the
topic of data streamsis one of the most relevant and current areasof interestto
xviii DATA STREAMS: MODELS AND ALGORITHMS
the community. This is because of the rapid advancement of the field of data
streams in the past two to three years. While the data stream field clearlyfalls
in the emerging category because of its recency, it is now beginning to reach a
maturation and popularity point, where the development of an overview book
on the topic becomes both possible and necessary. Whilethis book attemptsto
provide an overview of the stream mining area, it also tries to discuss current
topics of interest so as to be useful to students and researchers. It is hoped that
this book will provide a reference to students,researchers and practitioners in
both introducing the topic of data streams and understandingthe practical and
algorithmic aspectsof the area.
Chapter 1
AN INTRODUCTION TO DATA STREAMS
Cham C. Aggarwal
IBM ZJ WatsonResearch Center
Hawthorne,NY 10532
Abstract
Inrecentyears, advancesinhardwaretechnologyhavefacilitatednew waysof
collecting data continuously. In many applicationssuch as network monitoring,
the volume of such data is so large that it may be impossible to store the data
on disk. Furthermore, even when the data can be stored, the volume of the
incomingdatamay be solargethat itmay be impossibletoprocessanyparticular
record more than once. Therefore, many data mining and database operations
such as classification, clustering, frequentpattern mining and indexing become
significantlymore challengingin this context.
In many cases, the datapatternsmay evolvecontinuously,as a resultof which
it is necessaryto design the mining algorithmseffectively in orderto accountfor
changesinunderlyingstructureofthedatastream. Thismakesthesolutionsofthe
underlyingproblems evenmore difficult from an algorithmicand computational
pointofview. Thisbook containsanumberofchapterswhicharecarefullychosen
in order to discussthe broad researchissuesin data streams. The purpose of this
chapter is to provide an overview of the organization of the stream processing
and mining techniqueswhich are covered in this book.
1 Introduction
In recent years, advancesin hardwaretechnologyhave facilitatedthe ability
to collect datacontinuously. Simpletransactionsof everydaylifesuch as using
a credit card, a phone or browsing the web lead to automated data storage.
Similarly, advances in informationtechnologyhave lead to large flows of data
acrossIPnetworks. Inmanycases,these largevolumesofdatacanbe minedfor
interestingandrelevantinformationin awidevarietyofapplications. Whenthe
2 DATA STREAMS:MODELS AND ALGORITHMS
volumeoftheunderlyingdataisverylarge,itleadstoanumberofcomputational
and mining challenges:
With increasingvolume ofthedata, it isno longerpossibleto processthe
data efficientlyby using multiple passes. Rather, one can process a data
item at most once. This leadsto constraintsonthe implementationof the
underlying algorithms. Therefore, stream mining algorithms typically
need to be designed so that the algorithms work with one pass of the
data.
In most cases, there is an inherent temporal component to the stream
mining process. This is because the data may evolve over time. This
behavior of data streams is referred to as temporal locality. Therefore,
a straightforward adaptation of one-pass mining algorithms may not be
an effective solution to the task. Stream mining algorithms need to be
carefully designed with a clear focus on the evolutionof the underlying
data.
Another important characteristicof data streams is that they are often mined in
a distributed fashion. Furthermore,the individualprocessorsmay have limited
processing and memory. Examples of such cases include sensor networks, in
which it maybe desirableto perfom in-network processingof data streamwith
limited processing and memory [8, 191.This book will also contain a number
of chapters devoted to these topics.
This chapter will provide an overview of the different stream mining algo-
rithmscoveredinthisbook. Wewill discussthechallengesassociatedwitheach
kind of problem, and discuss an overview of the material in the corresponding
chapter.
2. StreamMining Algorithms
In this section, we will discuss the key stream mining problems and will
discussthe challenges associated with each problem. We will also discuss an
overview ofthematerial coveredin eachchapterofthisbook. Thebroad topics
covered in this book are as follows:
Data Stream Clustering. Clustering is a widely studied problem in the
data mining literature. However, it is more difficult to adapt arbitrary clus-
tering algorithms to data streams because of one-pass constraints on the data
set. An interesting adaptation of the k-means algorithm has been discussed
in [14] which uses a partitioning based approach on the entire data set. This
approachuses an adaptation of a k-means technique in order to createclusters
over the entire data stream. In the context of data streams, it may be more
desirable to determine clusters in specificuser defined horizons rather than on
An Introduction to Data Streams 3
the entiredata set. In chapter 2, we discuss the micro-clusteringtechnique [3]
which determines clusters over the entire data set. We also discuss a variety
of applicationsof micro-clusteringwhich can performeffectivesummarization
based analysis of the data set. For example, micro-clusteringcan be extended
to the problem of classificationon data streams [5]. In many cases, it can also
be used for arbitrarydata mining applications such as privacy preserving data
mining or query estimation.
Data Stream Classification. The problem of classificationis perhaps one
of the most widely studied in the context of data stream mining. The problem
of classification is made more difficultby the evolutionof the underlying data
stream. Therefore, effective algorithms need to be designed in order to take
temporal locality into account. In chapter 3, we discuss a survey of classifica-
tion algorithms for data streams. A wide variety of data stream classification
algorithmsarecoveredinthischapter. Someofthesealgorithmsaredesignedto
be purely one-pass adaptations of conventionalclassificationalgorithms [12],
whereas others (such as the methods in [5, 161)are more effectivein account-
ing for the evolution of the underlying data stream. Chapter 3 discusses the
different kinds of algorithms and the relative advantagesof each.
Frequent Pattern Mining. The problem of frequent pattern mining was
first introduced in [6], and was extensivelyanalyzed for the conventionalcase
of diskresident data sets. In the case of data streams,one may wish to find the
frequentitemsetseitherover a slidingwindowortheentiredata stream[15,17].
In Chapter 4, we discuss an overview of the different frequent pattern mining
algorithms, and also provide a detailed discussion of some interesting recent
algorithms on the topic.
Change Detection in Data Streams. As discussed earlier, the patterns
in a data stream may evolve over time. In many cases, it is desirable to track
and analyze the nature of these changesover time. In [I, 11, 181, a number of
methodshave been discussedforchangedetectionof data streams. In addition,
data streamevolutioncanalsoaffectthebehavioroftheunderlyingdatamining
algorithms sincethe results can become stale over time. Therefore, in Chapter
5, we have discussed the differentmethods for change detection data streams.
Wehavealsodiscussedtheeffectofevolutionondatastreamminingalgorithms.
Stream Cube Analysis of Multi-dimensional Streams. Much of stream
data resides at a multi-dimensionalspace and at rather low level of abstraction,
whereasmostanalystsareinterestedinrelativelyhigh-level dynamicchangesin
somecombinationof dimensions. Todiscoverhigh-level dynamicandevolving
characteristics,onemayneed toperformmulti-level, multi-dimensionalon-line
4 DATA STREAMS: MODELS AND ALGORITHMS
analyticalprocessing(OLAP)of streamdata. Suchnecessitycallsfortheinves-
tigation of new architecturesthat may facilitateon-lineanalyticalprocessing of
multi-dimensional stream data [7, 101.
In Chapter 6, an interesting stream-cube architecture that effectively per-
forms on-line partial aggregation of multi-dimensional stream data, captures
the essential dynamic and evolving characteristics of data streams, and facil-
itates fast OLAP on stream data. Stream cube architecture facilitates online
analytical processing of stream data. It also forms a preliminary structure for
online stream mining. The impact of the design and implementationof stream
cube in the context of stream mining is also discussed in the chapter.
Loadshedding in Data Streams. Since data streams are generated by
processes which are extraneous to the stream processing application, it is not
possible to control the incoming streamrate. As a result, it is necessary for the
system to have the ability to quickly adjust to varying incoming stream pro-
cessingrates. Chapter 7 discusses one particular type of adaptivity: the ability
to gracefully degradeperformancevia "load shedding" (droppingunprocessed
tuples to reduce system load) when the demands placed on the system can-
not be met in full given availableresources. Focusing on aggregation queries,
the chapter presents algorithms that determine at what points in a query plan
should load sheddingbe performed and what amount of load shouldbe shed at
eachpoint in order to minimize the degree of inaccuracyintroducedinto query
answers.
SlidingWindow Computations in Data Streams. Many of the synopsis
structures discussed use the entire data stream in order to construct the cor-
responding synopsis structure. The sliding-windowmodel of computation is
motivated by the assumptionthat it is more importantto use recent data in data
streamcomputation [9]. Therefore,theprocessingand analysisis onlydone on
a fixed history of the data stream. Chapter 8 formalizes this model of compu-
tation and answers questions about how much space and computation time is
required to solve certainproblems under the sliding-windowmodel.
SynopsisConstructioninData Streams. Thelargevolumeofdata streams
poses unique space and time constraints on the computation process. Many
query processing, database operations,and mining algorithmsrequire efficient
execution which can be difficult to achieve with a fast data stream. In many
cases, it may be acceptable to generate approximate solutions for such prob-
lems. In recent years a number of synopsis structures have been developed,
which can be used in conjunction with a variety of mining and query process-
ing techniques [13]. Some key synopsis methods include those of sampling,
wavelets, sketches and histograms. In Chapter 9, a survey of the key synopsis
An Introduction to Data Streams 5
techniquesisdiscussed, andtheminingtechniquessupportedby suchmethods.
The chapter discusses the challenges and tradeoffs associated with using dif-
ferent kinds of techniques, and the important research directions for synopsis
construction.
Join Processingin Data Streams. Streamjoin is a fundamentaloperation
for relating information from different streams. This is especially useful in
many applications such as sensornetworks in which the streams arriving from
differentsourcesmayneed tobe related with one another. In the stream setting,
inputtuples arrivecontinuously,andresult tuples need to be produced continu-
ouslyaswell. Wecannotassumethatthe inputdata isalreadystoredorindexed,
or that the input rate can be controlled by the query plan. Standardjoin algo-
rithmsthatuseblockingoperations,e.g., sorting,no longerwork. Conventional
methods for cost estimation and query optimizationare also inappropriate,be-
cause they assume finite input. Moreover, the long-running nature of stream
queries calls for more adaptiveprocessing strategies that can react to changes
and fluctuations in data and stream characteristics. The "stateful" nature of
streamjoins adds another dimension to the challenge. In general, in order to
computethe completeresult of a streamjoin, we need to retain allpast arrivals
as part of the processing state, becausea new tuple mayjoin with an arbitrarily
old tuple arrived in the past. This problem is exacerbatedby unbounded input
streams, limited processing resources, and high performancerequirements, as
it is impossible in the long run to keep all past history in fast memory. Chap-
ter 10provides an overview of research problems,recent advances,and future
research directions in streamjoin processing.
Indexing Data Streams. The problem of indexing data streams attempts
to create a an indexed representation, sothat it is possible to efficientlyanswer
different kinds of queries such as aggregation queries or trend based queries.
This is especially important in the data stream case because of the huge vol-
ume of the underlying data. Chapter 11 exploresthe problem of indexing and
querying data streams.
DimensionalityReduction and Forecasting in Data Streams. Because
of the inherent temporal nature of data streams, the problems of dimension-
ality reduction and forecasting and particularly important. When there are a
largenumber of simultaneousdata stream,we canuse the correlationsbetween
different data streams in order to make effective predictions [20, 211 on the
futurebehavior of the data stream. In Chapter 12,an overviewof dimensional-
ity reduction and forecasting methods have been discussed for the problem of
data streams. In particular, the well known MUSCLES method [21] has been
discussed, and its application to data streams have been explored. In addition,
6 DATA STREAMS: MODELS AND ALGORITHMS
the chapterpresents the SPIRITalgorithm,which exploresthe relationshipbe-
tween dimensionality reduction and forecasting in data streams. In particular,
the chapter explores the use of a compact number of hidden variablesto com-
prehensivelydescribethe data stream. This compact representationcan alsobe
used for effectiveforecasting of the data streams.
Distributed Mining of Data Streams. In many instances, streams are
generated at multiple distributed computingnodes. Analyzing and monitoring
data in such environmentsrequires data mining technology that requires opti-
mization of a variety of criteria such as communication costs across different
nodes, aswell as computational,memoryor storagerequirementsat eachnode.
A comprehensivesurveyof the adaptation of differentconventionalmining al-
gorithms to the distributed case is provided in Chapter 13. In particular, the
clustering, classification, outlier detection, frequent pattern mining, and surn-
marization problems are discussed. In Chapter 14, some recent advances in
stream mining algorithms are discussed.
Stream Mining in SensorNetworks. With recent advances in hardware
technology, ithasbecomepossibletotracklargeamountsofdatainadistributed
fashionwith the use of sensortechnology. The large amountsof data collected
by the sensor nodes makes the problem of monitoring a challenging one from
many technological stand points. Sensor nodes have limited local storage,
computational power, and battery life, as a result of which it is desirable to
minimize the storage, processing and communication from these nodes. The
problem is furthermagnifiedby the factthat a givennetworkmay havemillions
ofsensornodesandthereforeitisveryexpensiveto localizeallthedataatagiven
globalnode for analysisboth from a storage and communicationpoint of view.
In Chapter 15, we discuss an overview of a number of stream mining issues
in the context of sensor networks. This topic is closely related to distributed
stream mining, and a number of concepts related to sensor mining have also
been discussed in Chapters 13and 14.
3. Conclusions and Summary
Datastreamsareacomputationalchallengeto dataminingproblemsbecause
ofthe additionalalgorithmicconstraintscreatedby the largevolumeof data. In
addition, the problem of temporal locality leads to a number of unique mining
challenges in the data stream case. This chapter provides an overview to the
different mining algorithms which are covered in this book. We discussed the
differentproblems and the challengeswhich are associatedwith eachproblem.
We also provided an overview of the material in each chapter of the book.
An Intmduction to Data Streams 7
References
[I] Aggarwal C. (2003). A Framework for Diagnosing Changes in Evolving
Data Streams.ACM SIGMOD Conference.
[2] AggarwalC (2002).An IntuitiveFramework forunderstandingChangesin
EvolvingData Streams.IEEE ICDE Conference.
[3] Aggarwal C., Han J., Wang J., Yu P (2003). A Framework for Clustering
EvolvingData Streams. VLDBConference.
[4] AggarwalC., HanJ., WangJ., Yu P (2004).A FrameworkforHigh Dimen-
sional Projected Clustering of Data Streams. VLDBConference.
[5] Aggarwal C, Han J., Wang J., Yu P. (2004). On-DemandClassification of
Data Streams.ACM KDD Conference.
[6] Agrawal R., Imielinski T., Swami A. (1993) Mining Association Rules
between Setsof items in Large Databases. ACM SIGMOD Conference.
[7] Chen Y., Dong G., Han J., Wah B. W., Wang J. (2002) Multi-dimensional
regression analysisof time-series data streams. VLDBConference.
[8] Cormode G., Garofalakis M. (2005) Sketching Streams Through the Net:
DistributedApproximate Query Tracking. VLDBConference.
[9] Datar M., Gionis A., Indyk P., Motwani R. (2002) Maintaining stream
statisticsover slidingwindows. SIAM Journal on Computing,3l(6):1794-
1813.
[lo] DongG.,HanJ., LamJ.,PeiJ., WangK. (2001)Miningmulti-dimensional
constrained gradients in data cubes. VLDBConference.
[ll] Dasu T., Krishnan S., Venkatasubramaniam S., Yi K. (2005).
An Information-Theoretic Approach to Detecting Changes in Multi-
dimensional data Streams.Duke University TechnicalReport CS-2005-06.
[12] Domingos P. and Hulten G. (2000) Mining High-speed Data Streams.In
Proceedings of the ACM KDD Conference.
[13] Garofalakis M., Gehrke J., Rastogi R. (2002) Querying and mining data
streams: you only get one look (a tutorial). SIGMOD Conference.
[14] Guha S., MishraN., MotwaniR., O'Callaghan L. (2000).ClusteringData
Streams.IEEE FOCS Conference.
[I51 Giannella C., Han J., Pei J., Yan X., and Yu P. (2002) Mining Frequent
Patterns in Data Streams at Multiple Time Granularities. Proceedings of
the NSF Workshopon Next GenerationData Mining.
1161 Hulten G., SpencerL., DomingosP. (2001).MiningTimeChangingData
Streams.ACM KDD Conference.
[17] Jin R., AgrawalG. (2005)An algorithmfor in-core frequent itemsetmin-
ing on streaming data. ICDM Conference.
8 DATA STREAMS: MODELS AND ALGORITHMS
[18] Kifer D., David S.-B., Gehrke J. (2004). Detecting Change in Data
Streams. VLDB Conference, 2004.
1191 Kollios G., Byers J., ConsidineJ., HadjielefttheriouM., Li F. (2005)Ro-
bust Aggregation in SensorNetworks. IEEEData EngineeringBulletin.
[20] S a h a iY
.
, PapadimitriouS., FaloutsosC. (2005).BRAID: Streammining
through group lag correlations.ACMSIGMOD Conference.
[21] Yi B.-K., Sidiropoulos N.D., Johnson T., Jagadish, H. V
.
,Faloutsos C.,
BilirisA. (2000).Onlinedataminingforco-evolvingtimesequences.ICDE
Conference.
Chapter 2
ON CLUSTERING MASSIVE DATA STREAMS: A
SUMMARIZATIONPARADIGM
Cham C. Aggarwal
IBM Z J. WatsonResearch Center
Hawthorne, W 1053.2
Jiawei Han
UniversityofIllinois at Urbana-Champaign
Urbana,IL
hanj@cs.uiuc.edu
Jianyong Wang
Universityof Illinois at Urbana-Champaign
Urbana,Z
L
jianyong @tsinghua.edu.cn
Philip S. Yu
IBM Z J. WatsonResearch Center
Hawthorne, NY 10532
Abstract
In recent years, data streams have become ubiquitous because of the large
number of applications which generate huge volumes of data in an automated
way. Many existing data mining methods cannot be applied directly on data
streams because of the fact that the data needs to be mined in one pass. Fur-
thermore, datastreamsshowa considerableamountof temporal localitybecause
of which a direct application of the existing methods may lead to misleading
results. In this paper, we develop an efficient and effective approach for min-
ing fast evolving data streams, which integratesthe micro-clusteringtechnique
DATA STREAMS: MODELS AND ALGORITHMS
with the high-level datamining process, and discoversdataevolutionregularities
as well. Our analysis and experimentsdemonstratetwo important data mining
problems, namely stream clustering and stream classification,can be performed
effectively using this approach, with high quality mining results. We discuss
the use of micro-clusteringas a general summarization technology to solvedata
mining problems on streams. Our discussion illustrates the importance of our
approachfor a variety of miningproblems in the data stream domain.
1. Introduction
In recent years, advances in hardware technology have allowed us to auto-
matically record transactions and other pieces of information of everyday life
at a rapid rate. Such processes generate huge amounts of online data which
grow at an unlimited rate. These kinds of online data are referred to as data
streams. The issues on management and analysis of data streams have been
researched extensivelyin recent years because of its emerging, imminent, and
broad applications [l 1, 14, 17,231.
Many important problems such as clustering and classification have been
widely studied in the data mining community. However, a majority of such
methods may not be working effectively on data streams. Data streams pose
special challenges to a number of data mining algorithms, not only because
of the huge volume of the online data streams, but also because of the fact
that the data in the streams may show temporal correlations. Such temporal
correlationsmayhelpdiscloseimportantdataevolutioncharacteristics,andthey
canalsobeusedtodevelopefficientandeffectiveminingalgorithms. Moreover,
data streams require online mining, in which we wish to mine the data in a
continuous fashion. Furthermore, the system needs to have the capability to
perform an ofline analysis as well based on the user interests. This is similar
to an onlineanalyticalprocessing(OLAP)frameworkwhich usestheparadigm
of pre-processing once, querying many times.
Based on the aboveconsiderations,we propose a new streammining frame-
work, which adopts a tilted time window framework, takes micro-clustering
as a preprocessing process, and integrates the preprocessing with the incre-
mental, dynamic mining process. Micro-clustering preprocessing effectively
compressesthe data, preservesthe generaltemporal localityof data, and facili-
tatesboth onlineand offlineanalysis, aswell asthe analysisof current data and
data evolutionregularities.
In this study, we primarily concentrate on the application of this technique
to two problems: (1) streamclustering,and (2) streamclassification. Theheart
of the approach is to use an online summarizationapproach which is efficient
and also allows for effectiveprocessing of the data streams. We also discuss
On ClusteringMassive Data Streams: A Summarization Paradigm
Figure 2.I. Micro-clustering Examples
.time
Now
Figure 2.2. Some SimpleTimeWindows
a number of research directions, in which we show how the approach can be
adapted to a variety of other problems.
This paper is organized as follows. In the next section, we will present our
micro-clusteringbased stream mining Eramework. In section 3, we discuss the
streamclusteringproblem. Theclassificationmethodsaredeveloped in Section
4. In section 5, we discuss a number of other problems which can be solved
with the micro-clustering approach, and other possible research directions. In
section 6, we will discuss some empirical results for the clustering and classi-
fication problems. In Section 7 we discuss the issues related to our proposed
streammining methodologyand compareit with other related work. Section 8
concludes our study.
12 DATA STREAMS: MODELS AND ALGORITHMS
2. The Micro-clustering Based Stream Mining
Framework
In order to apply our technique to a variety of data mining algorithms, we
utilize a micro-clusteringbased stream mining framework. This frameworkis
designedbycapturingsummaryinformationaboutthenatureofthedatastream.
This summaryinformation is defined by the following structures:
Micro-clusters: Wemaintainstatisticalinformationaboutthedatalocality
in terms of micro-clusters. These micro-clusters are defined as a temporal
extension of the clusterfeature vector [24]. The additivity property of the
micro-clustersmakes them a natural choice for the data streamproblem.
Pyramidal Time Frame: The micro-clusters are stored at snapshots in
time which followapyramidalpattern. Thispatternprovidesan effectivetrade-
offbetweenthe storagerequirementsandthe abilityto recall summarystatistics
from different time horizons.
The summary information in the micro-clusters is used by an offline com-
ponent which is dependent upon a wide variety of user inputs such as the time
horizon or the granularity of clustering. In order to define the micro-clusters,
we will introduce a few concepts. It is assumed that the data stream consists
-
of a set of multi-dimensional records ...Xk... arriving at time stamps
TI
...Tk.... Each is a multi-dimensionalrecord containing d dimensions
which are denoted by = (xi...x$.
We will first begin by definingthe concept of micro-clusters and pyramidal
time frame more precisely.
DEFINITION
2.1 A micro-clusterfor aset ofd-dimensionalpoints Xi,
...Xi,
--
withtimestamps~,
...T,, isthe (2-d+3)tuple (CF2",C F l X ,
CF2t,CFlt,n),
wherein CF2" and C F l Xeach correspond to a vector of d entries. The de$-
nition of each of these entries is asfollows:
For eachdimension, thesum of thesquares of thedata valuesismaintained
in CF2". Thus, CF2" contains d values. Thep-th entry of CF2" is equal to
EY=l(<
12.
For each dimension, the sum of the data values is maintained in CFlX.
Thus, CFIXcontains d values. Thep-th entry of CFIXis equal to E7L=1
e;.
The sum of the squares of the time stamps Ti,
...Tin
is maintained in
CF2t.
Thesum of the time stamps Ti,...Tin
is maintained in CFlt.
The number of datapoints is maintained in n.
We note that the above definition of micro-cluster maintains similar summary
information as the cluster feature vector of [24], except for the additional in-
formation about time stamps. We will refer to this temporal extension of the
clusterfeaturevectorfora setofpointsCby CFT(C).As in [24],this summary
On ClusteringMassive Data Streams: A Summarization Paradigm 13
information can be expressed in an additiveway over the different data points.
This makes it a natural choice for use in data stream algorithms.
Wenotethatthe maintenanceof a largenumberofmicro-clustersisessential
in the abilityto maintain more detailed informationabout the micro-clustering
process. For example,Figure 2.1 forms3 clusters,which are denotedby a, b, c.
At a later stage,evolutionforms3 differentfiguresal, a2,bc, with a splitintoa1
and a2, whereas b and c merged into bc. If we keep micro-clusters(each point
represents a micro-cluster), such evolution can be easilycaptured. However, if
we keep only 3 cluster centers a, byc, it is impossibleto derive later al, a2, bc
clusterssincethe information of more detailed points are already lost.
The data stream clustering algorithm discussed in this paper can generate
approximate clusters in any user-specified length of history from the current
instant. This is achieved by storing the micro-clusters at particular moments
in the stream which are referred to as snapshots. At the same time, the current
snapshotof micro-clusters is alwaysmaintainedby the algorithm. The macro-
clustering algorithm discussed at a later stage in this paper will use these h e r
level micro-clusters in order to create higher level clusters which can be more
easilyunderstoodby the user. Considerfor example, the casewhen the current
clock time is t, and the user wishes to find clusters in the stream based on
a history of length h. Then, the macro-clustering algorithm discussed in this
paper will use some of the additive properties of the micro-clusters stored at
snapshots t, and (t,- h) in order to find the higher level clusters in a history
or time horizon of length h. Of course, since it is not possible to store the
snapshotsat eachand everymoment in time, it isimportantto chooseparticular
instantsof time at which it ispossible to storethe stateof the micro-clusters so
thatclustersin anyuser specifiedtimehorizon (t, -h, t,) canbe approximated.
We note that some examples of time frames used for the clustering process
are the natural time frame (Figure 2.2(a) and (b)), and the logarithmic time
frame (Figure 2.2(c)). In the natural time frame the snapshots are stored at
regular intervals. We note that the scale of the natural time frame could be
based on the applicationrequirements. For example, we could choose days,
monthsoryearsdependingupon thelevelofgranularityrequiredintheanalysis.
Amoreflexibleapproachisto usethe logarithmictime framein whichdifferent
variationsof the time intervalcan be stored. As illustrated in Figure 2.2(c), we
store snapshots at times oft, 2 t, 4 t .... The danger of this is that we may
jump too farbetween successivelevels of granularity. We need an intermediate
solution which provides a good balance between storage requirements and the
level of approximationwhich a user specified horizon can be approximated.
In order to achieve this, we will introduce the concept of a pyramidal time
frame. In thistechnique,the snapshotsarestoredat differinglevels of granular-
ity depending upon the recency. Snapshotsare classified into different orders
which can vary from 1to log(T), where T is the clock time elapsed since the
14 DATA STREAMS: MODELS AND ALGORITHMS
beginning of the stream. The order of a particular class of snapshots define
the level of granularity in time at which the snapshots are maintained. The
snapshots of differentorder are maintained as follows:
0 Snapshots of the i-th order occur at time intervals of ai,
where a is an
integer and a 2 1. Specifically, each snapshot of the i-th order is taken at
a moment in time when the clock value1 from the beginning of the stream is
exactly divisibleby a2.
0 At any given moment in time, onlythe last a +1snapshotsof order i are
stored.
We note that the above definition allows for considerable redundancy in
storage of snapshots. For example, the clock time of 8 is divisible by 2', 2l,
22,and 23 (where cr = 2). Therefore,the state of the micro-clusters at a clock
time of 8 simultaneously corresponds to order 0, order 1, order 2 and order
3 snapshots. From an implementation point of view, a snapshot needs to be
maintained only once. We make the followingobservations:
0 For a data stream, the maximum order of any snapshot stored at T time
units sincethe beginning of the stream mining process is log, (T).
For a data streamthe maximumnumberof snapshotsmaintainedat Ttime
units sincethe beginning of the stream mining process is (a+1).log, (T).
0 For any user specifiedtime window of h, at least one stored snapshot can
be found within 2 .h units of the current time.
While the first two results are quite easy to see, the last one needs to be
proven formally.
LEMMA
2.2 Let h be a user-speciJiedtime window,t, be the currenttime, and
t, be the time of the last stored snapshot ofany orderjust before the time t, -h.
Then t, - t, 5 2 .h.
Proof: Let r be the smallestinteger suchthat ar2 h. Therefore,we know that
ar-I< h. Sinceweknowthattherearea+ 1snapshotsoforder (r-I),at least
onesnapshotoforderr-1mustalwaysexistbeforet, -h. Lett, bethesnapshot
of order r - 1which occursjust before t, - h. Then (t, - h) - t, 5 ar-l.
Therefore, we have t, - t, 5 h +ar-l< 2 - h.
Thus, in this case, it is possible to find a snapshot within a factor of 2 of
any user-specified time window. Furthermore, the total number of snapshots
which need to be maintained are relatively modest. For example, for a data
stream running for 100 years with a clock time granularity of 1 second, the
total number of snapshots which need to be maintained are given by (2 +1) .
log2(100*365 *24 *60 *60) w 95. This is quite a modest requirement given
the fact that a snapshotwithin a factorof 2 can alwaysbe foundwithin anyuser
specifiedtime window.
It is possible to improve the accuracy of time horizon approximation at a
modest additional cost. In order to achieve this, we save the a1+1snapshots
On ClusteringMassive Data Streams: A SummarizationParadigm
Table2.1. An example of snapshotsstored for a = 2 and 1 = 2
Order of
Snapshots
0
1
2
3
4
5
of order r for 1 > 1. In this case, the storage requirement of the technique
correspondsto (az+1) log, (T)snapshots. Onthe otherhand, theaccuracyof
time horizon approximationalso increases substantially. In this case, any time
horizon can be approximatedto a factor of (1 +l/az-l). We summarizethis
result as follows:
Clock Times (Last 5 Snapshots)
5554535251
5452504846
5248444036
48403224 16
48 32 16
32
LEMMA
2.3 Let h be a userspecijied time horizon, t, be the current time, and
t, be the time of the laststored snapshot of any orderjust before the time t, -h.
Thent, - t, < (1 +l/az-l) - h.
Proof: Similarto previous case.
For larger values of I, the time horizon can be approximated as closely as
desired. For example, by choosing 1 = 10, it is possible to approximate any
time horizon within 0.2%, while a total of only (2'' +1) log2(100* 365 *
24 * 60 * 60) = 32343 snapshots are required for 100years. Since historical
snapshots can be stored on disk and only the current snapshot needs to be
maintained in main memory, this requirement is quite feasible from a practical
point of view. It is also possible to specify the pyramidal time window in
accordancewith user preferencescorrespondingto particular moments in time
such as beginning of calendar years, months, and days. While the storage
requirementsandhorizonestimationpossibilitiesof suchaschemearedifferent,
all the algorithmic descriptions of this paper are directly applicable.
In order to clarifythe way in which snapshotsare stored, let us consider the
case when the stream has been running starting at a clock-time of 1,and a use
of a = 2 and 1= 2. Therefore 22+1= 5 snapshotsof each order are stored.
Then, at a clock time of 55, snapshotsat the clocktimes illustratedin Table2.1
are stored.
Wenotethatalargenumberofsnapshotsarecommonamongdifferentorders.
From an implementationpoint of view, the states of the micro-clustersat times
of 16,24,32,36,40,44,46,48,50,51,52,53,54,and 55 are stored. It is easy
to see that for more recent clock times, there is less distance between succes-
sive snapshots (better granularity). We also note that the storage requirements
16 DATA STMAMS: MODELS AND ALGORITHMS
estimated in this section do not take this redundancy into account. Therefore,
the requirements which have been presented so far are actually worst-case re-
quirements.
These redundancies can be eliminated by using a systematicrule described
in [6], orby using amore sophisticatedgeometrictime frame. Inthistechnique,
snapshotsareclassifiedintodifferentframe numberswhich can varyfrom0to a
valueno largerthanlog2(T),whereTisthemaximumlengthofthestream. The
frame number of a particular class of snapshotsdefines the level of granularity
in time at which the snapshotsare maintained. Specifically,snapshotsof frame
number i are stored at clock times which are divisible by 2i, but not by 2i+1.
Therefore, snapshots of frame number 0 are stored only at odd clock times. It
is assumed that for each frame number, at most max-capacity snapshots are
stored.
We note that for a data stream,the maximum framenumber of any snapshot
stored at T time units since the beginning of the stream mining process is
log2(T). Since at most max-capacity snapshots of any order are stored, this
also means that the maximum number of snapshotsmaintainedat T time units
sincethebeginning ofthe streamminingprocess is (max-capacity) .log2(T).
Oneinterestingcharacteristicof thegeometrictimewindowisthat foranyuser-
specifiedtime window of h, at least one stored snapshot can be found within
a factor of 2 of the specified horizon. This ensures that sufficient granularity
is available for analyzing the behavior of the data stream over different time
horizons. We will formalize this result in the lemma below.
LEMMA
2.4 Let h be a user-specijiedtime window,and t, be the current time.
Let us also assume that max-capacity >2. Thena snapshot exists at time t,,
such that h/2 5 t, - t, I
:2 .h.
Proof: Let r be the smallestintegersuchthat h < 2T+1.Sincer is the smallest
such integer, it also means that h > 2'. This means that for any interval
(t, - h, t,) of length h, at least one integer t' E (t, - h, t,) must exist which
satisfiesthepropertythat t' mod 2'-l = 0andt' mod 2r # 0. Let t' be thetime
stamp of the last (most current) such snapshot. This also means the following:
Then, if max-capacity isat least 2, the secondlast snapshotof order (r -1)
is also stored and has a time-stamp value of t' - 2'. Let us pick the time
t, = t' - 2'. By substitutingthe value oft,, we get:
t, - t, = (t, - t' +
Since (t, - t') L 0 and 2' > h/2, it easily follows from Equation 2.2 that
tc -t, > h/2.
On ClusteringMassive Data Streams: A Summarization Paradigm
Table2.2. A geometrictime window
- Frameno.
0
1
Sincet' isthepositionofthelatest snapshotof frame (r-1)occurringbefore
the current time t,, it followsthat (t, -t') <2r. Subsitutingthis inequality in
Equation 2.2, we get t, - t, <2' +2r <h +h = 2 .h. Thus, we have:
Snapshots(by clock time) I
69 67 65
70 66 62 I
The aboveresult ensures that everypossible horizon can be closelyapprox-
imated within a modest level of accuracy. While the geometric time frame
shares a number of conceptual similarities with the pyramidal time frame [6],
it is actually quite different and also much more efficient. This is because it
eliminates the double counting of the snapshotsover different frame numbers,
as is the case with the pyramidal time frame [6]. In Table 2.2, we present
an example of a frame table illustrating snapshots of different frame numbers.
The rules for insertion of a snapshott (at time t) into the snapshot frame table
are defined as follows: (1) if (t mod 2i) = 0 but (t mod 2'+') # 0, t is in-
serted into frame-number i (2) each slot has a max-capacity (which is 3 in
our example). At the insertion o f t into frame-number i, if the slot already
reaches its max-capacity, the oldest snapshot in this frame is removed and
the new snapshot inserted. For example, at time 70, since (70 mod 2') = 0
but (70 mod 22) # 0, 70 is inserted into framenumber 1which knocks out
the oldest snapshot 58 if the slot capacity is 3. Following this rule, when slot
capacity is 3, the followingsnapshotsare stored in the geometrictime window
table: 16,24,32,40,48,52,56,60,62,64,65,66,67,68,69,70,as shown in
Table 2.2. From the table, one can see that the closer to the current time, the
denser are the snapshots stored.
3. ClusteringEvolving Data Streams: A Micro-clustering
Approach
The clustering problem is defined as follows: for a given set of data points,
we wish to partition them into one or more groups of similar objects. The
similarity of the objects with one another is typically defined with the use of
some distance measure or objectivefunction. The clusteringproblem has been
18 DATA STREAMS: MODELS AND ALGORITHMS
widely researched in the database, data mining and statistics communities [I2,
18,22,20,21,24]because of its use in a wide range of applications. Recently,
the clustering problem has also been studied in the context of the data stream
environment[17,23].
ApreviousalgorithmcalledSTREAM[23]assumesthattheclustersaretobe
computedoverthe entiredata stream. While suchatask maybe useful in many
applications, a clustering problem may often be defined only over a portion of
a data stream. This is because a data stream should be viewed as an infinite
process consisting of data which continuously evolves with time. As a result,
the underlying clustersmay also changeconsiderablywith time. The natureof
theclustersmay vary with both themoment at which they arecomputedas well
as the time horizon over which they are measured. For example, a data analyst
may wish to examine clusters occurring in the last month, last year, or last
decade. Such clusters may be considerably different. Therefore, we assume
that one of the inputs to the clustering algorithm is a time horizon over which
the clusters are found. Next, we will discuss CluStream, the online algorithm
used for clustering data streams.
3.1 Micro-clusteringChallenges
Wenotethat sincestreamdatanaturally imposesa one-passconstraintonthe
design of the algorithms, it becomes more difficultto provide such a flexibility
in computing clusters over differentkinds of time horizons using conventional
algorithms. For example,a direct extensionof the streambased Ic-meansalgo-
rithm in [23] to such a case would require a simultaneousmaintenance of the
intermediate results of clustering algorithms over all possible time horizons.
Sucha computationalburden increaseswith progressionof the data stream and
can rapidly become a bottleneck for online implementation. Furthermore, in
many cases,ananalystmaywishto determinetheclustersatapreviousmoment
in time, and compare them to the current clusters. This requires even greater
book-keeping and can rapidly become unwieldy for fast data streams.
Since a data stream cannot be revisited over the course of the computation,
the clustering algorithmneeds to maintain a substantialamount of information
so that important details are not lost. For example, the algorithm in [23] is
implemented as a continuous version of k-means algorithm which continues
to maintain a number of cluster centers which change or merge as necessary
throughoutthe executionofthe algorithm. Suchan approach isespeciallyrisky
when the characteristics of the stream change over time. This is because the
amount of informationmaintainedby a k-means type approach is too approxi-
mate in granularity,and once two cluster centers arejoined, there is no way to
informativelysplit the clusters when required by the changes in the stream at a
later stage.
On ClusteringMassive Data Streams: A Summarization Paradigm 19
Thereforeanaturaldesignto streamclusteringwouldbe separateoutthepro-
cessintoan onlinemicro-clusteringcomponentand an offlinemacro-clustering
component. The online micro-clustering component requires a very efficient
process for storageof appropriate summarystatistics in a fast data stream. The
offline componentuses these summarystatisticsin conjunctionwith other user
input in order to provide the user with a quick understanding of the clusters
whenever required. Since the offline component requires only the summary
statistics as input, it turns out to be very efficient in practice. This leads to
severalchallenges:
0 What is the nature of the summary information which can be stored ef-
ficiently in a continuous data stream? The summary statistics should provide
sufficient temporal and spatial information for a horizon specific offline clus-
tering process, while being prone to an efficient (online) update process.
At what moments intime shouldthe summaryinformationbe storedaway
on disk? How can an effective trade-off be achieved between the storagere-
quirements of such a periodic process and the ability to cluster for a specific
time horizon to within a desired level of approximation?
How can the periodic summarystatisticsbe used to provide clustering and
evolutioninsights over user-specified time horizons?
3
.
2 Online Micro-cluster Maintenance: The CluStream
Algorithm
The micro-clustering phase is the online statistical data collection portion
of the algorithm. This process is not dependent on any user input such as the
time horizon or the required granularity of the clustering process. The aim
is to maintain statistics at a sufficientlyhigh level of (temporal and spatial)
granularity so that it can be effectively used by the offline components such
as horizon-specific macro-clustering as well as evolution analysis. The basic
concept of the micro-cluster maintenance algorithm derives ideas from the k-
means and nearest neighbor algorithms. The algorithm works in an iterative
fashion,by alwaysmaintainingacurrentsetofmicro-clusters. Itisassumedthat
a total of q micro-clusters are stored at any moment by the algorithm. We will
denotethesemicro-clustersbyM1 ...Mq.Associatedwitheachmicro-cluster
i, we create a unique id whenever it is first created. If two micro-clusters are
merged (aswillbecomeevidentfromthedetailsofourmaintenancealgorithm),
a list of ids is created in order to identify the constituent micro-clusters. The
value of q is determined by the amount of main memory available in order to
store the micro-clusters. Therefore, typical values of q are significantlylarger
than the natural number of clustersin the data but are also significantlysmaller
than the number of data points arriving in a long period of time for a massive
data stream. These micro-clusters represent the current snapshot of clusters
20 DATA STREAMS: MODELS AND ALGORITHMS
which change overthe courseofthe streamasnew points arrive. Their status is
stored away on disk wheneverthe clock time is divisibleby aifor any integer
i. At the same time any micro-clusters of order r which were stored at a time
in the past more remote than aZ+"
units are deleted by the algorithm.
We first need to create the initial q micro-clusters. This is done using an
offline process at the very beginning of the data stream computation process.
At the very beginningof the data stream,we storethe first InitNumber points
on disk and use a standard k-means clustering algorithm in order to create the
q initialmicro-clusters. The value of InitNumber is chosen to be as large as
permitted by the computationalcomplexity of a k-means algorithm creating q
clusters.
Oncethese initialmicro-clustershavebeen established,theonlineprocessof
updatingthemicro-clustersisinitiated. Wheneveranew datapoint arrives,
the micro-clusters are updated in order to reflect the changes. Each data point
eitherneedstobe absorbedbyamicro-cluster, oritneedstobeput in aclusterof
its own. The firstpreference isto absorbthe datapoint into a currentlyexisting
micro-cluster. We first find the distance of each data point to the micro-cluster
centroids M I ...M4. Let us denote this distance value of the data point Xi,
to the centroid of the micro-cluster M by dist(Mj,Xi,).Sincethe centroid
of the micro-cluster is available in the cluster feature vector, this value can be
computedrelatively easily.
We findthe closest cluster M, to the data point z
.
We note that in many
cases, the point Xi,does not naturally belong to the cluster Mp. These cases
are as follows:
0 The data point Xi,correspondsto an outlier.
0 The data point Xi,correspondsto the beginning of a new cluster because
of evolutionof the data stream.
While the two cases above cannot be distinguished until more data points
arrive,the data point needs to be assigneda (new)micro-clusterof its own
with a unique id. How do we decide whether a completelynew cluster should
be created? In order to make this decision, we use the cluster feature vector
of M pto decide if this data point falls within the maximum boundary of the
micro-cluster Mp.If SO,then the data point Xi,is added to the micro-cluster
M pusing the CF additivity property. The maximum boundary of the micro-
cluster M pis defined as a factor o f t of the RMS deviation of the data points
in M pfrom the centroid. We define this as the maximal boundaryfactor. We
note that the RMS deviation can only be defined for a cluster with more than
1 point. For a cluster with only 1previous point, the maximum boundary is
defined in a heuristic way. Specifically, we choose it to be r times that of the
next closest cluster.
If the data point does not lie within the maximum boundary of the nearest
micro-cluster, then a new micro-cluster must be created containing the data
On ClusteringMassive Data Streams: A Summarization Paradigm 21
point Xi,.This newly created micro-cluster is assigned a new id which can
identify it uniquely at any future stage of the data steam process. However,
in order to create this new micro-cluster, the number of other clusters must
be reduced by one in order to create memory space. This can be achieved by
eitherdeletinganoldclusterorjoining twoofthe oldclusters. Ourmaintenance
algorithmfirstdeterminesif it is safeto delete any of the currentmicro-clusters
as outliers. If not, then a merge of two micro-clusters is initiated.
The first step is to identify if any of the old micro-clusters are possibly out-
liers which can be safelydeleted by the algorithm. While it might be tempting
to simplypick themicro-clusterwith the fewestnumber ofpoints asthe micro-
cluster to be deleted, this may often lead to misleadingresults. In many cases,
a given micro-cluster might correspondto a point of considerablecluster pres-
ence in the past history of the stream, but may no longer be an active cluster
in the recent stream activity. Such a micro-cluster can be considered an out-
lier from the current point of view. An ideal goal would be to estimate the
average timestamp of the last m arrivals in each micro-cluster 2, and delete
the micro-cluster with the least recent timestamp. While the above estimation
can be achieved by simply storing the last m points in each micro-cluster, this
increases the memory requirements of a micro-cluster by a factor of m. Such
a requirement reduces the number of micro-clusters that can be stored by the
availablememory and therefore reduces the effectivenessof the algorithm.
We will find a way to approximatethe averagetimestamp of the last m data
points of the cluster M. This will be achieved by using the data about the
timestamps stored in the micro-cluster M. We note that the timestamp data
allowsuito calculate the mean and standarddeviation3of the arrival times of
points in a given micro-cluster M. Let these values be denoted by pM and
OMrespectively. Then,wefindthetimeofarrivalofthem/(2 n)-th percentile
ofthepoints in M assumingthat thetimestampsarenormallydistributed. This
timestamp is used as the approximate value of the recency. We shall call this
value as the relevancestamp of cluster M. When the least relevance stamp of
any micro-cluster is below a user-defined threshold 6, it can be eliminated and
anew micro-clustercanbe createdwith aunique id correspondingto thenewly
arrived data point Xi,.
In some cases, none of the micro-clusters can be readily eliminated. This
happens when all relevance stamps are sufficientlyrecent and lie above the
user-defined threshold 6. In such a case, two of the micro-clusters need to be
merged. We merge the two micro-clusters which are closest to one another.
The new micro-cluster no longer corresponds to one id. Instead, an idlist is
created which is a union of the the ids in the individualmicro-clusters. Thus,
any micro-cluster which is result of one or more merging operations can be
identified in terms of the individualmicro-clustersmerged into it.
22 DATA STREAMS: MODELS AND ALGORITHMS
While the above process of updating is executed at the arrival of each data
point, an additional process is executed at each clock time which is divisible
by ai for any integer i. At each such time, we store away the current set of
micro-clusters(possiblyon disk)togetherwith their id list, and indexedby their
time of storage. We also delete the least recent snapshot of order i, if a' +1
snapshotsof suchorderhad alreadybeen storedondisk, andiftheclocktimefor
this snapshotisnot divisibleby ai+l.(Inthe lattercase,the snapshotcontinues
to be a viable snapshotof order (i+I).) Thesemicro-clusterscan then be used
to form higher level clustersor an evolutionanalysis of the data stream.
3.3 High Dimensional Projected Stream Clustering
The method can also be extended to the case of high dimensionalprojected
stream clustering . The algorithms is referred to as HPSTREAM. The high-
dimensional case presents a special challenge to clustering algorithms even in
the traditional domain of static data sets. This is because of the sparsity of
the data in the high-dimensional case. In high-dimensional space, all pairs
of points tend to be almost equidistant from one another. As a result, it is
often unrealistic to define distance-based clusters in a meaningful way. Some
recent work on high-dimensionaldata uses techniques forprojected clustering
which candetermineclustersfora specificsubsetof dimensions[I, 41. Inthese
methods, the definitions of the clusters are such that each cluster is specific
to a particular group of dimensions. This alleviates the sparsity problem in
high-dimensional space to some extent. Even though a cluster may not be
meaningfully defined on all the dimensionsbecause of the sparsity of the data,
somesubsetof thedimensionscan alwaysbe found on whichparticularsubsets
of points form high quality and meaningful clusters. Of course, these subsets
of dimensions may vary over the different clusters. Such clusters are referred
to asprojected clusters [I].
In [8], we have discussedmethodsforhigh dimensionalprojected clustering
of data streams. The basic idea is to use an (incremental) algorithm in which
we associate a set of dimensions with each cluster. The set of dimensions is
represented as a d-dimensional bit vector B(Ci) for each cluster structure in
FCS. This bit vector contains a 1 bit for each dimension which is included
in cluster Ci. In addition, the maximum number of clusters k and the average
cluster dimensionality 1 is used as an input parameter. The average cluster
dimensionality1representstheaveragenumberofdimensionsusedinthecluster
projection. An iterative approach is used in which the dimensions are used to
update the clusters and vice-versa. The structure in FCS uses a decay-based
mechanisminordertoadjustforevolutionintheunderlyingdatastream. Details
are discussed in [8].
On ClusteringMassive Data Streams: A Summarization Paradigm
Time tl
Timet2
Time
Figure 2.3. Varying Horizons for the classification process
Classification of Data Streams: A Micro-clustering
Approach
Oneimportantdataminingproblemwhichhasbeen studiedin the contextof
data streamsisthatof streamclassification[15]. Themainthrust ondata stream
miningin thecontextof classificationhasbeen that ofone-passmining [14,19].
In general, the use of one-pass mining does not recognize the changes which
have occurred in the model since the beginning of the stream construction
process [5]. While the work in [19] works on time changing data streams,
the focus is on providing effective methods for incremental updating of the
classification model. We note that the accuracy of such a model cannot be
greater than the best sliding window model on a data stream. For example, in
the case illustrated in Figure 2.3, we have illustrated two classes (labeled by
'x' and '-') whose distribution changes over time. Correspondingly, the best
horizon at times tl and t 2 will also be different. As our empirical results will
show,thetruebehaviorofthedata streamiscapturedin atemporalmodelwhich
is sensitiveto the level of evolutionof the data stream.
The classificationprocessmay require simultaneousmodelconstructionand
testing in an environmentwhich constantlyevolvesover time. We assumethat
the testing process is performed concurrently with the training process. This
is often the case in many practical applications, in which only a portion of
the data is labeled, whereas the remaining is not. Therefore, such data can
be separated out into the (labeled) training stream, and the (unlabeled) testing
stream. The main difference in the construction of the micro-clusters is that
the micro-clusters are associatedwith a class label; therefore an incomingdata
point in the training stream can only be added to a micro-cluster belonging to
the same class. Therefore,we constructmicro-clustersin almost the sameway
as the unsupervised algorithm, with an additional class-label restriction.
From thetestingperspective,the importantpoint to be noted is that the most
effectiveclassificationmodel does not stay constant over time, but varies with
24 DATA STREAMS: MODELS AND ALGORITHMS
progression of the data stream. If a static classificationmodel were used for
an evolving test stream, the accuracy of the underlying classificationprocess
is likely to drop suddenly when there is a suddenburst of records belonging to
a particular class. In such a case, a classificationmodel which is constructed
using a smaller history of data is likely to provide better accuracy. In other
cases, a longer history of training provides greater robustness.
In the classification process of an evolving data stream, either the short
term or long term behavior of the stream may be more important, and it often
cannot be known a-priori as to which one is more important. How do we
decidethewindow or horizon of the training datato use soasto obtainthe best
classificationaccuracy? While techniques such as decision trees are useful for
one-pass mining of data streams [14, 191, these cannot be easily used in the
contextof an on-demandclassijier in an evolvingenvironment. Thisisbecause
such a classifier requires rapid variation in the horizon selection process due
to data stream evolution. Furthermore, it is too expensive to keep track of
the entire history of the data in its original fine granularity. Therefore, the
on-demand classification process still requires the appropriate machinery for
efficientstatisticaldata collectionin orderto performthe classificationprocess.
4.1 On-Demand StreamClassification
We use the micro-clusters to perform an On Demand Stream Classijication
Process. In ordertoperformeffectiveclassificationofthestream,it isimportant
to find the correct time-horizon which should be used for classification. How
do we find the most effective horizon for classification at a given moment in
time? In order to do so, a small portion of the training stream is not used
for the creation of the micro-clusters. This portion of the training stream is
referred to as the horizon fitting stream segment. The number of points in the
streamused forhorizon fitting is denotedby kfit. Theremainingportion of the
training stream is used for the creation and maintenance of the class-specific
micro-clusters as discussed in the previous section.
Since the micro-clusters are based on the entire history of the stream, they
cannotdirectlybeusedtotesttheeffectivenessofthe classificationprocess over
different time horizons. This is essential, since we would like to find the time
horizon which provides the greatest accuracyduringthe classificationprocess.
We will denote the set of micro-clusters at time t, and horizon h by N(t,, h).
This set of micro-clusters is determined by subtracting out the micro-clusters
at time t, - h from the micro-clusters at time t,. The subtraction operation
is naturally defined for the micro-clustering approach. The essential idea is
to match the micro-clusters at time t, to the micro-clusters at time t, - h,
and subtract out the corresponding statistics. The additiveproperty of micro-
On ClusteringMassive Data Streams: A Summarization Paradigm 25
clustersensuresthattheresulting clusterscorrespondto thehorizon (t, -h, t,).
More details can be found in [6].
Once the micro-clusters for a particular time horizon have been determined,
they areutilized to determinethe classificationaccuracyof that particular hori-
zon. This process is executed periodically in order to adjust for the changes
which have occurred in the stream in recent time periods. For this purpose,
we use the horizon fitting stream segment. The last kfit points which have
arrived in the horizon fitting stream segment are utilized in order to test the
classification accuracy of that particular horizon. The value of kfit is chosen
while taking into consideration the computational complexity of the horizon
accuracy estimation. In addition, the value of kfit should be small enough so
that the points in it reflect the immediatelocality oft,. Typically, the value of
kfit should be chosen in such a way that the least recent point should be no
largerthan a pre-specified number oftime units fromthecurrenttimet,. Let us
denote this set of points by Q it. Note that since &fit is a part of the training
stream,the class labels are known a-priori.
Inordertotesttheclassificationaccuracyoftheprocess,eachpoint;
i
f E &fit
is used in the followingnearest neighbor classificationprocedure:
0 We find the closest micro-cluster in N(tc, h) to x.
We determine the class label of this micro-cluster and compare it to the true
class label of X.The accuracy over all the points in Qfit is then determined.
This provides the accuracy over that particular time horizon.
The accuracy of all the time horizons which are tracked by the geometric
time frame are determined. The p time horizons which provide the greatest
dynamic classificationaccuracy (usingthe last kfit points) are selectedfor the
classification of the stream. Let us denote the corresponding horizon values
by 3-1 = {hl ...h,). We note that since kfit represents only a small locality
of the points within the current time period t,, it would seem at first sight
that the system would always pick the smallest possible horizons in order to
maximize the accuracy of classification. However, this is often not the case
for evolving data streams. Consider for example, a data stream in which the
records fora givenclassarriveforaperiod, andthen subsequentlystartarriving
again after a time interval in which the records for another class have arrived.
In such a case, the horizon which includes previous occurrences of the same
class is likely to provide higher accuracy than shorter horizons. Thus, such a
system dynamically adapts to the most effective horizon for classification of
data streams. In addition, for a stable stream the system is also likely to pick
largerhorizonsbecause of the greateraccuracyresulting fromuse of largerdata
sizes.
26 DATA STRFAMS:MODELSAND ALGORITHMS
The classificationof the test stream is a separateprocess which is executed
continuously throughout the algorithm. For each given test instance x,
the
above described nearest neighbor classification process is applied using each
hi E 'Ti. It is often possible that in the case of a rapidly evolvingdata stream,
differenthorizonsmayreportresult inthedeterminationofdifferentclasslabels.
The majority class among these p class labels is reported as the relevant class.
More detailson the technique may be found in [7].
5. Other Applications of Micro-clustering and Research
Directions
Whilethispaper discussestwo applicationsofmicro-clustering,wenotethat
anumberofotherproblemscanbe handledwith themicro-clusteringapproach.
This is because the process of micro-clustering createsa summaryof the data
which can be leveraged in a variety of ways for otherproblems in data mining.
Some examples of such problems are as follows:
Privacy PreservingData Mining: Intheproblem ofprivacypreserving
data mining, we create condensedrepresentations [3] of the data which
show k-anonymity. These condensed representations are like micro-
clusters, except that each cluster has a minimum cardinality threshold
on the number of data points in it. Thus, each cluster contains at least
k data-points, and we ensure that the each record in the data cannot be
distinguished from at least k other records. For this purpose, we only
maintain the summary statistics for the data points in the clusters as
opposed to the individual data points themselves. In addition to the first
and second order moments we also maintain the covariance matrix for
the data in each cluster. We note that the covariance matrix provides
a complete overview of the distribution of in the data. This covariance
matrix can be used in order to generate the pseudo-points which match
the distributionbehavior of the data in eachmicro-cluster. For relatively
smallmicro-clusters, it is possible to match theprobabilistic distribution
inthedatafairlyclosely. Thepseudo-pointscanbeusedasa surrogatefor
the actualdatapoints in the clusters in order to generatethe relevant data
mining results. Since the pseudo-points match the original distribution
quiteclosely, they canbe used forthepurposeof a varietyof data mining
algorithms. In [3], we have illustrated the use of the privacy-preserving
technique in the context of the classificationproblem. Our results show
thatthe classificationaccuracyisnot significantlyreducedbecauseof the
use of pseudo-points instead of the individualdata points.
Query Estimation: Since micro-clusters encode summary information
about the data, they can also be used for query estimation . A typical
exampleof suchatechniqueisthatofestimatingtheselectivityofqueries.
On ClusteringMassive Data Streams: A Summarization Paradigm 27
In such cases, the summary statistics of micro-clusters can be used in
order to estimate the number of data points which lie within a certain
interval such as a range query. Such an approach can be very efficient
in a variety of applications sincevoluminousdata streams are difficult to
use if they need to be utilized for query estimation. However, the micro-
clusteringapproachcancondensethedataintosummarystatistics,sothat
it is possible to efficiently use it for various kinds of queries. We note
that the technique is quite flexibleas long as it can be used for different
kinds of queries. An exampleof such a technique is illustrated in [9], in
which we use the micro-clustering technique (with some modifications
on the tracked statistics) for futuristic query processing in data streams.
StatisticalForecasting: Sincemicro-clusterscontaintemporal and con-
densed information, they can be used for methods such as statistical
forecasting of streams . While it can be computationally intensive to
use standard forecasting methods with large volumes of data points, the
micro-clustering approach provides a methodology in which the con-
densed data can be used as a surrogate for the original data points. For
example, for a standardregressionproblem, it is possible to use the cen-
troidsofdifferentmicro-clustersoverthevarioustemporaltimeframesin
order to estimatethe values of the data points. These values can then be
used for making aggregate statistical observations about the future. We
note that this is a useful approach in many applications since it is often
not possible to effectivelymake forecastsabout the futureusing the large
volume of the data in the stream. In [9], it has been shownhow to use the
technique for querying and analysis of future behavior of data streams.
In addition, we believe that the micro-clustering approach is powefil enough
to accomodatea wide variety of problems which require informationabout the
summary distribution of the data. In general, since many new data mining
problemsrequire summaryinformationaboutthedata, it is conceivablethatthe
micro-clustering approach can be used as a methodology to store condensed
statistics for general data mining and exploration applications.
6. Performance Study and Experimental Results
AllofourexperimentsareconductedonaPCwithIntelPentiumI11processor
and 512MB memory, which runs WindowsXP professional operating system.
For testingtheaccuracyandefficiencyoftheCluStreamalgorithm,we compare
CluStream with the STREAM algorithm [17,23], the best algorithm reported
so far for clustering data streams. CluStream is implementedaccording to the
descriptionin this paper, and the STREAMK-means is done strictlyaccording
to [23],whichshowsbetteraccuracythanBIRCH[24]. Tomakethecomparison
fair, both CluStream and STREAMK-means use the sameamount of memory.
28 DATA STREAMS: MODELS AND ALGORITHMS
Specifically, they use the same stream incoming speed, the same amount of
memoryto storeintermediateclusters(calledMicro-clustersinCluStream),and
the same amount of memory to store the final clusters (called Macro-clusters
in CluStream).
Because the synthetic datasets can be generated by controlling the number
of data points, the dimensionality, and the number of clusters, with different
distributionor evolutioncharacteristics,theyareusedto evaluatethe scalability
in our experiments. However, since synthetic datasets are usually rather dif-
ferent from real ones, we will mainly use real datasets to test accuracy, cluster
evolution,and outlier detection.
Real datasets. First, weneedtofindsomereal datasetsthat evolvesignificantly
over time in order to test the effectivenessof CluStream. A good candidate for
such testing is the KDD-CUP'99 Network Intrusion Detection stream data set
which has been used earlier [23] to evaluate STREAM accuracy with respect
to BIRCH. This data set corresponds to the important problem of automatic
and real-time detection of cyber attacks. This is also a challenging problem
for dynamic stream clustering in its own right. The offline clustering algo-
rithms cannot detect such intrusions in real time. Even the recently proposed
stream clustering algorithms such as BIRCH and STREAMcannot be very ef-
fectivebecause the clustersreported by these algorithmsare all generatedfrom
the entirehistory of data stream, whereas the current cases may have evolved
significantly.
The Network Intrusion Detection dataset consists of a series of TCP con-
nection records from two weeks of LAN network traffic managed by MIT
Lincoln Labs. Each n record can either correspondto a normal connection, or
an intrusion or attack. The attacks fall into four main categories: DOS (i.e.,
denial-of-service),R2L(i.e., unauthorizedaccessfromaremotemachine),U2R
(i.e., unauthorized access to local superuser privileges), and PROBING (i.e.,
surveillance and other probing). As a result, the data contains a total of five
clusters including the class for "normal connections". The attack-types are
furtherclassified into one of 24 types, such as buffer-overflow, guess-passwd,
neptune, portsweep, rootkit, smurf, warezclient, spy, and so on. It is evident
that each specific attacktype can be treated as a sub-cluster. Most of the con-
nections in this dataset are normal, but occasionally there could be a burst of
attacks at certain times. Also, each connection record in this dataset contains
42 attributes,suchasdurationofthe connection,thenumberofdatabytestrans-
mitted from source to destination (and vice versa), percentile of connections
that have "SYN" errors, the number of "root" accesses, etc. As in 1231, all 34
continuous attributes will be used for clustering and one outlierpoint has been
removed.
Second,besidestestingontherapidlyevolvingnetworkintrusiondatastream,
we also test our method over relatively stable streams. Since previously re-
On ClusteringMassive Data Streams: A Summarization Paradigm 29
ported stream clusteringalgorithms work on the entirehistory of stream data,
we believe that they should perform effectivelyfor some data sets with stable
distribution over time. An example of such a data set is the KDD-CUP'98
Charitable Donation data set. We will show that even for such datasets, the
CluStream can consistently beat the STREAMalgorithm.
The KDD-CUP'98 Charitable Donation data set has also been used in eval-
uating severalone-scan clustering algorithms, such as [16]. This data set con-
tains 95412 records of information about people who have made charitable
donations in response to direct mailing requests, and clustering can be used to
group donors showing similar donation behavior. As in [16], we will only use
56 fields which can be extracted from the total 481 fields of each record. This
data set is converted into a data stream by taking the data input order as the
order of streaming and assumingthat they flow-in with a uniform speed.
Synthetic datasets. To test the scalability of CluStream, we generate some
syntheticdatasetsby varyingbase sizefrom 1O
O
Kto 1O
O
O
Kpoints, thenumber
of clusters from 4 to 64, and the dimensionality in the range of 10 to 100.
Because we know the true cluster distribution a priori, we can compare the
clustersfound with the true clusters. The data points of each synthetic dataset
will followa seriesof Gaussiandistributions,and to reflect the evolutionof the
streamdataovertime, wechangethemeanandvarianceofthecurrentGaussian
distribution every 10Kpoints in the synthetic data generation.
The quality of clustering on the real data sets was measured using the sum
of square distance(SSQ), defined as follows. Assume that there are a total of
N points in the past horizon at current time Tc.For each point pi, we find the
centroid Cpi of its closest macro-cluster,and compute d(pi,Cpi),the distance
between pi and C,,. Then the SSQ at time Tcwith horizon H (denoted as
SSQ(Tc7
H))is equalto the sum of d2(pi,Cpi)for all the N points within the
previous horizon H. Unless otherwise mentioned, the algorithm parameters
were set at a = 2,1 = 10,InitNumber = 2000, and t = 2.
We compare the clustering quality of CluStreamwith that of STREAM for
differenthorizons at differenttimesusingtheNetwork Intrusiondatasetandthe
Charitable donation data set. The results are illustrated in Figures 2.4 and 2.5.
We run each algorithm 5 times and compute their average SSQs. The results
show that CluStream is almost always better than STREAM. All experiments
for these datasetshave shown that CluStream has substantially higher quality
than STREAM. However the Network Intrusion data set showed significantly
betterresultsthanthecharitabledonationdatasetbecauseofthefactthenetwork
intrusion data set was a highly evolvingdata set. For such cases, the evolution
sensitive CluStream algorithm was much more effective than the STREAM
algorithm.
We also tested the accuracy of the On-DemandStream ClassiJier.The first
test was performed on the Network Intrusion Data Set. The first experiment
DATA STREAMS: MODELS AND ALGORITHMS
1 CluStream HSTREAMI
750 1250 1750 2250
Stream (in time units)
Figure 2.4. Quality comparison (NetworkIntrusion dataset, horizon=256,stream-speed=200)
Stream (in time units)
Figure 2.5. Quality comparison (CharitableDonation dataset, horizon=4, streamspeed=200)
On ClusteringMassive Data Streams: A Summarization Paradigm
.On DemandStream .Fixed SlidlngWindow DEntlreDataset
100 -
E
Z
F 9s
0
4
90
1500 2000 2500
Stream (In time units)
Figure 2.6. Accuracy comparison (Network Intrusion dataset,
buffer-size=1600,kfit=80, init_number=400)
0.25 0.5 1 2 4 8 16 32
Best horizon
Figure 2.7. Distribution of the (smallest) best horizon (Network Intrusion dataset, Time
units=2500,buffer-size=1600,kf$t=80,init-number=400)
EOn DemandStream .Fixed SlidingWindow OEntimDataset
"T
T
500 1000 1500 2000
Stream (in time units)
Figure 2.8. Accuracy comparison (Synthetic dataset B300kC5D20,
buffer_size=500,kfit=25, init-number=400)
DATA STREAMS:MODELS AND ALGORITHMS
I OStream s m d 400 points w r time unit
0.25 0.5 1 2
Best horizon
Figure 2.9. Distributionof the (smallest) best horizon (Syntheticdataset B300kCSD20, Time
units=2000, buffersize=500, lcfit=25, init-number400)
was conducted with a stream speed at 80 connectionsper time unit (i.e., there
are 40 training stream points and 40 test stream points per time unit). We
set the buffersize at 1600 points, which means upon receiving 1600 points
(including both training and test stream points) we'll use a small set of the
training data points (In this case kfit =80) to choose the best horizon. We
compared the accuracy of the On-Demand-Stream classifier with two simple
one-pass stream classifiers over the entire data stream and the selected sliding
window(i.e., slidingwindowH=8). Figure2.6 showstheaccuracycomparison
among the three algorithms. We can see the On-Demand-Stream classifier
consistentlybeatsthetwo simpleone-passclassifiers. For example,at timeunit
2000, the On-Demand-Stream classijier's accuracyis about4%higher than the
classifierwith fixed sliding window, and is about 2% higher than the classifier
with the entire dataset. Because the class distribution of this dataset evolves
significantlyover time, eitherthe entiredataset or a fixed sliding window may
not always capture the underlying stream evolution nature. As a result, they
always have a worse accuracy than the On-Demand-Stream classifier which
always dynamicallychoosesthe best horizon for classifying.
Figure 2.7 showsthe distributionof the best horizons (They are the smallest
onesifthereexistseveralbesthorizonsatthesametime). Althoughabout78.4%
of the (smallest)best horizonshave avalue 114,theredo exist about21.6% best
horizons ranging from 112to 32 (e.g., about 6.4% of the best horizons have a
value 32). This also illustratesthat there is no fixed sliding window that can
achievethebest accuracyandthereasonwhy the On-Demand-Streamclassifier
can outperform the simpleone-pass classifiersover either the entiredataset or
a fixed sliding window.
We have also generated one synthetic dataset B300kC5D20to test the clas-
sificationaccuracyof these algorithms. This dataset contains5 classlabelsand
300Kdatapointswith 20dimensions. Wefirstsetthe streamspeedat 100points
Another Random Scribd Document
with Unrelated Content
The journey to Italy duly took place, the proposed party of two
being enlarged to one of four by the addition of Ignaz Brüll and
Simrock. Original plans had to be modified on account of the
exceptionally wet season, and the chief places visited were Vicenza,
Padua, and Venice.
The personnel of Brahms' intimate friends in Vienna had remained
on the whole much what it had become a very few years after his
arrival in the Austrian capital. Of its closest circle the Fabers,
Billroths, and Hanslicks, with whom must be associated Joachim's
cousins, the various members of the Wittgenstein family—amongst
them Frau Franz and Frau Dr. Oser—still formed the nucleus. An
acquaintance with Herr Victor von Miller zu Aichholz and his wife had
meanwhile ripened into warm friendship, and their house became
one of those whose hospitality was most frequently and gladly
accepted by the master. Amongst the musicians, Carl Ferdinand Pohl,
author of the standard Life of Mozart, and, since 1866, archivar to
the Gesellschaft, was one of his dearest friends. With the leading
professors of the conservatoire his relations continued very cordial,
and amongst the younger musicians to whom, in addition to his
early allies, Goldmark, Gänsbacher and Epstein, he extended his
friendly regard, may be mentioned Anton Door and Robert Fuchs.
The feeling of warm friendship existing between Brahms and Johann
Strauss has been commemorated in several well-known anecdotes.
The autumn of 1881, however, brought to permanent residence in
Vienna a family that before long made notable addition to the
master's intimate circle. Special circumstances conduced to the
speedy formation of a bond of friendship between Brahms and the
new-comers, Dr. and Frau Fellinger. In the first place, they were
friends of Frau Schumann and her daughters, and as such had an
instant claim on his courtesy, which he acknowledged by calling on
them as soon as possible after their arrival. In the second, his
interest was awakened by the fact that Frau Dr. Fellinger was the
daughter of Frau Professor Lang-Köstlin, the gifted Josephine Lang,
whose attractive personality and talent for composition made a
strong impression upon Mendelssohn when he was a youth of
twenty-one and some six years the lady's senior. The story of
Josephine, who at the age of twenty-six married Professor Köstlin of
Tübingen, is given in Hiller's 'Tonleben,' and Mendelssohn's
congratulations to her bridegroom-elect may be read in the second
volume of the 'Letters.' The talent for art which had come to her as
a family inheritance was transmitted to her daughter, though with a
difference. Frau Dr. Fellinger's gifts have associated themselves
especially with the plastic arts; in the first place with that of
painting, but they have become well known in the musical world also
by her busts and statuettes of Brahms, Billroth, and others belonging
to their circle. Her photographs of our master are now familiar to
most music-lovers. When it is added that Brahms found he could
command in Dr. Fellinger's hospitable house, not only congenial
intellectual sympathy, but the unceremonious intercourse with a
simple, affectionate family circle in which he had through life found a
pre-eminent source of happiness, it will easily be understood that he
became a more and more frequent guest there, until, during the
closing years of his life, it became for him almost a second home.
The master introduced two of his new works in the course of a few
weeks' journey undertaken in the winter of 1882-83. According to
Simrock's Thematic Catalogue, the Pianoforte Trio in C major, the
String Quintet in F major, and the 'Parzenlied' constitute the
publications of 1883. Early copies of the trio and quintet were sent
out, however, and the works were publicly performed from them in
December, 1882. An interesting entry in Frau Schumann's diary says:
'I had invited Koning and Müller to come and try Brahms'
new trio with me on Thursday 21st [December]. Who
should surprise us as we were playing it—he himself! He
came from Strassburg and means to stay with us for
Christmas. I played the trio first and he repeated it.'
Both works were performed on December 29 at a Museum chamber
music concert—the Quintet by the Heermann-Müller party, the Trio
by Brahms, Heermann, and Müller.
Amongst the early performances of the Trio were those on January
17 and 22 respectively in Berlin (Trio Concerts: Barth, de Ahna,
Hausmann) and London (Monday Popular Concerts: Hallé, Madame
Néruda, Piatti), and at Hellmesberger's in Vienna on March 15.
The work has not become one of the most generally familiar of the
master's compositions, though it is not easy to say why. It contains
no trace of the 'heaven-storming Johannes,' but, like many of the
later compositions, it breathes, and especially the first movement,
with a rich, mellow warmth suggestive of one to whom the
experiences of life have brought a solution of their own to its
problems, which has quieted, if it has not altogether satisfied, the
aspirations and impulses of youth.
The Quintet in F for strings is, for the most part, bright, concise, and
easy to follow. As one of its special features may be mentioned the
combination of the usual two middle movements in the second. It
was given in Hamburg on the 22nd and in Berlin on the 23rd of
January, respectively by Bargheer and Joachim and their colleagues
(it should be noted that Hausmann had at this time succeeded
Müller as the violoncellist of the Joachim Quartet), at
Hellmesberger's on February 15, and at the Monday Popular,
London, of March 5.
Brahms conducted the first performance of the Parzenlied in Basle
on December 8, 1882. Excellently sung by the members of the Basle
Choral Society, the work met with extraordinary success, and was
repeated after the New Year by general desire. Similar results
followed its performance in other towns, of which Strassburg and
Crefeld should be specially mentioned. The programme of the
Crefeld concert included the fifth movement of the Requiem. 'What
is your tempo?' Brahms inquired, on the morning of the rehearsal, of
Fräulein Antonia Kufferath, who was to sing the solo. The lady, not
taking the question seriously from the composer of the music,
waived a reply. 'No, I mean it; you have to hold out the long notes.
Well, we shall understand each other,' he added; 'sing only as you
feel, and I will follow with the chorus.'
These are characteristic words, and valuable in more than one
sense. To most of the few works to which the master has placed
metronome indications—and the Requiem is amongst these—he
added them by special request, and attached to them only a limited
importance. An absolutely and uniformly 'correct' pace for a piece of
genuine music does not exist. The pace must vary to some extent
according to subtle conditions existent in the performer, and the
instinct of a really musical executant or conductor will, as a rule, be
a safer guide, within limits, than what can be at best but the
mechanical markings even of the composer himself.
The Parzenlied, received with enthusiasm throughout Brahms' tour in
Germany and Switzerland, was not equally successful in Vienna,
where it was heard for the first time at the Gesellschaft concert of
February 18 under Gericke. The austere simplicity of the music,
which paces majestically onward with the concentrated, resigned
calm of despair, adds extraordinary force to Goethe's poem, but does
not appeal to every audience, and the work has never become a
prime favourite in the Austrian Kaiserstadt. The song is set for six-
part chorus with orchestra, in plainer harmonic masses and with less
employment of imitative counterpoint than we usually find in the
works of Brahms, who has accommodated his music here, as in
'Nänie,' to the classical spirit of the text. A singular deviation,
however, which occurs in the course of the setting, from the
uncompromising severity of the words, furnishes a remarkable
illustration of the composer's unconquerable idealism. Comment was
made in its place on the beautiful device by which he has sought to
relieve the dark mood of Hölderlin's 'Song of Destiny'—the addition
of an instrumental postlude which breathes forth a message of
tender consolation that the poet could hardly have rendered in
words. In Schiller's 'Nänie' the lament, with all its calm, gives
expression to a sentiment of compassionate sorrow that is perfectly
reproduced in the master's music. Goethe's Fates, however, in their
measured recitation of the gods' relentless cruelty, would have
seemed to offer no possible opportunity for even the inarticulate
expression of ruth. Least of all, it might be imagined, could any
concession to the demands of the human heart have been found in
the penultimate stanza of their song:
'The rulers exclude from
Their favouring glances
Entire generations,
And heed not in children
The once so belovèd
And still speaking features
Of distant forefathers.'
Our Brahms, however, who, in spite of his increasing weight, his
shaggy beard, his frequently rough manners, his unsatisfied
affections, his impenetrable reserve, remained at fifty, in his heart of
hearts, the very same being whom we have watched as the loving
child of seven, the simple-minded boy of fourteen, the broken-
hearted man of thirty, sobbing by the death-bed of his mother,
cannot leave the dread gloom of his subject unrelieved by a single
ray. He seems, in his setting of the last strophe but one, to
concentrate attention on past kindness of the gods, and thus,
perhaps, subtly to suggest a plea for present hope. How far the
musician was justified in thus wandering from the obvious intention
of his poet must be left to each hearer of the work to determine for
himself. If it be the case, as has sometimes been suggested, that the
variation was made by the composer in the musical interests of the
piece as a work of art, it cannot be held to have fulfilled its purpose;
for the striking inconsistency between words and music in the verse
in question has a disturbing effect on the mind of the listener. We
believe, however, that the true explanation of the master's procedure
is more radical, and is to be found in the nature of the man in which
that of the musician was grounded.
The Parzenlied was dedicated to 'His Highness George, Duke of
Saxe-Meiningen,' and was included in a Brahms programme
performed in Meiningen on April 2 to celebrate the Duke's birthday.
The complete breakdown of Bülow's health necessitated his
temporary retirement from his conductor's duties, which were
divided on this occasion between Brahms and Court Capellmeister
Franz Mannstädt, appointed to assist Bülow. Returning by a
circuitous route to Vienna after a few days at the ducal castle,
Brahms paid a short visit to Hamburg to take part in another Brahms
programme arranged by the talented young conductor of the Cecilia
Society, Julius Spengel. This was the first of several occasions on
which the master gave testimony of his appreciation of Dr. Spengel's
talents and musicianship by co-operating in the concerts of the
society.
Brahms celebrated his fiftieth birthday by entertaining his friends
Faber, Billroth, and Hanslick at a bachelor supper. He was occupied
during the summer with the completion of a third symphony, on
which he had worked the preceding year, and lived at Wiesbaden in
a house that had belonged to the celebrated painter Ludwig Knaus,
in whose former studio—Brahms' music-room for the nonce—the
work was finished.
It was known to the composer that a delicate elderly lady inhabited
the first-floor of the house of which Frau von Dewitz's flat, where he
lodged, formed an upper story. Every night, therefore, on returning
to his rooms, he took off his boots before going upstairs, and made
the ascent in his socks, so that her rest should not be disturbed. This
anecdote is but one amongst several of the same kind that have
been related to the author by Brahms' intimate associates. Samples
of another variety should not, however, be omitted.
A private performance of the new symphony, this time arranged for
two pianofortes, was given as usual at Ehrbar's by Brahms and Brüll,
and aroused immense expectations for the future of the work.
Amongst the listeners was a musician who, not having hitherto
allowed himself to be suspected of a partiality for the master's art,
expressed his enthusiastic admiration of the composition. 'Have you
had any conversation with X?' young Mr. Ehrbar asked Brahms; 'he
has been telling me how delighted he is with the symphony.' 'And
have you told him that he very often lies when he opens his mouth?'
angrily retorted the composer, who could never bring himself to
submit to the humiliation of accepting a compliment which he
suspected—perhaps unjustly in this case—of being insincere.
A terrible rebuff was administered by him on the evening of a first
Gewandhaus performance. It must be owned that Brahms was
seldom in his happiest mood when on a visit to Leipzig; he was well
aware that his music was not appreciated within the official 'ring'
there, and suspiciously resented any well-meant efforts made to
ignore this fact. 'And where are you going to lead us to-night, Herr
Doctor?' inquired one of the committee a few minutes before the
beginning of the concert, assuming a conciliatory manner as he
smoothed on his white kid gloves; 'to heaven?' 'It is the same to me
where you go,' rejoined Brahms.
The first performance of the Symphony in F major (No. 3) took place
in Vienna at the Philharmonic concert of December 2, under Hans
Richter, who was, according to Hanslick, originally responsible for the
name 'the Brahms Eroica,' by which it has occasionally been called.
Whether or not the suggestion is happy, a saying of the kind,
probably uttered on the impulse of the moment, should not be taken
very seriously.
Nothing of the quiescent autumn mood which we have observed in
the master's chamber music of this period is to be traced in either of
his symphonies, and the third, like its companions, represents him in
the zenith of his energies, working happily in the consciousness of
his absolute command over the resources of his art. Whether it be
judged by its effect as an entire work or studied movement by
movement, whether each movement be listened to as a whole or
analyzed into its component parts, all is found to be without halt of
inspiration or flaw in workmanship. Each theme is striking and
pregnant, and, though contrasting with what precedes it, seems to
belong inevitably to the movement and place in which it occurs,
whilst the development of the thematic material is so masterly that
to speak of admiring it seems almost ridiculous. The last movement
closes with a very beautiful and distinctive Brahms coda. The third
symphony is more immediately easy to follow than the first, and of
broader atmosphere than the second. It is of an essentially objective
character, and belongs absolutely to the domain of pure music.
The supreme and glorious pre-eminence which the great master had
by this time attained in contemporary estimation naturally made it
an object of competition with concert-givers and directors to
announce the earliest performances of his works, and this was
especially the case in the rare event of a new symphony which
succeeded its immediate predecessor after an interval of six years.
Brahms, however, had his own ideas on this matter, as on every
other that he thought important, and after the first performance of
the work in Vienna he sent the manuscript to Joachim in Berlin, and
begged him to conduct the second performance when and where he
liked. This proceeding would hardly have been noteworthy under the
circumstances of intimate friendship which had so long united the
two musicians, had it not been that the old relation between Brahms
and Joachim had been clouded during the past year or two, during
which there had been a cessation of their former affectionate
intercourse. When, therefore, it became known that Joachim, acting
on the composer's wish, proposed to conduct the symphony at one
of the subscription concerts of the Royal Academy of Arts, Berlin, so
much disappointment and heart-burning were felt and expressed
that Joachim, although he had already replied in the affirmative to
Brahms' request, consented to write again and ask what his wishes
really were. The answer came without delay, and was clear enough
to set the matter quite at rest. Brahms desired that the performance
should be committed unreservedly to the care of his old friend.
The symphony was heard for the second time, therefore, on January
4 under Joachim at Berlin, and was enthusiastically received by all
sections of the public and press. It was given again three times
during the same month in the German imperial capital under the
composer's bâton.
Detailed description of the triumphant progress of the new work
from town to town is no longer necessary. The composer was
overwhelmed with invitations to conduct it from the manuscript, and
Bülow, convalescent from his illness, and determined not to be
outdone in enthusiasm, placed it twice, as second and fourth
numbers, in a Meiningen programme of five works. On publication, it
was performed in all the chief music-loving towns of Germany, Great
Britain, Holland, Russia, Switzerland, and the United States.
In an account of a performance of the symphony at a Hamburg
Philharmonic concert under Brahms in December, which followed one
under von Bernuth after three weeks' interval, the critic of the
Correspondenten says:
'Brahms' interpretation of his works frequently differs so
inconceivably in delicate rhythmic and harmonic accents
from anything to which one is accustomed, that the
apprehension of his intentions could only be entirely
possible to another man possessed of exactly similar
sound-susceptibility or inspired by the power of
divination.'
The author feels a peculiar interest in quoting these lines, which
strikingly corroborate the impression formed by her on hearing this
and other of Brahms' works played under his own direction.
The publications of 1884 were, besides the third Symphony, Two
Songs for Contralto with Viola and Pianoforte, the second being the
'Virgin's Cradle Song,' already mentioned as one of the compositions
of 1865; two sets of four-part Songs, the one for accompanied Solo
voices, the other for mixed Chorus a capella, and the two books of
Songs, Op. 94 and 95.
At this date Brahms had entered into what we may call the third
period of his activity as a song-writer—one in which he frequently
chose texts that speak of loneliness or death. The wonderful beauty
of his settings of these subjects penetrates the very soul, and by the
mere force of its pathos carries to the hearer the conviction that the
composer speaks out of the feeling of his own heart. Stockhausen,
trying the song 'Mit vierzig Jahren' (Op. 94, No. 1) from the
manuscript to the composer's accompaniment, was so affected
during its performance that he could not at once proceed to the end.
Our remarks are, however, by no means intended to convey the
impression that Brahms only or generally chose poems of a
melancholy tendency at this time.
WITH FORTY YEARS.
By Friedrich Rückert (1788-1866).
With forty years we've gained the
mountain's summit,
We stand awhile and look
behind;
There we behold the quiet years
of childhood
And there the joy of youth we
find.
Look once again, and then, with
freshened vigour,
Take up thy staff and onward
wend!
A mountain-ridge extendeth,
broad, before thee,
Not here, but there must thou
descend.
No longer, climbing, need'st thou
struggle breathless,
The level path will lead thee on;
And then with thee a little
downward tending,
Before thou know'st, thy
journey's done.
With the knowledge we have gained of the master's habit of
producing his large works in couples, we are prepared to find him
employed this summer on the composition of a fourth symphony.
Avoiding a long journey, he settled down to his work at Mürz
Zuschlag in Styria, not far from the highest ridge of the Semmering.
Hearing soon after his arrival there that his old friend Misi Reinthaler,
now grown up into a young lady, was leaving home under her
mother's care to go through a course of treatment under a famous
Vienna specialist, he wrote to place his rooms in Carlsgasse at Frau
Reinthaler's disposal. The offer was not accepted, but when the
invalid was sufficiently convalescent, he insisted that the two ladies
should come for a few days as his guests to Mürz Zuschlag, where
he took rooms for them near his own lodgings. He went over to see
them also at Vienna, and spent the greater part of a morning
showing them his valuable collection of autographs and other
treasures. 'Yes, these would have been something to give a wife!'
was his answer to the ladies' expressions of delight. Amongst his
collection of musical autographs were two written on different sides
of the same sheet of paper—one of Beethoven, the song 'Ich liebe
dich'; the other of Schubert, part of a pianoforte composition. These,
with Brahms' autograph signature 'Joh. Brahms in April 1872,'
written at the bottom of one of the pages, constitute a unique
triplet. The sheet now belongs to the Gesellschaft library, and is
framed within glass.
The society of Hanslick, who came with his wife to stay near Mürz
Zuschlag for part of the summer, was very acceptable to Brahms.
The departure of his friends at the close of the season, in the
company of some mutual Vienna acquaintances, incited the
composer to an act of courtesy of a kind quite unusual with him, the
sequel to which seems to have caused him almost comical
annoyance that found expression in a couple of notes sent
immediately afterwards to Hanslick.
'Dearest Friend,
'Here I stand with roses and pansies; which means with a
basket of fruit, liqueurs and cakes! You must have
travelled through by the earlier Sunday extra train? I
made a good and unusual impression for politeness at the
station! The children are now rejoicing over the cakes....'
and, on finding that, mistaking the time of the train, he had arrived
a quarter of an hour late:
'How such a stupid thing can spoil one's day and the
thought of it recur to torment one. I hope you do not
know this as well as I, who am for ever preparing for
myself such vexatious worry....'
Later on, writing about other matters, he adds:
'... I hope Professor Schmidt's ladies do not describe my
promenade with the basket too graphically in Vienna!
Otherwise my unspoiled lady friends may cease to be so
unassuming.'[68]
The journeys of the winter included visits to Bremen and Oldenburg,
during which Hermine Spiess, one of the very favourite younger
interpreters of Brahms' songs, sang dainty selections of them to the
composer's accompaniment, with overwhelming success. The early
death of this gifted artist, soon after her marriage, caused the
master, with whom she was a great favourite, deep and sincere grief.
Brahms went also to Crefeld, where the 'Tafellied,' dedicated on
publication 'To the friends in Crefeld in remembrance of Jan. 28th
1885,' was sung on the date in question, with some of the new part-
songs a capella, and other of the composer's works, at the jubilee of
the Crefeld Concert Society. The manuscript score of the 'Tafellied' is
in the possession of Herr Alwin von Beckerath, to whom it was
presented by Brahms with an affectionate inscription.
CHAPTER XX
1885-1888
Vienna Tonkünstlerverein—Fourth Symphony—Hugo Wolf
—Brahms at Thun—Three new works of chamber music—
First performances of the second Violoncello Sonata by
Brahms and Hausmann—Frau Celestine Truxa—Double
Concerto—Marxsen's death—Eugen d'Albert—The Gipsy
Songs—Conrat's translations from the Hungarian—Brahms
and Jenner—The 'Zum rothen Igel'—Ehrbar's asparagus
luncheons—Third Sonata for Pianoforte and Violin.
The early part of the year 1885 offers for record no event of unusual
interest to the reader. The greater portion of it was spent by Brahms
in his customary routine in Vienna. He was generally to be seen at
the weekly meetings of the Tonkünstlerverein, a musicians' club
founded by Epstein, Gänsbacher, and others, of which the master
had consented to be named honorary life-president. The Monday
evening proceedings included a short musical programme,
sometimes followed by an informal supper. Brahms did not usually
sit in the music-room, but would remain in a smaller apartment
smoking and chatting sociably with friends of either sex. His arrival
always became known at once to the assembled company, 'Brahms
is here; Brahms is come!' being passed eagerly from mouth to
mouth. His old love of open-air exercise had not diminished with
increasing years, and the Sunday custom of a long walk in the
country was still kept up. A few friends used to meet in the morning
outside the Café Bauer, opposite the Opera House, and, taking train
or tram to the outskirts of the city, would thence proceed on foot,
returning in the late afternoon. Brahms, nearly always in a good
humour on these occasions, was generally soon ahead of his
companions, or leading the way with the foremost, and, as had
usually been the case with him through life, was looked upon by his
friends as the chief occasion of their meetings, allowed his own way,
and admired as a kind of pet oracle. The excursions always
commenced for the season on his return to Vienna in the autumn,
and were continued with considerable regularity until his departure
in the spring. They not infrequently gave opportunity for the
employment of the composer's unfailing readiness of repartee, as on
the occasion of a meeting in the train, on the return journey, with a
learned but unmusical acquaintance of one of the party, between
whom and Brahms an animated conversation arose. 'Will you not
join us one day, Herr Doctor? Next Sunday, perhaps?' asked Brahms.
'I!' exclaimed the other. 'Saul among the prophets?' 'Na, so you give
yourself royal airs!' instantly rejoined the master.
The fourth symphony was completed during the summer at Mürz
Zuschlag, where Brahms this year had the advantage of Dr. and Frau
Fellinger's society, and—indispensable for his complete enjoyment of
a home circle—that of their children. Returning one afternoon from a
walk, he found that the house in which he lodged had caught fire,
and that his friends were busily engaged in bringing his papers, and
amongst them the nearly-finished manuscript of the new symphony,
into the garden. He immediately set to work to help in getting the
fire under, whilst Frau Fellinger sat out of doors with either arm
outspread on the precious papers piled on each side of her. Luckily,
all serious harm was averted, and it was soon possible to restore the
manuscripts intact to the composer's apartments.
Brahms paid a neighbourly call, in the course of the summer, on the
author Rosegger, who was living in his small country house at
Krieglach near Mürz Zuschlag, and tasted the unusual experience of
a repulse. Absorbed in work at the moment when his servant
announced 'a strange gentleman,' Rosegger, without glancing at the
card placed beside him, desired his visitor to 'sit down for a
moment.' Conscious only of the presence of a bearded stranger with
a gray overcoat over his shoulder and a light-coloured umbrella in
his hand, he vouchsafed but scant answer to the trifling remarks
with which his caller tried to pave the way to cordiality, and before
long Brahms composedly remarked that he would be on his legs
again, and took leave. It was not till some minutes after his
departure that it occurred to Rosegger to glance at the card, and he
has himself described the feelings of despair with which he read the
words 'Johannes Brahms' staring at him in all the reality of black on
white. Not he alone, but the ladies of his family, were enthusiastic
admirers of the composer's genius. He was so overwhelmed by his
mistake as to be incapable of taking any steps to remedy it, and
firmly declined to yield to the entreaties of his wife and daughter
that he would return the visit and explain matters to Brahms. He
published an amusing account of the misadventure in the year 1894
in an issue of the Heimgarten. Perhaps it may have fallen into the
master's hands.
The honour not only of the first, but of several subsequent early
performances of the Symphony in E minor, fell to the Meiningen
orchestra. The work was announced for the third subscription
concert of the season 1885-86, and shortly beforehand the score
and parts of the third and fourth movements were sent by the
composer to Meiningen for correction at a preliminary rehearsal
under Bülow. Three listeners were, by Bülow's invitation, present on
the occasion—the Landgraf of Hesse; Richard Strauss, the now
famous composer, who had succeeded Mannstädt as second
conductor of the Meiningen orchestra; and Frederic Lamond. The
lapse of another day or so brought Brahms himself with the first and
second movements, and the first public performance of the work
took place on October 25.
That the new symphony was enthusiastically received on the
occasion goes almost without saying. Persevering but unsuccessful
efforts were made by the audience to obtain a repetition of the third
movement, and the close of the work was followed by the emphatic
demonstration incident to a great success.
The work was repeated under Bülow's direction at the following
Meiningen concert of November 1, and was conducted by the
composer throughout a three weeks' tour on which he started with
Bülow and his orchestra immediately afterwards, and which included
the towns Siegen, Dortmund, Essen, Elberfeld, Düsseldorf,
Rotterdam, Utrecht, Amsterdam, the Hague, Arnheim, Crefeld, Bonn,
and Cologne. A performance at Wiesbaden followed, and the work
was heard for the first time in Vienna at the Philharmonic concert of
January 17, 1886, under Richter. This occasion was celebrated by a
dinner given by Billroth at the Hôtel Sacher, the guests invited to
meet the composer being Richter, Hanslick, Goldmark, Faber, Door,
Epstein, Ehrbar, Fuchs, Kalbeck, and Dömpke.
A new and important work by Brahms could hardly fail to obtain a
warm reception in Vienna at a period when the composer could look
back to thirty years' residence in the imperial city with which his
name had become as closely associated as those of Haydn, Mozart,
Beethoven, and Schubert; but though the symphony was applauded
by the public and praised by all but the inveterately hostile section of
the press, it did not reach the hearts of the Vienna audience in the
same unmistakable manner as its two immediate predecessors, both
of which had, as we have seen, made a more striking impression on
a first hearing in Austria than the first Symphony in C minor.
Strangely enough, the fourth symphony at once obtained some
measure of real appreciation in Leipzig, where the first had been far
more successful than the second and third. It was performed under
the composer at the Gewandhaus concert of February 18. The
account given of the occasion by the Leipziger Nachrichten is,
perhaps, the more satisfactory since our old friend Dörffel, who
might possibly have been suspected of partiality, had long since
retired from the staff of the journal. Bernhard Vögl, his second
successor, says:
'... The reception must, we think, have made amends to
Brahms for former ones, which, in Bülow's opinion, were
too cool. After each movement the hall resounded with
tumultuous and long-continued applause, and, at the
conclusion of the work, the composer was repeatedly
called forward.... The finale is certainly the most original
of the movements, and furnishes more complete
argument than has before been brought forward for the
opinion of those who see in Brahms the modern Sebastian
Bach. The movement is not only constructed on the form
displayed in Bach's Chaconne for violin, but is filled with
Bach's spirit. It is built up with astounding mastery upon
the eight notes,
[Listen]
and in such a manner that its contrapuntal learning remains
subordinate to its poetic contents.... It can be compared with no
former work of Brahms and stands alone in the symphonic literature
of the present and the past.'
A still more triumphant issue attended the production of the
symphony under Brahms at a concert of the Hamburg Cecilia Society
on April 9. Josef Sittard, who had recently been appointed musical
critic to the Hamburger Correspondenten, a post he has held to the
present day, wrote:
'To-day we abide by what we have affirmed for years past
in musical journals; that Brahms is the greatest
instrumental composer since Beethoven. Power, passion,
depth of thought, exalted nobility of melody and form, are
the qualities which form the artistic sign manual of his
creations. The E minor (fourth) Symphony is distinguished
from the second and third principally by the rigorous and
even grim earnestness which, though in a totally different
way, mark the first. More than ever does the composer
follow out his ideas to their conclusion, and this
unbending logic makes the immediate understanding of
the work difficult. But the oftener we have heard it, the
more clearly have its great beauties, the depth, energy
and power of its thoughts, the clearness of its classic
form, revealed themselves to us. In the contrapuntal
treatment of its themes, in richness of harmony and in the
art of instrumentation, it seems to as superior to the
second and third, these, perhaps, have the advantage of
greater melodic beauty; a guarantee of popularity. In
depth, power and originality of conception, however, the
fourth symphony takes its place by the side of the first....'
After an interesting discussion of the several movements, the writer
adds: 'In a word, the symphony is of monumental significance.'
Brahms' fourth symphony, produced when he was over fifty, is, in
the opinion of most musicians, unsurpassed by any other
achievement of his genius. It has during the past twenty years been
growing slowly into general knowledge and favour, and will, it may
be safely predicted, become still more deeply rooted in its place
amongst the composer's most widely-valued works. The second
movement, in the opinion of the late Philipp Spitta, 'does not find its
equal in the symphonic world'; and the fourth, written in
'Passacaglia' form, is the most astonishing illustration achieved even
by Brahms himself of the limitless capability of variation form, in
which he is pre-eminent.[69]
It is with something of a mournful feeling that we find ourselves at
the close of our enumeration of the master's four greatest
instrumental works. Enough, we may hope, has been said to indicate
that any comparison of the symphonies as inferior or superior is
impossible, for the reason that each, while perfectly fulfilling its own
particular destiny, is quite different from all the others, and such
natural preference as may be felt by this or that listener for either
must be considered as purely personal. The present writer may,
perhaps, be allowed to confess that, with all joy in the dainty second
and the magnificent third and fourth—emphatically the fourth—
neither appeals to her quite so strongly as the first. There is here a
quality of youth in the intensity of the soaring imagination that
seems to search the universe, which, presented as it is with the
wealth of resource that was at the command of the mature
composer, could not by its nature be other than unique. The
presence of this very quality may be the reason why the first
symphony suffers even more lamentably than its companions from
the dull, cold, cautious, 'classical' rendering which Brahms' orchestral
works receive at the hands of some conductors, who seem unable to
realize that a composer who founds his works on certain definite and
traditional principles of structure does not thereby change his
nature, or in any degree renounce the free exercise of his poetic
gifts.
Perhaps the present is as good an opportunity as may occur for
passing mention of a newspaper episode of the eighties, which was
much talked of for a few years, but which, though it may have
caused Brahms annoyance, could not possibly at this period of his
career have had any more serious consequence so far as he was
concerned.
Hugo Wolf, in 1884 a young aspirant to fame, seeking recognition
but finding none, poor, gifted, disappointed, weak in health, highly
nervous, without influential friends, accepted an opportunity of
increasing his miserably small means of subsistence by becoming the
musical critic of the Salon Blatt, a weekly society paper of Vienna,
and soon made for himself an unenviable notoriety by his persistent
attacks upon Brahms' compositions. The affair would not now
demand mention in a biography of our master if it were not that the
posthumous recognition afforded to Wolf's art gives some interest,
though not of an agreeable nature, to this association of his name
with that of Brahms. For the benefit of those readers who may wish
to study the matter further, it may be added that Wolf's criticisms
have been republished since his death. For ourselves, having done
what was, perhaps, incumbent on us by referring to the matter, we
shall adopt what we believe would have been Brahms' desire, by
allowing it, so far as these pages are concerned, to follow others of
the kind to oblivion.
The summer of 1886 was the first of the three seasons passed by
Brahms at Thun, of which Widmann has written so charming an
account. He rented the entire first-floor of a house opposite the spot
where the river Aare flows out of the lake, the ground-floor being
occupied by the owner, who kept a little haberdashery shop.
According to his general custom, he dined in fine weather in the
garden of some inn, occasionally alone, but oftener in the company
of a friend or friends. Every Saturday he went to Bern to remain till
Monday or longer with the Widmanns, who, like other friends, found
him a most considerate and easily satisfied guest, though his
exceptional energy of body and mind often made it exhausting work
to keep up with him.
'His week-end visits were,' says Widmann, 'high festivals
and times of rejoicing for me and mine; days of rest they
certainly were not, for the constantly active mind of our
guest demanded similar wakefulness from all his
associates and one had to pull one's self well together to
maintain sufficient freshness to satisfy the requirements of
his indefatigable vitality.... I have never seen anyone who
took such fresh, genuine and lasting interest in the
surroundings of life as Brahms, whether in objects of
nature, art, or even industry. The smallest invention, the
improvement of some article for household use, every
trace, in short, of practical ingenuity gave him real
pleasure. And nothing escaped his observation.... He
hated bicycles because the flow of his ideas was so often
disturbed by the noiseless rushing past, or the sudden
signal, of these machines, and also because he thought
the trampling movement of the rider ugly. He was,
however, glad to live in the age of great inventions and
could not sufficiently admire the electric light, Edison's
phonographs, etc. He was equally interested in the animal
world. I always had to tell him anew about the family
customs of the bears in the Bern bear-pits before which
we often stood together. Indeed, subjects of conversation
seemed inexhaustible during his visits.'[70]
Brahms' ordinary costume, the same here as elsewhere, was chosen
quite without regard to appearances. Mere lapse of time must
occasionally have compelled him to wear a new coat, but it is safe to
conclude that his feelings suffered discomposure on the rare
occurrence of such a crisis. Neckties and white collars were reserved
as special marks of deference to conventionality. During his visits to
Thun he used on wet Saturdays to appear at Bern wearing 'an old
brown-gray plaid fastened over his chest with an immense pin,
which completed his strange appearance.' Many were the books
borrowed from Widmann at the beginning, and brought back at the
end, of the week, carried by him in a leather bag slung over his
shoulder. Most of them were standard works; he was not devoted to
modern literature on the whole, though he read with pleasure new
and really good books of history and travel, and was fond of
Gottfried Keller's novels and poems. Over engravings and
photographs of Italian works of art he would pore for hours, never
weary of discussing memories and predilections with his friend.
Visits to the Bern summer theatre, a short mountain tour with
Widmann, an introduction to Ernst von Wildenbruch, whose dramas
the master liked, and with whom he now found himself in personal
sympathy—events such as these served to diversify the summer
season of 1886, which was made musically noteworthy by the
composition of a group of chamber works, the Sonatas in A and F
major for pianoforte with violin and violoncello respectively, and the
Trio in C minor for pianoforte and strings. The Sonatas were
performed for the first time in public in Vienna; severally by Brahms
and Hellmesberger, at the Quartet concert of December 2, and by
Brahms and Hausmann at Hausmann's concert of November 24; the
Trio was introduced at Budapest about the same time by Brahms,
Hubay, and Popper, in each case from the manuscript.
Detailed discussion of these works is superfluous; two of them, at all
events, are amongst the best known of Brahms' compositions. The
Sonata for pianoforte and violoncello in F is the least familiar of the
group, but assuredly not because it is inferior to its companions. It
is, indeed, one of the masterpieces of Brahms' later concise style.
Each movement has a remarkable individuality of its own, whilst all
are unmistakably characteristic of the composer. The first is broad
and energetic, the second profoundly touching, the third vehemently
passionate—in the Brahms' signification of the word, be it noted,
which means that the emotions are reached through the intellectual
imagination—the fourth written from beginning to end in a spirit of
vivacity and fun. The work was tried in the first instance at Frau
Fellinger's house. 'Are you expecting Hausmann?' Brahms inquired
carelessly of this lady soon after his return in the autumn. Frau
Fellinger, suspecting that something lay behind the question,
telegraphed to the great violoncellist, who usually stayed at her
house when in Vienna, to come as soon as possible, if only for a day.
He duly appeared, and the new sonata was played by Brahms and
himself on the evening of his arrival. They performed it again the
day before the concert above recorded, at a large party at Billroth's.
The last movement of the beautiful Sonata in A for pianoforte and
violin is sometimes criticised as being almost too concise. The
present writer confesses that she always feels it to be so, and one
day confided this sentiment to Joachim, who did not agree with her,
but said that the coda was originally considerably longer. 'Brahms
told me he had cut a good deal away; he aimed always at
condensation.'
Dr. Widmann allows us to publish an English version of a poem
written by him on this work, the original of which is published in the
appendix to his 'Brahms Recollections.' We have desired to place it
before our English-speaking readers, not only because it coincides
remarkably with what we related in our early chapters of the
delicate, fanciful tastes of the youthful Hannes, but because it gave
pleasure to the Brahms of fifty-three, and even of sixty-three, and
thus seems to illustrate the fact on which we have insisted, that if in
any case then in our master's, the child was father to the man. Only
a year before his death the great composer wrote to Widmann to
beg for one or two more copies of the poem, which had been
printed for private circulation.
THE THUN SONATA.
Poem on the Sonata in A for Pianoforte and Violin, Op. 100,
By Johannes Brahms,
WRITTEN BY
J. V. WIDMANN.
There where the Aare's waters
gently glide
From out the lake and flow
towards the town,
Where pleasant shelter spreading
trees provide,
Amidst the waving grass I laid
me down;
And sleeping softly on that
summer day,
I saw a wondrous vision as I lay.
Three knights rode up on proudly
stepping steeds,
Tiny as elves, but with the mien
of kings,
And spake to me: 'We come to
search the meads,
To seek a treasure here, of
precious things
Amongst the fairest; wilt thou
help us trace
A new-born child, a child of
heav'nly race?'
'And who are ye?' I, dreaming,
made reply;
'Knights of the golden
meadows' then they said,
'That at the foot of yonder
Niesen[71] lie;
And in our ancient castles many
a maid
Hath listened to the greeting of
our strings,
Long mute and passed amid
forgotten things.
'But lately tones were heard upon
the lake,
A sound of strings whose like
we never knew,
So David played, perhaps, for
Saul's dread sake,
Soothing the monarch curtained
from his view;
It reached us as it softly swelled
and sank,
And drew us, filled with longing,
to this bank.
'Then help us search, for surely
from this place,
This meadow by the river, came
the sound;
Help us then here the miracle to
trace,
That we may offer homage
when 'tis found.
Sleeps under flow'rs the new-born
creature rare?
Or is it floating in the evening air?'
But ere they ceased, a sudden
rapid twirl
Ruffled the waters, and, before
our eyes,
A fairy boat from out the wavelet's
whirl
Floated up stream, guided by
dragon-flies;
Within it sat a sweet-limbed, fair-
haired may,
Singing as to herself in ecstasy.
'To ride on waters clear and cool
is sweet,
For clear as deep my being's
living source;
To open worlds where joy and
sorrow meet,
Each flowing pure and full in
mingling course;
Go on, my boat, upstream with
happy cheer,
Heaven is reposing on the tranquil
mere.'
So sang the fairy child and they
that heard
Owned, by their swelling hearts,
the music's might,
The knights had only tears, nor
spake a word,
Welling from pain that thrilled
them with delight;
But when the skiff had vanished
from their eyes,
The eldest, pointing, said in
tender wise:
'Thou beauteous wonder of the
boat, farewell,
Sweet melody, revealed to us
to-day;
We that with slumb'ring
minnesingers dwell,
Bid thee Godspeed, thou
guileless stranger fay;
Our land is newly consecrate in
thee
That rang of old with fame of
minstrelsy.
'Now we may sleep again
amongst our dead,
The harper's holy spirit is
awake,
And as the evening glory, purple-
red,
Shineth upon our Alps and o'er
our lake,
And yet on distant mountain
sheds its light,
Throughout the earth this song
will wing its flight.
'Yet, though subduing many a
list'ning throng,
In stately town, in princely hall
it sound,
To this our land it ever will belong,
For here on flowing river it was
found.'
Fervent and glad the minnesinger
spake;
'Yes!' cried my heart—and then I
was awake.
Whilst our master had been living through the spring and summer
months in the enchanted world of his imagination, coming out of it
only for brief intervals of sojourn in earth's pleasant places amidst
the companionship of chosen friends, certain hard, commonplace
realities of the workaday world, which had arisen earlier at home in
Vienna, were still awaiting a satisfactory solution. The death of the
occupier of the third-floor flat of No. 4, Carlsgasse, the last
remaining member of the family with whom Brahms had lodged for
fourteen or fifteen years, had confronted him with the necessity of
choosing between several alternatives almost equally disagreeable to
him, concerning which it is only necessary to say that he had
avoided the annoyance of a removal by taking on the entire dwelling
direct from the landlord, and had escaped the disturbance of having
to replace the furniture of his rooms by accepting the offer of friends
to lend him sufficient for his absolute needs. Arrangements and all
necessary changes were made during his absence. To Frau Fellinger
Brahms had entrusted the keys of the flat and of his rooms, which
under her directions were brought into apple-pie order by the time
of his return, the drawers being tidied, and a list of the contents of
each neatly drawn up on a piece of cardboard, so that everything
should be ready to his hand. The greatest difficulty, however, still
remained. Who was to keep the rooms in order and see to the very
few of Brahms' daily requirements which he was not in the habit of
looking after himself? His coffee, as we know, he always prepared at
a very early hour in the morning, and he was kept provided with a
regular supply of the finest Mocha by a lady friend at Marseilles.
Dinner, afternoon coffee, and often supper, were taken away from
home. The master now declared he would have no one in the flat.
To as many visitors as he felt disposed to admit he could himself
open the door, whilst the cleaning and tidying of the rooms could be
done by the 'Hausmeisterin,' an old woman occupying a room in the
courtyard, and responsible for the cleaning of the general staircase,
etc. In vain Frau Fellinger contested the point. Brahms was
inflexible, and this kind lady apparently withdrew her opposition to
his plan, though remaining quietly on the look-out for an opportunity
of securing more suitable arrangements. By-and-by it presented
itself. In Frau Celestine Truxa, the widow of a journalist, whose
family party consisted of two young sons and an old aunt, Frau
Fellinger felt that she saw a most desirable tenant for the Carlsgasse
flat, and after a renewed attack on the master, whose arguments,
founded on the immaculate purity of his rooms under the old
woman's care, she irretrievably damaged by lifting a sofa cushion
and laying bare a collection of dust, which she declared would soon
develop into something worse, he was so far shaken as to say that if
she would make inquiries for him he would consider her views. Frau
Fellinger wisely abstained from further discussion, but after a few
days Frau Truxa herself, having been duly advised to open the
matter to Brahms with diplomatic sang-froid, went in person to apply
for the dwelling. After her third ring at the door-bell, the door was
opened by the master himself, who started in dismay at seeing a
strange lady standing in front of him.
'I have come to see the flat,' said Frau Truxa.
'What!' cried Brahms.
'I have heard there is an empty flat here, and have come to look at
it,' responded Frau Truxa indifferently; 'but perhaps it is not to let?'
A moment's pause, and the composer's suspicious expression
relaxed.
'Frau Dr. Fellinger mentioned the circumstances to me,' she
continued, 'and I thought they might suit me.'
By this time Brahms had become sufficiently reassured to show the
rooms and to listen, though without remark, to a brief description of
Frau Truxa's family and of the circumstances in which she found
herself.
'Perhaps, Dr. Brahms, you will consider the matter,' she concluded,
'and communicate with me if you think further of it. If I hear nothing
more from you, I shall consider the matter at an end.'
After about a week, during which Frau Truxa kept her own
confidence, her maid came one day to tell her a gentleman had
called to see her. Being engaged at the moment, she asked her aunt
to ascertain his business, but the old lady returned immediately with
a frightened look.
'I don't know what to think!' she exclaimed; 'there is a strange-
looking man walking about in the next room measuring the furniture
with a tape!'
'The things will all go in!' exclaimed the master as Frau Truxa hurried
to receive him.
The upshot was that the master gave up the tenancy of the flat,
returning to his old irresponsible position as lodger, whilst Frau
Truxa, bringing her household with her, stepped into the position of
his former landlady, thereby giving Brahms cause to be grateful for
the remainder of his life for Frau Fellinger's wise firmness. He was,
says Frau Truxa, perfectly easy to get on with; all he desired was to
be let alone. He was extremely orderly and neat in his ways, and
expected the things scattered about his room to be dusted and kept
tidy, but was vexed if he found the least trifle at all displaced—even
if his glasses were turned the wrong way—and, without making
direct allusion to the subject, would manage to show that he had
noticed it. Observing, after she had been a little time in the flat, that
he always rearranged the things returned from the laundress after
they had been placed in their drawer, she asked him why he did so.
'Only,' he said, 'because perhaps it is better that those last sent back
should be put at the bottom, then they all get worn alike.' A glove or
other article requiring a little mending would be placed carelessly at
the top of a drawer left open as if by accident. The next day he
would observe to Frau Truxa, 'I found my glove mended last night; I
wonder who can have done it!' and on her replying, 'I did it, Herr
Doctor,' would answer, 'You? How very kind!'
Frau Truxa came to respect and honour the composer more and
more the longer he lived in her house. She made his peculiarities her
study, and after a short time understood his little signs, and was
able to supply his requirements as they arose without being
expressly asked to do so. It is almost needless to say that he took
great interest in her two boys, and once, when she was summoned
away from Vienna to the sick-bed of her father, begged that the
maid-servant might be instructed to give all her attention to the
children during their mother's absence, even if his rooms were
neglected. 'I can take care of myself, but suppose something were to
happen to the children whilst the girl was engaged for me!' Every
night whilst Frau Truxa was away, the master himself looked in on
the boys to assure himself of their being safe in bed. For the old
aunt he always had a pleasant passing word.
The fourth Symphony and two books of Songs were published in
1886, and the three new works of chamber music, Op. 99, 100, 101,
in 1887. Of the songs we would select for particular mention the
wonderfully beautiful setting of Heine's verses:
'Death is the cool night,
Life is the sultry day,'
Op. 96, No. 1, and Nos. 1 and 2 of Op. 97.
Brahms' Italian journey in the spring of 1887 was made in the
company of Simrock and Kirchner. The following year he travelled in
Widmann's society, visiting Verona, Bologna, Rimini, Ancona, Loretto,
Rome, and Turin. Widmann sees in Brahms' spiritual kinship with the
masters of the Italian Renaissance the chief secret of his love for
Italy.
'Their buildings, their statues, their pictures were his
delight and when one witnessed the absorbed devotion
with which he contemplated their works, or heard him
admire in the old masters a trait conspicuous in himself,
their conscientious perfection of detail ... even where it
could hardly be noticeable to the ordinary observer, one
could not help instituting the comparison between himself
and them.'
Brahms had an interview when on this journey with the now famous
Italian composer Martucci, who displayed a thorough familiarity with
the works of the German master.
Amongst the friends and acquaintances whom the composer met at
Thun during his second and third summers there were the Landgraf
of Hesse, Hanslick, Gottfried Keller, Professor Bächthold, Hermine
Spiess and her sister, Gustav Wendt, the Hegars, Max Kalbeck,
Steiner, Claus Groth, etc. One day, as he had started for a walk, he
was stopped by a stranger, who asked if he knew where Dr. Brahms
lived. 'He lives there,' replied the master, pointing to the
haberdasher's shop. 'Do you know if he is at home?' 'That I cannot
tell you,' was the reply. 'But go and ask in the shop; you will
certainly be able to find out there.' The gentleman followed this
advice, sent his card up, and received the answer that the Doctor
was at home, and would be pleased to see him. To his surprise, on
ascending the stairs, he found his newly-formed acquaintance
waiting for him at the top.
Brahms' Lodgings near Thun.
Photograph by Moegle, Thun.
The rumour revived in the summer of 1887 that Brahms was
engaged on an opera. This came about, perhaps, from his intimacy
with Widmann. 'I am composing the entr'actes,' he jestingly replied
to the Landgraf's question as to whether the report had any
foundation. As a matter of fact, the subject of opera was not
mentioned between the composer and his friend at this time.
The works which really occupied Brahms during the summer of 1887
were the double Concerto for violin and violoncello, with orchestral
accompaniment, and the 'Gipsy Songs.'
The Concerto was performed privately, immediately on its
completion, in the 'Louis Quinze' room of the Baden-Baden Kurhaus.
Brahms conducted, and the solo parts were performed by Joachim
and Hausmann. Amongst the listeners were Frau Schumann and her
eldest daughter, Rosenhain, Lachner, the violoncellist Hugo Becker,
and Gustav Wendt. The work was heard in public for the first time in
Cologne on October 15, Brahms conducting, and Joachim and
Hausmann playing the solos as before; and the next performances,
carried out under the same unique opportunities for success, were in
Wiesbaden, Frankfurt, and Basle, on November 17, 18, and 20.
In the autumn of this year one of the few remaining figures linked
with the most cherished associations of Brahms' early youth passed
away. Marxsen died on November 17, 1887, at the age of eighty-one,
having retained to the end almost unimpaired vigour of his mental
faculties. The last great pleasure of his life was associated with his
beloved art. In spite of great bodily weakness, he managed to be
present a week before his death at a concert of the Hamburg
Philharmonic Society to hear a performance of the 'ninth' Symphony.
'I am here for the last time,' he said, pressing Sittard's hand; and he
passed peacefully away fourteen days later.
A few years previously his artistic jubilee had been celebrated in
Hamburg, and his dear Johannes had surprised him with the proof-
sheets of a set of one hundred Variations composed long ago by
Marxsen, not with a view to publication, but as a practical illustration
of the inexhaustible possibilities contained in the art of thematic
development. Brahms, who happened to see the manuscript in
Marxsen's room during one of his subsequent visits to Hamburg, was
so strongly interested in it that in the end Marxsen gave it him, with
leave to do as he should like with it after his death. The parcel of
proof-sheets was accompanied by an affectionate letter, in which
Brahms begged forgiveness for having anticipated this permission
and yielded to his desire of placing the work within general reach
during his master's lifetime; and perhaps no jubilee honour of which
the old musician was the recipient filled him with such lively joy as
was caused by this tribute. Marxsen's name as a composer is,
indeed, now forgotten without chance of revival, but his memory will
live gloriously in the way he would have chosen, carried through the
years by the hand that wrote the great composer's acknowledgment
to his teacher on the title-page of the Concerto in B flat.
Four more performances from the manuscript of the double concerto
of interest in our narrative remain to be chronicled—those of the
Leipzig Gewandhaus, under Brahms, on January 1, 1888; of the
Berlin Philharmonic Society, under Bülow, of February 6; and of the
London Symphony Concerts, under Henschel, on February 15 and
21. The work, published in time for the autumn season, was given in
Vienna at the Philharmonic concert of December 23 under Richter.
On all these occasions the solos were played, as before, by Joachim
and Hausmann.
Bülow, having at this time resigned his post at Meiningen, had
entered on a period of activity as conductor in some of the northern
cities of Germany, and particularly in Hamburg and Berlin. His future
programmes, in which our master's works were well represented,
though not with the conspicuous prominence that had been possible
at Meiningen, do not fall within the scope of these pages, since, with
the mention of the double concerto, the enumeration of Brahms'
orchestral works is complete. Bülow's successor at Meiningen, Court
Capellmeister Fritz Steinbach, carried on the traditions and
preferences of the little Thuringian capital as he found them, until
his removal to Cologne a year or two ago, and has become
especially appreciated as a conductor of the works of Brahms, whose
personal friendship and artistic confidence he enjoyed in a high
degree.
The name of Eugen d'Albert, whose great gifts and attainments were
warmly recognised by Brahms, should not be omitted from our
pages, though detailed account of his relations with the master is
outside their limits. D'Albert's fine performances of the pianoforte
concertos helped to make these works familiar to many Continental
audiences, and certainly contributed, during the second half of the
eighties, to the better understanding of the great composer which
has gradually come to prevail at Leipzig.
But little needs to be said about the double concerto. This fine work,
which may be regarded as in some sort a successor to the double
and triple concertos of Mozart and Beethoven, exhibits all the power
of construction, the command of resource, the logical unity of idea,
characteristic of Brahms' style, whilst its popularity has been
hindered by the same cause that has retarded that of the pianoforte
concertos; the solo parts do not stand out sufficiently from the
orchestral accompaniment to give effective opportunity for the
display of virtuosity, in the absence of which no performer, appearing
before a great public as the exponent of an unfamiliar work for an
accompanied solo instrument, has much chance of sustaining the
lively interest of his audience in the composition. Of the three
movements of the double concerto, the first is especially interesting
to musicians, whilst the second, a beautiful example of Brahms'
expressive lyrical muse, appeals equally to less technically prepared
listeners. On the copy of the work presented by Brahms to Joachim
the words are inscribed in the composer's handwriting: 'To him for
whom it was written.'
Widely contrasted in every respect was the other new work of 1887,
introduced to the private circle of Vienna musicians at the last
meeting for the season of the Tonkünstlerverein in April, 1888. The
eleven four-part 'Gipsy Songs,' published in the course of the year as
Op. 103, were sung from the manuscript by Fräulein Walter, Frau
Gomperz-Bettelheim, Gustav Walter, and Weiglein of the imperial
opera, to the composer's accompaniment. Brahms obtained the texts
of this characteristic and attractive work from a collection of twenty-
five 'Hungarian Folk-songs' translated into German by Hugo Conrat,
and published in Budapest, with their original melodies set by Zoltan
Nagy for mezzo-soprano or baritone, with the addition of pianoforte
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com

More Related Content

PDF
Managing And Mining Uncertain Data 1st Edition Charu C Aggarwal Auth
PPTX
Streaming Hypothesis Reasoning - William Smith, Jan 2016
PDF
Gridbased Nonlinear Estimation And Its Applications Jia Bin Xin
PDF
Data Flow Analysis Theory And Practice 1st Edition Uday Khedker
PDF
Internet Of Things And Secure Smart Environments Successes And Pitfalls Uttam...
PPTX
Streaming HYpothesis REasoning
PDF
1105.1950
PDF
Big Data And Computational Intelligence In Networking 1st Edition Yulei Wu
Managing And Mining Uncertain Data 1st Edition Charu C Aggarwal Auth
Streaming Hypothesis Reasoning - William Smith, Jan 2016
Gridbased Nonlinear Estimation And Its Applications Jia Bin Xin
Data Flow Analysis Theory And Practice 1st Edition Uday Khedker
Internet Of Things And Secure Smart Environments Successes And Pitfalls Uttam...
Streaming HYpothesis REasoning
1105.1950
Big Data And Computational Intelligence In Networking 1st Edition Yulei Wu

Similar to Data Streams Models And Algorithms Charu C Aggarwal Ed (20)

PDF
Ncct Ieee Software Abstract Collection Volume 1 50+ Abst
PDF
Splunk App for Stream - Einblicke in Ihren Netzwerkverkehr
PDF
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
PDF
Handbook Of Graph Theory Combinatorial Optimization And Algorithms Arumugam
PDF
IRJET- A Data Stream Mining Technique Dynamically Updating a Model with Dynam...
PDF
IRJET- AC Duct Monitoring and Cleaning Vehicle for Train Coaches
PDF
Accelerating Discovery Mining Unstructured Information For Hypothesis Generat...
PDF
Accelerating Discovery Mining Unstructured Information For Hypothesis Generat...
PDF
Multifractal Traffic And Anomaly Detection In Computer Communications Ming Li
PDF
Computer Simulation A Foundational Approach Using Python 1st Edition Yahya Es...
PDF
Computer Simulation_ A Foundational Approach Using Python (2018).pdf
PDF
Blockchain For 6genabled Networkbased Applications A Vision Architectural Ele...
PDF
Operations Research and Cyber Infrastructure John W. Chinneck
PDF
Cloud-based Data Stream Processing
PDF
Data Clustering Algorithms and Applications First Edition Charu C. Aggarwal
PDF
Datadriven Computational Methods Parameter And Operator Estimations Harlim
PPT
Cyberinfrastructure and Applications Overview: Howard University June22
PDF
Sylabbi 2012
PDF
Evolutionary Multiobjective System Designtheory And Applications 1st Edition ...
PPTX
Harnessing Quantum Computing and AI for Next-Gen Supply Chain Optimization
Ncct Ieee Software Abstract Collection Volume 1 50+ Abst
Splunk App for Stream - Einblicke in Ihren Netzwerkverkehr
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
Handbook Of Graph Theory Combinatorial Optimization And Algorithms Arumugam
IRJET- A Data Stream Mining Technique Dynamically Updating a Model with Dynam...
IRJET- AC Duct Monitoring and Cleaning Vehicle for Train Coaches
Accelerating Discovery Mining Unstructured Information For Hypothesis Generat...
Accelerating Discovery Mining Unstructured Information For Hypothesis Generat...
Multifractal Traffic And Anomaly Detection In Computer Communications Ming Li
Computer Simulation A Foundational Approach Using Python 1st Edition Yahya Es...
Computer Simulation_ A Foundational Approach Using Python (2018).pdf
Blockchain For 6genabled Networkbased Applications A Vision Architectural Ele...
Operations Research and Cyber Infrastructure John W. Chinneck
Cloud-based Data Stream Processing
Data Clustering Algorithms and Applications First Edition Charu C. Aggarwal
Datadriven Computational Methods Parameter And Operator Estimations Harlim
Cyberinfrastructure and Applications Overview: Howard University June22
Sylabbi 2012
Evolutionary Multiobjective System Designtheory And Applications 1st Edition ...
Harnessing Quantum Computing and AI for Next-Gen Supply Chain Optimization
Ad

Recently uploaded (20)

PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
Pharma ospi slides which help in ospi learning
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PPTX
Cell Structure & Organelles in detailed.
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
Classroom Observation Tools for Teachers
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
Complications of Minimal Access Surgery at WLH
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
Presentation on HIE in infants and its manifestations
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
VCE English Exam - Section C Student Revision Booklet
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Microbial disease of the cardiovascular and lymphatic systems
Pharma ospi slides which help in ospi learning
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
Cell Structure & Organelles in detailed.
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Classroom Observation Tools for Teachers
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
102 student loan defaulters named and shamed – Is someone you know on the list?
Complications of Minimal Access Surgery at WLH
Final Presentation General Medicine 03-08-2024.pptx
O7-L3 Supply Chain Operations - ICLT Program
Presentation on HIE in infants and its manifestations
STATICS OF THE RIGID BODIES Hibbelers.pdf
human mycosis Human fungal infections are called human mycosis..pptx
GDM (1) (1).pptx small presentation for students
Microbial diseases, their pathogenesis and prophylaxis
VCE English Exam - Section C Student Revision Booklet
Ad

Data Streams Models And Algorithms Charu C Aggarwal Ed

  • 1. Data Streams Models And Algorithms Charu C Aggarwal Ed download https://ebookbell.com/product/data-streams-models-and-algorithms- charu-c-aggarwal-ed-36520980 Explore and download more ebooks at ebookbell.com
  • 2. Here are some recommended products that we believe you will be interested in. You can click the link to download. Data Streams Models And Algorithms 1st Edition Charu C Aggarwal Auth https://ebookbell.com/product/data-streams-models-and-algorithms-1st- edition-charu-c-aggarwal-auth-4239842 Dataflow Programming Visualizing And Managing Data Streams For Effective Processing And Parallelism Programming Models Edet https://ebookbell.com/product/dataflow-programming-visualizing-and- managing-data-streams-for-effective-processing-and-parallelism- programming-models-edet-232947356 Building Big Data Pipelines With Apache Beam Use A Single Programming Model For Both Batch And Stream Data Processing 1st Edition Jan Lukavsky https://ebookbell.com/product/building-big-data-pipelines-with-apache- beam-use-a-single-programming-model-for-both-batch-and-stream-data- processing-1st-edition-jan-lukavsky-37633446 Data Stream Management Processing Highspeed Data Streams 1st Edition Minos Garofalakis https://ebookbell.com/product/data-stream-management-processing- highspeed-data-streams-1st-edition-minos-garofalakis-5608642
  • 3. Statistical Analysis Of Massive Data Streams Proceedings Of A Workshop 1st Edition Committee On Applied And Theoretical Statistics Board On Mathematical Sciences And Their Applications https://ebookbell.com/product/statistical-analysis-of-massive-data- streams-proceedings-of-a-workshop-1st-edition-committee-on-applied- and-theoretical-statistics-board-on-mathematical-sciences-and-their- applications-51848662 Knowledge Discovery From Data Streams 1st Edition Joao Gama https://ebookbell.com/product/knowledge-discovery-from-data- streams-1st-edition-joao-gama-2253200 Machine Learning For Data Streams With Practical Examples In Moa Adaptive Computation And Machine Learning Series Albert Bifet https://ebookbell.com/product/machine-learning-for-data-streams-with- practical-examples-in-moa-adaptive-computation-and-machine-learning- series-albert-bifet-32906616 Transactional Machine Learning With Data Streams And Automl Build Frictionless And Elastic Machine Learning Solutions With Apache Kafka In The Cloud Using Python 1st Edition Sebastian Maurice https://ebookbell.com/product/transactional-machine-learning-with- data-streams-and-automl-build-frictionless-and-elastic-machine- learning-solutions-with-apache-kafka-in-the-cloud-using-python-1st- edition-sebastian-maurice-37321806 Autonomous Learning Systems From Data Streams To Knowledge In Realtime Plamen Angelovauth https://ebookbell.com/product/autonomous-learning-systems-from-data- streams-to-knowledge-in-realtime-plamen-angelovauth-4299632
  • 7. ADVANCES IN DATABASE SYSTEMS Series Editor Ahmed K. Elmagarmid Purdue Universify WestLafayette, IN 47907 Other books in the Series: SIMILARITY SEARCH: The Metric Space Approach, P. Zezuln, C. A~wito,V. Dohnal, M. Batko, ISBN: 0-387-29146-6 STREAM DATA MANAGEMENT, Naurnan Chaudhry, Kevin Shaw, Mahdi Abdelgueifi, ISBN: 0-387-24393-3 FUZZY DATABASE MODELING WITH XML, Zongrnin Ma, ISBN: 0-387- 24248-1 MINING SEQUENTIAL PATTERNS FROM LARGE DATA SETS, Wei Wang and Jiong Yang; ISBN: 0-387-24246-5 ADVANCED SIGNATURE INDEXING FOR MULTIMEDIA AND WEB APPLICATIONS, Yannis Manolopoulos, Alexandros Nanopoulos, Eleni Tousidou; ISBN: 1-4020-7425-5 ADVANCES IN DIGITAL GOVERNMENT: Technology, Human Factors, and Policy, edited by William J. Mclver, Jr. and Ahrned K. Elrnagarrnid; ISBN: 1- 4020-7067-5 INFORMATION AND DATABASE QUALITY, Mario Piattini, Coral Calero and Marcela Genero; ISBN: 0-7923- 7599-8 DATA QUALITY, Richard Y. Wang, Mostapha Ziad, Yang W. Lee: ISBN: 0-7923- 7215-8 THE FRACTAL STRUCTURE OF DATA REFERENCE: Applications to the Memory Hierarchy, Bruce McNutt; ISBN: 0-7923-7945-4 SEMANTIC MODELS FOR MULTIMEDIA DATABASE SEARCHING AND BROWSING, Shu-Ching Chen,R.L. Kashyap, and ArifGhafoor;ISBN:0-7923- 7888-1 INFORMATIONBROKERINGACROSSHETEROGENEOUSDIGITALDATA: AMetadata-based Approach, VipulKashyap,Arnit Sheth;ISBN:0-7923-7883-0 DATA DISSEMINATION IN WIRELESS COMPUTING ENVIRONMENTS, Kian-Lee Tan and Beng Chin Ooi;ISBN: 0-7923-7866-0 MIDDLEWARE NETWORKS: Concept, Design and Deployment of Internet Infrastructure, Michah Lerner, George Vanecek,Nino Vidovic,Dad Vrsalovic; ISBN: 0-7923-7840-7 ADVANCED DATABASE INDEXING, YannisManolopoulos, Yannis Theodoridis, VassilisJ. Tsotras; ISBN: 0-7923-7716-8 MULTILEVEL SECURE TRANSACTION PROCESSING, Vijay Atluri, Sushi1 Jajodia, Binto George ISBN: 0-7923-7702-8 FUZZY LOGIC IN DATA MODELING, Guoqing Chen ISBN: 0-7923-8253-6 For a complete listing of books in this series, go to htt~://www.s~rin~er.com
  • 8. Data Streams Models and Algorithms edited by Charu C. Aggarwal ZBM, T.J. WatsonResearch Center Yorktown Heights, NY, USA a - Springer
  • 9. Charu C. Aggarwal IBM Thomas J. Watson Research Center 19Skyline Drive Hawthorne NY 10532 Library of Congress Control Number: 2006934111 DATA STREAMS: Models and Algorithms edited by Charu C. Aggarwal ISBN-10:0-387-28759-0 ISBN-13:978-0-387-28759-1 e-ISBN-10:0-387-47534-6 e-ISBN-13: 978-0-387-47534-9 Cover by Will Ladd, NRL Mapping, Charting and Geodesy Branch utilizing NRL's GIDBB Portal System that can be utilized at http://dmap.nrlssc.navy.mil Printed on acid-free paper. O 2007 Springer Science+BusinessMedia, LLC. All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now know or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks and similar terms, even if the are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.
  • 10. Contents List of Figures List of Tables Preface xv xvii 1 An Introduction to Data Streams Cham C.Aggarwal 1. Introduction 2. Stream Mining Algorithms 3. Conclusions and Summary References 2 On Clustering MassiveData Streams: A SummarizationParadigm Cham C. Aggarwal,Jiawei Han,Jianyong Wangand Philip S. Yu 1. Introduction 2. The Micro-clustering Based StreamMining Framework 3. Clustering EvolvingData Streams: A Micro-clusteringApproach 3.1 Micro-clustering Challenges 3.2 Online Micro-cluster Maintenance: The CluStream Algo- rithm 3.3 High DimensionalProjected Stream Clustering 4. Classificationof Data Streams: A Micro-clusteringApproach 4.1 On-DemandStream Classification 5. Other Applications of Micro-clusteringand Research Directions 6. Performance Studyand ExperimentalResults 7. Discussion References 3 A Survey of ClassificationMethods in Data Streams Mohamed Medhat Gaber,Arkady Zaslavsky and Shonali Krishnaswamy 1. Introduction 2. Research Issues 3. SolutionApproaches 4. ClassificationTechniques 4.1 Ensemble Based Classification 4.2 Very Fast Decision Trees (VFDT)
  • 11. DATA STREAMS: MODELS AND ALGORITHMS 4.3 On DemandClassification 4.4 Online InformationNetwork (OLIN) 4.5 LWClass Algorithm 4.6 ANNCAD Algorithm 4.7 SCALLOPAlgorithm 5. Summary References 4 Frequent Pattern Mining in Data Streams RuomingJin and GaganAgrawal 1. Introduction 2. Overview 3. New Algorithm 4. Work on OtherRelated Problems 5. Conclusions and Future Directions References 5 A Surveyof Change Diagnosis Algorithms in Evolving Data Streams Cham C.Agganval 1. Introduction 2. The Velocity Density Method 2.1 Spatial Velocity Profiles 2.2 Evolution Computationsin High Dimensional Case 2.3 On the use of clustering for characterizing stream evolution 3. On the Effect of Evolution in Data Mining Algorithms 4. Conclusions References 6 Multi-Dimensional Analysis of Data 103 Streams Using Stream Cubes Jiawei Hun, Z Dora Cai, rain Chen, GuozhuDong, Jian Pei, Benjamin W: Wah,and Jianyong Wang 1. Introduction 104 2. Problem Definition 106 3. Architecture for On-line Analysis of Data Streams 108 3.1 Tilted time fiame 108 3.2 Criticallayers 110 3.3 Partialmaterialization of stream cube 111 4. Stream Data Cube Computation 112 4.1 Algorithms for cube computation 115 5. Performance Study 117 6. Related Work 120 7. PossibleExtensions 121 8. Conclusions 122 References 123
  • 12. Contents vii 7 Load Sheddingin Data Stream Systems Brian Babcoclr,Mayur Datar andRajeevMotwani 1. Load Sheddingfor AggregationQueries 1.1 Problem Formulation 1.2 Load SheddingAlgorithm 1.3 Extensions 2. Load Shedding in Aurora 3. Load Shedding for Sliding WindowJoins 4. Load Sheddingfor ClassificationQueries 5. Summary References 8 The Sliding-WindowComputationModel and Results Mayur Datar andRajeevMotwani 0.1 Motivationand Road Map 1. A Solution to the BASICCOUNTING Problem 1.1 The Approximation Scheme 2. SpaceLower Bound for BASICCOUNTING Problem 3. Beyond 0's and 1's 4. References and Related Work 5. Conclusion References 9 A Survey of SynopsisConstruction in Data Streams Cham C. Agganual,Philip S. Y u 1. Introduction 2. SamplingMethods 2.1 Random Samplingwith a Reservoir 2.2 Concise Sampling 3. Wavelets 3.1 Recent Research on Wavelet Decomposition in Data Streams 4. Sketches 4.1 Fixed Window Sketchesfor MassiveTime Series 4.2 VariableWindow Sketchesof MassiveTime Series 4.3 Sketches and their applications in Data Streams 4.4 Sketcheswith p-stable distributions 4.5 The Count-Min Sketch 4.6 RelatedCountingMethods: HashFunctionsforDetermining Distinct Elements 4.7 Advantages and Limitations of SketchBased Methods 5. Histograms 5.1 One Pass Construction of Equi-depthHistograms 5.2 Constructing V-Optimal Histograms 5.3 WaveletBased Histograms for Query Answering 5.4 SketchBased Methods for Multi-dimensionalHistograms 6. Discussion and Challenges
  • 13. viii DATA STREAMS:MODELS AND ALGORITHMS References 10 A Surveyof Join Processing in Data Streams Junyi Xie and Jun Yang 1. Introduction 2. Model and Semantics 3. State Management for StreamJoins 3.1 Exploiting Constraints 3.2 Exploiting Statistical Properties 4. FundamentalAlgorithms for StreamJoin Processing 5. Optimizing Stream Joins 6. Conclusion Acknowledgments References 11 Indexing and Querying Data Streams Ahmet Bulut,Ambuj K.Singh Introduction Indexing Streams 2.1 Preliminariesand definitions 2.2 Feature extraction 2.3 Index maintenance 2.4 DiscreteWaveletTransform Querying Streams 3.1 Monitoring an aggregate query 3.2 Monitoring a pattern query 3.3 Monitoring a correlationquery Related Work Future Directions 5.1 Distributed monitoring systems 5.2 Probabilistic modeling of sensornetworks 5.3 Content distributionnetworks Chapter Summary References 12 Dimensionality Reduction and Forecasting on Streams Spiros Papadimitriou, Jimeng Sun, and ChristosFaloutsos 1. Related work 2. Principalcomponent analysis (PCA) 3. Auto-regressivemodels and recursive least squares 4. MUSCLES 5. Tracking correlations and hidden variables: SPIRIT 6. Putting SPIRITto work 7. Experimental case studies
  • 14. Contents i x 8. Performance and accuracy 9. Conclusion Acknowledgments References 287 13 A Surveyof Distributed Mining of Data Streams SrinivasanParthasarathy,Am01 Ghotingand Matthew Eric Otey 1. Introduction 2. Outlierand AnomalyDetection 3. Clustering 4. Frequent itemset mining 5. Classification 6. Summarization 7. Mining Distributed Data Streams in Resource Constrained Environ- ments 8. SystemsSupport References 14 Algorithms for Distributed 309 Data StreamMining Kanishka Bhaduri, Kamalika Das, Krishnamoorthy Sivakumar,Hill01 Kargupta,Ran WolfandRong Chen 1. Introduction 310 2. Motivation: Why DistributedData StreamMining? 311 3. Existing Distributed Data StreamMining Algorithms 312 4. A localalgorithm for distributed data streammining 315 4.1 Local Algorithms : definition 315 4.2 Algorithm details 316 4.3 Experimentalresults 318 4.4 Modificationsand extensions 320 5. Bayesian Network Learning from Distributed Data Streams 321 5.1 Distributed Bayesian Network Learning Algorithm 322 5.2 Selection of samples for transmission to global site 323 5.3 Online Distributed BayesianNetwork Learning 324 5.4 ExperimentalResults 326 6. Conclusion 326 References 329 15 A Surveyof Stream Processing Problems and Techniques in SensorNetworks Sharmila Subramaniam, Dimitrios Gunopulos 1. Challenges
  • 15. DATA STREAMS: MODELS AND ALGORITHMS 2. TheData CollectionModel 3. Data Communication 4. Query Processing 4.1 Aggregate Queries 4.2 Join Queries 4.3 Top-k Monitoring 4.4 Continuous Queries 5. CompressionandModeling 5.1 Data Distribution Modeling 5.2 OutlierDetection 6. Application: Tracking of Objectsusing SensorNetworks 7. Summary References Index
  • 16. List of Figures Micro-clustering Examples 11 Some Simple Time Windows 11 Varying Horizons for the classificationprocess 23 Qualitycomparison(NetworkIntrusiondataset,horizon=256, stream_speed=200) 30 Quality comparison (Charitable Donation dataset, hori- zon=4, stream_speed=200) 30 Accuracycomparison(NetworkIntrusiondataset,streamspeed=80, buffer_size=1600,lcfit=80, init_number=400) 31 Distribution of the (smallest) best horizon (Network In- trusiondataset,Timeunits=2500,buffer_size=1600,kfit=80, init_number=400) 31 Accuracy comparison (Synthetic dataset B300kC5D20, stream_speed=l00,buffer_size=500,lc it=25,init_number=400) 31 Distributionofthe(smallest)besthorizon(Syntheticdataset B300kC5D20, Timeunits=2000,buffer_size=500, lc it=25, init_number=400) 32 Stream Proc. Rate (Charit. Donation data, stream_speed=2000) 33 Stream Proc. Rate (Ntwk. Intrusion data, stream_speed=2000) 33 Scalabilitywith Data Dimensionality(stream_speed=2000) 34 Scalabilitywith Number of Clusters (stream_speed=2000) 34 The ensemble based classificationmethod 53 VFDT Learning Systems 54 On Demand Classification 54 Online InformationNetwork System 55 Algorithm Output Granularity 55 ANNCAD Framework 56 SCALLOP Process 56 Karp et al. Algorithmto Find Frequent Items 68 ImprovingAlgorithm with An Accuracy Bound 71
  • 17. xii DATA STREAMS: MODELS AND ALGORITHMS StreamMining-Fixed:AlgorithmAssumingFixedLength Transactions 73 SubroutinesDescription 73 StreamMining-Bounded: Algorithm with a Bound on Accuracy 75 StreamMining: Final Algorithm The Forward Time SliceDensity Estimate The Reverse Time Slice Density Estimate The Temporal VelocityProfile The SpatialVelocityProfile A tilted time frame with natural time partition A tilted time frame with logarithmictime partition A tilted time frame with progressive logarithmic time partition Two critical layers in the stream cube Cube structurefrom the m-layer to the o-layer H-tree structure for cube computation Cube computation: time and memory usage vs. # tuples at the m-layer for the data set D5L3C10 Cube computation: time and space vs. # of dimensions for the data set L3ClOT100K Cube computation: time and space vs. # of levels for the data set D5C10T50K Data Flow Diagram Illustration of Example 7.1 Illustration of Observation 1.4 Procedure SetSamplingRate(x,R,) Sliding window model notation An illustration of an ExponentialHistogram (EH). Illustration of the Wavelet Decomposition The Error Tree from the Wavelet Decomposition Drifting normal distributions. Example ECBs. ECBsforsliding-windowjoins underthefrequency-based model. ECBs under the age-basedmodel. Thesystemarchitectureforamulti-resolutionindexstruc- tureconsistingof3levelsandstream-specificauto-regressive (AR) models for capturing multi-resolutiontrends in the data. 240 Exact featureextraction,update rate T = 1. 241 Incremental feature extraction,update rate T = 1. 241
  • 18. List of Figures ... Xlll Approximate feature extraction,update rate T = 1. Incremental featureextraction,update rate T = 2. Transformingan MBR using discretewavelettransform. Transformationcorrespondsto rotating the axes (the ro- tation angle = 45"for Haar wavelets) 247 Aggregatequerydecompositionandapproximationcom- position for a query window of sizew = 26. 249 Subsequence query decomposition for a query window of size IQI = 9. 253 Illustration of problem. 262 Illustration of updating wl when a new point xt+l arrives. 266 Chlorine dataset. 279 Mote dataset. 280 Critter dataset 281 Detail of forecasts on Critter with blanked values. 282 River data. 283 Wall-clock times (includingtime to update forecastingmodels). 284 Hidden variable tracking accuracy. Centralized Stream Processing Architecture (left) Dis- tributed StreamProcessing Architecture (right) (A) the area inside an E circle. (B) Seven evenly spaced vectors - ul ...u7. (C) The borders of the seven halfs- paces tii .x 2 E define a polygon in which the circle is circumscribed. (D) The area between the circle and the union of half-spaces. Quality of the algorithmwith increasingnumber of nodes Cost of the algorithmwith increasingnumber of nodes ASIA Model Bayesian network for onlinedistributedparameter learning SimulationresultsforonlineBayesianlearning: (left)KL distancebetween theconditionalprobabilitiesforthenet- worksBol(k)andBb,forthreenodes(right)KLdistance between the conditional probabilities for the networks Bol(k)and Bb, for three nodes An instanceofdynamicclusterassignmentin sensorsys- tem according to LEACH protocol. Sensornodes of the sameclustersareshownwith samesymbolandthecluster heads are marked with highlighted symbols.
  • 19. xiv DATA STREAMS: MODELS AND ALGORITHMS Interest Propagation, gradient setup and path reinforce- ment fordatapropagationindirected-dzfusion paradigm. Event is described in terms of attribute value pairs. The figure illustrates an event detectedbased on the location of the node and target detection. Sensors aggregatingthe result for a MAX query in-netwc Error filter assignments in tree topology. The nodes that are shown shaded are the passive nodes that take part only in routing the measurements. A sensor comrnuni- catesa measurementonly if it lies outside the intervalof values specified by Eii.e., maximum permitted error at the node. A sensor that receives partial results from its children aggregates the results and communicatesthem to its parent after checking against the error interval Usageofduplicate-sensitivesketchestoallowresultprop- agationtomultipleparentsprovidingfaulttolerance. The system is divided into levels during the query propaga- tion phase. Partial results from a higher level (level 2 in thefigure) is received at more than onenode inthe lower level (Level 1in the figure) (a) Two dimensional Gaussian model of the measure- ments from sensors S1 and S2(b) The marginal distri- bution of the values of sensor S1, given S2:New obser- vations from one sensor is used to estimatetheposterior density of the other sensors Estimation of probability distribution of the measure- ments over slidingwindow Trade-offs in modeling sensor data Tracking a target. The leader nodes estimate the prob- ability of the target's direction and determines the next monitoringregion thatthetargetisgoingto traverse. The leadersof the cells within the next monitoringregion are alerted
  • 20. List of Tables An exampleof snapshots stored for a = 2 and I = 2 A geometric time window Data Based Techniques Task Based Techniques Typical LWClassTrainingResults Summaryof Reviewed Techniques Algorithms for Frequent Itemsets Mining over Data Streams Summaryof results for the sliding-window model. An Example of Wavelet Coefficient Computation Description of notation. Description of datasets. Reconstruction accuracy(mean squarederrorrate).
  • 21. Preface In recent years, the progress in hardware technology has made it possible for organizationsto store and record large streams of transactional data. Such data setswhich continuouslyandrapidly grow over time arereferred to as data streams. In addition, the development of sensor technology has resulted in the possibility of monitoring many events in real time. While data mining has become a fairly well established field now, the data stream problem poses a number of unique challenges which are not easily solved by traditional data mining methods. The topic of data streams is a very recent one. The first research papers on this topic appeared slightly under a decade ago, and since then this field has grown rapidly. There is a large volume of literature which has been published in this field over the past few years. The work is also of great interest to practitionersinthefieldwhohavetomineactionableinsightswithlargevolumes of continuously growing data. Because of the large volume of literature in the field,practitioners andresearchersmay oftenfind it an arduoustask of isolating the right literature for a given topic. In addition, from a practitioners point of view, the use of research literature is even more difficult, since much of the relevant material is buried in publications. While handling a real problem, it may often be difficult to know where to look in order to solvethe problem. This book contains contributed chapters from a variety of well known re- searchers in the data mining field. While the chapters will be written by dif- ferent researchers, the topics and content will be organizedin such a way so as to present the most important models, algorithms, and applications in the data mining fieldin a structured and conciseway. In addition,the book is organized in order to make it more accessible to application driven practitioners. Given the lack of structurally organized information on the topic, the book will pro- vide insightswhich are not easily accessible otherwise. In addition, the book will be a great help to researchersand graduate students interested in the topic. The popularity and currentnature of the topic of data streams is likely to make it an important source of information for researchers interested in the topic. The data mining communityhas grownrapidly overthepast few years, and the topic of data streamsis one of the most relevant and current areasof interestto
  • 22. xviii DATA STREAMS: MODELS AND ALGORITHMS the community. This is because of the rapid advancement of the field of data streams in the past two to three years. While the data stream field clearlyfalls in the emerging category because of its recency, it is now beginning to reach a maturation and popularity point, where the development of an overview book on the topic becomes both possible and necessary. Whilethis book attemptsto provide an overview of the stream mining area, it also tries to discuss current topics of interest so as to be useful to students and researchers. It is hoped that this book will provide a reference to students,researchers and practitioners in both introducing the topic of data streams and understandingthe practical and algorithmic aspectsof the area.
  • 23. Chapter 1 AN INTRODUCTION TO DATA STREAMS Cham C. Aggarwal IBM ZJ WatsonResearch Center Hawthorne,NY 10532 Abstract Inrecentyears, advancesinhardwaretechnologyhavefacilitatednew waysof collecting data continuously. In many applicationssuch as network monitoring, the volume of such data is so large that it may be impossible to store the data on disk. Furthermore, even when the data can be stored, the volume of the incomingdatamay be solargethat itmay be impossibletoprocessanyparticular record more than once. Therefore, many data mining and database operations such as classification, clustering, frequentpattern mining and indexing become significantlymore challengingin this context. In many cases, the datapatternsmay evolvecontinuously,as a resultof which it is necessaryto design the mining algorithmseffectively in orderto accountfor changesinunderlyingstructureofthedatastream. Thismakesthesolutionsofthe underlyingproblems evenmore difficult from an algorithmicand computational pointofview. Thisbook containsanumberofchapterswhicharecarefullychosen in order to discussthe broad researchissuesin data streams. The purpose of this chapter is to provide an overview of the organization of the stream processing and mining techniqueswhich are covered in this book. 1 Introduction In recent years, advancesin hardwaretechnologyhave facilitatedthe ability to collect datacontinuously. Simpletransactionsof everydaylifesuch as using a credit card, a phone or browsing the web lead to automated data storage. Similarly, advances in informationtechnologyhave lead to large flows of data acrossIPnetworks. Inmanycases,these largevolumesofdatacanbe minedfor interestingandrelevantinformationin awidevarietyofapplications. Whenthe
  • 24. 2 DATA STREAMS:MODELS AND ALGORITHMS volumeoftheunderlyingdataisverylarge,itleadstoanumberofcomputational and mining challenges: With increasingvolume ofthedata, it isno longerpossibleto processthe data efficientlyby using multiple passes. Rather, one can process a data item at most once. This leadsto constraintsonthe implementationof the underlying algorithms. Therefore, stream mining algorithms typically need to be designed so that the algorithms work with one pass of the data. In most cases, there is an inherent temporal component to the stream mining process. This is because the data may evolve over time. This behavior of data streams is referred to as temporal locality. Therefore, a straightforward adaptation of one-pass mining algorithms may not be an effective solution to the task. Stream mining algorithms need to be carefully designed with a clear focus on the evolutionof the underlying data. Another important characteristicof data streams is that they are often mined in a distributed fashion. Furthermore,the individualprocessorsmay have limited processing and memory. Examples of such cases include sensor networks, in which it maybe desirableto perfom in-network processingof data streamwith limited processing and memory [8, 191.This book will also contain a number of chapters devoted to these topics. This chapter will provide an overview of the different stream mining algo- rithmscoveredinthisbook. Wewill discussthechallengesassociatedwitheach kind of problem, and discuss an overview of the material in the corresponding chapter. 2. StreamMining Algorithms In this section, we will discuss the key stream mining problems and will discussthe challenges associated with each problem. We will also discuss an overview ofthematerial coveredin eachchapterofthisbook. Thebroad topics covered in this book are as follows: Data Stream Clustering. Clustering is a widely studied problem in the data mining literature. However, it is more difficult to adapt arbitrary clus- tering algorithms to data streams because of one-pass constraints on the data set. An interesting adaptation of the k-means algorithm has been discussed in [14] which uses a partitioning based approach on the entire data set. This approachuses an adaptation of a k-means technique in order to createclusters over the entire data stream. In the context of data streams, it may be more desirable to determine clusters in specificuser defined horizons rather than on
  • 25. An Introduction to Data Streams 3 the entiredata set. In chapter 2, we discuss the micro-clusteringtechnique [3] which determines clusters over the entire data set. We also discuss a variety of applicationsof micro-clusteringwhich can performeffectivesummarization based analysis of the data set. For example, micro-clusteringcan be extended to the problem of classificationon data streams [5]. In many cases, it can also be used for arbitrarydata mining applications such as privacy preserving data mining or query estimation. Data Stream Classification. The problem of classificationis perhaps one of the most widely studied in the context of data stream mining. The problem of classification is made more difficultby the evolutionof the underlying data stream. Therefore, effective algorithms need to be designed in order to take temporal locality into account. In chapter 3, we discuss a survey of classifica- tion algorithms for data streams. A wide variety of data stream classification algorithmsarecoveredinthischapter. Someofthesealgorithmsaredesignedto be purely one-pass adaptations of conventionalclassificationalgorithms [12], whereas others (such as the methods in [5, 161)are more effectivein account- ing for the evolution of the underlying data stream. Chapter 3 discusses the different kinds of algorithms and the relative advantagesof each. Frequent Pattern Mining. The problem of frequent pattern mining was first introduced in [6], and was extensivelyanalyzed for the conventionalcase of diskresident data sets. In the case of data streams,one may wish to find the frequentitemsetseitherover a slidingwindowortheentiredata stream[15,17]. In Chapter 4, we discuss an overview of the different frequent pattern mining algorithms, and also provide a detailed discussion of some interesting recent algorithms on the topic. Change Detection in Data Streams. As discussed earlier, the patterns in a data stream may evolve over time. In many cases, it is desirable to track and analyze the nature of these changesover time. In [I, 11, 181, a number of methodshave been discussedforchangedetectionof data streams. In addition, data streamevolutioncanalsoaffectthebehavioroftheunderlyingdatamining algorithms sincethe results can become stale over time. Therefore, in Chapter 5, we have discussed the differentmethods for change detection data streams. Wehavealsodiscussedtheeffectofevolutionondatastreamminingalgorithms. Stream Cube Analysis of Multi-dimensional Streams. Much of stream data resides at a multi-dimensionalspace and at rather low level of abstraction, whereasmostanalystsareinterestedinrelativelyhigh-level dynamicchangesin somecombinationof dimensions. Todiscoverhigh-level dynamicandevolving characteristics,onemayneed toperformmulti-level, multi-dimensionalon-line
  • 26. 4 DATA STREAMS: MODELS AND ALGORITHMS analyticalprocessing(OLAP)of streamdata. Suchnecessitycallsfortheinves- tigation of new architecturesthat may facilitateon-lineanalyticalprocessing of multi-dimensional stream data [7, 101. In Chapter 6, an interesting stream-cube architecture that effectively per- forms on-line partial aggregation of multi-dimensional stream data, captures the essential dynamic and evolving characteristics of data streams, and facil- itates fast OLAP on stream data. Stream cube architecture facilitates online analytical processing of stream data. It also forms a preliminary structure for online stream mining. The impact of the design and implementationof stream cube in the context of stream mining is also discussed in the chapter. Loadshedding in Data Streams. Since data streams are generated by processes which are extraneous to the stream processing application, it is not possible to control the incoming streamrate. As a result, it is necessary for the system to have the ability to quickly adjust to varying incoming stream pro- cessingrates. Chapter 7 discusses one particular type of adaptivity: the ability to gracefully degradeperformancevia "load shedding" (droppingunprocessed tuples to reduce system load) when the demands placed on the system can- not be met in full given availableresources. Focusing on aggregation queries, the chapter presents algorithms that determine at what points in a query plan should load sheddingbe performed and what amount of load shouldbe shed at eachpoint in order to minimize the degree of inaccuracyintroducedinto query answers. SlidingWindow Computations in Data Streams. Many of the synopsis structures discussed use the entire data stream in order to construct the cor- responding synopsis structure. The sliding-windowmodel of computation is motivated by the assumptionthat it is more importantto use recent data in data streamcomputation [9]. Therefore,theprocessingand analysisis onlydone on a fixed history of the data stream. Chapter 8 formalizes this model of compu- tation and answers questions about how much space and computation time is required to solve certainproblems under the sliding-windowmodel. SynopsisConstructioninData Streams. Thelargevolumeofdata streams poses unique space and time constraints on the computation process. Many query processing, database operations,and mining algorithmsrequire efficient execution which can be difficult to achieve with a fast data stream. In many cases, it may be acceptable to generate approximate solutions for such prob- lems. In recent years a number of synopsis structures have been developed, which can be used in conjunction with a variety of mining and query process- ing techniques [13]. Some key synopsis methods include those of sampling, wavelets, sketches and histograms. In Chapter 9, a survey of the key synopsis
  • 27. An Introduction to Data Streams 5 techniquesisdiscussed, andtheminingtechniquessupportedby suchmethods. The chapter discusses the challenges and tradeoffs associated with using dif- ferent kinds of techniques, and the important research directions for synopsis construction. Join Processingin Data Streams. Streamjoin is a fundamentaloperation for relating information from different streams. This is especially useful in many applications such as sensornetworks in which the streams arriving from differentsourcesmayneed tobe related with one another. In the stream setting, inputtuples arrivecontinuously,andresult tuples need to be produced continu- ouslyaswell. Wecannotassumethatthe inputdata isalreadystoredorindexed, or that the input rate can be controlled by the query plan. Standardjoin algo- rithmsthatuseblockingoperations,e.g., sorting,no longerwork. Conventional methods for cost estimation and query optimizationare also inappropriate,be- cause they assume finite input. Moreover, the long-running nature of stream queries calls for more adaptiveprocessing strategies that can react to changes and fluctuations in data and stream characteristics. The "stateful" nature of streamjoins adds another dimension to the challenge. In general, in order to computethe completeresult of a streamjoin, we need to retain allpast arrivals as part of the processing state, becausea new tuple mayjoin with an arbitrarily old tuple arrived in the past. This problem is exacerbatedby unbounded input streams, limited processing resources, and high performancerequirements, as it is impossible in the long run to keep all past history in fast memory. Chap- ter 10provides an overview of research problems,recent advances,and future research directions in streamjoin processing. Indexing Data Streams. The problem of indexing data streams attempts to create a an indexed representation, sothat it is possible to efficientlyanswer different kinds of queries such as aggregation queries or trend based queries. This is especially important in the data stream case because of the huge vol- ume of the underlying data. Chapter 11 exploresthe problem of indexing and querying data streams. DimensionalityReduction and Forecasting in Data Streams. Because of the inherent temporal nature of data streams, the problems of dimension- ality reduction and forecasting and particularly important. When there are a largenumber of simultaneousdata stream,we canuse the correlationsbetween different data streams in order to make effective predictions [20, 211 on the futurebehavior of the data stream. In Chapter 12,an overviewof dimensional- ity reduction and forecasting methods have been discussed for the problem of data streams. In particular, the well known MUSCLES method [21] has been discussed, and its application to data streams have been explored. In addition,
  • 28. 6 DATA STREAMS: MODELS AND ALGORITHMS the chapterpresents the SPIRITalgorithm,which exploresthe relationshipbe- tween dimensionality reduction and forecasting in data streams. In particular, the chapter explores the use of a compact number of hidden variablesto com- prehensivelydescribethe data stream. This compact representationcan alsobe used for effectiveforecasting of the data streams. Distributed Mining of Data Streams. In many instances, streams are generated at multiple distributed computingnodes. Analyzing and monitoring data in such environmentsrequires data mining technology that requires opti- mization of a variety of criteria such as communication costs across different nodes, aswell as computational,memoryor storagerequirementsat eachnode. A comprehensivesurveyof the adaptation of differentconventionalmining al- gorithms to the distributed case is provided in Chapter 13. In particular, the clustering, classification, outlier detection, frequent pattern mining, and surn- marization problems are discussed. In Chapter 14, some recent advances in stream mining algorithms are discussed. Stream Mining in SensorNetworks. With recent advances in hardware technology, ithasbecomepossibletotracklargeamountsofdatainadistributed fashionwith the use of sensortechnology. The large amountsof data collected by the sensor nodes makes the problem of monitoring a challenging one from many technological stand points. Sensor nodes have limited local storage, computational power, and battery life, as a result of which it is desirable to minimize the storage, processing and communication from these nodes. The problem is furthermagnifiedby the factthat a givennetworkmay havemillions ofsensornodesandthereforeitisveryexpensiveto localizeallthedataatagiven globalnode for analysisboth from a storage and communicationpoint of view. In Chapter 15, we discuss an overview of a number of stream mining issues in the context of sensor networks. This topic is closely related to distributed stream mining, and a number of concepts related to sensor mining have also been discussed in Chapters 13and 14. 3. Conclusions and Summary Datastreamsareacomputationalchallengeto dataminingproblemsbecause ofthe additionalalgorithmicconstraintscreatedby the largevolumeof data. In addition, the problem of temporal locality leads to a number of unique mining challenges in the data stream case. This chapter provides an overview to the different mining algorithms which are covered in this book. We discussed the differentproblems and the challengeswhich are associatedwith eachproblem. We also provided an overview of the material in each chapter of the book.
  • 29. An Intmduction to Data Streams 7 References [I] Aggarwal C. (2003). A Framework for Diagnosing Changes in Evolving Data Streams.ACM SIGMOD Conference. [2] AggarwalC (2002).An IntuitiveFramework forunderstandingChangesin EvolvingData Streams.IEEE ICDE Conference. [3] Aggarwal C., Han J., Wang J., Yu P (2003). A Framework for Clustering EvolvingData Streams. VLDBConference. [4] AggarwalC., HanJ., WangJ., Yu P (2004).A FrameworkforHigh Dimen- sional Projected Clustering of Data Streams. VLDBConference. [5] Aggarwal C, Han J., Wang J., Yu P. (2004). On-DemandClassification of Data Streams.ACM KDD Conference. [6] Agrawal R., Imielinski T., Swami A. (1993) Mining Association Rules between Setsof items in Large Databases. ACM SIGMOD Conference. [7] Chen Y., Dong G., Han J., Wah B. W., Wang J. (2002) Multi-dimensional regression analysisof time-series data streams. VLDBConference. [8] Cormode G., Garofalakis M. (2005) Sketching Streams Through the Net: DistributedApproximate Query Tracking. VLDBConference. [9] Datar M., Gionis A., Indyk P., Motwani R. (2002) Maintaining stream statisticsover slidingwindows. SIAM Journal on Computing,3l(6):1794- 1813. [lo] DongG.,HanJ., LamJ.,PeiJ., WangK. (2001)Miningmulti-dimensional constrained gradients in data cubes. VLDBConference. [ll] Dasu T., Krishnan S., Venkatasubramaniam S., Yi K. (2005). An Information-Theoretic Approach to Detecting Changes in Multi- dimensional data Streams.Duke University TechnicalReport CS-2005-06. [12] Domingos P. and Hulten G. (2000) Mining High-speed Data Streams.In Proceedings of the ACM KDD Conference. [13] Garofalakis M., Gehrke J., Rastogi R. (2002) Querying and mining data streams: you only get one look (a tutorial). SIGMOD Conference. [14] Guha S., MishraN., MotwaniR., O'Callaghan L. (2000).ClusteringData Streams.IEEE FOCS Conference. [I51 Giannella C., Han J., Pei J., Yan X., and Yu P. (2002) Mining Frequent Patterns in Data Streams at Multiple Time Granularities. Proceedings of the NSF Workshopon Next GenerationData Mining. 1161 Hulten G., SpencerL., DomingosP. (2001).MiningTimeChangingData Streams.ACM KDD Conference. [17] Jin R., AgrawalG. (2005)An algorithmfor in-core frequent itemsetmin- ing on streaming data. ICDM Conference.
  • 30. 8 DATA STREAMS: MODELS AND ALGORITHMS [18] Kifer D., David S.-B., Gehrke J. (2004). Detecting Change in Data Streams. VLDB Conference, 2004. 1191 Kollios G., Byers J., ConsidineJ., HadjielefttheriouM., Li F. (2005)Ro- bust Aggregation in SensorNetworks. IEEEData EngineeringBulletin. [20] S a h a iY . , PapadimitriouS., FaloutsosC. (2005).BRAID: Streammining through group lag correlations.ACMSIGMOD Conference. [21] Yi B.-K., Sidiropoulos N.D., Johnson T., Jagadish, H. V . ,Faloutsos C., BilirisA. (2000).Onlinedataminingforco-evolvingtimesequences.ICDE Conference.
  • 31. Chapter 2 ON CLUSTERING MASSIVE DATA STREAMS: A SUMMARIZATIONPARADIGM Cham C. Aggarwal IBM Z J. WatsonResearch Center Hawthorne, W 1053.2 Jiawei Han UniversityofIllinois at Urbana-Champaign Urbana,IL hanj@cs.uiuc.edu Jianyong Wang Universityof Illinois at Urbana-Champaign Urbana,Z L jianyong @tsinghua.edu.cn Philip S. Yu IBM Z J. WatsonResearch Center Hawthorne, NY 10532 Abstract In recent years, data streams have become ubiquitous because of the large number of applications which generate huge volumes of data in an automated way. Many existing data mining methods cannot be applied directly on data streams because of the fact that the data needs to be mined in one pass. Fur- thermore, datastreamsshowa considerableamountof temporal localitybecause of which a direct application of the existing methods may lead to misleading results. In this paper, we develop an efficient and effective approach for min- ing fast evolving data streams, which integratesthe micro-clusteringtechnique
  • 32. DATA STREAMS: MODELS AND ALGORITHMS with the high-level datamining process, and discoversdataevolutionregularities as well. Our analysis and experimentsdemonstratetwo important data mining problems, namely stream clustering and stream classification,can be performed effectively using this approach, with high quality mining results. We discuss the use of micro-clusteringas a general summarization technology to solvedata mining problems on streams. Our discussion illustrates the importance of our approachfor a variety of miningproblems in the data stream domain. 1. Introduction In recent years, advances in hardware technology have allowed us to auto- matically record transactions and other pieces of information of everyday life at a rapid rate. Such processes generate huge amounts of online data which grow at an unlimited rate. These kinds of online data are referred to as data streams. The issues on management and analysis of data streams have been researched extensivelyin recent years because of its emerging, imminent, and broad applications [l 1, 14, 17,231. Many important problems such as clustering and classification have been widely studied in the data mining community. However, a majority of such methods may not be working effectively on data streams. Data streams pose special challenges to a number of data mining algorithms, not only because of the huge volume of the online data streams, but also because of the fact that the data in the streams may show temporal correlations. Such temporal correlationsmayhelpdiscloseimportantdataevolutioncharacteristics,andthey canalsobeusedtodevelopefficientandeffectiveminingalgorithms. Moreover, data streams require online mining, in which we wish to mine the data in a continuous fashion. Furthermore, the system needs to have the capability to perform an ofline analysis as well based on the user interests. This is similar to an onlineanalyticalprocessing(OLAP)frameworkwhich usestheparadigm of pre-processing once, querying many times. Based on the aboveconsiderations,we propose a new streammining frame- work, which adopts a tilted time window framework, takes micro-clustering as a preprocessing process, and integrates the preprocessing with the incre- mental, dynamic mining process. Micro-clustering preprocessing effectively compressesthe data, preservesthe generaltemporal localityof data, and facili- tatesboth onlineand offlineanalysis, aswell asthe analysisof current data and data evolutionregularities. In this study, we primarily concentrate on the application of this technique to two problems: (1) streamclustering,and (2) streamclassification. Theheart of the approach is to use an online summarizationapproach which is efficient and also allows for effectiveprocessing of the data streams. We also discuss
  • 33. On ClusteringMassive Data Streams: A Summarization Paradigm Figure 2.I. Micro-clustering Examples .time Now Figure 2.2. Some SimpleTimeWindows a number of research directions, in which we show how the approach can be adapted to a variety of other problems. This paper is organized as follows. In the next section, we will present our micro-clusteringbased stream mining Eramework. In section 3, we discuss the streamclusteringproblem. Theclassificationmethodsaredeveloped in Section 4. In section 5, we discuss a number of other problems which can be solved with the micro-clustering approach, and other possible research directions. In section 6, we will discuss some empirical results for the clustering and classi- fication problems. In Section 7 we discuss the issues related to our proposed streammining methodologyand compareit with other related work. Section 8 concludes our study.
  • 34. 12 DATA STREAMS: MODELS AND ALGORITHMS 2. The Micro-clustering Based Stream Mining Framework In order to apply our technique to a variety of data mining algorithms, we utilize a micro-clusteringbased stream mining framework. This frameworkis designedbycapturingsummaryinformationaboutthenatureofthedatastream. This summaryinformation is defined by the following structures: Micro-clusters: Wemaintainstatisticalinformationaboutthedatalocality in terms of micro-clusters. These micro-clusters are defined as a temporal extension of the clusterfeature vector [24]. The additivity property of the micro-clustersmakes them a natural choice for the data streamproblem. Pyramidal Time Frame: The micro-clusters are stored at snapshots in time which followapyramidalpattern. Thispatternprovidesan effectivetrade- offbetweenthe storagerequirementsandthe abilityto recall summarystatistics from different time horizons. The summary information in the micro-clusters is used by an offline com- ponent which is dependent upon a wide variety of user inputs such as the time horizon or the granularity of clustering. In order to define the micro-clusters, we will introduce a few concepts. It is assumed that the data stream consists - of a set of multi-dimensional records ...Xk... arriving at time stamps TI ...Tk.... Each is a multi-dimensionalrecord containing d dimensions which are denoted by = (xi...x$. We will first begin by definingthe concept of micro-clusters and pyramidal time frame more precisely. DEFINITION 2.1 A micro-clusterfor aset ofd-dimensionalpoints Xi, ...Xi, -- withtimestamps~, ...T,, isthe (2-d+3)tuple (CF2",C F l X , CF2t,CFlt,n), wherein CF2" and C F l Xeach correspond to a vector of d entries. The de$- nition of each of these entries is asfollows: For eachdimension, thesum of thesquares of thedata valuesismaintained in CF2". Thus, CF2" contains d values. Thep-th entry of CF2" is equal to EY=l(< 12. For each dimension, the sum of the data values is maintained in CFlX. Thus, CFIXcontains d values. Thep-th entry of CFIXis equal to E7L=1 e;. The sum of the squares of the time stamps Ti, ...Tin is maintained in CF2t. Thesum of the time stamps Ti,...Tin is maintained in CFlt. The number of datapoints is maintained in n. We note that the above definition of micro-cluster maintains similar summary information as the cluster feature vector of [24], except for the additional in- formation about time stamps. We will refer to this temporal extension of the clusterfeaturevectorfora setofpointsCby CFT(C).As in [24],this summary
  • 35. On ClusteringMassive Data Streams: A Summarization Paradigm 13 information can be expressed in an additiveway over the different data points. This makes it a natural choice for use in data stream algorithms. Wenotethatthe maintenanceof a largenumberofmicro-clustersisessential in the abilityto maintain more detailed informationabout the micro-clustering process. For example,Figure 2.1 forms3 clusters,which are denotedby a, b, c. At a later stage,evolutionforms3 differentfiguresal, a2,bc, with a splitintoa1 and a2, whereas b and c merged into bc. If we keep micro-clusters(each point represents a micro-cluster), such evolution can be easilycaptured. However, if we keep only 3 cluster centers a, byc, it is impossibleto derive later al, a2, bc clusterssincethe information of more detailed points are already lost. The data stream clustering algorithm discussed in this paper can generate approximate clusters in any user-specified length of history from the current instant. This is achieved by storing the micro-clusters at particular moments in the stream which are referred to as snapshots. At the same time, the current snapshotof micro-clusters is alwaysmaintainedby the algorithm. The macro- clustering algorithm discussed at a later stage in this paper will use these h e r level micro-clusters in order to create higher level clusters which can be more easilyunderstoodby the user. Considerfor example, the casewhen the current clock time is t, and the user wishes to find clusters in the stream based on a history of length h. Then, the macro-clustering algorithm discussed in this paper will use some of the additive properties of the micro-clusters stored at snapshots t, and (t,- h) in order to find the higher level clusters in a history or time horizon of length h. Of course, since it is not possible to store the snapshotsat eachand everymoment in time, it isimportantto chooseparticular instantsof time at which it ispossible to storethe stateof the micro-clusters so thatclustersin anyuser specifiedtimehorizon (t, -h, t,) canbe approximated. We note that some examples of time frames used for the clustering process are the natural time frame (Figure 2.2(a) and (b)), and the logarithmic time frame (Figure 2.2(c)). In the natural time frame the snapshots are stored at regular intervals. We note that the scale of the natural time frame could be based on the applicationrequirements. For example, we could choose days, monthsoryearsdependingupon thelevelofgranularityrequiredintheanalysis. Amoreflexibleapproachisto usethe logarithmictime framein whichdifferent variationsof the time intervalcan be stored. As illustrated in Figure 2.2(c), we store snapshots at times oft, 2 t, 4 t .... The danger of this is that we may jump too farbetween successivelevels of granularity. We need an intermediate solution which provides a good balance between storage requirements and the level of approximationwhich a user specified horizon can be approximated. In order to achieve this, we will introduce the concept of a pyramidal time frame. In thistechnique,the snapshotsarestoredat differinglevels of granular- ity depending upon the recency. Snapshotsare classified into different orders which can vary from 1to log(T), where T is the clock time elapsed since the
  • 36. 14 DATA STREAMS: MODELS AND ALGORITHMS beginning of the stream. The order of a particular class of snapshots define the level of granularity in time at which the snapshots are maintained. The snapshots of differentorder are maintained as follows: 0 Snapshots of the i-th order occur at time intervals of ai, where a is an integer and a 2 1. Specifically, each snapshot of the i-th order is taken at a moment in time when the clock value1 from the beginning of the stream is exactly divisibleby a2. 0 At any given moment in time, onlythe last a +1snapshotsof order i are stored. We note that the above definition allows for considerable redundancy in storage of snapshots. For example, the clock time of 8 is divisible by 2', 2l, 22,and 23 (where cr = 2). Therefore,the state of the micro-clusters at a clock time of 8 simultaneously corresponds to order 0, order 1, order 2 and order 3 snapshots. From an implementation point of view, a snapshot needs to be maintained only once. We make the followingobservations: 0 For a data stream, the maximum order of any snapshot stored at T time units sincethe beginning of the stream mining process is log, (T). For a data streamthe maximumnumberof snapshotsmaintainedat Ttime units sincethe beginning of the stream mining process is (a+1).log, (T). 0 For any user specifiedtime window of h, at least one stored snapshot can be found within 2 .h units of the current time. While the first two results are quite easy to see, the last one needs to be proven formally. LEMMA 2.2 Let h be a user-speciJiedtime window,t, be the currenttime, and t, be the time of the last stored snapshot ofany orderjust before the time t, -h. Then t, - t, 5 2 .h. Proof: Let r be the smallestinteger suchthat ar2 h. Therefore,we know that ar-I< h. Sinceweknowthattherearea+ 1snapshotsoforder (r-I),at least onesnapshotoforderr-1mustalwaysexistbeforet, -h. Lett, bethesnapshot of order r - 1which occursjust before t, - h. Then (t, - h) - t, 5 ar-l. Therefore, we have t, - t, 5 h +ar-l< 2 - h. Thus, in this case, it is possible to find a snapshot within a factor of 2 of any user-specified time window. Furthermore, the total number of snapshots which need to be maintained are relatively modest. For example, for a data stream running for 100 years with a clock time granularity of 1 second, the total number of snapshots which need to be maintained are given by (2 +1) . log2(100*365 *24 *60 *60) w 95. This is quite a modest requirement given the fact that a snapshotwithin a factorof 2 can alwaysbe foundwithin anyuser specifiedtime window. It is possible to improve the accuracy of time horizon approximation at a modest additional cost. In order to achieve this, we save the a1+1snapshots
  • 37. On ClusteringMassive Data Streams: A SummarizationParadigm Table2.1. An example of snapshotsstored for a = 2 and 1 = 2 Order of Snapshots 0 1 2 3 4 5 of order r for 1 > 1. In this case, the storage requirement of the technique correspondsto (az+1) log, (T)snapshots. Onthe otherhand, theaccuracyof time horizon approximationalso increases substantially. In this case, any time horizon can be approximatedto a factor of (1 +l/az-l). We summarizethis result as follows: Clock Times (Last 5 Snapshots) 5554535251 5452504846 5248444036 48403224 16 48 32 16 32 LEMMA 2.3 Let h be a userspecijied time horizon, t, be the current time, and t, be the time of the laststored snapshot of any orderjust before the time t, -h. Thent, - t, < (1 +l/az-l) - h. Proof: Similarto previous case. For larger values of I, the time horizon can be approximated as closely as desired. For example, by choosing 1 = 10, it is possible to approximate any time horizon within 0.2%, while a total of only (2'' +1) log2(100* 365 * 24 * 60 * 60) = 32343 snapshots are required for 100years. Since historical snapshots can be stored on disk and only the current snapshot needs to be maintained in main memory, this requirement is quite feasible from a practical point of view. It is also possible to specify the pyramidal time window in accordancewith user preferencescorrespondingto particular moments in time such as beginning of calendar years, months, and days. While the storage requirementsandhorizonestimationpossibilitiesof suchaschemearedifferent, all the algorithmic descriptions of this paper are directly applicable. In order to clarifythe way in which snapshotsare stored, let us consider the case when the stream has been running starting at a clock-time of 1,and a use of a = 2 and 1= 2. Therefore 22+1= 5 snapshotsof each order are stored. Then, at a clock time of 55, snapshotsat the clocktimes illustratedin Table2.1 are stored. Wenotethatalargenumberofsnapshotsarecommonamongdifferentorders. From an implementationpoint of view, the states of the micro-clustersat times of 16,24,32,36,40,44,46,48,50,51,52,53,54,and 55 are stored. It is easy to see that for more recent clock times, there is less distance between succes- sive snapshots (better granularity). We also note that the storage requirements
  • 38. 16 DATA STMAMS: MODELS AND ALGORITHMS estimated in this section do not take this redundancy into account. Therefore, the requirements which have been presented so far are actually worst-case re- quirements. These redundancies can be eliminated by using a systematicrule described in [6], orby using amore sophisticatedgeometrictime frame. Inthistechnique, snapshotsareclassifiedintodifferentframe numberswhich can varyfrom0to a valueno largerthanlog2(T),whereTisthemaximumlengthofthestream. The frame number of a particular class of snapshotsdefines the level of granularity in time at which the snapshotsare maintained. Specifically,snapshotsof frame number i are stored at clock times which are divisible by 2i, but not by 2i+1. Therefore, snapshots of frame number 0 are stored only at odd clock times. It is assumed that for each frame number, at most max-capacity snapshots are stored. We note that for a data stream,the maximum framenumber of any snapshot stored at T time units since the beginning of the stream mining process is log2(T). Since at most max-capacity snapshots of any order are stored, this also means that the maximum number of snapshotsmaintainedat T time units sincethebeginning ofthe streamminingprocess is (max-capacity) .log2(T). Oneinterestingcharacteristicof thegeometrictimewindowisthat foranyuser- specifiedtime window of h, at least one stored snapshot can be found within a factor of 2 of the specified horizon. This ensures that sufficient granularity is available for analyzing the behavior of the data stream over different time horizons. We will formalize this result in the lemma below. LEMMA 2.4 Let h be a user-specijiedtime window,and t, be the current time. Let us also assume that max-capacity >2. Thena snapshot exists at time t,, such that h/2 5 t, - t, I :2 .h. Proof: Let r be the smallestintegersuchthat h < 2T+1.Sincer is the smallest such integer, it also means that h > 2'. This means that for any interval (t, - h, t,) of length h, at least one integer t' E (t, - h, t,) must exist which satisfiesthepropertythat t' mod 2'-l = 0andt' mod 2r # 0. Let t' be thetime stamp of the last (most current) such snapshot. This also means the following: Then, if max-capacity isat least 2, the secondlast snapshotof order (r -1) is also stored and has a time-stamp value of t' - 2'. Let us pick the time t, = t' - 2'. By substitutingthe value oft,, we get: t, - t, = (t, - t' + Since (t, - t') L 0 and 2' > h/2, it easily follows from Equation 2.2 that tc -t, > h/2.
  • 39. On ClusteringMassive Data Streams: A Summarization Paradigm Table2.2. A geometrictime window - Frameno. 0 1 Sincet' isthepositionofthelatest snapshotof frame (r-1)occurringbefore the current time t,, it followsthat (t, -t') <2r. Subsitutingthis inequality in Equation 2.2, we get t, - t, <2' +2r <h +h = 2 .h. Thus, we have: Snapshots(by clock time) I 69 67 65 70 66 62 I The aboveresult ensures that everypossible horizon can be closelyapprox- imated within a modest level of accuracy. While the geometric time frame shares a number of conceptual similarities with the pyramidal time frame [6], it is actually quite different and also much more efficient. This is because it eliminates the double counting of the snapshotsover different frame numbers, as is the case with the pyramidal time frame [6]. In Table 2.2, we present an example of a frame table illustrating snapshots of different frame numbers. The rules for insertion of a snapshott (at time t) into the snapshot frame table are defined as follows: (1) if (t mod 2i) = 0 but (t mod 2'+') # 0, t is in- serted into frame-number i (2) each slot has a max-capacity (which is 3 in our example). At the insertion o f t into frame-number i, if the slot already reaches its max-capacity, the oldest snapshot in this frame is removed and the new snapshot inserted. For example, at time 70, since (70 mod 2') = 0 but (70 mod 22) # 0, 70 is inserted into framenumber 1which knocks out the oldest snapshot 58 if the slot capacity is 3. Following this rule, when slot capacity is 3, the followingsnapshotsare stored in the geometrictime window table: 16,24,32,40,48,52,56,60,62,64,65,66,67,68,69,70,as shown in Table 2.2. From the table, one can see that the closer to the current time, the denser are the snapshots stored. 3. ClusteringEvolving Data Streams: A Micro-clustering Approach The clustering problem is defined as follows: for a given set of data points, we wish to partition them into one or more groups of similar objects. The similarity of the objects with one another is typically defined with the use of some distance measure or objectivefunction. The clusteringproblem has been
  • 40. 18 DATA STREAMS: MODELS AND ALGORITHMS widely researched in the database, data mining and statistics communities [I2, 18,22,20,21,24]because of its use in a wide range of applications. Recently, the clustering problem has also been studied in the context of the data stream environment[17,23]. ApreviousalgorithmcalledSTREAM[23]assumesthattheclustersaretobe computedoverthe entiredata stream. While suchatask maybe useful in many applications, a clustering problem may often be defined only over a portion of a data stream. This is because a data stream should be viewed as an infinite process consisting of data which continuously evolves with time. As a result, the underlying clustersmay also changeconsiderablywith time. The natureof theclustersmay vary with both themoment at which they arecomputedas well as the time horizon over which they are measured. For example, a data analyst may wish to examine clusters occurring in the last month, last year, or last decade. Such clusters may be considerably different. Therefore, we assume that one of the inputs to the clustering algorithm is a time horizon over which the clusters are found. Next, we will discuss CluStream, the online algorithm used for clustering data streams. 3.1 Micro-clusteringChallenges Wenotethat sincestreamdatanaturally imposesa one-passconstraintonthe design of the algorithms, it becomes more difficultto provide such a flexibility in computing clusters over differentkinds of time horizons using conventional algorithms. For example,a direct extensionof the streambased Ic-meansalgo- rithm in [23] to such a case would require a simultaneousmaintenance of the intermediate results of clustering algorithms over all possible time horizons. Sucha computationalburden increaseswith progressionof the data stream and can rapidly become a bottleneck for online implementation. Furthermore, in many cases,ananalystmaywishto determinetheclustersatapreviousmoment in time, and compare them to the current clusters. This requires even greater book-keeping and can rapidly become unwieldy for fast data streams. Since a data stream cannot be revisited over the course of the computation, the clustering algorithmneeds to maintain a substantialamount of information so that important details are not lost. For example, the algorithm in [23] is implemented as a continuous version of k-means algorithm which continues to maintain a number of cluster centers which change or merge as necessary throughoutthe executionofthe algorithm. Suchan approach isespeciallyrisky when the characteristics of the stream change over time. This is because the amount of informationmaintainedby a k-means type approach is too approxi- mate in granularity,and once two cluster centers arejoined, there is no way to informativelysplit the clusters when required by the changes in the stream at a later stage.
  • 41. On ClusteringMassive Data Streams: A Summarization Paradigm 19 Thereforeanaturaldesignto streamclusteringwouldbe separateoutthepro- cessintoan onlinemicro-clusteringcomponentand an offlinemacro-clustering component. The online micro-clustering component requires a very efficient process for storageof appropriate summarystatistics in a fast data stream. The offline componentuses these summarystatisticsin conjunctionwith other user input in order to provide the user with a quick understanding of the clusters whenever required. Since the offline component requires only the summary statistics as input, it turns out to be very efficient in practice. This leads to severalchallenges: 0 What is the nature of the summary information which can be stored ef- ficiently in a continuous data stream? The summary statistics should provide sufficient temporal and spatial information for a horizon specific offline clus- tering process, while being prone to an efficient (online) update process. At what moments intime shouldthe summaryinformationbe storedaway on disk? How can an effective trade-off be achieved between the storagere- quirements of such a periodic process and the ability to cluster for a specific time horizon to within a desired level of approximation? How can the periodic summarystatisticsbe used to provide clustering and evolutioninsights over user-specified time horizons? 3 . 2 Online Micro-cluster Maintenance: The CluStream Algorithm The micro-clustering phase is the online statistical data collection portion of the algorithm. This process is not dependent on any user input such as the time horizon or the required granularity of the clustering process. The aim is to maintain statistics at a sufficientlyhigh level of (temporal and spatial) granularity so that it can be effectively used by the offline components such as horizon-specific macro-clustering as well as evolution analysis. The basic concept of the micro-cluster maintenance algorithm derives ideas from the k- means and nearest neighbor algorithms. The algorithm works in an iterative fashion,by alwaysmaintainingacurrentsetofmicro-clusters. Itisassumedthat a total of q micro-clusters are stored at any moment by the algorithm. We will denotethesemicro-clustersbyM1 ...Mq.Associatedwitheachmicro-cluster i, we create a unique id whenever it is first created. If two micro-clusters are merged (aswillbecomeevidentfromthedetailsofourmaintenancealgorithm), a list of ids is created in order to identify the constituent micro-clusters. The value of q is determined by the amount of main memory available in order to store the micro-clusters. Therefore, typical values of q are significantlylarger than the natural number of clustersin the data but are also significantlysmaller than the number of data points arriving in a long period of time for a massive data stream. These micro-clusters represent the current snapshot of clusters
  • 42. 20 DATA STREAMS: MODELS AND ALGORITHMS which change overthe courseofthe streamasnew points arrive. Their status is stored away on disk wheneverthe clock time is divisibleby aifor any integer i. At the same time any micro-clusters of order r which were stored at a time in the past more remote than aZ+" units are deleted by the algorithm. We first need to create the initial q micro-clusters. This is done using an offline process at the very beginning of the data stream computation process. At the very beginningof the data stream,we storethe first InitNumber points on disk and use a standard k-means clustering algorithm in order to create the q initialmicro-clusters. The value of InitNumber is chosen to be as large as permitted by the computationalcomplexity of a k-means algorithm creating q clusters. Oncethese initialmicro-clustershavebeen established,theonlineprocessof updatingthemicro-clustersisinitiated. Wheneveranew datapoint arrives, the micro-clusters are updated in order to reflect the changes. Each data point eitherneedstobe absorbedbyamicro-cluster, oritneedstobeput in aclusterof its own. The firstpreference isto absorbthe datapoint into a currentlyexisting micro-cluster. We first find the distance of each data point to the micro-cluster centroids M I ...M4. Let us denote this distance value of the data point Xi, to the centroid of the micro-cluster M by dist(Mj,Xi,).Sincethe centroid of the micro-cluster is available in the cluster feature vector, this value can be computedrelatively easily. We findthe closest cluster M, to the data point z . We note that in many cases, the point Xi,does not naturally belong to the cluster Mp. These cases are as follows: 0 The data point Xi,correspondsto an outlier. 0 The data point Xi,correspondsto the beginning of a new cluster because of evolutionof the data stream. While the two cases above cannot be distinguished until more data points arrive,the data point needs to be assigneda (new)micro-clusterof its own with a unique id. How do we decide whether a completelynew cluster should be created? In order to make this decision, we use the cluster feature vector of M pto decide if this data point falls within the maximum boundary of the micro-cluster Mp.If SO,then the data point Xi,is added to the micro-cluster M pusing the CF additivity property. The maximum boundary of the micro- cluster M pis defined as a factor o f t of the RMS deviation of the data points in M pfrom the centroid. We define this as the maximal boundaryfactor. We note that the RMS deviation can only be defined for a cluster with more than 1 point. For a cluster with only 1previous point, the maximum boundary is defined in a heuristic way. Specifically, we choose it to be r times that of the next closest cluster. If the data point does not lie within the maximum boundary of the nearest micro-cluster, then a new micro-cluster must be created containing the data
  • 43. On ClusteringMassive Data Streams: A Summarization Paradigm 21 point Xi,.This newly created micro-cluster is assigned a new id which can identify it uniquely at any future stage of the data steam process. However, in order to create this new micro-cluster, the number of other clusters must be reduced by one in order to create memory space. This can be achieved by eitherdeletinganoldclusterorjoining twoofthe oldclusters. Ourmaintenance algorithmfirstdeterminesif it is safeto delete any of the currentmicro-clusters as outliers. If not, then a merge of two micro-clusters is initiated. The first step is to identify if any of the old micro-clusters are possibly out- liers which can be safelydeleted by the algorithm. While it might be tempting to simplypick themicro-clusterwith the fewestnumber ofpoints asthe micro- cluster to be deleted, this may often lead to misleadingresults. In many cases, a given micro-cluster might correspondto a point of considerablecluster pres- ence in the past history of the stream, but may no longer be an active cluster in the recent stream activity. Such a micro-cluster can be considered an out- lier from the current point of view. An ideal goal would be to estimate the average timestamp of the last m arrivals in each micro-cluster 2, and delete the micro-cluster with the least recent timestamp. While the above estimation can be achieved by simply storing the last m points in each micro-cluster, this increases the memory requirements of a micro-cluster by a factor of m. Such a requirement reduces the number of micro-clusters that can be stored by the availablememory and therefore reduces the effectivenessof the algorithm. We will find a way to approximatethe averagetimestamp of the last m data points of the cluster M. This will be achieved by using the data about the timestamps stored in the micro-cluster M. We note that the timestamp data allowsuito calculate the mean and standarddeviation3of the arrival times of points in a given micro-cluster M. Let these values be denoted by pM and OMrespectively. Then,wefindthetimeofarrivalofthem/(2 n)-th percentile ofthepoints in M assumingthat thetimestampsarenormallydistributed. This timestamp is used as the approximate value of the recency. We shall call this value as the relevancestamp of cluster M. When the least relevance stamp of any micro-cluster is below a user-defined threshold 6, it can be eliminated and anew micro-clustercanbe createdwith aunique id correspondingto thenewly arrived data point Xi,. In some cases, none of the micro-clusters can be readily eliminated. This happens when all relevance stamps are sufficientlyrecent and lie above the user-defined threshold 6. In such a case, two of the micro-clusters need to be merged. We merge the two micro-clusters which are closest to one another. The new micro-cluster no longer corresponds to one id. Instead, an idlist is created which is a union of the the ids in the individualmicro-clusters. Thus, any micro-cluster which is result of one or more merging operations can be identified in terms of the individualmicro-clustersmerged into it.
  • 44. 22 DATA STREAMS: MODELS AND ALGORITHMS While the above process of updating is executed at the arrival of each data point, an additional process is executed at each clock time which is divisible by ai for any integer i. At each such time, we store away the current set of micro-clusters(possiblyon disk)togetherwith their id list, and indexedby their time of storage. We also delete the least recent snapshot of order i, if a' +1 snapshotsof suchorderhad alreadybeen storedondisk, andiftheclocktimefor this snapshotisnot divisibleby ai+l.(Inthe lattercase,the snapshotcontinues to be a viable snapshotof order (i+I).) Thesemicro-clusterscan then be used to form higher level clustersor an evolutionanalysis of the data stream. 3.3 High Dimensional Projected Stream Clustering The method can also be extended to the case of high dimensionalprojected stream clustering . The algorithms is referred to as HPSTREAM. The high- dimensional case presents a special challenge to clustering algorithms even in the traditional domain of static data sets. This is because of the sparsity of the data in the high-dimensional case. In high-dimensional space, all pairs of points tend to be almost equidistant from one another. As a result, it is often unrealistic to define distance-based clusters in a meaningful way. Some recent work on high-dimensionaldata uses techniques forprojected clustering which candetermineclustersfora specificsubsetof dimensions[I, 41. Inthese methods, the definitions of the clusters are such that each cluster is specific to a particular group of dimensions. This alleviates the sparsity problem in high-dimensional space to some extent. Even though a cluster may not be meaningfully defined on all the dimensionsbecause of the sparsity of the data, somesubsetof thedimensionscan alwaysbe found on whichparticularsubsets of points form high quality and meaningful clusters. Of course, these subsets of dimensions may vary over the different clusters. Such clusters are referred to asprojected clusters [I]. In [8], we have discussedmethodsforhigh dimensionalprojected clustering of data streams. The basic idea is to use an (incremental) algorithm in which we associate a set of dimensions with each cluster. The set of dimensions is represented as a d-dimensional bit vector B(Ci) for each cluster structure in FCS. This bit vector contains a 1 bit for each dimension which is included in cluster Ci. In addition, the maximum number of clusters k and the average cluster dimensionality 1 is used as an input parameter. The average cluster dimensionality1representstheaveragenumberofdimensionsusedinthecluster projection. An iterative approach is used in which the dimensions are used to update the clusters and vice-versa. The structure in FCS uses a decay-based mechanisminordertoadjustforevolutionintheunderlyingdatastream. Details are discussed in [8].
  • 45. On ClusteringMassive Data Streams: A Summarization Paradigm Time tl Timet2 Time Figure 2.3. Varying Horizons for the classification process Classification of Data Streams: A Micro-clustering Approach Oneimportantdataminingproblemwhichhasbeen studiedin the contextof data streamsisthatof streamclassification[15]. Themainthrust ondata stream miningin thecontextof classificationhasbeen that ofone-passmining [14,19]. In general, the use of one-pass mining does not recognize the changes which have occurred in the model since the beginning of the stream construction process [5]. While the work in [19] works on time changing data streams, the focus is on providing effective methods for incremental updating of the classification model. We note that the accuracy of such a model cannot be greater than the best sliding window model on a data stream. For example, in the case illustrated in Figure 2.3, we have illustrated two classes (labeled by 'x' and '-') whose distribution changes over time. Correspondingly, the best horizon at times tl and t 2 will also be different. As our empirical results will show,thetruebehaviorofthedata streamiscapturedin atemporalmodelwhich is sensitiveto the level of evolutionof the data stream. The classificationprocessmay require simultaneousmodelconstructionand testing in an environmentwhich constantlyevolvesover time. We assumethat the testing process is performed concurrently with the training process. This is often the case in many practical applications, in which only a portion of the data is labeled, whereas the remaining is not. Therefore, such data can be separated out into the (labeled) training stream, and the (unlabeled) testing stream. The main difference in the construction of the micro-clusters is that the micro-clusters are associatedwith a class label; therefore an incomingdata point in the training stream can only be added to a micro-cluster belonging to the same class. Therefore,we constructmicro-clustersin almost the sameway as the unsupervised algorithm, with an additional class-label restriction. From thetestingperspective,the importantpoint to be noted is that the most effectiveclassificationmodel does not stay constant over time, but varies with
  • 46. 24 DATA STREAMS: MODELS AND ALGORITHMS progression of the data stream. If a static classificationmodel were used for an evolving test stream, the accuracy of the underlying classificationprocess is likely to drop suddenly when there is a suddenburst of records belonging to a particular class. In such a case, a classificationmodel which is constructed using a smaller history of data is likely to provide better accuracy. In other cases, a longer history of training provides greater robustness. In the classification process of an evolving data stream, either the short term or long term behavior of the stream may be more important, and it often cannot be known a-priori as to which one is more important. How do we decidethewindow or horizon of the training datato use soasto obtainthe best classificationaccuracy? While techniques such as decision trees are useful for one-pass mining of data streams [14, 191, these cannot be easily used in the contextof an on-demandclassijier in an evolvingenvironment. Thisisbecause such a classifier requires rapid variation in the horizon selection process due to data stream evolution. Furthermore, it is too expensive to keep track of the entire history of the data in its original fine granularity. Therefore, the on-demand classification process still requires the appropriate machinery for efficientstatisticaldata collectionin orderto performthe classificationprocess. 4.1 On-Demand StreamClassification We use the micro-clusters to perform an On Demand Stream Classijication Process. In ordertoperformeffectiveclassificationofthestream,it isimportant to find the correct time-horizon which should be used for classification. How do we find the most effective horizon for classification at a given moment in time? In order to do so, a small portion of the training stream is not used for the creation of the micro-clusters. This portion of the training stream is referred to as the horizon fitting stream segment. The number of points in the streamused forhorizon fitting is denotedby kfit. Theremainingportion of the training stream is used for the creation and maintenance of the class-specific micro-clusters as discussed in the previous section. Since the micro-clusters are based on the entire history of the stream, they cannotdirectlybeusedtotesttheeffectivenessofthe classificationprocess over different time horizons. This is essential, since we would like to find the time horizon which provides the greatest accuracyduringthe classificationprocess. We will denote the set of micro-clusters at time t, and horizon h by N(t,, h). This set of micro-clusters is determined by subtracting out the micro-clusters at time t, - h from the micro-clusters at time t,. The subtraction operation is naturally defined for the micro-clustering approach. The essential idea is to match the micro-clusters at time t, to the micro-clusters at time t, - h, and subtract out the corresponding statistics. The additiveproperty of micro-
  • 47. On ClusteringMassive Data Streams: A Summarization Paradigm 25 clustersensuresthattheresulting clusterscorrespondto thehorizon (t, -h, t,). More details can be found in [6]. Once the micro-clusters for a particular time horizon have been determined, they areutilized to determinethe classificationaccuracyof that particular hori- zon. This process is executed periodically in order to adjust for the changes which have occurred in the stream in recent time periods. For this purpose, we use the horizon fitting stream segment. The last kfit points which have arrived in the horizon fitting stream segment are utilized in order to test the classification accuracy of that particular horizon. The value of kfit is chosen while taking into consideration the computational complexity of the horizon accuracy estimation. In addition, the value of kfit should be small enough so that the points in it reflect the immediatelocality oft,. Typically, the value of kfit should be chosen in such a way that the least recent point should be no largerthan a pre-specified number oftime units fromthecurrenttimet,. Let us denote this set of points by Q it. Note that since &fit is a part of the training stream,the class labels are known a-priori. Inordertotesttheclassificationaccuracyoftheprocess,eachpoint; i f E &fit is used in the followingnearest neighbor classificationprocedure: 0 We find the closest micro-cluster in N(tc, h) to x. We determine the class label of this micro-cluster and compare it to the true class label of X.The accuracy over all the points in Qfit is then determined. This provides the accuracy over that particular time horizon. The accuracy of all the time horizons which are tracked by the geometric time frame are determined. The p time horizons which provide the greatest dynamic classificationaccuracy (usingthe last kfit points) are selectedfor the classification of the stream. Let us denote the corresponding horizon values by 3-1 = {hl ...h,). We note that since kfit represents only a small locality of the points within the current time period t,, it would seem at first sight that the system would always pick the smallest possible horizons in order to maximize the accuracy of classification. However, this is often not the case for evolving data streams. Consider for example, a data stream in which the records fora givenclassarriveforaperiod, andthen subsequentlystartarriving again after a time interval in which the records for another class have arrived. In such a case, the horizon which includes previous occurrences of the same class is likely to provide higher accuracy than shorter horizons. Thus, such a system dynamically adapts to the most effective horizon for classification of data streams. In addition, for a stable stream the system is also likely to pick largerhorizonsbecause of the greateraccuracyresulting fromuse of largerdata sizes.
  • 48. 26 DATA STRFAMS:MODELSAND ALGORITHMS The classificationof the test stream is a separateprocess which is executed continuously throughout the algorithm. For each given test instance x, the above described nearest neighbor classification process is applied using each hi E 'Ti. It is often possible that in the case of a rapidly evolvingdata stream, differenthorizonsmayreportresult inthedeterminationofdifferentclasslabels. The majority class among these p class labels is reported as the relevant class. More detailson the technique may be found in [7]. 5. Other Applications of Micro-clustering and Research Directions Whilethispaper discussestwo applicationsofmicro-clustering,wenotethat anumberofotherproblemscanbe handledwith themicro-clusteringapproach. This is because the process of micro-clustering createsa summaryof the data which can be leveraged in a variety of ways for otherproblems in data mining. Some examples of such problems are as follows: Privacy PreservingData Mining: Intheproblem ofprivacypreserving data mining, we create condensedrepresentations [3] of the data which show k-anonymity. These condensed representations are like micro- clusters, except that each cluster has a minimum cardinality threshold on the number of data points in it. Thus, each cluster contains at least k data-points, and we ensure that the each record in the data cannot be distinguished from at least k other records. For this purpose, we only maintain the summary statistics for the data points in the clusters as opposed to the individual data points themselves. In addition to the first and second order moments we also maintain the covariance matrix for the data in each cluster. We note that the covariance matrix provides a complete overview of the distribution of in the data. This covariance matrix can be used in order to generate the pseudo-points which match the distributionbehavior of the data in eachmicro-cluster. For relatively smallmicro-clusters, it is possible to match theprobabilistic distribution inthedatafairlyclosely. Thepseudo-pointscanbeusedasa surrogatefor the actualdatapoints in the clusters in order to generatethe relevant data mining results. Since the pseudo-points match the original distribution quiteclosely, they canbe used forthepurposeof a varietyof data mining algorithms. In [3], we have illustrated the use of the privacy-preserving technique in the context of the classificationproblem. Our results show thatthe classificationaccuracyisnot significantlyreducedbecauseof the use of pseudo-points instead of the individualdata points. Query Estimation: Since micro-clusters encode summary information about the data, they can also be used for query estimation . A typical exampleof suchatechniqueisthatofestimatingtheselectivityofqueries.
  • 49. On ClusteringMassive Data Streams: A Summarization Paradigm 27 In such cases, the summary statistics of micro-clusters can be used in order to estimate the number of data points which lie within a certain interval such as a range query. Such an approach can be very efficient in a variety of applications sincevoluminousdata streams are difficult to use if they need to be utilized for query estimation. However, the micro- clusteringapproachcancondensethedataintosummarystatistics,sothat it is possible to efficiently use it for various kinds of queries. We note that the technique is quite flexibleas long as it can be used for different kinds of queries. An exampleof such a technique is illustrated in [9], in which we use the micro-clustering technique (with some modifications on the tracked statistics) for futuristic query processing in data streams. StatisticalForecasting: Sincemicro-clusterscontaintemporal and con- densed information, they can be used for methods such as statistical forecasting of streams . While it can be computationally intensive to use standard forecasting methods with large volumes of data points, the micro-clustering approach provides a methodology in which the con- densed data can be used as a surrogate for the original data points. For example, for a standardregressionproblem, it is possible to use the cen- troidsofdifferentmicro-clustersoverthevarioustemporaltimeframesin order to estimatethe values of the data points. These values can then be used for making aggregate statistical observations about the future. We note that this is a useful approach in many applications since it is often not possible to effectivelymake forecastsabout the futureusing the large volume of the data in the stream. In [9], it has been shownhow to use the technique for querying and analysis of future behavior of data streams. In addition, we believe that the micro-clustering approach is powefil enough to accomodatea wide variety of problems which require informationabout the summary distribution of the data. In general, since many new data mining problemsrequire summaryinformationaboutthedata, it is conceivablethatthe micro-clustering approach can be used as a methodology to store condensed statistics for general data mining and exploration applications. 6. Performance Study and Experimental Results AllofourexperimentsareconductedonaPCwithIntelPentiumI11processor and 512MB memory, which runs WindowsXP professional operating system. For testingtheaccuracyandefficiencyoftheCluStreamalgorithm,we compare CluStream with the STREAM algorithm [17,23], the best algorithm reported so far for clustering data streams. CluStream is implementedaccording to the descriptionin this paper, and the STREAMK-means is done strictlyaccording to [23],whichshowsbetteraccuracythanBIRCH[24]. Tomakethecomparison fair, both CluStream and STREAMK-means use the sameamount of memory.
  • 50. 28 DATA STREAMS: MODELS AND ALGORITHMS Specifically, they use the same stream incoming speed, the same amount of memoryto storeintermediateclusters(calledMicro-clustersinCluStream),and the same amount of memory to store the final clusters (called Macro-clusters in CluStream). Because the synthetic datasets can be generated by controlling the number of data points, the dimensionality, and the number of clusters, with different distributionor evolutioncharacteristics,theyareusedto evaluatethe scalability in our experiments. However, since synthetic datasets are usually rather dif- ferent from real ones, we will mainly use real datasets to test accuracy, cluster evolution,and outlier detection. Real datasets. First, weneedtofindsomereal datasetsthat evolvesignificantly over time in order to test the effectivenessof CluStream. A good candidate for such testing is the KDD-CUP'99 Network Intrusion Detection stream data set which has been used earlier [23] to evaluate STREAM accuracy with respect to BIRCH. This data set corresponds to the important problem of automatic and real-time detection of cyber attacks. This is also a challenging problem for dynamic stream clustering in its own right. The offline clustering algo- rithms cannot detect such intrusions in real time. Even the recently proposed stream clustering algorithms such as BIRCH and STREAMcannot be very ef- fectivebecause the clustersreported by these algorithmsare all generatedfrom the entirehistory of data stream, whereas the current cases may have evolved significantly. The Network Intrusion Detection dataset consists of a series of TCP con- nection records from two weeks of LAN network traffic managed by MIT Lincoln Labs. Each n record can either correspondto a normal connection, or an intrusion or attack. The attacks fall into four main categories: DOS (i.e., denial-of-service),R2L(i.e., unauthorizedaccessfromaremotemachine),U2R (i.e., unauthorized access to local superuser privileges), and PROBING (i.e., surveillance and other probing). As a result, the data contains a total of five clusters including the class for "normal connections". The attack-types are furtherclassified into one of 24 types, such as buffer-overflow, guess-passwd, neptune, portsweep, rootkit, smurf, warezclient, spy, and so on. It is evident that each specific attacktype can be treated as a sub-cluster. Most of the con- nections in this dataset are normal, but occasionally there could be a burst of attacks at certain times. Also, each connection record in this dataset contains 42 attributes,suchasdurationofthe connection,thenumberofdatabytestrans- mitted from source to destination (and vice versa), percentile of connections that have "SYN" errors, the number of "root" accesses, etc. As in 1231, all 34 continuous attributes will be used for clustering and one outlierpoint has been removed. Second,besidestestingontherapidlyevolvingnetworkintrusiondatastream, we also test our method over relatively stable streams. Since previously re-
  • 51. On ClusteringMassive Data Streams: A Summarization Paradigm 29 ported stream clusteringalgorithms work on the entirehistory of stream data, we believe that they should perform effectivelyfor some data sets with stable distribution over time. An example of such a data set is the KDD-CUP'98 Charitable Donation data set. We will show that even for such datasets, the CluStream can consistently beat the STREAMalgorithm. The KDD-CUP'98 Charitable Donation data set has also been used in eval- uating severalone-scan clustering algorithms, such as [16]. This data set con- tains 95412 records of information about people who have made charitable donations in response to direct mailing requests, and clustering can be used to group donors showing similar donation behavior. As in [16], we will only use 56 fields which can be extracted from the total 481 fields of each record. This data set is converted into a data stream by taking the data input order as the order of streaming and assumingthat they flow-in with a uniform speed. Synthetic datasets. To test the scalability of CluStream, we generate some syntheticdatasetsby varyingbase sizefrom 1O O Kto 1O O O Kpoints, thenumber of clusters from 4 to 64, and the dimensionality in the range of 10 to 100. Because we know the true cluster distribution a priori, we can compare the clustersfound with the true clusters. The data points of each synthetic dataset will followa seriesof Gaussiandistributions,and to reflect the evolutionof the streamdataovertime, wechangethemeanandvarianceofthecurrentGaussian distribution every 10Kpoints in the synthetic data generation. The quality of clustering on the real data sets was measured using the sum of square distance(SSQ), defined as follows. Assume that there are a total of N points in the past horizon at current time Tc.For each point pi, we find the centroid Cpi of its closest macro-cluster,and compute d(pi,Cpi),the distance between pi and C,,. Then the SSQ at time Tcwith horizon H (denoted as SSQ(Tc7 H))is equalto the sum of d2(pi,Cpi)for all the N points within the previous horizon H. Unless otherwise mentioned, the algorithm parameters were set at a = 2,1 = 10,InitNumber = 2000, and t = 2. We compare the clustering quality of CluStreamwith that of STREAM for differenthorizons at differenttimesusingtheNetwork Intrusiondatasetandthe Charitable donation data set. The results are illustrated in Figures 2.4 and 2.5. We run each algorithm 5 times and compute their average SSQs. The results show that CluStream is almost always better than STREAM. All experiments for these datasetshave shown that CluStream has substantially higher quality than STREAM. However the Network Intrusion data set showed significantly betterresultsthanthecharitabledonationdatasetbecauseofthefactthenetwork intrusion data set was a highly evolvingdata set. For such cases, the evolution sensitive CluStream algorithm was much more effective than the STREAM algorithm. We also tested the accuracy of the On-DemandStream ClassiJier.The first test was performed on the Network Intrusion Data Set. The first experiment
  • 52. DATA STREAMS: MODELS AND ALGORITHMS 1 CluStream HSTREAMI 750 1250 1750 2250 Stream (in time units) Figure 2.4. Quality comparison (NetworkIntrusion dataset, horizon=256,stream-speed=200) Stream (in time units) Figure 2.5. Quality comparison (CharitableDonation dataset, horizon=4, streamspeed=200)
  • 53. On ClusteringMassive Data Streams: A Summarization Paradigm .On DemandStream .Fixed SlidlngWindow DEntlreDataset 100 - E Z F 9s 0 4 90 1500 2000 2500 Stream (In time units) Figure 2.6. Accuracy comparison (Network Intrusion dataset, buffer-size=1600,kfit=80, init_number=400) 0.25 0.5 1 2 4 8 16 32 Best horizon Figure 2.7. Distribution of the (smallest) best horizon (Network Intrusion dataset, Time units=2500,buffer-size=1600,kf$t=80,init-number=400) EOn DemandStream .Fixed SlidingWindow OEntimDataset "T T 500 1000 1500 2000 Stream (in time units) Figure 2.8. Accuracy comparison (Synthetic dataset B300kC5D20, buffer_size=500,kfit=25, init-number=400)
  • 54. DATA STREAMS:MODELS AND ALGORITHMS I OStream s m d 400 points w r time unit 0.25 0.5 1 2 Best horizon Figure 2.9. Distributionof the (smallest) best horizon (Syntheticdataset B300kCSD20, Time units=2000, buffersize=500, lcfit=25, init-number400) was conducted with a stream speed at 80 connectionsper time unit (i.e., there are 40 training stream points and 40 test stream points per time unit). We set the buffersize at 1600 points, which means upon receiving 1600 points (including both training and test stream points) we'll use a small set of the training data points (In this case kfit =80) to choose the best horizon. We compared the accuracy of the On-Demand-Stream classifier with two simple one-pass stream classifiers over the entire data stream and the selected sliding window(i.e., slidingwindowH=8). Figure2.6 showstheaccuracycomparison among the three algorithms. We can see the On-Demand-Stream classifier consistentlybeatsthetwo simpleone-passclassifiers. For example,at timeunit 2000, the On-Demand-Stream classijier's accuracyis about4%higher than the classifierwith fixed sliding window, and is about 2% higher than the classifier with the entire dataset. Because the class distribution of this dataset evolves significantlyover time, eitherthe entiredataset or a fixed sliding window may not always capture the underlying stream evolution nature. As a result, they always have a worse accuracy than the On-Demand-Stream classifier which always dynamicallychoosesthe best horizon for classifying. Figure 2.7 showsthe distributionof the best horizons (They are the smallest onesifthereexistseveralbesthorizonsatthesametime). Althoughabout78.4% of the (smallest)best horizonshave avalue 114,theredo exist about21.6% best horizons ranging from 112to 32 (e.g., about 6.4% of the best horizons have a value 32). This also illustratesthat there is no fixed sliding window that can achievethebest accuracyandthereasonwhy the On-Demand-Streamclassifier can outperform the simpleone-pass classifiersover either the entiredataset or a fixed sliding window. We have also generated one synthetic dataset B300kC5D20to test the clas- sificationaccuracyof these algorithms. This dataset contains5 classlabelsand 300Kdatapointswith 20dimensions. Wefirstsetthe streamspeedat 100points
  • 55. Another Random Scribd Document with Unrelated Content
  • 56. The journey to Italy duly took place, the proposed party of two being enlarged to one of four by the addition of Ignaz Brüll and Simrock. Original plans had to be modified on account of the exceptionally wet season, and the chief places visited were Vicenza, Padua, and Venice. The personnel of Brahms' intimate friends in Vienna had remained on the whole much what it had become a very few years after his arrival in the Austrian capital. Of its closest circle the Fabers, Billroths, and Hanslicks, with whom must be associated Joachim's cousins, the various members of the Wittgenstein family—amongst them Frau Franz and Frau Dr. Oser—still formed the nucleus. An acquaintance with Herr Victor von Miller zu Aichholz and his wife had meanwhile ripened into warm friendship, and their house became one of those whose hospitality was most frequently and gladly accepted by the master. Amongst the musicians, Carl Ferdinand Pohl, author of the standard Life of Mozart, and, since 1866, archivar to the Gesellschaft, was one of his dearest friends. With the leading professors of the conservatoire his relations continued very cordial, and amongst the younger musicians to whom, in addition to his early allies, Goldmark, Gänsbacher and Epstein, he extended his friendly regard, may be mentioned Anton Door and Robert Fuchs. The feeling of warm friendship existing between Brahms and Johann Strauss has been commemorated in several well-known anecdotes. The autumn of 1881, however, brought to permanent residence in Vienna a family that before long made notable addition to the master's intimate circle. Special circumstances conduced to the speedy formation of a bond of friendship between Brahms and the new-comers, Dr. and Frau Fellinger. In the first place, they were friends of Frau Schumann and her daughters, and as such had an instant claim on his courtesy, which he acknowledged by calling on them as soon as possible after their arrival. In the second, his interest was awakened by the fact that Frau Dr. Fellinger was the daughter of Frau Professor Lang-Köstlin, the gifted Josephine Lang, whose attractive personality and talent for composition made a strong impression upon Mendelssohn when he was a youth of
  • 57. twenty-one and some six years the lady's senior. The story of Josephine, who at the age of twenty-six married Professor Köstlin of Tübingen, is given in Hiller's 'Tonleben,' and Mendelssohn's congratulations to her bridegroom-elect may be read in the second volume of the 'Letters.' The talent for art which had come to her as a family inheritance was transmitted to her daughter, though with a difference. Frau Dr. Fellinger's gifts have associated themselves especially with the plastic arts; in the first place with that of painting, but they have become well known in the musical world also by her busts and statuettes of Brahms, Billroth, and others belonging to their circle. Her photographs of our master are now familiar to most music-lovers. When it is added that Brahms found he could command in Dr. Fellinger's hospitable house, not only congenial intellectual sympathy, but the unceremonious intercourse with a simple, affectionate family circle in which he had through life found a pre-eminent source of happiness, it will easily be understood that he became a more and more frequent guest there, until, during the closing years of his life, it became for him almost a second home. The master introduced two of his new works in the course of a few weeks' journey undertaken in the winter of 1882-83. According to Simrock's Thematic Catalogue, the Pianoforte Trio in C major, the String Quintet in F major, and the 'Parzenlied' constitute the publications of 1883. Early copies of the trio and quintet were sent out, however, and the works were publicly performed from them in December, 1882. An interesting entry in Frau Schumann's diary says: 'I had invited Koning and Müller to come and try Brahms' new trio with me on Thursday 21st [December]. Who should surprise us as we were playing it—he himself! He came from Strassburg and means to stay with us for Christmas. I played the trio first and he repeated it.' Both works were performed on December 29 at a Museum chamber music concert—the Quintet by the Heermann-Müller party, the Trio by Brahms, Heermann, and Müller.
  • 58. Amongst the early performances of the Trio were those on January 17 and 22 respectively in Berlin (Trio Concerts: Barth, de Ahna, Hausmann) and London (Monday Popular Concerts: Hallé, Madame Néruda, Piatti), and at Hellmesberger's in Vienna on March 15. The work has not become one of the most generally familiar of the master's compositions, though it is not easy to say why. It contains no trace of the 'heaven-storming Johannes,' but, like many of the later compositions, it breathes, and especially the first movement, with a rich, mellow warmth suggestive of one to whom the experiences of life have brought a solution of their own to its problems, which has quieted, if it has not altogether satisfied, the aspirations and impulses of youth. The Quintet in F for strings is, for the most part, bright, concise, and easy to follow. As one of its special features may be mentioned the combination of the usual two middle movements in the second. It was given in Hamburg on the 22nd and in Berlin on the 23rd of January, respectively by Bargheer and Joachim and their colleagues (it should be noted that Hausmann had at this time succeeded Müller as the violoncellist of the Joachim Quartet), at Hellmesberger's on February 15, and at the Monday Popular, London, of March 5. Brahms conducted the first performance of the Parzenlied in Basle on December 8, 1882. Excellently sung by the members of the Basle Choral Society, the work met with extraordinary success, and was repeated after the New Year by general desire. Similar results followed its performance in other towns, of which Strassburg and Crefeld should be specially mentioned. The programme of the Crefeld concert included the fifth movement of the Requiem. 'What is your tempo?' Brahms inquired, on the morning of the rehearsal, of Fräulein Antonia Kufferath, who was to sing the solo. The lady, not taking the question seriously from the composer of the music, waived a reply. 'No, I mean it; you have to hold out the long notes. Well, we shall understand each other,' he added; 'sing only as you feel, and I will follow with the chorus.'
  • 59. These are characteristic words, and valuable in more than one sense. To most of the few works to which the master has placed metronome indications—and the Requiem is amongst these—he added them by special request, and attached to them only a limited importance. An absolutely and uniformly 'correct' pace for a piece of genuine music does not exist. The pace must vary to some extent according to subtle conditions existent in the performer, and the instinct of a really musical executant or conductor will, as a rule, be a safer guide, within limits, than what can be at best but the mechanical markings even of the composer himself. The Parzenlied, received with enthusiasm throughout Brahms' tour in Germany and Switzerland, was not equally successful in Vienna, where it was heard for the first time at the Gesellschaft concert of February 18 under Gericke. The austere simplicity of the music, which paces majestically onward with the concentrated, resigned calm of despair, adds extraordinary force to Goethe's poem, but does not appeal to every audience, and the work has never become a prime favourite in the Austrian Kaiserstadt. The song is set for six- part chorus with orchestra, in plainer harmonic masses and with less employment of imitative counterpoint than we usually find in the works of Brahms, who has accommodated his music here, as in 'Nänie,' to the classical spirit of the text. A singular deviation, however, which occurs in the course of the setting, from the uncompromising severity of the words, furnishes a remarkable illustration of the composer's unconquerable idealism. Comment was made in its place on the beautiful device by which he has sought to relieve the dark mood of Hölderlin's 'Song of Destiny'—the addition of an instrumental postlude which breathes forth a message of tender consolation that the poet could hardly have rendered in words. In Schiller's 'Nänie' the lament, with all its calm, gives expression to a sentiment of compassionate sorrow that is perfectly reproduced in the master's music. Goethe's Fates, however, in their measured recitation of the gods' relentless cruelty, would have seemed to offer no possible opportunity for even the inarticulate expression of ruth. Least of all, it might be imagined, could any
  • 60. concession to the demands of the human heart have been found in the penultimate stanza of their song: 'The rulers exclude from Their favouring glances Entire generations, And heed not in children The once so belovèd And still speaking features Of distant forefathers.' Our Brahms, however, who, in spite of his increasing weight, his shaggy beard, his frequently rough manners, his unsatisfied affections, his impenetrable reserve, remained at fifty, in his heart of hearts, the very same being whom we have watched as the loving child of seven, the simple-minded boy of fourteen, the broken- hearted man of thirty, sobbing by the death-bed of his mother, cannot leave the dread gloom of his subject unrelieved by a single ray. He seems, in his setting of the last strophe but one, to concentrate attention on past kindness of the gods, and thus, perhaps, subtly to suggest a plea for present hope. How far the musician was justified in thus wandering from the obvious intention of his poet must be left to each hearer of the work to determine for himself. If it be the case, as has sometimes been suggested, that the variation was made by the composer in the musical interests of the piece as a work of art, it cannot be held to have fulfilled its purpose; for the striking inconsistency between words and music in the verse in question has a disturbing effect on the mind of the listener. We believe, however, that the true explanation of the master's procedure is more radical, and is to be found in the nature of the man in which that of the musician was grounded. The Parzenlied was dedicated to 'His Highness George, Duke of Saxe-Meiningen,' and was included in a Brahms programme performed in Meiningen on April 2 to celebrate the Duke's birthday. The complete breakdown of Bülow's health necessitated his temporary retirement from his conductor's duties, which were
  • 61. divided on this occasion between Brahms and Court Capellmeister Franz Mannstädt, appointed to assist Bülow. Returning by a circuitous route to Vienna after a few days at the ducal castle, Brahms paid a short visit to Hamburg to take part in another Brahms programme arranged by the talented young conductor of the Cecilia Society, Julius Spengel. This was the first of several occasions on which the master gave testimony of his appreciation of Dr. Spengel's talents and musicianship by co-operating in the concerts of the society. Brahms celebrated his fiftieth birthday by entertaining his friends Faber, Billroth, and Hanslick at a bachelor supper. He was occupied during the summer with the completion of a third symphony, on which he had worked the preceding year, and lived at Wiesbaden in a house that had belonged to the celebrated painter Ludwig Knaus, in whose former studio—Brahms' music-room for the nonce—the work was finished. It was known to the composer that a delicate elderly lady inhabited the first-floor of the house of which Frau von Dewitz's flat, where he lodged, formed an upper story. Every night, therefore, on returning to his rooms, he took off his boots before going upstairs, and made the ascent in his socks, so that her rest should not be disturbed. This anecdote is but one amongst several of the same kind that have been related to the author by Brahms' intimate associates. Samples of another variety should not, however, be omitted. A private performance of the new symphony, this time arranged for two pianofortes, was given as usual at Ehrbar's by Brahms and Brüll, and aroused immense expectations for the future of the work. Amongst the listeners was a musician who, not having hitherto allowed himself to be suspected of a partiality for the master's art, expressed his enthusiastic admiration of the composition. 'Have you had any conversation with X?' young Mr. Ehrbar asked Brahms; 'he has been telling me how delighted he is with the symphony.' 'And have you told him that he very often lies when he opens his mouth?' angrily retorted the composer, who could never bring himself to
  • 62. submit to the humiliation of accepting a compliment which he suspected—perhaps unjustly in this case—of being insincere. A terrible rebuff was administered by him on the evening of a first Gewandhaus performance. It must be owned that Brahms was seldom in his happiest mood when on a visit to Leipzig; he was well aware that his music was not appreciated within the official 'ring' there, and suspiciously resented any well-meant efforts made to ignore this fact. 'And where are you going to lead us to-night, Herr Doctor?' inquired one of the committee a few minutes before the beginning of the concert, assuming a conciliatory manner as he smoothed on his white kid gloves; 'to heaven?' 'It is the same to me where you go,' rejoined Brahms. The first performance of the Symphony in F major (No. 3) took place in Vienna at the Philharmonic concert of December 2, under Hans Richter, who was, according to Hanslick, originally responsible for the name 'the Brahms Eroica,' by which it has occasionally been called. Whether or not the suggestion is happy, a saying of the kind, probably uttered on the impulse of the moment, should not be taken very seriously. Nothing of the quiescent autumn mood which we have observed in the master's chamber music of this period is to be traced in either of his symphonies, and the third, like its companions, represents him in the zenith of his energies, working happily in the consciousness of his absolute command over the resources of his art. Whether it be judged by its effect as an entire work or studied movement by movement, whether each movement be listened to as a whole or analyzed into its component parts, all is found to be without halt of inspiration or flaw in workmanship. Each theme is striking and pregnant, and, though contrasting with what precedes it, seems to belong inevitably to the movement and place in which it occurs, whilst the development of the thematic material is so masterly that to speak of admiring it seems almost ridiculous. The last movement closes with a very beautiful and distinctive Brahms coda. The third symphony is more immediately easy to follow than the first, and of
  • 63. broader atmosphere than the second. It is of an essentially objective character, and belongs absolutely to the domain of pure music. The supreme and glorious pre-eminence which the great master had by this time attained in contemporary estimation naturally made it an object of competition with concert-givers and directors to announce the earliest performances of his works, and this was especially the case in the rare event of a new symphony which succeeded its immediate predecessor after an interval of six years. Brahms, however, had his own ideas on this matter, as on every other that he thought important, and after the first performance of the work in Vienna he sent the manuscript to Joachim in Berlin, and begged him to conduct the second performance when and where he liked. This proceeding would hardly have been noteworthy under the circumstances of intimate friendship which had so long united the two musicians, had it not been that the old relation between Brahms and Joachim had been clouded during the past year or two, during which there had been a cessation of their former affectionate intercourse. When, therefore, it became known that Joachim, acting on the composer's wish, proposed to conduct the symphony at one of the subscription concerts of the Royal Academy of Arts, Berlin, so much disappointment and heart-burning were felt and expressed that Joachim, although he had already replied in the affirmative to Brahms' request, consented to write again and ask what his wishes really were. The answer came without delay, and was clear enough to set the matter quite at rest. Brahms desired that the performance should be committed unreservedly to the care of his old friend. The symphony was heard for the second time, therefore, on January 4 under Joachim at Berlin, and was enthusiastically received by all sections of the public and press. It was given again three times during the same month in the German imperial capital under the composer's bâton. Detailed description of the triumphant progress of the new work from town to town is no longer necessary. The composer was overwhelmed with invitations to conduct it from the manuscript, and
  • 64. Bülow, convalescent from his illness, and determined not to be outdone in enthusiasm, placed it twice, as second and fourth numbers, in a Meiningen programme of five works. On publication, it was performed in all the chief music-loving towns of Germany, Great Britain, Holland, Russia, Switzerland, and the United States. In an account of a performance of the symphony at a Hamburg Philharmonic concert under Brahms in December, which followed one under von Bernuth after three weeks' interval, the critic of the Correspondenten says: 'Brahms' interpretation of his works frequently differs so inconceivably in delicate rhythmic and harmonic accents from anything to which one is accustomed, that the apprehension of his intentions could only be entirely possible to another man possessed of exactly similar sound-susceptibility or inspired by the power of divination.' The author feels a peculiar interest in quoting these lines, which strikingly corroborate the impression formed by her on hearing this and other of Brahms' works played under his own direction. The publications of 1884 were, besides the third Symphony, Two Songs for Contralto with Viola and Pianoforte, the second being the 'Virgin's Cradle Song,' already mentioned as one of the compositions of 1865; two sets of four-part Songs, the one for accompanied Solo voices, the other for mixed Chorus a capella, and the two books of Songs, Op. 94 and 95. At this date Brahms had entered into what we may call the third period of his activity as a song-writer—one in which he frequently chose texts that speak of loneliness or death. The wonderful beauty of his settings of these subjects penetrates the very soul, and by the mere force of its pathos carries to the hearer the conviction that the composer speaks out of the feeling of his own heart. Stockhausen, trying the song 'Mit vierzig Jahren' (Op. 94, No. 1) from the
  • 65. manuscript to the composer's accompaniment, was so affected during its performance that he could not at once proceed to the end. Our remarks are, however, by no means intended to convey the impression that Brahms only or generally chose poems of a melancholy tendency at this time. WITH FORTY YEARS. By Friedrich Rückert (1788-1866). With forty years we've gained the mountain's summit, We stand awhile and look behind; There we behold the quiet years of childhood And there the joy of youth we find. Look once again, and then, with freshened vigour, Take up thy staff and onward wend! A mountain-ridge extendeth, broad, before thee, Not here, but there must thou descend. No longer, climbing, need'st thou struggle breathless, The level path will lead thee on; And then with thee a little downward tending, Before thou know'st, thy journey's done. With the knowledge we have gained of the master's habit of producing his large works in couples, we are prepared to find him employed this summer on the composition of a fourth symphony.
  • 66. Avoiding a long journey, he settled down to his work at Mürz Zuschlag in Styria, not far from the highest ridge of the Semmering. Hearing soon after his arrival there that his old friend Misi Reinthaler, now grown up into a young lady, was leaving home under her mother's care to go through a course of treatment under a famous Vienna specialist, he wrote to place his rooms in Carlsgasse at Frau Reinthaler's disposal. The offer was not accepted, but when the invalid was sufficiently convalescent, he insisted that the two ladies should come for a few days as his guests to Mürz Zuschlag, where he took rooms for them near his own lodgings. He went over to see them also at Vienna, and spent the greater part of a morning showing them his valuable collection of autographs and other treasures. 'Yes, these would have been something to give a wife!' was his answer to the ladies' expressions of delight. Amongst his collection of musical autographs were two written on different sides of the same sheet of paper—one of Beethoven, the song 'Ich liebe dich'; the other of Schubert, part of a pianoforte composition. These, with Brahms' autograph signature 'Joh. Brahms in April 1872,' written at the bottom of one of the pages, constitute a unique triplet. The sheet now belongs to the Gesellschaft library, and is framed within glass. The society of Hanslick, who came with his wife to stay near Mürz Zuschlag for part of the summer, was very acceptable to Brahms. The departure of his friends at the close of the season, in the company of some mutual Vienna acquaintances, incited the composer to an act of courtesy of a kind quite unusual with him, the sequel to which seems to have caused him almost comical annoyance that found expression in a couple of notes sent immediately afterwards to Hanslick. 'Dearest Friend, 'Here I stand with roses and pansies; which means with a basket of fruit, liqueurs and cakes! You must have travelled through by the earlier Sunday extra train? I
  • 67. made a good and unusual impression for politeness at the station! The children are now rejoicing over the cakes....' and, on finding that, mistaking the time of the train, he had arrived a quarter of an hour late: 'How such a stupid thing can spoil one's day and the thought of it recur to torment one. I hope you do not know this as well as I, who am for ever preparing for myself such vexatious worry....' Later on, writing about other matters, he adds: '... I hope Professor Schmidt's ladies do not describe my promenade with the basket too graphically in Vienna! Otherwise my unspoiled lady friends may cease to be so unassuming.'[68] The journeys of the winter included visits to Bremen and Oldenburg, during which Hermine Spiess, one of the very favourite younger interpreters of Brahms' songs, sang dainty selections of them to the composer's accompaniment, with overwhelming success. The early death of this gifted artist, soon after her marriage, caused the master, with whom she was a great favourite, deep and sincere grief. Brahms went also to Crefeld, where the 'Tafellied,' dedicated on publication 'To the friends in Crefeld in remembrance of Jan. 28th 1885,' was sung on the date in question, with some of the new part- songs a capella, and other of the composer's works, at the jubilee of the Crefeld Concert Society. The manuscript score of the 'Tafellied' is in the possession of Herr Alwin von Beckerath, to whom it was presented by Brahms with an affectionate inscription.
  • 68. CHAPTER XX 1885-1888 Vienna Tonkünstlerverein—Fourth Symphony—Hugo Wolf —Brahms at Thun—Three new works of chamber music— First performances of the second Violoncello Sonata by Brahms and Hausmann—Frau Celestine Truxa—Double Concerto—Marxsen's death—Eugen d'Albert—The Gipsy Songs—Conrat's translations from the Hungarian—Brahms and Jenner—The 'Zum rothen Igel'—Ehrbar's asparagus luncheons—Third Sonata for Pianoforte and Violin. The early part of the year 1885 offers for record no event of unusual interest to the reader. The greater portion of it was spent by Brahms in his customary routine in Vienna. He was generally to be seen at the weekly meetings of the Tonkünstlerverein, a musicians' club founded by Epstein, Gänsbacher, and others, of which the master had consented to be named honorary life-president. The Monday evening proceedings included a short musical programme, sometimes followed by an informal supper. Brahms did not usually sit in the music-room, but would remain in a smaller apartment smoking and chatting sociably with friends of either sex. His arrival always became known at once to the assembled company, 'Brahms is here; Brahms is come!' being passed eagerly from mouth to mouth. His old love of open-air exercise had not diminished with increasing years, and the Sunday custom of a long walk in the country was still kept up. A few friends used to meet in the morning outside the Café Bauer, opposite the Opera House, and, taking train or tram to the outskirts of the city, would thence proceed on foot, returning in the late afternoon. Brahms, nearly always in a good humour on these occasions, was generally soon ahead of his companions, or leading the way with the foremost, and, as had
  • 69. usually been the case with him through life, was looked upon by his friends as the chief occasion of their meetings, allowed his own way, and admired as a kind of pet oracle. The excursions always commenced for the season on his return to Vienna in the autumn, and were continued with considerable regularity until his departure in the spring. They not infrequently gave opportunity for the employment of the composer's unfailing readiness of repartee, as on the occasion of a meeting in the train, on the return journey, with a learned but unmusical acquaintance of one of the party, between whom and Brahms an animated conversation arose. 'Will you not join us one day, Herr Doctor? Next Sunday, perhaps?' asked Brahms. 'I!' exclaimed the other. 'Saul among the prophets?' 'Na, so you give yourself royal airs!' instantly rejoined the master. The fourth symphony was completed during the summer at Mürz Zuschlag, where Brahms this year had the advantage of Dr. and Frau Fellinger's society, and—indispensable for his complete enjoyment of a home circle—that of their children. Returning one afternoon from a walk, he found that the house in which he lodged had caught fire, and that his friends were busily engaged in bringing his papers, and amongst them the nearly-finished manuscript of the new symphony, into the garden. He immediately set to work to help in getting the fire under, whilst Frau Fellinger sat out of doors with either arm outspread on the precious papers piled on each side of her. Luckily, all serious harm was averted, and it was soon possible to restore the manuscripts intact to the composer's apartments. Brahms paid a neighbourly call, in the course of the summer, on the author Rosegger, who was living in his small country house at Krieglach near Mürz Zuschlag, and tasted the unusual experience of a repulse. Absorbed in work at the moment when his servant announced 'a strange gentleman,' Rosegger, without glancing at the card placed beside him, desired his visitor to 'sit down for a moment.' Conscious only of the presence of a bearded stranger with a gray overcoat over his shoulder and a light-coloured umbrella in his hand, he vouchsafed but scant answer to the trifling remarks
  • 70. with which his caller tried to pave the way to cordiality, and before long Brahms composedly remarked that he would be on his legs again, and took leave. It was not till some minutes after his departure that it occurred to Rosegger to glance at the card, and he has himself described the feelings of despair with which he read the words 'Johannes Brahms' staring at him in all the reality of black on white. Not he alone, but the ladies of his family, were enthusiastic admirers of the composer's genius. He was so overwhelmed by his mistake as to be incapable of taking any steps to remedy it, and firmly declined to yield to the entreaties of his wife and daughter that he would return the visit and explain matters to Brahms. He published an amusing account of the misadventure in the year 1894 in an issue of the Heimgarten. Perhaps it may have fallen into the master's hands. The honour not only of the first, but of several subsequent early performances of the Symphony in E minor, fell to the Meiningen orchestra. The work was announced for the third subscription concert of the season 1885-86, and shortly beforehand the score and parts of the third and fourth movements were sent by the composer to Meiningen for correction at a preliminary rehearsal under Bülow. Three listeners were, by Bülow's invitation, present on the occasion—the Landgraf of Hesse; Richard Strauss, the now famous composer, who had succeeded Mannstädt as second conductor of the Meiningen orchestra; and Frederic Lamond. The lapse of another day or so brought Brahms himself with the first and second movements, and the first public performance of the work took place on October 25. That the new symphony was enthusiastically received on the occasion goes almost without saying. Persevering but unsuccessful efforts were made by the audience to obtain a repetition of the third movement, and the close of the work was followed by the emphatic demonstration incident to a great success. The work was repeated under Bülow's direction at the following Meiningen concert of November 1, and was conducted by the
  • 71. composer throughout a three weeks' tour on which he started with Bülow and his orchestra immediately afterwards, and which included the towns Siegen, Dortmund, Essen, Elberfeld, Düsseldorf, Rotterdam, Utrecht, Amsterdam, the Hague, Arnheim, Crefeld, Bonn, and Cologne. A performance at Wiesbaden followed, and the work was heard for the first time in Vienna at the Philharmonic concert of January 17, 1886, under Richter. This occasion was celebrated by a dinner given by Billroth at the Hôtel Sacher, the guests invited to meet the composer being Richter, Hanslick, Goldmark, Faber, Door, Epstein, Ehrbar, Fuchs, Kalbeck, and Dömpke. A new and important work by Brahms could hardly fail to obtain a warm reception in Vienna at a period when the composer could look back to thirty years' residence in the imperial city with which his name had become as closely associated as those of Haydn, Mozart, Beethoven, and Schubert; but though the symphony was applauded by the public and praised by all but the inveterately hostile section of the press, it did not reach the hearts of the Vienna audience in the same unmistakable manner as its two immediate predecessors, both of which had, as we have seen, made a more striking impression on a first hearing in Austria than the first Symphony in C minor. Strangely enough, the fourth symphony at once obtained some measure of real appreciation in Leipzig, where the first had been far more successful than the second and third. It was performed under the composer at the Gewandhaus concert of February 18. The account given of the occasion by the Leipziger Nachrichten is, perhaps, the more satisfactory since our old friend Dörffel, who might possibly have been suspected of partiality, had long since retired from the staff of the journal. Bernhard Vögl, his second successor, says: '... The reception must, we think, have made amends to Brahms for former ones, which, in Bülow's opinion, were too cool. After each movement the hall resounded with tumultuous and long-continued applause, and, at the conclusion of the work, the composer was repeatedly
  • 72. called forward.... The finale is certainly the most original of the movements, and furnishes more complete argument than has before been brought forward for the opinion of those who see in Brahms the modern Sebastian Bach. The movement is not only constructed on the form displayed in Bach's Chaconne for violin, but is filled with Bach's spirit. It is built up with astounding mastery upon the eight notes, [Listen] and in such a manner that its contrapuntal learning remains subordinate to its poetic contents.... It can be compared with no former work of Brahms and stands alone in the symphonic literature of the present and the past.' A still more triumphant issue attended the production of the symphony under Brahms at a concert of the Hamburg Cecilia Society on April 9. Josef Sittard, who had recently been appointed musical critic to the Hamburger Correspondenten, a post he has held to the present day, wrote: 'To-day we abide by what we have affirmed for years past in musical journals; that Brahms is the greatest instrumental composer since Beethoven. Power, passion, depth of thought, exalted nobility of melody and form, are the qualities which form the artistic sign manual of his creations. The E minor (fourth) Symphony is distinguished from the second and third principally by the rigorous and even grim earnestness which, though in a totally different way, mark the first. More than ever does the composer follow out his ideas to their conclusion, and this unbending logic makes the immediate understanding of
  • 73. the work difficult. But the oftener we have heard it, the more clearly have its great beauties, the depth, energy and power of its thoughts, the clearness of its classic form, revealed themselves to us. In the contrapuntal treatment of its themes, in richness of harmony and in the art of instrumentation, it seems to as superior to the second and third, these, perhaps, have the advantage of greater melodic beauty; a guarantee of popularity. In depth, power and originality of conception, however, the fourth symphony takes its place by the side of the first....' After an interesting discussion of the several movements, the writer adds: 'In a word, the symphony is of monumental significance.' Brahms' fourth symphony, produced when he was over fifty, is, in the opinion of most musicians, unsurpassed by any other achievement of his genius. It has during the past twenty years been growing slowly into general knowledge and favour, and will, it may be safely predicted, become still more deeply rooted in its place amongst the composer's most widely-valued works. The second movement, in the opinion of the late Philipp Spitta, 'does not find its equal in the symphonic world'; and the fourth, written in 'Passacaglia' form, is the most astonishing illustration achieved even by Brahms himself of the limitless capability of variation form, in which he is pre-eminent.[69] It is with something of a mournful feeling that we find ourselves at the close of our enumeration of the master's four greatest instrumental works. Enough, we may hope, has been said to indicate that any comparison of the symphonies as inferior or superior is impossible, for the reason that each, while perfectly fulfilling its own particular destiny, is quite different from all the others, and such natural preference as may be felt by this or that listener for either must be considered as purely personal. The present writer may, perhaps, be allowed to confess that, with all joy in the dainty second and the magnificent third and fourth—emphatically the fourth—
  • 74. neither appeals to her quite so strongly as the first. There is here a quality of youth in the intensity of the soaring imagination that seems to search the universe, which, presented as it is with the wealth of resource that was at the command of the mature composer, could not by its nature be other than unique. The presence of this very quality may be the reason why the first symphony suffers even more lamentably than its companions from the dull, cold, cautious, 'classical' rendering which Brahms' orchestral works receive at the hands of some conductors, who seem unable to realize that a composer who founds his works on certain definite and traditional principles of structure does not thereby change his nature, or in any degree renounce the free exercise of his poetic gifts. Perhaps the present is as good an opportunity as may occur for passing mention of a newspaper episode of the eighties, which was much talked of for a few years, but which, though it may have caused Brahms annoyance, could not possibly at this period of his career have had any more serious consequence so far as he was concerned. Hugo Wolf, in 1884 a young aspirant to fame, seeking recognition but finding none, poor, gifted, disappointed, weak in health, highly nervous, without influential friends, accepted an opportunity of increasing his miserably small means of subsistence by becoming the musical critic of the Salon Blatt, a weekly society paper of Vienna, and soon made for himself an unenviable notoriety by his persistent attacks upon Brahms' compositions. The affair would not now demand mention in a biography of our master if it were not that the posthumous recognition afforded to Wolf's art gives some interest, though not of an agreeable nature, to this association of his name with that of Brahms. For the benefit of those readers who may wish to study the matter further, it may be added that Wolf's criticisms have been republished since his death. For ourselves, having done what was, perhaps, incumbent on us by referring to the matter, we shall adopt what we believe would have been Brahms' desire, by
  • 75. allowing it, so far as these pages are concerned, to follow others of the kind to oblivion. The summer of 1886 was the first of the three seasons passed by Brahms at Thun, of which Widmann has written so charming an account. He rented the entire first-floor of a house opposite the spot where the river Aare flows out of the lake, the ground-floor being occupied by the owner, who kept a little haberdashery shop. According to his general custom, he dined in fine weather in the garden of some inn, occasionally alone, but oftener in the company of a friend or friends. Every Saturday he went to Bern to remain till Monday or longer with the Widmanns, who, like other friends, found him a most considerate and easily satisfied guest, though his exceptional energy of body and mind often made it exhausting work to keep up with him. 'His week-end visits were,' says Widmann, 'high festivals and times of rejoicing for me and mine; days of rest they certainly were not, for the constantly active mind of our guest demanded similar wakefulness from all his associates and one had to pull one's self well together to maintain sufficient freshness to satisfy the requirements of his indefatigable vitality.... I have never seen anyone who took such fresh, genuine and lasting interest in the surroundings of life as Brahms, whether in objects of nature, art, or even industry. The smallest invention, the improvement of some article for household use, every trace, in short, of practical ingenuity gave him real pleasure. And nothing escaped his observation.... He hated bicycles because the flow of his ideas was so often disturbed by the noiseless rushing past, or the sudden signal, of these machines, and also because he thought the trampling movement of the rider ugly. He was, however, glad to live in the age of great inventions and could not sufficiently admire the electric light, Edison's phonographs, etc. He was equally interested in the animal
  • 76. world. I always had to tell him anew about the family customs of the bears in the Bern bear-pits before which we often stood together. Indeed, subjects of conversation seemed inexhaustible during his visits.'[70] Brahms' ordinary costume, the same here as elsewhere, was chosen quite without regard to appearances. Mere lapse of time must occasionally have compelled him to wear a new coat, but it is safe to conclude that his feelings suffered discomposure on the rare occurrence of such a crisis. Neckties and white collars were reserved as special marks of deference to conventionality. During his visits to Thun he used on wet Saturdays to appear at Bern wearing 'an old brown-gray plaid fastened over his chest with an immense pin, which completed his strange appearance.' Many were the books borrowed from Widmann at the beginning, and brought back at the end, of the week, carried by him in a leather bag slung over his shoulder. Most of them were standard works; he was not devoted to modern literature on the whole, though he read with pleasure new and really good books of history and travel, and was fond of Gottfried Keller's novels and poems. Over engravings and photographs of Italian works of art he would pore for hours, never weary of discussing memories and predilections with his friend. Visits to the Bern summer theatre, a short mountain tour with Widmann, an introduction to Ernst von Wildenbruch, whose dramas the master liked, and with whom he now found himself in personal sympathy—events such as these served to diversify the summer season of 1886, which was made musically noteworthy by the composition of a group of chamber works, the Sonatas in A and F major for pianoforte with violin and violoncello respectively, and the Trio in C minor for pianoforte and strings. The Sonatas were performed for the first time in public in Vienna; severally by Brahms and Hellmesberger, at the Quartet concert of December 2, and by Brahms and Hausmann at Hausmann's concert of November 24; the Trio was introduced at Budapest about the same time by Brahms, Hubay, and Popper, in each case from the manuscript.
  • 77. Detailed discussion of these works is superfluous; two of them, at all events, are amongst the best known of Brahms' compositions. The Sonata for pianoforte and violoncello in F is the least familiar of the group, but assuredly not because it is inferior to its companions. It is, indeed, one of the masterpieces of Brahms' later concise style. Each movement has a remarkable individuality of its own, whilst all are unmistakably characteristic of the composer. The first is broad and energetic, the second profoundly touching, the third vehemently passionate—in the Brahms' signification of the word, be it noted, which means that the emotions are reached through the intellectual imagination—the fourth written from beginning to end in a spirit of vivacity and fun. The work was tried in the first instance at Frau Fellinger's house. 'Are you expecting Hausmann?' Brahms inquired carelessly of this lady soon after his return in the autumn. Frau Fellinger, suspecting that something lay behind the question, telegraphed to the great violoncellist, who usually stayed at her house when in Vienna, to come as soon as possible, if only for a day. He duly appeared, and the new sonata was played by Brahms and himself on the evening of his arrival. They performed it again the day before the concert above recorded, at a large party at Billroth's. The last movement of the beautiful Sonata in A for pianoforte and violin is sometimes criticised as being almost too concise. The present writer confesses that she always feels it to be so, and one day confided this sentiment to Joachim, who did not agree with her, but said that the coda was originally considerably longer. 'Brahms told me he had cut a good deal away; he aimed always at condensation.' Dr. Widmann allows us to publish an English version of a poem written by him on this work, the original of which is published in the appendix to his 'Brahms Recollections.' We have desired to place it before our English-speaking readers, not only because it coincides remarkably with what we related in our early chapters of the delicate, fanciful tastes of the youthful Hannes, but because it gave pleasure to the Brahms of fifty-three, and even of sixty-three, and
  • 78. thus seems to illustrate the fact on which we have insisted, that if in any case then in our master's, the child was father to the man. Only a year before his death the great composer wrote to Widmann to beg for one or two more copies of the poem, which had been printed for private circulation. THE THUN SONATA. Poem on the Sonata in A for Pianoforte and Violin, Op. 100, By Johannes Brahms, WRITTEN BY J. V. WIDMANN. There where the Aare's waters gently glide From out the lake and flow towards the town, Where pleasant shelter spreading trees provide, Amidst the waving grass I laid me down; And sleeping softly on that summer day, I saw a wondrous vision as I lay. Three knights rode up on proudly stepping steeds, Tiny as elves, but with the mien of kings, And spake to me: 'We come to search the meads, To seek a treasure here, of precious things Amongst the fairest; wilt thou help us trace
  • 79. A new-born child, a child of heav'nly race?' 'And who are ye?' I, dreaming, made reply; 'Knights of the golden meadows' then they said, 'That at the foot of yonder Niesen[71] lie; And in our ancient castles many a maid Hath listened to the greeting of our strings, Long mute and passed amid forgotten things. 'But lately tones were heard upon the lake, A sound of strings whose like we never knew, So David played, perhaps, for Saul's dread sake, Soothing the monarch curtained from his view; It reached us as it softly swelled and sank, And drew us, filled with longing, to this bank. 'Then help us search, for surely from this place, This meadow by the river, came the sound; Help us then here the miracle to trace, That we may offer homage
  • 80. when 'tis found. Sleeps under flow'rs the new-born creature rare? Or is it floating in the evening air?' But ere they ceased, a sudden rapid twirl Ruffled the waters, and, before our eyes, A fairy boat from out the wavelet's whirl Floated up stream, guided by dragon-flies; Within it sat a sweet-limbed, fair- haired may, Singing as to herself in ecstasy. 'To ride on waters clear and cool is sweet, For clear as deep my being's living source; To open worlds where joy and sorrow meet, Each flowing pure and full in mingling course; Go on, my boat, upstream with happy cheer, Heaven is reposing on the tranquil mere.' So sang the fairy child and they that heard Owned, by their swelling hearts, the music's might, The knights had only tears, nor spake a word, Welling from pain that thrilled
  • 81. them with delight; But when the skiff had vanished from their eyes, The eldest, pointing, said in tender wise: 'Thou beauteous wonder of the boat, farewell, Sweet melody, revealed to us to-day; We that with slumb'ring minnesingers dwell, Bid thee Godspeed, thou guileless stranger fay; Our land is newly consecrate in thee That rang of old with fame of minstrelsy. 'Now we may sleep again amongst our dead, The harper's holy spirit is awake, And as the evening glory, purple- red, Shineth upon our Alps and o'er our lake, And yet on distant mountain sheds its light, Throughout the earth this song will wing its flight. 'Yet, though subduing many a list'ning throng, In stately town, in princely hall it sound, To this our land it ever will belong,
  • 82. For here on flowing river it was found.' Fervent and glad the minnesinger spake; 'Yes!' cried my heart—and then I was awake. Whilst our master had been living through the spring and summer months in the enchanted world of his imagination, coming out of it only for brief intervals of sojourn in earth's pleasant places amidst the companionship of chosen friends, certain hard, commonplace realities of the workaday world, which had arisen earlier at home in Vienna, were still awaiting a satisfactory solution. The death of the occupier of the third-floor flat of No. 4, Carlsgasse, the last remaining member of the family with whom Brahms had lodged for fourteen or fifteen years, had confronted him with the necessity of choosing between several alternatives almost equally disagreeable to him, concerning which it is only necessary to say that he had avoided the annoyance of a removal by taking on the entire dwelling direct from the landlord, and had escaped the disturbance of having to replace the furniture of his rooms by accepting the offer of friends to lend him sufficient for his absolute needs. Arrangements and all necessary changes were made during his absence. To Frau Fellinger Brahms had entrusted the keys of the flat and of his rooms, which under her directions were brought into apple-pie order by the time of his return, the drawers being tidied, and a list of the contents of each neatly drawn up on a piece of cardboard, so that everything should be ready to his hand. The greatest difficulty, however, still remained. Who was to keep the rooms in order and see to the very few of Brahms' daily requirements which he was not in the habit of looking after himself? His coffee, as we know, he always prepared at a very early hour in the morning, and he was kept provided with a regular supply of the finest Mocha by a lady friend at Marseilles. Dinner, afternoon coffee, and often supper, were taken away from home. The master now declared he would have no one in the flat. To as many visitors as he felt disposed to admit he could himself
  • 83. open the door, whilst the cleaning and tidying of the rooms could be done by the 'Hausmeisterin,' an old woman occupying a room in the courtyard, and responsible for the cleaning of the general staircase, etc. In vain Frau Fellinger contested the point. Brahms was inflexible, and this kind lady apparently withdrew her opposition to his plan, though remaining quietly on the look-out for an opportunity of securing more suitable arrangements. By-and-by it presented itself. In Frau Celestine Truxa, the widow of a journalist, whose family party consisted of two young sons and an old aunt, Frau Fellinger felt that she saw a most desirable tenant for the Carlsgasse flat, and after a renewed attack on the master, whose arguments, founded on the immaculate purity of his rooms under the old woman's care, she irretrievably damaged by lifting a sofa cushion and laying bare a collection of dust, which she declared would soon develop into something worse, he was so far shaken as to say that if she would make inquiries for him he would consider her views. Frau Fellinger wisely abstained from further discussion, but after a few days Frau Truxa herself, having been duly advised to open the matter to Brahms with diplomatic sang-froid, went in person to apply for the dwelling. After her third ring at the door-bell, the door was opened by the master himself, who started in dismay at seeing a strange lady standing in front of him. 'I have come to see the flat,' said Frau Truxa. 'What!' cried Brahms. 'I have heard there is an empty flat here, and have come to look at it,' responded Frau Truxa indifferently; 'but perhaps it is not to let?' A moment's pause, and the composer's suspicious expression relaxed. 'Frau Dr. Fellinger mentioned the circumstances to me,' she continued, 'and I thought they might suit me.' By this time Brahms had become sufficiently reassured to show the rooms and to listen, though without remark, to a brief description of
  • 84. Frau Truxa's family and of the circumstances in which she found herself. 'Perhaps, Dr. Brahms, you will consider the matter,' she concluded, 'and communicate with me if you think further of it. If I hear nothing more from you, I shall consider the matter at an end.' After about a week, during which Frau Truxa kept her own confidence, her maid came one day to tell her a gentleman had called to see her. Being engaged at the moment, she asked her aunt to ascertain his business, but the old lady returned immediately with a frightened look. 'I don't know what to think!' she exclaimed; 'there is a strange- looking man walking about in the next room measuring the furniture with a tape!' 'The things will all go in!' exclaimed the master as Frau Truxa hurried to receive him. The upshot was that the master gave up the tenancy of the flat, returning to his old irresponsible position as lodger, whilst Frau Truxa, bringing her household with her, stepped into the position of his former landlady, thereby giving Brahms cause to be grateful for the remainder of his life for Frau Fellinger's wise firmness. He was, says Frau Truxa, perfectly easy to get on with; all he desired was to be let alone. He was extremely orderly and neat in his ways, and expected the things scattered about his room to be dusted and kept tidy, but was vexed if he found the least trifle at all displaced—even if his glasses were turned the wrong way—and, without making direct allusion to the subject, would manage to show that he had noticed it. Observing, after she had been a little time in the flat, that he always rearranged the things returned from the laundress after they had been placed in their drawer, she asked him why he did so. 'Only,' he said, 'because perhaps it is better that those last sent back should be put at the bottom, then they all get worn alike.' A glove or other article requiring a little mending would be placed carelessly at
  • 85. the top of a drawer left open as if by accident. The next day he would observe to Frau Truxa, 'I found my glove mended last night; I wonder who can have done it!' and on her replying, 'I did it, Herr Doctor,' would answer, 'You? How very kind!' Frau Truxa came to respect and honour the composer more and more the longer he lived in her house. She made his peculiarities her study, and after a short time understood his little signs, and was able to supply his requirements as they arose without being expressly asked to do so. It is almost needless to say that he took great interest in her two boys, and once, when she was summoned away from Vienna to the sick-bed of her father, begged that the maid-servant might be instructed to give all her attention to the children during their mother's absence, even if his rooms were neglected. 'I can take care of myself, but suppose something were to happen to the children whilst the girl was engaged for me!' Every night whilst Frau Truxa was away, the master himself looked in on the boys to assure himself of their being safe in bed. For the old aunt he always had a pleasant passing word. The fourth Symphony and two books of Songs were published in 1886, and the three new works of chamber music, Op. 99, 100, 101, in 1887. Of the songs we would select for particular mention the wonderfully beautiful setting of Heine's verses: 'Death is the cool night, Life is the sultry day,' Op. 96, No. 1, and Nos. 1 and 2 of Op. 97. Brahms' Italian journey in the spring of 1887 was made in the company of Simrock and Kirchner. The following year he travelled in Widmann's society, visiting Verona, Bologna, Rimini, Ancona, Loretto, Rome, and Turin. Widmann sees in Brahms' spiritual kinship with the masters of the Italian Renaissance the chief secret of his love for Italy.
  • 86. 'Their buildings, their statues, their pictures were his delight and when one witnessed the absorbed devotion with which he contemplated their works, or heard him admire in the old masters a trait conspicuous in himself, their conscientious perfection of detail ... even where it could hardly be noticeable to the ordinary observer, one could not help instituting the comparison between himself and them.' Brahms had an interview when on this journey with the now famous Italian composer Martucci, who displayed a thorough familiarity with the works of the German master. Amongst the friends and acquaintances whom the composer met at Thun during his second and third summers there were the Landgraf of Hesse, Hanslick, Gottfried Keller, Professor Bächthold, Hermine Spiess and her sister, Gustav Wendt, the Hegars, Max Kalbeck, Steiner, Claus Groth, etc. One day, as he had started for a walk, he was stopped by a stranger, who asked if he knew where Dr. Brahms lived. 'He lives there,' replied the master, pointing to the haberdasher's shop. 'Do you know if he is at home?' 'That I cannot tell you,' was the reply. 'But go and ask in the shop; you will certainly be able to find out there.' The gentleman followed this advice, sent his card up, and received the answer that the Doctor was at home, and would be pleased to see him. To his surprise, on ascending the stairs, he found his newly-formed acquaintance waiting for him at the top.
  • 87. Brahms' Lodgings near Thun. Photograph by Moegle, Thun. The rumour revived in the summer of 1887 that Brahms was engaged on an opera. This came about, perhaps, from his intimacy with Widmann. 'I am composing the entr'actes,' he jestingly replied to the Landgraf's question as to whether the report had any foundation. As a matter of fact, the subject of opera was not mentioned between the composer and his friend at this time.
  • 88. The works which really occupied Brahms during the summer of 1887 were the double Concerto for violin and violoncello, with orchestral accompaniment, and the 'Gipsy Songs.' The Concerto was performed privately, immediately on its completion, in the 'Louis Quinze' room of the Baden-Baden Kurhaus. Brahms conducted, and the solo parts were performed by Joachim and Hausmann. Amongst the listeners were Frau Schumann and her eldest daughter, Rosenhain, Lachner, the violoncellist Hugo Becker, and Gustav Wendt. The work was heard in public for the first time in Cologne on October 15, Brahms conducting, and Joachim and Hausmann playing the solos as before; and the next performances, carried out under the same unique opportunities for success, were in Wiesbaden, Frankfurt, and Basle, on November 17, 18, and 20. In the autumn of this year one of the few remaining figures linked with the most cherished associations of Brahms' early youth passed away. Marxsen died on November 17, 1887, at the age of eighty-one, having retained to the end almost unimpaired vigour of his mental faculties. The last great pleasure of his life was associated with his beloved art. In spite of great bodily weakness, he managed to be present a week before his death at a concert of the Hamburg Philharmonic Society to hear a performance of the 'ninth' Symphony. 'I am here for the last time,' he said, pressing Sittard's hand; and he passed peacefully away fourteen days later. A few years previously his artistic jubilee had been celebrated in Hamburg, and his dear Johannes had surprised him with the proof- sheets of a set of one hundred Variations composed long ago by Marxsen, not with a view to publication, but as a practical illustration of the inexhaustible possibilities contained in the art of thematic development. Brahms, who happened to see the manuscript in Marxsen's room during one of his subsequent visits to Hamburg, was so strongly interested in it that in the end Marxsen gave it him, with leave to do as he should like with it after his death. The parcel of proof-sheets was accompanied by an affectionate letter, in which Brahms begged forgiveness for having anticipated this permission
  • 89. and yielded to his desire of placing the work within general reach during his master's lifetime; and perhaps no jubilee honour of which the old musician was the recipient filled him with such lively joy as was caused by this tribute. Marxsen's name as a composer is, indeed, now forgotten without chance of revival, but his memory will live gloriously in the way he would have chosen, carried through the years by the hand that wrote the great composer's acknowledgment to his teacher on the title-page of the Concerto in B flat. Four more performances from the manuscript of the double concerto of interest in our narrative remain to be chronicled—those of the Leipzig Gewandhaus, under Brahms, on January 1, 1888; of the Berlin Philharmonic Society, under Bülow, of February 6; and of the London Symphony Concerts, under Henschel, on February 15 and 21. The work, published in time for the autumn season, was given in Vienna at the Philharmonic concert of December 23 under Richter. On all these occasions the solos were played, as before, by Joachim and Hausmann. Bülow, having at this time resigned his post at Meiningen, had entered on a period of activity as conductor in some of the northern cities of Germany, and particularly in Hamburg and Berlin. His future programmes, in which our master's works were well represented, though not with the conspicuous prominence that had been possible at Meiningen, do not fall within the scope of these pages, since, with the mention of the double concerto, the enumeration of Brahms' orchestral works is complete. Bülow's successor at Meiningen, Court Capellmeister Fritz Steinbach, carried on the traditions and preferences of the little Thuringian capital as he found them, until his removal to Cologne a year or two ago, and has become especially appreciated as a conductor of the works of Brahms, whose personal friendship and artistic confidence he enjoyed in a high degree. The name of Eugen d'Albert, whose great gifts and attainments were warmly recognised by Brahms, should not be omitted from our pages, though detailed account of his relations with the master is
  • 90. outside their limits. D'Albert's fine performances of the pianoforte concertos helped to make these works familiar to many Continental audiences, and certainly contributed, during the second half of the eighties, to the better understanding of the great composer which has gradually come to prevail at Leipzig. But little needs to be said about the double concerto. This fine work, which may be regarded as in some sort a successor to the double and triple concertos of Mozart and Beethoven, exhibits all the power of construction, the command of resource, the logical unity of idea, characteristic of Brahms' style, whilst its popularity has been hindered by the same cause that has retarded that of the pianoforte concertos; the solo parts do not stand out sufficiently from the orchestral accompaniment to give effective opportunity for the display of virtuosity, in the absence of which no performer, appearing before a great public as the exponent of an unfamiliar work for an accompanied solo instrument, has much chance of sustaining the lively interest of his audience in the composition. Of the three movements of the double concerto, the first is especially interesting to musicians, whilst the second, a beautiful example of Brahms' expressive lyrical muse, appeals equally to less technically prepared listeners. On the copy of the work presented by Brahms to Joachim the words are inscribed in the composer's handwriting: 'To him for whom it was written.' Widely contrasted in every respect was the other new work of 1887, introduced to the private circle of Vienna musicians at the last meeting for the season of the Tonkünstlerverein in April, 1888. The eleven four-part 'Gipsy Songs,' published in the course of the year as Op. 103, were sung from the manuscript by Fräulein Walter, Frau Gomperz-Bettelheim, Gustav Walter, and Weiglein of the imperial opera, to the composer's accompaniment. Brahms obtained the texts of this characteristic and attractive work from a collection of twenty- five 'Hungarian Folk-songs' translated into German by Hugo Conrat, and published in Budapest, with their original melodies set by Zoltan Nagy for mezzo-soprano or baritone, with the addition of pianoforte
  • 91. Welcome to our website – the perfect destination for book lovers and knowledge seekers. We believe that every book holds a new world, offering opportunities for learning, discovery, and personal growth. That’s why we are dedicated to bringing you a diverse collection of books, ranging from classic literature and specialized publications to self-development guides and children's books. More than just a book-buying platform, we strive to be a bridge connecting you with timeless cultural and intellectual values. With an elegant, user-friendly interface and a smart search system, you can quickly find the books that best suit your interests. Additionally, our special promotions and home delivery services help you save time and fully enjoy the joy of reading. Join us on a journey of knowledge exploration, passion nurturing, and personal growth every day! ebookbell.com