How To Do Corpus Pragmatics On Pragmatically Annotated Data Speech Acts And Beyond Martin Weisser

How To Do Corpus Pragmatics On Pragmatically
Annotated Data Speech Acts And Beyond Martin
Weisser download
https://ebookbell.com/product/how-to-do-corpus-pragmatics-on-
pragmatically-annotated-data-speech-acts-and-beyond-martin-
weisser-10513214
Explore and download more ebooks at ebookbell.com

Here are some recommended products that we believe you will be
interested in. You can click the link to download.
Programming For Corpus Linguistics How To Do Text Analysis With Java
Oliver Mason
https://ebookbell.com/product/programming-for-corpus-linguistics-how-
to-do-text-analysis-with-java-oliver-mason-51972816
How To Do Nothing Resisting The Attention Economy Jenny Odell
https://ebookbell.com/product/how-to-do-nothing-resisting-the-
attention-economy-jenny-odell-46178142
How To Do Research And How To Be A Researcher Robert Stewart
https://ebookbell.com/product/how-to-do-research-and-how-to-be-a-
researcher-robert-stewart-47239100
How To Do Things With Tears Ritual Lamenting In Ancient Mesopotamia
Paul Delnero
https://ebookbell.com/product/how-to-do-things-with-tears-ritual-
lamenting-in-ancient-mesopotamia-paul-delnero-48074960

How To Do Ecology A Concise Handbook 3rd Edition 3rd Richard Karban
https://ebookbell.com/product/how-to-do-ecology-a-concise-
handbook-3rd-edition-3rd-richard-karban-50439732
How To Do A Research Project A Guide For Undergraduate Students Colin
Robson
https://ebookbell.com/product/how-to-do-a-research-project-a-guide-
for-undergraduate-students-colin-robson-50560080
How To Do The Final Year Project A Practical Guideline For Computer
Science And It Students Hossein Hadsani
https://ebookbell.com/product/how-to-do-the-final-year-project-a-
practical-guideline-for-computer-science-and-it-students-hossein-
hadsani-50822550
How To Do Things With Narrative Cognitive And Diachronic Perspectives
Jan Alber Editor Greta Olson Editor Birte Christ Editor
https://ebookbell.com/product/how-to-do-things-with-narrative-
cognitive-and-diachronic-perspectives-jan-alber-editor-greta-olson-
editor-birte-christ-editor-50993268
How To Do Comparative Theology 1st Edition Francis X Clooney Klaus Von
Stosch
https://ebookbell.com/product/how-to-do-comparative-theology-1st-
edition-francis-x-clooney-klaus-von-stosch-51626002

Studies
in
Corpus
Linguistics
How to Do
Corpus Pragmatics
on Pragmatically
Annotated Data
JOHN BENJAMINS PUBLISHING COMPANY
84
Martin Weisser

How to Do Corpus Pragmatics on Pragmatically Annotated Data

Volume 84
How to Do Corpus Pragmatics on Pragmatically Annotated Data:
Speech acts and beyond
by Martin Weisser
SCL focuses on the use of corpora throughout language study, the development
of a quantitative approach to linguistics, the design and use of new tools for
processing language texts, and the theoretical implications of a data-rich discipline.
For an overview of all books published in this series, please see
http://benjamins.com/catalog/books/scl
Studies in Corpus Linguistics (SCL)
issn 1388-0373
General Editor Founding Editor
Ute Römer Elena Tognini-Bonelli
Georgia State University The Tuscan Word Centre/University of Sienna
Advisory Board
Laurence Anthony
Waseda University
Antti Arppe
University of Alberta
Michael Barlow
University of Auckland
Monika Bednarek
University of Sydney
Tony Berber Sardinha
Catholic University of São Paulo
Douglas Biber
Northern Arizona University
Marina Bondi
University of Modena and Reggio Emilia
Jonathan Culpeper
Lancaster University
Sylviane Granger
University of Louvain
Stefan Th. Gries
University of California, Santa Barbara
Susan Hunston
University of Birmingham
Michaela Mahlberg
University of Birmingham
Anna Mauranen
University of Helsinki
Andrea Sand
University of Trier
Benedikt Szmrecsanyi
Catholic University of Leuven
Elena Tognini-Bonelli
The Tuscan Word Centre/The University of Siena
Yukio Tono
Tokyo University of Foreign Studies
Martin Warren
The Hong Kong Polytechnic University
Stefanie Wulff
University of Florida

How to Do Corpus Pragmatics
on Pragmatically Annotated Data
Speech acts and beyond
Martin Weisser
Guangdong University of Foreign Studies
John Benjamins Publishing Company
Amsterdam/Philadelphia

8
TM
Cover design: Françoise Berserik
Cover illustration from original painting Random Order
by Lorenzo Pezzatini, Florence, 1996.
The paper used in this publication meets the minimum requirements of
the American National Standard for Information Sciences – Permanence
of Paper for Printed Library Materials, ansi z39.48-1984.
doi 10.1075/scl.84
Cataloging-in-Publication Data available from Library of Congress:
lccn 2017056561 (print) / 2017061549 (e-book)
isbn 978 90 272 0047 1 (Hb)
isbn 978 90 272 6429 9 (e-book)
© 2018 – John Benjamins B.V.
No part of this book may be reproduced in any form, by print, photoprint, microfilm, or any
other means, without written permission from the publisher.
John Benjamins Publishing Company · https://benjamins.com

Table of contents
List of tables ix
List of figures xi
Abbreviations xiii
chapter 1
Introduction 1
1.1
Previous approaches to pragmatics and discourse 2
1.2
Speech acts 5
1.3
Approaches to corpus-/computer-based pragmatics 8
1.4
Outline of the book 14
1.5
Conventions used in this book 15
chapter 2
Computer-based data in pragmatics 17
2.1
Linguistic corpora and pragmatics 17
2.2
Issues and standards in text representation and annotation 20

2.2.1
General computer-based representation 27

2.2.2
Text vs. meta-information 34

2.2.3
General linguistic annotation 35
2.3
Problems and specifics in dealing with spoken language transcription 39

2.3.1
Issues concerning orthographic representation 39

2.3.2
Issues concerning prosody 44

2.3.3
Issues concerning segmental and other features 47

2.3.4
Issues concerning sequential integrity 52

2.3.5
Issues concerning multi-modality 54
chapter 3
Data, tools and resources 57
3.1
Corpus data used in the research 57

3.1.1
The SPAADIA Trainline Corpus 57

3.1.2
The selection from Trains 93 58

3.1.3
The selection from the Switchboard Annotated Dialogue
Corpus 59

3.1.4
Discarded data 61

3.1.5
Supplementary data 62

 How to do corpus pragmatics on pragmatically annotated data
3.2
The DART implementation and its use in handling dialogue data 62

3.2.1
The DART functionality 63

3.2.2
The DART XML format 67
3.3
Morpho-syntactic resources required for pragmatic analysis 69

3.3.1
The generic lexicon concept 72

3.3.2
The DART tagset 78

3.3.3
Morphology and morpho-syntax 83

3.3.4
‘Synthesising’ domain-specific lexica 84
chapter 4
The syntax of spoken language units 89
4.1
Sentence vs. syntactic types (C-Units) 90
4.2
Units of analysis and frequency norming for pragmatic purposes 96
4.3
Unit types and basic pragmatic functions 97

4.3.1
Yes-units 100

4.3.2
No-units 108

4.3.3
Discourse markers 117

4.3.4
Forms of address 134

4.3.5
Wh-questions 134

4.3.6
Yes/no- and alternative questions 138

4.3.7
Declaratives 141

4.3.8
Imperatives 146

4.3.9
Fragments and exclamatives 149
chapter 5
Semantics and semantico-pragmatics 155
5.1
The DAMSL annotation scheme 156
5.2
Modes 162

5.2.1
Grammatical modes 165

5.2.2
Interactional modes 166

5.2.3
Point-of-view modes 171

5.2.4
Volition and personal stance modes 174
5.2.5 Social modes 176

5.2.6
Syntax-indicating modes 176
5.3 Topics 177

5.3.1
Generic topics 179

5.3.2
Domain-specific topics 184
chapter 6
The annotation process 187
6.1
Issues concerning the general processing of spoken dialogues 187

6.1.1
Pre-processing – manual and automated unit determination 187

6.1.2
Fillers, pauses, backchannels, overlap, etc 190

Table of contents 

6.1.3
Handling initial connectors, prepositions and adverbs 192

6.1.4
Dealing with disfluent starts 193

6.1.5
Parsing and chunking for syntactic purposes 193
6.2
Identifying and annotating the individual unit types automatically 194

6.2.1
Splitting off and annotating shorter units 194

6.2.2
Tagging wh-questions 196

6.2.3
Tagging yes/no-questions 199

6.2.4
Tagging fragments, imperatives and declaratives 201
6.3
Levels above the C-unit 205

6.3.1
Answers and other responses 205

6.3.2
Echoes 207
6.4
Identifying topics and modes 208
6.5
Inferencing and determining or correcting speech acts 209
chapter 7
Speech acts: Types, functions, and distributions across the corpora 213
7.1
Information-seeking speech acts 214
7.2
(Non-)Cohesive speech acts 220
7.3
Information-providing and referring speech acts 231
7.4
Negotiative speech acts 250
7.5
Suggesting or commitment-indicating speech acts 256
7.6
Evaluating or attitudinal speech acts 261
7.7
Reinforcing speech acts 266
7.8
Social, conventionalised speech acts 269
7.9
Residual speech acts 274
chapter 8
Conclusion 277
Appendix A
The DART speech-act taxonomy (version 2.0) 281
References 285
Index 293

List of tables
Table 2.1
BNC D96: Discrepancies between original and edited version 22
Table 3.1
Summary of corpus materials primarily used 61
Table 3.2
Frequencies for 1st and 2nd person pronouns in two illustrative
corpora 73
Table 3.3
Lexical coverage of the generic lexicon with regard to 4 different
corpora 77
Table 3.4
The DART tagset 79
Table 4.1
Main function types, intra-category percentages and normed

frequencies for yes-units 101
Table 4.2
f
requencies for no-units 109
Table 4.3

frequencies for DMs 120
Table 4.4
Proposition-signalling initiating DMs 123
Table 4.5
Functions, intra-category percentages and normed frequencies
for wh-questions 136
Table 4.6
for alternative and yes/no-questions 139
Table 4.7
Main functions, intra-category percentages and normed
frequencies for declaratives 143
Table 4.8
for imperatives 146
Table 4.9
Important functions, intra-category percentages and normed
frequencies for fragments 150
Table 5.1 Grammatical modes 166
Table 5.2
Backward-looking interactional modes 167
Table 5.3
Forward-looking interactional modes 167
Table 5.4
‘Bi-directional’ interactional modes 169
Table 5.5 Point-of-view modes 171
Table 5.6
Volition and personal stance modes 174
Table 5.7 Social modes 176
Table 5.8
Syntax-indicating lexicogrammarical modes 177
Table 5.9
Syntax-indicating modes based on punc … / elements 177
Table 5.10
Measures, enumerations spellings 179

 How to do corpus pragmatics on pragmatically annotated data
Table 5.11
Times dates 181
Table 5.12
Locations directions 182
Table 5.13 Personal details 183
Table 5.14
Meetings and appointments 184
Table 5.15 Domain-specific topics 185
Table 7.1
Information-seeking speech acts 215
Table 7.2
Engaging speech acts 221
Table 7.3
Dialogue-managing speech acts 223
Table 7.4
Textual speech acts 227
Table 7.5
Informing or referring speech acts 232
Table 7.6
Elaborating speech acts 241
Table 7.7
Explaining speech acts 243
Table 7.8
Awareness-indicating speech acts 245
Table 7.9
Hypothesising speech acts 246
Table 7.10
Volitional speech acts 249
Table 7.11
Negotiative speech acts 251
Table 7.12
Suggesting or commitment-indicating speech acts 256
Table 7.13
Evaluating speech acts 261
Table 7.14
Attitudinal speech acts 263
Table 7.15
Reinforcing speech acts 268
Table 7.16
Social, conventionalised speech acts 269
Table 7.17
Residual speech acts 274

List of figures
Figure 2.1 Sample extract from the London-Lund Corpus 18
Figure 2.2 Sample excerpt from the original SGML version of the BNC 28
Figure 2.3 Hierarchical display of XML in a browser window 31
Figure 2.4 A short, illustrative, linguistic XML sample 33
Figure 2.5 A colour-coded XML sample 33
Figure 2.6 A sample CSS style sheet 33
Figure 3.1 The SPAACy dialogue annotation tool 63
Figure 3.2 The Dialogue Annotation and Research Tool (DART ver. 2) 65
Figure 3.3 Basic DART XML dialogue structure 68
Figure 3.4 A short sample from an LFG lexicon 70
Figure 3.5 A sample from the generic lexicon 76
Figure 3.6
Comparison of type and token coverage of the uninflected
generic lexicon for
various corpora 77
Figure 3.7 Sample from a synthesised domain-specific lexicon 87

Abbreviations
ANC American National Corpus
ACL Association for Computational Linguistics
ALLC Association for Literary and Linguistic Computing
ACH Association for Computing and the Humanities
ASCII American Standard Code for Information Interchange
BNC British National Corpus
CA Conversational Analysis
CANCODE Cambridge and Nottingham Corpus of Discourse in English
COMPGR Comprehensive Grammar of the English Language
CAMGR The Cambridge Grammar of the English Language
CLAWS Constituent Likelihood Automatic Word-tagging System
CSS Cascading Style Sheets
DA Discourse Analysis
DAMSL Dialogue Act Markup in Several Layers
DART Dialogue Annotation and Research Tool
DRI Discourse Resource Initiative
DSSSL Document Style Semantics and Specification Language
DTD Document Type Definition
FLOB Freiburg Lancaster-Oslo/Bergen Corpus
FROWN Freiburg Brown Corpus
GENAM General American
HTML Hypertext Markup Language
ICE International Corpus of English
IDE Integrated Development Environment
LINDSEI Louvain International Database of Spoken English Interlanguage
LONGGR Longman Grammar of Spoken and Written English
LLC London-Lund Corpus of Spoken English
LOB Lancaster-Oslo/Bergen Corpus
MATE Multilevel Annotation, Tools Engineering
MICASE Michigan Corpus of Academic Spoken English
NLP Natural Language Processing
POS Part of Speech
RP Received Pronunciation
SGML Standard Generalized Markup Language

 How to do corpus pragmatics on pragmatically annotated data
TART Text Annotation and Research Tool
TEI Text Encoding Initiative
XML eXtensible Markup Language
XSL eXtensible Style Sheet Language
XSLT XSL Transformations
XSL-FO XML Formatting Objects

chapter 1
Introduction
Corpus- and computer-based methods of analysis have ‘revolutionised’ much of the
researchinlinguisticsornaturallanguageprocessingoverthelastfewdecades.Major
advances have been made in lexicography (cf. Ooi 1998 or Atkins
Rundell 2008),
morphology (cf. Beesley Karttunen 2003, or Roark Sproat 2007), (morpho-)
syntax (Roark Sproat 2007), and genre-based text-linguistics (cf. Biber et al. 1998),
to name but the most important areas. These advances were in many cases linked to,
or dependent upon, advances in creating and providing suitably annotated resources
in the form of corpora. However, apart from the efforts made on the SPAAC project
(cf. Leech Weisser 2003), the creation of the SPICE-Ireland corpus (Kallen Kirk
2012), or my own research into improving the automated annotation of pragmatics-
related phenomena (Weisser 2010), to date very few linguistically motivated efforts
have been made to construct annotated corpora of spoken language that reflect the
different facets of language involved in creating meaning on the level of human
interaction – in other words, on the level of pragmatics.
One aim of this book is to rectify this shortcoming and to demonstrate how
it is possible to create corpora that can be annotated largely automatically on the
levels of syntax, (surface) polarity (positive or negative mood of the unit), seman-
tics (in the sense of representing major topic features of a textual unit), semantico-
pragmatics (in the form of capturing interactional signals), and, finally, pragmatics
(in the shape of speech acts). In contrast to current trends in computer-based, and
here especially computational linguistics, this is done relying purely on linguistic
surface information in conjunction with appropriate inferencing strategies, rather
than employing probabilistic methods. Thus, e.g. the ‘utterance’ i’d like to go from
Preston to London can be ‘recorded’ inside a corpus as:
a. being of declarative sentence type,
b. having positive surface polarity,
c. containing topic information about some locations and movements between
them,
d. signalling an intent or preference on the semantico-pragmatic level, as well as
e. pragmatically, in its particular context of occurrence inside the dialogue it was
taken from, representing a directive that also informs the interlocutor about
the speaker’s intentions.

 How to do corpus pragmatics on pragmatically annotated data
The exact format in which such information can best be stored in order to facili-
tate usability and exchangeability will be presented and discussed in the relevant
sections below. The main emphasis here will be on the automatic determination
of speech acts.
The advantages of being able to produce pragmatically annotated corpora effi-
ciently and thereby creating resources for many areas of linguistic research con-
cerned with human (or human-computer) interaction should be self-evident, as
this could not only greatly facilitate research about the interplay of the different
linguistic levels, and in so doing also increase our understanding of how commu-
nication works, but also make it possible to use these resources in more applied
areas, such as language teaching, textbook creation, or the training of other ‘lan-
guage professionals’, for example as interpreters or even call centre personnel.
Some of the ways in which this can be achieved, for instance through the prag-
matic profiling of speakers or speaker groups/populations, will be discussed in
the more research-oriented chapters of this book, Chapters 4 and 7, where I shall
attempt to demonstrate the different forms of appliability of the approach.
At the same time, creating such a(n) annotation/corpus-creation methodol-
ogy obviously does not constitute a purely mechanical process because, as, apart
from devising appropriate algorithms, data structures and storage mechanisms for
processing language on the computer, such an endeavour already involves active
research into the interplay – or interfaces, as some researchers prefer to refer to
them – of the different linguistics levels mentioned above. Another aim of this
book, therefore, is to provide a substantial contribution to the practical and theo-
retical underpinnings of how to analyse, explain, and categorise the individual ele-
ments of language that contribute towards the generation of pragmatic meaning.
Here, I will especially focus on developing the theory of speech acts – originally
established by Austin (1962) and Searle (1969) – further, and present a generi-
cally applicable speech-act taxonomy that goes far beyond the limited categories
proposed by Searle (ibid.) that are generally used in pragmatics research. Before
going into how this is achieved in any detail, though, I will first contextualise the
research discussed here by providing a brief overview of existing studies into gen-
eral and computer-based pragmatics, and the analysis of spoken discourse.
1.1 Previous approaches to pragmatics and discourse
With regard to contemporary pragmatics, one can see a rough division into two
different factions, or what Huang (2007: 4; cf. also Horn Ward 2004: x) refers
to as the “Anglo-American” and the “European Continental” schools. Amongst
these, the former subscribes to the “component view” (ibid.), a view that sees

Chapter 1. Introduction 

pragmatics as a separate level of linguistics, such as those of phonetics/pho-
nology, syntax and semantics, while the latter adopts the “perspective view”
(ibid.) – following the original ideas developed by Morris in 1938 (cf. Verschueren
1999: 2–10) –, which perceives pragmatics as a function of language that influ-
ences the other levels and thus incorporates a larger situational context, including
sociolinguistic factors. These different ‘attitudes’ towards the nature of pragmat-
ics also to some extent manifest themselves in the two different approaches to
the subject that Leech (1983: 10–11) refers to as “PRAGMA-LINGUISTICS” and
“SOCIO-PRAGMATICS”, respectively.
Along with pragma-linguistics and the component view generally comes an
emphasis on issues in micro-pragmatics (cf. Mey 1993: 182), where the main topics
of investigation are generally considered to be implicature, presupposition, speech
acts, reference, deixis, as well as definiteness and indefiniteness. This is evident in
the chapters under the heading “The Domain of Pragmatics” in Horn and Ward’s
(2004/2006) Handbook of Pragmatics, which may be seen as one of the standard
references for work in pragmatics that follows the component view. These topics
are also still predominantly investigated on the level of the ‘sentence’ – a concept
which will need to be scrutinised further in Section 4.1 of this book –, rather than
involving any larger context. This practice is still being adhered to, despite the
fact that at least some of the emphasis in this school is now also shifting towards
an analysis of contextually embedded examples, as evidenced by a number of
later chapters in Horn and Ward (2006). One further feature that goes hand in
hand with the ‘sentence-level’ analysis is that ‘research’ by proponents of this view
still frequently involves the use of constructed (‘armchair’) examples (cf. Jucker
2009: 1615) and a strong adherence to philosophically-oriented, and logic-based
interpretations of how meaning is created, as well as employing more formal lin-
guistic methods. The latter also tend to stress the affinity of pragmatics to (formal)
semantics, something which supporters of the component view are still strug-
gling to resolve, as evidenced through the ongoing debate about the Semantics –
Pragmatics distinction (cf. e.g. Szabó 2005).
In contrast, advocates of socio-pragmatics and the perspective view often
concentrate on research that is more process- and context-oriented, and which
focuses on macro-pragmatics, i.e. the investigation of larger contexts and ‘meaning
in use’, something that also frequently involves the cultural or even extra-linguistic
information that contributes to communication as a social act. In line with their
more sociological orientation, socio-pragmatists seem, to some extent, also be
more inclined towards employing less formal data analysis methods – reminis-
cent of the approaches in conversational analysis (CA) –, but that use a substantial
amount of empirical data in a bottom-up strategy to draw conclusions from. There
may also still be more of an emphasis on issues of sequencing of interaction, such

 How to do corpus pragmatics on pragmatically annotated data
as turn-taking (cf. Sacks, Schegloff Jefferson 1974), as a means of handling or
managing the social act(ion) constituted by verbal (conversational) communica-
tion, although this is beginning to play a larger role in the component view these
days, too.
Supporters of the component view, on the other hand, still seem to work
more along the lines of analysis methods developed in discourse analysis (DA),
although they do not explicitly subscribe to this. DA itself developed out of the
(British)
Firthian linguistic tradition and is therefore essentially functional or
systemic in its approach to dialogues. Its main attention was originally focussed
on the identification of units and structure of interaction, as a kind of exten-
sion of the hierarchy of units employed in general linguistic analysis and descrip-
tion, ranging from the morpheme, word, clause, to the sentence, the micro-level
referred to above. DA was initially limited to the relatively narrow scope of ana-
lysing classroom interaction (cf. Sinclair Coulthard 1975; Coulthard 1977),
in order to attempt to identify regular patterns therein. More recent approaches
to DA (cf. Coulthard 1992 or Brown Yule 1983), however, have realised that
the specific conditions of classroom interaction have led to over-generalisations
and incorrect labelling – and hence potentially equally incorrect interpreta-
tion – of interactional patterns, and have therefore actively sought to overcome
these earlier problems. The DA approach is also more top-down, in the sense that

categorisations are attempted earlier, and potentially also based on some degree
of intuition, something that is also still clearly reflected in ‘component-view’
pragmatics. This does not mean, however, that DA is not an empirical approach
just like CA, since both approaches are well-grounded empirically, only with a
slightly different slant and emphasis on various levels of detail. I will return to
these levels of detail later in the discussion of issues regarding transcription con-
ventions or dialogue sequencing and structure.
Within the component-view school, one can also distinguish between two
further subgroups, proponents of the neo-Gricean and the relevance-theoretical
view. While the neo-Griceans have attempted to ‘refine’ (cf. Huang 2007: 36–54)
the original approach developed by H. P. Grice that includes the Cooperative Prin-
ciple (CP) and its associated categories and maxims (cf. Grice 1989: 26), which
assumes that all communication is essentially based on the co-operative behav-
iour of the communication partners, supporters of relevance theory work on the
assumption that there is only one overarching principle in communication, which
is that of relevance:
We share Grice’s intuition that utterances raise expectations of relevance,
but question several other aspects of his account, including the need for a
Cooperative Principle and maxims, the focus on pragmatic contributions to
implicit (as opposed to explicit) content, the role of maxim violation in utterance

Chapter 1. Introduction 
interpretation, and the treatment of figurative utterances. The central claim of
relevance theory is that the expectations of relevance raised by an utterance are
precise and predictable enough to guide the hearer toward the speaker’s meaning.
(Wilson Sperber 2006: 608)
The concepts employed in relevance theory, though, are not ‘measurable’ (ibid:
610) and can hence also not be easily applied to computer-based analysis, so that
they will be largely ignored in the following exposition.
As indicated above, the main emphasis of this book is on working with speech
acts, so the other main issues in micro-pragmatics – implicature, presupposition,
reference, deixis, and definiteness and indefiniteness – will only be referred to if
they play a direct role in the identification of speech acts or are in fact constitutive
thereof.
1.2 Speech acts
Research into speech acts essentially started with the ‘ordinary language philoso-
pher’ Austin and his famous collection of William James lectures that was pub-
lished under the title of How to Do Things with Words (Austin 1962). Here, Austin
contradicted the idea that, unlike it had commonly been assumed by most phi-
losophers previously since the days of Aristotle, sentences are only used to express
propositions, that is facts that are either true or false.
It was too long the assumption of philosophers that the business of a ‘statement’
can only be to ‘describe’ some state of affairs, or to ‘state some fact’, which it must
do either truly or falsely. (Austin 1962: 1)
The concept of truth-conditionality, though, rather surprisingly, still seems to
be present in many current approaches to the logic-based semantic description
of language followed by at least some of the proponents of the component view.
Starting from his theory of performative verbs (ibid.: 14ff), Austin claimed that,
instead, sentences are often used to perform a ‘verbal act(ion)’ and distinguished
between three different functions – or acts – of an utterance:
1. locution: ‘what is said’
2. illocution: ‘what is intended’
3. perlocution: ‘what is evoked in the recipient’ (cf. ibid.: 98ff.)
The explanations given in single quotation marks above represent my brief sum-
maries of Austin’s expositions. Apart from these functions, he also claims that
there are a number of felicity conditions (“Conditions for Happy Performatives”)
that are necessary for such actions to become successful, amongst them that the

 How to do corpus pragmatics on pragmatically annotated data
hearer understand and accept them, such as in the act of promising (ibid.: 22f).
These are often dependent upon established conventions or laws (ibid.: 14f).
Searle (1969), in his Speech Acts: an Essay in the Philosophy of Language, takes
Austin’s ideas further and defines the speech act not only as an expression of

illocutionary force, but even ascribes it the most central role in communication.
The unit of linguistic communication is not, as has generally been supposed,
the symbol, the word or sentence, but rather the production or issuance of the
symbol or word or sentence in the performance of the speech act. […] More
precisely, the production or issuance of a sentence token under certain conditions
is a speech act, and speech acts ([…]) are the basic and minimal unit of linguistic
communication.(Searle 1969: 16)
In order to distinguish between the ‘locutionary elements’ of a sentence, he

differentiates between a “propositional” and an “illocutionary force indicator”
(ibid.: 30) and introduces the notion of what has later come to simply be referred
to by the acronym IFIDs (“illocutionary force indicating devices”). Amongst these,
he lists “word order, stress, intonation contour, punctuation, the mood of the verb,
and the so-called performative verbs” (ibid.). Some, but not all of these, though
complemented by a few others, will later turn out to be highly relevant to our
analysis methodology.
Summarising the most important points made by Austin and Searle, it becomes
clear that linguistic form, (lexico-)grammar, and context or established conven-
tions taken together, determine meaning. Since syntactic features are listed among
the IFIDs, it ought to be clear that it is both the semantics and the syntax that play
a role in determining the meaning of a speech act. And because analysing basic
syntactic patterns is often much easier than determining the exact meaning – the
(deep) semantics –, it seems only natural that one might want to begin an analysis
of speech acts by testing to see how the high-level syntax may constrain the options
for them, thereby also signalling high-level types of communication.
Hence it is relatively easy, although not always foolproof, to distinguish syntac-
tically between whether someone is asking a question, making a statement or sim-
ply indicating (dis)approval/agreement, backchanneling, etc., in order to be able
to limit the set of initial choices for identifying a speech-act. Once the selection
has been narrowed down, one can then look for and identify further IFIDs at the
semantico-pragmatic level that may reflect additional linguistic or interactional
conventions, ‘synthesise’ the existing information, and, in a final step – as and
when required – carry out some more inferencing in order to try and determine
the exact primary force of the illocution as far as possible. To show how this can be
done will be one of the most important aims of this book, along with demonstrat-
ing that far more in communication than has commonly been assumed to belong

Chapter 1. Introduction 
to the realm of conventional implicature (cf. Grice 1989: 25–26) – as opposed to
conversational implicature – is indeed conventional, and can thus be investigated
using methodologies similar to those traditionally employed in corpus linguistics,
albeit with some major extensions.
Closely linked to the notion of conventionality is that of indirectness in mean-
ing. For instance O’Keeffe et al. (2011), apparently subscribing to something I
would like to call the ‘general myth of indirectness’, state that
the utterance I’ve got a headache carries a variety of meanings according to when
it is used, who uses it, who the person is talking to, where the conversation takes
place, and so forth:
–
If a patient said it to a doctor during a medical examination, it could mean:
I need a prescription.
–
If a mother said it to her teenage son, it could mean: Turn down the music.
–
If two friends were talking, it could mean: I was partying last night.
–
If it were used as a response to an invitation from one friend to another, such
as Do you fancy going for a walk?, it could simply mean: No.
Therefore, depending on the context it occurs in, the utterance I’ve got a headache
can function as an appeal, an imperative, a complaint or a refusal, and so on.
(O’Keeffe et al. 2011: 1–2)
Such claims to the multi-functionality and indirectness of ‘utterances’ – a rather
vague term we shall have to evaluate in more detail later – are very common in
the traditional pragmatics literature. However, it seems to me that we seriously
need to question the true extent of these phenomena. Certainly, no-one, includ-
ing myself, would claim that it is not possible to create meaning in highly indi-
rect ways, and that ‘locutionary facts’ may assume a special meaning in context
that is not expressed directly through them. Nevertheless, if we look at the above
examples more closely, we can assume that the functions associated with them by
O’Keeffe et al. (2011) probably do not reside in the locution I’ve got a headache
itself, but in their surrounding co-text, and may therefore at best be inferred,
rather than really being implicit. Thus, the first example is more likely to consti-
tute an answering response, i.e. statement, to the doctor’s query as to the ailment
of the patient, and the request for a prescription would most probably follow
more or less in exactly the words assumed to be the meaning of the communica-
tive unit used as an example. Similarly, in example two, the imperative Turn down
the music is more likely to be a kind of ‘preface’ to the explanatory statement I’ve
got a headache, while, in the third example, some contextual information would
probably be required in order to ‘set the scene’ for the cause of the headache,
while, in the final example, the assumed refusal is more likely to be expressed by
an expression of regret – in other words, a dispreferred response – such as Sorry
preceding the explanation. It therefore seems that we need to perhaps adopt a

 How to do corpus pragmatics on pragmatically annotated data
more critical stance towards the notion of indirectness, and start our investiga-
tion into the meaning of functional communicative units by focussing on identi-
fying their ‘local’ direct meaning first.
Yet the method referred to above, initially focussing on syntactic form and
then supplementing this in an inferencing process by looking at other lexico-
grammatical features, only works well for those types of verbal interaction where
the individual speech act is essentially identifiable without taking too much of
the surrounding context into account. However, there are also some other types
of speech acts whose function is not solely determined by the propositional and
illocutionary force inherent in what is said, but is rather almost exclusively related
to how they function in reaction to what the previous interlocutor has said or
what has been referred to within the larger context of the whole dialogue, such
as answers to questions, echoing (partially or wholly repeating) something that
the previous speaker has said, or confirming facts that have been established in
the course of the interaction. In order to interpret these correctly, it is no longer
sufficient to simply analyse the current/local textual unit itself, but necessary to
look backwards or forwards within the dialogue, as well as to possibly assign mul-
tiple speech act labels that reflect the different, ‘cumulative’, functions on different
levels. The very fact that such textual units exist that absolutely require the sur-
rounding context for interpreting their function is yet another argument against
the ‘single-sentence interpretation mentality’ already previously referred to in
connection with logic-based traditional approaches to pragmatics.
1.3 Approaches to corpus-/computer-based pragmatics
In the preceding sections, I mainly focussed on issues and background information
related to traditional, ‘manual’ pragmatics, but of course, ‘doing pragmatics’ on the
computer is in many respects very different from traditional pragmatics, and this
is why it is necessary to introduce this field of research separately. This difference
is partly due to the nature of electronic data and the methods involved in handling
it, and whose discussion will therefore form a major part of this book, but also to
some extent by the aims pursued by the people who work in this area.
Computer-based pragmatic analysis has only relatively recently become a
major focus of attention, most notably because of increasing efforts in creating
more flexible and accurate dialogue systems (cf. Androutsopoulos Aretoulaki
2003: 635–644) that allow a human user to interact with a computer system or to
help human agents to interact and negotiate with one another if they have a differ-
ent language background, such as in the German Verbmobil project (cf. Jekat et al.
1995). Consequently, the efforts in this field are often geared more towards the

Chapter 1. Introduction 
needs of the language engineering community, rather than attempting to improve
our general understanding of communication, which is still the implicit or explicit
aim of pragmatics. Although the initial efforts on the SPAAC project (cf. Leech
Weisser 2003), which provided the original basis for the research described here,
were also made in order to help and improve such systems by creating annotated
training materials for dialogue systems, my own emphasis has long shifted back
towards a much more corpus-oriented approach, aimed at offering a wider basis
for research on language and communication.
Corpus-/computer-based pragmatics, though, is still very much a developing
field, and as yet there exist no real commonly agreed standards as to how such a
type of research can or ought to be conducted. Having said this, there have at least
been attempts to try and define the levels and units of annotation/analysis that are
needed in order to create corpora of pragmatically enriched discourse data, most
notably the efforts of the Discourse Resource Initiative (DRI). The DRI held three
workshops on these issues between the years of 1995 and 1998, and, as a result of
this, an annotation scheme called DAMSL (Allen Core 1997) was developed.
As this scheme has been fairly influential and parts of it bear some similarity to
the DART (Dialogue Annotation and Research Tool) scheme used here, DAMSL
will be discussed in more detail in Section 5.1 below. In the expanded title of
DAMSL, Dialogue Act Markup in Several Layers, we can also see that the language-
engineering community often prefers to use the term dialogue act instead of the
original speech act (cf. Leech et al. 2000: 6), but I see no benefit in adopting this
here, and will keep on using the traditional term, which is also still better-known
in linguistics circles.
Other attempts at reporting on or defining best practice standards in this area
have been Leech et al. (2000) within the EAGLES (Expert Advisory Group on Lan-
guage Engineering Standards) framework and the efforts of the MATE (Multilevel
Annotation, Tools Engineering; cf. Klein 1999) project. While these attempts at
defining and possibly also standardising annotation schemes were predominantly
carried out for NLP purposes, from the linguistics-oriented side, Kallen and Kirk
(2012) also established a pragmatics-related annotation scheme for the SPICE-
Ireland, based essentially on the original design of the annotation for the corpora
of the International Corpus of English (ICE; Nelson 2002), but adding various lev-
els of annotation, drawing mainly on Searles’s speech act taxonomy. The specific
issues raised through these efforts, as well as other relevant endeavours, will be
discussed in more detail later.
In recent years, corpus pragmatics, as a specialised sub-field of corpus linguis-
tics, has also begun to establish itself more and more. This is evidenced by such
publications as the series Yearbook of Corpus Linguistics and Pragmatics, whose
first volume appeared in 2013 (Romero-Trillo 2013), the new journal
Corpus

 How to do corpus pragmatics on pragmatically annotated data
Pragmatics, which was established in 2017, and, perhaps most notably, the edited
collection Corpus Pragmatics: a Handbook (Aijmer Rühlemann 2015). Yet, when
looking through the chapters/articles in such publications, it quickly becomes
apparent that much of the research conducted under this label ‘only’ more or less
constitutes the application of relatively traditional corpus-linguistics techniques,
such as concordancing or n-gram analysis, to research on highly limited features,
rather than resorting to any form of annotation that would make it possible to
carry out large-scale analyses of multiple communicative functions at the same
time so as to be able to create communicative profiles.
As far as computer-based approaches to pragmatic analysis from a compu-
tational linguistics perspective are concerned, Jurafsky (2006: 579) identifies “[f]
our core inferential problems […]: REFERENCE RESOLUTION, the interpreta-
tion and generation of SPEECH ACTS, the interpretation and generation of DIS-
COURSE STRUCTURE AND COHERENCE RELATIONS, and ABDUCTION.”
Out of these four, this book is only concerned with the latter three, with the main
emphasis being on identifying speech acts through abductive (cf. Hobbs 2006),
natural language-based, reasoning, instead of employing logic-based formal
semantic approaches. Discourse structure and coherence relations are also treated
to some, albeit slightly lesser, extent.
According to Jurafsky, “there are two distinct computational paradigms in
speech act interpretation: a logic-based approach and a probabilistic approach”
(ibid.) in computational pragmatics research such as it is generally conducted
by computational linguists. The former approach is essentially grounded in the
BDI (belief, desire, intention) model (cf. Allen 1995: 542–554) and the con-
cept of plans. Allen (1995: 480) provides the following description for plans
and their usage.
A plan is a set of actions, related by equality assertions and causal relations, that
if executed would achieve some goal. A goal is a state that an agent wants to make
true or an action that the agent wants to execute. […] The reasoning needed in
language understanding […] involves the plans of other agents based on their
actions. This process is generally called plan recognition or plan inference. The
input to a plan inference process is a list of the goals that an agent might plausibly
be pursuing and a set of actions that have been described or observed. The task
is to construct a plan involving all the actions in a way that contributes toward
achieving one of the goals. By forcing all the actions to relate to a limited number
of goals, or to a single goal, the plan-based model constrains the set of possible
expectations that can be generated. (emphasis in original)
Often, the need to generate these types of constraints, and thereby limit the
range of understanding of a system, is unfortunately driven by rather commer-
cial reasons because it is obviously highly time-consuming and costly to conduct

Chapter 1. Introduction 
extensive research on discourse matters. This leads to a fairly limited application
basis – and hence lack of generic applicability – for these plans. Plans are thus
essentially only usable as, or represent, ad hoc methods for dealing with highly
specific types of interaction, and are consequently of less concern to my discus-
sion. Furthermore, the BDI model embodies a very complex chain of reasoning
and abstract logical representation that is far removed from natural language (for
an example of this, consult Jurafsky 2006: 582–586) and “requires that each utter-
ance have a single literal meaning” (ibid. 587), something that is all too frequently
not the case in real-life spoken interaction. Consequently, although there is cer-
tainly some abductive logic involved in identifying speech acts, as we shall dis-
cuss in more detail in Section 6.5, a reasoning process that is based on a complex
logic-based abstraction that allows for only one single and precise interpretation
seems unsuited to the task at hand. Furthermore, as its name implies, the BDI
model is based almost exclusively on the assumption that it is possible to recog-
nise intentions (along with beliefs), something that Verschueren (1999: 48) rather
lucidly argues against.
It would be unwarranted to downplay the role intentions also play. An important
philosophical correlate of intentionality is ‘directedness’. Being directed at certain
goals is no doubt an aspect of what goes on in language use ([…]). But it would
be equally unwise to claim that every type of communicated meaning is directly
dependent on a definable individual intention on the part of the utterer. Such a
claim would be patently false. Just consider the Minister who has to resign after
making a stupid remark that was felt to be offensive, even if many people would
agree that it was not meant offensively. Or, at a more trivial level, look at the
exchange in (16).
(16) 1. Dan: Como is a giant silk worm.
Debby: Yukh! What a disgusting idea!
Dan’s innocent metaphor may simply be intended to mean that Como produces
a large amount of silk. But that does not stop Debby from activating a meaning
potential that was not intended at all And by doing so, (16)1. really gets the
meaning Debby is reacting to. In other words, (16)1. does not simply have a
meaning once uttered (which would be the case if meaning were determined by
intentions).
One further drawback of taking a BDI approach is that it necessitates a deep
semantic analysis with access to a variety of different types of linguistic – and
possibly encyclopaedic – information, and thus by necessity needs to be based on
ideas of relatively strict compositionality, a notion that would seem to contradict
the basic assumption that pragmatics represents ‘meaning in context’, rather than
‘dictionary meaning’.

 How to do corpus pragmatics on pragmatically annotated data
The second form of analysis/identification Jurafsky identifies is what he calls
the cue-based model.
In this alternate CUE model, we think of the listener as using different cues in
the input to help decide how to build an interpretation. […] What characterizes a
cue-based model is the use of different sources of knowledge (cues) for detecting
a speech act, such as lexical, collocational, syntactic, prosodic, or conversational-
structure cues. (ibid.: 587–8)
In other words, in the cue-based model, we are dealing more or less exactly with
IFIDs as defined by Searle, although Jurafsky claims that this approach is – unlike
the plan-based one – not grounded in “Searle-like intuitions” (ibid.), but
[…] draws from the conversational analytic tradition. In particular, it draws
from intuitions about what Goodwin (1996) called microgrammar (specific
lexical, collocation, and prosodic features which are characteristic of particular
conversational moves), as well as from the British pragmatic tradition on
conversational games and moves (Power 1979). (ibid.)
There thus clearly seems to be a misunderstanding regarding the potential ori-
gins of the cue-based approach, especially also as the term microgrammar never
appears in the article by Goodwin referred to above. Be that as it may, the pre-
sumed origin of this approach – which is generally also the one followed in the
present methodology – is not the real reason why one might want to disagree with
computational linguists like Jurafsky in employing cues for the identification of
speech acts. Rather, it is the way in which speech acts are in fact recognized by
them, which is largely through manual labelling of examples, followed by machine
learning to derive possible cues, and then applying probabilistic techniques to
identify the latter, thereby arriving at a speech act assignment. Although proba-
bilistic methods have relatively successfully been employed in morpho-syntactic
tagging (cf. Marshall 1987) and other areas of linguistic analysis, it is well-known
that they generally suffer from a sparse data problem (cf. Manning Schütze
1999: 195ff.). This essentially means that they can only work reliably if trained
on a fairly large amount of existing data, something which is usually not available
when moving from the analysis of one particular domain to another, and using
potentially relatively short stretches of text. Furthermore, probabilistic approaches
also represent somewhat of a black box which is likely to conflate domain-specific

patterns and generic structures induced through the machine learning techniques
(cf. Weisser 2015). In other words, what should be common across a variety of dif-
ferent domains – and hence indicate common features of human interaction – may
frequently not be easily extractable from the machine-learnt patterns to generalise
from in order to re-use this information. This is where the methodology used in
this book introduces some very distinct advantages. It (a) specifies the clues to be

Chapter 1. Introduction 
used as linguistically motivated and transparent patterns and (b) tries to identify
and use as many of the generic elements that exist on the different linguistic levels
as possible, so that these can then be adapted or augmented as necessary when
introducing a new domain into the analysis routines.
The practical corpus-linguistic approach discussed along with the theoreti-
cal issues involved in creating pragmatically annotated corpora have also lead me
to develop ideas as to which kinds of components may be useful or necessary to
incorporate into a research tool that supports the types of analysis and annotation
mechanisms discussed here. Incorporating such features into a research tool is also
of great importance in corpus linguistics because general linguistic analysis tools
like most concordancers typically do not provide the functionality required to inves-
tigate multiple linguistic levels at the same time. This is why, along with the general
theoretical and practical sides of dialogue annotation on multiple levels, I will also
introduce one particular research tool here, called DART (Dialogue Annotation and
Research Tool; Weisser 2016b), designed by me for this specific research purpose.
Inanycomputationalanalysisofcommunication,therearedifferent‘mechanics’
at work, and those also require different types of computational treatment. One
is to do with identifying, storing, and retrieving the right types of information to
make it possible to capture and look them up again in whichever data structure(s)
one may choose to use for this purpose. The other is to identify the necessary pat-
terns in order to label these units and their content appropriately, consistently and
reliably. Since the latter essentially consists in pattern identification and matching
operations, it can be achieved quite efficiently by using finite-state technology in the
form of regular expressions (cf. Weisser 2009: 69–79 or Weisser 2016a: 82–101 for
overviews of or introductions to their use in linguistic analysis).
Although the research described here is primarily based on English data, I will
also occasionally draw on materials from other languages, so that it will hopefully
become clear to which extent the theoretical concepts underlying the approach
are transferable. Any real in-depth discussion of these other languages is beyond
the scope of this book, though, so my treatment will only remain exemplary and
sometimes even superficial.
In view of the complexities inherent in the construction of meaning discussed
above, and the relative dearth of concrete research into these from a large-scale
empirical point-of-view, I intend to investigate the following research questions in
this book in order to fill these particular gaps:
1. Which levels of meaning can we distinguish within dialogues pertaining
to different domains, some more restricted (e.g. call-centre interactions or
problem-
solving tasks), and others more open (e.g. data drawn from the
Switchboard Corpus)?

 How to do corpus pragmatics on pragmatically annotated data
2. How can we classify and describe these levels of meaning in order to relate
them to the identification of pragmatic force/speaker intentions?
3. What would a suitably generic taxonomy of speech acts look like and what
does it need to cover?
4. To what extent is a large-scale automated pragmatic annotation feasible and
how could this incorporate different levels of (in-)directness?
5. How can such an annotation make truly corpus-based pragmatics research
possible?
1.4 Outline of the book
Having outlined and put into focus the basic framework, it is now possible to
proceed to looking at how the task of annotating dialogues can be achieved,
which individual steps or particular resources are required for enriching cor-
pora with pragmatics-relevant features, and, once the annotation has been
completed, how such corpora can be used in order to draw important con-
clusions about the various communicative strategies employed by individual
speakers or across different corpora from various domains. Through(out)
these discussions, I will try to demonstrate why the methodology adopted here

provides distinct advantages over ‘traditional’, and more complex, approaches,
and hence represents a major step forward in empirical research into spoken
corpus pragmatics.
Yet, no approach or method is completely without caveats, and this one is no
exception. As the original data the approach was developed on contained no pro-
sodic information other than pauses, and I also had no access to any of the audio
data, a high degree of manual pre- or post-processing, including some interpreta-
tion of the data, was necessary in order to allow the analysis routines to recognise –
and thus categorise on the different levels – all the relevant units automatically, and
with a high degree of accuracy. Not doing so would have led to unnecessary inac-
curacies in the annotation, especially if the particular textual units concerned were
either very long, and therefore had to be broken up into smaller units of content,
or – as is the case with declarative questions – it is only the prosodic characteristics
that permit the analysis routines to disambiguate between the potential functions
offered by the syntactic structure. Thus, to avoid these issues, the original data
were, as far as it was possible without making reference to any audio, broken down
into functional units, and then enriched with information pertaining to unit-final
prosody, as described in Section 2.3.2.
Before actually moving on to the main chapters, a brief overview of the main
parts of the discussion is in order. Chapter 2 will be concerned with linguistic

Chapter 1. Introduction 
data on the computer in the form of corpora to be used for pragmatic
analysis,
also discussing important issues in their design and handling in general, as
well as basic and specific issues in text representation and annotation. The next
chapter, Chapter 3, will provide a brief introduction to the corpora used for this
study, the analysis tool DART, as well as the necessary computational resources
required to achieve the annotation task. The syntax of spoken language, its units,
and its peculiarities that necessitate special computational treatment are cov-
ered in Chapter 4. This chapter also contains a comparison of the distribution of
syntactic categories and their basic communicative functions across the corpora
used. Descriptions of the levels of semantics and semantico-pragmatics, which
make important contributions to the realisation of speech acts, form the main
substance of Chapter 5, while Chapter 6 will present a brief overview of the
largely automated annotation process in DART. In Chapters 7, I shall discuss
further results of the research in the form of a discussion of the DART speech-
act taxonomy, again including a comparison of the distribution of the various
acts across the different sets of data, thereby also illustrating the applicability
of the annotation scheme towards establishing functional profiles of the cor-
pora used for this study. Chapter 8 will then round off with a conclusion and
outlook towards further potential improvements and future applications of the
DART methodology.
1.5 Conventions used in this book
In this book, I use a number of conventions that have either been established in
linguistics in order to help us to distinguish between different levels of analysis
and/or description, or to indicate special types of textual content relevant to the
presentation, for instance to distinguish between text and computer codes, etc.
Double quotes (“…”) are exclusively used to indicate direct speech or short
passages quoted from books, while single quotes (‘…’) signal that an expression
is being used in an unusual or unconventional way, or that I am referring to the
meaning of a word or construction on the semantic level. Curly brackets ({…})
represent information pertaining to the level of morphology, whereas angle
brackets (…) indicate specific spellings, to contrast these with phonetic/
phonological representations of words. Furthermore, they also occur as part
of the linguistic annotation introduced in the book. Paired forward slashes/
square brackets generally indicate phonological or phonetic
representations.
Within quoted material, the latter may also signal amendments to the original
material made in order to fit it into the general sentence structure.

 How to do corpus pragmatics on pragmatically annotated data
Italics are generally used in linguistics to represent words or expressions,
sometimes whole sentences, that illustrate language materials under discussion.
In some cases, they may also be used to indicate emphasis or highlighting. In addi-
tion to this, I use italics to indicate specific terminology and speech act labels.
Small caps are used to indicate lemmas, i.e. forms that allow us to conveniently
refer to all instances of a verb, noun, etc. Last, but not least, monospaced font
indicates computer code or annotations.

chapter 2
Computer-based data in pragmatics
The acquisition or representation of linguistic material in electronic form always
brings with it a number of different issues. Transcribing or transforming the data
into a form that meets one’s research aims is often one of the most time-consuming
and major parts of creating research materials, and a substantial amount of time
and resources needs to be allocated to this task before one is actually in a position
to analyse the data itself (see Weisser 2016a for a practical introduction to these
issues). Therefore, before beginning our exploration of corpus-based pragmatics,
I will provide a brief survey of existing technologies and issues surrounding the
handling of the ‘raw material’ involved.
2.1 Linguistic corpora and pragmatics
Today, there is an abundance of electronic corpora designed for many different
purposes (cf. McEnery et al. 2006: 59ff.) Evidently, not all of these are equally
suitable for different types of language analysis, especially not the type of prag-
matic analysis of spoken language discussed here. The most general of these cor-
pora,
reference corpora, cover a large amount of naturally occurring written or
spoken data from a variety of different domains, which, in theory, makes them
representative of a given language as a whole in terms of vocabulary, syntax, and
also many pragmatic aspects of language use. Yet the earliest such corpora, the
American BROWN (Francis Kucera 1979) and its British counterpart, the
LOB (Lancaster-Oslo/Bergen; Johannson, Leech Goodluck 1978) corpus, were
hardly representative in this sense yet, as they ‘only’ contained one million words
of written text each. In the 1960s, when both of these corpora were collected, this
seemed like a very large amount of data, and written language was still assumed
to be more important than its spoken counterpart. Since then, however, it has
become clear that a balanced corpus needs to contain suitable amounts of both
written and spoken language, and that 1 million words are hardly enough to
capture many of the interesting phenomena to be observed in language, espe-
cially when it comes to identifying collocations, idioms, and other rarer forms of

 How to do corpus pragmatics on pragmatically annotated data
language. Sinclair (2005) provides a fairly detailed exploration as to how much
data may be required to account for various types of such analyses, and how the
size of corpora required for investigating especially longer sequences of words
may increase exponentially. To fulfil such needs, corpora of ever-growing sizes
are being produced to cover these gaps, and this is why modern mega-corpora,
such as the British National Corpus (BNC), already contain 100
million words,
subdivided into 90 million words of written and 10 million words of spoken lan-
guage for the BNC, where of course the latter segment is most relevant to our
research. Other mega corpora for English, like the Corpus of Contemporary
American (COCA; Davies 2009), are even larger, but their content does not cover
the same range of spoken language as the BNC, in particular not where uncon-
strained natural dialogue is concerned.
Spoken electronic corpora appropriate for conducting research in pragmatics
have existed at least since the publication of the London-Lund Corpus of Spoken
English(LLC;seeSvartvik1990orhttp://clu.uni.no/icame/manuals/LONDLUND/
INDEX.HTM) in 1990. A number of interesting studies on spoken interaction and
its properties – such as Stenström 1994 or Aijmer 1996 – have been undertaken
based on it. One major advantage of this 500,000-word corpus is that it contains
detailed prosodic information that makes it possible to study nuances of attitudinal
stance of the individual speakers in detail, rather than ‘just’ presenting the syntac-
tic, lexical and structural information of the ongoing interaction. At the same time,
this detailed prosodic information makes it very difficult to work with the corpus
data and to perform automatic analyses of the kind discussed in this book on it, as
the prosodic information is integrated into the transcription in such a way that it
becomes difficult to recognise the ‘shapes’ of the individual words easily, as they may
contain unusual ‘accented’ characters to indicate tone movement or other prosodic
markers. A brief example from the beginning of the first file of the LLC is shown
below, but there will be more to say on these issues in Section 2.3.1.
1 1 1 10 1 1 B 11 ((of ^Spanish)) . graphology# /
1 1 1 20 1 1 A
11 ^w=ell# . /
1 1 1 30 1 1 A 11 ((if)) did ^y/ou _set _that# - /
1 1 1 40 1 1 B
11 ^well !Joe and _I
/
1 1 1 50 1 1 B 11 ^set it between _us#
Figure 2.1 Sample extract from the London-Lund Corpus
One further potential drawback of the LLC corpus is that it only reflects the speech
of “adult educated speakers of English” (Svartvik 1990: 11), so that some of the
features of more general spoken English may be missing from the data. A corpus
that implicitly seeks to redress this problem is the two million-word CANCODE

Chapter 2. Computer-based data in pragmatics 
(Cambridge and Nottingham Corpus of Discourse in English)1, as it was “targeted
towards informal encounters and were made in a variety of settings, such as in
people’s homes, in shops, restaurants, offices, and informal university tutorial
groups, all in the British Isles” (McCarthy 1998). The main disadvantage of this
corpus, however, is that it is not generally available, so that it is also not possible to
replicate any studies based on it.
In theory, this would then probably leave the spoken part of the BNC, due to
its relatively easy accessibility, wide coverage, and size of data, as an ideal candidate
for pragmatic analysis. However, in practice, even if one were to limit the selec-
tion of data chosen from such a large corpus in a sensible way, initially
analysing
data from such relatively unrestricted domains computationally would pose prob-
lems in terms of the lexical coverage an analysis program would have to offer.
In addition, as we shall see in the next section and also Section 3.1.4, some seri-
ous problems exist in the spoken part of the BNC. Thus, it is probably best to
start designing an analysis program or methodology on the basis of corpora from
relatively restricted and clearly defined domains. This is in fact the approach that
was taken for the original research behind this study, and also the reason why
other projects or efforts aimed at designing computationally tractable methods of
analysis for pragmatic data have generally been restricted to smaller corpora and
limited domains. In contrast to most previous efforts, though, one of the explicit
aims in the design of the methodology employed here was to allow for an extensi-
bility to different domains right from the start by making use of generic elements
(cf. Weisser 2002) to implement the core functionality, but also allowing furthers
resources to be added later.
One classic exemplar of a dedicated spoken corpus is the 146,855 word HCRC
Map Task Corpus.2 According to the classification scheme established in Leech
et al. (2000: 6ff.), this corpus can be categorised as task-oriented and task-driven.
In other words, it represents a specific type of dialogue corpus where two or more
interlocutors negotiate or interact in order to achieve a specific task. The particu-
lar task in this case consists in finding a route to a target based on two maps that
contain partly identical and partly differing information. The MapTask corpus was
specifically designed to investigate features of relatively informal interaction on a
number of linguistic and other levels, such as speaker gaze, general communica-
tive strategies, etc. (cf. Anderson et al. 1991), and has been marked up (see 2.2
below) for a number of these features.
. See https://www.nottingham.ac.uk/research/groups/cral/projects/cancode.aspx for a
list of publications related to this.
. See http://groups.inf.ed.ac.uk/maptask/ for more details.

 How to do corpus pragmatics on pragmatically annotated data
Other corpora that have been designed and used in the context of research
on the computational analysis of dialogues – mainly in the context of develop-
ing dialogue systems – include materials from the domains of travel information
(SUNDIAL, ATIS, etc.), transport (Trains), business appointments (Verbmobil),
etc. (cf. Leech Weisser 2003: 149). However, apart from the earlier Trains cor-
pus materials from 1991 and 1993, data from such projects is relatively difficult
to obtain.
Flöck and Geluykens (2015: 9) claim that “there are no corpora available that
are tagged for individual illocutions or even illocutionary types”. However, this
claim is certainly not true, as, despite a relative dearth of pragmatically annotated
corpora, a few corpora containing speech-act related information have been in
existence for a number of years. Amongst these are the SPAADIA (ver. 1 released
in 2013, ver. 2 in 2015), the Trains 93, and one version of the Switchboard Corpus,
data from all of which was used to some extent in this book (see Section 3.1
below), as well as the Coconut and Monroe corpora. For more details on the
original annotation of these corpora, as well as a comparison of their annotation
schemes, see Weisser (2015). The MapTask corpus already mentioned above also
contains annotations that are somewhat similar to speech-act labels, but referred
to as moves.
The MICASE Corpus (Simpson et al. 2002) has also been marked up with
information about pragmatic features, including a sub-corpus that contains 12
pragmatic tags (Leicher Maynard 2007: 112). However, instead of reflecting
generic concepts, the tag labels used there often represent highly domain-specific
functions, such as “AHW Assigning homework” or “IRM Introductory Roadmap”
(ibid.: 112), and the annotated materials – to the best of my knowledge – have
never been made available publicly.
2.2 Issues and standards in text representation and annotation
As already hinted at in the prior discussion, having electronic data in a suitable
format is of utmost importance for anything but a cursory analysis and genera-
tion of superficial hypotheses. This is why, in this section, an overview of the most
important issues that apply to the design, representation and handling of corpora
in language analysis shall be provided, beginning with a ‘plea for accuracy’ in
recording the original data used for corpus compilation, as this is a feature that is
highly likely to affect any subsequent analysis to a very large extent.
Spending a considerable amount of time on producing ‘clean’ data – not
in the Sinclairean sense of being annotation-free, but free of potential errors
due to typographical or encoding issues (cf. Weisser 2016a: 4–5 56–57),

Chapter 2. Computer-based data in pragmatics 
though – may sometimes seem an unnecessary effort, just in order to conduct a

small-scale
project. However, it is certainly time well-spent, as one never knows
to which use the data may be put later on and how badly represented data may
affect the outcome of any type of analysis. For instance, many researchers use
the BNC for their work on British English because it is the major reference cor-
pus for this variety, and highly useful research can be conducted on it in many
areas, especially through powerful interfaces such as BNCweb http://bncweb.
lancs.ac.uk/ or the BYU-BNC one created by Mark Davies https://corpus.byu.
edu/bnc/. Its compilation certainly also represents a most laudable and worthy
effort, but if one takes a closer look at some of the spoken data and how it was
transcribed, one cannot but wonder how much of an error may be introduced
into any numerical analysis conducted on it simply due to the fact that its tran-
scribers seem to have been relatively unqualified, and thus often did not seem to
know where to use an apostrophe or not, apart from generally being somewhat
insecure about their spelling. For example, in BNC file D96 alone, which I re-
transcribed from the audio provided in BNCweb, there is an alarmingly high
amount of instances of the contraction we’re that were simply transcribed with-
out an apostrophe, plus a number of other rather serious transcription errors.
These issues can easily be seen in the excerpts provided below, where deleted or
replaced items are marked as struck through, and replacements or insertions
indicated in bold italics:
I mean, a lot, what I can say with on the youths, I mean, I think we’re were doing,
we were, we’re were, we’re were working walking with young people at the local
levels of various places in the town you know, we’ve got, we haven’t got as many
resources as we want yet, but we’re were still trying to do that, well I actually feel,
on youth we’re doing quite a good job you know, extensive expensive job you
know, that we are, and, and all the that concerns you raise, we’re were certainly
aware of.
The problem is solving all the problems, providing all the facilities in, in a the
situation where it’s diminishing resources, I mean we wouldn’t be actually be
carrying out this frontline review, in the way that we’re were gonna do it, if we
didn’t have the problem with the money we’ve got, you know.
[…] We’re with still not losing loosing site sight of the idea of having a cafe,
bar, coffee for Heyham people, one of the things that we’re were, that gonna
look to through and explore explore actually is er setting up some kind of
coffee bar facility facilities at Kingsmoor, with the play farming barn, there
next to them.
[…] At the, at the last search, at the last highways committee, although we’re were
not having, having, having the, the full service that we at the envisage in the first
instance, a lot is going to be done, there’s is going to be some more erm shelters

 How to do corpus pragmatics on pragmatically annotated data
erected directed there and one or two other facilities and somebody has even
suggested that we put a toilet there which is a very good idea.
[…], but anyway any way, there will be some improvements for the bus station
in the future.
[…] Yeah, you’re your off the hook.
[…] Right, we’re were now on other reports. Anybody Any body got anything any
thing else to report, with got a few minutes left? Yes, no, any other business, you
can all go, you’re your all off the hook.
[…] Don’t go overdoing over doing it.
Although the extract above probably already provides a fairly striking impression of
the severity of the problem, let us take another look at the overall discrepancies in
numbers between the original BNC version and my corrected one, which may still
contain errors, due to intelligibility issues that perhaps no transcriber can resolve.
Table 2.1 BNC D96: Discrepancies between original and edited version
Unit Original BNC version Corrected version
wa-units/words 839 902
ub-units/turns 40 35
sc-units/cd-units 51 162
punctuation 160 138
insertions 14
deletion(s) 1
corrections 56
a. word
b. utterance
c. sentence(-like)
d. clausal and non-clausal (Biber et al. 1999: 1070; cf. 4.1)
The information regarding the original units in the first four rows of Table 2.1 were
taken directly from the header of the BNX XML file. The relatively discrepancy
in terms of w-units/words is partly due to insertions of materials that were either
marked as unclear in the original, but I was able to discern from the audio after all,
or corrections where the marking of previous erroneously non-marked contrac-
tions resulted in two words being present, rather than just one. The latter applies,
for instance, to all cases of were in the original transcript that should have been
transcribes as we’re. The difference in u-units/turns can be explained mainly by the
fact that, occasionally, turns by the same speaker were split over multiple u-units,

Chapter 2. Computer-based data in pragmatics 
in particular if some event, such as background laughter, occurs in between. With-
out more detailed transcription guidelines, however, it is difficult to ascertain the
exact reason for this.
A similar phenomenon partly explains the divergence in the number of punc-
tuation marks, as the transcriber of this particular file seems to have added punc-
tuation marks even after event descriptions, in other words, non-textual material,
even if this does not really make any sense at all. Although the number of c-units
in the corrected version is higher than that of the s-units in the original, which
would normally lead us to expect a higher instance of punctuation marks in the
former, there are a number of reasons why this is not the case. First of all, in the
BNC, ‘sentence-like’ units are marked in a rather haphazard way, where often the
end of functional units is marked by a comma, rather than a major punctuation
mark, as can easily be seen in the first paragraph of the extract we saw earlier. Most
of these were deleted in the conversion process, but were partially replaced by
‘phono-pragmatic’ punctuation tags (see 2.3.2 for more information) in the cor-
rected version. In addition, multiple functional units are often conflated into one
s-unit in the BNC, while the DART scheme employs a more fine-grained system
of syntactic/functional units (see Chapter 4), which accounts for the considerable
difference in number between s- and c-units in the two versions.
What is perhaps more important than the differences in the individual units
discussed above is the number of errors presented in the final three rows of
Table 2.1. Added up, insertions, deletions, and other corrections account for 71
word tokens. If we see this number relative to the original 834 tokens, we arrive
at an error rate of 8.5%. Applying the customary, albeit arguably incorrect (see
Weisser 2016a: 175), method for frequency norming and extrapolating to instances
per 10 million words, we could then potentially expect to find 846,250 word-token
errors in the whole of the spoken part of the BNC!
The problems illustrated above, occurring within such a relatively short space
of text, will of course not only skew the general results of any (word) frequency
counts, but also influence any such counts based on word classes, as well as the
comparative counts that seek to illustrate the differences between spoken and writ-
ten language.
Although, in terms of pure frequency counts of word classes, some of these
errors may actually balance out each other in that a lack of an apostrophe in one
place may be compensated by an additional erroneous one elsewhere, the above
observations should lead us to raise serious doubts about the validity of many
frequency counts obtained from large reference corpora. This especially ought to
be the case if these corpora have been collected very quickly and only few people
have been involved in their compilation and processing, such as may possibly be
the case with the final version of the American counterpart to the BNC, the Open

 How to do corpus pragmatics on pragmatically annotated data
American National Corpus (ANC),3 which, from its inception, was hailed and
marketed as an enterprise in ‘efficient’ corpus collection.
The problems highlighted above simply indicate that the issue of homographs or
potentially occurring word forms that have been misrepresented or represented as
unclear is something that mere use of a spell-checker will not eradicate, and there-
fore a close reading and potential manual post-editing of transcriptions cannot be
avoided, unless appropriate care has been taken to ensure that the data has been
transcribed extremely thoroughly in the first place. And even then, occasional errors
that were either overlooked during the compilation phase or might have been intro-
duced by unforeseen side effects of any computer programs used to process the data
cannot be discounted and may always have at least a minor influence on any kind of
so-called ‘statistical’, i.e. frequency, analysis of texts. This might not seem much of a
problem if ‘all’ we are interested in is the frequencies of words or their distributions,
but, in the context of computational dialogue analysis, it may well affect the creation
of domain-specific lexica required to do the processing (cf. Section 3.3.3).
However, it is not only frequency counts that may be affected by a somewhat
careless preparation of corpus materials. Perhaps more importantly, when perform-
ing a syntactic analysis of a particular unit of text, the difference between an apos-
trophe being present or not may in fact prevent us from recognising a declarative
structure and mistaking it for an ill-formed syntactic structure – later referred to as
fragments (cf. Section 4.3.9) – and where the latter may be much more difficult or
impossible to interpret in its function, as in the case of “your off the hook” – instead
of the correct you’re off the hook – from the BNC sample above.
To summarise: the importance of using data that has been created with the
utmost care and being aware of the content of this data is not only relevant for pro-
ducing or extracting high-quality information from our corpora, but also in order
to be able to form the right research hypotheses and come to the right conclusions
about them. This is a fact that all too often seems to be ignored in the quest for
ever-increasing amounts of corpus data that can be collected and prepared for dis-
semination in a maximally efficient and inexpensive way. In other words, quality
should always remain more important than expedience.
Having demonstrated how important it is to work with clean data, as well as
to have some kind of expectation about which types of issues may be encountered
in electronic data, we can now move on to discussing the basic means of render-
ing the data in a faithful way, and adding additional useful types of structural and
linguistic information to it. In this context, it will first be necessary to introduce
or discuss some important terminology that enables us to describe the essential
. http://www.anc.org/

Chapter 2. Computer-based data in pragmatics 

concepts behind representing linguistic data on the computer, and more specifi-
cally, how it can be ensured that the method of representation employed is one that
as many potential users of the data as possible will be able to understand and make
use of, e.g. for interpreting and verifying the results of the analyses. Discussing
these issues at this point is of vital importance because, traditionally, logic-based
and philosophically oriented pragmatics, unlike CA, does not normally pay much
attention to the nature of real-life data and the forms it may occur in, but rather
‘abstracts away’ from the ‘messiness’ of naturally occurring data. It does so either
by constructing examples or simply leaving out ‘performance’ details that seem
to be irrelevant to explaining the underlying problems encountered in (re)con-
structing the logical form of an ‘utterance’. Nonetheless, any kind of corpus-based
pragmatics definitely needs to take heed of these problems, as ignoring them may
lead to incomplete, or even incorrect, analyses.
Having explicit standards in representation and annotation is not only impor-
tant for handling language data in more industrial settings, such as for language
engineering purposes. Setting, understanding and adhering to these standards
also enables researchers to make the nature of their data maximally explicit, and
the enriching annotation as far as possible self-describing. The following sections
will introduce the most important concepts in the representation and annotation
of language data, and make the necessity for sensible standards explicit by show-
ing where there have been problems in interpreting inconsistent and difficult, or
perhaps unnecessarily fine-grained, coding schemes in the past, thereby making
it difficult to understand and interpret data produced by different researchers or
research teams. Further testimony to the fact that representation and annotation
in describing language data are important issues concerning the interpretation of
such data is provided by the very fact that books like Edwards and Lampert’s Talk-
ing Data: Transcription and Coding in Discourse Research (1993) even exist. In
addition to demonstrating how important these features are to rendering language
information in general, I will also point out how much more of a necessity for
using standardised text rendering methods there is when it comes to analysing and
processing language on the computer.
Amongst the first terms one is likely to come across in the context of corpora
are markup (also mark-up) and annotation. Edwards (1993: 20) still makes a dis-
tinction between the two terms, also introducing two synonyms for annotation,
coding and tagging:
‘Coding’ (also called ‘tagging’ or ‘annotation’) differs from transcription in its
content and degree of structuring. Rather than capturing the overtly observable
acoustic and non-verbal aspects of the interaction, coding focuses on events
which bear a more abstract relationship to each other, that is, on syntactic,
semantic and pragmatic categories. […]

 How to do corpus pragmatics on pragmatically annotated data
‘Mark-up’ originated in the marks used by typesetters to signal the structural
units and fonts of a document. As defined here, it concerns format-relevant
specifications intended to be interpreted by a typesetter or computer software, for
proper segmentation of the text and cataloguing of its parts, in the service of
formatting, retrieval, tabulation or related processes.
As we can see here, Edwards draws a fairly clear distinction between signalling
structural segmentation of the text (markup) and adding or enriching data by
making explicit other types of information that are only implicit in the text (anno-
tation). However, today the two terms are often used synonymously, as can be seen
from the entry in the EAGLET Term database, which defines annotation in the
following way:
annotation /ænəʹteɩʃən/, /{n@'teIS@n/, [N: annotation], [plural:
-s]. Domain: corpus representation. Hyperonyms: description, representation,
characterisation. Hyponyms: part of speech annotation, POS annotation,
segmental annotation, prosodic annotation. Synonyms: labelling, markup.
Def.: 1. Symbolic description of a speech signal or text by assigning categories to
intervals or points in the speech signal or to substrings or positions in the text. 2.
Process of obtaining a symbolic representation of signal data. (2) The act of adding
additional types of linguistic information to the transcription (representation) of
a text or discourse. 3. The material added to a corpus by means of (a): e.g. part-of-
speech tags. (Gibbon et al. 2000: 375)
In practice, though, it probably pays to look more closely at the words that tend
to collocate with annotation, markup, and also the third term mentioned by
Edwards, tagging, as well as the actions that may be associated with them. As is
also partly implicit in the definition from the EAGLET Term database, annotation
often refers to the process of enriching corpus data in specific ways, and we thus
often talk about corpus or dialogue annotation. The term tagging, however, gener-
ally tends to occur in phrases, such as POS (part-of-speech) tagging, which almost
exclusively refers to the action or result of adding morpho-syntactic (or word-class)
information to the words in a text/corpus. And last, but not least, the term markup
is generally used in such expressions as SGML/HTML/XML markup, which refer
to the ‘physical’ or computer-related representation of materials at various levels,
not only at the segmental4 level referred to by Edwards above.
Having clarified some terminological issues, I will now provide a brief intro-
duction to the general means preferred by corpus or computational linguists to
achieve the kind of markup referred to last, beginning with a brief historical – and
. Segmental here essentially means ‘structural’ and is not to be confused with the term
segmental as it is used in phonology.

Chapter 2. Computer-based data in pragmatics 
potentially somewhat simplified – overview of the development and linguistic util-
ity of some of the more important markup languages, before discussing the particu-
lar requirements and proposed standards that exist for dialogue annotation. This
discussion will be split into two sections, where the first one deals with more gen-
eral aspects of markup on the computer, while the second will discuss linguistics-
oriented markup.
2.2.1 General computer-based representation
The original markup language of choice for linguistic purposes was SGML (Stan-
dard Generalized Markup Language). This language developed out of attempts
to standardise means of exchanging information in the 1960s (Bradley 1998: 6).
Anyone who has ever had to struggle with problems related to different propri-
etary document, sound, or graphics formats will easily understand that standardi-
sation is an important and commendable effort because it ensures transparency
and transportability. However, SGML itself, the first standard in this respect, was
only fully ratified by the ISO (International Standards Organization; http://www.
iso.org/iso/home.htm) in 1986 (ibid.), and even though it was widely adopted by
various research communities, has not ‘fulfilled all its promises’. Thus, these days,
it has largely been replaced by XML (eXtensible Markup Language), which is more
flexible, even though it still has not eliminated some of the original issues.
The basic idea in all markup languages that are related to, or derived from,
SGML is that the content is stored in plain text format, meaning in largely human-
readable, non-binary form, while structural and basic category information is
marked up through so-called elements. Elements are also sometimes referred to as
tags, but should of course not be confused with the kind of tags employed in many
tagged corpora to mark up morpho-syntactic information.
So as to be able to easily distinguish elements from the raw text data, elements
tend to be represented in angle brackets (…), where the opening bracket () is
immediately followed by the name of the element. This name may reflect a text-
level, syntactic, morpho-syntactic, etc., category or sub-category. There are essen-
tially – and conceptually – two different types of elements, those that surround or
delimit specific divisions or categories, and those which mainly represent processing
instructions to a computer and may reflect particular types of formatting, such as
line breaks, used to link in or include external content, or express non-structural or
non-hierarchical content. The former tend to enclose the marked up information in
paired tags, where the closing one, to indicate the end, contains a forward slash (/)
between the opening angle bracket and the element name, thus yielding something
like element nameelement content/element name. A processing instruc-
tion, because it is a ‘one off’ command, consists of only a single, unpaired tag.

 How to do corpus pragmatics on pragmatically annotated data
Elements may also contain additional attributes, following the element name
in the start tag. These usually specify the nature of the category expressed by the
element further, or may simply be used to provide a unique identifier, such as a
number, for the element. They tend to consist of an attribute name and an associ-
ated value, which are joined by an equals sign. Let us exemplify this to some extent
by looking at an excerpt from one of the spoken files from the original version of
the BNC (KCU), also pointing out some of the problems that tended to arise with
the use of SGML.
u who=PS0GF
s n=0001w RPOnc YQUE? /s
/u
u who=PS0GG
s n=0002w RRRightc YCOM, w PPIS1Iw VM'll w VVIgo w CCand w
VVIget w AT1a w NN1videoc YCOM, w RRokayc YQUE? /s
/u
u who=PS0GF
s n=0003w UHYeah w PPIS1I w VD0dow XXn't w VVIknow w DDQwhatw
VBZ's w RPonc YSTP. /s
/u
u who=PS0GG
s n=0004w RRAlrightc YSTP. /s
/u
Figure 2.2 Sample excerpt from the original SGML version of the BNC
As is evident from Figure 2.2, SGML uses a fairly standard notation for opening
tags, but unfortunately the sample text is not always consistent in indicating the
ends of textual elements, and often the start of a new element simply has to be
taken as a signal that the preceding element is now to be taken as closed. Thus,
the u (‘utterance’) and s (‘sentence’) elements in the example are explicitly
closed, whereas w elements are not. This type of ‘shortcut’, which is allowed in
SGML, makes processing it much more difficult than needs be and also much
more error-prone. Figure 2.2 also demonstrates that SGML is organised in a hier-
archical (tree) structure where certain elements can be nested within one another.
Thus, the sentences contain a number of words (w), but are themselves embed-
ded in u elements. The exact document ‘grammar’ is specified via a so-called
DTD (Document Type Definition).
In our example, we can also see that all attributes occurring inside the start
tags are either not quoted, as is the case for the n IDs, which could cause parsing
problems if the attribute values contained spaces, or the attribute name is even
assumed to be explicit, as we can see in the examples of the PoS tags, where the
attribute name and the conjoining equals symbol are missing. Of course, this could
only work if this particular element were assumed to only ever allow a single type

Chapter 2. Computer-based data in pragmatics 
of attribute. Thus, as soon as one might want to add e.g. a numerical ID for each
w element, one would first need to ensure that an appropriate attribute name is
added in front of the existing value.
That the SGML annotation here is used for linguistic purposes can be under-
stood from the tags s and w, indicating ‘sentences’ and words respectively.
This kind of markup therefore seems to be quite appropriate for general, perhaps
more written language oriented rendering of linguistic material, in order to estab-
lish category sub-divisions down to the level of syntax.
Compared to its derivatives HTML and XML, SGML also has two other major
disadvantages, the first being that it absolutely requires a DTD specifying the
structure allowed for the document in order to allow any type of serious process-
ing, and the fact that it is not supported by any ‘standard’ browser software. On
the other hand, one big advantage, at least in comparison to HTML, is that a large
set of tag definitions/DTDs, such as for the TEI (see 2.2.3 below), were originally
designed for SGML, although nowadays more and more of these are being ‘ported’
to XML, too.
Although HTML is a direct descendant of SGML, it only provides a limited
set of tags, which on the one hand makes it less flexible than XML, but on the
other also much easier to learn. It is widely recognised by standard browser soft-
ware, and the DTDs are already built into these browsers, although they can also
be explicitly specified. HTML itself is largely standardised and also technically
extensible via CSS (Cascading Style Sheets; see below) to some extent, so that it is
already quite useful for the presentation and visualisation of linguistic content.
This extensibility is somewhat limited, though, which is why it is not really flexible
enough as a markup language for representing more complex linguistic data. It is,
however, possible to transform complex linguistic data encoded in SGML or XML
into a more simplified HTML representation for display in a standard browser.
XML is much more versatile than HTML because – as the attribute extensible
in the name indicates – it was designed to provide the ability to the user to com-
pletely define anything but the most basic language features. It is much easier to
process and far less error-prone than SGML because some of the shortcuts illus-
trated before are no longer allowed. All XML documents minimally have to be
well-formed. In other words, no overlapping tags (e.g. b…i…/b…/i)
as were possible to use in HTML are allowed, and end tags are required for all
non-empty, paired, elements. While the express prohibition of overlapping tags
makes it easier to check the well-formedness of XML documents, it may also pres-
ent a distinct disadvantage for annotating linguistic documents, as e.g. speaker
overlap – where one speaker in a dialogue starts talking while the other has not
finished yet – cannot be marked up using a container element, since this would
‘interfere’ with the hierarchical structure of speaker turns and their embedded
structural utterance units.

 How to do corpus pragmatics on pragmatically annotated data
So-called empty tags/elements differ from their SGML equivalents in that they
have to be ended by a slash before the closing bracket, e.g. element name /.
They provide a work-around for the problem of overlapping tags, should it be
required to indicate overlap precisely because they can be given attributes that
signal its start and end, along with potentially some IDs if there should be mul-
tiple concurrent overlap sequences. Unlike with older forms of HTML, where case
did not matter, XML is also case sensitive, so that tags like turn, Turn and
TURN are treated as being different from one another.
The representation of individual letters – or characters, to be more precise –
on the computer may also be an important issue in linguistics, especially in
dealing with multi-lingual data or data that needs to include phonological infor-
mation. For dealing with ‘English only’, a very limited Latin-based character set
may appear sufficient. Originally, characters in English data were encoded in a
character set called ASCII (American Standard Code for Information Interchange)
and its later derivatives, but as computing technology spread across the world,
this presented problems in representing other languages that contain accented
characters, etc., as well as the occasional foreign word appearing in English texts,
such as fiancée. In order to overcome this problem, and be able to store char-
acters from different character sets in one and the same document, a universal
character encoding strategy called Unicode was developed. Unicode exists in a
number of different formats, the most widely used and flexible of which is called
UTF-8, which is the default assumed for XML files unless an alternative encod-
ing is specified. This, along with the fact that it is the format that Perl – the pro-
gramming language used for the implementation of the analysis tool discussed
in the next chapter – uses it for its internal character representation, made a
combination of XML and UTF-8 the most logical choice for the encoding of the
data used for this study.
Leech et al. (2000: 24) still argued against the use of Unicode and recom-
mended to use a 7-bit ASCII character set for encoding most information, as this
was most widely supported in general at the time, so that, for example, the inclu-
sion of phonetic transcription details could only be achieved by using the trans-
literation format SAMPA (ibid.). However, as more and more operating systems,
browsers and even standard editors available on many different platforms now
widely support UTF-8, such transliterations or the use of special character entity
references for e.g. representing foreign characters like é (eacute;) or escaping
umlaut characters – e.g. writing u for ü, as was done for the Verbmobil data –
should these days no longer be necessary. Seeing the text in the way it was meant to
be represented in the respective writing or representation systems makes dealing
with, and processing, the data much more intuitive, and also allows researchers
to use established linguistic conventions, such as proper phonetic transcription,
instead of remaining caught up in unnecessary conventions that only stem from

Chapter 2. Computer-based data in pragmatics 
the traditional anachronistic predominance of an American influence on data rep-
resentation on the computer.
XML, in contrast to HTML, describes content, rather than layout, so that the
rendering, i.e. the visual representation of a document, needs to be specified via a
style sheet, or otherwise the browser or application displaying it would not know
how to achieve this. If no style sheet is explicitly provided, most browsers will try
to render the XML content using their own default style sheets that at least attempt
to represent the hierarchical tree structure and often allow the user to expand and
collapse nested (embedded) structures. Other applications can at least display the
plain text, provided they support the given encoding. A screenshot of what the
hierarchical XML display inside a browser looks like is shown in Figure 2.3.
-u who=PSOGF
-s n=1
w c5=AVP-PRP'' hw=on pos=ADVOn/w
c c5=PUN?/c
/s
/u
w c5=NNl hw=video pos=SUBSTvideo/w
c c5=PUN, /c
w c5=AV0 hw=okay pos=ADVokay/w
c c5=PUN?/c
/s
/u
-u who=PSOGF
- s n=3
- u who=PSOGG
- s n=2
w c5=AV0 hw=right pos=ADVRight/w
c c5=PUN, /c
w c5=PNP hw=i pos=''PRONI/w
w c5=VM0 hw=will pos=VERB'll /w
w c5=VVI hw=go pos=VERBgo /w
w c5=CJC hw=and pos=CONJand /w
w c5=VVI hw=get pos=VERBget /w
w c5=AT0 hw=a pos=ARTa /w
Figure 2.3 Hierarchical display of XML in a browser window

 How to do corpus pragmatics on pragmatically annotated data
Figure 2.3 contains the fragment from file KCU of the BNC depicted as SGML
earlier, and it is clearly visible that the markup has been suitably adjusted to make
it well-formed, with all start and end tags properly set, all attribute names given,
and all attribute values quoted. Furthermore, some additional attributes have been
added, where, according to the BNC User Reference Guide5 “hw specifies the head-
word under which this lexical unit is conventionally grouped, where known.”. “[H]
eadword” here subsumes all paradigm forms associated with a particular word
form (or type), regardless of their PoS, so it is distinct from a lemma, which only
subsumes those forms of a paradigm that belong to the same PoS. For example, the
headword hand subsumes the nominal forms hand (sing.) and hands (pl.), as well
as the verbal forms hand (inf./base form), hands (3rd pers. sing), etc. Furthermore,
the pos-attribute now indicates a simplified PoS value, whereas the c5-attribute
provides more specific PoS information, based on the more elaborate CLAWS C5
tagset (Garside, Leech McEnery 1997: 256–257).
Apart from the well-formedness criterion described above, the document
structure of an XML document can also be more rigorously constrained by
specifying either a DTD or a schema that it needs to conform with, in which
case we talk of a valid document. Issues of designing DTDs or schemas will not
be discussed here because they are fairly complex6 and the data structure used
for the DART annotation scheme is relatively simple, but a brief overview of
some of the rendering options for XML documents using style sheets will at least
be provided.
As can be seen in the illustration above, each XML document represents a
hierarchical structure. The outer ‘layer’ for this hierarchy – not shown above – is
represented by a container or ‘wrapper’ element that encloses all the nested ele-
ments. In the case of the DART annotation scheme, this element is aptly named
dialogue. This wrapper element is only preceded by a single special processing
instruction, the XML declaration, ?xml version=1.0?. This declaration
may also contain further attributes, such as the encoding or whether the document
is a standalone document or not, i.e. whether an associated external DTD exists.
Style sheets allow the author to present or publish material in a more appro-
priate format, for instance specifying line spacing, indentation, positioning, font
and background colours, etc. What may at first seem to only be a feature to make
the rendering of the textual and annotation materials look nicer does in fact have
its purpose because proper layouting and colour-coding may well help to enhance
the representation of the logical structure of documents, as well as to highlight
. At http://www.natcorp.ox.ac.uk/docs/URG/ref-w.html
. For more detailed information on this, see Carstensen et al. 2004: 140 ff.

Chapter 2. Computer-based data in pragmatics 
certain facts about the content of a linguistic XML document. Thus, it e.g. becomes
possible to highlight information about the syntax or semantics of a particular unit
of text. Something similar to a style sheet is for instance used in the implemen-
tation of the analysis program discussed later to indicate the difference between
syntactic units, such as declaratives and interrogatives.
Below, a short XML sample without a style sheet is shown, followed by an
illustration of what the latter may look like using a simple style sheet.
?xml version=1.0 encoding=UTF-8?
sample
sentence
word pos=DETThis/word
word pos=BEis/word
word pos=DETa/word
word pos=Nsample/word
word pos=Nsentence/word
word pos=PUN./word
/sentence
/sample
Figure 2.4 A short, illustrative, linguistic XML sample
Figure 2.5 A colour-coded XML sample
sample {display: block; margin-left: 5%; margin-top: 5%; font-size: 2em;}
[pos=DET] {display: inline; color: blue;}
[pos=BE] {display: inline; color: red;}
[pos=N] {display: inline; color: green;}
[pos=PUN] {display: inline; color: grey;}
Figure 2.6 A sample CSS style sheet
The first line ensures that every time a sample element is encountered, this is
displayed as a block-level element, a text block similar to a paragraph, with spac-
ing around it. Furthermore, just to ensure that the display is not ‘crushed’ against
the top and left-hand side, a margin of 5% of the page width is specified and, to
enlarge the text a little, a relative value (em) for the font-size defined for the whole
page, which is effectively twice the default font-size the browser would use. The
next few lines specify that each time a pos attribute with either the value of DET
(for determiner), BE (a form of be), N (for noun), or PUN (for punctuation) is
encountered, whatever is enclosed in the corresponding element tag is displayed
inline, in other words, not as a separate block, and using the appropriate colour. If

 How to do corpus pragmatics on pragmatically annotated data
you observe the XML and its corresponding style sheet-controlled output closely,
it will probably become evident that the browser has also automatically added a
space after rendering each inline element, something which was not part of the
original XML.
XSL, the style sheet language developed for use with XML, provides similar
options to CSS for formatting XML display, but also much more complex selection
mechanisms and allows reuse of ‘text objects’, e.g. for producing tables of con-
tents from elements marked up as headings, etc., via XSL Transformations (XSLT).

Layout design for other (printed) media is also supposed to be enhanced through
XML Formatting Objects (XSL-FO). However, for rendering XML, it is not even
absolutely necessary to use XSL, but a simpler, albeit less powerful, solution is to
simply link in a CSS style sheet to control the display, as we saw above. None of
the XML style sheet options are currently exploited in the implementation of the
annotation, but links to an appropriate – maybe user-definable – style sheet can be
included in the dialogues used in DART.
2.2.2 Text vs. meta-information
In valid HTML code, there are two separate sections that make up an HTML
document. The first of these is represented by the head and the second by
the body element. The two different types of information expressed by these
elements are quite distinct from one another. The first one is somewhat similar
to the front matter or imprint of a book, which contains meta-information about
that book, such as its title, the author, the typeface used, etc., and does in fact
not represent any real book content, whereas the second one contains the actual
text itself.
What is called the head element in HTML is usually referred to as a header in
general. Headers in corpus data may contain various types and amounts of meta
information, such as the language the data is in, its encoding, where, when and
how it was collected, the author, copyright situation, whether the individual file
is part of a larger collection, etc. For spoken data, often some speaker informa-
tion is included, as well as the recording date and quality, the number of channels,
etc. Such meta information can become quite extensive, as is e.g. the case in the
BNC files, and often needs to be skipped over when processing the files, either
for annotation, concordancing, or other forms of processing. Although much of
this may be highly useful information about the corpus files, it does not really
form part of the text itself, and can be quite distracting when ‘interacting’ with
the linguistic data in any form. Thus, perhaps a more suitable alternative to using
an extensive header is some kind of external description of the data. This has the
distinct advantage of keeping the text ‘clean’ and easier to process, even if it may

Chapter 2. Computer-based data in pragmatics 
necessitate distributing additional files containing such meta-information that can
be consulted e.g. when selecting data on the basis of the age or sex of the speaker.
Depending on how extensive or deeply structured it is, such external documenta-
tion can either be kept in a simple plain text file or in a database (cf. Leech et al.
2000: 13).
As much of the data used for this study did not actually provide any detailed
information about the speakers or was relevant in any other way for the processing,
the DART XML representation does not include a separate header. In general, only
the most important information pertaining to the corpus, the identifier of the dia-
logue within the corpus, and the language, are stored as attributes inside the con-
tainer tag, e.g. dialogue corpus=trainline id=01 lang=en,
although for some data, additional information about sub-corpus types, etc., may
be present.
2.2.3 General linguistic annotation
Although, for the sake of simplifying our processing later, we will often specifically
disregard some of the recommendations made by the wider language research
community (at least initially), it is still important to discuss some of the efforts
that have been made in the past in order to establish a common framework for the
exchange of annotated language data, most specifically those of the Text Encoding
Initiative (TEI).7 Apart from discussing existing practices and schemes for linguis-
tic annotation in a general way, the motivation for choosing particular representa-
tion and annotation options employed in the annotation scheme used for the data
annotation in this study will also be explained as and when appropriate.
The TEI itself is a research project, organised and funded by the major associa-
tions that deal with computing in the humanities, the ACL (Association for Compu-
tational Linguistics), the ALLC (Association for Literary and Linguistic Computing),
and the ACH (Association for Computing and the Humanities). The explicit origi-
nal aim of this project was to devise some recommendations, as well as an associated
(SGML) markup framework, that would guarantee the successful annotation and
exchange of data for many diverse language-related needs, ranging from library cat-
alogues, via standardised dictionary entries, to critical editions of literary works or
large language corpora, such as the BNC. The TEI framework has developed consid-
erably further since its inception, especially with its changeover to XML in version
4, published in 2002. The latest version of the guidelines, P5, appeared in November
2007 and is available from http://www.tei-c.org/Guidelines/P5/.
. http://www.tei-c.org/

Other documents randomly have
different content

Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com

How To Do Corpus Pragmatics On Pragmatically Annotated Data Speech Acts And Beyond Martin Weisser

More Related Content

Similar to How To Do Corpus Pragmatics On Pragmatically Annotated Data Speech Acts And Beyond Martin Weisser (20)

Recently uploaded (20)

How To Do Corpus Pragmatics On Pragmatically Annotated Data Speech Acts And Beyond Martin Weisser