Data Analytics In Cognitive Linguistics Methods And Insights 1st Edition Tay

Data Analytics In Cognitive Linguistics Methods
And Insights 1st Edition Tay download
https://ebookbell.com/product/data-analytics-in-cognitive-
linguistics-methods-and-insights-1st-edition-tay-50559766
Explore and download more ebooks at ebookbell.com

Here are some recommended products that we believe you will be
interested in. You can click the link to download.
Data Analytics In Cognitive Linguistics Methods And Insights Dennis
Tay Editor Molly Xie Pan Editor
https://ebookbell.com/product/data-analytics-in-cognitive-linguistics-
methods-and-insights-dennis-tay-editor-molly-xie-pan-editor-50985074
Methods In Cognitive Linguistics Human Cognitive Processing They
Actually Said That An Introduction To Working With Usage Data Through
Discourse And Corpus Analysis Irene Mittelberg
https://ebookbell.com/product/methods-in-cognitive-linguistics-human-
cognitive-processing-they-actually-said-that-an-introduction-to-
working-with-usage-data-through-discourse-and-corpus-analysis-irene-
mittelberg-55943240
Big Data Analytics In Cognitive Social Media And Literary Texts Theory
And Praxis 1st Ed 2021 Sanjiv Sharma
https://ebookbell.com/product/big-data-analytics-in-cognitive-social-
media-and-literary-texts-theory-and-praxis-1st-ed-2021-sanjiv-
sharma-35084362
Modeling And Analysis Of Voice And Data In Cognitive Radio Networks
1st Edition Subodha Gunawardena
https://ebookbell.com/product/modeling-and-analysis-of-voice-and-data-
in-cognitive-radio-networks-1st-edition-subodha-gunawardena-4662618

Data Analytics In Football Positional Data Collection Modelling And
Analysis 2nd Edition Daniel Memmert
https://ebookbell.com/product/data-analytics-in-football-positional-
data-collection-modelling-and-analysis-2nd-edition-daniel-
memmert-53931924
Data Analytics In Marketing Entrepreneurship And Innovation Mounir
Kehal
https://ebookbell.com/product/data-analytics-in-marketing-
entrepreneurship-and-innovation-mounir-kehal-22155426
Data Analytics In Bioinformatics A Machine Learning Perspective 1st
Edition Rabinarayan Satpathy Editor
https://ebookbell.com/product/data-analytics-in-bioinformatics-a-
machine-learning-perspective-1st-edition-rabinarayan-satpathy-
editor-36137846
Data Analytics In The Era Of The Industrial Internet Of Things 1st Ed
2021 Aldo Dagnino
https://ebookbell.com/product/data-analytics-in-the-era-of-the-
industrial-internet-of-things-1st-ed-2021-aldo-dagnino-36893906
Data Analytics In Medicine Concepts Methodologies Tools And
Applications Information Resources Management Association
https://ebookbell.com/product/data-analytics-in-medicine-concepts-
methodologies-tools-and-applications-information-resources-management-
association-44644024

Data Analytics in Cognitive Linguistics

Applications of
Cognitive Linguistics
Editors
Gitte Kristiansen
Francisco J. Ruiz de Mendoza Ibáñez
Honorary editor
René Dirven
Volume 41

Data Analytics
in Cognitive
Linguistics
Methods and Insights
Edited by
Dennis Tay
Molly Xie Pan

ISBN 978-3-11-068715-6
e-ISBN (PDF) 978-3-11-068727-9
e-ISBN (EPUB) 978-3-11-068734-7
ISSN 1861-4078
Library of Congress Control Number: 2022935004
Bibliographic information published by the Deutsche Nationalbibliothek
The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie;
detailed bibliographic data are available on the Internet at http://dnb.dnb.de.
Chapter “Lectal variation in Chinese analytic causative constructions: What trees can and
cannot tell us” © Xiaoyu Tian, Weiwei Zhang and Dirk Speelman
© 2022 Walter de Gruyter GmbH, Berlin/Boston
Typesetting: Integra Software Services Pvt. Ltd.
Printing and binding: CPI books GmbH, Leck
www.degruyter.com

Contents
Dennis Tay, Molly Xie Pan
Data analytics in cognitive linguistics 1
Bodo Winter
Mapping the landscape of exploratory and confirmatory data analysis in
linguistics 13
Dennis Tay
Time series analysis with python 49
Matteo Fuoli
Structural equation modeling in R: A practical introduction for
linguists 75
Mariana Montes, Kris Heylen
Visualizing distributional semantics 103
Xiaoyu Tian, Weiwei Zhang, Dirk Speelman
Lectal variation in Chinese analytic causative constructions: What trees
can and cannot tell us 137
Molly Xie Pan
Personification metaphors in Chinese video ads: Insights from data
analytics 169
Han Qiu, Dennis Tay
The interaction between metaphor use and psychological states:
A mix-method analysis of trauma talk in the Chinese context 197
Jane Dilkes
Prospecting for metaphors in a large text corpus: Combining
unsupervised and supervised machine learning approaches 229
Jonathan Dunn
Cognitive linguistics meets computational linguistics: Construction
grammar, dialectology, and linguistic diversity 273

Karlien Franco
What Cognitive Linguistics can learn from dialectology
(and vice versa) 309
Index 345
VI Contents

Dennis Tay, Molly Xie Pan
Data analytics in cognitive linguistics
1 Is data analytics just another name
for statistical analysis?
Data analytics is commonly defined as the “processing and analysis of data to
extract information for enhancing knowledge and decision-making”, with minor
differences among definitions. Although large amounts of data are collected non-
stop around the clock, people still describe today’s world with the old phrase
“data rich but information poor” (Peters and Waterman 1982). The process of
turning data into useful information is like finding “a small set of precious nug-
gets from a great deal of raw material” (Han et al. 2000: 5–6), and would indeed
seem like a daunting task to the unacquainted. On the other hand, those who
have received some training in general data analysis, including many linguists,
might see data analytics as little more than an attempt to refashion applied sta-
tistics and quantitative methods in a more marketable way. The gist of it still ap-
pears to be making sense of data in numerical rather than verbal or qualitative
forms, and popular techniques like clustering and regression still bear the same
name as when they were taught in traditional statistics courses. There is some
merit to this cynicism given that we live in a world where it seems to be impor-
tant to put a new spin on old things all the time. However, we would be remiss to
overlook some nuanced but important differences between the two. The first dif-
ference is that while most data analytic techniques are indeed based on quantita-
tive and statistical methods, there is a strong emphasis on the importance of
substantive expertise (Conway 2010) or domain knowledge in order to maximize
their potential for insight. This follows from the fact that just about any type of
data from historical archives to complex multimodal artifacts can be viewed from
the lenses of data analytic techniques as long as there are good theoretical or
practical reasons for doing so. It also means that general statistical methods like
classification and regression are continuously adopted to meet the specific needs
Dennis Tay, Department of English and Communication, The Hong Kong Polytechnic University,
e-mail: dennis.tay@polyu.edu.hk
Molly Xie Pan, College of Foreign Languages and Literatures, Fudan University,
e-mail: mollyxiaoxie@foxmail.com
Acknowledgement: The editorial work involved in this volume was partly supported by the
HKSAR Research Grants Council (Project number: 15601019).
https://doi.org/10.1515/9783110687279-001

of different domains like business (Chen et al. 2012) and healthcare (Raghupathi
and Raghupathi 2014), with ever expanding functions, applications, and special-
ized interpretations of models and results. The second difference pertains to a
perceived difference in scope between the two. There is a tendency among many
novice and experienced researchers alike to view statistical analysis as a set of
standard ‘tests’ that are applied to measurements collected under some strictly
controlled guidelines in order to determine whether some hypothesis is ‘correct’
or otherwise. This seems to be especially true in the field of applied linguistics
where testing, assessment, and other forms of measurement are commonplace.
The typical question which test should I use? is often answered by convenient
heuristical tools like flow charts, abundantly available on the internet, that at-
tempt to link stock scenarios like ‘comparing the means of two groups’ or ‘com-
paring the means of three or more groups’ to the t-test, one-way ANOVA, and so
on. While this approach of ‘choosing the correct test to use’ might be convenient
and helpful for learners, it reinforces the narrow view that statistical analysis is
all about trying to prove or disprove a hypothesis at a specific stage of the re-
search process. This in turn makes it easy to see statistical analysis as an inde-
pendent set of procedures that apply to all types of, and are hence divorced
from, specific subject matter knowledge. Data analytics, on the other hand, is
more in line with the broader notion of statistical thinking that has been gaining
traction in modern statistics education. There are many different definitions of
statistical thinking but they all focus on cultivating a “more global view” (Chance
2002) in learners right from the start. For researchers, this means learning to see
data analysis as holistic and contextual, rather than linear and procedural. Some
concrete steps to do so include exploring and visualizing data in new and crea-
tive ways, understanding why a certain analytical procedure is used rather than
what or how to use it, reflecting constantly on alternative approaches to think
about the data and situation at hand, appreciating how subject matter and con-
textual knowledge can potentially shape analytic decisions, and learning how to
interpret conclusions in non-statistical terms. Compared to the traditional con-
ception of statistical analysis described above, we can therefore describe data an-
alytics as encompassing a more exploratory spirit, being ‘messier’ in a positive
sense, and even as a means to inspire emergent research questions rather than a
resolution of existing ones. At the same time, data analytics can also be de-
scribed as being very context-specific and purpose driven, and thus potentially
more engaging for focused learners than what the traditional ‘decontextualized’
view of statistics presents. Tools for the actual implementation of data analytic
techniques on increasingly large volumes of data have also become more avail-
able today. Powerful open-source programming languages like R and Python are
continuously developed and freely available to personal users, which can be a
2 Dennis Tay, Molly Xie Pan

great relief for many learners relying on expensive commercial statistical soft-
ware packages like SPSS, Stata, MATLAB etc.
Data analytics can be classified as four subtypes(Evans and Lindner 2012)
in order of increasing complexity and value-addedness (Figure 1). Originally
conceived for business contexts where complexity and value are measured in
relatively concrete financial terms, these notions also have meaningful inter-
pretations for researchers. We may understand the four subtypes as represent-
ing progressive phases of inquiry into a certain dataset. Descriptive analytics is
roughly synonymous with the classic notion of descriptive statistics. It involves
summarizing and depicting data in intuitive and accessible ways, often to pre-
pare for later phases of analysis. A simple example is to depict the central ten-
dency and distribution of a dataset with box plots or histograms. With an
increasing premium placed on visual aesthetics and user engagement, how-
ever, data visualization has become a growing field in itself, with increasingly
sophisticated and interactive forms of visualization driving the development
of contemporary descriptive analytics. The next subtype or phase known as
diagnostic analytics involves discovering relationships in the data using vari-
ous statistical techniques. It is deemed more complex and valuable than de-
scriptive analytics because the connections between different aspects of our
data help us infer potential causes underlying observed effects, addressing
the why behind the what. In an applied linguistics context, for example, the
descriptive step of uncovering significant differences in the mean scores of
student groups might motivate a broader correlational study of scores and
demographics to diagnose potential sociocultural factors that explain this dif-
ference. Following that, if we see diagnostic analytics as revealing why some-
thing might have happened in the past, the next phase known as predictive
analytics is aimed at telling us what might happen in the future. This involves
predicting the values of future data points using present and historical data
points, supported by core techniques like regression, classification, and time
series analysis. It should be clear why predictive analytics represents a quan-
tum leap in value for businesses that are inherently forward looking. To a
lesser extent perhaps, the same applies for linguistics research that aims to
predictively categorize new texts, speakers, and varieties, or forecast lan-
guage assessment scores, based on existing data. As ever-increasing volumes
of data become available, both diagnostic and predictive analytics are turning to-
wards machine learning – the use of artificial intelligence to quickly identify pat-
terns, build models, and make decisions with little or no human intervention.
Applications in computational linguistics and natural language processing (NLP)
best reflect these advances in the field of linguistics research. Lastly, prescriptive
analytics fill the gap between knowing and doing by translating the above insights

into concrete courses of action. It goes beyond knowing what is likely to happen
based on predictive analytics, which may or may not be ideal, to suggest what
needs to be done to optimize outcomes. Examples that require split-second deci-
sions to everchanging information and conditions include the optimization of air-
line prices, staffing in large organizations, and the modern self-driving car. While
linguistics is not likely to involve this level of challenge, prescriptive analytics still
interfaces with the ubiquitous notion of applied or appliable research, which ulti-
mately boils down to the growing need to demonstrate how our findings positively
inform personal and social action.
2 Data analytics in cognitive linguistics
Cognitive linguistics has undergone remarkable development since its incep-
tion. Luminaries of the field like George Lakoff, Leonard Talmy, Ronald Lan-
gacker, Charles Fillmore, and Gilles Fauconnier established some of the most
influential theories about the interfaces between linguistic meaning, structure,
and cognition, basing much of their introspective analyses on made-up exam-
ples. This mode of inquiry has however come to be criticized on both philosophi-
cal and methodological grounds over the years. Philosophically, reliance on
introspection reduces “cognition” to psychological reality and neglects its neuro-
biological underpinnings (Lamb 1999). This is often reflected in the common use
of the terminology “mind/brain” to conflate the two for analytic convenience.
Figure 1: Four subtypes of data analytics.

The use of introspective examples has also been susceptible to charges of argu-
mentative circularity, like in the case of conceptual metaphor theory where in-
vented utterances are taken as both evidence and product of conceptual mappings
(Kertész and Rákosi 2009). More generally, there are obvious limitations to our
ability as humans to accurately introspect upon many of our cognitive processes
(Gibbs 2006). The concern motivating the present volume is more methodological
in nature. The past decades have witnessed steady growth in the (combined) use
of empirical methods like corpora, surveys, and experimentation in humanities re-
search in general and cognitive linguistics in particular (Gibbs 2007, 2010; Kertész
et al. 2012). A key argument for empirical over introspective methods in cognitive
linguistics is their compatibility with the basic tenet that linguistic structure and
meaning emerge from multiple usage contexts. These are inherently beyond the in-
trospective ambit of individuals, and we therefore need transparent methods that
can deal with measures and their variability on larger scales. This ‘empirical turn’
has at the same time dovetailed with the call for cognitive linguistics to demon-
strate its applications in real world activities, including but not limited to the
traditional areas of language acquisition and education. Many examples in
the Applications of Cognitive Linguistics book series have powerfully illus-
trated this point.
The above conditions imply that cognitive linguistics – as a specialized
knowledge domain aspiring to be truly “applied” – presents a fertile but underex-
plored ground for data analytics. While not all empirical methods are quantitative
in nature, the majority used by cognitive linguists including corpora, surveys,
and experiments do involve different extents of quantification and statistical anal-
ysis. There is certainly no lack of advocacy, pedagogy, and application of quanti-
tative and statistical methods by cognitive linguists interested in different topics
ranging from metaphor and metonymy to constructions and lexical semantics
(Glynn and Robinson 2014; Gonzalez-Marquez et al. 2007; Janda 2013; Tay 2017;
Winter 2019; Zhang 2016). A cursory review of articles in specialized journals like
Cognitive Linguistics, Review of Cognitive Linguistics and Cognitive Linguistic Studies
quickly shows that there is an increasing use of quantitative methods in theoreti-
cal and empirical work alike. We are fortunate to already have many examples of
textbooks, introductory overviews, step-by-step guides, as well as more advanced
applications of a wide variety of statistical methods to analyze linguistic data.
However, with respect to the distinct features of data analytics outlined above,
two aspects remain critically underexplored in the cognitive linguistics literature.
Firstly, existing work has made ample use of descriptive analytics to account for
data, diagnostic analytics to investigate hypotheses, and predictive analytics to
make inferences about patterns of language use. It nevertheless stops short at pre-
scriptive analytics; i.e. suggesting concrete courses of action, which is crucial if

cognitive linguistics wishes to be truly “applied”. This goes beyond the traditional
ambit of language acquisition and pedagogy (Achard and Niemeier 2008; Little-
more 2009) to other contexts like advertising (Littlemore et al. 2018), design (Hurti-
enne et al. 2015), and aspects of healthcare where language plays a key role
(Demjén et al. 2019; Tay 2013). As mentioned earlier, while linguistic research is
not likely (yet) to require the most sophisticated prescriptive analytics, it is time to
consider how the “practical implications” of our work could be more intimately
informed by prior analytical steps and articulated as such.
The second underexplored aspect is the aforementioned holistic guiding
role of data analytics throughout the research trajectory – from data description
to hypothesis setting, testing, and the eventual interpretation and application
of findings. This contrasts in important ways with the widely held belief that
statistical and quantitative methods only apply to “top-down” investigation of
experimental hypotheses determined in advance. It is in fact the case that
many “bottom-up” designs across different cognitive linguistic topics – ranging
from corpus-driven studies to (conceptual) metaphors in discourse – can be
critically informed by various data analytic techniques. A dedicated space is re-
quired to demonstrate the different possibilities with reference to diverse areas
in current cognitive linguistics research. From a pedagogical point of view, re-
searchers new to data analytics could be made more aware that even a working
familiarity with basic skills, including programming languages like R and Py-
thon, can go a long way towards the formulation and refinement of different
research objectives.
3 This volume as a first step
This volume is a modest first step towards the aforementioned goals. It features
ten contributions from established and up-and-coming researchers working on
different aspects of cognitive linguistics. As far as practicable, the contributions
vary in terms of their aims, featured languages and linguistic phenomena, the so-
cial domains in which these phenomena are embedded, the types of data analytic
techniques used, and the tools with which they are implemented. Some chapters
are conceptual discussions on the relationships between cognitive linguistic re-
search and data analytics, some take a more pedagogical approach to demonstrate
the application of established as well as underexplored data analytic techniques,
while others elaborate these applications in with full independent case studies.
Examples from multiple languages and their varieties like English, Mandarin Chi-
nese, Dutch, French, and German will be discussed. Phenomena and constructs to

be analyzed include verbal and visual metaphors, constructions, language
variation, polysemy, psychological states, and prototypicality. The case studies
address relevant issues in different social domains like business, advertising, poli-
tics, and mental healthcare, further underlining the applied dimensions of cogni-
tive linguistics. The featured data analytic techniques span across descriptive, to
the threshold of prescriptive analytics as described above. These include innova-
tive ways of data visualization, machine learning and computational techniques
like topic modeling, vector space models, and regression, underexplored applica-
tions in (cognitive) linguistics like time series analysis and structural equation
modeling, and initial forays into prescriptive analytics in the mentioned social do-
mains. The contributions also showcase a diverse range of implementation tools
from traditional statistical software packages like SPSS to programming languages
like Javascript, R, and Python, with code and datasets made available either in
print, via external links, or upon request by contributors.
The volume starts with Chapter 1 where Bodo Winter provides an excellent
overview of the landscape of statistical analysis in cognitive as well as general
linguistics research. Framing data analysis as a process of modeling the data
with respect to domain and contextual knowledge rather than the ritualistic ap-
plication of statistical tests, he communicates the central message of this vol-
ume and discusses how a modeling approach could address perceived issues of
replicability and reproducibility in cognitive linguistics research. This overview
is followed by two tutorial-style chapters aimed at introducing useful data ana-
lytic techniques that are likely to be less familiar to cognitive linguists. In Chap-
ter 2, Dennis Tay discusses the underexplored relevance of time series data –
consecutive observations of a random variable in orderly chronological se-
quence – in cognitive linguistics research. Key steps of the widely used Box-
Jenkins method, which applies a family of mathematical models called ARIMA
models to express values at the present time period in terms of past periods,
are explained with reference to a guiding example of metaphors across psycho-
therapy sessions. Sample code from the Python programming language is pro-
vided to encourage readers to attempt and implement the method to their own
datasets. Matteo Fuoli’s Chapter 3 follows closely with an introduction of
structural equation modelling. This is a technique for testing complex causal
models among multiple related variables, and has much underexplored poten-
tial in cognitive linguistics work. Experimental data on the psychological ef-
fects of stance verbs (e.g. know, want, believe) in persuasive business discourse
comprise the guiding example, this time using the R programming language for
implementation. Learners might also find in these two chapters an opportunity
to compare Python and R code for themselves.

The volume then transits into a series of four case studies that, as men-
tioned above, feature a diverse range of phenomena, settings, and data analytic
techniques. In Chapter 4, Mariana Montes and Kris Heylen highlight the im-
portance and increasing sophistication of data visualization techniques, and
how they interface with processes of statistical data analysis. They argue that
the process of visualization helps researchers recognize, interpret, and reason
about otherwise abstract statistical patterns in more intuitive ways. The crucial
role played by interactive visual analytics is then demonstrated by a corpus-
based case study of Dutch, where distributional semantic models are used to
analyze structural properties of word meaning like polysemy and prototypical-
ity as they emerge from contextual usage patterns. Chapter 5 by Xiaoyu Tian,
Weiwei Zhang, and Dirk Speelman is a case study of lectal variation in analytic
causative constructions across three varieties of Chinese used in Mainland China,
Taiwan, and Singapore. The authors are interested in how features of the cause
and the effected predicate might influence the choice of near-synonymous causa-
tive markers shi, ling, and rang in these varieties. They demonstrate how a combi-
nation of data analytic techniques – conditional random forests and inference
trees, complemented by logistic regression models, can enhance the explanatory
power and insight offered by tree-based methods alone. We remain in the Chinese-
speaking context in Chapter 6 by Molly Xie Pan, turning our attention to the use
of personification metaphors in Chinese video advertisements. Besides showing
how these metaphors are constructed and distributed across product types, with
prescriptive implications for advertisers, another important objective of this chap-
ter is to underline how data analytic techniques like log-linear and multiple corre-
spondence analysis can guide researchers to explore larger datasets with multiple
categorical variables, as a starting point to inspire subsequent research questions.
Chapter 7 moves from the social domain of advertising to mental health and poli-
tics. Working with interview and psychometric data from affected individuals in
the recent social unrest in Hong Kong where protestors occupied a university and
disrupted its operation, Han Qiu and Dennis Tay apply multiple linear re-
gression to analyze the interaction between metaphor usage profiles and
measures of psychological trauma, interpreting the results and implications in a
contextually specific way. As a step towards predictive analytics, they show how
aspects of metaphor use (e.g. targets, sources, conventionality, emotional va-
lence) at the level of individuals could reasonably predict performance in the
Stanford Acute Stress Reaction Questionnaire, which in turn suggests concrete
courses of actions by healthcare professionals.
The final three chapters showcase innovations in the application of data
analytics, both in terms of refining methodology in established research areas
as well as tilling new grounds for collaborative cross-disciplinary work. In

Chapter 8, Jane Dilkes discusses metaphor identification, a foundational
step in metaphor research that is well known for its (over)reliance on manual
human judgement. She shows how a combination of supervised and unsuper-
vised machine learning techniques in natural language processing can be
harnessed to prospect for “community metaphor themes” in an extensive on-
line English cancer-related forum. This in turn paves the way for investigat-
ing associations between metaphor use and other measures of language
style. Chapters 9 and 10 by Jonathan Dunn and Karlien Franco respectively
argue for closer collaboration between cognitive linguistics and the neighbor-
ing fields of computational linguistics and dialectology. In Chapter 9, the
growing area of computational cognitive linguistics is showcased as a truly
usage-based approach operating on a large enough scale to capture meaning-
ful generalizations about actual usage. Computational methods are used to
model language learning and variation from respective theoretical perspec-
tives of construction grammar and dialectology, featuring a vast dataset that
covers seven languages (English, French, German, Spanish, Portuguese, Rus-
sian, Arabic) and their 79 distinct national dialects. In Chapter 10, potential
synergies between cognitive linguistics and dialectology are further explored
with three case studies that offer a cognitive linguistic take on sociolinguistic
principles like transmission, diffusion, and communicative need. To opera-
tionalize this cross-fertilization of theoretical ideas, generalized additive
models, as extensions of generalized linear models that flexibly accommo-
date parametric and non-parametric relationships between (in)dependent
variables, are combined with other correlational analyses to investigate dia-
lectal lexical variation in Dutch.
In this way, the volume is divided into three natural and overlapping sec-
tions. Chapters 1 to 3 are more conceptually and pedagogically oriented by pre-
senting a broad overview followed by tutorial-style contributions aimed at
learners. Chapters 4 to 7 feature specific case studies with a range of data ana-
lytic techniques, phenomena, and social contexts, and Chapters 8 to 10 con-
clude the volume by offering glimpses of promising future directions for data
analytics in cognitive linguistics. While the target audience are cognitive lin-
guists, the techniques underpinning the theoretical issues and examples are
readily applicable to other areas of social and linguistic research with appropri-
ate reconceptualization of the design and variables. We believe that the volume
will be most beneficial to researchers who have some foundational knowledge
in statistics and data analytics, and want to further understand how a range of
underexplored as well as established techniques could operate in actual research
contexts.

References
Achard, Michel & Susanne Niemeier. 2008. Cognitive linguistics, second language
acquisition, and foreign language teaching. Berlin: Walter de Gruyter.
Chance, Beth L. 2002. Components of statistical thinking and implications for instruction and
assessment. Journal of Statistics Education 10(3).
Chen, Hsinchun, Roger HL Chiang & Veda C Storey. 2012. Business intelligence and analytics:
From big data to big impact. MIS quarterly. 1165–1188.
Conway, Drew. 2010. The Data Science Venn Diagram. blog.revolutionanalytics.com.
Demjén, Zsófia, Agnes Marszalek, Elena Semino & Filippo Varese. 2019. Metaphor framing
and distress in lived-experience accounts of voice-hearing. Psychosis 11(1). 16–27.
Evans, James R & Carl H Lindner. 2012. Business analytics: The next frontier for decision
sciences. Decision Line 43(2). 4–6.
Gibbs, Raymond W. 2006. Introspection and cognitive linguistics. Annual Review of Cognitive
Linguistics 4. 131–151.
Gibbs, Raymond W. 2007. Why cognitive linguists should care more about empirical methods.
In Monica Gonzale-Marquez, Irene Mittelberg, Seana Coulson & J. Michael Spivey (eds.),
Methods in cognitive linguistics, 2–18. Amsterdam: John Benjamins.
Gibbs, Raymond W. 2010. The wonderful, chaotic, creative, heroic, challenging world of
researching and applying metaphor. In Graham Low, Zazie Todd, Alice Deignan & Lynne
Cameron (eds.), Researching and applying metaphor in the real world, 1–18. Amsterdam:
John Benjamins.
Glynn, Dylan & Justyna Robinson (eds.). 2014. Corpus methods for semantics: Quantitative
studies in polysemy and synonymy. Amsterdam: John Benjamins.
Gonzalez-Marquez, Monica, Irene Mittelberg, Seana Coulson & J. Michael Spivey (eds.). 2007.
Methods in cognitive linguistics. Amsterdam: John Benjamins.
Han, Jiawei, Micheline Kamber & Jian Pei. 2000. Data mining concepts and techniques
3rd edn. San Francisco, CA: Morgan Kaufmann.
Hurtienne, Jörn, Kerstin Klöckner, Sarah Diefenbach, Claudia Nass & Andreas Maier. 2015.
Designing with image schemas: Resolving the tension between innovation, inclusion and
intuitive use. Interacting with Computers 27(3). 235–255.
Janda, Laura A. 2013. Cognitive linguistics: The quantitative turn. Berlin & New York: Walter
de Gruyter.
Kertész, András & Csilla Rákosi. 2009. Cyclic vs. circular argumentation in the Conceptual
Metaphor Theory. Cognitive Linguistics 20(4). 703–732.
Kertész, András, Csilla Rákosi & Péter Csatár. 2012. Data, problems, heuristics and results in
cognitive metaphor research. Language Sciences 34(6). 715–727.
Lamb, Sydney M. 1999. Pathways of the brain: The neurocognitive basis of language.
Amsterdam: John Benjamins Publishing.
Littlemore, Jeannette. 2009. Applying cognitive linguistics to second language learning and
teaching. Basingstoke/New York: Palgrave Macmillan.
Littlemore, Jeannette, Paula Pérez Sobrino, David Houghton, Jinfang Shi & Bodo Winter. 2018.
What makes a good metaphor? A cross-cultural study of computer-generated metaphor
appreciation. Metaphor and Symbol 33(2). 101–122.
Peters, Thomas J & Robert H Waterman. 1982. In search of excellence: Lessons from America’s
best-run companies. New York: Harper Collins Business.

Raghupathi, Wullianallur & Viju Raghupathi. 2014. Big data analytics in healthcare: Promise
and potential. Health information science and systems 2(1). 1–10.
Tay, Dennis. 2013. Metaphor in psychotherapy: A descriptive and prescriptive analysis.
John Benjamins Publishing.
Tay, Dennis. 2017. Time series analysis of discourse: A case study of metaphor in
psychotherapy sessions. Discourse Studies 19(6). 694–710.
Winter, Bodo. 2019. Statistics for linguists: An introduction using R. New York: Routledge.
Zhang, Weiwei. 2016. Variation in metonymy: Cross-linguistic, historical and lectal
perspectives. Berlin, Germany: Walter de Gruyter.

Bodo Winter
Mapping the landscape of exploratory
and confirmatory data analysis
in linguistics
1 The data gold rush
Linguistics has and still is undergoing a quantitative revolution (Kortmann
2021; Levshina 2015; Sampson 2005; Winter 2019a). Over the last few decades in
particular, methodological change has arguably taken up speed. For example,
many researchers have criticized the over-reliance on introspective data in gen-
erative linguistics (Gibson and Fedorenko 2010; Pullum 2007; Schütze 1996)
and cognitive linguistics (Dąbrowska 2016a; Dąbrowska 2016b; Gibbs 2007).
This critique of introspection was one of the driving forces spurring an in-
creased adoption of quantitative methods. Other factors that have spurred the
quantitative revolution in our field include the ever-increasing ease with which
data can be extracted from corpora, or crowdsourced via platforms such as Am-
azon Mechanical Turk and Prolific (Bohannon 2011; Paolacci, Chandler and
Ipeirotis 2010; Peer, Vosgerau and Acquisti 2014; Sprouse 2011). In addition, it
is becoming increasingly easy to access freely available web data, such as the
results of large-scale word rating studies (Winter 2021).
In the cognitive sciences, Griffiths (2015) speaks of the ‘big data’ computa-
tional revolution. Buyalskaya and colleagues (2021) speak of ‘the golden age of
social science.’ This new era, in which we are inundated by a large amount of
freely available or easily obtainable datasets, means that data analytics is in-
creasingly becoming an essential part of linguistic training. However, even
though some linguistics departments offer excellent statistical education, many
others still struggle with incorporating this into their curricula. Many linguistics
students (and sometimes their supervisors!) feel overwhelmed by the sheer
number of different approaches available to them, as well as the many different
choices they have to make for any one approach.
Bodo Winter, Dept. of English Language and Linguistics, University of Birmingham,
e-mail: B.Winter@bham.ac.uk
Acknowledgement: Bodo Winter was supported by the UKRI Future Leaders Fellowship MR/
T040505/1.
https://doi.org/10.1515/9783110687279-002

To readers who are new to the field, the landscape of statistical methodology
may look very cluttered. To begin one’s journey through this landscape, there is
no way around reading at least one book-length introductory text on statistical
methods, of which there are by now many for linguists (e.g., Baayen, 2008; Lar-
son-Hall, 2015), cognitive linguists (e.g., Levshina, 2015; Winter, 2019b), and cor-
pus linguists (Desagulier 2017; Gries 2009). We simply cannot expect to learn all
relevant aspects of data analysis from a short paper, online tutorial, or work-
shop. Statistical education needs more attention than that, and reading book-
length statistical introductions should be a mandatory part of contemporary
linguistic training.
The available books are often focused on teaching the details of particular sta-
tistical procedures and their implementation in the R statistical programming lan-
guage. These books generally cover a lot of ground – many different approaches
are introduced – but they are often less focused on giving a big picture overview.
This chapter complements these introductions by taking a different approach:
without going into the details of any one particular method, I will try to map out a
path through the landscape of statistics. My goal is not to give the reader a set of
instructions that they can blindly follow. Instead, I will focus on giving a bird’s
eye overview of the landscape of statistics, hoping to reduce the clutter.
This chapter is written decidedly with the intention of being accessible to
novice analysts. However, the chapter should also be useful for more experi-
enced researchers, as well as supervisors and statistics educators who are in
need for high-level introductions. Even expert analysts may find the way I
frame data analytics useful for their own thinking and practice. Moreover, I
want to chart the map of statistics in light of the modern debate surrounding
the replication crisis and reproducible research methods (§2), using this chapter
as an opportunity to further positive change in our field.
Our journey through the landscape of statistics starts with a characterization
of data analysis as a cognitive activity, a process of sensemaking (§3). There are
two main sub-activities via which we can make sense of data, corresponding to
exploratory and confirmatory data analysis. Some have (rightfully) criticized the
distinction between exploration and confirmation in statistics (Gelman 2004;
Hullman and Gelman 2021), as it often breaks down in practice. Regardless of
these critiques, the exploration-confirmation divide will serve as useful goal posts
for framing this introduction, as a means for us to split the landscape of statistics
into two halves, each with their own set of approaches that are particularly suited
for either confirmation or exploration. And it is fitting for this volume, which in-
cludes chapters that are relatively more focused on confirmatory statistics (e.g.,
Tay; Fuoli, this volume), as well as chapters that are relatively more focused on
exploratory statistics (e.g., Dilkes; Tian and Zhang; Pan, this volume).
14 Bodo Winter

Within the confirmatory part of the statistical landscape, a critique of the
significance testing framework (§4) motivates a discussion of linear models
(§5–6) and their extensions (§7), including logistic regression, Poisson re-
gression, mixed models, and structural equation models, among others. My
goal in these sections is to focus on the data analysis process from the per-
spective of the following guiding question: How can we express our theories
in the form of statistical models? Following this, I will briefly sketch a path
through the landscape of exploratory statistics by looking at Principal Com-
ponents Analysis, Exploratory Factor Analysis, and cluster analysis to show-
case how exploration differs from confirmation (§8.1). Section §8.2 briefly
mentions other techniques that could be seen as exploratory, such as classifica-
tion and regression trees (CART), random forests, and NLP-based techniques
such as topic modeling.
2 The replication crisis and reproducible research
We start our journey by considering how statistical considerations are intrinsi-
cally connected to the open science movement and the ‘replication crisis’ that
has been unfolding over the last decade. No introduction to statistics is com-
plete without considering the important topic of open and reproducible re-
search. Any statistical analysis is pointless if it is not reproducible, and we
cannot, and should not, trust results that do not meet modern standards of
open science. Given how essential reproducibility and transparency are for the
success of linguistics as a science, not including discussions of open science
and reproducibility into the statistics curriculum is doing our field a disservice.
Large-scale efforts in psychology have shown that the replicability of study
results is much lower than people hoped for, with one study obtaining only 36
successful replications out of 100 studies from three major psychological jour-
nals (Open Science Collaboration, 2015; see also Camerer et al., 2018). Linguis-
tics is not safe from this “replication crisis,” as evidenced by the fact that some
high-profile findings relating to language have failed to replicate, such as the
idea that bilingualism translates into advantages in cognitive processing (e.g.,
de Bruin et al., 2015; Paap and Greenberg, 2013).
Cognitive linguists in particular should be particularly wary of the replication
crisis, as a number of the results that have failed to replicate relate to one of the
core tenets of cognitive linguistics, the idea that the mind is embodied (see
Evans, 2012; Gibbs, 2013; Lakoff and Johnson, 1999). Embodied cognition re-
sults that have failed to replicate include, among others, the finding that
Mapping the landscape of exploratory and confirmatory data analysis 15

reading age-related words makes people walk more slowly (Doyen et al. 2012),
that experiencing warm physical temperatures promotes social warmth (Chabris
et al. 2018), that reading immoral stories makes people more likely to clean their
hands (Gámez, Díaz and Marrero 2011), and that reading action-related sentences
facilitates congruent movements (Papesh 2015). In fact, embodied cognition re-
search may be one of the most non-replicable areas of cognitive psychology (see
discussion in Lakens, 2014). Thus, linguists, and especially cognitive linguists,
need to take the replication crisis very seriously. A recent special issue in the
journal Linguistics (de Gruyter) includes several papers focused on discussing the
relevance of the replication crisis for linguistics (Grieve 2021; Roettger 2021; Sön-
ning and Werner 2021; Winter and Grice 2021).
The reasons for failures to replicate are manifold and cannot be pinned
down to just one cause. This also means that a variegated set of solutions is
required (e.g., Asendorpf et al., 2013; Finkel et al., 2017), including replicating
existing studies, performing meta-analyses of existing studies, preregistering
one’s planned methodology ahead of time, increasing the sample size of studies
where possible, placing more emphasis on effect sizes in one’s analysis, being
more rigorous about the application of statistical methodology, as well as mak-
ing all materials, data, and analysis code publicly available. The latter factor –
open data and open code – is particularly relevant for us here. In linguistics,
including cognitive linguistics, it is still not required for publications to make
everything that can be shared available, although this situation is changing
rapidly (see, e.g., Berez-Kroeker et al., 2018; Roettger et al., 2019). Two of the
flagship cognitive linguistics journals (Cognitive Linguistics, de Gruyter; Lan-
guage and Cognition, Cambridge University Press) now require data to be shared
on publicly available repositories.
In Winter (2019b), I explicitly discussed the issue of replicability in the con-
text of cognitive linguistic research, focusing on “reproducibility” rather than
replication. Reproducibility is defined as the ability of another analyst to take
the existing data of a study and reproduce each and every published value (see
e.g., Gentleman and Temple Lang, 2007; Munafò et al., 2017; Peng, 2011; Weiss-
gerber et al., 2016). In many ways, reproducibility is an even more basic re-
quirement than replicability. Replication involves the repetition of a study with
a new dataset; reproducibility includes that even for the very same data, an-
other person should be able to trace each and every step, ultimately being able
to re-create all figures and statistical results on one’s own machine.
Lakoff (1990) proposed that the subfield of cognitive linguistics can be
characterized by three “commitments”: the cognitive reality commitment, the
convergent evidence commitment, and the generalization and comprehensive-
ness commitment. The details of each of these commitments is irrelevant for
16 Bodo Winter

our purposes, but taken together, they ground cognitive linguistics in the em-
pirical sciences (see also Gibbs, 2007), including the incorporation of research
from the wider cognitive sciences. However, if the cognitive science results that
cognitive linguists use to ground their theories in empirical research turn out to
be non-reproducible, all commitment to empirical work is vacuous. Therefore,
in analogy to Lakoff’s foundational commitments, I have argued that cognitive
linguists should add the “reproducibility commitment” to their canon of com-
mitments, repeated here as follows:
The Reproducibility Commitment: “An adequate theory of linguistics needs to be sup-
ported by evidence that can be reproduced by other linguists who did not conduct the
original study.” (Winter 2019b: 126)
When focused on data analysis, this commitment, at a bare minimum, compels
us to make all data and code available.1
From this reproducibility commitment, we can get an easy question out of
the way that some beginning data analysts may have: What statistical software
package should be used? On what software should a novice analyst focus their
efforts on? The Reproducibility Commitment rules out any statistical software
that is proprietary, i.e., that costs money and is not open source (SPSS, SAS,
STATA, Matlab, Mplus etc.). Instead, efforts have to be directed to freely available
open-source software (such as R and Python). Reproducibility commits us to use
software that can be accessed and understood by everyone in the community
without the need to acquire expensive licenses. Clearly, software does not make
one a statistician, and many software packages other than R and Python are very
powerful, but if we want to follow open science principles, we should not be
using software that restricts access to certain members of the linguistic commu-
nity. Especially the R programming environment (R Core Team 2019) is by now
the de facto standard in linguistics, one could even say the ‘lingua franca’ of our
field (Mizumoto and Plonsky 2016). A common objection against R (and other
programming languages such as Python) is the belief that they may be harder to
learn than software with graphical user interfaces such as SPSS. However, there
simply is no empirical evidence to support this claim, and the few studies that
have actually looked at students’ reactions to different software packages suggest
claims about R being substantially harder may be overstated (Rode and Ringel
2019). But even if R were harder than software such as SPSS, teaching the latter
is simply put unethical given how incompatible the use of proprietary software is
with the core principles of open and reproducible research.
1 For a response to common objections to data and code sharing, see Winter (2019b, Ch. 2).

With the fundamental topics of reproducibility and the question as to what
software we should use out of the way, we can now begin charting a map of the
landscape of statistics.
3 Data analysis as a cognitive process:
Confirmatory and exploratory sensemaking
Data analysis is fruitfully seen as a cognitive process, one that involves making
sense of data. As stated by Grolemund and Wickham (2014: 189):
“Data analysis is a sensemaking task. It has the same goals as sensemaking: to create reli-
able ideas of reality from observed data. It is performed by the same agents: human
beings equipped with the cognitive mechanisms of the human mind. It uses the same
methods.”
Any sensemaking process is an interaction between the external world and the
sensemaker’s preexisting beliefs. The same way, data analysis is shaped not
only by what’s in the data, but also by the state of the cognizer. Psychologists
distinguish between bottom-up perception (the input, that what directly comes
from the world around us) and top-down perception (influence from our preex-
isting beliefs). Visual perception is both bottom-up and top-down, and so is
data analysis. However, in contrast to visual perception, which is generally au-
tomatic, the researcher performing a data analysis has a choice to make about
how much they want to be bottom-up or top-down. A data analyst should think
about whether they are primarily looking into the data to discover new pat-
terns – with relatively fewer existing beliefs intervening – or whether they are
looking to the data to either confirm or disconfirm their existing beliefs. If we
are in a maximally confirmatory mode, all hypotheses are specified a priori; in
exploratory statistics, much fewer hypotheses are specified a priori, and the
data itself is allowed to suggest new patterns, including some that the re-
searcher may not have thought of. Very informally, we can think of exploratory
statistics as answering the question: What does my data have to offer? In turn,
confirmatory statistics can be thought of as answering the question: Is my the-
ory consistent with the data?
Ultimately, there is a continuum between confirmation and exploration be-
cause every method will always take something from the data, and every method
will always come with some set of assumptions. The distinction between confir-
matory and exploratory statistics is therefore one that comes in degrees, depend-
ing on how much a given statistical methodology requires specifying structures in
18 Bodo Winter

advance. And of course, the distinction between confirmatory and exploratory
statistics pertains to the difference in the purpose of an analysis. Generally
speaking, the same method can be used for both confirmation and exploration,
depending on the analyst’s goals. That said, within the field of linguistics, some
methods are more aligned with exploratory versus confirmatory purposes. Be-
cause of this, the following sections will proceed from confirmatory statistics
(§4-6) to exploratory statistics (§7) to frame this introduction. Moreover, even
though the exploration-confirmation distinction may break down in practice,
it is important not to frame the results of exploratory analysis in terms of con-
firmatory analysis, as if they had been predicted in advance (Roettger, Winter
and Baayen 2019).
The majority of the remainder of this chapter is devoted to confirmatory sta-
tistics as opposed to exploratory statistics not because the latter is less important,
but because confirmatory statistics has recently undergone massive changes in
our field, away from significance tests towards statistical models and parameter
estimation. Because this approach is still new in some subfields and new text-
books do not necessarily teach the full scope of this framework, the emphasis
will be on confirmatory statistical models.
4 Why cognitive linguistics needs to move
away from significance tests
We start the journey of confirmatory statistics with what is still the status quo in
many subfields of linguistics. To this day, the notion of ‘statistics’ is synonymous
with ‘null hypothesis significance testing’ (NHST) to many researchers. Undergrad-
uate statistics courses still emphasize the use of such NHST procedures as t-tests,
ANOVAs, Chi-Square tests etc. This includes many existing introductions in cogni-
tive and corpus linguistics (Brezina 2018; Gries 2009; Levshina 2015; Núñez 2007;
Wallis 2021). All significance tests have a primary goal, which is to yield a p-value.
Informally, this statistic measures the incompatibility of a given dataset with the
null hypothesis. If a p-value reaches a certain threshold, the ritual of NHST in-
volves that the null hypothesis is rejected, and the researcher claims to have ob-
tained a “significant” result. The true meaning of the p-value, however, is so
counter-intuitive that even most statistics textbooks (Cassidy et al. 2019) and statis-
tics teachers fail to discuss it accurately (Gigerenzer 2004; Haller and Krauss 2002;
Lecoutre, Poitevineau and Lecoutre 2003; Vidgen and Yasseri 2016). By itself, the
p-value alone does not tell us very much (Spence and Stanley 2018), and is only a
very weak indicator of whether a study will replicate, or is strong, or “reliable.”

The over-reliance on significance tests in the behavioral and cognitive scien-
ces, including linguistics, has been widely criticized in the statistical and psycho-
logical literature for now nearly a century (Kline 2004). In fact, many now
believe that the ‘statistical rituals’ (Gigerenzer 2004) encouraged by the use of
significance tests may be one of the key factors that have contributed to the repli-
cation crisis in the first place. But data analysis is so much more than subjecting
the data to a prefab hypothesis testing procedure, which is why the field of lin-
guistics has undergone a dedicated shift away from these methods towards the
more theory-guided process of statistical modeling and parameter estimation
(Baayen, Davidson and Bates 2008; Jaeger 2008; Jaeger et al. 2011; Tagliamonte
and Baayen 2012; Wieling et al. 2014). This change from statistical testing to sta-
tistical modeling is also happening in cognitive linguistics (e.g., Gries, 2015a; Lev-
shina, 2016, 2018; Winter, 2019b). To be clear: significance testing can also be
done with statistical models, but a key difference is that the emphasis shifts from
subjecting the data to an off-the-shelf procedure such as a t-test or a Chi-square
test, towards considering the estimation of parameters in the form of multifacto-
rial statistical models (Gries 2015b; Gries 2018). The latter approach is less limit-
ing and allows for a more theory-guided approach to statistical analysis.
Most importantly for our purposes, significance tests make for a very bad
way of decluttering the landscape of statistics. Each significance test is a highly
specific tool that can be applied only in extremely limited circumstances. Rec-
ommendations about statistical methodology then often take the form of deci-
sion trees and statements like “if you have this hypothesis and this data, use
test X, otherwise use test Y”. However, rather than worrying about picking the
right test, we should channel our energy into theory-driven reasoning about
data. In contrast to significance testing as a conceptual framework, statistical
modeling encourages thinking about how our preexisting beliefs (= linguistic
domain knowledge / theories / hypotheses / assumptions) relate to a dataset at
hand in a more principled fashion. This is ultimately much more intellectually
engaging than picking a test from a decision tree of different pre-classified op-
tions, and it encourages thinking more deeply about how one’s theory relates
to the data at hand.
Luckily, the world of statistical modeling also turns out to be much easier
to navigate than the world of significance testing. In fact, there is just one tool
that will cover most of the use cases that commonly arise in linguistics. This
tool is the linear model, an approach that can be flexibly extended to deal with
all sorts of different theoretical proposals and data structures. The linear model
framework makes it possible to represent the same data structures that signifi-
cance tests are used for, which renders it unnecessary to teach significance
tests in this day and age.
20 Bodo Winter

5 Using linear models to express beliefs
in the form of statistical models
The statistical models we will discuss here are all versions of what is called the
‘linear model,’ or sometimes ‘general linear model,’ also discussed under the
banner of ‘(multiple) regression analysis.’ It is potentially confusing to novices
that the different models bear different names that may sound like they are en-
tirely different approaches. For example, consider this long list of terms:
generalized linear models, linear mixed effects models, multilevel models, generalized
additive models, structural equation models, path analysis, mediation analysis, modera-
tion analysis, Poisson regression, logistic regression, hierarchical regression, growth
curve analysis, . . .
This array of terms may seem daunting at first, but we should take comfort in
the fact all of it is built on the same foundations. In fact, they are all versions or
extensions of a particular approach that, in its essence, is easy to grasp. This is
one of the key conceptual advantages of the linear model framework, which is
that it leads to a very unified and coherent picture of the landscape of statistics,
regardless of the apparent diversity of terms suggested by the list above.
Any of the linear models or regression models we will consider here have
the same structure: One singular response or ‘outcome’ variable is described as
varying by a set of predictor variables. An alternative terminology is to say that
the dependent variable is modeled as a function of one or more independent
variables. Thus, this form of statistical model is focused on describing how one
singular quantity of interest (such as ratings, response times, accuracies, per-
formance scores, word frequencies etc.) is influenced by one or more predictors.
The analyst then focuses on thinking about which predictors they think would
influence the response, thereby implementing their assumptions about the rela-
tions in the data into a statistical model.
To give a concrete example of a linear model, consider the observation that
in English, many words related to taste are positive (such as sweet, delicious,
juicy, tasty, peachy), whereas on average, words related to smell are relatively
more negative (such as rancid, pungent, stinky, odorous) (Winter 2016; Winter
2019b). To test this generalization, we can use existing perceptual ratings for
words (Lynott and Connell 2009) in combination with ‘emotional valence’ rat-
ings, which describe the degree to which a word is good or bad (Warriner, Ku-
perman and Brysbaert 2013). Figure 1.1a visualizes the correlation between
these two quantities (taste-relatedness and emotional valence). As can be seen,
words that are relatively more strongly related to taste are on average more

positive than words that are less strongly related to taste, although this is obvi-
ously a very weak relationship given the large scatter seen in Figure 1.1 a.
The superimposed line shows the corresponding linear or ‘regression’ model.
This model describes the average emotional valence (whether a word is good or
bad, on the y-axis) as a function of how much a word relates to taste (on the x-
axis). The beauty of the linear model framework is that the underlying mathemat-
ics of lines can easily be extended to incorporate predictors that are categorical,
such as seen in Figure 1.1b. This works by pretending that the corresponding cat-
egories (in this case, the binary distinction between ‘taste words’ and ‘smell
words’) are positioned on a coordinate system (Winter, 2019b, Ch. 7).
The two models corresponding to Figure 1.1a and Figure 1.1b can be ex-
pressed in R as follows, with the first line (dark grey) representing the user
input command, and the following lines (light grey) showing the output. The
reader interested in the details of the implementation within R should consult
Winter (2019a), a book-length introduction to the basics of linear models. Here,
only a brief overview of the overall logic of the approach is presented.
lm(valence ~ taste_relatedness)
Coefficients:
(Intercept) taste
3.2476 0.4514
lm(valence ~ taste_vs_smell)
Coefficients:
(Intercept) taste_vs_smellSmell
5.3538 -0.9902
The R function lm() is named this way because it fits linear models. The tilde in
the formula specifies that the term on the left-hand side (in this case, emo-
tional valence) is ‘described by’, ‘predicted by’, ‘conditioned on’, or ‘modeled as
a function of’ the term on the right-hand side. All input to linear models takes
this general form, with the response on the left, and the predictor(s) on the
right: ‘response ~ predictors’. Each of the terms in the model, such as in this case
valence, taste_relatedness, and taste_vs_smell, corresponds to a column in
the spreadsheet that is loaded into R.
To understand the above output and interpret the model, we need to re-
mind ourselves that a line can be described by two numbers: an intercept, and
a slope. The intercept is the point where the line crosses the y-axis (which is
22 Bodo Winter

conventionally positioned at x = 0). Informally, we can think of the intercept as
corresponding to the ‘height’ of the regression line (higher intercepts mean that
the line is overall shifted upwards). The slope describes the degree to which y
depends on x, with Figure 1.1a giving an example of a positive slope and
Figure 1.1b giving an example of a negative slope.
Taken together, the intercept and slope are called ‘coefficients.’2
In any actual
data analysis, a considerable amount of time should be spent on interpreting the
coefficients to understand what exactly it is that a model predicts. For the continu-
ous model above, the output value of the intercept, 3.2476, corresponds to the
white square in Figure 1.1a, where the line crosses the y-axis. This is the predicted
value for a word with zero taste rating. The intercept is often not particularly inter-
esting, but it is necessary to ‘fix’ the line along the y-axis. Oftentimes we are more
interested in the slopes, as each slope expresses the relationship between the re-
sponse and a given predictor. In the output above, the slope value of 0.4514 has
the following interpretation: increasing the taste-relatedness by one rating unit
leads to a positive increment in emotional valence ratings by this value. Our model
of this data, then, corresponds to the equation of a line: y = 3.25 + 0.45 * taste
(rounded). We can use this equation to make predictions: For example, we can
plug in a taste rating of ‘3’ to assess what emotional valence rating this model pre-
dicts for this specific taste rating: 3.25 + 0.45 * 3 = 4.6 (rounded).
alcoholic
bitter
bland
delicious
fatty
fragrant
juicy
mild
peachy
perfumed
putrid
rancid
salty
smelly
sweaty
sweet
1
2
3
4
5
6
7
8
9
0 1 2 3 4 5
Taste−relatedness
Emotional
valence
(a) Continuous predictor
1
2
3
4
5
6
7
8
9
Taste
= 0
Smell
= 1
Emotional
valence
(b) Categorical predictor
Figure 1.1: a) Emotional valence as a function of how much a word relates to smell;
b) emotional valence as a function of the categorical taste versus smell difference.
2 Some textbooks use the term ‘coefficient’ only for the slope.

This basic picture does not look markedly different for the case of a categor-
ical predictor, except that the corresponding slope has to be interpreted as a
categorical difference between two groups. In the output above, the number
−0.9902 represents the difference in emotional valence between taste and smell
words. Thinking in terms of lines, this difference can be conceptualized as mov-
ing from the taste words at x = 0 (the intercept) down to the smell words at x = 1.
It is this mathematical trick – positioning categories within a coordinate system –
that allows linear models to easily incorporate continuous predictors (Figure 1.1a)
as well as categorical predictors (Figure 1.1b).
The basic idea of expressing the relation between y and x in terms of coeffi-
cients can be extended to the case of ‘multiple regression,’ which involves add-
ing more predictors to the model, each one associated with its own coefficient
that describes how that particular predictor is related to the response variable.
This is exemplified in the following schematic R function call:
lm(response ~ predictor1 + predictor2 + predictor3)
Coefficients:
(Intercept) predictor1 predictor2 predictor3
? ? ? ?
The function call (dark grey) can be thought of as a more technical way of ex-
pressing our hypotheses in the form of an equation. This particular formula nota-
tion can be paraphrased as “I want to assess whether the response is influenced
jointly by predictor1, predictor2, and predictor3.” The linear model will then es-
timate the corresponding coefficients – one slope for each predictor. Each slope ex-
presses the relationship between the response and that specific predictor while
holding all the other predictors constant. For example, the slope of predictor1 cor-
responds to how much predictor1 is statistically associated with the response
while controlling for the influence of the other predictors. If the slope is positive,
increases in predictor1 result in an increase of the response. If the slope is nega-
tive, increases in predictor1 result in a decrease of the response.
We can think of the coefficients as placeholders specified by the user,
which the linear model in turn will ‘fill’ with estimates based on the data. In
the schematic function call above, this placeholder nature is represented by the
question marks. This highlights how fitting a linear model essentially corre-
sponds to a set of questions (what is the slope of each of these terms?) that the
model will try to answer. Given a model specification and given a particular da-
taset, the model actually fills the question marks with the best-fitting coefficient
estimates, those values that ensure that the predictions are closest to all data
24 Bodo Winter

points. However, the linear model only performs optimally with respect to the set
of instructions that the data analyst has specified. If the user has missed impor-
tant predictors, the model cannot know this. It can only answer the questions it
has been asked to answer, which is why researchers should spend a lot of time
thinking about the linear model equation – ideally prior to collecting the data.
If one uses linear models in a confirmatory fashion, the inclusion of each
predictor into the model should be theoretically motivated. It is therefore good
practice to specify a linear model in advance – before loading the data into any
statistical software. Or, even more in line with a fully confirmatory approach, a
researcher can pre-register one’s analysis in a publicly accessible repository prior
to collecting the data (Roettger 2021). The two cognitive linguistics-oriented jour-
nals Language and Cognition and Cognitive Linguistics have a special article cate-
gory called ‘Registered Reports’ that requires pre-specifying an analysis plan
prior to collecting the data. In my experience teaching statistics, novice analysts
generally make their life harder by jumping into a statistical software package
too quickly. Everything becomes much easier if considerable time is spent on de-
termining which predictors should or should not be included in advance, based
on theory, literature, and domain knowledge.
There are procedures for changing the model in response to the data (e.g.,
“model selection” techniques such as LASSO) that will not be discussed here.
Moreover, in Bayesian statistics, several researchers recommend expanding
models based on how well they can generate novel data in line with existing
data (Gelman and Shalizi 2013; Kruschke 2013). However, despite the existence
of such approaches, it is still useful and recommended to think as much as pos-
sible about a model in advance of performing an analysis or collecting the data.
The next section discusses how linear models can be expanded to assess
more complex theoretical ideas involving interactions.
6 Linear models with interactions
Linear models can be expanded to include interaction terms. These are best ex-
plained by example. Here, I will draw from Winter and Duffy (2020), an experi-
mental study on gesture and time metaphors in which interactions were of key
theoretical interest. This experiment follows up on the famous “Next Wednes-
day” question that has been used extensively to probe people’s metaphorical
conceptualization of time (Boroditsky and Ramscar 2002; McGlone and Harding
1998). When asked the following question . . .

“Next Wednesday’s meeting has been moved forward two days – what day is the meeting
on now?”
. . . about half of all English speakers respond ‘Friday,’ and half respond ‘Mon-
day’ (Stickles and Lewis 2018). This is because there are two ways of conceptual-
izing time in English, one with an agent moving forward through time (reflected
in such expressions as We are approaching Christmas), another one with the
agent being stationary and time moving towards the agent (reflected in such ex-
pressions as Christmas is coming). Jamalian and Tversky (2012) and subsequently
Lewis and Stickles (2017) showed that certain gestures can change whether peo-
ple respond Monday or Friday. If the question asker moves the gesturing hands
forwards (away from their torso), an ego-moving perspective is primed, which
implies a shift from Wednesday towards Friday. If the question asker moves the
gesturing hand backwards (from an extended position towards their torso), a
time-moving perspective is primed, thus implying a shift from Wednesday to
Monday.
In our follow-up study to these experiments, we wanted to know to what ex-
tent gesture interacts with the concomitant language. That is, how much do lan-
guage and gesture co-depend on each other in determining time concepts? For
example, can forwards/backwards movement in gesture alone push people for-
wards/backwards along the mental time line, even if the corresponding language
does not use any spatial language at all? Or is spatial language needed in order
to make people pay attention to the direction of the gesture? In one of our experi-
ments (Winter and Duffy, 2020, Experiment 4), we manipulated two factors, each
one of which is a categorical predictor in the corresponding linear model: The
first factor is gestural movement, whether the hands move forwards or back-
wards. The second factor is whether the language was spatial (moved by two
days) or not (changed by two days). The corresponding model that we used had
the following basic structure:
glm(response ~ gesture * language,. . .)
Coefficients:
(Intercept) gesture language gesture:language
? ? ? ?
The use of glm() as opposed to lm() is irrelevant for the present discussion and
will be explained in the next section (§7). What matters here is the fact that the
above function call combines predictors with the multiplication symbol ‘*’
rather than using the plus symbol ‘+’, as in the last section. This difference in
notation instructs the statistical model to not only estimate the effects of gesture
26 Bodo Winter

and language, but also the effects of unique combinations of both predictors. An-
other way of thinking about this interaction is to say that one predictor has a dif-
ferent effect for specific values of the other predictor. For example, the forwards/
backwards effect could be nullified if the language is non-spatial, which is in-
deed what we found (Winter and Duffy 2020). In the output, the interaction ap-
pears as a third term in the model, gesture:language. The size of this coefficient
corresponds to the strength of the interaction. The larger this coefficient is (both
positive or negative), the more do the gesture and language predictors co-depend
on each other in changing the response.
In linguistics, many statistical models include such interactions. One issue
that arises, however, is that once two predictors are ‘interlocked’ by virtue of par-
ticipating in an interaction, each predictor’s influence has to be interpreted with
respect to the specific values of the other predictor. In the presence of an interac-
tion, what influence a predictor has on the response variable will depend on the
specific level of the other predictor, and so there is no easy-to-interpret ‘across the
board’ effect for the individual predictors anymore. As a result, interactions gener-
ally make models harder to interpret. Thus, while interactions are often theoreti-
cally interesting and need to be included if a hypothesis actually specifies that one
predictor’s influence on the response depends on another predictor, including in-
teraction terms also comes at the epistemological cost of making models harder to
interpret. The question whether an interaction should or should not be included
depends on theory. In the case of Winter and Duffy (2020), the interaction was the
primary effect of theoretical interest and therefore had to be included into the
model. As it is easy to misinterpret the output of statistical models that contain
interactions, the reader is advised to consult a statistics textbook on this material.
Winter (2019a) has a full chapter focused on the interpretation of interactions.
7 Becoming a more flexible data analyst
via extensions of linear models
7.1 Generalized linear models
In the first regression example we discussed above (§5), the response was con-
tinuous. Each word was represented by an average rating where words are
more or less good, in a scalar manner (Warriner, Kuperman and Brysbaert
2013). The Monday/Friday response in the second example (§6) was discrete. As
discussed above, predictors in linear models can be continuous or categorical.
However, to incorporate discrete responses, a more substantive change to the

model is required. Generalized linear models (GLMs) are an extension of linear
models that allow incorporating different assumptions about the nature of the
response variable. The generalized linear model framework subsumes the linear
models we discussed so far. That is, the multiple regression models we discussed
above (§5-6) are specific cases of the generalized linear model. Table 1.1 gives an
overview of the three ‘canonical’ generalized linear models that cover a lot of
common use cases in linguistics.
In linguistics, logistic regression is generally used when the response variable
is binary, such as was the case with the Monday/Friday responses in the exam-
ple above. Other response variables that are binary include things such as the
English dative alternation (Bresnan et al. 2007), the usage of was versus were
(Tagliamonte and Baayen 2012), or the presence/absence of a metaphor (Winter
2019b). Any case where the response involves only two categories is amenable
to logistic regression, and in the form of multinomial logistic regression, the
approach can be extended to include response variables with more than two
categories.
Poisson regression is another type of generalized linear model that is in-
credibly useful for linguistics because it is the canonical model type to deal
with count variables that have no known or fixed upper limit. Example applica-
tions include showing that visual words are more frequent than non-visual
words (Winter, Perlman and Majid 2018), or modeling the rate of particular fill-
ers and discourse markers as a function of whether a speaker speaks politely or
informally (Winter and Grawunder 2012). Given that linguists frequently count
the frequency of discrete events, Poisson regression should be a natural part of
the linguistic toolkit (Winter and Bürkner 2021).
Conceptually, the core ideas discussed in relation to multiple regression (§4)
carry over to the case of generalized linear models. Just as before, fitting a general-
ized linear model to a dataset yields estimates of slopes, with each slope represent-
ing how much a binary variable (logistic regression) or unbounded count variable
Table 1.1: Three of the most common types of response
variables and the most canonical generalized linear
models that correspond to them.
Response variable Generalized linear model
continuous multiple regression
discrete: binary (fixed N) logistic regression
discrete: count (no fixed N) Poisson regression
28 Bodo Winter

(Poisson regression) depends on the predictor(s) of interest. However, a key differ-
ence to the more basic case of multiple regression is that the slopes will appear in
a different metric. In the case of Poisson regression for unbounded count data, the
coefficients will appear as logged values; in the case of logistic regression for bi-
nary data, the coefficients will appear as log odds. The reader is advised to read an
introductory text on generalized linear models to aid the interpretation of the exact
numerical values of the coefficients (Winter 2019a; Winter and Bürkner 2021).
7.2 Mixed models / multilevel models
All generalized linear models, including standard multiple regression, assume
that data points are independent. For linguistics, this means that an ‘individual’
can only contribute one data point to the whole data set. What counts as ‘indi-
vidual’ in linguistics depends on the type of study and how it is designed (Winter
and Grice 2021): In many contexts, the ‘individual’ is an actual person. This is
often the case in such subfields as psycholinguistics, phonetics, sociolinguistics,
or experimental syntax where repeated measures designs are common, which
leads to experiments that include multiple data points from each person. How-
ever, the term ‘individual’ as used here has a broader meaning, essentially in-
corporating any grouping factor. For example, if a corpus analysis included
multiple data points from the same text, then ‘text’ could be an individual unit
of observation. If the same corpus analysis included multiple data points from
the same author, ‘author’ would be another grouping factor.
Ignoring the fact that data contains multiple data points from the same in-
dividual or group has dire consequences for the outcomes of a statistical analy-
sis. These problems are discussed under the banner of the “independence
assumption” of statistical tests, and they are also discussed under the banner
of ‘pseudoreplication’, which is a term used in certain fields, such as ecology,
to refer to erroneously treating statistically dependent observations as indepen-
dent. For a classic introduction to the notion of “pseudoreplication”, see Hurl-
bert (1984). For a more general discussion of the independence assumption in
statistical modeling and consequences for different subfields of linguistics, in-
cluding corpus linguistics, see Winter and Grice (2021). It is now standard in
many different subfields of linguistics to include individual or grouping factors
into the modeling process to statistically account for the fact that data points
from the same individual or group are dependent (see e.g., Baayen et al., 2008;
Tagliamonte and Baayen, 2012; Winter, 2019b, Ch. 13). In fact, the ability to ac-
count for multiple grouping data structures in a principled manner is one of the
major reasons why the field has moved away from significance tests (Winter

and Grice 2021), which make it hard and sometimes impossible to account for
multiple nested or crossed dependencies in the data.
The way to deal with non-independent clusters of observations within a
generalized linear model framework is to incorporate random effects into the
model. The mixing of the regular predictors we have dealt with in the last few
sections (now called ‘fixed effects’ in this framework) and what are called ‘ran-
dom effect’ predictors is what gives mixed models their name. Mixed models
are also called multilevel models, linear mixed effects models, or multilevel re-
gression, and specific instances of these models are also discussed under the
banner of hierarchical linear regressions. Gries (2015a) calls mixed models “the
most under-used statistical method in corpus linguistics.” McElreath (2020:
400) states that such “multilevel regression deserves to be the default ap-
proach.” Clearly, there is no point in using mixed models if a particular applica-
tion does not call for it. However, because data sets in linguistics will almost
always include repeated data points from the same ‘individual’ (person, word,
language family, text etc.), mixed models are almost always needed (Winter and
Grice 2021).
A random effect can be any categorical variable that identifies subgroups in
the data, clusters of observations that are associated with each other by virtue of
coming from the same individual. The word ‘random’ throws some novices off be-
cause differences between grouping factors are clearly not ‘random’, i.e., a specific
participant may systematically respond differently from another participant in a
study. It therefore helps to think of ‘randomness’ in terms of ‘ignorance’: by fitting
a specific grouping variable as a random effect, we are effectively saying that prior
to the study, we are ignorant about the unique contribution of each individual.
Random effects then estimate the variation across individuals.
It is entirely possible (and indeed, often required) to have random effects for
participants in the same model in which there are also fixed effects that are tied
to the same individuals. For example, we may hypothesize that gender and age
systematically affect our response variable, and we want our mixed model to in-
clude these two variables as fixed effects. In addition, the same model can, and
indeed probably should, also contain a random effect for participant. Whereas
the fixed effects capture the systematic variation that an individual is responsi-
ble for (e.g., for an increase in x years of age, the response changes by y); the
random effects capture the idiosyncratic contribution of that individual. Thus,
including fixed effects tied to specific individuals does not preclude the inclu-
sion of additional individual-specific random effects.
Importantly, the incorporation of random effects into the modeling process is
key to avoiding violations of the independence assumption. Ignoring important
random effects can lead to grossly misleading results, as has been extensively
30 Bodo Winter

discussed in linguistics and elsewhere (Barr 2013; Barr et al. 2013; Matuschek et al.
2017; Schielzeth and Forstmeier 2008; Winter 2019a; Winter and Grice 2021). The
reason for this is that fixed effects, such as condition effects, have to be evaluated
against random effect variation. If the random effect variation is not actively esti-
mated in the modeling process, fixed effects estimates can be severely over-
confident, leading to a much higher rate of spuriously significant results. Because
of this, it is of utmost importance that the researcher spends a lot of time thinking
about whether there are non-independent clusters in their data set that need to be
accounted for.
The perspective so far has focused on the perspective of including random
effects to make sure that the fixed effects in the same model are estimated ac-
curately. However, it is important to emphasize that random effect variation
itself may actually be the primary thing that is of theoretical interest in some
studies (for examples, see Baumann and Winter, 2018; Drager and Hay, 2012;
Idemaru et al., 2019; Mirman et al., 2008). Some analyses may also require
models that have more random effects than fixed effects (e.g., Ćwiek et al.,
2021). It is important to think about random effect variation as theoretically
interesting in its own right, and to interpret the random effect output of one’s
models. For a basic introduction to mixed models, see Winter (2019a).
7.3 Example models within cognitive linguistics
The trajectory of this chapter so far has been an expansion of the statistical
modeling repertoire. This is not, like in the case of significance tests, adding
new, fundamentally distinct tools to our toolkit. Instead, we take the very same
tool (focused on the relation of a response and a predictor in terms of lines) and
expand this tool to binary and count data (generalized linear models: logistic
regression and Poisson regression), as well as to cases where there are multiple
data points for the same grouping factor (mixed models). Table 1.2 shows some
different types of models applied to test cognitive linguistic theories. Plain text
descriptions of the corresponding R formula notation should help the reader
get a better grasp of the extent of the linear model framework, and how this
translates into actual data analyses conducted in cognitive linguistics. While
the models have different names (“multiple regression”, “logistic regression”
etc.), it has to be borne in mind that they are built on the same foundations.

Table 1.2: Examples of linear model formulas in cognitive linguistic work.
Littlemore et al. (2018)
lm(metaphor_goodness ~ frequency + concreteness + emotional_valence)
Multiple regression: This model describes the extent to which a continuous measure of metaphor
goodness depends on the frequency, concreteness, and emotional valence of a metaphor.
Winter (2019a, Ch. 17)
lm(metaphoricity ~ iconicity + valence)
Multiple regression: This study operationalized metaphoricity in a continuous fashion using a
corpus-based measure; the research question was whether iconicity (e.g., the words bang and
beep are iconic) and the emotional quality of words determine the likelihood with which the
words are used as source domains in metaphor.
Winter, Perlman, and Majid ()
glm(word_frequency ~ sensory_modality, family = poisson)
Poisson regression: The token frequency of words in a corpus (a categorical count variable)
was conditioned on which sensory modality (touch, taste, smell, sight, sound) the word
belongs to; of interest here is whether visual words are more frequent than non-visual words.
Hassemer and Winter (2016)
glm(height_versus_shape ~ pinkie finger curl * index curve, family = binomial)
Logistic regression: In this experimental study, participants had to indicate whether a gesture
depicted the height or the shape of an object; the above model formula expresses the belief
that this binary categorical response depends on two predictors that we experimentally
manipulated: either the index finger was more or less curved, or the pinkie finger was more or
less curled in. Additionally, we were interested in the interaction of these two hand
configuration predictors (it is plausible that specific combinations of pinkie curl and index
curve values have a unique effect on the choice variable).
Winter, Duffy and Littlemore (2020), simplified
lmer(RT ~ participant_gender * profession_term_gender * verticality)
(additional random effects for participant and item not shown)
Linear mixed effects regression: In this reaction time experiment, we were interested in
replicating the finding that people automatically think of power in terms of vertical space because
of the conceptual metaphor POWER IS UP. The basic idea is that seeing a word pair such as “doctor
~ nurse” with the word “doctor” shown on top of a screen and the word “nurse” shown at the
bottom of the screen will be faster than the reverse mapping (with the less powerful position on
top). In addition, we manipulated the gender of the profession terms (e.g., “male doctor” or
“female doctor”) and we also wanted to assess whether responses differed as a function of the
participant’s gender. Our results provided evidence for a three-way interaction between the
participant’s gender, the gender of the profession term, and whether the vertical arrangement was
in line with POWER IS UP or not: male participants responded much more quickly when the vertical
alignment was consistent with the metaphor and the male profession term was shown on top.
32 Bodo Winter

7.4 Further extensions of the framework
Further extensions of the linear model framework are possible, and specific
data sets and theoretical assumptions may require a move towards more com-
plex model types. One set of extensions is in the direction of structural equation
models (SEM), which involve extending the linear model framework either by
adding what are called ‘latent variables’ (variables that cannot be directly ob-
served), or by adding complex causal paths between predictors, such as when
one predictor influences the responses indirectly, mediated by another predic-
tor. Fuoli, this volume, discusses these concepts in more detail.
An additional extension of linear models are generalized additive models, or
GAMs. These models are increasingly gaining traction within linguistics (Wieling
et al. 2014; Winter and Wieling 2016), especially psycholinguistics (Baayen et al.
2017) and phonetics (Wieling 2018). GAMs are (generalized) linear models that
involve the addition of ‘smooth terms,’ which are predictors that are broken up
into a number of smaller functions. This allows the inclusion of nonlinearities
into one’s modelling process. Such nonlinearities are useful when dealing with
spatial data, such as in dialectology, sociolinguistics, or typology, but they are
also useful for dealing with temporal data, such as in time series analysis (com-
pare Tay, this volume). This means that GAMs could be used, for example, for
modeling such diverse aspects of linguistic data as pitch trajectories, mouse
movements in a mouse-tracking study, gesture movements on a continuous
scale, or language evolution over time.
Both SEMs and GAMs can also incorporate random effects, in which case they
would be mixed SEMs or mixed GAMs. Moreover, since all of these are extensions
of the same generalized linear model framework, each one of these approaches
can also be logistic regression models (binary data) or Poisson regression models
(unbounded count data), just with additional paths or latent variables (SEM) or
with additional smooth terms (GAM). In fact, all of these different extensions of
linear models are mutually compatible. For example, a GAM can also include ran-
dom effects or latent variables. Figure 1.2 gives one overview of how the different
classes of models can be conceptualized.
At this stage in this overview, the reader may get the impression that we are
back to some sort of decision tree where we have to choose the right model, just
what I have argued is a conceptual disadvantage of the significance testing
framework. However, the picture painted here is fundamentally different to the
case of significance tests. In the case of the linear model framework, the analysis
process involves reasoning about one’s model and building all the things that
are hypothesized to matter into a singular model. For example, if there are multi-
ple observations from the same individual in a dataset, random effects need to

be added. If there are also hypothesized nonlinear effects, smooth terms need to
be added, and so on. What type of model is needed, then, follows from theoreti-
cal considerations and one’s domain knowledge about the dataset at hand.
The reader should not put the question “What model should I choose?”
first, but instead focus on the guiding question: “What are my assumptions
about this data?” Or, more informally, “What do I know about this data that my
model needs to know?” There simply is no default way of answering these ques-
tions, as this depends on the knowledge, assumptions, and theoretical leaning
of the data analyst. Therefore, different researchers come up with different sta-
tistical analyses for the same research questions (Botvinik-Nezer et al. 2020; Sil-
berzahn et al. 2018), and this subjectivity of statistical modeling should be
endorsed, rather than eschewed (McElreath 2020). We should stay away from
any default recipes and instead focus on the thinking part in statistical model-
ing, arguably that what makes data analysis fun and inspiring.
8 Exploratory statistics: A very brief overview
of some select approaches
8.1 General overview
As mentioned above, exploratory data analysis is one of the two key ways that
we engage with data, it is one of the two ‘sensemaking modes’ of data analysis.
Within statistics, one of the names that is most commonly associated with ex-
ploratory data analysis is John Tukey, who characterizes the process of explor-
atory data analysis as “detective work” (Tukey 1977: 1). It is arguably the case
that exploratory data analysis has historically been more important for lin-
guists (Grieve 2021), given that the field is founded on detailed and genuinely
exploratory “detective work”, such as the early historical linguistics in the
19th
century, the rich tradition of descriptive and field linguistics throughout
Figure 1.2: Snapshot of a portion of the generalized linear model framework and some
common extensions.
34 Bodo Winter

the 20th
century, and the rich tradition of observational research conducted
under the banner of corpus linguistics since the middle of the 20th
century. It
is important that exploratory statistics are not viewed as theory-neutral or in-
ferior to confirmatory statistics (Roettger, Winter and Baayen 2019).
One form of exploratory data analysis is data visualization, and indeed,
this was a primary focus of Tukey’s work on this topic. In fact, all descriptive
statistics (summarizing data) and visualization can be conducted in a genuinely
exploratory fashion, looking to see what the data has to offer. For an excellent
practical introduction to data visualization with R, see Healy (2019). However,
here I want to focus on two approaches that are not only particularly useful for
cognitive linguists, but that have also been widely used within the language
sciences, especially in corpus linguistics. These approaches can be split up into
two broad categories, shown in Table 1.3. Other approaches, with different
goals, will be discussed below.
Each row in this table corresponds to a specific analysis goal: Is the target of
one’s investigation to look at relationships between lots of variables, grouping
them together into a smaller set of underlying factors? Or is the target of one’s
investigation to find subgroups of data points (‘clusters’)? The approaches
that are used to answer these two questions differ from the above-mentioned
linear model framework in a fundamental fashion in that they are genuinely
multivariate. This means that in contrast to regression modeling, there is no
one primary response variable. Instead, these approaches deal with multiple
outcome variables at the same time. For the approaches listed in Table 1.3,
there is no asymmetry between ‘predictor’ and ‘response’. All variables are on
equal terms.
8.2 Grouping variables with exploratory factor analysis
To exemplify both approaches, we can follow an analysis presented in Winter
(2019b), which uses the perceptual ratings for adjectives from Lynott and Con-
nell (2009). The input data has the following structure, with each column
Table 1.3: Two common goals in exploratory data analysis and corresponding approaches.
Goal Example techniques
grouping variables Exploratory Factor Analysis, (Multiple) Correspondence Analysis, . . .
grouping data points k-means, hierarchical cluster analysis, Gaussian mixture models, . . .

corresponding to ratings on one sensory modality, and each row corresponding
to a particular adjective. The word abrasive, for example, has a relatively high
touch rating and much lower taste and smell ratings. In contrast, the word acidic
is much more strongly related to taste and smell.
With Exploratory Factor Analysis (EFA), we focus on relationships between the
columns. The entire correlation structure between all variables is investigated
in a simultaneous fashion, looking to see whether particular variables can be
re-expressed in terms of being part of a single underlying factor. Running an
EFA on this ‘word by modality’ matrix suggests that there may be two underly-
ing factors in this set of variables. The following summary of ‘loadings’ allows
looking at how the original variables relate to the new set of factors. A loading
expresses how strongly an individual variable latches onto a factor, with the
sign of the loading expressing whether there is a positive association or nega-
tive association, which can be thought of as being analogous to a positive cor-
relation (more of X gives more of Y) or a negative correlation (more of X gives
less of Y). Two loadings for factor 2 are missing in the output below because
they do not exceed the threshold taken to be a strong-enough loading by the
base R function factanal() used to compute these results.
A crucial part of doing an Exploratory Factor Analysis is interpreting what the
new factors mean. It is not guaranteed that an EFA will yield a theoretically in-
terpretable solution. In this case, however, clear patterns emerge: The positive
values for taste and smell, as opposed to the negative values for all other mo-
dalities, suggest that the first factor represents how much a word is related to
abrasive
absorbent
aching
acidic
sight
2.894737
4.142857
2.047619
2.190476
touch
3.684211
3.142857
3.666667
1.142857
sound
1.6842105
0.7142857
0.6666667
0.4761905
taste
0.57894737
0.47619048
0.04761905
4.19047619
smell
0.57894737
0.47619048
0.09523809
2.90476190
Loadings:
sight
touch
sound
taste
smell
Factor1
-0.228
-0.177
-0.445
0.824
0.945
Factor2
0.674
0.496
-0.654
36 Bodo Winter

taste and smell. The fact that these two variables load heavily onto the same
factor is theoretically interesting given that prior psychological research sug-
gests that taste and smell are perceptually and neurally highly coupled (e.g.,
Auvray and Spence, 2008; De Araujo et al., 2003). Sight and touch load heavily
onto the second factor, with a strong negative loading for sound, suggesting
that this factor represents how much things can be touched and seen as op-
posed to heard.
Theoretically, this factor solution can be thought of as an alternative, more
parsimonious, way of representing the perceptual structure of the sensory lexi-
con of English. There are two underlying factors that allow capturing most of
the ways words differ from each other in terms of their sensory characteristics.
In this case, these two factors together account for 60% of the variance in the
overall ratings. The insight here is that a larger set of variables can be compressed
into a much smaller set of factors without losing too much information. This gen-
eral principle is called dimensionality reduction and can be likened to looking at
the night sky, which is a two-dimensional projection of three-dimensional space.
Other dimensionality reduction techniques include Principal Components Analysis
and (Multiple) Correspondence Analysis. Multidimensional Scaling is a conceptu-
ally similar approach that is aimed at finding the underlying structure of similarity
or dissimilarity data.
8.3 Grouping data points with cluster analysis
The above Exploratory Factor Analysis answers the question: Are there any
groups of variables? We can use the same data to answer the question: Are there
any groups of data points? For this, we use the same ‘word by rating’ matrix, but
our focus will be on grouping the words (rows) rather than the columns (varia-
bles). ‘Cluster analysis’ – itself a vast field of statistics – is the approach that al-
lows answering this question. There are many different specific algorithms that
realize this goal, with k-means and various forms of hierarchical cluster analysis
being common in corpus linguistics. For a conceptual introduction to cluster
analysis in cognitive linguistics, see Divjak and Fieller (2014).
In the case of the sensory modality data, I opted to use the specific cluster-
ing technique of Gaussian mixture models (Winter 2019b). In contrast to such
approaches as k-means and hierarchical cluster analysis, Gaussian mixture
modeling is a clustering technique that has the key theoretical advantage of al-
lowing for fuzzy overlap between clusters, with some words being more and
others being less certain members of each cluster. Moreover, Gaussian mixture
models actually yield a genuine model of clusters (with parameter estimates),

rather than merely a heuristic partitioning of the data. Either way, applying this
method to the sensory modality rating dataset yielded 12 clusters (various
model fit criteria can be used to find the best cluster solution). The same way
that there is no guarantee that Exploratory Factor Analysis produces interpret-
able factors, there is no guarantee that any cluster analysis method produces
interpretable clusters. To interpret whether the cluster solution produces sensi-
ble results that can be meaningfully related to existing linguistic proposals, we
can look at the most certain words for each cluster, shown in Table 1.4.
In Winter (2019b), I discussed these different clusters and their relation to the ex-
isting literature on sensory words in detail. The words in some clusters form
clearly semantically coherent groups, such as the ‘purse sight’ cluster, which con-
tains color words. Other clusters, however, are less semantically coherent, such as
the ‘multisensory’ and ‘impression-related’ categories. Just like EFA, cluster analy-
sis is a purely statistical procedure, and it is up to the analyst to use this procedure
within their sensemaking process.
Table 1.4: Two common goals in exploratory data analysis and corresponding approaches.
Clusters Proposed name Adjectives ordered in terms of certainty
 pure sight gray, red, brunette, brown, blonde, reddish, yellow, . . .
 shape and extent triangular, conical, circular, curved, little, bent, . . .
 gross surface properties crinkled, bristly, prickly, big, sharp, bumpy, wiry, . . .
 motion, touch, and gravity craggy, ticklish, low, swinging, branching, scratchy, . . .
 skin and temperature tingly, lukewarm, tepid, cool, warm, clammy, chilly, . . .
 chemical sense acrid, bitter, tangy, sour, salty, antiseptic, tasteless, . . .
 taste cheesy, chocolatey, bland, unpalatable, alcoholic, . . .
 smell odorous, whiffy, perfumed, reeking, smelly, stinky, . . .
 sound  noisy, deafening, bleeping, silent, whistling, . . .
 sound  sonorous, squeaking, melodious, muffled, creaking, . . .
 impression-related stormy, cute, crowded, crackling, clear, lilting, . . .
 multisensory beautiful, burning, gorgeous, sweaty, clean, strange, . . .
38 Bodo Winter

8.4 Other statistical techniques commonly used
for exploratory data analysis
The examples discussed so far should clarify how EFA and cluster analysis are
very different from the linear models discussed in the previous sections. In con-
trast to linear models, the exploratory approaches discussed so far consider all
variables together without an asymmetric relation between response and predic-
tor. There are, however, approaches that allow dealing with this asymmetric rela-
tionship in a genuinely exploratory fashion, such as classification and regression
trees (CART, for a conceptual introduction for linguists, see Gries, 2019). Such
trees use a method called binary recursive partitioning to split the data into a
representation that essentially looks like a decision tree, with discrete split points
for different variables in an analysis (see Strobl et al., 2009 for an introduction).
For example, a split point for the data shown in Figure 1.1a above could be a
taste rating of <2.3, below which most values are very negative. The process of
recursively splitting the data into smaller groups can also incorporate multiple
different variables, so that, for example, within the set of words with <2.3 taste
rating, we may consider a new split along a different variable. Resulting from
this recursive splitting procedure is a tree-like representation of the complex rela-
tions of how different predictors influence a singular response. The type of data
that one can use classification and regression trees for is similar to the type of
data one would use regression for, in that it involves multiple predictors and just
one response variable. However, in contrast to regression, CART approaches are
generally used when there are no clear expectations about how the predictors
relate to the response, and which predictors should or should not be included.
Random forests are a further extension of CART, involving an ensemble of
different classification or regression trees. Tian and Zhang (this volume) discuss
random forests in more detail, but a brief introduction is given here nonethe-
less: With random forests, each tree is fit on a random subset of data as well as
a random subset of variables. This ensures that whatever CART algorithm is
used does not learn too much from the specific data at hand (what is called
‘overfitting’). This facilitates generalization to novel, unseen data, rather than
honing in on the idiosyncratic characteristics of a specific dataset at hand.
Among other things, random forests provide a simple measure that tells the an-
alyst which variables in a study are most influential. For an application of ran-
dom forests to sound symbolism research, see Winter and Perlman (2021).
However, CART and random forests also care about the independence as-
sumption mentioned above in section §7.2 above. The results ascertained via
these approaches are just as much biased in the presence of multiple depen-
dent data points as other approaches. As almost all data in linguistics has

nested or crossed dependencies (Winter and Grice 2021), standard CART and
random forest models can very rarely be applied to linguistic data. In the case
of the random forest analysis in Winter and Perlman (2021), we had to exclude
etymologically related forms to ensure that the random forest results are not
biased due to etymological relatedness. There are, however, approaches to ran-
dom forests that allow the incorporation of dependencies (Hajjem, Bellavance
and Larocque 2014; Karpievitch et al. 2009; Stephan, Stegle and Beyer 2015).
Another important issue that is not often discussed in the linguistic literature is
that random forests should generally not be used without tuning the hyperpara-
meters (the settings used by the algorithms) to a specific dataset (Probst, Wright
and Boulesteix 2019).
Finally, it should be noted that the set of exploratory techniques covered here
is by no means exhaustive. Levshina (2015) in particular discusses many useful
exploratory techniques not covered here. Within corpus linguistics in particular,
exploratory techniques can also be seen as encompassing such techniques as key-
word analysis or topic modeling, and more generally, the class of approaches cov-
ered under the banner of “distributional semantics” (Günther, Rinaldi and Marelli
2019). The reader is encouraged to learn more about these techniques and how
they are implemented in R.
9 The landscape of statistics: Minefield
or playground?
This chapter introduced readers to the landscape of statistical analysis in lin-
guistics, focusing on confirmatory approaches within the domain of linear
models, as well as a select group of exploratory approaches. To some novel
analysts, statistics may appear like a minefield, with a bewildering array of
different approaches, each one of which has many different ways in which it
can be misapplied. Others may see statistics as a playground, with so many
different toys to play with. Both of these extreme views are dangerous. The
minefield perspective is stifling; the playground perspective invites an atti-
tude of ‘anything goes’ and ‘whatever yields the desired effects.’ But the sta-
tistical landscape is neither a minefield nor a playground. We can chart a
clear path through this landscape by being clear about our analysis goals.
The task of data analysis becomes substantially easier when we put our re-
search questions first, and to be clear about whether we are in a primarily
confirmatory or a primarily exploratory mode of data analysis. I have found
that oftentimes, lack of clarity about statistics is not genuinely rooted in lack
40 Bodo Winter

of knowledge about statistics, but in being unclear about one’s goals and
hypotheses.
The German physicist Werner Heisenberg once said that “What we ob-
serve is not nature in itself but nature exposed to our method of questioning.”
This statement is equally true of linguistics: What we observe is not language
itself, but language exposed to our method of questioning. This chapter dis-
cussed methods that expand the repertoire of questions we can ask when en-
gaging in data analysis. Knowing what is “out there” puts the analyst in the
best position to traverse the landscape of statistics. But ultimately, it is the
research questions that chart the path, not the statistical methods that hinge on
those questions.
References
Asendorpf, Jens B., Mark Conner, Filip De Fruyt, Jan De Houwer, Jaap JA Denissen, Klaus
Fiedler, Susann Fiedler, David C. Funder, Reinhold Kliegl and Brian A. Nosek. 2013.
Recommendations for increasing replicability in psychology. European Journal of
Personality. Wiley Online Library 27(2). 108–119. https://doi.org/10.1002/per.1919.
Auvray, Malika and Charles Spence. 2008. The multisensory perception of flavor.
Consciousness and Cognition. Elsevier 17(3). 1016–1031. https://doi.org/10.1016/j.
concog.2007.06.005.
Baayen, Harald. 2008. Analyzing linguistic data: A practical introduction to statistics using R.
Cambridge, UK: Cambridge University Press.
Baayen, Harald, Douglas J. Davidson and Douglas M. Bates. 2008. Mixed-effects modeling
with crossed random effects for subjects and items. Journal of Memory and Language
59(4). 390–412.
Baayen, Harald, Shravan Vasishth, Reinhold Kliegl and Douglas Bates. 2017. The cave of
shadows: Addressing the human factor with generalized additive mixed models. Journal
of Memory and Language. Elsevier 94. 206–234.
Barr, Dale J. 2013. Random effects structure for testing interactions in linear mixed-effects
models. Frontiers in Psychology 4. 328.
Barr, Dale J., Roger Levy, Christoph Scheepers and Harry J. Tily. 2013. Random effects
structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and
Language 68(3). 255–278.
Baumann, Stefan and Bodo Winter. 2018. What makes a word prominent? Predicting untrained
German listeners’ perceptual judgments. Journal of Phonetics 70. 20–38. https://doi.org/
10.1016/j.wocn.2018.05.004.
Berez-Kroeker, Andrea L., Lauren Gawne, Susan Smythe Kung, Barbara F. Kelly, Tyler Heston,
Gary Holton, Peter Pulsifer, et al. 2018. Reproducible research in linguistics: A position
statement on data citation and attribution in our field. Linguistics 56(1). 1–18. https://
doi.org/10.1515/ling-2017-0032.
Bohannon, John. 2011. Social science for pennies. Science 334(6054). 307.

Boroditsky, Lera and Michael Ramscar. 2002. The roles of body and mind in abstract thought.
Psychological Science 13(2). 185–189.
Botvinik-Nezer, Rotem, Felix Holzmeister, Colin F. Camerer, Anna Dreber, Juergen Huber,
Magnus Johannesson, Michael Kirchler, Roni Iwanir, Jeanette A. Mumford and R. Alison
Adcock. 2020. Variability in the analysis of a single neuroimaging dataset by many
teams. Nature. Nature Publishing Group 582(7810). 84–88. https://doi.org/10.1038/
s41586-020-2314-9.
Bresnan, Joan, Anna Cueni, Tatiana Nikitina and Harald Baayen. 2007. Predicting the dative
alternation. In Cognitive foundations of interpretation, 69–94. KNAW.
Brezina, Vaclav. 2018. Statistics in corpus linguistics: A practical guide. Cambridge,
UK: Cambridge University Press.
Bruin, Angela de, Barbara Treccani and Sergio Della Sala. 2015. Cognitive advantage in
bilingualism: An example of publication bias? Psychological Science. Sage
Publications Sage CA: Los Angeles, CA 26(1). 99–107. https://doi.org/10.1177/
0956797614557866.
Buyalskaya, Anastasia, Marcos Gallo and Colin F. Camerer. 2021. The golden age of social
science. Proceedings of the National Academy of Sciences. National Academy of Sciences
118(5). https://doi.org/10.1073/pnas.2002923118.https://www.pnas.org/content/118/5/
e2002923118 (26 September, 2021).
Camerer, Colin F., Anna Dreber, Felix Holzmeister, Teck-Hua Ho, Jürgen Huber, Magnus
Johannesson, Michael Kirchler, Gideon Nave, Brian A. Nosek and Thomas Pfeiffer. 2018.
Evaluating the replicability of social science experiments in Nature and Science between
2010 and 2015. Nature Human Behaviour. Nature Publishing Group 2(9). 637–644.
https://doi.org/10.1038/s41562-018-0399-z.
Cassidy, Scott A., Ralitza Dimova, Benjamin Giguère, Jeffrey R. Spence and David J. Stanley.
2019. Failing grade: 89% of introduction-to-psychology textbooks that define or explain
statistical significance do so incorrectly. Advances in Methods and Practices in
Psychological Science. Sage Publications Sage CA: Los Angeles, CA 2(3). 233–239.
https://doi.org/10.1177/2515245919858072.
Chabris, Christopher F., Patrick R. Heck, Jaclyn Mandart, Daniel J. Benjamin and Daniel
J. Simons. 2018. No evidence that experiencing physical warmth promotes interpersonal
warmth. Social Psychology. Hogrefe Publishing. https://doi.org/10.1027/1864-9335/
a000361.
Ćwiek, Aleksandra, Susanne Fuchs, Christoph Draxler, Eva Liina Asu, Dan Dediu, Katri Hiovain,
Shigeto Kawahara, et al. 2021. Novel vocalizations are understood across cultures.
Scientific Reports. Nature Publishing Group 11(1). 10108. https://doi.org/10.1038/
s41598-021-89445-4.
Dąbrowska, Ewa. 2016a. Looking into introspection. In Grzegorz Drożdż (ed.), Studies in
Lexicogrammar: Theory and applications, 55–74. Amsterdam: John Benjamins.
Dąbrowska, Ewa. 2016b. Cognitive Linguistics’ seven deadly sins. Cognitive Linguistics 27(4).
479–491. https://doi.org/10.1515/cog-2016-0059.
De Araujo, Ivan ET, Edmund T. Rolls, Morten L. Kringelbach, Francis McGlone and Nicola
Phillips. 2003. Taste-olfactory convergence, and the representation of the pleasantness
of flavour, in the human brain. European Journal of Neuroscience. Wiley Online Library
18(7). 2059–2068. https://doi.org/10.1046/j.1460-9568.2003.02915.x.
Desagulier, Guillaume. 2017. Corpus linguistics and statistics with R: Introduction to
quantitative methods in linguistics. Berlin: Springer.
42 Bodo Winter

Divjak, Dagmar and Nick Fieller. 2014. Finding structure in linguistic data. In Dylan Glynn and
Justyna Robinson (eds.), Corpus methods for semantics: Quantitative studies in polysemy
and synonymy, 405–441. Amsterdam: John Benjamins.
Doyen, Stéphane, Olivier Klein, Cora-Lise Pichon and Axel Cleeremans. 2012. Behavioral
priming: It’s all in the mind, but whose mind? PloS One. Public Library of Science 7(1).
e29081. https://doi.org/10.1371/journal.pone.0029081.
Drager, Katie and Jennifer Hay. 2012. Exploiting random intercepts: Two case studies in
sociophonetics. Language Variation and Change 24(1). 59–78. https://doi.org/10.1017/
S0954394512000014.
Evans, Vyvyan. 2012. Cognitive linguistics. Wiley Interdisciplinary Reviews: Cognitive Science.
Wiley Online Library 3(2). 129–141. https://doi.org/10.1002/wcs.1163.
Finkel, Eli J., Paul W. Eastwick and Harry T. Reis. 2017. Replicability and other features of a
high-quality science: Toward a balanced and empirical approach. Journal of Personality
and Social Psychology. American Psychological Association 113(2). 244. https://doi.org/
10.1037/pspi0000075.
Gámez, Elena, José M. Díaz and Hipólito Marrero. 2011. The uncertain universality of the
Macbeth effect with a Spanish sample. Spanish Journal of Psychology 14(1). 156–162.
Gelman, Andrew. 2004. Exploratory data analysis for complex models. Journal of
Computational and Graphical Statistics. Taylor and Francis 13(4). 755–779. https://doi.
org/10.1198/106186004X11435.
Gelman, Andrew and Cosma Rohilla Shalizi. 2013. Philosophy and the practice of Bayesian
statistics. British Journal of Mathematical and Statistical Psychology 66(1). 8–38. https://
doi.org/10.1111/j.2044-8317.2011.02037.x.
Gentleman, Robert and Duncan Temple Lang. 2007. Statistical analyses and reproducible
research. Journal of Computational and Graphical Statistics. Taylor and Francis 16(1).
1–23. https://doi.org/10.1198/106186007X178663.
Gibbs, Raymond W. 2007. Why cognitive linguists should care more about empirical methods.
In Monica Gonzalez-Marquez, Irene Mittelberg, Seana Coulson and Michael Spivey (eds.),
Methods in Cognitive Linguistics, 2–18. Amsterdam: John Benjamins.
Gibbs, Raymond W. 2013. Walking the walk while thinking about the talk: Embodied
interpretation of metaphorical narratives. Journal of Psycholinguistic Research. Springer
42(4). 363–378. https://doi.org/10.1007/s10936-012-9222-6.
Gibson, Edward and Evelina Fedorenko. 2010. Weak quantitative standards in linguistics
research. Trends in Cognitive Sciences 14(6). 233. https://doi.org/10.1016/j.
tics.2010.03.005.
Gigerenzer, Gerd. 2004. Mindless statistics. The Journal of Socio-Economics. Elsevier 33(5).
587–606. https://doi.org/10.1016/j.socec.2004.09.033.
Gries, Stefan. 2009. Quantitative corpus linguistics with R: A practical introduction. New York:
Routledge.
Gries, Stefan. 2015a. The most under-used statistical method in corpus linguistics: Multi-level
(and mixed-effects) models. Corpora 10(1). 95–125. https://doi.org/10.3366/
cor.2015.0068.
Gries, Stefan. 2015b. Some current quantitative problems in corpus linguistics and a sketch of
some solutions. Language and Linguistics. SAGE Publications Sage UK: London, England
16(1). 93–117. https://doi.org/10.1177/1606822X14556606.

Gries, Stefan. 2018. On over-and underuse in learner corpus research and multifactoriality in
corpus linguistics more generally. Journal of Second Language Studies. John Benjamins
1(2). 276–308. https://doi.org/10.1075/jsls.00005.gri.
Gries, Stefan. 2019. On classification trees and random forests in corpus linguistics: Some
words of caution and suggestions for improvement. Corpus Linguistics and Linguistic
Theory. De Gruyter Mouton 16(3). 617–647. https://doi.org/10.1515/cllt-2018-0078.
Grieve, Jack. 2021. Observation, experimentation, and replication in linguistics. Linguistics.
De Gruyter Mouton 59(5). 1343–1356. https://doi.org/10.1515/ling-2021-0094.
Griffiths, Thomas L. 2015. Manifesto for a new (computational) cognitive revolution. Cognition.
Elsevier 135. 21–23. https://doi.org/10.1016/j.cognition.2014.11.026.
Grolemund, Garrett and Hadley Wickham. 2014. A cognitive interpretation of data analysis.
International Statistical Review. Wiley Online Library 82(2). 184–204. https://doi.org/
10.1111/insr.12028.
Günther, Fritz, Luca Rinaldi and Marco Marelli. 2019. Vector-space models of semantic
representation from a cognitive perspective: A discussion of common misconceptions.
Perspectives on Psychological Science. Sage Publications Sage CA: Los Angeles, CA 14(6).
1006–1033. https://doi.org/10.1177/1745691619861372.
Hajjem, Ahlem, François Bellavance and Denis Larocque. 2014. Mixed-effects random forest
for clustered data. Journal of Statistical Computation and Simulation. Taylor and Francis
84(6). 1313–1328. https://doi.org/10.1080/00949655.2012.741599.
Haller, Heiko and Stefan Krauss. 2002. Misinterpretations of significance: A problem students
share with their teachers. Methods of Psychological Research 7(1). 1–20.
Hassemer, Julius and Bodo Winter. 2016. Producing and perceiving gestures conveying height
or shape. Gesture 15(3). 404–424.
Healy, Kieran. 2019. Data Visualization: A Practical Introduction. Princeton, NJ: Princeton
University Press.
Hullman, Jessica and Andrew Gelman. 2021. Designing for interactive exploratory data
analysis requires theories of graphical inference. Harvard Data Science Review. PubPub.
Hurlbert, Stuart H. 1984. Pseudoreplication and the design of ecological field experiments.
Ecological Monographs 54(2). 187–211. https://doi.org/10.2307/1942661.
Idemaru, Kaori, Bodo Winter, Lucien Brown and Grace Eunhae Oh. 2020. Loudness trumps
pitch in politeness judgments: Evidence from Korean deferential speech. Language and
Speech 63(1). 123–148. https://doi.org/10.1177/0023830918824344.
Jaeger, T. Florian. 2008. Categorical data analysis: Away from ANOVAs (transformation or not)
and towards logit mixed models. Journal of Memory and Language 59(4). 434–446.
Jaeger, T. Florian, Peter Graff, William Croft and Daniel Pontillo. 2011. Mixed effect models for
genetic and areal dependencies in linguistic typology. Linguistic Typology 15(2). 281–319.
https://doi.org/10.1515/lity.2011.021.
Jamalian, Azadeh and Barbara Tversky. 2012. Gestures alter thinking about time. In
Proceedings of the Annual Meeting of the Cognitive Science Society, vol. 34.
Karpievitch, Yuliya V., Elizabeth G. Hill, Anthony P. Leclerc, Alan R. Dabney and Jonas
S. Almeida. 2009. An introspective comparison of random forest-based classifiers for the
analysis of cluster-correlated data by way of RF++. PloS One. Public Library of Science
4(9). e7087. https://doi.org/10.1371/journal.pone.0007087.
Kline, Rex B. 2004. Beyond significance testing: Reforming data analysis methods in
behavioral research. Washington, DC: American Psychological Association.
44 Bodo Winter

Random documents with unrelated
content Scribd suggests to you:

disse-me que Tristão lhe escrevera dizendo achar-se eleito deputado
quando desembarcou em Lisboa, e pedindo-lhe que désse a noticia
á gente Aguiar como entendesse melhor; não lhes escrevia a elles
sobre isso para evitar o sobresalto. Que me parecia?
—Sempre se lhes hade dizer tudo, respondi; o melhor é que seja
logo, e aqui estamos para dizer as cousas cautelosamente.
—Tambem me parece.
—Eu engenharei uma fabula...
Engenhei o que pude. Falei do golpe que o moço recebeu quando
desembarcou deputado, e viu misturadas as alegrias dos paes com
as dos amigos politicos; devia dizer tambem que a primeira ideia de
Tristão foi rejeitar o diploma e vir para Santa-Pia; mas que o partido,
os chefes, os paes... Não fui tão longe; seria mentir de mais. Ao
cabo, não teria tempo. Os dous velhos ficaram fulminados, a mulher
verteu algumas lagrimas silenciosas, e o marido cuidou de lh'as
enxugar.
Assim correram as cousas, a mentira e os efeitos. Os dous
procurámos levantar-lhes o animo. Eu empreguei algumas reflexões
e metaforas, afirmando que elles viriam este anno mesmo ou no
principio do outro; bastava saberem a dor que causava aqui a
noticia.
D. Carmo não parecia ouvir-me, nem elle; olhavam para lá, para
longe, para onde se perde a vida presente, e tudo se esvae
depressa. Aguiar ainda pegou na carta que o desembargador lhe
mostrava; leu para si as palavras de Tristão, que eram aborrecidas
em si mesmas, além da nota que o autor intencionalmente lhes poz.
D. Carmo pediu-lh'a com o gesto, elle meteu-a na carteira. A boa
velha não insistiu. Campos e eu saímos pouco depois.
30 de Agosto.
Praia fóra (esqueceu-me notar-isto hontem) praia fóra viemos
falando daquella orfandade ás avessas em que os dous velhos

ficavam, e eu acrescentei, lembrando-me do marido defunto:
—Desembargador, se os mortos vão depressa, os velhos ainda vão
mais depressa que os mortos... Viva a mocidade!
Campos não me entendeu, nem logo, nem completamente. Tive
então de lhe dizer que aludia ao marido defunto, e aos dous velhos
deixados pelos dous moços, e conclui que a mocidade tem o direito
de viver e amar, e separar-se alegremente do extinto e do caduco.
Não concordou,—o que mostra que ainda então não me entendeu
completamente.
Sem data.
Ha seis ou sete dias que eu não ia ao Flamengo. Agora á tarde
lembrou-me lá passar antes de vir para caza. Fui a pé; achei aberta
a porta do jardim, entrei e parei logo.
—Lá estão elles, disse commigo.
Ao fundo, á entrada do saguão, dei com os dous velhos sentados,
olhando um para o outro. Aguiar estava encostado ao portal direito,
com as mãos sobre os joelhos. D. Carmo, á esquerda, tinha os
braços cruzados á cinta. Hesitei entre ir adeante ou desandar o
caminho; continuei parado alguns segundos até que recuei pé ante
pé. Ao transpor a porta para a rua, vi-lhes no rosto e na atitude uma
expressão a que não acho nome certo ou claro digo o que me
pareceu. Queriam ser risonhos e mal se podiam consolar. Consolava-
os a saudade de si mesmos.
FIM

*** END OF THE PROJECT GUTENBERG EBOOK MEMORIAL DE
AYRES ***
Updated editions will replace the previous one—the old editions will
be renamed.
Creating the works from print editions not protected by U.S.
copyright law means that no one owns a United States copyright in
these works, so the Foundation (and you!) can copy and distribute it
in the United States without permission and without paying
copyright royalties. Special rules, set forth in the General Terms of
Use part of this license, apply to copying and distributing Project
Gutenberg™ electronic works to protect the PROJECT GUTENBERG™
concept and trademark. Project Gutenberg is a registered trademark,
and may not be used if you charge for an eBook, except by following
the terms of the trademark license, including paying royalties for use
of the Project Gutenberg trademark. If you do not charge anything
for copies of this eBook, complying with the trademark license is
very easy. You may use this eBook for nearly any purpose such as
creation of derivative works, reports, performances and research.
Project Gutenberg eBooks may be modified and printed and given
away—you may do practically ANYTHING in the United States with
eBooks not protected by U.S. copyright law. Redistribution is subject
to the trademark license, especially commercial redistribution.
START: FULL LICENSE

THE FULL PROJECT GUTENBERG LICENSE

PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK
To protect the Project Gutenberg™ mission of promoting the free
distribution of electronic works, by using or distributing this work (or
any other work associated in any way with the phrase “Project
Gutenberg”), you agree to comply with all the terms of the Full
Project Gutenberg™ License available with this file or online at
www.gutenberg.org/license.
Section 1. General Terms of Use and
Redistributing Project Gutenberg™
electronic works
1.A. By reading or using any part of this Project Gutenberg™
electronic work, you indicate that you have read, understand, agree
to and accept all the terms of this license and intellectual property
(trademark/copyright) agreement. If you do not agree to abide by all
the terms of this agreement, you must cease using and return or
destroy all copies of Project Gutenberg™ electronic works in your
possession. If you paid a fee for obtaining a copy of or access to a
Project Gutenberg™ electronic work and you do not agree to be
bound by the terms of this agreement, you may obtain a refund
from the person or entity to whom you paid the fee as set forth in
paragraph 1.E.8.
1.B. “Project Gutenberg” is a registered trademark. It may only be
used on or associated in any way with an electronic work by people
who agree to be bound by the terms of this agreement. There are a
few things that you can do with most Project Gutenberg™ electronic
works even without complying with the full terms of this agreement.
See paragraph 1.C below. There are a lot of things you can do with
Project Gutenberg™ electronic works if you follow the terms of this
agreement and help preserve free future access to Project
Gutenberg™ electronic works. See paragraph 1.E below.

1.C. The Project Gutenberg Literary Archive Foundation (“the
Foundation” or PGLAF), owns a compilation copyright in the
collection of Project Gutenberg™ electronic works. Nearly all the
individual works in the collection are in the public domain in the
United States. If an individual work is unprotected by copyright law
in the United States and you are located in the United States, we do
not claim a right to prevent you from copying, distributing,
performing, displaying or creating derivative works based on the
work as long as all references to Project Gutenberg are removed. Of
course, we hope that you will support the Project Gutenberg™
mission of promoting free access to electronic works by freely
sharing Project Gutenberg™ works in compliance with the terms of
this agreement for keeping the Project Gutenberg™ name associated
with the work. You can easily comply with the terms of this
agreement by keeping this work in the same format with its attached
full Project Gutenberg™ License when you share it without charge
with others.
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside the
United States, check the laws of your country in addition to the
terms of this agreement before downloading, copying, displaying,
performing, distributing or creating derivative works based on this
work or any other Project Gutenberg™ work. The Foundation makes
no representations concerning the copyright status of any work in
any country other than the United States.
1.E. Unless you have removed all references to Project Gutenberg:
1.E.1. The following sentence, with active links to, or other
immediate access to, the full Project Gutenberg™ License must
appear prominently whenever any copy of a Project Gutenberg™
work (any work on which the phrase “Project Gutenberg” appears,
or with which the phrase “Project Gutenberg” is associated) is
accessed, displayed, performed, viewed, copied or distributed:

This eBook is for the use of anyone anywhere in the United
States and most other parts of the world at no cost and with
almost no restrictions whatsoever. You may copy it, give it away
or re-use it under the terms of the Project Gutenberg License
included with this eBook or online at www.gutenberg.org. If you
are not located in the United States, you will have to check the
laws of the country where you are located before using this
eBook.
1.E.2. If an individual Project Gutenberg™ electronic work is derived
from texts not protected by U.S. copyright law (does not contain a
notice indicating that it is posted with permission of the copyright
holder), the work can be copied and distributed to anyone in the
United States without paying any fees or charges. If you are
redistributing or providing access to a work with the phrase “Project
Gutenberg” associated with or appearing on the work, you must
comply either with the requirements of paragraphs 1.E.1 through
1.E.7 or obtain permission for the use of the work and the Project
Gutenberg™ trademark as set forth in paragraphs 1.E.8 or 1.E.9.
1.E.3. If an individual Project Gutenberg™ electronic work is posted
with the permission of the copyright holder, your use and distribution
must comply with both paragraphs 1.E.1 through 1.E.7 and any
additional terms imposed by the copyright holder. Additional terms
will be linked to the Project Gutenberg™ License for all works posted
with the permission of the copyright holder found at the beginning
of this work.
1.E.4. Do not unlink or detach or remove the full Project
Gutenberg™ License terms from this work, or any files containing a
part of this work or any other work associated with Project
Gutenberg™.
1.E.5. Do not copy, display, perform, distribute or redistribute this
electronic work, or any part of this electronic work, without
prominently displaying the sentence set forth in paragraph 1.E.1

with active links or immediate access to the full terms of the Project
Gutenberg™ License.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if you
provide access to or distribute copies of a Project Gutenberg™ work
in a format other than “Plain Vanilla ASCII” or other format used in
the official version posted on the official Project Gutenberg™ website
(www.gutenberg.org), you must, at no additional cost, fee or
expense to the user, provide a copy, a means of exporting a copy, or
a means of obtaining a copy upon request, of the work in its original
“Plain Vanilla ASCII” or other form. Any alternate format must
include the full Project Gutenberg™ License as specified in
paragraph 1.E.1.
1.E.7. Do not charge a fee for access to, viewing, displaying,
performing, copying or distributing any Project Gutenberg™ works
unless you comply with paragraph 1.E.8 or 1.E.9.
1.E.8. You may charge a reasonable fee for copies of or providing
access to or distributing Project Gutenberg™ electronic works
provided that:
• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information

about donations to the Project Gutenberg Literary Archive
Foundation.”
• You provide a full refund of any money paid by a user who
notifies you in writing (or by e-mail) within 30 days of receipt
that s/he does not agree to the terms of the full Project
Gutenberg™ License. You must require such a user to return or
destroy all copies of the works possessed in a physical medium
and discontinue all use of and all access to other copies of
Project Gutenberg™ works.
• You provide, in accordance with paragraph 1.F.3, a full refund of
any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.E.9. If you wish to charge a fee or distribute a Project Gutenberg™
electronic work or group of works on different terms than are set
forth in this agreement, you must obtain permission in writing from
the Project Gutenberg Literary Archive Foundation, the manager of
the Project Gutenberg™ trademark. Contact the Foundation as set
forth in Section 3 below.
1.F.
1.F.1. Project Gutenberg volunteers and employees expend
considerable effort to identify, do copyright research on, transcribe
and proofread works not protected by U.S. copyright law in creating
the Project Gutenberg™ collection. Despite these efforts, Project
Gutenberg™ electronic works, and the medium on which they may
be stored, may contain “Defects,” such as, but not limited to,
incomplete, inaccurate or corrupt data, transcription errors, a
copyright or other intellectual property infringement, a defective or

damaged disk or other medium, a computer virus, or computer
codes that damage or cannot be read by your equipment.
1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except for
the “Right of Replacement or Refund” described in paragraph 1.F.3,
the Project Gutenberg Literary Archive Foundation, the owner of the
Project Gutenberg™ trademark, and any other party distributing a
Project Gutenberg™ electronic work under this agreement, disclaim
all liability to you for damages, costs and expenses, including legal
fees. YOU AGREE THAT YOU HAVE NO REMEDIES FOR
NEGLIGENCE, STRICT LIABILITY, BREACH OF WARRANTY OR
BREACH OF CONTRACT EXCEPT THOSE PROVIDED IN PARAGRAPH
1.F.3. YOU AGREE THAT THE FOUNDATION, THE TRADEMARK
OWNER, AND ANY DISTRIBUTOR UNDER THIS AGREEMENT WILL
NOT BE LIABLE TO YOU FOR ACTUAL, DIRECT, INDIRECT,
CONSEQUENTIAL, PUNITIVE OR INCIDENTAL DAMAGES EVEN IF
YOU GIVE NOTICE OF THE POSSIBILITY OF SUCH DAMAGE.
1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you
discover a defect in this electronic work within 90 days of receiving
it, you can receive a refund of the money (if any) you paid for it by
sending a written explanation to the person you received the work
from. If you received the work on a physical medium, you must
return the medium with your written explanation. The person or
entity that provided you with the defective work may elect to provide
a replacement copy in lieu of a refund. If you received the work
electronically, the person or entity providing it to you may choose to
give you a second opportunity to receive the work electronically in
lieu of a refund. If the second copy is also defective, you may
demand a refund in writing without further opportunities to fix the
problem.
1.F.4. Except for the limited right of replacement or refund set forth
in paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,

INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.
1.F.5. Some states do not allow disclaimers of certain implied
warranties or the exclusion or limitation of certain types of damages.
If any disclaimer or limitation set forth in this agreement violates the
law of the state applicable to this agreement, the agreement shall be
interpreted to make the maximum disclaimer or limitation permitted
by the applicable state law. The invalidity or unenforceability of any
provision of this agreement shall not void the remaining provisions.
1.F.6. INDEMNITY - You agree to indemnify and hold the Foundation,
the trademark owner, any agent or employee of the Foundation,
anyone providing copies of Project Gutenberg™ electronic works in
accordance with this agreement, and any volunteers associated with
the production, promotion and distribution of Project Gutenberg™
electronic works, harmless from all liability, costs and expenses,
including legal fees, that arise directly or indirectly from any of the
following which you do or cause to occur: (a) distribution of this or
any Project Gutenberg™ work, (b) alteration, modification, or
additions or deletions to any Project Gutenberg™ work, and (c) any
Defect you cause.
Section 2. Information about the Mission
of Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new computers.
It exists because of the efforts of hundreds of volunteers and
donations from people in all walks of life.
Volunteers and financial support to provide volunteers with the
assistance they need are critical to reaching Project Gutenberg™’s
goals and ensuring that the Project Gutenberg™ collection will

remain freely available for generations to come. In 2001, the Project
Gutenberg Literary Archive Foundation was created to provide a
secure and permanent future for Project Gutenberg™ and future
generations. To learn more about the Project Gutenberg Literary
Archive Foundation and how your efforts and donations can help,
see Sections 3 and 4 and the Foundation information page at
www.gutenberg.org.
Section 3. Information about the Project
Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-profit
501(c)(3) educational corporation organized under the laws of the
state of Mississippi and granted tax exempt status by the Internal
Revenue Service. The Foundation’s EIN or federal tax identification
number is 64-6221541. Contributions to the Project Gutenberg
Literary Archive Foundation are tax deductible to the full extent
permitted by U.S. federal laws and your state’s laws.
The Foundation’s business office is located at 809 North 1500 West,
Salt Lake City, UT 84116, (801) 596-1887. Email contact links and up
to date contact information can be found at the Foundation’s website
and official page at www.gutenberg.org/contact
Section 4. Information about Donations to
the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission of
increasing the number of public domain and licensed works that can
be freely distributed in machine-readable form accessible by the
widest array of equipment including outdated equipment. Many

small donations ($1 to $5,000) are particularly important to
maintaining tax exempt status with the IRS.
The Foundation is committed to complying with the laws regulating
charities and charitable donations in all 50 states of the United
States. Compliance requirements are not uniform and it takes a
considerable effort, much paperwork and many fees to meet and
keep up with these requirements. We do not solicit donations in
locations where we have not received written confirmation of
compliance. To SEND DONATIONS or determine the status of
compliance for any particular state visit www.gutenberg.org/donate.
While we cannot and do not solicit contributions from states where
we have not met the solicitation requirements, we know of no
prohibition against accepting unsolicited donations from donors in
such states who approach us with offers to donate.
International donations are gratefully accepted, but we cannot make
any statements concerning tax treatment of donations received from
outside the United States. U.S. laws alone swamp our small staff.
Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.
Section 5. General Information About
Project Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could be
freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose network of
volunteer support.

Project Gutenberg™ eBooks are often created from several printed
editions, all of which are confirmed as not protected by copyright in
the U.S. unless a copyright notice is included. Thus, we do not
necessarily keep eBooks in compliance with any particular paper
edition.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.
This website includes information about Project Gutenberg™,
including how to make donations to the Project Gutenberg Literary
Archive Foundation, how to help produce our new eBooks, and how
to subscribe to our email newsletter to hear about new eBooks.

Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com

Data Analytics In Cognitive Linguistics Methods And Insights 1st Edition Tay

More Related Content

Similar to Data Analytics In Cognitive Linguistics Methods And Insights 1st Edition Tay (20)

Recently uploaded (20)

Data Analytics In Cognitive Linguistics Methods And Insights 1st Edition Tay