SlideShare a Scribd company logo
Marina Santini
Artificial Solutions,
KYH Agile Web Development
Stockholm
Uppsala University
Department of Linguistics and Philology, Seminar Series
Fri 4 March 2011
Genres on
the Web GoWeb
Outline
 What is genre? What is web genre?
 What is the difference betw genre and web genre?
 Why is (web) genre important?
 Automatic web genre identification
 The very beginning: Biber and Karlgren&Cutting
 Sharoff
 Kim & Ross
 Santini
 Stein et al.
 Web genre identification by Humans
 Karlgren
 Rosso & Haas
 Crowston et al.
 Future directions
What is genre? The beginning…
 Aristotle (4th cent. b.C.): drama, lyrics, epics
 Drama: tragedy, comedy, satyr
 Literary theory and literary genres
 Library classification
 Library classification used also in online bookshops (e.g
Amazon)
 Music genres (jazz, rock, etc.), film genres (thriller,
drama, western etc.)
More recently…
 Genre in academic contexts, in
workplace and professional
contexts, public contexts, in
pedagogy (teaching writing), etc
(resarch articles, essays, emails,
memos, etc.)
Recent Genre Definitions: 2008-2010
Genre & Corpus Linguistics
 Surprisingly, no explicit definition of what genre is…
 Brown corpus (1961): 15 genres
 Sockholm-Umeå Corpus (SUC) (1990s)
 British National Corpus (1990s)
 etc.
David Lee and the BNC Jungle
Why is genre important?
 It is a context carrier: being based on recurrent
conventions and predictable expectations, genre
provides the communicative context and the
communicative purpose for which a text has been
produced.
Think of what happens in your mind when you come
across a specific genre. Eg, FAQs, reviews,
interviews, academic papers, reportages…
Benefits (I)
Being a context carrier…
 Complexity reduction: a text receives identity
throught belonging to a certain genre;
 Predictivity: genre reduces information overload.
 Findability: genre helps find web documents
”relevant” to our information needs;
Benefits (II)
 Genre competence increases information
understanding:
 genre competence increases self protection against
digital crimes (fishing, hoaxes, cyberbullying) because it
can help us spot genre anomalies and consequently
malicious intentions;
 Genre competence helps implement democracy:
 some educational programs (e.g. in Australia) focus on
teaching genre since the primary school because those
who do not have genre competence because they drop
off school after the primary school become socially
disadvantaged in the structure of power.
What is webgenre ?
 All types of genres that are on the web…
 Paper genres that have been uploaded in any format
+ genres that do not have any countepart in the
paper world:
 ex: home page, About Us, FAQs, webzine,
personal blog, corporate weblogs …
How is webgenre different from paper
genre?
 On the web, there are new communicative settings,
and new communicative contexts, so new genres are
spawned
 On the web, the new communicative settings have
been spurred by a proliferation of new technologies
that ease, foster and model our communication: ex:
chats, blogs, social networks, like Facebook, Twitter,
LinkedIn…
Then, a written text is not only
topic…
 There are many dimensions of variation: domain,
topic, register, sentiment, level of complexity or
difficulty or specialisation, trustworthiness and
credibility, etc.
 … genre is a dimension of variation. Genre gives us
a topic packaged in a certain way. From the package,
we are able to identify the communicative purpose of
the text and the commiunicative context that has
spawn such a text.
A step back…
 Biber (1988)
 Genre
 Text types
 66 linguistically-motivated features
 Multi-Dimensional Analysis
 Ad-hoc corpus
 Karlgren & Cutting (1994)
 Genre
 20 shallow features
 Brown Corpus
Biberian
Text Types
Biber (1988)
Biber (1989)
Biber (1993)
Biber (1995)
Biber (2004a)
Biber (2004b)
Biber et al. (2005)
etc.
Genres/Registers
vs.
Text Types
External Features
vs.
Internal Features
“I have used the term ‘genre’ (or ‘register’) for text varieties that are readily
recognized and ‘named’ within a culture (e.g. letters, press editorials, sermon,
conversation), while I have used the term ‘text type’ for varieties that are defined
linguistically (rather than perceptually)” (Biber, 1993).
Multi-Dimensional Analysis
Factor Analysis, Factors Scores (Biber, 1988)
Cluster Analysis (Biber, 1989)
Additional Statistical Tests (Biber, 2004a; 2004b, etc.)
1. intimate interpersonal interaction
2. informational interaction
3. scientific exposition
4. learned exposition
5. imaginative narrative
6. general narrative exposition
7. situated reportage
8. involved persuasion
Cluster Analysis - Biber (1989)Factor 2 - Biber (1988)
Criticism: Lee (1999)
From Biber’s text types to genres of electronic
corpora: Karlgren and Cutting (1994)
Karlgren and Cutting (1994):
Recognizing Text Genres with Simple Metrics
Using Discriminant Analysis
 20 features
 Discriminant analysis
 Brown corpus
POSs & SUC
More than 15 years later…
 Grieve, Biber et al. ” We define a genre in a very similar
manner to how we define register – i.e. as a variety of
language defined by the external situation in which it is
produced. However, while a register is characterized by
pervasive linguistic features, a genre is characterized by
conventionalized linguistic features”
 Karlgren: ”Genre is a vague but well-established
notion, and genres are explicitly identified and
discussed by language users even while they may be
difficult to encode and put into practical use”
GoWeb
The concept of genre is beneficial…
but difficult to pin down and to
agree upon
GoWeb
In the book, we do not
propose a single and
unified definition of
genre. Authors give
their different views on
genre.
Do we really need a definition?
 After all….
 … once we are convinced that genre is useful, we could just
say that: genre is a classificatory principle based on a
number of attributes.
 The web is immense, we cannot think of classifying web
documents by genre manually, can we? Let’s just focus on
AUTOMATIC web GENRE CLASSIFCATION!
What do we need for Automatic
webGenre Identification (AGI)?
 We need:
 a genre taxonomy (palette) and a corpus
 measurable attributes (features) that can be extracted
automatically
 an automatic classifier, i.e. a computational model that
does the classification for us
Vector representation & supervised
machine learning algorithms (esp.
SVM)
Models for AGI: Scenarios
 Serge Sharoff
 Kim & Ross
 Santini
 Stein et al.
 Others…
GoWeb
Morphology & the Linguist
 Aim: Find a genre palette allowing comparison among
corpora (Web As Corpus initiative ) and across
languages
 A functional genre palette inspired by J. Sinclair
 Many corpora: English and Russian
 Classifier: SVM
 Features: POS trigrams (577 for Russian; 593 for
English)
Ex of POS trigrams: ADV ADJ NOUN
Sharoff  GoWeb
The expert (the linguist) decides:
Results
KRYS I and Harmonic Descriptor
Representation (HDR)
 Information studies , Digital Libraries:
semantic concept
 Features: HDR = FP, LP or AP (betw 1 and
T/ (N x MP))
 Number of features: 7431
 Classifier: SVM
 KRYS I + 7 webgenre collection (total: 24 +
7 genre classes , 3452 documents)
Kim & Ross  GoWeb
2477 words
KRYS I &
7-webgenre
collection
Accuracies
What about morphology & syntax?
What about noise?
 Collection: 7-webgenre collection + others
 Features: 100 facets
 Genre palette: 7 webgenres + other
 Classifier: inferential model subjective Bayesian
method
Santini  GoWeb
7-webgenre collection
 Balanced (200 web pages per genre
class)
 Genre palette
 Not annotated manually
 Built following 2 principles:
 Objective sources
 Consistent genre granularity
100 Facets
Inferential model
 It is a simple probabilistic model based on rules.
 It allows some ”reasonging” through the use of weights
(closer to artificial intelligence than machine learning)
Comparisons (I)
Different types of noise!
Results
Three experimental settings, three
different genre needs….
1. Genre comparison across corpora
2. Digital libraries, where documents can be more easily
monitored
3. The wild web, where everything is uncertain and
noisy
WEGA prototype:
a retrieval model for genre-enabled web search
Genre retrieval model
 Genre collection and palette: KI-04 corpus: 8 webgenres
 Firefox add-on
 Model: ”lightweight GenreRich model” (linear discriminant
analysis)
 Features: HTML, link features, character features,
vocabulary concentration features (< 100 features)
Stein, Meyer zu Eissen, Lipka GoWeb
WEGA (WEb Genre Analysis)
KI-04 genre collection: 8 webgenres
Genre Classes & Human
Recognition
 How can we decide on the most representative genre
classes? Let’s ask users… yes indeed, but how?
 1) questionnaires (Karlgren)
 2) card sorting (Rosso & Haas)
 3) task-oriented studies (Crowston et al.)
 4) others…
Questionnaires: ”what genres are
available on the internet?”
User Warrant
 Collecting genre terminology in the users’ own words
(3 participants)
 Make the users classify web pages and create piles
(rationale?)
 Users choose the best of the collected genre
terminology (102 participants)
 User validation of the genre palette (257 participants)
 Genres’ usefulness of web search (32 participants)
GoWeb: Rosso & Haas
Final
Genre
palette:
18
genres
Genres & Tasks
 3 groups of respondents : teachers, journalists, engineers,
 Respondents were asked to carry out a web search for a
real task of their own choice
 What is your search goal?
 What type of web page would you call this?
 What is it about the page that makes you call that?
 Was this page useful to you?
GoWeb: Crowston et al.
What type of web page would you call this?
 522 unique terms  about 300
Syracuse corpus & AGI
ACL 2010 (Uppsala):
FINE-GRAINED GENRE CLASSIFICATION USING
STRUCTURAL LEARNING ALGORITHMS
Zhili Wu, Katja Markert and Serge Sharoff
 The whole corpus: 3027 annotated webpages divided
into 292 genres.
 Focussing on genres containing 15 or more examples,
the corpus is of about 2293 examples and 52 genres.
Conclusions (I) : Do we really need
a definition of genre?
1. Take a number of web pages belonging to different
web genres (e.g. blogs, home pages, news stories,
FAQs, etc.)
2. Identify and extract genre-revealing features
3. Feed an automatic classifier
Where is problem?
Conclusions (II)
 The problem with this approach is that without a
theoretical definition and characterization of the
concept of genre, it is not clear:
 how to create a genre taxonomy that both humans and
automatic classifiers can easily discriminate against
 how to select representative corpus for the genre classes
in the taxonomy, since there is a lot of variation in users’
assessment …
 how to identifiy the optimal genre–revealing features
Future Work
Genre is a high-level concept: we NEED a theoretical
definition of genre for computational and empirical
purposes.
Without a theoretical definition:
 genres become lifeless texts, merely characterized by
formal attributes and the communicative context , i.e.
the thing that make genre important, is completely
stripped out
 Although in some restricted experimental settings,
this ”formalistic” approach is quite rewarding (more
than 95% success rate), we can hardly generalize on it.
Future directions: AGI is a fertile land
for research and development…
Now that basic explorations have been carried out, we
should concentrate more on the correlation and
interrelation of the following variables:
 Human agreement
 Representation of genre classes
 Number of genre classes
 Nature of genre classes
 Size of the whole corpus
 Sturctured and unstructered noise
 Genre-revealing features that account for the context that
genres carry with them
 New computational models and algorithms…
Certainties….
 Genre is a useful concept in many disciplines
 Automatic genre classification is feasible, and there is ample
space for improvement
 I am interested in your views on (web) genre:
 send me your impressions, ideas, gut feelings and your genre
classes:
 Facebook page: www.facebook.com/genresontheweb
 Genre blog: www.forum.santini.se
 Webrider’s Short proposal to EU: www.webrider.se
Thank you for your attention!
References (I)
 Bateman, John (2008) Multimodality and Genre,
Palgrave Macmillan
 Bawarshi, Anis S. and Reiff, Mary Jo (eds) (2010) Genre:
An Introduction to History, Theory, Research, and
Pedagogy (free book);
http://wac.colostate.edu/books/bawarshi_reiff/genre.pdf
 Bruce, Ian (2008) Academic Writing and Genre,
Continuum
 Dorgeloh, Heidrun and Wanner, Anja (2010) Syntactic
Variation and Genre, De Gruyter Mouton
References (II)
 Giltrow,Janet and Stein, Dieter (eds) (2009) Genres in
the Internet, John Benjamins Publishing Company
 Heyd, Theresa (2008) Email Hoaxes: Form, function,
genre ecology, John Benjamins Publishing Company
 Lee, David (2001), Genres, Registers, Text Types,
Domains, And Styles: Clarifying The Concepts And
Navigating A Path Through The Bnc Jungle, Language
Learning & Technology September 2001, Vol. 5, Num. 3.
pp. 37-72, http://llt.msu.edu/vol5num3/pdf/lee.pdf
References (III)
 Luzón, María José, Ruiz-Madrid, María Noelia and
Villanueva, María Luisa (eds) (2010) Digital Genres,
New Literacies and Autonomy in Language
Learning, Cambridge Scholars Publishing
 Martin, James and Rose, David (2008) Genre
Relations: Mapping Culture, Equinox
 Puschmann, Cornelius (2010) The corporate blog as
an emerging genre of computer-mediated
communication: features, constraints, discourse
situation, Universitätsverlag Göttingen
 WEGA prototype download, documentation and
references: http://www.uni-
weimar.de/cms/medien/webis/research/projects/wega
.html

More Related Content

PPTX
Publishing and Using Linked Open Data - Day 2
PPT
Bridging Informal MOOCs & Formal English for Academic Purposes Programmes wit...
PPT
FLAX: Flexible Language Acquisition with Open Data-Driven Learning
PDF
Lecture: Question Answering
PDF
JTharsen Curriculum Vitae 2016
PPT
Pedagogical applications of corpus data for English for General and Specific ...
KEY
Snac dh2011-june-2011
PPTX
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence
Publishing and Using Linked Open Data - Day 2
Bridging Informal MOOCs & Formal English for Academic Purposes Programmes wit...
FLAX: Flexible Language Acquisition with Open Data-Driven Learning
Lecture: Question Answering
JTharsen Curriculum Vitae 2016
Pedagogical applications of corpus data for English for General and Specific ...
Snac dh2011-june-2011
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

Viewers also liked (20)

PPTX
CityTimes
PPTX
Towards Contextualized Information: How Automatic Genre Identification Can Help
PPTX
How Emotional Are Users' Needs? Emotion in Query Logs
PDF
Lecture11 logistic regression
PDF
Lecture 5: Structured Prediction
PPTX
Lecture 3: Structuring Unstructured Texts Through Sentiment Analysis
PDF
Lecture: Semantic Word Clouds
PPTX
Lecture 7: Learning from Massive Datasets
PDF
Lecture 5: Interval Estimation
PDF
Lecture 6: Hidden Variables and Expectation-Maximization
PPTX
Lecture 2: From Semantics To Semantic-Oriented Applications
PPTX
Text analytics and R - Open Question: is it a good match?
PDF
Lecture 3: Semantic Role Labelling
PDF
Mathematics for Language Technology: Introduction to Probability Theory
PDF
Lecture 2: Preliminaries (Understanding and Preprocessing data)
PDF
Lecture 03: Machine Learning for Language Technology - Linear Classifiers
PDF
Lecture 10: SVM and MIRA
PPTX
Lecture 01: Machine Learning for Language Technology - Introduction
PPTX
Lecture 4: The Weka Package
PDF
Semantics and Computational Semantics
CityTimes
Towards Contextualized Information: How Automatic Genre Identification Can Help
How Emotional Are Users' Needs? Emotion in Query Logs
Lecture11 logistic regression
Lecture 5: Structured Prediction
Lecture 3: Structuring Unstructured Texts Through Sentiment Analysis
Lecture: Semantic Word Clouds
Lecture 7: Learning from Massive Datasets
Lecture 5: Interval Estimation
Lecture 6: Hidden Variables and Expectation-Maximization
Lecture 2: From Semantics To Semantic-Oriented Applications
Text analytics and R - Open Question: is it a good match?
Lecture 3: Semantic Role Labelling
Mathematics for Language Technology: Introduction to Probability Theory
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 03: Machine Learning for Language Technology - Linear Classifiers
Lecture 10: SVM and MIRA
Lecture 01: Machine Learning for Language Technology - Introduction
Lecture 4: The Weka Package
Semantics and Computational Semantics
Ad

Similar to Uppsala uni 4march2011 (20)

PDF
MacroMicroZoom.pdf
PPT
Discourse Analysis for Social Research
PPT
Ontologies and the humanities: some issues affecting the design of digital in...
PDF
A Simple Approach To Classify Fictional And Non-Fictional Genres
PPT
Tutorial on Semantic Digital Libraries (WWW'2007)
PPT
Building Mountains Out of Molehills
 
PPT
Cataloging Fiction With Audio
PPT
Cataloging fiction with audio
PPTX
MDST 3270 F10 Seminar 9
DOC
Gondek- Curriculum Map-extended
PPTX
Zoss High-Level Text Analysis and Techniques
PPTX
Methodology & Content analysis
PPT
Social Web 2.0 Class Week 8: Social Metadata, Ratings, Social Tagging
PPT
GCRD 6353: Seminar 2
PDF
Narrative Essay Topics For High School.pdf
PDF
Ten lessons from a study of ten notational systems
PDF
Space in Languages Linguistic Systems and Cognitive Categories 1st Edition Ma...
PPT
HT06, Position Paper, Tagging, Taxonomy, Flickr, Academic Article, ToRead, Pr...
PDF
bridging formal semantics and social semantics on the web
PPT
Where is New Media Now? Some Ideas...
MacroMicroZoom.pdf
Discourse Analysis for Social Research
Ontologies and the humanities: some issues affecting the design of digital in...
A Simple Approach To Classify Fictional And Non-Fictional Genres
Tutorial on Semantic Digital Libraries (WWW'2007)
Building Mountains Out of Molehills
 
Cataloging Fiction With Audio
Cataloging fiction with audio
MDST 3270 F10 Seminar 9
Gondek- Curriculum Map-extended
Zoss High-Level Text Analysis and Techniques
Methodology & Content analysis
Social Web 2.0 Class Week 8: Social Metadata, Ratings, Social Tagging
GCRD 6353: Seminar 2
Narrative Essay Topics For High School.pdf
Ten lessons from a study of ten notational systems
Space in Languages Linguistic Systems and Cognitive Categories 1st Edition Ma...
HT06, Position Paper, Tagging, Taxonomy, Flickr, Academic Article, ToRead, Pr...
bridging formal semantics and social semantics on the web
Where is New Media Now? Some Ideas...
Ad

More from Marina Santini (20)

PDF
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
PDF
Towards a Quality Assessment of Web Corpora for Language Technology Applications
PDF
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
PDF
An Exploratory Study on Genre Classification using Readability Features
PDF
Lecture: Semantic Word Clouds
PDF
Lecture: Ontologies and the Semantic Web
PDF
Lecture: Summarization
PDF
Relation Extraction
PDF
IE: Named Entity Recognition (NER)
PDF
Lecture: Vector Semantics (aka Distributional Semantics)
PDF
Lecture: Word Sense Disambiguation
PDF
Lecture: Word Senses
PDF
Sentiment Analysis
PDF
Semantic Role Labeling
PDF
Lecture 9: Machine Learning in Practice (2)
PDF
Lecture 8: Machine Learning in Practice (1)
PDF
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
PDF
Lecture 3b: Decision Trees (1 part)
PDF
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
PDF
Lecture 1: What is Machine Learning?
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Towards a Quality Assessment of Web Corpora for Language Technology Applications
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
An Exploratory Study on Genre Classification using Readability Features
Lecture: Semantic Word Clouds
Lecture: Ontologies and the Semantic Web
Lecture: Summarization
Relation Extraction
IE: Named Entity Recognition (NER)
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Word Sense Disambiguation
Lecture: Word Senses
Sentiment Analysis
Semantic Role Labeling
Lecture 9: Machine Learning in Practice (2)
Lecture 8: Machine Learning in Practice (1)
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 3b: Decision Trees (1 part)
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 1: What is Machine Learning?

Uppsala uni 4march2011

  • 1. Marina Santini Artificial Solutions, KYH Agile Web Development Stockholm Uppsala University Department of Linguistics and Philology, Seminar Series Fri 4 March 2011
  • 3. Outline  What is genre? What is web genre?  What is the difference betw genre and web genre?  Why is (web) genre important?  Automatic web genre identification  The very beginning: Biber and Karlgren&Cutting  Sharoff  Kim & Ross  Santini  Stein et al.  Web genre identification by Humans  Karlgren  Rosso & Haas  Crowston et al.  Future directions
  • 4. What is genre? The beginning…  Aristotle (4th cent. b.C.): drama, lyrics, epics  Drama: tragedy, comedy, satyr  Literary theory and literary genres  Library classification  Library classification used also in online bookshops (e.g Amazon)  Music genres (jazz, rock, etc.), film genres (thriller, drama, western etc.)
  • 5. More recently…  Genre in academic contexts, in workplace and professional contexts, public contexts, in pedagogy (teaching writing), etc (resarch articles, essays, emails, memos, etc.)
  • 7. Genre & Corpus Linguistics  Surprisingly, no explicit definition of what genre is…  Brown corpus (1961): 15 genres  Sockholm-Umeå Corpus (SUC) (1990s)  British National Corpus (1990s)  etc.
  • 8. David Lee and the BNC Jungle
  • 9. Why is genre important?  It is a context carrier: being based on recurrent conventions and predictable expectations, genre provides the communicative context and the communicative purpose for which a text has been produced. Think of what happens in your mind when you come across a specific genre. Eg, FAQs, reviews, interviews, academic papers, reportages…
  • 10. Benefits (I) Being a context carrier…  Complexity reduction: a text receives identity throught belonging to a certain genre;  Predictivity: genre reduces information overload.  Findability: genre helps find web documents ”relevant” to our information needs;
  • 11. Benefits (II)  Genre competence increases information understanding:  genre competence increases self protection against digital crimes (fishing, hoaxes, cyberbullying) because it can help us spot genre anomalies and consequently malicious intentions;  Genre competence helps implement democracy:  some educational programs (e.g. in Australia) focus on teaching genre since the primary school because those who do not have genre competence because they drop off school after the primary school become socially disadvantaged in the structure of power.
  • 12. What is webgenre ?  All types of genres that are on the web…  Paper genres that have been uploaded in any format + genres that do not have any countepart in the paper world:  ex: home page, About Us, FAQs, webzine, personal blog, corporate weblogs …
  • 13. How is webgenre different from paper genre?  On the web, there are new communicative settings, and new communicative contexts, so new genres are spawned  On the web, the new communicative settings have been spurred by a proliferation of new technologies that ease, foster and model our communication: ex: chats, blogs, social networks, like Facebook, Twitter, LinkedIn…
  • 14. Then, a written text is not only topic…  There are many dimensions of variation: domain, topic, register, sentiment, level of complexity or difficulty or specialisation, trustworthiness and credibility, etc.  … genre is a dimension of variation. Genre gives us a topic packaged in a certain way. From the package, we are able to identify the communicative purpose of the text and the commiunicative context that has spawn such a text.
  • 15. A step back…  Biber (1988)  Genre  Text types  66 linguistically-motivated features  Multi-Dimensional Analysis  Ad-hoc corpus  Karlgren & Cutting (1994)  Genre  20 shallow features  Brown Corpus
  • 16. Biberian Text Types Biber (1988) Biber (1989) Biber (1993) Biber (1995) Biber (2004a) Biber (2004b) Biber et al. (2005) etc. Genres/Registers vs. Text Types External Features vs. Internal Features “I have used the term ‘genre’ (or ‘register’) for text varieties that are readily recognized and ‘named’ within a culture (e.g. letters, press editorials, sermon, conversation), while I have used the term ‘text type’ for varieties that are defined linguistically (rather than perceptually)” (Biber, 1993).
  • 17. Multi-Dimensional Analysis Factor Analysis, Factors Scores (Biber, 1988) Cluster Analysis (Biber, 1989) Additional Statistical Tests (Biber, 2004a; 2004b, etc.) 1. intimate interpersonal interaction 2. informational interaction 3. scientific exposition 4. learned exposition 5. imaginative narrative 6. general narrative exposition 7. situated reportage 8. involved persuasion Cluster Analysis - Biber (1989)Factor 2 - Biber (1988) Criticism: Lee (1999)
  • 18. From Biber’s text types to genres of electronic corpora: Karlgren and Cutting (1994)
  • 19. Karlgren and Cutting (1994): Recognizing Text Genres with Simple Metrics Using Discriminant Analysis  20 features  Discriminant analysis  Brown corpus
  • 21. More than 15 years later…  Grieve, Biber et al. ” We define a genre in a very similar manner to how we define register – i.e. as a variety of language defined by the external situation in which it is produced. However, while a register is characterized by pervasive linguistic features, a genre is characterized by conventionalized linguistic features”  Karlgren: ”Genre is a vague but well-established notion, and genres are explicitly identified and discussed by language users even while they may be difficult to encode and put into practical use” GoWeb
  • 22. The concept of genre is beneficial… but difficult to pin down and to agree upon GoWeb In the book, we do not propose a single and unified definition of genre. Authors give their different views on genre.
  • 23. Do we really need a definition?  After all….  … once we are convinced that genre is useful, we could just say that: genre is a classificatory principle based on a number of attributes.  The web is immense, we cannot think of classifying web documents by genre manually, can we? Let’s just focus on AUTOMATIC web GENRE CLASSIFCATION!
  • 24. What do we need for Automatic webGenre Identification (AGI)?  We need:  a genre taxonomy (palette) and a corpus  measurable attributes (features) that can be extracted automatically  an automatic classifier, i.e. a computational model that does the classification for us
  • 25. Vector representation & supervised machine learning algorithms (esp. SVM)
  • 26. Models for AGI: Scenarios  Serge Sharoff  Kim & Ross  Santini  Stein et al.  Others… GoWeb
  • 27. Morphology & the Linguist  Aim: Find a genre palette allowing comparison among corpora (Web As Corpus initiative ) and across languages  A functional genre palette inspired by J. Sinclair  Many corpora: English and Russian  Classifier: SVM  Features: POS trigrams (577 for Russian; 593 for English) Ex of POS trigrams: ADV ADJ NOUN Sharoff  GoWeb
  • 28. The expert (the linguist) decides:
  • 30. KRYS I and Harmonic Descriptor Representation (HDR)  Information studies , Digital Libraries: semantic concept  Features: HDR = FP, LP or AP (betw 1 and T/ (N x MP))  Number of features: 7431  Classifier: SVM  KRYS I + 7 webgenre collection (total: 24 + 7 genre classes , 3452 documents) Kim & Ross  GoWeb 2477 words
  • 33. What about morphology & syntax? What about noise?  Collection: 7-webgenre collection + others  Features: 100 facets  Genre palette: 7 webgenres + other  Classifier: inferential model subjective Bayesian method Santini  GoWeb
  • 34. 7-webgenre collection  Balanced (200 web pages per genre class)  Genre palette  Not annotated manually  Built following 2 principles:  Objective sources  Consistent genre granularity
  • 36. Inferential model  It is a simple probabilistic model based on rules.  It allows some ”reasonging” through the use of weights (closer to artificial intelligence than machine learning)
  • 40. Three experimental settings, three different genre needs…. 1. Genre comparison across corpora 2. Digital libraries, where documents can be more easily monitored 3. The wild web, where everything is uncertain and noisy WEGA prototype: a retrieval model for genre-enabled web search
  • 41. Genre retrieval model  Genre collection and palette: KI-04 corpus: 8 webgenres  Firefox add-on  Model: ”lightweight GenreRich model” (linear discriminant analysis)  Features: HTML, link features, character features, vocabulary concentration features (< 100 features) Stein, Meyer zu Eissen, Lipka GoWeb
  • 42. WEGA (WEb Genre Analysis)
  • 43. KI-04 genre collection: 8 webgenres
  • 44. Genre Classes & Human Recognition  How can we decide on the most representative genre classes? Let’s ask users… yes indeed, but how?  1) questionnaires (Karlgren)  2) card sorting (Rosso & Haas)  3) task-oriented studies (Crowston et al.)  4) others…
  • 45. Questionnaires: ”what genres are available on the internet?”
  • 46. User Warrant  Collecting genre terminology in the users’ own words (3 participants)  Make the users classify web pages and create piles (rationale?)  Users choose the best of the collected genre terminology (102 participants)  User validation of the genre palette (257 participants)  Genres’ usefulness of web search (32 participants) GoWeb: Rosso & Haas
  • 48. Genres & Tasks  3 groups of respondents : teachers, journalists, engineers,  Respondents were asked to carry out a web search for a real task of their own choice  What is your search goal?  What type of web page would you call this?  What is it about the page that makes you call that?  Was this page useful to you? GoWeb: Crowston et al.
  • 49. What type of web page would you call this?  522 unique terms  about 300
  • 50. Syracuse corpus & AGI ACL 2010 (Uppsala): FINE-GRAINED GENRE CLASSIFICATION USING STRUCTURAL LEARNING ALGORITHMS Zhili Wu, Katja Markert and Serge Sharoff  The whole corpus: 3027 annotated webpages divided into 292 genres.  Focussing on genres containing 15 or more examples, the corpus is of about 2293 examples and 52 genres.
  • 51. Conclusions (I) : Do we really need a definition of genre? 1. Take a number of web pages belonging to different web genres (e.g. blogs, home pages, news stories, FAQs, etc.) 2. Identify and extract genre-revealing features 3. Feed an automatic classifier Where is problem?
  • 52. Conclusions (II)  The problem with this approach is that without a theoretical definition and characterization of the concept of genre, it is not clear:  how to create a genre taxonomy that both humans and automatic classifiers can easily discriminate against  how to select representative corpus for the genre classes in the taxonomy, since there is a lot of variation in users’ assessment …  how to identifiy the optimal genre–revealing features
  • 53. Future Work Genre is a high-level concept: we NEED a theoretical definition of genre for computational and empirical purposes. Without a theoretical definition:  genres become lifeless texts, merely characterized by formal attributes and the communicative context , i.e. the thing that make genre important, is completely stripped out  Although in some restricted experimental settings, this ”formalistic” approach is quite rewarding (more than 95% success rate), we can hardly generalize on it.
  • 54. Future directions: AGI is a fertile land for research and development… Now that basic explorations have been carried out, we should concentrate more on the correlation and interrelation of the following variables:  Human agreement  Representation of genre classes  Number of genre classes  Nature of genre classes  Size of the whole corpus  Sturctured and unstructered noise  Genre-revealing features that account for the context that genres carry with them  New computational models and algorithms…
  • 55. Certainties….  Genre is a useful concept in many disciplines  Automatic genre classification is feasible, and there is ample space for improvement  I am interested in your views on (web) genre:  send me your impressions, ideas, gut feelings and your genre classes:  Facebook page: www.facebook.com/genresontheweb  Genre blog: www.forum.santini.se  Webrider’s Short proposal to EU: www.webrider.se
  • 56. Thank you for your attention!
  • 57. References (I)  Bateman, John (2008) Multimodality and Genre, Palgrave Macmillan  Bawarshi, Anis S. and Reiff, Mary Jo (eds) (2010) Genre: An Introduction to History, Theory, Research, and Pedagogy (free book); http://wac.colostate.edu/books/bawarshi_reiff/genre.pdf  Bruce, Ian (2008) Academic Writing and Genre, Continuum  Dorgeloh, Heidrun and Wanner, Anja (2010) Syntactic Variation and Genre, De Gruyter Mouton
  • 58. References (II)  Giltrow,Janet and Stein, Dieter (eds) (2009) Genres in the Internet, John Benjamins Publishing Company  Heyd, Theresa (2008) Email Hoaxes: Form, function, genre ecology, John Benjamins Publishing Company  Lee, David (2001), Genres, Registers, Text Types, Domains, And Styles: Clarifying The Concepts And Navigating A Path Through The Bnc Jungle, Language Learning & Technology September 2001, Vol. 5, Num. 3. pp. 37-72, http://llt.msu.edu/vol5num3/pdf/lee.pdf
  • 59. References (III)  Luzón, María José, Ruiz-Madrid, María Noelia and Villanueva, María Luisa (eds) (2010) Digital Genres, New Literacies and Autonomy in Language Learning, Cambridge Scholars Publishing  Martin, James and Rose, David (2008) Genre Relations: Mapping Culture, Equinox  Puschmann, Cornelius (2010) The corporate blog as an emerging genre of computer-mediated communication: features, constraints, discourse situation, Universitätsverlag Göttingen  WEGA prototype download, documentation and references: http://www.uni- weimar.de/cms/medien/webis/research/projects/wega .html

Editor's Notes

  • #32: Kris I is pdf7-webgenre collection