SlideShare a Scribd company logo
SchemaCMD An XML-based storage schema for the compilation of mixed-source CMD corpora Cornelius Puschmann University of Düsseldorf [email_address] Towards a Reference Corpus of Web Genres University of Birmingham 27 July 2007
Contents of this presentation Problems of classifying digital genres Granularity and meta-data in blogs An example: the corporate web log corpus Variation and genre Individuated CL?
A) Problems of classifying digital genres
Approaches to genre and text typology Genre (abstract meta-concept) “a class of communicative events with a shared set of communicative purposes” (Swales) “socially recognized types of communicative actions that are habitually enacted by members of a community to realize particular social purposes” (Yates & Orlikowski) Text typology (concrete instantiations) “Linguistic features, their co-occurrence and relative distribution in a text” (Biber, paraphrase)
A faceted classification scheme (Herring 2007) Situational factors (8) Participation structure Participant characteristics Purpose ... Medium factors (10) Synchronicity Message transmission Persistence of transcript ... =  discourse community  and  communicative purpose =  creation  and  presentation
Aspects of digital genres concrete abstract
Are blogs a genre? discourse community  and  communicative purpose  are both highly variable aspects of digital genres creation  and  presentation   are relative stable aspects and the result of  design choices  made by software developers
B) Granularity and meta-data in blogs
Blog content syndication blog content is usually available via web feeds RSS 0.91, 0.92, 1.0, 2.0 Atom 1.0 RSS and Atom are based on XML (and can thus be extended) RSS is mostly for blogs, Atom is for content syndication in general
A sample RSS 2.0 feed <?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?> <rss version=&quot;2.0&quot;> <channel> <title>Yahoo! Search Blog</title> <link>http://www.ysearchblog.com/</link> <description>A look inside the world of search from the people at Yahoo!</description> <language>en</language> <copyright>Copyright 2007</copyright> <lastBuildDate>Thu, 19 Jul 2007 12:32:38 -0800</lastBuildDate> <generator>http://www.sixapart.com/movabletype/?v=3.2ysb5-20051201</generator> <docs>http://blogs.law.harvard.edu/tech/rss</docs> <item> <title>Weather Report: Yahoo! Search update</title> <description><![CDATA[<p>We've been rolling out some changes to our fresh  web data and crawling, indexing and ranking...]]></description> <link>http://www.ysearchblog.com/archives/000470.html</link> <guid>http://www.ysearchblog.com/archives/000470.html</guid> <category>Weather Report</category> <pubDate>Thu, 19 Jul 2007 12:32:38 -0800</pubDate> </item> <item> ...
C) An example: the corporate web log corpus
The corpus tool web-based, runs on Apache, PHP and MySQL a researcher can point the tool to a blog (or any source exposing a web feed)  which is then indexed indexing is initiated manually, but this could be automated with a cron job exploration of sources could also be automated (e.g. by using blogger.com's “random blog” feature, or by implementing the Google Data API)
MySQL data structure - sources - BNC top 100 words - sub-types of corp. blogs - blogs, press eds., ... - n-grams (not computed due to cost) - POS frequencies by post - post data (via RSS/Atom) - additional post statistics - tokens (depends on types) - types (string + POS)
Corpus data feeds are used to retrieve, store and analyze language data implemented TreeTagger for automated POS annotation 161 sources (133 corporate blogs, 18 personal, 1 political*, 1 technical**) 3 press editorial sections (New York Times, Washington Post, LA Times) 5 press release sections (Microsoft, GM, Sun, Oracle, McDonald's) 29,528 posts 7,821,317 tokens Note:  a much bigger corpus could easily be built using the tool, provided there is access to the right amount of computational resources.
D) Variation and genre
Measuring text formality via f-scores F-score (Heylighen & Dewaele) a metric to quantify the level of formality in a text, where formality is specifically defined as  context-independence f = 0.5 * ((N + ADJ + PRP + DET) - (PN + V + ADV + ITJ) + 100)
Example: high f-score (press release) The Toshiba Portege R400 is a Windows Vista-inspired signature mobile PC that incorporates innovative connectivity and display technologies to provide timely access to e-mail and appointments via Active Notifications and is built on Windows SideShow™ technology.  [...] http://www.microsoft.com/presspass/press/2007/jan07/01-07CES2007PR.mspx high noun frequency high adjective frequency more nominal than verbal often relate complex information often describe future events/potentiality
Example: low f-score (blog entry) OK, OK, I'm partly at fault here. But, hear me out. Last year at Gnomedex I had my son demonstrate Second Life up on stage while I was hosting a panel discussion. Someone from Linden Labs (the folks who make Second Life), Beth Goza (she now works at Microsoft), saw that, and told me and my son to knock it off. People under 18 aren't allowed in Second Life. So, what did I do? I just told Patrick never to go into Second Life and I didn't go back into Second Life either.  [...] http://scobleizer.com/2007/02/18/second-life-has-my-credit-card-and-wont-let-go/ high frequency of personal pronouns more verbal than nominal often describe past events, personal impressions, feelings
F-score over time for two sources light blue = Jonathan Schwartz (blog); dark blue = New York Times (editorials)
F-score and standard deviation for all sources x-axis = stdev; y-axis = f-score; dot size = number of posts editorials press releases
E) Individuated CL?
Observations while some (established) genres are highly conventionalized (and therefore distinctive) in regards to their  style  and  content , blogs are not taking into account that  community  and  purpose  in blogs are highly diversified, it is doubtful whether this will change variability  – in  all four aspects  – may be a constitutive feature of blogs O1:  digital genre labels are difficult to assign using traditional categories O2:  however, the availability of meta-data ( author, time of creation, tags ) could solve the problem of a permanently unstable genre ecology O3:  perhaps we should consider building web-data corpora that are  speaker-diversified  in addition to being  register-diversified ?
Thanks for listening!
SchemaCMD An XML-based storage schema for the compilation of mixed-source CMD corpora Cornelius Puschmann University of Düsseldorf [email_address] Towards a Reference Corpus of Web Genres University of Birmingham 27 July 2007

More Related Content

PDF
Construction of Authority Information for Personal Names Focused on the Forme...
ODP
Corpora, Blogs and Linguistic Variation (Paderborn)
PPT
PPT
Musica é Vida
PDF
trabalho sig - adminsitração de empresas
 
PPTX
Teoria%20de%20colas
PPS
Gezichtsbedrogmetverf
PPT
Podcasting In The 5th Grade Classroom
Construction of Authority Information for Personal Names Focused on the Forme...
Corpora, Blogs and Linguistic Variation (Paderborn)
Musica é Vida
trabalho sig - adminsitração de empresas
 
Teoria%20de%20colas
Gezichtsbedrogmetverf
Podcasting In The 5th Grade Classroom

Similar to SchemaCMD - An XML-based storage schema for the compilation of mixed-source CMD corpora (20)

PPT
DM110 - Week 2 - Blogs
KEY
Web 30 and RSS
PPT
Web 2.0
PPTX
PPT
Week5 Media Meaning
PPTX
UVA MDST 3703 Thematic Research Collections 2012-09-18
PPT
DM110 - Week 9 - Content Syndication
PPTX
Uppsala uni 4march2011
PPT
Semantic Web, Cataloging, & Metadata
DOCX
Multigenre Menu Rust October 2014
PPT
Semantic Pipes and Semantic Mashups
PPT
Yahoo Making The Web Searchable
PDF
The Semantic Web: What IAs Need to Know About Web 3.0
PDF
Role of Ontologies in Semantic Digital Libraries
PPT
Integrating RSS into Your Web Site (IL2008)
PPTX
Towards Contextualized Information: How Automatic Genre Identification Can Help
PDF
SSSW2007 Social Web Baumann
PDF
Blog Comments Organizer
PPT
Journalists and the Social Web 3
PDF
Assembling your Web 2.0 toolbox
DM110 - Week 2 - Blogs
Web 30 and RSS
Web 2.0
Week5 Media Meaning
UVA MDST 3703 Thematic Research Collections 2012-09-18
DM110 - Week 9 - Content Syndication
Uppsala uni 4march2011
Semantic Web, Cataloging, & Metadata
Multigenre Menu Rust October 2014
Semantic Pipes and Semantic Mashups
Yahoo Making The Web Searchable
The Semantic Web: What IAs Need to Know About Web 3.0
Role of Ontologies in Semantic Digital Libraries
Integrating RSS into Your Web Site (IL2008)
Towards Contextualized Information: How Automatic Genre Identification Can Help
SSSW2007 Social Web Baumann
Blog Comments Organizer
Journalists and the Social Web 3
Assembling your Web 2.0 toolbox
Ad

More from Cornelius Puschmann (20)

PDF
Collecting Twitter Data
PDF
A Tale of Two Platforms: Emerging communicative patterns in two scientific bl...
PDF
Digitale Methoden in den Sozial- und Geisteswissenschaften: Chancen und Herau...
PDF
Twitter as a data source for (socio)linguistic research
PDF
Form and Function of Digital Genres of Scholarly Communication: Results of th...
PDF
Vernetzung, Sichtbarkeit, Information: Nutzungsmotive informeller digitaler K...
PDF
The Pragmatics of Retweeting
PDF
Data Access, Ownership and Control in Social Web Services: Issues for Twitter...
PDF
Knowledge or Credit? The (Un)changing Face of Academic Publishing from the Ph...
PPTX
Wissenschaftliche Blogs: Nutzungsweisen und Nutzer
PDF
Was ist ein Wissenschaftsblog?
PPTX
Wissenschaftliche Blogs: Schnittstelle zur Öffentlichkeit oder virtueller Elf...
PDF
Beyond the stars: Interpreting discourse cohesion in Twitter as an indicator ...
PDF
(Academic) Community Management in the Humanities and Social Sciences for Pub...
KEY
Doing A Small-Scale Diachronic Twitter User Study
KEY
Social data: what it is, who owns it, and why you should care
KEY
Twitter zwischen Nachrichtenkanal und Mikronarrativ
PDF
#www2010 user activity chart
PDF
#s21 user activity chart
PDF
Studying Twitter conversations as (dynamic) graphs: visualization and structu...
Collecting Twitter Data
A Tale of Two Platforms: Emerging communicative patterns in two scientific bl...
Digitale Methoden in den Sozial- und Geisteswissenschaften: Chancen und Herau...
Twitter as a data source for (socio)linguistic research
Form and Function of Digital Genres of Scholarly Communication: Results of th...
Vernetzung, Sichtbarkeit, Information: Nutzungsmotive informeller digitaler K...
The Pragmatics of Retweeting
Data Access, Ownership and Control in Social Web Services: Issues for Twitter...
Knowledge or Credit? The (Un)changing Face of Academic Publishing from the Ph...
Wissenschaftliche Blogs: Nutzungsweisen und Nutzer
Was ist ein Wissenschaftsblog?
Wissenschaftliche Blogs: Schnittstelle zur Öffentlichkeit oder virtueller Elf...
Beyond the stars: Interpreting discourse cohesion in Twitter as an indicator ...
(Academic) Community Management in the Humanities and Social Sciences for Pub...
Doing A Small-Scale Diachronic Twitter User Study
Social data: what it is, who owns it, and why you should care
Twitter zwischen Nachrichtenkanal und Mikronarrativ
#www2010 user activity chart
#s21 user activity chart
Studying Twitter conversations as (dynamic) graphs: visualization and structu...
Ad

Recently uploaded (20)

PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
master seminar digital applications in india
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPTX
Pharma ospi slides which help in ospi learning
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
01-Introduction-to-Information-Management.pdf
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Yogi Goddess Pres Conference Studio Updates
PDF
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPTX
Lesson notes of climatology university.
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PPTX
Cell Structure & Organelles in detailed.
PDF
RMMM.pdf make it easy to upload and study
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
Final Presentation General Medicine 03-08-2024.pptx
master seminar digital applications in india
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Pharma ospi slides which help in ospi learning
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
01-Introduction-to-Information-Management.pdf
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Yogi Goddess Pres Conference Studio Updates
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
O5-L3 Freight Transport Ops (International) V1.pdf
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Lesson notes of climatology university.
Module 4: Burden of Disease Tutorial Slides S2 2025
202450812 BayCHI UCSC-SV 20250812 v17.pptx
Cell Structure & Organelles in detailed.
RMMM.pdf make it easy to upload and study
Microbial diseases, their pathogenesis and prophylaxis
Supply Chain Operations Speaking Notes -ICLT Program
2.FourierTransform-ShortQuestionswithAnswers.pdf

SchemaCMD - An XML-based storage schema for the compilation of mixed-source CMD corpora

  • 1. SchemaCMD An XML-based storage schema for the compilation of mixed-source CMD corpora Cornelius Puschmann University of Düsseldorf [email_address] Towards a Reference Corpus of Web Genres University of Birmingham 27 July 2007
  • 2. Contents of this presentation Problems of classifying digital genres Granularity and meta-data in blogs An example: the corporate web log corpus Variation and genre Individuated CL?
  • 3. A) Problems of classifying digital genres
  • 4. Approaches to genre and text typology Genre (abstract meta-concept) “a class of communicative events with a shared set of communicative purposes” (Swales) “socially recognized types of communicative actions that are habitually enacted by members of a community to realize particular social purposes” (Yates & Orlikowski) Text typology (concrete instantiations) “Linguistic features, their co-occurrence and relative distribution in a text” (Biber, paraphrase)
  • 5. A faceted classification scheme (Herring 2007) Situational factors (8) Participation structure Participant characteristics Purpose ... Medium factors (10) Synchronicity Message transmission Persistence of transcript ... = discourse community and communicative purpose = creation and presentation
  • 6. Aspects of digital genres concrete abstract
  • 7. Are blogs a genre? discourse community and communicative purpose are both highly variable aspects of digital genres creation and presentation are relative stable aspects and the result of design choices made by software developers
  • 8. B) Granularity and meta-data in blogs
  • 9. Blog content syndication blog content is usually available via web feeds RSS 0.91, 0.92, 1.0, 2.0 Atom 1.0 RSS and Atom are based on XML (and can thus be extended) RSS is mostly for blogs, Atom is for content syndication in general
  • 10. A sample RSS 2.0 feed <?xml version=&quot;1.0&quot; encoding=&quot;utf-8&quot;?> <rss version=&quot;2.0&quot;> <channel> <title>Yahoo! Search Blog</title> <link>http://www.ysearchblog.com/</link> <description>A look inside the world of search from the people at Yahoo!</description> <language>en</language> <copyright>Copyright 2007</copyright> <lastBuildDate>Thu, 19 Jul 2007 12:32:38 -0800</lastBuildDate> <generator>http://www.sixapart.com/movabletype/?v=3.2ysb5-20051201</generator> <docs>http://blogs.law.harvard.edu/tech/rss</docs> <item> <title>Weather Report: Yahoo! Search update</title> <description><![CDATA[<p>We've been rolling out some changes to our fresh web data and crawling, indexing and ranking...]]></description> <link>http://www.ysearchblog.com/archives/000470.html</link> <guid>http://www.ysearchblog.com/archives/000470.html</guid> <category>Weather Report</category> <pubDate>Thu, 19 Jul 2007 12:32:38 -0800</pubDate> </item> <item> ...
  • 11. C) An example: the corporate web log corpus
  • 12. The corpus tool web-based, runs on Apache, PHP and MySQL a researcher can point the tool to a blog (or any source exposing a web feed) which is then indexed indexing is initiated manually, but this could be automated with a cron job exploration of sources could also be automated (e.g. by using blogger.com's “random blog” feature, or by implementing the Google Data API)
  • 13. MySQL data structure - sources - BNC top 100 words - sub-types of corp. blogs - blogs, press eds., ... - n-grams (not computed due to cost) - POS frequencies by post - post data (via RSS/Atom) - additional post statistics - tokens (depends on types) - types (string + POS)
  • 14. Corpus data feeds are used to retrieve, store and analyze language data implemented TreeTagger for automated POS annotation 161 sources (133 corporate blogs, 18 personal, 1 political*, 1 technical**) 3 press editorial sections (New York Times, Washington Post, LA Times) 5 press release sections (Microsoft, GM, Sun, Oracle, McDonald's) 29,528 posts 7,821,317 tokens Note: a much bigger corpus could easily be built using the tool, provided there is access to the right amount of computational resources.
  • 16. Measuring text formality via f-scores F-score (Heylighen & Dewaele) a metric to quantify the level of formality in a text, where formality is specifically defined as context-independence f = 0.5 * ((N + ADJ + PRP + DET) - (PN + V + ADV + ITJ) + 100)
  • 17. Example: high f-score (press release) The Toshiba Portege R400 is a Windows Vista-inspired signature mobile PC that incorporates innovative connectivity and display technologies to provide timely access to e-mail and appointments via Active Notifications and is built on Windows SideShow™ technology. [...] http://www.microsoft.com/presspass/press/2007/jan07/01-07CES2007PR.mspx high noun frequency high adjective frequency more nominal than verbal often relate complex information often describe future events/potentiality
  • 18. Example: low f-score (blog entry) OK, OK, I'm partly at fault here. But, hear me out. Last year at Gnomedex I had my son demonstrate Second Life up on stage while I was hosting a panel discussion. Someone from Linden Labs (the folks who make Second Life), Beth Goza (she now works at Microsoft), saw that, and told me and my son to knock it off. People under 18 aren't allowed in Second Life. So, what did I do? I just told Patrick never to go into Second Life and I didn't go back into Second Life either. [...] http://scobleizer.com/2007/02/18/second-life-has-my-credit-card-and-wont-let-go/ high frequency of personal pronouns more verbal than nominal often describe past events, personal impressions, feelings
  • 19. F-score over time for two sources light blue = Jonathan Schwartz (blog); dark blue = New York Times (editorials)
  • 20. F-score and standard deviation for all sources x-axis = stdev; y-axis = f-score; dot size = number of posts editorials press releases
  • 22. Observations while some (established) genres are highly conventionalized (and therefore distinctive) in regards to their style and content , blogs are not taking into account that community and purpose in blogs are highly diversified, it is doubtful whether this will change variability – in all four aspects – may be a constitutive feature of blogs O1: digital genre labels are difficult to assign using traditional categories O2: however, the availability of meta-data ( author, time of creation, tags ) could solve the problem of a permanently unstable genre ecology O3: perhaps we should consider building web-data corpora that are speaker-diversified in addition to being register-diversified ?
  • 24. SchemaCMD An XML-based storage schema for the compilation of mixed-source CMD corpora Cornelius Puschmann University of Düsseldorf [email_address] Towards a Reference Corpus of Web Genres University of Birmingham 27 July 2007