SlideShare a Scribd company logo
Jack T. Bowers
Melanie Seltmann
Austrian Academy of Sciences -Austrian Center for Digital Humanities
Exploring data models for heterogenous
dialect data:
the case of explore.bread.AT!
Outline of Presentation
Part I: Overview of project & data
Part II: Overview of possible solutions using XML-based
markup standards for representing onomasiological
dialectal language
explore.AT!
Overview:
• DBÖ: collection of Bavarian dialectal speech began 1911
• 2015-2016 converted from TUSTEP to TEI
Goals
• Gain cultural and linguistic insights into Bavarian dialects in former
Austro-Hungarian empire;
• Update and improve the existing body of resources by converting to
conform with standards and best practice (ISOcat, ISOconcept, etc.;
• Enhance usability and compatibility of data in order to share with
project partners;
• Integration of semantic web/LOD resources;
Project Overview: Datasets
DBÖ@TEI
WBÖ@TEI
BaseX Database
place inventory (TEI-listPlace)
concept inventory(TEI-feature structures)
gram features inventory (TEI-feature structures)
questionnaires (TEI-list)
DBÖ@ema
SQL
BaseX Database
Extracted Topical Datasets
explore.bread
The language of Color
lexicon(location(a))
inventory(lexicalFeature(a))
• Domain/Topic-based (exploreBread)
• Location
• Lexical/grammatical features
Possible basis for examination of sub-datasets
Visualization
DBÖ Questionnaires
Questionnaires:
While questionnaires are topical in general, they are a complicated
mixture of semasiological (term-based) and onomasiological
(concept-based)
e.g.
(31B5) bes. Weißgebäcke:
länglich flaches, gerundetes Weißgebäck, z.B. Strutz (l.!),
Strutzen, Strützel, Wecken u.a.; scherzhafte Bez. wie Schendarm
Current means of extracting this information were initially limited to:
• Questionnaires
• String searches in certain data fields
Dataset requires significant manual editing and curation due to nature of
the questionnaires
Desired Enhancements
In most sub-topical studies such as ExploreBread! there would be
potential benefits of having the ability to format data onomasiologically,
for example:
• Domain and/or concept-oriented entries better represent the content of
interest
• Information retrieval
• Ontology mapping
• Etymological &/or Morphosyntactic analysis
• Cross linguistic (or dialectal) comparisson or translation
Problem:
> TEI has no explicitly designated means of
encoding onomasiological data!
Enhancing original data
• Adding domain (onomasiological) and ontology-based sense tags
<sense corresp=“concept:Weißgebäck”>Weißgebäck</usg>
<usg type="dom" corresp=“concept:Brot”>Brot</usg>
• Normalization of phonetic notation*
<form type="lautung" n="1">

<pron notation="tustep">&gt;str-uts</pron>

<pron notation="ipa" resp="#JB" change=“01.2">ʒ̊truːts</pron>

</form>
• Adding Morpholgical/Compositional Analysis*
            <form type="hauptlemma">
               <orth>(S:emmel)zipfel</orth>
            </form>
            <form type="hauptlemma" resp="#MS">
               <orth>(<seg corresp="concept:Semmel”>S:emm<seg ana="#dimin">el</seg></seg>)
   <seg ana="#stem" corresp="concept:Zipf”>zipf<seg ana="#dimin">el</seg></seg>
       </orth>
            </form>
Lexical Organization
Semasiological:
Onomasiological:
Semasiological Lexical Model
meaning(iii)
Form
meaning(ii)meaning(i)
Onomasiological Lexical Model
Concept
Form(i) Form(ii) Form(iii)
Starting point is word form and identifies
associated meanings and senses
Starting point is a concept and looks at forms
used to represent it
Headword
Lemma(i..n)
BROT
brot broet brɛot
Prôt Prôt Prôt
Core DBÖ entry datatypes
—————————————-
Archive record
Headword (Form)
POS
Dialect lemma (Form)
Gram info
Meaning (Sense)
Usage example
Source
Place
Questionnaire
Etymology
Desired Data Structure
Desired Onomasiological Model for Extracted
Terminological DBÖ Datasets
TermEntry
Concept(a)
DialectEntry(i) DialectEntry(ii) DialectEntry(n)
Options using XML-Based Standards
(i) TEI Hacks: Alternate TEI Dictionary format (<entryFree>)
(ii) TEI-TBX Hybrid (Romary, 2014)
OR…. use TEI P4
TEI <entryFree> Model
(1…n)
<sense @corresp/>
<entryFree @xml:id>
<usg @type=“dom”>
<superEntry>
<entry @xml:id @xml:lang=“bar”>
(0…n)
(1…n)
<form type=“hauptlemma”>
<orth>
(1…n)
(1…1)
<form type=“hauptlemma”>
(all other elements content from original copied without alteration)
<def @xml:lang>
(0…n)
<sense>
concept:
meaning
concept:
domain
Form (headword(i))
Form (dialect(a))
Metadata:
DBÖ entry (headword (i))
Form (headword(ii))
Form (dialect(b))
Metadata:
DBÖ entry (headword (ii))
TEI <entryFree> Model
concept:
meaning
<entryFree>
            <sense corresp="concept:Wecken">
               <usg type="dom" corresp="concept:Brot">Brot</usg>
               <def xml:lang="en" resp="#JB">Oblong loaf of bread</def>
            </sense>
            <superEntry> <!—for each unique hauptlemma for concept entry —>
               <form type="hauptlemma">
                  <orth>Wecken</orth>
               </form>
<entry xml:id="w834_qdb-d1e602b" xml:lang="bar">
                  <!-- hauptlemma removed from here; entry content abbreviated -->
                  <form type="lautung" n="1">
                     <pron notation="tustep">W.eiggn</pron>
                     <pron notation="ipa" resp="#JB" change=“01.2">ʋɛiggn̩</pron>
                  </form>
                  <usg type="geo">
                     <placeName>St.Michael/B. Bgl.</placeName>
                  </usg>
               </entry>
<!—all entries with headword “Wecken” (ii..n) —> </superEntry>
<superEntry>
               <form type="hauptlemma">
                    <orth>Strutzen</orth>
               </form>
              
               <entry xml:id="s806_qdb-d1e43847b" xml:lang="bar">
                  <!-- hauptlemma removed from here; entry content abbreviated -->
                  <form type="lautung" n="1">
                     <pron notation="tustep">Struzn</pron>
                     <pron notation="ipa" resp="#JB" change=“01.2">ʃtruzn̩</pron>
                  </form>
<usg type="geo">

<placeName>Rohrb. OÖ</placeName>

</usg>
               </entry>
<!—all entries with headword “Strutzen” (ii..n) —> </superEntry>
</entryFree>
concept:
domain
Form (headword(i))
Form (dialect(a))
Metadata:
DBÖ entry (headword (i))
Form (headword(ii))
Form (dialect(b))
Metadata:
DBÖ entry (headword (ii))
Problems with <entryFree> model
• It is a hack!
• Current TEI guidelines and data model are
inherantly and intentionallly semasiological and
this use of the vocabulary is only valid by chance,
not intention.
>Thus using this data model within the TEI will not have
any of the advantages that generally come with its use
TBX-TEI Hybrid
Romary (2014):
Makes attempt at customizing TEI guidelines to incorporate TBX
(ISO 30046) terminological entries in order to provide TEI with an
onomasiological model
https://github.com/laurentromary/TBXinTEI
TBX-TEI Hybrid
  <tbx:termEntry xmlns="http://www.tbx.org"><!-- @xml:id;  -->
            <descrip type="concept" target="concept:Wecken"/> <!-- sense not normally included in TBX! -->             
            <descrip type="domain" target="concept:Brot" xml:lang="de">Brot</descrip>
            <descrip type="definition" xml:lang="en">Oblong loaf of bread</descrip>
           <!-- no headword form may occur outside of <langSet>—>
            <langSet xml:id="w834_qdb-d1e602" xml:lang="bar-x-smichael"><!-- language/dialect i) @xml:id;  -->
<!-- No sense allowed! —>
               <tei:note type="anmerkung" resp="O" corresp="#BD">deren Grundriß ein Oval ist</tei:note>
<!-- @corresp allowed in TEI <note> but not here —>
<!-- Most metadata element valid using <tei:ref> but syntactically required to occur before <tig> —>
                <admin type="geo">
                  <tei:placeName>St.Michael/B. Bgl.</tei:placeName>
               </admin>
               <tig><!-- <tei:form> would be better -->
                  <tei:term type="hauptlemma">Wecken</tei:term>
                  <termNote type="transcription">orth</termNote><!-- this is inefficient: need to allow <orth> & <pron>—>
                  <termNote type="pos">Subst</termNote><!-- this actually should be applicable to all forms (headword & lemmas) -->      
               </tig>
               <tig>
                  <tei:term type="lautung" n="1">W.eiggn</tei:term>
                  <termNote type="transcription">pron</termNote>
                  <termNote type="notation">tustep</termNote><!-- we also need to allow @notation  -->
               </tig>
               <tig><!-- TBX doesn't allow multiple instances of <term> in same <tig> as TEI does with <orth>,<pron> w/in <form> -->
                  <tei:term type="lautung" n="1" resp="#JB">ʋɛiggn̩</tei:term>
                  <termNote type="transcription" change=“1.2">pron</termNote><!-- @change in original not allowed in hybrid schema -->
                  <termNote type="notation">ipa</termNote>
               </tig>   
            </langSet>
….
Problems with TEI-TBX Hybrid model as
per the ODD Schema from Romary (2014)
• <tig> is verbose and would be better replaced with <form>
• the order of occurence of elements is too restricted
• TBX-dominated schema lacks way too many attributes (e.g.
@notation),and elements (e.g. <orth> <pron>) that are key
to storage and representation of lexical data as used in TEI
Conclusion
(i) TEI lacks a legitimate means of encoding terminological/
onomasiological entries;
(ii) Given that we need to include sense (or a parallel equivalent) and
the headword at the top of an entry, a TBX-TEI hybrid doesn’t work
either without serious modification via ODD mostly to introduce
elements and features from TEI, and stretching the traditional usage
of the system;
(iii) TEI needs to re-introduce a means of onomasiological data
representation (such as <termEntry>) but with an expanded set of
elements and attributes based on the degree of expressivity in the
Dictionary module

More Related Content

PDF
basic knowledge abot html
PPTX
HTML: Tables and Forms
PDF
Html tags describe in bangla
PPT
LEARN HTML IN A DAY
PDF
PPTX
Advance sql session - strings
DOCX
Html file
PPT
HTML5 with PHP.ini
basic knowledge abot html
HTML: Tables and Forms
Html tags describe in bangla
LEARN HTML IN A DAY
Advance sql session - strings
Html file
HTML5 with PHP.ini

Similar to Exploring data models for heterogenous dialect data: the case of e​xplore.bread.AT! (20)

PDF
Handling Markup Overlaps Using OWL
PDF
Embedding semantic annotations within texts: the FRETTA approach
PPT
Introduction to XML
PPTX
The ISO-DCR
PDF
23xml
PDF
Xml
PDF
hyper text markup language ppt-100605011058-phpapp02.pdf
PPTX
XSLT
PDF
CS4200 2019 | Lecture 5 | Transformation by Term Rewriting
PDF
Introduction to XML, XHTML and CSS
PPTX
Html ppt
PPT
XPath - XML Path Language
PPT
Uta005 lecture2
PPTX
Xml and xslt
PDF
REST and AJAX Reconciled
PDF
Html bangla
PDF
Bangla HTML Tutorial
PPT
Processing XML with Java
PPTX
Introduction to xml
Handling Markup Overlaps Using OWL
Embedding semantic annotations within texts: the FRETTA approach
Introduction to XML
The ISO-DCR
23xml
Xml
hyper text markup language ppt-100605011058-phpapp02.pdf
XSLT
CS4200 2019 | Lecture 5 | Transformation by Term Rewriting
Introduction to XML, XHTML and CSS
Html ppt
XPath - XML Path Language
Uta005 lecture2
Xml and xslt
REST and AJAX Reconciled
Html bangla
Bangla HTML Tutorial
Processing XML with Java
Introduction to xml
Ad

Recently uploaded (20)

PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Cloud computing and distributed systems.
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Spectroscopy.pptx food analysis technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
Programs and apps: productivity, graphics, security and other tools
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Review of recent advances in non-invasive hemoglobin estimation
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Empathic Computing: Creating Shared Understanding
Agricultural_Statistics_at_a_Glance_2022_0.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
The AUB Centre for AI in Media Proposal.docx
Cloud computing and distributed systems.
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Digital-Transformation-Roadmap-for-Companies.pptx
MYSQL Presentation for SQL database connectivity
Understanding_Digital_Forensics_Presentation.pptx
Spectroscopy.pptx food analysis technology
Network Security Unit 5.pdf for BCA BBA.
Per capita expenditure prediction using model stacking based on satellite ima...
Ad

Exploring data models for heterogenous dialect data: the case of e​xplore.bread.AT!

  • 1. Jack T. Bowers Melanie Seltmann Austrian Academy of Sciences -Austrian Center for Digital Humanities Exploring data models for heterogenous dialect data: the case of explore.bread.AT!
  • 2. Outline of Presentation Part I: Overview of project & data Part II: Overview of possible solutions using XML-based markup standards for representing onomasiological dialectal language
  • 3. explore.AT! Overview: • DBÖ: collection of Bavarian dialectal speech began 1911 • 2015-2016 converted from TUSTEP to TEI Goals • Gain cultural and linguistic insights into Bavarian dialects in former Austro-Hungarian empire; • Update and improve the existing body of resources by converting to conform with standards and best practice (ISOcat, ISOconcept, etc.; • Enhance usability and compatibility of data in order to share with project partners; • Integration of semantic web/LOD resources;
  • 4. Project Overview: Datasets DBÖ@TEI WBÖ@TEI BaseX Database place inventory (TEI-listPlace) concept inventory(TEI-feature structures) gram features inventory (TEI-feature structures) questionnaires (TEI-list) DBÖ@ema SQL BaseX Database Extracted Topical Datasets explore.bread The language of Color lexicon(location(a)) inventory(lexicalFeature(a)) • Domain/Topic-based (exploreBread) • Location • Lexical/grammatical features Possible basis for examination of sub-datasets
  • 6. DBÖ Questionnaires Questionnaires: While questionnaires are topical in general, they are a complicated mixture of semasiological (term-based) and onomasiological (concept-based) e.g. (31B5) bes. Weißgebäcke: länglich flaches, gerundetes Weißgebäck, z.B. Strutz (l.!), Strutzen, Strützel, Wecken u.a.; scherzhafte Bez. wie Schendarm Current means of extracting this information were initially limited to: • Questionnaires • String searches in certain data fields Dataset requires significant manual editing and curation due to nature of the questionnaires
  • 7. Desired Enhancements In most sub-topical studies such as ExploreBread! there would be potential benefits of having the ability to format data onomasiologically, for example: • Domain and/or concept-oriented entries better represent the content of interest • Information retrieval • Ontology mapping • Etymological &/or Morphosyntactic analysis • Cross linguistic (or dialectal) comparisson or translation Problem: > TEI has no explicitly designated means of encoding onomasiological data!
  • 8. Enhancing original data • Adding domain (onomasiological) and ontology-based sense tags <sense corresp=“concept:Weißgebäck”>Weißgebäck</usg> <usg type="dom" corresp=“concept:Brot”>Brot</usg> • Normalization of phonetic notation* <form type="lautung" n="1">
 <pron notation="tustep">&gt;str-uts</pron>
 <pron notation="ipa" resp="#JB" change=“01.2">ʒ̊truːts</pron>
 </form> • Adding Morpholgical/Compositional Analysis*             <form type="hauptlemma">                <orth>(S:emmel)zipfel</orth>             </form>             <form type="hauptlemma" resp="#MS">                <orth>(<seg corresp="concept:Semmel”>S:emm<seg ana="#dimin">el</seg></seg>)    <seg ana="#stem" corresp="concept:Zipf”>zipf<seg ana="#dimin">el</seg></seg>        </orth>             </form>
  • 9. Lexical Organization Semasiological: Onomasiological: Semasiological Lexical Model meaning(iii) Form meaning(ii)meaning(i) Onomasiological Lexical Model Concept Form(i) Form(ii) Form(iii) Starting point is word form and identifies associated meanings and senses Starting point is a concept and looks at forms used to represent it
  • 10. Headword Lemma(i..n) BROT brot broet brɛot Prôt Prôt Prôt Core DBÖ entry datatypes —————————————- Archive record Headword (Form) POS Dialect lemma (Form) Gram info Meaning (Sense) Usage example Source Place Questionnaire Etymology Desired Data Structure Desired Onomasiological Model for Extracted Terminological DBÖ Datasets TermEntry Concept(a) DialectEntry(i) DialectEntry(ii) DialectEntry(n)
  • 11. Options using XML-Based Standards (i) TEI Hacks: Alternate TEI Dictionary format (<entryFree>) (ii) TEI-TBX Hybrid (Romary, 2014) OR…. use TEI P4
  • 12. TEI <entryFree> Model (1…n) <sense @corresp/> <entryFree @xml:id> <usg @type=“dom”> <superEntry> <entry @xml:id @xml:lang=“bar”> (0…n) (1…n) <form type=“hauptlemma”> <orth> (1…n) (1…1) <form type=“hauptlemma”> (all other elements content from original copied without alteration) <def @xml:lang> (0…n) <sense> concept: meaning concept: domain Form (headword(i)) Form (dialect(a)) Metadata: DBÖ entry (headword (i)) Form (headword(ii)) Form (dialect(b)) Metadata: DBÖ entry (headword (ii))
  • 13. TEI <entryFree> Model concept: meaning <entryFree>             <sense corresp="concept:Wecken">                <usg type="dom" corresp="concept:Brot">Brot</usg>                <def xml:lang="en" resp="#JB">Oblong loaf of bread</def>             </sense>             <superEntry> <!—for each unique hauptlemma for concept entry —>                <form type="hauptlemma">                   <orth>Wecken</orth>                </form> <entry xml:id="w834_qdb-d1e602b" xml:lang="bar">                   <!-- hauptlemma removed from here; entry content abbreviated -->                   <form type="lautung" n="1">                      <pron notation="tustep">W.eiggn</pron>                      <pron notation="ipa" resp="#JB" change=“01.2">ʋɛiggn̩</pron>                   </form>                   <usg type="geo">                      <placeName>St.Michael/B. Bgl.</placeName>                   </usg>                </entry> <!—all entries with headword “Wecken” (ii..n) —> </superEntry> <superEntry>                <form type="hauptlemma">                     <orth>Strutzen</orth>                </form>                               <entry xml:id="s806_qdb-d1e43847b" xml:lang="bar">                   <!-- hauptlemma removed from here; entry content abbreviated -->                   <form type="lautung" n="1">                      <pron notation="tustep">Struzn</pron>                      <pron notation="ipa" resp="#JB" change=“01.2">ʃtruzn̩</pron>                   </form> <usg type="geo">
 <placeName>Rohrb. OÖ</placeName>
 </usg>                </entry> <!—all entries with headword “Strutzen” (ii..n) —> </superEntry> </entryFree> concept: domain Form (headword(i)) Form (dialect(a)) Metadata: DBÖ entry (headword (i)) Form (headword(ii)) Form (dialect(b)) Metadata: DBÖ entry (headword (ii))
  • 14. Problems with <entryFree> model • It is a hack! • Current TEI guidelines and data model are inherantly and intentionallly semasiological and this use of the vocabulary is only valid by chance, not intention. >Thus using this data model within the TEI will not have any of the advantages that generally come with its use
  • 15. TBX-TEI Hybrid Romary (2014): Makes attempt at customizing TEI guidelines to incorporate TBX (ISO 30046) terminological entries in order to provide TEI with an onomasiological model https://github.com/laurentromary/TBXinTEI
  • 16. TBX-TEI Hybrid   <tbx:termEntry xmlns="http://www.tbx.org"><!-- @xml:id;  -->             <descrip type="concept" target="concept:Wecken"/> <!-- sense not normally included in TBX! -->                          <descrip type="domain" target="concept:Brot" xml:lang="de">Brot</descrip>             <descrip type="definition" xml:lang="en">Oblong loaf of bread</descrip>            <!-- no headword form may occur outside of <langSet>—>             <langSet xml:id="w834_qdb-d1e602" xml:lang="bar-x-smichael"><!-- language/dialect i) @xml:id;  --> <!-- No sense allowed! —>                <tei:note type="anmerkung" resp="O" corresp="#BD">deren Grundriß ein Oval ist</tei:note> <!-- @corresp allowed in TEI <note> but not here —> <!-- Most metadata element valid using <tei:ref> but syntactically required to occur before <tig> —>                 <admin type="geo">                   <tei:placeName>St.Michael/B. Bgl.</tei:placeName>                </admin>                <tig><!-- <tei:form> would be better -->                   <tei:term type="hauptlemma">Wecken</tei:term>                   <termNote type="transcription">orth</termNote><!-- this is inefficient: need to allow <orth> & <pron>—>                   <termNote type="pos">Subst</termNote><!-- this actually should be applicable to all forms (headword & lemmas) -->                      </tig>                <tig>                   <tei:term type="lautung" n="1">W.eiggn</tei:term>                   <termNote type="transcription">pron</termNote>                   <termNote type="notation">tustep</termNote><!-- we also need to allow @notation  -->                </tig>                <tig><!-- TBX doesn't allow multiple instances of <term> in same <tig> as TEI does with <orth>,<pron> w/in <form> -->                   <tei:term type="lautung" n="1" resp="#JB">ʋɛiggn̩</tei:term>                   <termNote type="transcription" change=“1.2">pron</termNote><!-- @change in original not allowed in hybrid schema -->                   <termNote type="notation">ipa</termNote>                </tig>                </langSet> ….
  • 17. Problems with TEI-TBX Hybrid model as per the ODD Schema from Romary (2014) • <tig> is verbose and would be better replaced with <form> • the order of occurence of elements is too restricted • TBX-dominated schema lacks way too many attributes (e.g. @notation),and elements (e.g. <orth> <pron>) that are key to storage and representation of lexical data as used in TEI
  • 18. Conclusion (i) TEI lacks a legitimate means of encoding terminological/ onomasiological entries; (ii) Given that we need to include sense (or a parallel equivalent) and the headword at the top of an entry, a TBX-TEI hybrid doesn’t work either without serious modification via ODD mostly to introduce elements and features from TEI, and stretching the traditional usage of the system; (iii) TEI needs to re-introduce a means of onomasiological data representation (such as <termEntry>) but with an expanded set of elements and attributes based on the degree of expressivity in the Dictionary module