Exploring data models for heterogenous dialect data: the case of explore.bread.AT!

Jack T. Bowers
Melanie Seltmann
Austrian Academy of Sciences -Austrian Center for Digital Humanities
Exploring data models for heterogenous
dialect data:
the case of explore.bread.AT!

Outline of Presentation
Part I: Overview of project & data
Part II: Overview of possible solutions using XML-based
markup standards for representing onomasiological
dialectal language

explore.AT!
Overview:
• DBÖ: collection of Bavarian dialectal speech began 1911
• 2015-2016 converted from TUSTEP to TEI
Goals
• Gain cultural and linguistic insights into Bavarian dialects in former
Austro-Hungarian empire;
• Update and improve the existing body of resources by converting to
conform with standards and best practice (ISOcat, ISOconcept, etc.;
• Enhance usability and compatibility of data in order to share with
project partners;
• Integration of semantic web/LOD resources;

Project Overview: Datasets
DBÖ@TEI
WBÖ@TEI
BaseX Database
place inventory (TEI-listPlace)
concept inventory(TEI-feature structures)
gram features inventory (TEI-feature structures)
questionnaires (TEI-list)
DBÖ@ema
SQL
BaseX Database
Extracted Topical Datasets
explore.bread
The language of Color
lexicon(location(a))
inventory(lexicalFeature(a))
• Domain/Topic-based (exploreBread)
• Location
• Lexical/grammatical features
Possible basis for examination of sub-datasets

DBÖ Questionnaires
Questionnaires:
While questionnaires are topical in general, they are a complicated
mixture of semasiological (term-based) and onomasiological
(concept-based)
e.g.
(31B5) bes. Weißgebäcke:
länglich flaches, gerundetes Weißgebäck, z.B. Strutz (l.!),
Strutzen, Strützel, Wecken u.a.; scherzhafte Bez. wie Schendarm
Current means of extracting this information were initially limited to:
• Questionnaires
• String searches in certain data fields
Dataset requires significant manual editing and curation due to nature of
the questionnaires

Desired Enhancements
In most sub-topical studies such as ExploreBread! there would be
potential beneﬁts of having the ability to format data onomasiologically,
for example:
• Domain and/or concept-oriented entries better represent the content of
interest
• Information retrieval
• Ontology mapping
• Etymological &/or Morphosyntactic analysis
• Cross linguistic (or dialectal) comparisson or translation
Problem:
> TEI has no explicitly designated means of
encoding onomasiological data!

Enhancing original data
• Adding domain (onomasiological) and ontology-based sense tags
<sense corresp=“concept:Weißgebäck”>Weißgebäck</usg>
<usg type="dom" corresp=“concept:Brot”>Brot</usg>
• Normalization of phonetic notation*
<form type="lautung" n="1"> 
<pron notation="tustep">>str-uts</pron> 
<pron notation="ipa" resp="#JB" change=“01.2">ʒ̊truːts</pron> 
</form>
• Adding Morpholgical/Compositional Analysis*
  <form type="hauptlemma">
   <orth>(S:emmel)zipfel</orth>
</form>
<form type="hauptlemma" resp="#MS">
   <orth>(<seg corresp="concept:Semmel”>S:emm<seg ana="#dimin">el</seg></seg>)
<seg ana="#stem" corresp="concept:Zipf”>zipf<seg ana="#dimin">el</seg></seg>
</orth>
</form>

Lexical Organization
Semasiological:
Onomasiological:
Semasiological Lexical Model
meaning(iii)
Form
meaning(ii)meaning(i)
Onomasiological Lexical Model
Concept
Form(i) Form(ii) Form(iii)
Starting point is word form and identiﬁes
associated meanings and senses
Starting point is a concept and looks at forms
used to represent it

Headword
Lemma(i..n)
BROT
brot broet brɛot
Prôt Prôt Prôt
Core DBÖ entry datatypes
—————————————-
Archive record
Headword (Form)
POS
Dialect lemma (Form)
Gram info
Meaning (Sense)
Usage example
Source
Place
Questionnaire
Etymology
Desired Data Structure
Desired Onomasiological Model for Extracted
Terminological DBÖ Datasets
TermEntry
Concept(a)
DialectEntry(i) DialectEntry(ii) DialectEntry(n)

Options using XML-Based Standards
(i) TEI Hacks: Alternate TEI Dictionary format (<entryFree>)
(ii) TEI-TBX Hybrid (Romary, 2014)
OR…. use TEI P4

TEI <entryFree> Model
(1…n)
<sense @corresp/>
<entryFree @xml:id>
<usg @type=“dom”>
<superEntry>
<entry @xml:id @xml:lang=“bar”>
(0…n)
(1…n)
<form type=“hauptlemma”>
<orth>
(1…n)
(1…1)
<form type=“hauptlemma”>
(all other elements content from original copied without alteration)
<def @xml:lang>
(0…n)
<sense>
concept:
meaning
concept:
domain
Form (headword(i))
Form (dialect(a))
Metadata:
DBÖ entry (headword (i))
Form (headword(ii))
Form (dialect(b))
Metadata:
DBÖ entry (headword (ii))

TEI <entryFree> Model
concept:
meaning
<entryFree>
<sense corresp="concept:Wecken">
   <usg type="dom" corresp="concept:Brot">Brot</usg>
   <def xml:lang="en" resp="#JB">Oblong loaf of bread</def>
</sense>
<superEntry> <!—for each unique hauptlemma for concept entry —>
<orth>Wecken</orth>
   </form>
<entry xml:id="w834_qdb-d1e602b" xml:lang="bar">

<form type="lautung" n="1">
   <pron notation="tustep">W.eiggn</pron>
   <pron notation="ipa" resp="#JB" change=“01.2">ʋɛiggn̩</pron>
</form>
<usg type="geo">
   <placeName>St.Michael/B. Bgl.</placeName>
</usg>
   </entry>
<!—all entries with headword “Wecken” (ii..n) —> </superEntry>
<superEntry>
   <orth>Strutzen</orth>
</form>

   <entry xml:id="s806_qdb-d1e43847b" xml:lang="bar">

<form type="lautung" n="1">
   <pron notation="tustep">Struzn</pron>
   <pron notation="ipa" resp="#JB" change=“01.2">ʃtruzn̩</pron>
</form>
<usg type="geo"> 
<placeName>Rohrb. OÖ</placeName> 
</usg>
   </entry>
<!—all entries with headword “Strutzen” (ii..n) —> </superEntry>
</entryFree>
concept:
domain
Form (headword(i))
Form (dialect(a))
Metadata:
DBÖ entry (headword (i))
Form (headword(ii))
Form (dialect(b))
Metadata:
DBÖ entry (headword (ii))

Problems with <entryFree> model
• It is a hack!
• Current TEI guidelines and data model are
inherantly and intentionallly semasiological and
this use of the vocabulary is only valid by chance,
not intention.
>Thus using this data model within the TEI will not have
any of the advantages that generally come with its use

TBX-TEI Hybrid
Romary (2014):
Makes attempt at customizing TEI guidelines to incorporate TBX
(ISO 30046) terminological entries in order to provide TEI with an
onomasiological model
https://github.com/laurentromary/TBXinTEI

TBX-TEI Hybrid
<tbx:termEntry xmlns="http://www.tbx.org">
<descrip type="concept" target="concept:Wecken"/> 
<descrip type="domain" target="concept:Brot" xml:lang="de">Brot</descrip>
<descrip type="deﬁnition" xml:lang="en">Oblong loaf of bread</descrip>


<tei:term type="hauptlemma">Wecken</tei:term>
<termNote type="transcription">orth</termNote>
   </tig>
   <tig>
<tei:term type="lautung" n="1">W.eiggn</tei:term>
<termNote type="transcription">pron</termNote>
<termNote type="notation">tustep</termNote>
   </tig>
   <tig>
<tei:term type="lautung" n="1" resp="#JB">ʋɛiggn̩</tei:term>
<termNote type="transcription" change=“1.2">pron</termNote>
<termNote type="notation">ipa</termNote>
   </tig>
</langSet>
….

Problems with TEI-TBX Hybrid model as
per the ODD Schema from Romary (2014)
• <tig> is verbose and would be better replaced with <form>
• the order of occurence of elements is too restricted
• TBX-dominated schema lacks way too many attributes (e.g.
@notation),and elements (e.g. <orth> <pron>) that are key
to storage and representation of lexical data as used in TEI

Conclusion
(i) TEI lacks a legitimate means of encoding terminological/
onomasiological entries;
(ii) Given that we need to include sense (or a parallel equivalent) and
the headword at the top of an entry, a TBX-TEI hybrid doesn’t work
either without serious modiﬁcation via ODD mostly to introduce
elements and features from TEI, and stretching the traditional usage
of the system;
(iii) TEI needs to re-introduce a means of onomasiological data
representation (such as <termEntry>) but with an expanded set of
elements and attributes based on the degree of expressivity in the
Dictionary module

Exploring data models for heterogenous dialect data: the case of explore.bread.AT!

More Related Content

Similar to Exploring data models for heterogenous dialect data: the case of explore.bread.AT! (20)

Recently uploaded (20)