Jack T. Bowers
Melanie Seltmann
Austrian Academy of Sciences -Austrian Center for Digital Humanities
Exploring data models for heterogenous
dialect data:
the case of explore.bread.AT!
Outline of Presentation
Part I: Overview of project & data
Part II: Overview of possible solutions using XML-based
markup standards for representing onomasiological
dialectal language
explore.AT!
Overview:
• DBÖ: collection of Bavarian dialectal speech began 1911
• 2015-2016 converted from TUSTEP to TEI
Goals
• Gain cultural and linguistic insights into Bavarian dialects in former
Austro-Hungarian empire;
• Update and improve the existing body of resources by converting to
conform with standards and best practice (ISOcat, ISOconcept, etc.;
• Enhance usability and compatibility of data in order to share with
project partners;
• Integration of semantic web/LOD resources;
Project Overview: Datasets
DBÖ@TEI
WBÖ@TEI
BaseX Database
place inventory (TEI-listPlace)
concept inventory(TEI-feature structures)
gram features inventory (TEI-feature structures)
questionnaires (TEI-list)
DBÖ@ema
SQL
BaseX Database
Extracted Topical Datasets
explore.bread
The language of Color
lexicon(location(a))
inventory(lexicalFeature(a))
• Domain/Topic-based (exploreBread)
• Location
• Lexical/grammatical features
Possible basis for examination of sub-datasets
Visualization
DBÖ Questionnaires
Questionnaires:
While questionnaires are topical in general, they are a complicated
mixture of semasiological (term-based) and onomasiological
(concept-based)
e.g.
(31B5) bes. Weißgebäcke:
länglich flaches, gerundetes Weißgebäck, z.B. Strutz (l.!),
Strutzen, Strützel, Wecken u.a.; scherzhafte Bez. wie Schendarm
Current means of extracting this information were initially limited to:
• Questionnaires
• String searches in certain data fields
Dataset requires significant manual editing and curation due to nature of
the questionnaires
Desired Enhancements
In most sub-topical studies such as ExploreBread! there would be
potential benefits of having the ability to format data onomasiologically,
for example:
• Domain and/or concept-oriented entries better represent the content of
interest
• Information retrieval
• Ontology mapping
• Etymological &/or Morphosyntactic analysis
• Cross linguistic (or dialectal) comparisson or translation
Problem:
> TEI has no explicitly designated means of
encoding onomasiological data!
Enhancing original data
• Adding domain (onomasiological) and ontology-based sense tags
<sense corresp=“concept:Weißgebäck”>Weißgebäck</usg>
<usg type="dom" corresp=“concept:Brot”>Brot</usg>
• Normalization of phonetic notation*
<form type="lautung" n="1">

<pron notation="tustep">&gt;str-uts</pron>

<pron notation="ipa" resp="#JB" change=“01.2">ʒ̊truːts</pron>

</form>
• Adding Morpholgical/Compositional Analysis*
            <form type="hauptlemma">
               <orth>(S:emmel)zipfel</orth>
            </form>
            <form type="hauptlemma" resp="#MS">
               <orth>(<seg corresp="concept:Semmel”>S:emm<seg ana="#dimin">el</seg></seg>)
   <seg ana="#stem" corresp="concept:Zipf”>zipf<seg ana="#dimin">el</seg></seg>
       </orth>
            </form>
Lexical Organization
Semasiological:
Onomasiological:
Semasiological Lexical Model
meaning(iii)
Form
meaning(ii)meaning(i)
Onomasiological Lexical Model
Concept
Form(i) Form(ii) Form(iii)
Starting point is word form and identifies
associated meanings and senses
Starting point is a concept and looks at forms
used to represent it
Headword
Lemma(i..n)
BROT
brot broet brɛot
Prôt Prôt Prôt
Core DBÖ entry datatypes
—————————————-
Archive record
Headword (Form)
POS
Dialect lemma (Form)
Gram info
Meaning (Sense)
Usage example
Source
Place
Questionnaire
Etymology
Desired Data Structure
Desired Onomasiological Model for Extracted
Terminological DBÖ Datasets
TermEntry
Concept(a)
DialectEntry(i) DialectEntry(ii) DialectEntry(n)
Options using XML-Based Standards
(i) TEI Hacks: Alternate TEI Dictionary format (<entryFree>)
(ii) TEI-TBX Hybrid (Romary, 2014)
OR…. use TEI P4
TEI <entryFree> Model
(1…n)
<sense @corresp/>
<entryFree @xml:id>
<usg @type=“dom”>
<superEntry>
<entry @xml:id @xml:lang=“bar”>
(0…n)
(1…n)
<form type=“hauptlemma”>
<orth>
(1…n)
(1…1)
<form type=“hauptlemma”>
(all other elements content from original copied without alteration)
<def @xml:lang>
(0…n)
<sense>
concept:
meaning
concept:
domain
Form (headword(i))
Form (dialect(a))
Metadata:
DBÖ entry (headword (i))
Form (headword(ii))
Form (dialect(b))
Metadata:
DBÖ entry (headword (ii))
TEI <entryFree> Model
concept:
meaning
<entryFree>
            <sense corresp="concept:Wecken">
               <usg type="dom" corresp="concept:Brot">Brot</usg>
               <def xml:lang="en" resp="#JB">Oblong loaf of bread</def>
            </sense>
            <superEntry> <!—for each unique hauptlemma for concept entry —>
               <form type="hauptlemma">
                  <orth>Wecken</orth>
               </form>
<entry xml:id="w834_qdb-d1e602b" xml:lang="bar">
                  <!-- hauptlemma removed from here; entry content abbreviated -->
                  <form type="lautung" n="1">
                     <pron notation="tustep">W.eiggn</pron>
                     <pron notation="ipa" resp="#JB" change=“01.2">ʋɛiggn̩</pron>
                  </form>
                  <usg type="geo">
                     <placeName>St.Michael/B. Bgl.</placeName>
                  </usg>
               </entry>
<!—all entries with headword “Wecken” (ii..n) —> </superEntry>
<superEntry>
               <form type="hauptlemma">
                    <orth>Strutzen</orth>
               </form>
              
               <entry xml:id="s806_qdb-d1e43847b" xml:lang="bar">
                  <!-- hauptlemma removed from here; entry content abbreviated -->
                  <form type="lautung" n="1">
                     <pron notation="tustep">Struzn</pron>
                     <pron notation="ipa" resp="#JB" change=“01.2">ʃtruzn̩</pron>
                  </form>
<usg type="geo">

<placeName>Rohrb. OÖ</placeName>

</usg>
               </entry>
<!—all entries with headword “Strutzen” (ii..n) —> </superEntry>
</entryFree>
concept:
domain
Form (headword(i))
Form (dialect(a))
Metadata:
DBÖ entry (headword (i))
Form (headword(ii))
Form (dialect(b))
Metadata:
DBÖ entry (headword (ii))
Problems with <entryFree> model
• It is a hack!
• Current TEI guidelines and data model are
inherantly and intentionallly semasiological and
this use of the vocabulary is only valid by chance,
not intention.
>Thus using this data model within the TEI will not have
any of the advantages that generally come with its use
TBX-TEI Hybrid
Romary (2014):
Makes attempt at customizing TEI guidelines to incorporate TBX
(ISO 30046) terminological entries in order to provide TEI with an
onomasiological model
https://github.com/laurentromary/TBXinTEI
TBX-TEI Hybrid
  <tbx:termEntry xmlns="http://www.tbx.org"><!-- @xml:id;  -->
            <descrip type="concept" target="concept:Wecken"/> <!-- sense not normally included in TBX! -->             
            <descrip type="domain" target="concept:Brot" xml:lang="de">Brot</descrip>
            <descrip type="definition" xml:lang="en">Oblong loaf of bread</descrip>
           <!-- no headword form may occur outside of <langSet>—>
            <langSet xml:id="w834_qdb-d1e602" xml:lang="bar-x-smichael"><!-- language/dialect i) @xml:id;  -->
<!-- No sense allowed! —>
               <tei:note type="anmerkung" resp="O" corresp="#BD">deren Grundriß ein Oval ist</tei:note>
<!-- @corresp allowed in TEI <note> but not here —>
<!-- Most metadata element valid using <tei:ref> but syntactically required to occur before <tig> —>
                <admin type="geo">
                  <tei:placeName>St.Michael/B. Bgl.</tei:placeName>
               </admin>
               <tig><!-- <tei:form> would be better -->
                  <tei:term type="hauptlemma">Wecken</tei:term>
                  <termNote type="transcription">orth</termNote><!-- this is inefficient: need to allow <orth> & <pron>—>
                  <termNote type="pos">Subst</termNote><!-- this actually should be applicable to all forms (headword & lemmas) -->      
               </tig>
               <tig>
                  <tei:term type="lautung" n="1">W.eiggn</tei:term>
                  <termNote type="transcription">pron</termNote>
                  <termNote type="notation">tustep</termNote><!-- we also need to allow @notation  -->
               </tig>
               <tig><!-- TBX doesn't allow multiple instances of <term> in same <tig> as TEI does with <orth>,<pron> w/in <form> -->
                  <tei:term type="lautung" n="1" resp="#JB">ʋɛiggn̩</tei:term>
                  <termNote type="transcription" change=“1.2">pron</termNote><!-- @change in original not allowed in hybrid schema -->
                  <termNote type="notation">ipa</termNote>
               </tig>   
            </langSet>
….
Problems with TEI-TBX Hybrid model as
per the ODD Schema from Romary (2014)
• <tig> is verbose and would be better replaced with <form>
• the order of occurence of elements is too restricted
• TBX-dominated schema lacks way too many attributes (e.g.
@notation),and elements (e.g. <orth> <pron>) that are key
to storage and representation of lexical data as used in TEI
Conclusion
(i) TEI lacks a legitimate means of encoding terminological/
onomasiological entries;
(ii) Given that we need to include sense (or a parallel equivalent) and
the headword at the top of an entry, a TBX-TEI hybrid doesn’t work
either without serious modification via ODD mostly to introduce
elements and features from TEI, and stretching the traditional usage
of the system;
(iii) TEI needs to re-introduce a means of onomasiological data
representation (such as <termEntry>) but with an expanded set of
elements and attributes based on the degree of expressivity in the
Dictionary module

Exploring data models for heterogenous dialect data: the case of e​xplore.bread.AT!

  • 1.
    Jack T. Bowers MelanieSeltmann Austrian Academy of Sciences -Austrian Center for Digital Humanities Exploring data models for heterogenous dialect data: the case of explore.bread.AT!
  • 2.
    Outline of Presentation PartI: Overview of project & data Part II: Overview of possible solutions using XML-based markup standards for representing onomasiological dialectal language
  • 3.
    explore.AT! Overview: • DBÖ: collectionof Bavarian dialectal speech began 1911 • 2015-2016 converted from TUSTEP to TEI Goals • Gain cultural and linguistic insights into Bavarian dialects in former Austro-Hungarian empire; • Update and improve the existing body of resources by converting to conform with standards and best practice (ISOcat, ISOconcept, etc.; • Enhance usability and compatibility of data in order to share with project partners; • Integration of semantic web/LOD resources;
  • 4.
    Project Overview: Datasets DBÖ@TEI WBÖ@TEI BaseXDatabase place inventory (TEI-listPlace) concept inventory(TEI-feature structures) gram features inventory (TEI-feature structures) questionnaires (TEI-list) DBÖ@ema SQL BaseX Database Extracted Topical Datasets explore.bread The language of Color lexicon(location(a)) inventory(lexicalFeature(a)) • Domain/Topic-based (exploreBread) • Location • Lexical/grammatical features Possible basis for examination of sub-datasets
  • 5.
  • 6.
    DBÖ Questionnaires Questionnaires: While questionnairesare topical in general, they are a complicated mixture of semasiological (term-based) and onomasiological (concept-based) e.g. (31B5) bes. Weißgebäcke: länglich flaches, gerundetes Weißgebäck, z.B. Strutz (l.!), Strutzen, Strützel, Wecken u.a.; scherzhafte Bez. wie Schendarm Current means of extracting this information were initially limited to: • Questionnaires • String searches in certain data fields Dataset requires significant manual editing and curation due to nature of the questionnaires
  • 7.
    Desired Enhancements In mostsub-topical studies such as ExploreBread! there would be potential benefits of having the ability to format data onomasiologically, for example: • Domain and/or concept-oriented entries better represent the content of interest • Information retrieval • Ontology mapping • Etymological &/or Morphosyntactic analysis • Cross linguistic (or dialectal) comparisson or translation Problem: > TEI has no explicitly designated means of encoding onomasiological data!
  • 8.
    Enhancing original data •Adding domain (onomasiological) and ontology-based sense tags <sense corresp=“concept:Weißgebäck”>Weißgebäck</usg> <usg type="dom" corresp=“concept:Brot”>Brot</usg> • Normalization of phonetic notation* <form type="lautung" n="1">
 <pron notation="tustep">&gt;str-uts</pron>
 <pron notation="ipa" resp="#JB" change=“01.2">ʒ̊truːts</pron>
 </form> • Adding Morpholgical/Compositional Analysis*             <form type="hauptlemma">                <orth>(S:emmel)zipfel</orth>             </form>             <form type="hauptlemma" resp="#MS">                <orth>(<seg corresp="concept:Semmel”>S:emm<seg ana="#dimin">el</seg></seg>)    <seg ana="#stem" corresp="concept:Zipf”>zipf<seg ana="#dimin">el</seg></seg>        </orth>             </form>
  • 9.
    Lexical Organization Semasiological: Onomasiological: Semasiological LexicalModel meaning(iii) Form meaning(ii)meaning(i) Onomasiological Lexical Model Concept Form(i) Form(ii) Form(iii) Starting point is word form and identifies associated meanings and senses Starting point is a concept and looks at forms used to represent it
  • 10.
    Headword Lemma(i..n) BROT brot broet brɛot PrôtPrôt Prôt Core DBÖ entry datatypes —————————————- Archive record Headword (Form) POS Dialect lemma (Form) Gram info Meaning (Sense) Usage example Source Place Questionnaire Etymology Desired Data Structure Desired Onomasiological Model for Extracted Terminological DBÖ Datasets TermEntry Concept(a) DialectEntry(i) DialectEntry(ii) DialectEntry(n)
  • 11.
    Options using XML-BasedStandards (i) TEI Hacks: Alternate TEI Dictionary format (<entryFree>) (ii) TEI-TBX Hybrid (Romary, 2014) OR…. use TEI P4
  • 12.
    TEI <entryFree> Model (1…n) <sense@corresp/> <entryFree @xml:id> <usg @type=“dom”> <superEntry> <entry @xml:id @xml:lang=“bar”> (0…n) (1…n) <form type=“hauptlemma”> <orth> (1…n) (1…1) <form type=“hauptlemma”> (all other elements content from original copied without alteration) <def @xml:lang> (0…n) <sense> concept: meaning concept: domain Form (headword(i)) Form (dialect(a)) Metadata: DBÖ entry (headword (i)) Form (headword(ii)) Form (dialect(b)) Metadata: DBÖ entry (headword (ii))
  • 13.
    TEI <entryFree> Model concept: meaning <entryFree>            <sense corresp="concept:Wecken">                <usg type="dom" corresp="concept:Brot">Brot</usg>                <def xml:lang="en" resp="#JB">Oblong loaf of bread</def>             </sense>             <superEntry> <!—for each unique hauptlemma for concept entry —>                <form type="hauptlemma">                   <orth>Wecken</orth>                </form> <entry xml:id="w834_qdb-d1e602b" xml:lang="bar">                   <!-- hauptlemma removed from here; entry content abbreviated -->                   <form type="lautung" n="1">                      <pron notation="tustep">W.eiggn</pron>                      <pron notation="ipa" resp="#JB" change=“01.2">ʋɛiggn̩</pron>                   </form>                   <usg type="geo">                      <placeName>St.Michael/B. Bgl.</placeName>                   </usg>                </entry> <!—all entries with headword “Wecken” (ii..n) —> </superEntry> <superEntry>                <form type="hauptlemma">                     <orth>Strutzen</orth>                </form>                               <entry xml:id="s806_qdb-d1e43847b" xml:lang="bar">                   <!-- hauptlemma removed from here; entry content abbreviated -->                   <form type="lautung" n="1">                      <pron notation="tustep">Struzn</pron>                      <pron notation="ipa" resp="#JB" change=“01.2">ʃtruzn̩</pron>                   </form> <usg type="geo">
 <placeName>Rohrb. OÖ</placeName>
 </usg>                </entry> <!—all entries with headword “Strutzen” (ii..n) —> </superEntry> </entryFree> concept: domain Form (headword(i)) Form (dialect(a)) Metadata: DBÖ entry (headword (i)) Form (headword(ii)) Form (dialect(b)) Metadata: DBÖ entry (headword (ii))
  • 14.
    Problems with <entryFree>model • It is a hack! • Current TEI guidelines and data model are inherantly and intentionallly semasiological and this use of the vocabulary is only valid by chance, not intention. >Thus using this data model within the TEI will not have any of the advantages that generally come with its use
  • 15.
    TBX-TEI Hybrid Romary (2014): Makesattempt at customizing TEI guidelines to incorporate TBX (ISO 30046) terminological entries in order to provide TEI with an onomasiological model https://github.com/laurentromary/TBXinTEI
  • 16.
    TBX-TEI Hybrid   <tbx:termEntryxmlns="http://www.tbx.org"><!-- @xml:id;  -->             <descrip type="concept" target="concept:Wecken"/> <!-- sense not normally included in TBX! -->                          <descrip type="domain" target="concept:Brot" xml:lang="de">Brot</descrip>             <descrip type="definition" xml:lang="en">Oblong loaf of bread</descrip>            <!-- no headword form may occur outside of <langSet>—>             <langSet xml:id="w834_qdb-d1e602" xml:lang="bar-x-smichael"><!-- language/dialect i) @xml:id;  --> <!-- No sense allowed! —>                <tei:note type="anmerkung" resp="O" corresp="#BD">deren Grundriß ein Oval ist</tei:note> <!-- @corresp allowed in TEI <note> but not here —> <!-- Most metadata element valid using <tei:ref> but syntactically required to occur before <tig> —>                 <admin type="geo">                   <tei:placeName>St.Michael/B. Bgl.</tei:placeName>                </admin>                <tig><!-- <tei:form> would be better -->                   <tei:term type="hauptlemma">Wecken</tei:term>                   <termNote type="transcription">orth</termNote><!-- this is inefficient: need to allow <orth> & <pron>—>                   <termNote type="pos">Subst</termNote><!-- this actually should be applicable to all forms (headword & lemmas) -->                      </tig>                <tig>                   <tei:term type="lautung" n="1">W.eiggn</tei:term>                   <termNote type="transcription">pron</termNote>                   <termNote type="notation">tustep</termNote><!-- we also need to allow @notation  -->                </tig>                <tig><!-- TBX doesn't allow multiple instances of <term> in same <tig> as TEI does with <orth>,<pron> w/in <form> -->                   <tei:term type="lautung" n="1" resp="#JB">ʋɛiggn̩</tei:term>                   <termNote type="transcription" change=“1.2">pron</termNote><!-- @change in original not allowed in hybrid schema -->                   <termNote type="notation">ipa</termNote>                </tig>                </langSet> ….
  • 17.
    Problems with TEI-TBXHybrid model as per the ODD Schema from Romary (2014) • <tig> is verbose and would be better replaced with <form> • the order of occurence of elements is too restricted • TBX-dominated schema lacks way too many attributes (e.g. @notation),and elements (e.g. <orth> <pron>) that are key to storage and representation of lexical data as used in TEI
  • 18.
    Conclusion (i) TEI lacksa legitimate means of encoding terminological/ onomasiological entries; (ii) Given that we need to include sense (or a parallel equivalent) and the headword at the top of an entry, a TBX-TEI hybrid doesn’t work either without serious modification via ODD mostly to introduce elements and features from TEI, and stretching the traditional usage of the system; (iii) TEI needs to re-introduce a means of onomasiological data representation (such as <termEntry>) but with an expanded set of elements and attributes based on the degree of expressivity in the Dictionary module