Conversion of an existing termbank: EAGLET to MARTIF

  The EAGLET database (see 2.4.4) is available in the form of an ASCII-file, containing all information on every single term in the database.

Table 3.2 shows a possible model of a single, one lined database-entry, also indicating the structure used within the individual data categories. It is printed as is, including all signs that are present in the ASCII source of the EAGLET data-base; for printing purposes only line breaks were added. The real entry does not have any line breaks or carriage returns, but every single term entry consists of one and only one line. The line breaks do not represent spaces here. The data-categories are limited by the comma-character `,', the escape character is the backslash `tex2html_wrap_inline3313'. 20 separate columns are defined for the data-base table. Instances of the data categories are represented by expressions referring to the name of the relevant data-categories, for some categories with some generally included markup and for others additional numbers, at least in those cases where the rule of elementarity of data categories is violated. In these latter cases one or two more elements are included as they occur in the database (except of definitions, where three is currently the maximum number).

 
term,term,
/SAMPA-transcription/,
[POS of the term [POS of individual parts]],
[number:inflectional ending],
domainname,
hyperonym1tex2html_wrap_inline3313,hyperonym2tex2html_wrap_inline3313,hyperonym3tex2html_wrap_inline3313,
hyperonym4tex2html_wrap_inline3313,hyperonym5tex2html_wrap_inline3313,hyperonym6,
hyponym1tex2html_wrap_inline3313,hyponym2tex2html_wrap_inline3313,hyponym3tex2html_wrap_inline3313,hyponym4tex2html_wrap_inline3313,hyponym5tex2html_wrap_inline3313,hyponym6,
synonym1tex2html_wrap_inline3313,synonym2tex2html_wrap_inline3313,synonym3tex2html_wrap_inline3313,synonym4tex2html_wrap_inline3313,synonym5tex2html_wrap_inline3313,synonym6,
antonym1tex2html_wrap_inline3313,antonym2tex2html_wrap_inline3313,antonym3tex2html_wrap_inline3313,antonym4tex2html_wrap_inline3313,antonym5tex2html_wrap_inline3313,antonym6tex2html_wrap_inline3313,
antonym7tex2html_wrap_inline3313,antonym8tex2html_wrap_inline3313,antonym9tex2html_wrap_inline3313,antonym10tex2html_wrap_inline3313,antonym11tex2html_wrap_inline3313,antonym12tex2html_wrap_inline3313,
antonym13tex2html_wrap_inline3313,antonym14tex2html_wrap_inline3313,antonym15tex2html_wrap_inline3313,antonym16,
meronymic-superordinate1tex2html_wrap_inline3313,meronymic-superordinate2tex2html_wrap_inline3313,
meronymic-superordinate3tex2html_wrap_inline3313,meronymic-superordinate4tex2html_wrap_inline3313,
meronymic-superordinate5tex2html_wrap_inline3313,meronymic-superordinate6,
meronymic-subordinate1tex2html_wrap_inline3313,meronymic-subordinate2tex2html_wrap_inline3313,
meronymic-subordinate3tex2html_wrap_inline3313,meronymic-subordinate4tex2html_wrap_inline3313,
meronymic-subordinate5tex2html_wrap_inline3313,meronymic-subordinate6,
definition1(Source1)tex2html_wrap_inline3313,definition2(Source2)tex2html_wrap_inline3313,definition3(Source3),
example,graphic-model,audio-model,formula,
reference-URL1 reference-URL2 reference-URL3
reference-URL4 reference-URL5 reference-URL5
reference-URL6 reference-URL7 reference-URL8
reference-URL9 reference-URL10,
datetex2html_wrap_inline3313,update1tex2html_wrap_inline3313,update2tex2html_wrap_inline3313,update3tex2html_wrap_inline3313,update4tex2html_wrap_inline3313,update5tex2html_wrap_inline3313,update6,
authortex2html_wrap_inline3313,updater1tex2html_wrap_inline3313,updater2tex2html_wrap_inline3313,updater3tex2html_wrap_inline3313,
updater4tex2html_wrap_inline3313,updater5tex2html_wrap_inline3313,updater6
Table 3.2: Sample model of a single database entry 

A conversion of such a complex database entry is not easy but it is possible to manage it with the help of UNIX tools . The actual script is given in appendix C. For a UNIX script it is fairly complex; most scripts are written only with few commandlines as easy and fast written tools for computer experts needing tools for otherwise annoying and repeatedly appearing tasks. This script consists of some hundred lines of source code, most of them containing substrings of subroutines.

The most important criterion for this program was that it has to work on UNIX machines, has to be capable of handling great amounts of data and has to be able to be changed quickly. Consequently, programming languages needing compilers such as C and C++ did not match the task. UNIX tools were easy to change in a few seconds, easy to debug, and what is even more important, if changes in the database itself take place, such as the introduction of new or/and other categories, these scripts can easily be adapted to new needs.

Data loss is a serious problem in database conversion, as was previously pointed out on page gif. To evaluate the efficiency of the script developed for the EAGLET to MARTIF conversion the following tables 3.3 up to 3.7 show the degree of equivalence of the EAGLET data category in comparison to the MARTIF standard with a description of the used tool, comments on possible data loss and some comments.

 
EAGLET-data-category Individual Format Script for MARTIF-conversion Loss of information comment
Orthography ASCII-format: an orthographic representation of a term awk script none
Sorter ASCII-format: the same orthographic representation in only small characters none none
Pronunciation SAMPA transcription preceded and ended by slash `/' awk script none
Part of Speech abbreviated POS; also including POS analysis of term components (for compounds or multi word terms) awk script none but violation of elementarity
Table 3.3: Equivalence of Database formats EAGLET and MARTIF 

 
EAGLET-data-category Individual Format Script for MARTIF-conversion Loss of information comment
Inflections inflectional ending for plurals (if any) or absent singular awk script none
Domain Name of the domain, subdomain separated by colon awk script none
Hyperonyms orthographic representation of hyperonym (sometimes multiple) awk script with subprocess extracting multiple hyperonyms none violated principle of elementarity solved by subprocesses
Hyponyms orthographic representation of hyponym (sometimes multiple) awk script with subprocess extracting multiple hyponyms none violated principle of elementarity solved by subprocesses
Synonyms orthographic representation of synonym (sometimes multiple) awk script with subprocess extracting multiple synonyms none   violated principle of elementarity solved by subprocesses
Table 3.7: Equivalence of Database formats EAGLET and MARTIF (continued) 

 
EAGLET-data category Individual Format Script for MARTIF-conversion Loss of information comment
Antonyms orthographic representation of antonym (sometimes multiple) awk script with subprocess extracting multiple antonyms none violated principle of elementarity solved by subprocesses 
Meronymic superordinate orthographic representation of meronymic superordinates (sometimes multiple) awk script with subprocess extracting multiple meronymic superordinates none violated principle of elementarity solved by subprocesses
Meronymic subordinate orthographic representation of meronymic subordinates (sometimes multiple) awk script with subprocess extracting multiple meronymic subordinates none violated principle of elementarity solved by subprocesses
Table: Equivalence of Database formats EAGLET and MARTIF (continued) 

 
EAGLET-data-category Individual Format Script for MARTIF-conversion Loss of information comment
Definitions definition, source in parenthesis awk script with subprocess creating a separate source category with a link to the source definition undefined not uniquely structured multiple definitions with sources result in manually post-editing of 10 to 20 definitions in EAGLET
Examples Short description of examples in text form awk script none
Graphic models Link to external graphic-files none none currently not available for any term  
Audio models Link to external audio-files none none currently not available for any term
Table: Equivalence of Database formats EAGLET and MARTIF (continued) 

 
EAGLET-data category Individual Format Script for MARTIF-conversion Loss of information comment
Formulas Link to external graphic-files none none currently not available for any term
Reference List of URL's none entirely because of outdated references not included
Date List of dates awk script with subprocess extracting multiple date records none the relation of name and date (which is latent in the source) is not represented
Author List of abbreviated author names awk script with subprocess extracting multiple author records none the relation of name and date (which is latent in the source) is not represented
Table: Equivalence of Database formats EAGLET and MARTIF (continued) 


next up previous contents index
Next: Conclusion Up: Transformation of a relational Previous: Available tools for format

Thorsten Trippel
Fri May 21 13:04:11 MET DST 1999