TAUS USER CONFERENCE 2009, Normalization of translation memories

USER CONFERENCE 2009 BEFORE MT

Normalization of translation memories/training data for MT Moderator: Karen R. Combe, PTC Ryan Martin, Intel Chris Wendt, Microsoft William Wong, Language Weaver Olga Beregovaya ProMT

Agenda TM Data for MT training Examples – clean or normalize? PTC and Intel Solutions/suggestions from technology providers Microsoft ProMT

Issue: Excessive number of internal tags Pour effectuer la plupart de ces tâches, vous pouvez utiliser {1}{2} Fichier (File) {3}{4} Traitement des instances (Instance Operations) {5}{6} Actualiser l'index (Update Index) {7}{8} ou {9}{10} Fichier (File) {11}{12} Traitement des instances (Instance Operations) {13}{14} Options d'accélérateur (Accelerator Options) {15}{16} afin d'ouvrir la boîte de dialogue {17} Accélérateur d'instances (Instance Accelerator) {18} You can use {1}{2} File {3}{4} Instance Operations {5}{6} Update Index {7}{8}{9}{10} File {11}{12} Instance Operations {13} {14} Accelerator Options {15}{16} (which opens the {17} Instance Accelerator {18} dialog box) to perform most instance operations.

Issue: Irrelevant data English: 0.31% French: 0,31 % English: &amp;asm.mbr.name==part* French: &asm.mbr.name==pièce* English: (Windows NT/95/98/2000)D:\partlib\{1}\objects French: (Windows NT/95/98/2000)D:\partlib\{1}\objects

Issue: homonyms English: This figure shows that after midsurface compression, the resulting model develops a gap between the collet and the bracket . French: Cette figure montre qu'après la compression en feuillet moyen, le modèle obtenu crée un jeu entre le collet et le gousset . English: All data in brackets [] are optional. French: Toutes les données entre crochets [] sont facultatives. Bracket #1 (gousset): An overhanging member that projects from a structure (as a wall) and is usually designed to support a vertical load or to strengthen an angle. bracket #2 (crochet): The bracket character, such as [ or (.

Issue: Acronyms spelled out in the target English: You cannot propagate SDTAE s and DTAE s in a DTAF . French: Vous ne pouvez propager ni des éléments d'annotation d'étiquette de référence ni des éléments d'annotation de référence de positionnement à l'intérieur d'une FARP.

Issue: Mismatching number of sentences English: You can have multiple entries for the same pipe size in the bend file, that is, a single pipe size can have multiple bend radius values associated with it, as shown in the following example of a bend file. French: Vous pouvez avoir plusieurs entrées pour la même taille de tuyau dans le fichier de pliage. En d'autres termes, une même taille de tuyau peut être associée à plusieurs valeurs de rayon de pliage, comme dans le fichier de pliage d'exemple suivant.

Issue: Inconsistent double quote usage Ainsi, si vous créez une pièce portant le nom " bracket " , elle est tout d'abord enregistrée dans le fichier {1}. For example, if you create a part with the name bracket, it initially saves to the file name {1}.

Issue: Entity mismatch English: One way is to create a &quot; flexible model. French: Une méthode consiste à créer un modèle souple.

Issue: Punctuation mismatch (brace vs. dash) English: {1}Copy as Skeleton{2} ( the option cannot be changed ) to create a skeleton model. French: Cliquez sur {1}Copier en tant que squelette (Copy as Skeleton){2} - option non modifiable - pour créer un modèle squelette.

Issue: Punctuation mismatch (dash vs. colon) English: {1}Additional Rotation{2} — Enter a real-number value for the number of degrees to rotate the spring's Y axis. French: {1}Rotation supplémentaire (Additional Rotation){2} : entrez un nombre réel pour indiquer le nombre de degrés de rotation de l'axe Y du ressort.

Issue: Capitalization mismatch English: Piping Master Catalog Directory File French: Fichier répertoire du catalogue principal de tuyauterie

Issue: English UI strings in the translation English: Click View > Color and Appearance to create or modify colors. Cliquez sur Affichage ( View ) > Couleur et apparence ( Color and Appearance ) pour créer ou modifier les couleurs.

Issue: Fix common entity issues English: System without Intel &#xAE; vPro technology Portuguese: Sistema sem a tecnologia Intel &#xAE; vPro Corrected: English: System without Intel ® vPro technology Portuguese: Sistema sem a tecnologia Intel ® vPro

Issue: Remove internal markup <tuv xml:lang="ZH-CN"> <seg> <bpt i="1"><span style='font-size:10.0pt; font-family:Verdana'></bpt> 在默认情况下，节点 <bpt i="2" type="bold"><b></bpt> 应用程序 <ept i="2"></b></ept> 之下没有任何应用程序，如下图所示。 <ept i="1"></span></ept> </seg></tuv></tu> Corrected: <tuv xml:lang="ZH-CN"> <seg> 在默认情况下，节点应用程序之下没有任何应用程序，如下图所示。 </seg> </tuv> </tu>

Issue: Empty field <tu creationdate="20040727T134835Z" creationid="RICHARD"> <prop type="x-error">empty field</prop> <tuv xml:lang="EN-US"> <seg>Must re-verify if changing motherboard brands</seg> </tuv> <tuv xml:lang="ZH-CN"> <seg></seg> </tuv> </tu>

Issue: Suspect character English: â€œxxxâ€or â€œyyyâ€ for a description of each. Turkish: Her seÃ§eneÄŸe iliÅŸkin aÃ§Ä±klamayÄ± gÃ¶rmek iÃ§in bkz. â€œxxxâ€ veya â€œyyyâ€.

Issue: Suspect character <tuv xml:lang="EN-US"> <seg>Yes ? </seg> </tuv> <tuv xml:lang="ZH-CN"> <seg> 是 ¹ </seg> </tuv>

Issue: Escape character in translation English: Mode: Turkish: Geli\u351 şmi\u351 ş Mod:

Issue: Trivial segment; missing sentence features <tuv xml:lang="EN-US"> <seg>put_brand_logo</seg> </tuv> <tuv xml:lang="DE-DE"> <seg>put_brand_logo</seg> </tuv>

Issue: Incomplete translation, missing punctuation English: performance, redefine efficiency. German: Leistung neu entdeckt, Effizienz neu definiert

Data Issues Control characters that break wellformedness (invisible form feed character between “CONDITIONS” and “How”) and which may produce undetectable problems in translation since they are often invisible: <TrU> <CrD>08052004, 05:54:40 <CrU>REYNA <Seg L=EN-US>END OF TERMS AND CONDITIONS How to Apply These Terms to Your New Programs <Seg L=ES-EM>FIN DE LOS T É RMINOS Y CONDICIONES Aplicaci ó n de estos t é rminos en sus programas nuevos </TrU>

Data Issues Non-standard entities need to be converted. The XML standard includes support for & < > " ' , while any others need to be manually converted specially, e.g. ® ™   © &lang; &rang; : <TrU> <CrD>22112002, 14:48:14 <CrU>JANICE <Seg L=EN-US>Intel® Xeon™ processors - System Boots With No Problem, But Crashes Or Freezes After Several Minutes Of Operation Or The System Is Unstable <Seg L=ES-EM>Procesadores Intel® Xeon™ : El sistema arranca sin problemas, pero deja de funcionar o se bloquea despu é s de varios minutos de funcionamiento o el sistema es inestable. </TrU> Keyword lists that are not fully translations: <TrU> <CrD>13112003, 13:28:28 <CrU>REYNA <ChD>14112003, 10:32:01 <ChU>DJN <Seg L=EN-US>Generic,cache,chip,core,cpu,ghz,giga,l2,mega,mhz,package,pentium,processor, processors,specifications,specs,speed,voltage <Seg L=ES-EM>Generic, cache, chip, core, cpu, ghz, giga, l2, mega, mhz, package, pentium, processor, processors, specifications, specs, speed, voltage, gen é rico, cach é , chip, n ú cleo, cpu, ghz, giga, l2, mega, mhz, paquete, pentium, procesador, procesadores, especificaciones, velocidad, voltaje </TrU>

Data Issues Mapped junk characters to proper characters. Mapped to sensible equivalents: U+0092  '; U+0096 and U+0097  -; U+008C  Œ; U+0099  ™. Mapped junk characters to proper characters. The junk characters caused by converting UTF-8 (misidentified as ISO-8859-1) to UTF-8 needed to be mapped back to their true characters. For example, Ã©  é, Ã¡  á, Ã»  û, Ã§  ç, Ã®  î.

Data Issues Code in TM data r1 = CLSIDFromProgID(L"OPC.SimaticNET", &clsid); if (r1 != S_OK) { MessageBox("Retrival of CLSID failed", "Error CLSIDFromProgID()", MB_OK+MB_ICONERROR); CoUninitialize(); SendMessage(WM_CLOSE); return; } //******************************************************************FUNCTION_BLOCK FB 23 XXXXXXXXEAX= XXXXXXXX EBX= XXXXXXXX ECX= XXXXXXXX EDX= XXXXXXXX EDI= XXXXXXXX ESI= XXXXXXXX FLAGS= XXXXXXXX DS= XXXX ES= XXXX SS= XXXX ESP=XXXXXXXX EBP= XXXXXXXX FS= XXXX GS= XXXX

Data Issues Inconsistent translations (100 to 120, 220 to 264) (100 -120 , 220 и 264 ) (0.197 to 0.236) (0,197 -0,236 ) (0.276 to 0.335) (0,276 -0,335 ) (176 and 212°F) (176 -212°F)

Data Issues Double escaped tags <column name="quelltext">Der Menüpunkt &lt;<010035762_GE_1048565/>&gt; wird in der vorliegenden Fehlersuchanleitung nicht genauer beschrieben.</column> <column name="zieltext">The menu item &lt;<010035762_GE_1048565/>&gt; is not described in greater detail in these trouble-shooting instructions.</column> <column name="quelltext">Die Komponente &lt;<010035731_TERM_1048521/>&gt; schrittweise über den Taster betätigen.</column> <column name="zieltext">Gradually actuate the component &lt;<010035731_TERM_1048521/>&gt; by way of the button.</column>

Training Data example Comparable data – not parallel data:

Standard TM verification/normalization process During TM verification the following is addressed through automatic steps Irregular characters gets flagged and replaced Incomplete sentences get flagged Punctuation suspects get flagged UI strings and other irregular sentences get added to phrase tables

PROMT handling of internal tags – not excessive but useful Original Source Segment in File Check <codeph class="+ topic/ph pr-d/codeph>NativeApplication.supportsSystemTrayIcon</codeph> to determine whether system tray icons are supported on the current system. Converted to GMS Segment format (after GMS-native segmentation) Check {1}NativeApplication.supportsSystemTrayIcon{2} to determine whether system tray icons are supported on the current system. Pre‐Processed String in XLIFF Segment format is sent to PROMT. Check <ph i=1 x=”<codeph class="+ topic/ph pr-d/codeph" >”>{1}</ph> NativeApplication.supportsSystemTrayIcon<ph i=2 &lt/codeph>>{2}</ph> to determine whether system tray icons are supported on the current system. Format of the translated XLIFF Segment returned by PROMT to Idiom Проверить <ph i=1 x=”<codeph class="+ topic/ph pr-d/codeph" >”>{1}</ph> NativeApplication.supportsSystemTrayIcon<ph i=2 &lt/codeph>>{2}</ph> для определения системном трее иконки поддерживает нынешнюю систему.

GMS Integration with XLIFF Connector – Why is metadata so Important?

PROMT handling of irrelevant data Scenario 1: We can leave the irrelevant data untouched and let it propagate from TM or be handled through special formatting rules Scenario 2: We will clean it up and add to the phrase table Our system will perform well in either scenario and our course of action needs to be the clients call

PROMT handling of homonyms PROMT system is specially tailored to handle one-to-many translations and homonymy PROMT approach is to create context-based dictionary entries, whether single words or MWE which allows the system to properly indentify the correct translation for ambiguous entries PROMT also uses XML metadata when assigning a semantic class to an entry

PROMT handling of expanding acronyms PROMT system handles expansion of acronyms or different acronyms between languages through creating explicit mapping This is a rather standard task in the process of PROMT engine customization, along with DoNotTranslate and variable lists Should an abbreviation or the expanded version change, this can be fixed through the client interface in a matter of seconds

PROMT handling of locale-specific punctuation Quotation mark usage for a specific small group of terms can be defined on a dictionary level If the use of quotation marks or other punctuation is universal for a specific locate it will be defined on the linguistic rules level

PROMT handling of Entity and Capitalization mismatch The differences in locale setting for Entities and Capitalization rules are already pre-built in the baseline engine and are regulated through regional settings in the product interface All additional differences between locales are learnt from the TM during the engine customization phase and then are added to the client profile template

PROMT suggestion for UI string handling All the UI strings will be automatically added to DoNotTranslate lists when appearing in the appropriate context The context can be detected semantically, though formatting and punctuation

Intuitive contextual identification Any word that occurs as part of a context such as “show” in “show command,” remains in English per the UI, whereas the word command gets translated. In other contexts, both words, show and command, are translated as regular words.

PROMT approach to entities We can address it through a set of special rules, however, typically these issues are addressed on the GMS level If TM cleanup is a part of the specific project scope we take this task on and address all similar issues through automated scripts

PROMT handling of internal markup This step is not necessary for PROMT translation process Scenario 1: the markup is handled by PROMT TMX Level 2 extensive TM metadata support Scenario 2: if we need to create phrase table entries from these strings we will normalize, but the markup will still be preserved in the translation process

PROMT handling of empty fields Scenario 1: During TM verification an automatic script will render a warning message and the empty unit will not be propagated Scenario 2: We also can send the empty segment to the customized engine and obtain a translation which will be propagated into the TM for further verification This is how PROMT pre-project dictionaries are created

In General Liberal at throwing away training data Automatic filters only No human cleaning If it is likely there is a clean variant of almost the same sentence in the data, no harm in throwing it away In-domain diversity is a plus Example: Several versions of the same product have close to no effect

Automatic training data filtering and conversion Remove Low text ratio (characters vs. markup, punc.) High length delta Less than minimal length Unexpected language Convert Character encoding Named entities Escape Factoids Numbers, dates, URLs, email, etc. Ex: “5/16/2009” | “June 28, 1998”  <factoid_date>

Cleaning issues 1/3 Issue Training Action Runtime action Excessive # of internal tags Remove segment Preserve and ignore Irrelevant data Fails in ratio filter Apply factoids Homonyms n/a Target language model Acronyms spelled out May be caught in word alignment Project dictionary # sentences mismatch Sentence break, align, discard n/a Inconsistent quote usage n/a Not handled Entities Unescape Unescape-reescape  XML-safe Punctuation mismatch n/a Needs special code (i.e. French “ :”)

Cleaning issues 2/3 Issue Training Action Runtime action Capitalization mismatch ignored Apply language logic and target language model English UI strings Factoid Preprocess and escape Internal markup Escape to single tag Pass through Empty field Fails size delta filter Pass through Suspect character May fail language check Pass through HTML escapes Unescape Unescape – reescape for XML Trivial segment Fails length or ratio filter Pass through Missing punctuation none Apply language-appropriate punctuation

Cleaning issues 3/3 Issue Training Action Runtime action Newline in string New sentence New sentence Program code Avoided or learned Needs markup Comparable data Currently fails length filter Research item Handled like “parallel”

TAUS USER CONFERENCE 2009, Normalization of translation memories

More Related Content

Viewers also liked

Similar to TAUS USER CONFERENCE 2009, Normalization of translation memories

More from TAUS - The Language Data Network

Recently uploaded

TAUS USER CONFERENCE 2009, Normalization of translation memories

Editor's Notes