Tagging existing content in a translation memory
Posted by Péter Botta on 11 October 2012 12:52 PM
Title: Tagging existing content in a translation memory
You find yourself in a situation where you're leveraging older content coming from other tools, or you start using the built-in custom regex tagger, and you face a situation where you have content that you can tag in your source files, but the existing translation memory holds content which is NOT tagged.
- Your original document contains entries like "%1" or '%1' or just %1
- You tagged those entries using the custom regex tagger, showing only simple tags, like this:
BUT… your translation memory still contains the raw "%1" and similar text:
How can you increase your leverage by creating memoQ tags in the TM?… how can you boldly tag where no one has tagged before?
You can edit a TM the same way you do for a document in memoQ:
1. Export your TM as TMX. Go to Project home > Translations memories. Select your TM, then click Export to TMX.
2. Create a project or use an existing project with the same language combination as your TM. Import the TMX file. memoQ supports TMX file import in projects. You can edit the imported TMX file as any other document you imported into a memoQ project.
Using the custom regex tagger is then possible. But you can also freely edit that TM in a regex-enabled text editor like Notepad++ (http://notepad-plus-plus.org/).
If you want to deal with tags outside memoQ, you can do the following:
1 - Finding out how memoQ marks tags in its TMs
• Create a simple document, with one sentence containing all the cases you want to cover, or if you prefer, one sentence per case to cover.
• Import in memoQ and use the custom regex tagger to mark your tags the way you want
• Confirm the segment(s) in a freshly created TM
• Open the exported TMX in Notepad++. it should show something like this:
• There, you can see how memoQ marks its custom tags:
They are <ph> tags, with displaytext and val as attributes. displaytext is the text displayed in the translation UI and val is the actual text which is tagged (and will be exported).
Of course, as with any XML content (TMX is XML code) inside tags, all quotes, ampersands and many other characters need to be properly escaped. That's why you see " for quotes ("), etc.
<ph><mq:rxt displaytext=" <your_tag_text> " val=" <actual_text_tagged> " /></ph>
2 - Search text and replace it with "tagged version" using regular expressions
• In our example, we want to replace "%1", '%1' and %1, in that specific order (to avoid confusion)
Here, the last entry matches replaced the "pla\d" and "plq\d" placeholders used to avoid bad replacements of already replaced %1.
We have to use this trick, as Notepad++ unfortunately does not support lookbehind operators.
• Now that all entries have been replaced, create a new TM and import your TMX file in it, content will be properly "tagged" in the matches.