Importing multiline text with the regex text filter
Posted by Péter Botta on 07 April 2014 03:23 PM
Title: Importing multiline text with the regex text filter
Description: Importing multiline text with the regex text filter isn't trivial at first glance. This document demonstrates it through an example. This article assumes basic understanding of the concepts of the regex text filter in memoQ.
The below screenshot shows the expected result. Green highlighting shows the text to be imported for translation, multiline comments from a program source code sample. (Ignore for now the fact that the third one is probably not perfect, the word someParam should not be imported as translatable.)
First, to import a piece of multiline text as a unit in the regex text filter, that piece of text must be in a single logical "paragraph". The default paragraph boundary rules won't work: they'll chop up the text at every new line. The horizontal lines in the above screenshot show paragraph boundaries. This was achieved by specifying a custom paragraph start rule:
In the above screen shot, a paragraph start is defined as this: a new line followed by whitespace characters, followed by a @ character. This was chosen by looking at the sample, where the "paragraphs" we are looking for seem to start like this. You must come up with different paragraph end/start rules for different files. The point is to "break" paragraphs at the right location, making sure that one unit of text ends up in one "paragraph". (Again, please ignore this might not be perfect, this is just for demonstration.) Note that you can leave the paragraph end rule blank.
Now we need to define what to import from a paragraph. See the below screenshot:
The part in parentheses in the paragraph rule is the text to be imported as the translatable portion of our paragraphs. You may note that it doesn't look right, because normally the period means "any character but new lines". But in paragraph rules, memoQ uses a special kind of matching mode, very confusingly called "SingleLine", where the period also matches new lines. If you try to match multiline text on the Include/Exclude tab, for example, this will not work. There, the same would have to be written as something like (\s|.)*?
Again, a paragraph starts with a newline character, followed by a word starting with @, then some more whitespace, and then comes the translatable part.