Knowledgebase: Product > memoQ
Setting up and performing a term extraction in memoQ
Posted by P├ęter Botta on 08 February 2013 10:11 AM

Title: Setting up and performing a term extraction in memoQ

Description: Before you start a project, you may want to harvest terminology from the documents of the project. Having the terminology translated first, and having later on several translators working on the project, will help you to keep up a consistent translation. Sometimes it is also a customer request to extend existing terminology. memoQ enables you to perform such a term extraction.

How to extract terms:

1. Create a local project or check out the online project. The project contains all new documents or documents you want to harvest terminology from. The project also contains the resources (translation memories, corpora, term bases).

Note: When you have a multilingual project (several target languages) as project manager, you need to choose your target language first. Go to the Translations pane of Project home, and from the All languages drop-down list, choose the target language you want to run the extraction for.

2. Go to Operations > Extract terms. The Extract candidates dialog appears:

3. In this dialog, choose which sources you want to run the term extraction. In the Source section, you can select the materials where memoQ will extract the candidates from.

  • Translation documents check box: Check this if you want memoQ to process the source-language text of the translation documents in the current project. This check box is checked by default, but it is available only if there is at least one translation document in the current project.
    • Every document button: Choose this if you wish to process all documents in your project. This is the default setting.
    • Selected documents button: Choose this if you wish to process the selected documents. Before you can use this option, you need to select one or more documents in the Translations pane of Project home.
  • Translation memories check box: Check this if you want memoQ to process the source-language text in translation memories that are being used in the current project. This check box is checked by default, but it is available only if at least one translation memory is used in the current project.
    • All memories in project button: Choose this if you want to process all translation memories that are used in the current project. This is the default setting.
    • Primary TM button: Choose this if you want to process the primary translation memory only.
    • Selected TMs button: Choose this if you want to process the selected translation memories. Before you can use this option, you need to select one or more documents in the Translation memories pane of Project home.
  • LiveDocs corpus documents check box: Check this if you want memoQ to process the source-language text in the LiveDocs corpora that are being used in the current project. This check box is not checked by default. It is not available if no LiveDocs corpora are used in the current project.
    • All documents shown button: Choose this if you wish to process all documents of all LiveDocs corpora used in the current project. This is the default setting.
    • Selected documents button: Choose this if you wish to process the selected documents in the selected LiveDocs corpus. Before you can use this option, you need to select one or more documents in a LiveDocs corpus from the LiveDocs pane of Project home.

Note: When you perform a term extraction with remote resources (remote TM, remote corpus) assigned to the project, they are not available as reference for the term extraction.

4. Define the options:

              General options

  • Maximum length (words) text box: The number of words in the longest term candidate. memoQ will not list expressions that are longer than this. The default value is 4.
  • Minimum frequency text box: memoQ will not list candidates that do not occur in the source text as many or more times as the number specified here. For example, if the minimum frequency is 3, the list will contain candidates that occur 3 or more times in the source text. The default value is 3.
  • Expression delimiters text box: This is a list of characters that mark the beginning or the end of a term candidate. memoQ will not extract expressions where one or more of these characters occur inside the expression.
  • Length factor text box: This is a number between 0.5 and 3 that controls how much memoQ should favor longer expressions. Each term candidate (that is, extracted expression) receives a score during the extraction process. The larger the length factor, the larger the difference will be between the score of a longer and a shorter expression. The default vale is 1.5.
  • Ignore words with numbers check box: If this check box is checked, memoQ will not include expressions if there is a word in it that contains one or more digits. The check box is not checked by default.

            Single-word terms

  • Minimum length (characters) text box: memoQ does not list words that are shorter than the number specified here. For example, if the minimum length is 3, memoQ extracts single-word candidates that are 3 characters long or longer. The default value is 3.
              Note: The minimum length does not apply to term candidates that contain multiple words.
  • Minimum frequency text box: memoQ will not list candidates that do not occur in the source text as many or more times as the number specified here. For example, if the minimum frequency is 3, the list will contain candidates that occur 3 or more times in the source text. The default value is 3.

             Term base lookup

             When extracting candidates, memoQ looks for expressions in the source-language text only. However, memoQ can retrieve possible translations for the extracted candidates by looking them up in term bases used in the same project.

  • Look up candidates check box: Check this if you want memoQ to look up translations for each candidate in the term bases used in the current project. The check box is checked by default.
  • All term bases in project button: Choose this if you wish to look up the candidates in all term bases in the current project. This is the default setting.
  • Primary term base only button: Choose this if you wish to look up the candidates in the primary term base only.

5. There are words that do not usually occur at the beginning, at the end, or inside a term. If an expression begins with, ends in, or contains one of these words, it should not be listed as a term candidate. You also want to avoid to have words like "the", "and", for English for example extracted. Such words are called stop words. In the lower part of the Extract candidates dialog, you can list stop words. Each stop word has three options: you can exclude words from the beginning, the end, or any position of an expression. To load an existing stop word list, choose one from the Stop word list drop-down box. To save the current stop word list, click Save as... next to the Stop word list box, and specify a name and an expression in the Create stop word list dialog.

Note: Stop word lists are light resources in memoQ. You can save, load, and manage them in the Resource console.

Caution:  It is possible that there is no default stop word list for your source language.

To add a new stop word to the list, type the word in the Word text box at the bottom of the dialog, and then click the Add link next to it. By default, memoQ adds the list to the word with all check boxes checked in the Blocks inside, Blocks as first, and Blocks as last columns. After adding a word, you may want to uncheck one or more of these check boxes:

  • Blocks inside: Check this check box if this word must not occur anywhere in a term. Expressions containing this word will not be listed.
  • Blocks as first: Check this check box if this word must not occur at the beginning of a term. Expressions beginning with this word will not be listed.
  • Blocks as last: Check this check box if this word must not occur at the end of a term. Expressions ending in this word will not be listed.

To remove a stop word from the list, select its row, and click the Delete selected link below the list.

6. Click OK to perform the term extraction. memoQ will then extract the terms according to your settings, and then opens the Candiate list editor to decide which term candidates should be used and which ones discarded.

Note: In the Occurrences section  in the candidate list editor, you have the context for each term available. This helps to see the relevance of a term and its context of usage. In  the Term base results section, you can see if a term already exists in a term base attached to this project.

7. After you made your choices, send the terms to a term base. Click the Export to term base icon. If you have several term bases attached to this project, decide in which one the terms should be saved to. Choose the term base from the Term base to export to drop-down list. Click OK to save the approved candiates to the specified term base.

Further explanations on settings for term extraction and term candidate editing can be found in the memoQ Help under: Functions and Settings > Term extraction.

(3 vote(s))
This article was helpful
This article was not helpful

Comments (0)
Help Desk Software by Kayako support.memoq.com/index.php?