Cleaning unnecessary tags with TransTools Document Cleaner
Some Microsoft Word documents produced by OCR (Optical Character Recognition) and PDF conversion tools, when imported into memoQ, may contain excessive tags, even though the original document has simple formatting. This article describes how to prepare such Word documents in order to reduce the number of tags to the minimum.
Excessive tags in Word documents produced from PDF files occur due to slight differences in formatting applied to individual characters or words by OCR or PDF conversion tools, or due to bookmarks.
In simple cases, it is sufficient to save the document in DOCX format and import it, making sure that Ignore minor formatting changes for fewer tags option is checked if you use Import with options. This option is applied automatically if you use the Import command rather than Import with options, and removes tags caused by font character spacing (font scale, spacing, position, kerning).
If this does not help, you can use Document Cleaner (http://www.translatortools.net/word-doccleaner.html), which is part of TransTools for Word add-in, distributed as free software. Document Cleaner is a collection of tools designed to clean badly formatted documents before translation in CAT tools.
- Download TransTools from http://www.translatortools.net/download.html and install it. This will install TransTools for Word add-in and additional optional components (which you can deselect during the installation process).
- Open the document in Microsoft Word and click Document Cleaner under TransTools group on the Add-Ins tab (if you use Word 2007 or later) or TransTools -> Document Cleaner from the menu (if you use Word 2003 or earlier).
- On the Reformat tab of Document Cleaner dialogue, choose the necessary options from the list. Usually, the default options will be sufficient. For details about each option, see http://www.translatortools.net/word-doccleaner.html
- Click Run Selected Operations.
- If you want to remove bookmarks, which can also cause a lot of tags in Word documents, select Bookmark Cleanup tab, choose the necessary options and press Remove Bookmarks. Note that bookmarks are usually used to create links between tables of contents and the relevant sections in Word documents, so, if you choose to remove Table of Contents bookmarks, you will need to know how to regenerate these tables of contents after exporting the target document.
- Save the document and import/re-import it into memoQ. The document will now have much fewer tags.