How to merge stop word lists
Description: Sometimes you want to extend an existing stopword list, e.g. with terms from a term base. This can be handy for the next term extraction session you run. You can then, in your next term extraction session, choose this merged stop word list to skip already existing terms.
Follow the steps below to merge a stopword list with existing terminology in a batch process:
1. Open memoQ, go to the Resource console > Stop word lists. Create a new stop word list for your language, e.g. German, or select an existing one. Click Export to export the stop word list as MQRES file format.
2. Open the exported stop word list in an editor, e.g. Notepad++. Copy the contents into an empty Word documents.
3. If you have not prepared the terminology which you want to merge with this stop word list, you need to export the term base from memoQ:
- Go to the Resource console > Term bases.
- Select your term base, and click the Export to CSV link.
- Rename the exported CSV file to the *.txt file extension.
- Open Excel. Then go to File > Open. Navigate to the renamed TXT file. You need to select "All files" to see the TXT file.
- Follow the Excel wizard. On the 2nd page of the wizard, choose the Comma option. In the preview of the data, you will see the contents separated in columns. Click Next and Finish.
- Excel now has imported your TXT file. Go to the column which contains the terms you want to merge into your stop word list. Copy this column, open Notepad, and paste the content. This way, you get rid of Excel formatting. Only the terms are left.
4. After you prepared the terms and exported the stop word list, you can merge both:
- Go back to the Word document where you copied the contents from the MQRES file into it.
- Now insert the terms, which you copied out of the Excel file and inserted in Notepad after the last stop word.
5. You will see that the stop words from the MQRES file have numbers, and your inserted terms not:
You need to add the numbers to your just added words. You need to choose of what you want to block:
- 111 means the word is blocked as first, inside and as last.
- 101 means the word is blocked as first, not inside and as last.
- 100 means the word is blocke as first and not inside and as last.
Generally, adding 111 is fine. You can fine tune the stop words later on in memoQ at any stage.
6. In Word, put the cursor to the first word that has no number behind it, then open the Search and Replace dialog, and enter:
- P is the paragraph symbol.
- T is the tab symbol. 111 stands for of what you want to block.
- You need to put back the paragraph, therefore the ^p needs to be inserted again.
Note: You can also use 100 or 101 to add in a batch process to all added words in the stop word list.
IMPORTANT: Replace all from cursor position and do not touch the existing stop words or the file header.
Click the Replace All button.
7. Now your Word file should look like this:
8. Go back to your originally exported MQRES file. If you have closed it, open it again in Notepad or Notepad++. Copy all new terms from your Word file, where you added the numbers, and insert it into the MQRES file below the last stop word.
9. Save the file. Now go to memoQ, open the Resource console > Stop word lists. Click the Import new link. Import the merged stop word list. You may need to choose a different name when prompted.
10. After a successful import, you can click the Edit link. You will find the terms merged into your stop word list.
Comments
0 comments
Please sign in to leave a comment.