Text Cleaner

The text cleaner step cleans up any text that has been produced by an OCR process. It checks for any excessive character runs and other text content restraints configured.

  1. Select the methods you want to use from the Cleaning Process dropdown. Multiple methods can be selected.

    • Remove Long Strings - Strings over a certain length will be removed from the text.

    • Remove Null Characters - Any blank characters will be removed.

    • Remove Non ASCII Characters - Any characters not in the American Standard Code for Information Interchange will be removed.

    • OCR Cleanup - Improve the accuracy of your OCR process by defining rules for cleaning.

    • Remove Blank Lines - Any blank lines will be removed.

Additional Settings Per Method

Remove Long Strings

  1. When selected a Maximum Continuous Characters must be set.

    • Anything longer than this with no spaces or delimiters will be removed.

  2. Enter any delimiters to be used other than full stops. These will be used to determine the length of a sentence.

OCR Cleanup

  1. Enter the Path to the dictionary to use for word checking in OCR Cleanup Dictionary File path.

    • Aiimi can provide a dictionary set as required.

  2. To ignore the spelling of proper nouns check Ignore Proper Nouns.

  3. To only perform checks on documents that have passed OCR check Only Clean If OCR Metadata Present.

  4. To ignore specific words from OCR spellchecks.

    1. Enter a word within the Words to Ignore for OCR Cleanup field and hit enter or select the plus button.

      • There is no limit to the number of words you can add.

      • You can remove and edit words in the list using the edit or delete buttons next to the word.

  5. Adding terms from an entity group to the dictionary.

    1. Enter an Entity Group within the Entity Groups To Allow and hit enter or select the plus button.

      • Words and terms found in these entity groups will be allowed in the text.

Advanced Options

  • Define the maximum messages to process concurrently in Bounded Capacity.

  • Define the maximum number of messages that can be queued.

    • Limiting this will reduce the memory use but increase the time taken.

Last updated