Phrases and Topics

The phrases enrichment step extracts repeating phrases that are said to be ‘left right complete’. Generally, people will repeat core topics that they are talking about in text, so this step aims to efficiently extract these and store these as metadata against a document in Aiimi Insight Engine. They are then useful to help a user understand the central topics, themes, and concepts in a piece of text.

The first step is to create a ‘phrases’ field in Control Hub in the metadata group if it does not already exist. It should be of type Keyword.

You will need to also perform some configuration in the steps configuration file which can be found in:

  • \PythonRestService\config\endpoints\phrases.json

Settings:

  • metadata_field_for_phrases – this should be left as ‘phrases’.

  • number_of_lingo_phrases – the maximum number of Lingo phrases that should be included.

  • number_of_bigram_phrases – the maximum number of bigram phrases that should be included.

  • number_of_trigram_phrases – the maximum number of trigram phrases that should be included.

  • max_text_size – maximum number of characters to use. If the text is larger than this value, the first n characters up to max text size are used.

  • percentage_of_numbers_allowed_in_sentence – this is the maximum percentage of numbers allowed in a sentence for it to be considered in summary generation. This setting helps avoid including sentences that are largely numbers.

Phrase cleaning settings:

  • min_word_length – minimum length of any word in a phrase.

  • max_word_length – maximum length of any word in a phrase.

  • minimum_number_of_words – minimum number of words in a phrase.

  • maximum_number_of_words – maximum number of words in a phrase.

  • minimum_number_of_total_characters – minimum number of total characters in a phrase.

  • minimum_number_of_real_words – minimum of words that don’t contain numbers.

  • only_allow_real_words – whether to only allow phrases that are words that do not contain numbers.

  • bad_strings – a path to a file that contains a list of phrases that are not allowed as phrases.

  • bad_words – a path to a file that contains a list of words that are not allowed in phrases.

  • remove_phrases_that_are_in_entities – a list of entities that should be removed from phrases (typically, you don’t want NER entities to repeat in phrases).

The following screen shot shows the phrases and topics popup when used as filter in Enterprise Search.

The endpoint for this enrichment step is ‘phrases’.