Language Detection

The language detection step determines the language of a document and then store this in a metadata field called ‘language’. If this metadata field does not exist, create it as type keyword.

Settings:

  • max_text_size – maximum number of characters to use. If the text is larger than this value, the first n characters up to max text size are used.

  • percentage_of_numbers_allowed_in_sentence – this is the maximum percentage of numbers allowed in a sentence for it to be considered in summary generation. This setting helps avoid including sentences that are largely numbers.

  • metadata_field – leave this set to language

  • language_map – this is the country code to friendly name map. You can change the friendly value if you wish.

The endpoint for this enrichment step is ‘language’