Huggingface NER

The Huggingface NER enrichment step uses statistical named entity recognition to extract people, locations and organisations. These are then stored in entity fields in Aiimi Insight Engine against documents.

Before using the step, you will need to familiarise yourself with the entities that are supported and then create an entity group and the entities in Control Hub. You should create these as keyword entities and follow the standard camelCase standard for their names.

You will need to also perform some configuration in the steps configuration file which can be found in:

  • \PythonRestService\config\endpoints\huggingfacener.json

By default, you will want the following, where ‘ner’ is the entity group name:

  • ner.person

  • ner.location

  • ner.organisation

Settings:

  • max_text_size – maximum number of characters to use. If the text is larger than this value, the first n characters up to max text size are used.

  • percentage_of_numbers_allowed_in_sentence – this is the maximum percentage of numbers allowed in a sentence for it to be considered in summary generation. This setting helps avoid including sentences that are largely numbers.

  • entity_group – this should reflect the entity group used. We suggest ‘ner’

  • entity_map – this maps classes that are returned by the NER model to your entities. You will only need to change this if you change the model that you use. If you do this please consult Aiimi.

  • entity_validator – this is a regex that is executed against each respective class to valid the value.

  • bad_strings – this is a path to a file that contains a list of disallowed strings for each respective class.

  • minimum_score – this is a decimal number between 0-1 that determines how confident the model needs to be for us to accept the value.

  • allow_single_character_entities – simple setting to avoid single character entities (these are nearly always nonsense)

  • nlp_model – the model to use to generate the named entities. Please consult with Aiimi before changing this as other settings such as the entity_map will need to change.

The endpoint for this enrichment step is ‘huggingfacener’