Spacyner

Please note – the configuration for this step will shortly be ported to a configuration file and you will not have to edit the Python code. In the meantime, edit the endpoints/spacyner.py file

The SpacyNER enrichment step uses statistical named entity recognition to extract things such as people, locations, organisations, and geopolitical references. These are then stored in entity fields in Aiimi Insight Engine against documents.

Before using the step, you will need to familiarise yourself with the entities that are supported and then create an entity group and the entities in Control Hub. You should create these as keyword entities and follow the standard camelCase standard for their names.

Supported entities are:

PERSON – People, including fictional.
GPE – Countries, cities, states.
NORP – Nationalities or religious or political groups.
FAC – Buildings, airports, highways, bridges, etc.
ORG – Companies, agencies, institutions, etc.
LOC – Non-GPE locations, mountain ranges, bodies of water.
PRODUCT – Objects, vehicles, foods, etc. (Not services.)
EVENT – Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART – Titles of books, songs, etc.
LAW – Named documents made into laws.
LANGUAGE – Any named language.
DATE – Absolute or relative dates or periods.
TIME – Times smaller than a day.
PERCENT – Percentage, including ”%“.
MONEY – Monetary values, including unit.
QUANTITY – Measurements, as of weight or distance.
ORDINAL – “first”, “second”, etc.
CARDINAL – Numerals that do not fall under another type.

The following configuration will be required in the spacyner.py file which can be found in the endpoints folder.

ENTITY_GROUP – this should be the name of the entity group you created in Control Hub.
ENTITY_MAP – for each entity that you are interested in, map them to the entity name that you created in Control Hub. If you leave any as empty strings, then they will be ignored.
ENTITY_VALIDATOR – for each entity returned by the model you can configure a regular expression that is used to validate the value.

There are also some parameters that help remove noise when the default spacy models are used. These are by no means scientific in their approach, but they do provide a simple way of applying some domain specific cleaning to what NER finds and can be very effective.

BAD_WORDS – this is an array of words that can’t appear in an entity value.
BAD_STRINGS – this is an array of phrases that entities can’t match.
PERCENTAGE_OF_GOOD_WORDS – for any given sentence this is the percentage of tokens in the sentence that need to be recognizable English words (expressed as a decimal). Sentences meeting this rule are then fed into the summarization algorithm. 0.75 means that ¾ of the tokens in a sentence need to be valid words.

Aiimi have a research program that continues to enhance NER model training, reinforcement, domain specific models, and value validation. Therefore, capabilities in this space will continue to evolve and improve.

The endpoint for this enrichment step is ‘spacyner’

PreviousSummary NextHTML Cleaner Service