Step Configuration

This section contains a guide for each enrichment step that is hosted by the Rest Enrichment Service.

Bert Chinese NER

Provides named entity recognition for Chinese text. It supports person, location and organisation classes.

Bert Chinese NER Configuration

Values will be written to the following entities in Aiimi Insight Engine, where 'ner' is the group name, and person, location and organisation is the entity name:

  • ner.person

  • ner.location

  • ner.organisation

To change where values are stored edit the configuration for the step. Remember also to make sure the entities have been created un CHUB.

Classification

This uses Aiimi’s clustering and classification framework to classify documents using a pre-trained model.

Classification Configuration

The endpoint for this enrichment step is ‘classify’

Model Set

You will need a ‘model set’ that has been trained and built for your documents.

  1. Check the model set consists of the following files with the exact names:

    • FE.pckl

    • Models.pckl

    • PP.pckl

  1. Create a folder in the ‘models’ subfolder, that can be found in the root of the Python REST Service.

  2. Place these files in the new models subfolder.

Metadata

By default, the classification will be stored in a metadata field called ‘classification’.

You will need to create a metadata field for this within the Control Hub and set it to a type keyword. See our guide on creating metadata to create this.

In some instances of Aiimi Insight Engine this metadata field may already exist.

Invoke the Classify Step

  1. Create a REST enrichment step in your pipeline.

  2. Configure the step to call the ‘classify’ endpoint.

    • Add ‘model_set=name’ to your configuration to pass a parameter to invoke the specific model set for your documents.

      • For example: classify?model_set=aiimi

Configuration for the step can be found in config/endpoints/classify.json

default_model_set – This model set is used when a model_set is not specified in your REST step configuration.

number_of_models_to_cache – Classification models are not thread-safe so we build a set of models to use for inbound requests.

  • Setting this to a larger number potentially means you can get more done faster, but you will also use more memory.

  • There is little point setting this higher than the REST step concurrency setting. Some empirical testing will help arrive at the ideal setting. Reach out to you Aiimi contact for advise on this.

Entity Mapper

Maps values found in one entity with synonyms of that value. It then stores them in another entity. This step is useful if you want to normalise numerous values into a singe common value/term which can simplify filters.

Entity Mapper Configuration

The endpoint for this enrichment step is ‘entity_mapper’

To use this you must create the target entity and choose the right data type. It must be a keyword If you want to use it as a filter. See our guide to creating entities for help creating this.

Json file configuration

  1. Edit the config/endpoints/entity_mapper.json file:

  • In the example below:

    • test – The entity group.

    • keywords – The entity.

    • target – The name of the entity to write synonym values to.

      • This must be in the same group.

    • path – This points to the mappings for the entity.

You can have more than one configuration, just copy the ‘test’ object and edit it.

Txt file configuration

You can edit the txt file with this path to determine the mappings. Mappings are case sensitive so you may want to create mappings for all case variations.

A typical configuration would;

  • Use the Trie entity extractor to pull out your entities.

  • Then use this to normalise the values into a single set of master/common terms.

  • You then use this in your filter configuration and not the original entity.

    • This would give your users a simple set of filters without duplicates.

Extract AI Prompt

This allows you to execute extractive QA models. These are tun against the text content stored against Aiimi Insight Engine objects, the answer is then also stored in a metadata field.

Extract AI Prompt

This step works alongside the Insight Engine Model Server which hosts the extractive models.

The following example configuration shows how you can send extractive prompts to the model server. You provide the prompt name as a parameter when setting up the REST step in CHUB.

Generative AI Prompt

Allows you to run large language models at enrichment. You can define prompts which are then run over the text content of a file in Aiimi Insight Engine.

  • This works with the Model Server thst hosts both private (Llama2) and cloud based LLMs (Azure Open AI).

Generative AI Prompt

This allows you to execute generative prompt models against the text content stored against Aiimi Insight Engine objects the answer is then also stored in a metadata field.

This step works alongside the Insight Engine Model Server which hosts the extractive models.

The following example configuration shows how you can send generative prompts to the model server. You provide the prompt name as a parameter when setting up the REST step in CHUB.

HF Sentence Transformers

Uses the Sentence Transformers framework to generate word embeddings which can be stored as dense vectors within Aiimi Insight Engine. These provide users with a semantic search experience.

HF Sentence Transformers

This step allows you to translate your text content (and other nominated fields) into dense vectors through word embedding algorithms.

We use the sentence transformers framework, leverage both open source models found on the Huggingface hub, and our own fine-tuned models.

  1. Create the respective vectors for the dense vector storage in the Control Hub.

    • These need the correct dimensions for the models you are using.

We ship some default configurations that use popular models that perform over a fairly broad spectrum of information types.
  1. Pass the name of the configuration to use as a parameter when setting up your REST step in CHUB.

HF Sparse Vector

Uses models running with the transformers framework to generate sparse vectors for files within Aiimi Insight Engine. These are stored as Rank Features within Aiimi Insight Engine and enable a search experience that can handle vocabulary mismatch.

Huggingface Named Entity Recognition

Extracts named entities from text or documents using statistical methods. These are stored in entity fields in Aiimi Insight Engine against documents. This step is more accurate, but slower, than Spacy.

Huggingface Named Entity Recognition

The endpoint for this enrichment step is ‘huggingfacener’.

Before using the step, familiarise yourself with the entities that are supported and create any needed entity groups and entities in Control Hub. You should create these as keyword entities and follow the standard camelCase standard for their names. See our guide for creating a entities for help.

Values will be written to the following entities in Aiimi Insight Engine. 'ner' is the group name, and person, location and organisation are the entity name:

  • ner.person

  • ner.location

  • ner.organisation

To change where the values are stored edit the configuration for the step.

Settings

  • max_text_size – Maximum number of characters to use. If the text is larger than this, the first n characters up to max text size are used.

  • percentage_of_numbers_allowed_in_sentence – The maximum percentage of numbers allowed in a sentence for it to be considered in summary generation. This helps avoid sentences that are largely numbers.

  • entity_group – This should reflect the entity group used. We suggest ‘ner’.

  • entity_map – This maps classes that are returned by the NER model to your entities. You will only need to change this if you change the model that you use. Please reach out to your Aiimi contact If you do this.

  • entity_validator – This is a regex that is executed against each respective class to validate the value.

  • bad_strings – This is the path to a file that contains a list of disallowed strings for each respective class.

  • minimum_score – This is a decimal number between 0-1 that determines how confident the model needs to be for us to accept the value.

  • allow_single_character_entities – This setting avoids single character entities.

  • nlp_model – The model used to generate the named entities. Reach out to your Aiimi contact before this as other settings such as the entity_map will need to change.

Language Detection

This determines the language of a document and stores this as a metadata field called ‘language’. It can detect 54 different languages from text.

Language Detection

The endpoint for this enrichment step is ‘language’.

This determines the language of a document and stores this as a metadata field called ‘language’.

Before using this step, check the metadata field Language exists. You will need to create it as a type keyword if it does not. See our guide on creating metadata to create this.

Settings

  • max_text_size – Maximum number of characters to use. If the text is larger than this value, the first n characters up to max text size are used.

  • percentage_of_numbers_allowed_in_sentence – The maximum percentage of numbers allowed in a sentence for it to be considered in summary generation. This helps avoid sentences that are largely numbers.

  • metadata_field – This field should be left as language.

  • language_map – The country code to use when friendly name mapping.

Phrase and Topic Detection

Extracts repeating phrases from a document or text that are said to be ‘left right complete’. When writing about a topic, people generally repeat the core concepts and topics several times. This extracts these from the text and creates a list of the core concepts, themes, and topics.

Phrase and Topic Detection

The endpoint for this enrichment step is ‘phrases’.

  1. Check the metadata group phrases exists. You will need to create it as a type keyword if it does not. See our guide on creating metadata to create this.

  1. Find the steps configuration file in:

    • \PythonRestService\config\endpoints\phrases.json

Settings

  • metadata_field_for_phrases – Leave as ‘phrases’.

  • number_of_lingo_phrases – The maximum number of Lingo phrases that should be included.

  • number_of_bigram_phrases – The maximum number of bigram phrases that should be included.

  • number_of_trigram_phrases – The maximum number of trigram phrases that should be included.

  • max_text_size – Maximum number of characters to use. If the text is larger than this value, the first n characters up to max text size are used.

  • percentage_of_numbers_allowed_in_sentence – The maximum percentage of numbers allowed in a sentence for it to be considered in summary generation. This helps avoid sentences that are largely numbers.

Phrase cleaning settings

  • min_word_length – The minimum length of any word in a phrase.

  • max_word_length – The maximum length of any word in a phrase.

  • minimum_number_of_words – The minimum number of words in a phrase.

  • maximum_number_of_words – The maximum number of words in a phrase.

  • minimum_number_of_total_characters – The minimum number of total characters in a phrase.

  • minimum_number_of_real_words – The minimum of words that don’t contain numbers.

  • only_allow_real_words – Whether to only allow phrases that are words that do not contain numbers.

  • bad_strings – A path to a file that contains a list of phrases that are not allowed as phrases.

  • bad_words – A path to a file that contains a list of words that are not allowed in phrases.

  • remove_phrases_that_are_in_entities – A list of entities that should be removed from phrases (typically, you don’t want NER entities to repeat in phrases).

Sentiment

Assigns a sentiment label and score to an object stored within Aiimi Insight Engine.

Document Summaries

This creates a short multi-sentence summary of a document so users can quickly understand what the document is about. There are several algorithms provided, each with different merits.

Document Summaries

The endpoint for this enrichment step is ‘summary’.

  1. Check the metadata group summary exists. You will need to create it if it does not. See our guide on creating metadata to create this.

  1. Find the steps configuration file in:

    • \PythonRestService\config\endpoints\summary.json

Settings

  • nmax_text_size – The maximum number of characters to use. If the text is larger than this value, the first n characters up to max text size are used.

  • percentage_of_numbers_allowed_in_sentence – The maximum percentage of numbers allowed in a sentence for it to be considered in summary generation. This helps avoid sentences that are largely numbers.

  • minimum_sentence_count – The minimum size for a generated summary.

  • maximum_sentence_count – The maximum size for a generated summary.

  • language – Leave this set to English.

  • algorithm – The algorithm to use, this is set to text-rank by default. Other options:

    • luhn

    • edmundson

    • lsa

    • lex-rank

    • sum-basic

    • kl

  • metadata_field_for_summary – The field use to store the summary. This should be summary.

Last updated