Aiimi Insight Engine Anaheim
User GuidesAiimi
  • Introducing Aiimi Insight Engine
  • Architecture
    • Overview and Key Concepts
    • Hosting Options
    • Architecture and How It Works
      • Agent Servers
        • Security Agent
        • Source Agent
        • Content Agent
        • Enrichment Agent
        • Job Agent
        • OCR Agent
        • Migration Agent
        • Tika Agent
      • Repository
        • Data Node
        • Proxy Node
        • Kibana Node
      • Gateway and User Interface
      • Document and Data Sources
    • Deployment Options
    • Security
      • Source System Security
      • Firewalling
      • Agent Servers
        • Security Agent
        • Source Agent
        • Content Agent
        • Enrichment Agent
        • Job Agent
        • OCR Agent
        • Migration Agent
        • Tika Agent
      • Repository
      • Gateway (Web Server)
      • Tools & Utilities
  • Installation
    • Elasticsearch Installation (Windows)
    • Aiimi Insight Engine Installation (Windows)
    • REST Enrichment Service Install and Configuration
      • Installation and Setup
        • Offline Set-up
        • Configuration of Logging
        • Running as a Service (Windows)
        • Using SSL
        • Performance and Concurrency
        • Security
      • Communicating with the Rest Enrichment Service
      • Step Configuration
    • HTML Cleaner Service
  • security
    • Users
    • Data and Documents
      • Progressive Access
      • Privileged Access
  • Control Hub
    • Configurations
      • Managing Configurations
        • Edit and Delete
        • Find a Configuration
        • Monitoring Stats
        • Stop a Configuration
        • Refreshing
      • Security Configurations
        • Configuring Security
          • Active Directory
          • Azure Active Directory
          • BuiltinSecurity
          • MiroSecurity
          • GoogleDirectory
        • Security Sync
        • Security Agents
        • Security Scheduling
      • Source Configurations
        • General
        • Source
          • Azure Blob Storage
          • BBC Monitoring
          • Big Query Cataloguer
          • BIM360
          • CSV Data Loader
          • Confluence
          • Content Server
          • Data File Cataloguer
          • Document Store
          • DocuSign
          • Dropbox
          • Exchange 365
          • Filesystem
          • Google Bucket
          • Google Drive
          • JSON Data Loader
          • Miro
          • ODBC Data Loader
          • PowerBi Cataloguer
          • Reuters Connect
          • ShareFile
          • SharePoint
          • SharePoint Legacy
          • SQL Server Cataloguer
          • Websites
          • XML Data Loader
        • Crawl
        • Source Agents
        • Source Scheduling
        • Advanced
      • Enrichment Configurations
        • Creating a Pipeline
          • General
          • Steps
            • AccessMiner
            • Anonymiser
            • CAD Extractor
            • Checksum
            • Content Retrieval
            • Copy
            • Data Rule Processor
            • Delete
            • Email Extractor
            • Entity Rule Processor
            • External Links
            • Geotag
            • Google NLP Extractor
            • Google Vision Extractor
            • Metrics Calculation
            • Microsoft Vision Extractor
            • OcrRest
            • Office Metadata
            • PCI Extractor
            • REST
            • Set Document Risk
            • Text Cleaner
            • Tika Text Extraction
            • Trie Entity Extractor
          • Filters
          • Agents
          • Schedule
          • Advanced
      • OCR Engine
      • Job Configurations
        • General
        • Job
          • AutomatedSearchJob
          • Command Job
          • ElasticJob
          • Extended Metrics Job
          • GoogleVaultSAR
          • Nightly Events Processor Job
          • Notifications Processor Job
          • Portal Sync Job
          • Purge Job
          • Text Content Merge Job
        • Output
        • Agents
        • Scheduling
      • Migration Configuration
        • General
        • Filter
        • Metadata Mappings
        • Agents
        • Scheduling
        • Advanced
      • Content Server
    • Credentials
      • Create a Credential
      • Find a Credential
      • Edit a Credential
      • Delete a Credential
    • Mappings
      • Entities
        • Managing Groups
        • Create an Entity
        • Managing Entities
      • Models
        • Create a New Model
        • Find a Model
        • Enable or Disable a Model
      • Vectors
      • Rank Features
    • Featured Links
    • Global Settings
      • General
        • Stackdriver
        • Document Recommendations
        • Searchable PDF Storage
        • Versioning
        • Results
        • Marking Useful Results
        • Folder Browsing
        • Cascading Search
        • Search Suggestions
        • Delve Settings
        • Collections
        • Miscellaneous
      • Authentication
      • Application Access
      • Search Relevancy
        • Core Settings
        • Makers Algorithm
        • Filename Boost Layer
        • Minimum Matching Terms Filter
        • Field Boost
        • Modified Date Boosting
        • Hit Highlighting
        • Why My Search Matched
        • Data Search Strategy
      • Search Performance
      • Filtering
      • Thumbnails
      • Presets
      • Code of Conduct
      • Metrics
      • Viewer
      • SAR
        • Importing Data For A SAR
        • Getting SAR data from Google Vault
        • Redacting Information
        • SAR Access
      • Privacy Portal
        • Activate the Privacy Portal
        • Disclosure
        • Submit SARs From The Privacy Portal
        • Email Delivery Settings
          • Delivery Settings
          • Brand Settings
          • Customise Emails
        • SMS Delivery Settings
        • Requestor Message Limit
        • Attachment Configuration
        • Password Configuration
        • File Scanner Configurator
      • Visualisations
        • Related Result Connections Diagram
        • Event Timeline
        • Create and Modified Date Activity Chart
        • Relationship Map
      • Notifications
      • Map Lens
      • App
      • Theming
      • Related Results
      • OData API
      • Bulk Search
        • Managing a Bulk Search
      • Search Flows
    • User Settings
    • Stats
      • Data Views
  • API Guides
    • Insight API Guide
      • Swagger Documentation
      • Trying Some Endpoints
      • Search Filter
      • Hits / Items
      • Inspecting REST Calls
    • Data Science API Guide
      • REST Interface
        • Login
        • Datasets
        • Fields
        • Field Statistics
        • Search
        • Scroll
        • Update
      • Python Wrapper
        • Login
        • Datasets
        • Fields
        • Field Statistics
        • Search
        • Query Builders
        • Scroll
        • Scroll Search
        • Update Single Document
        • Bulk Update
    • Creating a Native Enrichment Step
      • Creating an Enrichment Step
        • Creating the Core Classes
        • Extending our Enrichment Step
        • Adding a Configuration Template
        • Adding the Enrichment Step
        • Creating an Enrichment Pipeline
      • Other Tasks
        • Entities, Metadata and Data
        • Accessing the Repository
      • Example Code
      • Troubleshooting
    • Creating a Python Enrichment Step
      • Creating an Enrichment Step
        • Running the Example from Command Line
        • Running the Example
      • Creating Your Own Step
      • Adding or Changing Entities, Metadata
  • whitepapers and explainers
    • From a Billion To One – Mastering Relevancy
    • Methods for Text Summarization
      • Application
      • Technology Methods
      • Commercial Tools
      • Key Research Centres
      • Productionisation
      • Related Areas of Text Analytics
      • Conclusion
      • References
Powered by GitBook
On this page
  • Bert Chinese NER
  • Classification
  • Entity Mapper
  • Extract AI Prompt
  • Generative AI Prompt
  • HF Sentence Transformers
  • HF Sparse Vector
  • Huggingface Named Entity Recognition
  • Language Detection
  • Phrase and Topic Detection
  • Sentiment
  • Document Summaries
  1. Installation
  2. REST Enrichment Service Install and Configuration

Step Configuration

PreviousCommunicating with the Rest Enrichment ServiceNextHTML Cleaner Service

Last updated 1 year ago

This section contains a guide for each enrichment step that is hosted by the Rest Enrichment Service.

Bert Chinese NER

Provides named entity recognition for Chinese text. It supports person, location and organisation classes.

Bert Chinese NER Configuration

Values will be written to the following entities in Aiimi Insight Engine, where 'ner' is the group name, and person, location and organisation is the entity name:

  • ner.person

  • ner.location

  • ner.organisation

To change where values are stored edit the configuration for the step. Remember also to make sure the entities have been created un CHUB.

Classification

This uses Aiimi’s clustering and classification framework to classify documents using a pre-trained model.

Classification Configuration

The endpoint for this enrichment step is ‘classify’

Model Set

You will need a ‘model set’ that has been trained and built for your documents.

  1. Check the model set consists of the following files with the exact names:

    • FE.pckl

    • Models.pckl

    • PP.pckl

  1. Create a folder in the ‘models’ subfolder, that can be found in the root of the Python REST Service.

  2. Place these files in the new models subfolder.

Metadata

By default, the classification will be stored in a metadata field called ‘classification’.

You will need to create a metadata field for this within the Control Hub and set it to a type keyword.

In some instances of Aiimi Insight Engine this metadata field may already exist.

Invoke the Classify Step

  1. Create a REST enrichment step in your pipeline.

  2. Configure the step to call the ‘classify’ endpoint.

    • Add ‘model_set=name’ to your configuration to pass a parameter to invoke the specific model set for your documents.

      • For example: classify?model_set=aiimi

Configuration for the step can be found in config/endpoints/classify.json

default_model_set – This model set is used when a model_set is not specified in your REST step configuration.

number_of_models_to_cache – Classification models are not thread-safe so we build a set of models to use for inbound requests.

  • Setting this to a larger number potentially means you can get more done faster, but you will also use more memory.

  • There is little point setting this higher than the REST step concurrency setting. Some empirical testing will help arrive at the ideal setting. Reach out to you Aiimi contact for advise on this.

Entity Mapper

Maps values found in one entity with synonyms of that value. It then stores them in another entity. This step is useful if you want to normalise numerous values into a singe common value/term which can simplify filters.

Entity Mapper Configuration

The endpoint for this enrichment step is ‘entity_mapper’

To use this you must create the target entity and choose the right data type. It must be a keyword If you want to use it as a filter.

Json file configuration

  1. Edit the config/endpoints/entity_mapper.json file:

  • In the example below:

    • test – The entity group.

    • keywords – The entity.

    • target – The name of the entity to write synonym values to.

      • This must be in the same group.

    • path – This points to the mappings for the entity.

You can have more than one configuration, just copy the ‘test’ object and edit it.

Txt file configuration

You can edit the txt file with this path to determine the mappings. Mappings are case sensitive so you may want to create mappings for all case variations.

A typical configuration would;

  • Use the Trie entity extractor to pull out your entities.

  • Then use this to normalise the values into a single set of master/common terms.

  • You then use this in your filter configuration and not the original entity.

    • This would give your users a simple set of filters without duplicates.

Extract AI Prompt

This allows you to execute extractive QA models. These are tun against the text content stored against Aiimi Insight Engine objects, the answer is then also stored in a metadata field.

Extract AI Prompt

This step works alongside the Insight Engine Model Server which hosts the extractive models.

The following example configuration shows how you can send extractive prompts to the model server. You provide the prompt name as a parameter when setting up the REST step in CHUB.

Generative AI Prompt

Allows you to run large language models at enrichment. You can define prompts which are then run over the text content of a file in Aiimi Insight Engine.

  • This works with the Model Server thst hosts both private (Llama2) and cloud based LLMs (Azure Open AI).

Generative AI Prompt

This allows you to execute generative prompt models against the text content stored against Aiimi Insight Engine objects the answer is then also stored in a metadata field.

This step works alongside the Insight Engine Model Server which hosts the extractive models.

The following example configuration shows how you can send generative prompts to the model server. You provide the prompt name as a parameter when setting up the REST step in CHUB.

HF Sentence Transformers

Uses the Sentence Transformers framework to generate word embeddings which can be stored as dense vectors within Aiimi Insight Engine. These provide users with a semantic search experience.

HF Sentence Transformers

This step allows you to translate your text content (and other nominated fields) into dense vectors through word embedding algorithms.

We use the sentence transformers framework, leverage both open source models found on the Huggingface hub, and our own fine-tuned models.

  1. Create the respective vectors for the dense vector storage in the Control Hub.

    • These need the correct dimensions for the models you are using.

We ship some default configurations that use popular models that perform over a fairly broad spectrum of information types.
  1. Pass the name of the configuration to use as a parameter when setting up your REST step in CHUB.

HF Sparse Vector

Uses models running with the transformers framework to generate sparse vectors for files within Aiimi Insight Engine. These are stored as Rank Features within Aiimi Insight Engine and enable a search experience that can handle vocabulary mismatch.

Huggingface Named Entity Recognition

Extracts named entities from text or documents using statistical methods. These are stored in entity fields in Aiimi Insight Engine against documents. This step is more accurate, but slower, than Spacy.

Huggingface Named Entity Recognition

The endpoint for this enrichment step is ‘huggingfacener’.

Values will be written to the following entities in Aiimi Insight Engine. 'ner' is the group name, and person, location and organisation are the entity name:

  • ner.person

  • ner.location

  • ner.organisation

To change where the values are stored edit the configuration for the step.

Settings

  • max_text_size – Maximum number of characters to use. If the text is larger than this, the first n characters up to max text size are used.

  • percentage_of_numbers_allowed_in_sentence – The maximum percentage of numbers allowed in a sentence for it to be considered in summary generation. This helps avoid sentences that are largely numbers.

  • entity_group – This should reflect the entity group used. We suggest ‘ner’.

  • entity_map – This maps classes that are returned by the NER model to your entities. You will only need to change this if you change the model that you use. Please reach out to your Aiimi contact If you do this.

  • entity_validator – This is a regex that is executed against each respective class to validate the value.

  • bad_strings – This is the path to a file that contains a list of disallowed strings for each respective class.

  • minimum_score – This is a decimal number between 0-1 that determines how confident the model needs to be for us to accept the value.

  • allow_single_character_entities – This setting avoids single character entities.

  • nlp_model – The model used to generate the named entities. Reach out to your Aiimi contact before this as other settings such as the entity_map will need to change.

Language Detection

This determines the language of a document and stores this as a metadata field called ‘language’. It can detect 54 different languages from text.

Language Detection

The endpoint for this enrichment step is ‘language’.

This determines the language of a document and stores this as a metadata field called ‘language’.

Settings

  • max_text_size – Maximum number of characters to use. If the text is larger than this value, the first n characters up to max text size are used.

  • percentage_of_numbers_allowed_in_sentence – The maximum percentage of numbers allowed in a sentence for it to be considered in summary generation. This helps avoid sentences that are largely numbers.

  • metadata_field – This field should be left as language.

  • language_map – The country code to use when friendly name mapping.

Phrase and Topic Detection

Extracts repeating phrases from a document or text that are said to be ‘left right complete’. When writing about a topic, people generally repeat the core concepts and topics several times. This extracts these from the text and creates a list of the core concepts, themes, and topics.

Phrase and Topic Detection

The endpoint for this enrichment step is ‘phrases’.

  1. Find the steps configuration file in:

    • \PythonRestService\config\endpoints\phrases.json

Settings

  • metadata_field_for_phrases – Leave as ‘phrases’.

  • number_of_lingo_phrases – The maximum number of Lingo phrases that should be included.

  • number_of_bigram_phrases – The maximum number of bigram phrases that should be included.

  • number_of_trigram_phrases – The maximum number of trigram phrases that should be included.

  • max_text_size – Maximum number of characters to use. If the text is larger than this value, the first n characters up to max text size are used.

  • percentage_of_numbers_allowed_in_sentence – The maximum percentage of numbers allowed in a sentence for it to be considered in summary generation. This helps avoid sentences that are largely numbers.

Phrase cleaning settings

  • min_word_length – The minimum length of any word in a phrase.

  • max_word_length – The maximum length of any word in a phrase.

  • minimum_number_of_words – The minimum number of words in a phrase.

  • maximum_number_of_words – The maximum number of words in a phrase.

  • minimum_number_of_total_characters – The minimum number of total characters in a phrase.

  • minimum_number_of_real_words – The minimum of words that don’t contain numbers.

  • only_allow_real_words – Whether to only allow phrases that are words that do not contain numbers.

  • bad_strings – A path to a file that contains a list of phrases that are not allowed as phrases.

  • bad_words – A path to a file that contains a list of words that are not allowed in phrases.

  • remove_phrases_that_are_in_entities – A list of entities that should be removed from phrases (typically, you don’t want NER entities to repeat in phrases).

Sentiment

Assigns a sentiment label and score to an object stored within Aiimi Insight Engine.

Document Summaries

This creates a short multi-sentence summary of a document so users can quickly understand what the document is about. There are several algorithms provided, each with different merits.

Document Summaries

The endpoint for this enrichment step is ‘summary’.

  1. Find the steps configuration file in:

    • \PythonRestService\config\endpoints\summary.json

Settings

  • nmax_text_size – The maximum number of characters to use. If the text is larger than this value, the first n characters up to max text size are used.

  • percentage_of_numbers_allowed_in_sentence – The maximum percentage of numbers allowed in a sentence for it to be considered in summary generation. This helps avoid sentences that are largely numbers.

  • minimum_sentence_count – The minimum size for a generated summary.

  • maximum_sentence_count – The maximum size for a generated summary.

  • language – Leave this set to English.

  • algorithm – The algorithm to use, this is set to text-rank by default. Other options:

    • luhn

    • edmundson

    • lsa

    • lex-rank

    • sum-basic

    • kl

  • metadata_field_for_summary – The field use to store the summary. This should be summary.

Before using the step, familiarise yourself with the entities that are supported and create any needed entity groups and entities in Control Hub. You should create these as keyword entities and follow the standard camelCase standard for their names.

Before using this step, check the metadata field Language exists. You will need to create it as a type keyword if it does not.

Check the metadata group phrases exists. You will need to create it as a type keyword if it does not.

Check the metadata group summary exists. You will need to create it if it does not.

See our guide for creating a entities for help.
See our guide on creating metadata to create this.
See our guide on creating metadata to create this.
See our guide on creating metadata to create this.
See our guide on creating metadata to create this.
See our guide to creating entities for help creating this.
Example Extractive Configuration
Generative Example
Sentence Transformers Config