Aiimi Insight Engine Primo
User GuidesAiimi
  • Introducing Aiimi Insight Engine
  • Architecture
    • Overview and Key Concepts
    • Hosting Options
    • Architecture and How It Works
      • Agent Servers
        • Security Agent
        • Source Agent
        • Content Agent
        • Enrichment Agent
        • Job Agent
        • OCR Agent
        • Migration Agent
        • Tika Agent
      • Repository
        • Data Node
        • Proxy Node
        • Kibana Node
      • Gateway and User Interface
      • Document and Data Sources
    • Deployment Options
    • Security
      • Source System Security
      • Firewalling
      • Agent Servers
        • Security Agent
        • Source Agent
        • Content Agent
        • Enrichment Agent
        • Job Agent
        • OCR Agent
        • Migration Agent
        • Tika Agent
      • Repository
      • Gateway (Web Server)
      • Tools & Utilities
  • Installation
    • Elasticsearch Installation (Windows)
    • Aiimi Insight Engine Installation (Windows)
    • Python REST Service Install and Configuration
      • Installation and Setup
        • Offline Huggingface Set-up
        • Configuration of Logging
        • Running as a Service (Windows)
        • Using SSL
        • Performance and Concurrency
        • Security
      • Communicating with the Python REST Service
      • Step Configuration
        • Classify
        • Entity Mapper
        • Huggingface NER
        • Language Detection
        • Phrases and Topics
        • Summary
        • Spacyner
    • HTML Cleaner Service
  • security
    • Users
    • Data and Documents
      • Progressive Access
      • Privileged Access
  • Control Hub
    • Configurations
      • Managing Configurations
        • Edit and Delete
        • Find a Configuration
        • Monitoring Stats
        • Stop a Configuration
        • Refreshing
      • Security Configurations
        • Configuring Security
        • Active Directory
        • Azure Active Directory
        • BuiltinSecurity
        • Security Sync
        • Security Agents
        • Security Scheduling
      • Source Configurations
        • General
        • Source
          • Azure Blob Storage
          • BBC Monitoring
          • Big Query Cataloguer
          • BIM360
          • CSV Data Loader
          • Confluence
          • Content Server
          • Data File Cataloguer
          • Document Store
          • Dropbox
          • Exchange 365
          • Filesystem
          • Google Bucket
          • JSON Data Loader
          • ODBC Data Loader
          • PowerBi Cataloguer
          • Reuters Connect
          • ShareFile
          • SharePoint
          • SharePoint Legacy
          • SQL Server Cataloguer
          • Websites
          • XML Data Loader
        • Crawl
        • Source Agents
        • Source Scheduling
        • Advanced
      • Enrichment Configurations
        • Creating a Pipeline
          • General
          • Steps
            • Tika Text Extraction
            • External Links
            • Delete
            • Copy
            • Text Cleaner
            • Data Rule Processor
            • Checksum
            • OcrRest
            • AccessMiner
            • CAD Extractor
            • Trie Entity Extractor
            • PCI Extractor
            • Email Extractor
            • Geotag
            • Google Vision Extractor
            • Google NLP Extractor
            • Metrics Calculation
            • Microsoft Vision Extractor
            • Entity Rule Processor
            • Anonymiser
            • Set Document Risk
            • Content Retrieval
            • REST
          • Filters
          • Agents
          • Schedule
          • Advanced
      • OCR Engine
      • Job Configurations
        • General
        • Job
          • AutomatedSearchJob
          • Command Job
          • ElasticJob
          • Extended Metrics Job
          • GoogleVaultSAR
          • Nightly Events Processor Job
          • Notifications Processor Job
          • Portal Sync Job
          • Purge Job
          • Text Content Merge Job
        • Output
        • Agents
        • Scheduling
      • Migration Configuration
        • General
        • Filter
        • Metadata Mappings
        • Agents
        • Scheduling
        • Advanced
      • Content Server
    • Credentials
      • Create a Credential
      • Find a Credential
      • Edit a Credential
      • Delete a Credential
    • Mappings
      • Entities
        • Managing Groups
        • Create an Entity
        • Managing Entities
      • Models
        • Create a New Model
        • Find a Model
        • Enable or Disable a Model
      • Vectors
      • Rank Features
    • Featured Links
    • Global Settings
      • General
        • Windows Authentication
        • SAML2 Authentication
        • ADFS
        • Stackdriver
        • Document Recommendations
        • Searchable PDF Storage
        • Versioning
        • Results
        • Marking Useful Results
        • Folder Browsing
        • Cascading Search
        • Search Suggestions
        • Delve Settings
        • Collections
        • Miscellaneous
      • Application Access
      • Search Relevancy
        • Core Settings
        • Makers Algorithm
        • Filename Boost Layer
        • Minimum Matching Terms Filter
        • Field Boost
        • Modified Date Boosting
        • Hit Highlighting
        • Why My Search Matched
        • Data Search Strategy
      • Search Performance
      • Filtering
      • Thumbnails
      • Presets
      • Code of Conduct
      • Metrics
      • Viewer
      • SAR
        • Importing Data For A SAR
        • Getting SAR data from Google Vault
        • Redacting Information
        • SAR Access
      • Privacy Portal
        • Activate the Privacy Portal
        • Disclosure
        • Submit SARs From The Privacy Portal
        • Email Delivery Settings
          • Delivery Settings
          • Brand Settings
          • Customise Emails
        • SMS Delivery Settings
        • Requestor Message Limit
        • Attachment Configuration
        • Password Configuration
        • File Scanner Configurator
      • Visualisations
        • Related Result Connections Diagram
        • Event Timeline
        • Create and Modified Date Activity Chart
        • Relationship Map
      • Notifications
      • Map Lens
      • App
      • Theming
      • Related Results
      • OData API
      • Bulk Search
        • Managing a Bulk Search
      • Search Flows
    • User Settings
    • Stats
      • Data Views
  • API Guides
    • Insight API Guide
      • Swagger Documentation
      • Trying Some Endpoints
      • Search Filter
      • Hits / Items
      • Inspecting REST Calls
    • Data Science API Guide
      • REST Interface
        • Login
        • Datasets
        • Fields
        • Field Statistics
        • Search
        • Scroll
        • Update
      • Python Wrapper
        • Login
        • Datasets
        • Fields
        • Field Statistics
        • Search
        • Query Builders
        • Scroll
        • Scroll Search
        • Update Single Document
        • Bulk Update
    • Creating a Native Enrichment Step
      • Creating an Enrichment Step
        • Creating the Core Classes
        • Extending our Enrichment Step
        • Adding a Configuration Template
        • Adding the Enrichment Step
        • Creating an Enrichment Pipeline
      • Other Tasks
        • Entities, Metadata and Data
        • Accessing the Repository
      • Example Code
      • Troubleshooting
    • Creating a Python Enrichment Step
      • Creating an Enrichment Step
        • Running the Example from Command Line
        • Running the Example
      • Creating Your Own Step
      • Adding or Changing Entities, Metadata
  • whitepapers and explainers
    • Methods for Text Summarization
      • Application
      • Technology Methods
      • Commercial Tools
      • Key Research Centres
      • Productionisation
      • Related Areas of Text Analytics
      • Conclusion
      • References
Powered by GitBook
On this page
  1. Installation
  2. Python REST Service Install and Configuration
  3. Step Configuration

Spacyner

Please note – the configuration for this step will shortly be ported to a configuration file and you will not have to edit the Python code. In the meantime, edit the endpoints/spacyner.py file

The SpacyNER enrichment step uses statistical named entity recognition to extract things such as people, locations, organisations, and geopolitical references. These are then stored in entity fields in Aiimi Insight Engine against documents.

Before using the step, you will need to familiarise yourself with the entities that are supported and then create an entity group and the entities in Control Hub. You should create these as keyword entities and follow the standard camelCase standard for their names.

Supported entities are:

  • PERSON – People, including fictional.

  • GPE – Countries, cities, states.

  • NORP – Nationalities or religious or political groups.

  • FAC – Buildings, airports, highways, bridges, etc.

  • ORG – Companies, agencies, institutions, etc.

  • LOC – Non-GPE locations, mountain ranges, bodies of water.

  • PRODUCT – Objects, vehicles, foods, etc. (Not services.)

  • EVENT – Named hurricanes, battles, wars, sports events, etc.

  • WORK_OF_ART – Titles of books, songs, etc.

  • LAW – Named documents made into laws.

  • LANGUAGE – Any named language.

  • DATE – Absolute or relative dates or periods.

  • TIME – Times smaller than a day.

  • PERCENT – Percentage, including ”%“.

  • MONEY – Monetary values, including unit.

  • QUANTITY – Measurements, as of weight or distance.

  • ORDINAL – “first”, “second”, etc.

  • CARDINAL – Numerals that do not fall under another type.

The following configuration will be required in the spacyner.py file which can be found in the endpoints folder.

  • ENTITY_GROUP – this should be the name of the entity group you created in Control Hub.

  • ENTITY_MAP – for each entity that you are interested in, map them to the entity name that you created in Control Hub. If you leave any as empty strings, then they will be ignored.

  • ENTITY_VALIDATOR – for each entity returned by the model you can configure a regular expression that is used to validate the value.

There are also some parameters that help remove noise when the default spacy models are used. These are by no means scientific in their approach, but they do provide a simple way of applying some domain specific cleaning to what NER finds and can be very effective.

  • BAD_WORDS – this is an array of words that can’t appear in an entity value.

  • BAD_STRINGS – this is an array of phrases that entities can’t match.

  • PERCENTAGE_OF_GOOD_WORDS – for any given sentence this is the percentage of tokens in the sentence that need to be recognizable English words (expressed as a decimal). Sentences meeting this rule are then fed into the summarization algorithm. 0.75 means that ¾ of the tokens in a sentence need to be valid words.

Aiimi have a research program that continues to enhance NER model training, reinforcement, domain specific models, and value validation. Therefore, capabilities in this space will continue to evolve and improve.

The endpoint for this enrichment step is ‘spacyner’

PreviousSummaryNextHTML Cleaner Service