HTML Cleaner Service
Last updated
Last updated
The HTML Text Cleaner service is an optional service that is used by the Web Sites source connector. It provides an alternative way to parse HTML into text that uses some intelligence to try and capture the main content from the web page, whilst ignoring the menus and other links around the main content.
You will need to perform some testing to ensure it captures the right content from the web sites you are crawling before using it. If it does not, then you can revert to the standard text extraction option.
Open the AIModel Service.
Enable the 'BeautifulSoupHTMLCleaner'.
Once enabled import the new config using: InsightMaker.IndexUtilities.exe import --ai-registration-configuration C:\tmp\ai.json
Open the appsettings.json file in your source agent folder.
Example location - 'C:\insightMaker\SourceAgent'
Within the advanced object add: "webSites_HTMLCleanerService": "http://127.0.0.1:15008/"
If your source agent and AIModel Service run on different hosts, use "127.0.0.1" for the AIModel Service hostname.
Save this file.
Restart your Source Agent.
Within the control Hub go to Configurations.
Select Edit on the configured websites source and go to the Source tab.
Text Extraction Mode: Select BeautifulSoup HTML Cleaner from the dropdown.
Save the source.