HTML Cleaner Service

The HTML Text Cleaner service is an optional service that is used by the Web Sites source connector. It provides an alternative way to parse HTML into text that uses some intelligence to try and capture the main content from the web page, whilst ignoring the menus and other links around the main content.

You will need to perform some testing to ensure it captures the right content from the web sites you are crawling before using it. If it does not, then you can revert to the standard text extraction option.

Installation

Copy the HTMLCleanerService from your distribution to a suitable location on the server where the Source Agent is located. We suggest that you place it in a folder called InsightMaker.Python.

Create a Python virtual environment within HTMLCleanerService:

Option 1 – Install Python where no other Python version exists on the server:

  1. Install Python 3.9.13

    1. You need to install Python for ‘all users’ if you plan to run the Python REST Service as a Window Service.

  2. Create a Python ‘virtual environment’:

    1. Create a venv folder in the root of the HTMLCleanerService folder.

    2. Open a command prompt, navigate to the venv folder, and create the venv with the following python command:

      1. python -m venv ./

Option 2 – Install Python where other Python versions exist on the server:

  1. Install Python 3.9.13

    1. You need to install Python for ‘all users’ if you plan to run the HTMLCleanerService as a Window Service.

    2. Do not add select add to the system variables or path (this may interfere with existing Python applications running on the server)

  2. Open an administrator command prompt and install virtualenv

    1. pip install virtualenv

  3. Create a Python ‘virtual environment’:

    1. Create a venv folder in the root of the HTMLCleanerService folder.

    2. Open a command prompt, navigate to the venv folder, and create the venv with the following python command:

      1. python -m virtualenv ./ -p="C:\Program Files\Python39\python.exe"

      2. Replace the -p parameter with the path to the Python 3.9.13 executable

Activate the virtual environment by running activate.bat within the venv/scripts subfolder

Run the following in the root of the HTMLCleanerService folder pip install -r requirements_3.9.13.txt

Edit run.bat to point at the right path for both the activate script and the waitress_server.py file.

Now test the service by running the run.bat file. You should see the following:

You can install this as service by using nssm.exe which should be found in the HTMLCleanerService . To do this run nssm install "InsightMaker HTMLCleanerService" and select the run.bat file as the path.

To use the service from the Web Sites Source connector you will need to add webSites_HTMLCleanerService to your appsettings.json file with the Source Agent.

If you have installed this service on another server, then change the server name or IP address accordingly. You will also need to edit the trusted IP address setting in the config/config.json file.

Finally, select the following option on the web sites source within Control Hub.

Last updated