Private Generative AI

To run large language models such as Llama2 or Llama3 on your virtual machine you will need to obtain the models and install some additional requirements with pip.

Clone the models into the AIModelService sub models folder using: git clone
- For large models, like Llama3 70B, this may take some time.

Running larger models in 4-bit mode to save GPU memory

pip install bitsandbytes
pip install setuptools
pip install accelerate

Run GGUF format models with our HuggingfaceGenerativeLlamaCpp provider

You will need to install llama-cpp-python from source.

Install the CUDA toolkit for your GPU driver.
Run the following to build and install llama-cpp-python into the Model Service venv.
- set CMAKE_ARGS=-DLLAMA_CUBLAS=ON
- set FORCE_CMAKE=1
- pip install --upgrade --force-reinstall llama_cpp_python --no-cache-dir
When you enable the model, you should see your GPU memory utilisation increase.
- If it does not, the model is loading into conventional memory. Revisit the previous steps to fix this.

Troubleshooting

You may see an error if you have a newer version of the CUDA toolkit installed. You will see an error at the pip install --upgrade --force-reinstall llama_cpp_python --no-cache-dir step.

The following steps will stop this issue.

Execute: set CMAKE_ARGS="-DGGML_CUDA=on"
Rerun: pip install --upgrade --force-reinstall llama_cpp_python --no-cache-dir
- This may take up to an hour to build from source.

PreviousEnabling Providers NextAzure Open AI