Private Generative AI
To run large language models such as Llama2 or Llama3 on your virtual machine you will need to obtain the models and install some additional requirements with pip.
Obtain models - for Huggingface we recommend:
Clone the models into the AIModelService sub models folder using:
git clone
For large models, like Llama3 70B, this may take some time.
Running larger models in 4-bit mode to save GPU memory
pip install bitsandbytes
pip install setuptools
pip install accelerate
Run GGUF format models with our HuggingfaceGenerativeLlamaCpp provider
You will need to install llama-cpp-python from source.
Install the CUDA toolkit for your GPU driver.
Run the following to build and install llama-cpp-python into the Model Service venv.
set CMAKE_ARGS=-DLLAMA_CUBLAS=ON
set FORCE_CMAKE=1
pip install --upgrade --force-reinstall llama_cpp_python --no-cache-dir
When you enable the model, you should see your GPU memory utilisation increase.
If it does not, the model is loading into conventional memory. Revisit the previous steps to fix this.
Troubleshooting
You may see an error if you have a newer version of the CUDA toolkit installed. You will see an error at the pip install --upgrade --force-reinstall llama_cpp_python --no-cache-dir
step.
The following steps will stop this issue.
Execute:
set CMAKE_ARGS="-DGGML_CUDA=on"
Rerun:
pip install --upgrade --force-reinstall llama_cpp_python --no-cache-dir
This may take up to an hour to build from source.