Private Generative AI
To run large language models such as Llama2 or Llama3 on your virtual machine you will need to obtain the models and install some additional requirements with pip.
Obtain models - For Huggingface we recommend:
Clone the models into the AIModelService sub models folder using:
git cloneFor large models, like Llama3 70B, this may take some time.
Running larger models in 4-bit mode to save GPU memory
pip install bitsandbytes
pip install setuptools
pip install accelerate
Run GGUF format models with our HuggingfaceGenerativeLlamaCpp provider
You will need to install llama-cpp-python from source.
Install the CUDA toolkit for your GPU driver.
Run the following to build and install llama-cpp-python into the Model Service venv.
set CMAKE_ARGS=-DLLAMA_CUBLAS=ONset FORCE_CMAKE=1pip install --upgrade --force-reinstall llama_cpp_python --no-cache-dir
When you enable the model, you should see your GPU memory utilisation increase.
If it does not, the model is loading into conventional memory. Revisit the previous steps to fix this.
Troubleshooting
You may see an error if you have a newer version of the CUDA toolkit installed. You will see an error at the pip install --upgrade --force-reinstall llama_cpp_python --no-cache-dir step.
The following steps will stop this issue.
Execute:
set CMAKE_ARGS="-DGGML_CUDA=on"Rerun:
pip install --upgrade --force-reinstall llama_cpp_python --no-cache-dirThis may take up to an hour to build from source.