diff --git a/docs/GGML-llama.cpp-models.md b/docs/GGML-llama.cpp-models.md deleted file mode 100644 index bcf3c046..00000000 --- a/docs/GGML-llama.cpp-models.md +++ /dev/null @@ -1,53 +0,0 @@ -# Using llama.cpp in the web UI - -## Setting up the models - -#### Pre-converted - -Place the model in the `models` folder, making sure that its name contains `ggml` somewhere and ends in `.bin`. - -#### Convert LLaMA yourself - -Follow the instructions in the llama.cpp README to generate the `ggml-model.bin` file: https://github.com/ggerganov/llama.cpp#usage - -## GPU acceleration - -Enabled with the `--n-gpu-layers` parameter. - -* If you have enough VRAM, use a high number like `--n-gpu-layers 200000` to offload all layers to the GPU. -* Otherwise, start with a low number like `--n-gpu-layers 10` and then gradually increase it until you run out of memory. - -To use this feature, you need to manually compile and install `llama-cpp-python` with GPU support. - -#### Linux - -``` -pip uninstall -y llama-cpp-python -CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --no-cache-dir -``` - -#### Windows - -``` -pip uninstall -y llama-cpp-python -set CMAKE_ARGS="-DLLAMA_CUBLAS=on" -set FORCE_CMAKE=1 -pip install llama-cpp-python --no-cache-dir -``` - -#### macOS - -``` -pip uninstall -y llama-cpp-python -CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python --no-cache-dir -``` - -Here you can find the different compilation options for OpenBLAS / cuBLAS / CLBlast: https://pypi.org/project/llama-cpp-python/ - -## Performance - -This was the performance of llama-7b int4 on my i5-12400F (cpu only): - -> Output generated in 33.07 seconds (6.05 tokens/s, 200 tokens, context 17) - -You can change the number of threads with `--threads N`. diff --git a/docs/llama.cpp.md b/docs/llama.cpp.md new file mode 100644 index 00000000..07d3a1d8 --- /dev/null +++ b/docs/llama.cpp.md @@ -0,0 +1,42 @@ +# llama.cpp + +llama.cpp is the best backend in two important scenarios: + +1) You don't have a GPU. +2) You want to run a model that doesn't fit into your GPU. + +## Setting up the models + +#### Pre-converted + +Download the ggml model directly into your `text-generation-webui/models` folder, making sure that its name contains `ggml` somewhere and ends in `.bin`. It's a single file. + +`q4_K_M` quantization is recommended. + +#### Convert Llama yourself + +Follow the instructions in the llama.cpp README to generate a ggml: https://github.com/ggerganov/llama.cpp#prepare-data--run + +## GPU acceleration + +Enabled with the `--n-gpu-layers` parameter. + +* If you have enough VRAM, use a high number like `--n-gpu-layers 1000` to offload all layers to the GPU. +* Otherwise, start with a low number like `--n-gpu-layers 10` and then gradually increase it until you run out of memory. + +This feature works out of the box for NVIDIA GPUs. For other GPUs, you need to uninstall `llama-cpp-python` with + +``` +pip uninstall -y llama-cpp-python +``` + +and then recompile it using the commands here: https://pypi.org/project/llama-cpp-python/ + +#### macOS + +For macOS, these are the commands: + +``` +pip uninstall -y llama-cpp-python +CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python --no-cache-dir +```