Optimizing KoboldCpp For Roleplaying With AI

What scares many people away from local AI roleplay is the need to set everything up themselves. It looks complicated and time-consuming, but it’s not as hard as it seems. The learning curve can be fun, especially when it lets you enjoy your hobby even more.

In this guide, we’ll show you how to optimize KoboldCpp for roleplaying with AI, from downloading to choosing an LLM model and understanding the functions of various essential settings.

Download KoboldCpp

Running KoboldCpp is a straightforward process that requires downloading a single executable file and then double-clicking it; that’s all. Download the file appropriate for your operating system from the release page.

koboldcpp-linux-x64: For users with modern hardware using Linux and an NVIDIA GPU.
koboldcpp-linux-x64-nocuda: For users using Linux and don’t use an NVIDIA GPU. The developer recommends using this version if you have an AMD GPU (Vulkan backend preset).
koboldcpp-linux-x64-oldpc: For users using Linux with an older CPU or NVIDIA GPU, who are having issues with koboldcpp-linux-x64 not working on their system.
koboldcpp-mac-arm64: For users using modern MacOS (M-Series) systems.
koboldcpp-nocuda.exe: For users using Windows and don’t use an NVIDIA GPU. The developer recommends using this version if you have an AMD GPU (Vulkan backend preset).
koboldcpp-oldpc.exe: For users using Windows with an older CPU or NVIDIA GPU, who are having issues with koboldcpp-linux-x64 not working on their system.
koboldcpp.exe: For users with modern hardware using Windows and an NVIDIA GPU.

Save the file in an easily accessible location on your device, since you’ll interact with it every time you want to launch KoboldCpp. To update KoboldCpp, download the updated file from the release page and overwrite the existing file on your device.

Note: You may have to give the file permission to “execute as a program” if you double-click on the KoboldCpp file and the program doesn’t launch.

Understanding Which LLM Model To Use With KoboldCpp For Roleplaying With AI

KoboldCpp supports most modern GGUF models found on Hugging Face. However, before you begin downloading models, you need to understand which models your device can efficiently run.

Video Random-Access Memory (VRAM)

TLDR: For optimizing KoboldCpp for roleplaying with AI, the size of your LLM model, along with your context window size, shouldn’t exceed the amount of VRAM available on your graphics card.

Explanation

LLM models work best on GPUs and utilize VRAM, the dedicated memory available on your device’s graphics card. VRAM is different from the system RAM available to your device. You need to know how much VRAM you have available within your GPU.

To put it simply, these models require a fast, dedicated workspace to store massive amounts of data and perform several simple mathematical operations simultaneously. Your GPU and its VRAM provide exactly what these LLM models require within a unified memory architecture.

If your device doesn’t have a dedicated graphics card, you can run an LLM model stored in your system memory (RAM) while your CPU handles the mathematical operations. However, the model takes much longer to generate content when you use a RAM and CPU combination.

Context Size (Context Window)

TLDR: LLM models process data as tokens. Context size is the total amount of tokens available as “memory” for your LLM model to reference before generating a response. For optimizing KoboldCpp for roleplaying with AI, your LLM model and the context window need to fit within the VRAM available on your graphics card.

Explanation

Tokens are small, standardized units that LLM models use to process data and generate content. On average, one English word is about 1.3 to 1.5 tokens.

Also Read: Understanding Tokens And Context Size

Your character’s description, personality, backstory, and world lore, along with your back-and-forth messages with the AI, are processed as tokens. And context size is the total amount of tokens an LLM model “remembers” while generating a response.

A smaller context size gives your AI character a shorter memory during roleplay. A larger context size helps your AI character remember more, but depending on your hardware limits and the model’s capabilities, it can completely break the character and derail the roleplay.

For most LLM models, the ideal context size ranges between 8,192 and 16,384 tokens. Even the more powerful models, capable of handling a context size of 128,000+ tokens, struggle to use their “memory” effectively beyond the ideal range.

Also Read: Context Rot: Large Context Size Negatively Impacts AI Roleplay

Using KoboldCpp, you can also quantize context size. But we’ll get to explaining that later.

Model Sizes and Quantizations

TLDR: LLM models come in sizes like 7B, 12B, and 24B. Bigger models seem smarter but may struggle in roleplay. They’re compressed (quantized) to consume less VRAM, but overly compressed quants can become unusable. For optimizing KoboldCpp for roleplaying with AI, don’t go lower than Q4_K_S or IQ4_XS.

Explanation

LLM models come in different sizes, measured in billions of parameters and represented by a number followed by “B.”

Also Read: Understanding LLM Quantization For AI Roleplay

For example, you’ll find 7B models, 12B models, 24B models, and so on. On paper, larger models seem smarter because they’re trained on more data. But keep in mind, not every large model is fine-tuned for roleplay.

Every model is further quantized. Think of quantization like video compression: a large 4K video gets compressed into smaller versions to save storage space on your device. But each time you compress the video, it loses quality. It goes from 4K to 2K, then to 1080p, 720p, and so on.

Technically, you can use a quant of a large model that fits into the VRAM available on your GPU. Especially the newer I-quants, which shrink the model a lot further. But if you pick a quant that is too small compared to the original model, it may be too dumb to perform well during roleplay.

The lower the quant number, the more compressed and dumber the model is. Q8_0 offers the highest quality with minimal compression, while Q2_K is the lowest. For decent creative writing, like roleplay, you need at least Q4_K_S or IQ4_XS.

VRAM Estimator

There’s a handy VRAM estimator you can use that will help you visualize which model, quant, and context size combination will fit within the VRAM of your GPU.

When using KoboldCpp for roleplaying with AI, avoid running other GPU-intensive programs that eat up your VRAM. Remember, your system also uses VRAM for your display, browser tabs, Spotify, and more.

Using a model that is larger than your available VRAM will still work, especially when we tweak KoboldCpp settings as explained further down below. However, you’ll start sacrificing generation speed, especially as context cache grows during roleplay.

Recommended LLM Models

You can read our model feature articles to find a local model that suits your requirements. AI roleplay is a diverse hobby, with everyone having personal preferences and different expectations from LLMs. We share our observations and conclusions along with chat logs for you to assess the model’s performance in AI roleplay without wasting your time testing them.

Remember, optimizing KoboldCpp for Roleplaying with AI is just one part of the puzzle! You will also have to optimize SillyTavern for AI Roleplay. In particular, the models will need the correct corresponding “Context Template” and “Instruct Template” within SillyTavern.

Optimizing KoboldCpp Settings

Once you launch KoboldCpp, it will display the Quick Launch menu. As we’ll be taking a look at every setting and optimizing KoboldCpp for roleplaying with AI, we won’t be changing anything on the Quick Launch menu apart from loading the GGUF Text Model.

Click “Browse” and navigate to where you saved your LLM GGUF model that you downloaded from Hugging Face.

Hardware

The Hardware menu is where you can optimize the LLM model to run as efficiently as possible on your device.

Note: We’re using a Linux desktop with an NVIDIA GPU with 11GB VRAM for this guide. The optimization settings may vary if you have an AMD GPU or are going to run your model using CPU and RAM.

Presets

This setting allows you to select the backend KoboldCpp will use. If you’re using an NVIDIA GPU, you’ll need to select “Use CuBLAS.” For AMD GPUs, you need to use Vulkan.

GPU ID

This setting controls which GPU KoboldCpp selects to load the LLM model onto, and defaults to your device’s GPU. If you’re using a device with multiple GPUs, you can choose which GPU you want the model loaded onto.

Low VRAM (No KV offload)

TLDR: Low VRAM mode moves KV Cache (character info, chat history, etc.) from GPU VRAM to system RAM to free up VRAM for the model.

For optimal performance, use it only if your model fits entirely in VRAM.
Using it if your model doesn’t fit entirely in VRAM significantly slows down token/text generation.
Best used when you want to use higher quantization.

Explanation

Enabling this setting stores all data within the “KV Cache” (context window, character information, chat history, custom prompts, etc.) in your system RAM instead of your GPU’s VRAM. Moving context data to system RAM frees up VRAM for offloading more layers of your LLM model onto your GPU.

We recommend using this setting only when you can load your entire LLM model onto your GPU. For example, on our GPU with 11GB VRAM, Forgotten-Abomination-12B-v4.0.Q6_K.gguf consumes 10.1GB of our VRAM, leaving us with just 900MB of VRAM for context, which isn’t sufficient if we want to use a context size of 16,384 tokens.

By enabling Low VRAM here, we could offload the entire model onto our GPU without lowering the quantization level. Although token/text generation becomes slower, it’s still worth it. You can also quantize your context, which we’ll explain later in the Tokens menu section.

GPU Layers

This setting instructs KoboldCpp how much of the model it should offload onto your GPU’s VRAM. The default -1 auto-detects the appropriate number of layers to offload. In most cases, letting KoboldCpp auto-detect the number of layers to offload is the best option for performance.

You’ll need to tweak this setting primarily if your model size exceeds your available VRAM and you enable Low VRAM. It’ll take some trial and error, but you can keep increasing the number of layers offloaded and run a benchmark to test if your model loads or KoboldCpp crashes.

If KoboldCpp crashes or your prompt processing/token generation speed is too low, reduce the number of layers offloaded and rerun the benchmark test. Repeat this process until your benchmark test returns a satisfactory result.

Tensor Split & Main GPU

These settings are only applicable if you are using multiple GPUs. You can use the Tensor Split setting to control how KoboldCpp splits the workload between your GPUs. The Main GPU setting allows you to set your primary GPU.

Threads

This option controls the number of CPU threads KoboldCpp is allowed to use. The suggested value is appropriate in most cases.

If you’ve offloaded the LLM model to your system RAM, set the value to match the number of physical CPU cores in your system.

BLAS Threads & BLAS Batch Size

These options control how much of your system’s resources KoboldCpp uses for mathematical operations and set the batch size, the number of tokens processed at once. The suggested values are appropriate in most cases.

Other Settings

Launch Browser: Opens KoboldCpp’s chat interface in your browser. Disable this if you’re using SillyTavern or another frontend.
High Priority: Raises KoboldCpp’s process priority above all other applications on your device. Enable only if necessary, as it can cause instability.
Use MMAP: Reduces the amount of RAM needed to load the model by reading parts from disk into RAM on demand. Enable only if necessary.
Use mlock: Forces the model to stay in RAM after loading. Useful for low-RAM setups. Enable only if necessary.
Debug Mode: Shows additional debug information in the terminal.
Keep Foreground: Sends the console terminal to the foreground each time a new prompt is generated, preventing idling slowdown issues on Windows.
CLI Terminal Only: Enables KoboldCpp from the command line instead of launching the KoboldCpp HTTP server. Enable only if necessary.

Tokens

The Tokens menu is where you can optimize the Key-Value (KV) Cache for the best performance on your device.

Use ContextShift

ContextShift automatically removes old tokens from the context window and adds new tokens without requiring any reprocessing. This option helps performance, especially when your context cache is full. We suggest keeping this option enabled.

Use FastForwarding

FastForwarding helps LLMs skip reused tokens in the context window that they have already processed in previous generations.

This setting improves performance by preventing the LLM from reprocessing the entire message history when generating a new response; it only processes the most recent reply. We suggest keeping this option enabled.

Use Sliding Window Attention (SWA)

Sliding Window Attention (SWA) reduces the memory required for the KV cache. However, you cannot use SWA and ContextShift together. We suggest keeping this option disabled.

Context Size

We explained Context Size/Context Window in detail earlier. Context size is the total amount of tokens available as “memory” for your LLM model to reference before generating a response.

Depending on your available VRAM, we suggest a context window between 8,192 and 16,384 tokens.

Default Gen Amt

The Default Generation Amount sets how many tokens the LLM generates when you don’t specify a value. Frontends like SillyTavern usually override this setting, so you typically don’t need to change it.

Custom RoPE Config, RoPE Scale, & RoPE Base

A custom RoPE configuration lets you stretch the LLM’s context capacity by adjusting how RoPE calculates positions, allowing it to handle much longer contexts without retraining.

This option is for advanced users, and KoboldCpp’s official documentation explains the available scaling methods. Enable and modify this only if necessary.

Use FlashAttention

FlashAttention is a faster, memory-efficient technique that helps LLMs identify the most important tokens in context. It works only with NVIDIA GPUs, as it requires CUDA/CuBLAS.

Enabling FlashAttention significantly improves token generation performance and moderately reduces VRAM usage. However, some models don’t support it and may experience performance drops.

We recommend enabling this option if you have an NVIDIA GPU. If you encounter issues, compare token generation performance, stability, and output coherence with FlashAttention enabled and disabled.

Quantize KV Cache

You can quantize (compress) the KV cache with this option to use less VRAM. F16 uses full weights (no compression), 8-bit applies about 50% compression, and 4-bit applies about 75% compression. To fully benefit from a quantized KV cache, you need to enable FlashAttention.

Quantizing the KV cache may reduce response quality. We recommend keeping this option disabled [F16 (Off)]. However, you can compare output coherence with a quantized KV cache if you want to test it.

Other Options

Most LLMs don’t understand negative prompting. The Enable Guidance option activates Classifier-Free Guidance, allowing the LLM to process both positive and negative prompts. This option lets you tell the model not only what to include but also what to avoid.

However, using the Enable Guidance option results in severe performance drops. We recommend not enabling this option.

The No BOS Token, MoE Experts, Override KV, and Override Tensor options are for advanced users. You can read KoboldCpp’s official documentation if you wish to modify these options.

Network

The Network menu is where you’ll adjust options related to KoboldCpp’s API, which frontends like SillyTavern will use.

Port: This is the port where KoboldCpp hosts its web server, with 5001 as the default. Change it only if another application is already using port 5001. If you’re using a firewall, you may need to add rules to allow connections to this port.
Host: By default, KoboldCpp’s web server listens to all interfaces, which means it can be accessed from any device on the local network. You can change this to “localhost” to restrict access to only the device running the program.
Multiuser Mode: Allows multiple people to share a single KoboldCpp web server by queuing requests and dispatching them to the correct clients. You can disable this option unless you are sharing KoboldCpp over a LAN or through an internet tunnel with multiple people.
Remote Tunnel: Creates a cloudflared tunnel to access KoboldCpp over the internet using a URL. Enable the option only if required.
Quiet Mode: Prevents all generation-related terminal output and shows only the final results and some end-of-process statistics. Enable the option only if required.
NoCertify Mode (insecure): Allows insecure SSL connections and lets you bypass certificate restrictions. Enable the option only if required.
Shared Multiplayer: Hosts a shared multiplayer session that others can join.
Enable WebSearch: This feature enables the local search engine proxy to augment your queries with web searches. Only relevant if you’re using KoboldAI Lite as your frontend.
SSL Cert & SSL Key: Security certificates to use SSL (https://) for your KoboldCpp web server. You can generate self-signed certificates using OpenSSL.
Password: You can set a password to use as an API key on frontends like SillyTavern.

If you wish to learn more about how to set up KoboldCpp for remote access over LAN/Internet, read the official documentation. If you’re going to be using a frontend like SillyTavern, having a localhost setup where the frontend and backend can interact with each other is sufficient.

Save Configuration

To easily load all your tweaked settings with the selected GGUF model, click on “Save Config” and save the file with the .kcpps extension.

Next time you launch KoboldCpp, you just need to click on “Load Config” and select your saved configuration file. It helps you switch between multiple configurations without having to edit individual settings manually.

Troubleshooting and Other Resources

We’ve written this guide to optimizing KoboldCpp for roleplaying with AI, drawing on our experience with KoboldCpp on Linux and an NVIDIA GPU.

You may run into issues specific to Windows or other GPUs that we may not have covered. We’re listing other great resources to learn more or find answers to your particular issues.