Run KoboldCpp On Runpod

Running LLMs locally requires a desktop or laptop with decent hardware. Those who have a gaming or productivity rig with a dedicated GPU can run small to medium models locally without breaking a sweat.

If you don’t have a dedicated GPU or have one with 6GB or less VRAM, running even small models locally at an acceptable quant can be challenging. But that doesn’t mean you should give up on running LLMs with KoboldCpp and the privacy benefits that come along with it.

KoboldCpp On Runpod

RunPod is an all-in-one cloud platform for training, fine-tuning, and running inference with LLMs. It lets you rent GPUs to run KoboldCpp and use models that your local hardware cannot handle.

Runpod is a reliable, easy, and low-cost option for renting GPUs to run LLMs for AI roleplay. If you are looking for a free and limited option to run small models, learn how to run KoboldCpp on Google Colab.

How To Run KoboldCpp On Runpod

Sign up on Runpod using your Google account and KoboldCpp’s referral link. By registering with their referral link, you receive a one-time credit of $5 or more from Runpod when you add $10 to your account.

Select A GPU

To run KoboldCpp on Runpod, navigate to Manage > Pods and select a GPU that has enough VRAM to fit your model and context size. Use this VRAM calculator to estimate the required memory.

In this guide, we are using TheDrummer’s Cydonia 24b v4.1. The Q4_K_S quantization is 12.57 GB, and a 16,384 context size needs 4.16 GB. Since we require 17GB VRAM, we will select the RTX A4500, as it’s the cheapest option at $0.25 per hour.

Your pod can have more than 1 GPU. Runpod shows how many GPUs you can add to your pod below the pricing information (e.g., 1 max, 8 max, etc.). For example, if you need 80GB VRAM, you can use 2 A40’s at $0.80 per hour.

Configure Your Pod

Once you select your GPU, click the Change Template button. Search for “Kobold” and select the KoboldCpp’s official template from the list.

Then, click on the Edit button to modify the template.

Container Disk: Temporary storage that Runpod charges you for by the hour. We recommend setting the same amount of storage as your VRAM to keep your costs low.
Volume Disk: Keep the default setting of 0 GB. Change it only if you need persistent storage for your pod.
Volume Mount Path: Keep the default setting of /workspace.
Expose HTTP Ports (Max 10): Keep the default setting of 5001.
Expose TC Ports: Keep the default port setting of 22.

Next, expand the Environment Variables options.

KCPP_MODEL: Enter the link to your GGUF model from Hugging Face.
KCPP_ARGS: Only change the context size value in this option from the default 4096 to your preferred value. Do not modify anything else unless you are sure of what you are doing.
KCPP_IMGMODEL: Enter the link to your image generation model. Click X to remove if not required.
KCPP_WHISPERMODEL: Enter the link to your audio generation model. Click X to remove if not required.
KCPP_TTSMODEL: Enter the link to your text-to-speech model. Click X to remove if not required.
KCPP_EMBEDMODEL: Enter the link to your embedding model. Click X to remove if not required.
Once configured, click the Set Overrides button.

KoboldCpp Template Overrides Environment Variables

Runpod will show a summary of your pod’s pricing and configuration. Click on Deploy On-Demand to start your pod.

Connecting To KoboldCpp On Runpod

Once you deploy your pod, it takes 2 to 3 minutes to start KoboldCpp and load your model (the time may vary depending on your model size). You can monitor the progress through the Logs (container tab) while viewing your pod’s information.

Once the model is loaded, Runpod will provide you with Cloudflare tunnel links to access KoboldAI Lite and KoboldCpp’s API.

You can also access the link through the Connect menu. Runpod may report that the link is not ready, but as long as the logs show your remote tunnel is active, you can connect to KoboldCpp. Use these Cloudflare tunnel links on your frontend, like SillyTavern, to connect to KoboldCpp’s API.

Incomplete Setup

If your logs end with “Could not load text model,” then there is something wrong with your pod’s configuration or the modification you made to KoboldCpp’s template.

Either the Context Size you chose was too high and there wasn’t enough VRAM available to allocate for KV Cache, or the model size/quant was too large. Refer to the logs to identify what went wrong, update your configuration, and try again.

Privacy While Using Runpod

Although you are not running models “locally” when using Runpod, you still have more control over your data compared to using cloud providers to access LLMs. KoboldCpp’s Docker template ensures your prompts and generations are private. All data is wiped once you terminate your pod, unless you set up persistent storage.

However, it is still a cloud service, and 100% privacy cannot be guaranteed. Avoid sharing any personally identifiable information in your conversations, and adhere to Runpod’s Terms of Service.

Troubleshooting And Help

The logs provide useful information to help you understand what’s going wrong if Runpod fails to load your model. However, if you can’t figure it out on your own, you can ask for help on KoboldAI’s Discord server or the r/KoboldAI subreddit.

Rub KoboldCpp On Runpod

If your hardware can’t run LLMs locally or you want to use larger models than your system can handle, you can run KoboldCpp on RunPod. KoboldCpp’s Docker template makes setup simple, and you can connect any frontend to KoboldCpp’s API using the Cloudflare tunnel links provided by RunPod.

Runpod is a reliable and low-cost service to run LLMs for AI roleplay. If you’re looking for a free, limited alternative for smaller models, consider running KoboldCpp on Google Colab.

Running KoboldCpp on Runpod gives you more control over your data compared to other LLM providers, but since it’s still a cloud service, avoid sharing personal information and follow Runpod’s Terms of Service.