In our model feature articles, we observe how LLMs perform in AI roleplay across five different character cards and scenarios. Our conversations with the LLMs range from 12 to over 60 messages, depending on the scenario.
AI roleplay is a diverse hobby, with everyone having personal preferences and different expectations from LLMs. We share our observations and conclusions, along with chat logs, so that readers can draw their own conclusions about a model’s performance.
Not An “Objective” Review
We don’t “objectively” review models. Our conclusions are based on our experience, understanding of the character and scenario, and personal preferences. It’s subjective and isn’t meant to be a standardized benchmark.
Personal preference (and “vibes”) plays a huge role in how much someone enjoys a specific model. We roleplay with the character cards using a model, and our “hands-on” experience shapes our observations and conclusions.
We don’t use another LLM to judge or score the roleplay. It wouldn’t be able to understand how many times we had to re-roll for the perfect response, the edits we made to the model’s replies, or the other factors that influence the overall experience.
What Matters
- Character Adherence: How well does a model depict a character? Does it stay true to the character’s core traits? How often does it act out of character?
- Character Portrayal: Do the dialogue and narration help the character feel unique and not generic?
- Character Consistency: Does the model make decisions that align with the character’s core traits? Does it take the liberty to forgo character traits and favor progress?
- Character Depth: How effectively does the model use information from the character card? Does it naturally incorporate backstories and other details?
- Theme Adherence: How well the model’s responses align with the theme of the roleplay (e.g., medieval, fantasy, comedy, etc.).
- User Effort: How much effort did the user need to put in to receive good responses?
- Instruction Following: How well does the model follow instructions?
- Vibes: The overall “fun” factor during the conversation.
What Doesn’t Play A (Major) Role
- Longform or Creative Writing: Judging a model’s writing quality adds another layer of subjectivity. Unless any glaring issues negatively affect the experience, the quality of the writing does not play a major role. We are collaborating with a co-author in a turn-based roleplay, not writing novels.
- Slop: Repetition and slop affect the overall “fun” factor during the conversation and play some role in our conclusions. However, they are not major factors, and we often give smaller models more leeway.
Model Size
We don’t hold an 8B model to the same standard as a 15B. Nor do we hold a 24B model to the same standard as a 70B. Smaller models are given more leeway, and we are more forgiving of their mistakes and flaws. The larger the model, the higher our expectations.
Conclusion “Grading”
Conclusions are based on our experience and observations, and we keep the “grading” simple.
- Decent: The model didn’t consistently meet the What Matters criteria.
- Good: The model met the What Matters criteria but could have performed better in some areas.
- Great: The model met the What Matters criteria and performed as per our expectations.
Conversation Logs
All of our model feature articles include complete conversation logs between us and the character using the model we are writing about.
Additionally, we enhance user input with an AI assistant to maintain a consistent input style. As humans, some days we lack creativity, and factors like how tired we are, how much we can concentrate, etc., can drastically affect our input. The AI assistant helps us make sure our input remains consistent.
The AI assistant only enhances our input. We are in complete control of the narrative. We publish conversation logs with the AI assistant for transparency.
