Close Menu
Roleplay With AI
    X (Twitter) Reddit Discord
    Roleplay With AIRoleplay With AI
    • Home
    • What’s New
      • Newsletter
    • News
      • Interviews
    • Guides
      • LLMs For AI Roleplay
      • Beginner Guides
    • Entertainment
      • Opinions
    • AI Roleplay
      • Feature Articles
      • Local Roleplay
      • Online Roleplay
    Roleplay With AI
    Home»Online Roleplay»JanitorAI»janitorBench, JanitorAI’s Benchmark For Evaluating LLMs
    janitorBench, JanitorAI’s Benchmark For Evaluating LLMs
    JanitorAI

    janitorBench, JanitorAI’s Benchmark For Evaluating LLMs

    By WayfarerNovember 9, 20255 Mins Read

    Users on JanitorAI use various LLMs to roleplay with AI-powered characters. The platform processes over 1 trillion tokens daily, the majority through its free in-house JanitorLLM.

    Last week, JanitorAI revealed janitorBench, a benchmark for evaluating LLMs “based on real conversations with real users.” Currently, the benchmark ranks models solely based on users’ ratings of the responses they receive.

    Table of Contents
    1. Evaluating LLMs’ Performance In Multi-Turn Conversations
    2. janitorBench And Its Current Evaluation Method
      1. The Flaw With User Ratings
      2. More Users Need To Provide Ratings
    3. Future Potential For LLM Providers
    4. janitorBench, JanitorAI’s Benchmark For Evaluating LLMs

    Evaluating LLMs’ Performance In Multi-Turn Conversations

    JanitorAI’s large user base and the number of tokens it processes daily make it an attractive source for gathering data to evaluate the real-world performance of LLMs in creative, long, multi-turn conversations.

    There are existing benchmarks that rank LLMs based on their creativity or multi-turn conversational ability. However, none have had access to 2 million daily users to gather real-world feedback on a model’s performance.

    Our large scale and third-party model support give us a unique set of data to evaluate the performance of a range of different AI models in real time. Now, we’re sharing this data with the community.

    About janitorBench, JanitorAI.

    On JanitorAI, after each LLM response, users can leave a rating from 1 to 5 stars. The platform has previously stated that these ratings do not affect the output quality for external models used via APIs and Proxies, and are mainly intended to gather feedback for JanitorLLM.

    janitorBench And Its Current Evaluation Method

    JanitorAI tracks user message ratings (1-5 stars), engagement time, and “behavioral signals across millions of conversations.”

    Currently, janitorBench evaluates LLMs using average message ratings. A minimum of 5000 ratings is necessary for a model to be considered for evaluation. The leaderboard updates every 12 hours but does not provide any raw data.

    janitorBench Model Leaderboard

    The developers plan to expand janitorBench’s evaluation to include additional signals, such as engagement time and behavioral signals. They also plan to create category-specific leaderboards and perform A/B testing of LLM responses on identical prompts for head-to-head comparison.

    JanitorAI plans to offer model providers access to data through a dashboard, allowing them to monitor their models’ performance with real-time, real-world data. The developers also intend to make janitorBench data available to creators on a per-character basis.

    The Flaw With User Ratings

    JanitorAI’s users rate the LLM’s responses based on their expectations, not on what would objectively be a better reply. For example, some find it acceptable when the LLM speaks for and controls their character, while others do not.

    There are also other variables, such as the response length a user expects, the custom prompts they use, and the quality of their own input, among others. Personal preference plays a significant role in a user’s rating.

    Creating an objective leaderboard is impossible with such a large user base. The sheer size of the platform will still provide valuable real-world data to LLM providers. Adding other signals to janitorBench in the future might help offset the subjective ratings.

    More Users Need To Provide Ratings

    Many JanitorAI users who use external models through APIs and Proxies don’t leave ratings for the LLM’s responses. This is because the platform has stated that ratings do not influence external models’ responses and play no role in improving their chat experience.

    Users who are happy with the LLMs’ responses and have properly configured their parameters and custom prompts are currently less likely to leave ratings. JanitorAI needs to ensure that user ratings remain a valuable signal by encouraging advanced users to start rating the LLMs’ responses.

    Future Potential For LLM Providers

    OpenRouter often provides “stealth” models that users use, and companies like OpenAI, xAI, etc., gather real-world usage data before launching their proprietary models.

    With JanitorAI’s large user base and the developers’ plan to make real-world performance data available to LLM providers in real-time, it could serve as a useful source for companies to gather data and improve their models.

    It can also serve as a source of revenue for JanitorAI, as data of this scale is a valuable resource.

    janitorBench, JanitorAI’s Benchmark For Evaluating LLMs

    janitorBench is JanitorAI’s benchmark for evaluating the performance of LLMs in creative, long, multi-turn conversations. The platform’s large user base helps gather real-world performance data for several LLMs in real-time.

    Currently, janitorBench evaluates and scores LLMs based on user-provided star ratings (1-5) for their responses. The developers plan to include additional signals in the evaluation, such as engagement time, and conduct A/B testing on identical prompts for head-to-head comparison.

    JanitorAI users’ feedback is subjective and varies based on many factors. The platform’s previous statement that ratings do not influence chat quality for external models has also caused several users with optimized prompts and parameters to refrain from leaving ratings. To maintain value of user ratings, JanitorAI needs to encourage advanced users to rate the LLMs’ responses.

    The developers also plan to offer model providers access to real-world, real-time model performance data through a dedicated dashboard. This could be a potential revenue source for JanitorAI, as data of this scale is a valuable resource.

    API & Proxies
    Share. Twitter Reddit WhatsApp Bluesky Copy Link
    Wayfarer
    • Website
    • X (Twitter)

    Wayfarer is the founder of RPWithAI. He’s a former journalist who became interested in AI in 2023 and quickly developed a passion for AI roleplay. He enjoys medieval and fantasy settings, and his roleplays often involve politics, power struggles, and magic.

    Related Articles

    JanitorAI's Native iOS And Android Apps Now in Beta

    JanitorAI’s Native iOS And Android Apps Now in Beta

    December 4, 2025
    JanitorAI Introduces Scripts And Advanced Generation Settings

    JanitorAI Introduces Scripts And Advanced Generation Settings

    September 7, 2025
    Use OpenRouter On JanitorAI

    Use OpenRouter On JanitorAI

    July 21, 2025
    Use DeepSeek On JanitorAI

    Use DeepSeek On JanitorAI

    July 17, 2025
    JanitorAI Fails To Communicate Yet Again

    JanitorAI Fails To Communicate Yet Again

    July 14, 2025
    JanitorAI

    JanitorAI: The Unfiltered Frontier of AI Roleplay

    July 9, 2025

    New Articles

    AI Roleplay Is No Longer A Niche, But There’s A Catch

    AI Roleplay Is No Longer A Niche, But There’s A Catch

    December 22, 2025
    Neuro-sama’s 2025 Subathon: Sub Goals, Music Video, Merch Drop, And More!

    Neuro-sama’s 2025 Subathon: Sub Goals, Music Video, Merch Drop, And More!

    December 20, 2025
    Sophia's LoreBary 2.0 - Roleplay Studio, AI Assistant, And More

    Sophia’s LoreBary 2.0 – Roleplay Studio, AI Assistant, And More

    December 15, 2025
    Sketchy AI Roleplay Platform Ads On Reddit

    Sketchy AI Roleplay Platform Ads On Reddit

    December 13, 2025
    DeepSeek V3.2's Performance In AI Roleplay

    DeepSeek V3.2’s Performance In AI Roleplay

    December 11, 2025
    Subscribe to Our Newsletter!

    Stay in the loop with the AI roleplay scene! Subscribe to our newsletter to get our latest posts delivered directly to your inbox twice a month.

    About Us & Policies
    • About Us
    • Contact Us
    • Content Policy
    • Privacy Policy
    Connect With Us
    X (Twitter) Reddit Discord
    © 2025 RPWithAI. All rights reserved.

    Type above and press Enter to search. Press Esc to cancel.