janitorBench, JanitorAI’s Benchmark For Evaluating LLMs

Users on JanitorAI use various LLMs to roleplay with AI-powered characters. The platform processes over 1 trillion tokens daily, the majority through its free in-house JanitorLLM.

Last week, JanitorAI revealed janitorBench, a benchmark for evaluating LLMs “based on real conversations with real users.” Currently, the benchmark ranks models solely based on users’ ratings of the responses they receive.

Evaluating LLMs’ Performance In Multi-Turn Conversations

JanitorAI’s large user base and the number of tokens it processes daily make it an attractive source for gathering data to evaluate the real-world performance of LLMs in creative, long, multi-turn conversations.

There are existing benchmarks that rank LLMs based on their creativity or multi-turn conversational ability. However, none have had access to 2 million daily users to gather real-world feedback on a model’s performance.

Our large scale and third-party model support give us a unique set of data to evaluate the performance of a range of different AI models in real time. Now, we’re sharing this data with the community.
About janitorBench, JanitorAI.

On JanitorAI, after each LLM response, users can leave a rating from 1 to 5 stars. The platform has previously stated that these ratings do not affect the output quality for external models used via APIs and Proxies, and are mainly intended to gather feedback for JanitorLLM.

janitorBench And Its Current Evaluation Method

JanitorAI tracks user message ratings (1-5 stars), engagement time, and “behavioral signals across millions of conversations.”

Currently, janitorBench evaluates LLMs using average message ratings. A minimum of 5000 ratings is necessary for a model to be considered for evaluation. The leaderboard updates every 12 hours but does not provide any raw data.

The developers plan to expand janitorBench’s evaluation to include additional signals, such as engagement time and behavioral signals. They also plan to create category-specific leaderboards and perform A/B testing of LLM responses on identical prompts for head-to-head comparison.

JanitorAI plans to offer model providers access to data through a dashboard, allowing them to monitor their models’ performance with real-time, real-world data. The developers also intend to make janitorBench data available to creators on a per-character basis.

The Flaw With User Ratings

JanitorAI’s users rate the LLM’s responses based on their expectations, not on what would objectively be a better reply. For example, some find it acceptable when the LLM speaks for and controls their character, while others do not.

There are also other variables, such as the response length a user expects, the custom prompts they use, and the quality of their own input, among others. Personal preference plays a significant role in a user’s rating.

Creating an objective leaderboard is impossible with such a large user base. The sheer size of the platform will still provide valuable real-world data to LLM providers. Adding other signals to janitorBench in the future might help offset the subjective ratings.

More Users Need To Provide Ratings

Many JanitorAI users who use external models through APIs and Proxies don’t leave ratings for the LLM’s responses. This is because the platform has stated that ratings do not influence external models’ responses and play no role in improving their chat experience.

Users who are happy with the LLMs’ responses and have properly configured their parameters and custom prompts are currently less likely to leave ratings. JanitorAI needs to ensure that user ratings remain a valuable signal by encouraging advanced users to start rating the LLMs’ responses.

Future Potential For LLM Providers

OpenRouter often provides “stealth” models that users use, and companies like OpenAI, xAI, etc., gather real-world usage data before launching their proprietary models.

With JanitorAI’s large user base and the developers’ plan to make real-world performance data available to LLM providers in real-time, it could serve as a useful source for companies to gather data and improve their models.

It can also serve as a source of revenue for JanitorAI, as data of this scale is a valuable resource.

janitorBench, JanitorAI’s Benchmark For Evaluating LLMs

janitorBench is JanitorAI’s benchmark for evaluating the performance of LLMs in creative, long, multi-turn conversations. The platform’s large user base helps gather real-world performance data for several LLMs in real-time.

Currently, janitorBench evaluates and scores LLMs based on user-provided star ratings (1-5) for their responses. The developers plan to include additional signals in the evaluation, such as engagement time, and conduct A/B testing on identical prompts for head-to-head comparison.

JanitorAI users’ feedback is subjective and varies based on many factors. The platform’s previous statement that ratings do not influence chat quality for external models has also caused several users with optimized prompts and parameters to refrain from leaving ratings. To maintain value of user ratings, JanitorAI needs to encourage advanced users to rate the LLMs’ responses.

The developers also plan to offer model providers access to real-world, real-time model performance data through a dedicated dashboard. This could be a potential revenue source for JanitorAI, as data of this scale is a valuable resource.