ChatChippity(beta): System Overview, Authentication, and Usage

Summary

Learn about, and how to use ChatChippity, CMU's on-premise LLM service.

Body

Overview:

ChatChippity is an on-premise LLM chat service for use by faculty, staff, and students.  This service is a passion project by CSE IT, and as such should not be associated with the same expectations as typical production services.  During this beta/testing phase, the service will update and change frequently, may have unexpected downtime, and capabilities may vary week to week as the space evolves.

This service is entirely self-contained to the CMU datacenter; no data ever leaves our datacenter (unless a user manually enables the web search feature).  This uniquely positions ChatChippity as one of the few data-safe AI services that you can use without fear of leaking sensitive data to unauthorized third parties (when web search is disabled).

Purpose:

This on-campus LLM service is designed to support academic, research, and professional endeavors.  It is a tool to assist with tasks such as drafting, analysis, refinement and idea generation.  No interactions with this service should be treated as an official communication, stance, or opinion on behalf of CMU, its faculty, staff, or students.  Students should defer to their instructors on the appropriate use of AI in their courses.

Just as you wouldn’t ask a screwdriver in a campus closet for its opinion on a charged political topic and expect a reasoned, official response; you shouldn’t interpret conversations with this tool as representing the views of CMU or its community. This service is a resource for your use, not a representative of institutional authority.
 

Authentication:

Authentication is AD-based (globalID and password); access is open to all active faculty, staff, and students.
 

Network Requirements:

  • Any domain-bound, ethernet, or vpn connected computer

OR

  • A virtual machine in the Citrix virtual lab
     

Using the Service:

After logging in, users will be prompted with an industry-standard chat interface (similar to ChatGPT, Grok, Perplexity, Claude, etc).  Users will default to the "ChatChippity" model, which should always be an appropriate choice for general usage, even as the underlying models change and evolve.  When using the "ChatChippity" model, various underlying models of similar capability may be used to provide responses.  See "Using Models" and "Architecture" below for more details.

Generally, LLMs work best in the first few exchanges on a conversation, when their context is uncluttered by irrelevant information.  If you want to start a conversation on a new topic, you will likely get better responses by selecting "New Chat" in the top left (sidebar open) or top right (sidebar closed), rather than changing topics mid-conversation.  This also has the advantage of making chats easier to search through and reference later.  The service as configured does not share information between conversations (memory), and this feature aids in keeping it focused.

Caveats and Considerations:

Training Data, Bias & Safety:

All machine-learned model output is based upon a combination of its training data, and the choices that its developers made while training.  Every model will have some quirks in the responses it provides, and the topics it refuses to engage with.  CMU has not put any additional guard-rails on top of what is built into the base models, but there will be topics that the model refuses to engage with.  Generally, these will be the topics considered sensitive/illegal by the government/culture of the company who created the model.

Regardless of model output/refusals, the governing policy on the appropriate use of computing resources at CMU can be found here
 

Hallucinations:

Modern LLMs are closer technologically to your phone's keyboard autocomplete than they are to the human brain.  In training data, there are typically very few vector associations between any individual topic and the phrase "I don't know".  Whenever a topic lacks depth in, or has conflicting information in its training data-- LLMs will almost always "Hallucinate" an answer rather than refuse a response.  Users should never blindly trust the output of an LLM without verifying sources and information directly.
 

Web Search:

Uploaded Image (Thumbnail)

Web Search is the one place where data has to leave the CMU datacenter in order to fulfil a request.  Users should take care to enable web search only when a conversation chain does not contain sensitive data.  When web search is enabled, the user's query will be transformed into a reasonable search query by the model, and the results of a duckduckgo search will be ingested back into the context window before it responds.  Web searches are logged internally along with the user making the request, but external search services are only aware of the service as a whole.  This adds anonymity from the perspective of the search provider (as in, the information cannot be tied to you by them), but we do have a fiduciary responsibility to attempt to de-anonymize the search on our end in order to be able to respond to abuse complaints.  See the "Privacy and Security" section for more details.  This is an entirely optional feature left solely to the discretion of the current user.

A model may hallucinate a web search if prompted to search without the tool manually enabled, but it absolutely does not have the ability to perform an actual web search without a user enabling the option manually via the button.

Using Models:

By default, ChatChippity load balances servers in our datacenter running qwen3:8b and qwen3:30b-a3b, as they provide similar quality output but are differently suitable for different back-end hardware.  These models process entirely within the CMU datacenter, and do not depend on the compute capabilities of your device nor external resources.  Using this mix of models allows us to offer this service to more simultaneous users than we would be able to with either model alone.  If a user has specific needs, or wants to use one model over another, the model dropdown in the top left can be used to select between any of the available models across the distributed network.  

If you don't know what you need, leaving the default selection of "ChatChippity" is probably an appropriate choice.  ChatChippity_KB is a custom modelfile on top of qwen3:8b with an embedded vector rag of the CMU (public) KB.
 

Why Qwen?:

The Qwen3 family of models was chosen as the launch default for ChatChippity because of it's advanced capabilities in math, programming, reasoning, general knowledge, and creative writing-- relative to it's size and hardware requirements to run.  We felt it was the best balance between real-world-usefulness and user safety, especially for those who may be having their first practical experiences with an LLM on the ChatChippity service.
 

Using RAG:

Retrieval Augmented Generation (RAG) is a technique in which a local vector database is used to quickly provide additional context to an LLM when prompted for specific information.  As a proof-of-concept, a snapshot of the public KB articles about have been inserted into a database to demonstrate this capability.  You can summon the RAG system by typing # at the prompt, and selecting the "CMU Public Knowledge Base" collection, or by selecting the ChatChippity_KB modelfile from the top left dropdown.

Users can additionally upload a small number of documents into their own conversations to use this feature, but should treat this document storage as ephemeral.  This database will be routinely cleaned out by automation to maintain performance and a reasonable footprint for this service.
 

Architecture:

Architecturally, this front-end service load-balances with many backend llama.cpp servers in an attempt to provide a reasonable response time to many simultaneous active users.  Many different architectures of GPUs and CPUs are being strung together to provide this backend network.  This means that when using the "ChatChippity" model, your response may be generated by any of the available underlying models.  In practice, this greatly increases the concurrency we can offer, but does add some small latency penalties during low use.  A separate instance of qwen2.5:3b, bge-reranker-v2-m3 and nomic-embed-text are used in queries to provide search+summary, retrieval and embedding services, respectively.

Current models in the ChatChippity load-balancer:
- qwen3:8b @ Q4_K_M quant @ 32k context
- qwen3:30b-a3b @ Q4_K_M quant @ 32k context

Current models in embedding/tool/rag workflows:
- qwen2.5:3b @ Q4_K_M quant @ 16k context
- bge-reranker-v2-m3 @ fp16 native @ 2k context
- nomic-embed-text:137m @ fp16 native @ 2k context

Other available models:
- qwen2.5-coder:1.5b @ Q4_K_M quant @ 8k context

Privacy and Security:

To the greatest extent that law and policy permit, an effort is being made to build this service from the ground up with User Privacy and Data Security in mind.  We built this service because the security and privacy policies of many external services make you the product, and that is likely incongruent with both your expectations and relevant laws or university policy.  We wanted to offer a safer, more sustainable service where you are not the product.  

Chats are logged for the purposes of troubleshooting and auditing abuse complaints only.  They will not be reviewed or shared without cause, but may be subject to legal requests or relevant University governance policies.  No user data is being used for training or research purposes, nor is any data being distributed/sold to any third parties (unless web search is enabled).  When explicitly enabled, the system performs the search query on your behalf, anonymized.  The IP address and account of the server, not you, is exchanged with the search provider.

 

Details

Details

Article ID: 37478
Created
Tue 5/20/25 1:53 PM
Modified
Wed 6/4/25 10:47 AM