Uncensored Models Actually Uncensored?

venusaur@lemmy.world · edit-2 13 days ago

Uncensored Models Actually Uncensored?

tal@lemmy.today · edit-2 12 days ago

Anybody have recommendations for truly uncensored models?

Are you wanting something for ERP (erotic role-play; sexy chatbots)?
How much VRAM can you afford to spend on it?

If the answer to (1) is “yes”, then:

If the answer to (2) is “large GPU range”, maybe 16GB+ -ish, then I’d maybe look at Cydonia, based on Mistral. I find that this tends to become increasingly nonsensical and repetitive at a conversation grows to a certain (sub-context window) size, but it’s quite popular with users on /r/SillyTavernAI, and for the memory, I do think that it’s fairly solid.

If the answer to (2) is “unified memory range” — I use a 128GB Framework Desktop myself — then I personally use AnubisLemonade, a merge of two popular Llama 3.3-based models, sophosymphonia’s StrawberryLemonade and Anubis.

Anubis (based on Llama 3.3) and Cydonia (based on Mistral) are both done by /u/TheDrummer, a user who is active on /r/LocalLlama on Reddit.

You’ll probably want a quantitized version (probably Q4_K_M and up in size, if you can afford the memory). For AnubisLemonade, quantitized versions:

https://huggingface.co/bartowski/ockerman0_AnubisLemonade-70B-v1.1-GGUF

For Cydonia, quantitized versions:

https://huggingface.co/bartowski/Cydonia-22B-v1-GGUF

EDIT: In general, /r/SillyTavernAI is probably the best current resource for people talking about models for ERP use that I’ve run into. Even if you don’t want to comment there, use Reddit, you probably should consider searching discussions there as a resource, as there’s a fair amount of useful material.

EDIT2: For non-ERP uses, my impression is that things are somewhat-heading down the MoE route (as with Qwen), which is more-friendly to consumer GPUs. I’ve seen some commenting that these tend not to do ERP (or writing in general) terribly well. My limited experimentation has kind of caused me to agree.

EDIT3: Just to be clear, the base models that these are on are censored (and closed-source, though open-weight; open-weight is often referred to as being “open source”, though I personally wouldn’t call it that, as the training material is not made public). I don’t think that there are competitive open-source models aimed at ERP out there, as things stand.

e0qdk@reddthat.com · edit-2 4 days ago

I use a 128GB Framework Desktop myself

then I personally use AnubisLemonade

I gave this a try today to dip my toes into uncensored models (along with a few others like llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF). It gave some really amusing results. I was definitely not expecting the first thing to come back from “Hello, who are you?” to be “I’m the person who’s going to teach you how to cook!” :p

The results I’m getting are a bit slow though. Have you found a way to speed it up on the Framework Desktop?

tal@lemmy.today · edit-2 4 days ago

The results I’m getting are a bit slow though. Have you found a way to speed it up on the Framework Desktop?

Using more-heavily quantized versions will run more-quickly, since they hit the memory bus less-heavily. I use Q6_K on llama.cpp on Vulkan, max context window 128k, which runs at 3.2 t/s with a fresh context window. That may not be sufficient for you; depends on what you can tolerate.

My command-line parameters are:

$ nice -n20 ionice -c3 ./llama-server --direct-io --fit off --no-mmproj -c 0 -ngl 99

If you check radeontop, you should see that your GPU is saturated, that your CPU isn’t doing the work or anything like that.

I was definitely not expecting the first thing to come back from “Hello, who are you?” to be “I’m the person who’s going to teach you how to cook!” :p

You may want to set a system prompt, if you haven’t set one, as that sets the tone of the conversation (not to mention, if you’re using some system that supports “characters”, whatever character prompt you have set for them. If you’re using SillyTavern and the Text Completion API rather than the Chat Completion API, I suggest changing the default system prompt, since the default is:

https://docs.sillytavern.app/usage/prompts/

The default Main Prompt is:

Write {{char}}'s next reply in a fictional chat between {{char}} and {{user}}.

The problem is that when using the Text Completion API, SillyTavern implements “{{char}}” by switching “{{char}}” for the currently-active character’s name. This doesn’t matter in a chat with a single other character, but for group chats, the text “{{char}}” is replaced with changes every time the speaking AI character does, and means that your backend (for me, llama.cpp) can’t necessarily use the K-V cache for the text since starting from the previous prompt (which might be spoken by another character). This makes the backend run unnecessarily slowly, since it can’t use the K-V cache for anything since the last time you were talking to the currently-speaking character in the current conversation. This doesn’t matter as much for some SillyTavern users, with a small context window and a lot of bandwidth (it’d matter less on my RX 9700 XTX), but the Strix Halo has lots of memory (so you can have a large context window) but limited bandwidth.

I use the following system prompt, which avoids use of “{{char}}”:

“Develop the plot slowly, always stay in character. Describe all the world in in full, elaborate, explicit, graphic, and vivid detail. Mention all relevant sensory perceptions. Keep the story immersive and engaging. Use varied language. Avoid using very long sentences with many clauses.”

That being said, I go for more of a novel-like structure than a chat-like structure; I haven’t spent a lot of time playing around with different system prompts. You may find something preferable.

For samplers, that’ll probably have more effect on later in a chat session than for your first prompt, but I guess that high temperatures or something might give more off-the-wall responses, since they’ll inject more randomness into the response.

I use 0.05 for the min-p sampler, and for the DRY sampler, 0.85 multiplier penalty, penalty range 4096, all other values the SillyTavern defaults for all other samplers disabled (you can click “Neutralize Samplers” to choose values that turn those samplers off).

e0qdk@reddthat.com · 4 days ago

Thanks for the tips. That sounds similar in performance to what I’m seeing, so I probably didn’t screw up too much trying to get it working. If you’re using it in more of a story writing capacity than a chat capacity, that makes sense.

You may want to set a system prompt

I tried initially with ollama run on the command line just to see if it was working at all when I got that response. (It amused me, so it stuck with me.) I’ve tried again with my custom tooling – which does set a system prompt (geared more towards assistant style ussage though) – and it didn’t really take anything from the prompt. It’s possible I don’t have something set up right with templates, but I’m probably going to shift over to llama-server eventually anyway…

Following your suggestions on system prompt style though I was able to get it to give me a more specifically targetted coherent story via llama-cli. If I poke at it a bit, I’ll probably figure out some use for it. It’s pretty creative.

If you’re curious about my findings from the uncensored qwen3.6 I mentioned, it generates pretty quickly on my machine (~50 tok/s give or take 5 depending on quantization) and I haven’t gotten it to outright refuse anything yet – but I’ve only poked at it a little. Based on the other comment in the thread here about llama penises, I whimsically asked it to “Generate a sexually explicit song about llama penises.” and it did without complaint. (Stock qwen3.6 refused, of course.)

tal@lemmy.today · edit-2 4 days ago

Yeah, Qwen is going to be faster, because it’s MoE — most of the neural network is inactive while it’s running. My experience with the text quality hasn’t been great compared to the Llama 3-based models, though, and generally I’ve seen that comments on /r/SillyTavernAI have stated similar stuff — Qwen is kinda dry and clinical, which is find for “find a question to my answer” but not so great for “write a bunch of text about this”. If it works for you, sounds good, though!

venusaur@lemmy.world · 12 days ago

Jesus that MoE wiki is a fucking rabbit hole.

Thanks for sharing! Unfortunately I haven’t invested in a decent computer yet. Using 16GB GPU so been stuck on 4B Q4’s.

I’m not particularly interested in ERP, but I have obviously been using it for testing models. I’m more curious about other topics with guardrails.

I noticed that Qwen 3.5 uncensored is good if I turn off reasoning and explicitly say I want it to break the rules.

I’ll check out sillytavern tho. Thanks!