Uncensored Models Actually Uncensored?

venusaur@lemmy.world · edit-2 15 days ago

Uncensored Models Actually Uncensored?

tal@lemmy.today · edit-2 6 days ago

The results I’m getting are a bit slow though. Have you found a way to speed it up on the Framework Desktop?

Using more-heavily quantized versions will run more-quickly, since they hit the memory bus less-heavily. I use Q6_K on llama.cpp on Vulkan, max context window 128k, which runs at 3.2 t/s with a fresh context window. That may not be sufficient for you; depends on what you can tolerate.

My command-line parameters are:

$ nice -n20 ionice -c3 ./llama-server --direct-io --fit off --no-mmproj -c 0 -ngl 99

If you check radeontop, you should see that your GPU is saturated, that your CPU isn’t doing the work or anything like that.

I was definitely not expecting the first thing to come back from “Hello, who are you?” to be “I’m the person who’s going to teach you how to cook!” :p

You may want to set a system prompt, if you haven’t set one, as that sets the tone of the conversation (not to mention, if you’re using some system that supports “characters”, whatever character prompt you have set for them. If you’re using SillyTavern and the Text Completion API rather than the Chat Completion API, I suggest changing the default system prompt, since the default is:

https://docs.sillytavern.app/usage/prompts/

The default Main Prompt is:

Write {{char}}'s next reply in a fictional chat between {{char}} and {{user}}.

The problem is that when using the Text Completion API, SillyTavern implements “{{char}}” by switching “{{char}}” for the currently-active character’s name. This doesn’t matter in a chat with a single other character, but for group chats, the text “{{char}}” is replaced with changes every time the speaking AI character does, and means that your backend (for me, llama.cpp) can’t necessarily use the K-V cache for the text since starting from the previous prompt (which might be spoken by another character). This makes the backend run unnecessarily slowly, since it can’t use the K-V cache for anything since the last time you were talking to the currently-speaking character in the current conversation. This doesn’t matter as much for some SillyTavern users, with a small context window and a lot of bandwidth (it’d matter less on my RX 9700 XTX), but the Strix Halo has lots of memory (so you can have a large context window) but limited bandwidth.

I use the following system prompt, which avoids use of “{{char}}”:

“Develop the plot slowly, always stay in character. Describe all the world in in full, elaborate, explicit, graphic, and vivid detail. Mention all relevant sensory perceptions. Keep the story immersive and engaging. Use varied language. Avoid using very long sentences with many clauses.”

That being said, I go for more of a novel-like structure than a chat-like structure; I haven’t spent a lot of time playing around with different system prompts. You may find something preferable.

For samplers, that’ll probably have more effect on later in a chat session than for your first prompt, but I guess that high temperatures or something might give more off-the-wall responses, since they’ll inject more randomness into the response.

I use 0.05 for the min-p sampler, and for the DRY sampler, 0.85 multiplier penalty, penalty range 4096, all other values the SillyTavern defaults for all other samplers disabled (you can click “Neutralize Samplers” to choose values that turn those samplers off).

e0qdk@reddthat.com · 6 days ago

Thanks for the tips. That sounds similar in performance to what I’m seeing, so I probably didn’t screw up too much trying to get it working. If you’re using it in more of a story writing capacity than a chat capacity, that makes sense.

You may want to set a system prompt

I tried initially with ollama run on the command line just to see if it was working at all when I got that response. (It amused me, so it stuck with me.) I’ve tried again with my custom tooling – which does set a system prompt (geared more towards assistant style ussage though) – and it didn’t really take anything from the prompt. It’s possible I don’t have something set up right with templates, but I’m probably going to shift over to llama-server eventually anyway…

Following your suggestions on system prompt style though I was able to get it to give me a more specifically targetted coherent story via llama-cli. If I poke at it a bit, I’ll probably figure out some use for it. It’s pretty creative.

If you’re curious about my findings from the uncensored qwen3.6 I mentioned, it generates pretty quickly on my machine (~50 tok/s give or take 5 depending on quantization) and I haven’t gotten it to outright refuse anything yet – but I’ve only poked at it a little. Based on the other comment in the thread here about llama penises, I whimsically asked it to “Generate a sexually explicit song about llama penises.” and it did without complaint. (Stock qwen3.6 refused, of course.)

tal@lemmy.today · edit-2 6 days ago

Yeah, Qwen is going to be faster, because it’s MoE — most of the neural network is inactive while it’s running. My experience with the text quality hasn’t been great compared to the Llama 3-based models, though, and generally I’ve seen that comments on /r/SillyTavernAI have stated similar stuff — Qwen is kinda dry and clinical, which is find for “find a question to my answer” but not so great for “write a bunch of text about this”. If it works for you, sounds good, though!