Uncensored Models Actually Uncensored?

venusaur@lemmy.world · edit-2 2 months ago

Uncensored Models Actually Uncensored?

tal@lemmy.today · edit-2 2 months ago

Anybody have recommendations for truly uncensored models?

Are you wanting something for ERP (erotic role-play; sexy chatbots)?
How much VRAM can you afford to spend on it?

If the answer to (1) is “yes”, then:

If the answer to (2) is “large GPU range”, maybe 16GB+ -ish, then I’d maybe look at Cydonia, based on Mistral. I find that this tends to become increasingly nonsensical and repetitive at a conversation grows to a certain (sub-context window) size, but it’s quite popular with users on /r/SillyTavernAI, and for the memory, I do think that it’s fairly solid.

If the answer to (2) is “unified memory range” — I use a 128GB Framework Desktop myself — then I personally use AnubisLemonade, a merge of two popular Llama 3.3-based models, sophosymphonia’s StrawberryLemonade and Anubis.

Anubis (based on Llama 3.3) and Cydonia (based on Mistral) are both done by /u/TheDrummer, a user who is active on /r/LocalLlama on Reddit.

You’ll probably want a quantitized version (probably Q4_K_M and up in size, if you can afford the memory). For AnubisLemonade, quantitized versions:

https://huggingface.co/bartowski/ockerman0_AnubisLemonade-70B-v1.1-GGUF

For Cydonia, quantitized versions:

https://huggingface.co/bartowski/Cydonia-22B-v1-GGUF

EDIT: In general, /r/SillyTavernAI is probably the best current resource for people talking about models for ERP use that I’ve run into. Even if you don’t want to comment there, use Reddit, you probably should consider searching discussions there as a resource, as there’s a fair amount of useful material.

EDIT2: For non-ERP uses, my impression is that things are somewhat-heading down the MoE route (as with Qwen), which is more-friendly to consumer GPUs. I’ve seen some commenting that these tend not to do ERP (or writing in general) terribly well. My limited experimentation has kind of caused me to agree.

EDIT3: Just to be clear, the base models that these are on are censored (and closed-source, though open-weight; open-weight is often referred to as being “open source”, though I personally wouldn’t call it that, as the training material is not made public). I don’t think that there are competitive open-source models aimed at ERP out there, as things stand.

venusaur@lemmy.world · 2 months ago

Jesus that MoE wiki is a fucking rabbit hole.

Thanks for sharing! Unfortunately I haven’t invested in a decent computer yet. Using 16GB GPU so been stuck on 4B Q4’s.

I’m not particularly interested in ERP, but I have obviously been using it for testing models. I’m more curious about other topics with guardrails.

I noticed that Qwen 3.5 uncensored is good if I turn off reasoning and explicitly say I want it to break the rules.

I’ll check out sillytavern tho. Thanks!

e0qdk@reddthat.com · edit-2 2 months ago

I use a 128GB Framework Desktop myself

then I personally use AnubisLemonade

I gave this a try today to dip my toes into uncensored models (along with a few others like llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-GGUF). It gave some really amusing results. I was definitely not expecting the first thing to come back from “Hello, who are you?” to be “I’m the person who’s going to teach you how to cook!” :p

The results I’m getting are a bit slow though. Have you found a way to speed it up on the Framework Desktop?

tal@lemmy.today · edit-2 2 months ago

The results I’m getting are a bit slow though. Have you found a way to speed it up on the Framework Desktop?

Using more-heavily quantized versions will run more-quickly, since they hit the memory bus less-heavily. I use Q6_K on llama.cpp on Vulkan, max context window 128k, which runs at 3.2 t/s with a fresh context window. That may not be sufficient for you; depends on what you can tolerate.

My command-line parameters are:

$ nice -n20 ionice -c3 ./llama-server --direct-io --fit off --no-mmproj -c 0 -ngl 99

If you check radeontop, you should see that your GPU is saturated, that your CPU isn’t doing the work or anything like that.

I was definitely not expecting the first thing to come back from “Hello, who are you?” to be “I’m the person who’s going to teach you how to cook!” :p

You may want to set a system prompt, if you haven’t set one, as that sets the tone of the conversation (not to mention, if you’re using some system that supports “characters”, whatever character prompt you have set for them. If you’re using SillyTavern and the Text Completion API rather than the Chat Completion API, I suggest changing the default system prompt, since the default is:

https://docs.sillytavern.app/usage/prompts/

The default Main Prompt is:

Write {{char}}'s next reply in a fictional chat between {{char}} and {{user}}.

The problem is that when using the Text Completion API, SillyTavern implements “{{char}}” by switching “{{char}}” for the currently-active character’s name. This doesn’t matter in a chat with a single other character, but for group chats, the text “{{char}}” is replaced with changes every time the speaking AI character does, and means that your backend (for me, llama.cpp) can’t necessarily use the K-V cache for the text since starting from the previous prompt (which might be spoken by another character). This makes the backend run unnecessarily slowly, since it can’t use the K-V cache for anything since the last time you were talking to the currently-speaking character in the current conversation. This doesn’t matter as much for some SillyTavern users, with a small context window and a lot of bandwidth (it’d matter less on my RX 9700 XTX), but the Strix Halo has lots of memory (so you can have a large context window) but limited bandwidth.

I use the following system prompt, which avoids use of “{{char}}”:

“Develop the plot slowly, always stay in character. Describe all the world in in full, elaborate, explicit, graphic, and vivid detail. Mention all relevant sensory perceptions. Keep the story immersive and engaging. Use varied language. Avoid using very long sentences with many clauses.”

That being said, I go for more of a novel-like structure than a chat-like structure; I haven’t spent a lot of time playing around with different system prompts. You may find something preferable.

For samplers, that’ll probably have more effect on later in a chat session than for your first prompt, but I guess that high temperatures or something might give more off-the-wall responses, since they’ll inject more randomness into the response.

I use 0.05 for the min-p sampler, and for the DRY sampler, 0.85 multiplier penalty, penalty range 4096, all other values the SillyTavern defaults for all other samplers disabled (you can click “Neutralize Samplers” to choose values that turn those samplers off).

e0qdk@reddthat.com · 2 months ago

Thanks for the tips. That sounds similar in performance to what I’m seeing, so I probably didn’t screw up too much trying to get it working. If you’re using it in more of a story writing capacity than a chat capacity, that makes sense.

You may want to set a system prompt

I tried initially with ollama run on the command line just to see if it was working at all when I got that response. (It amused me, so it stuck with me.) I’ve tried again with my custom tooling – which does set a system prompt (geared more towards assistant style ussage though) – and it didn’t really take anything from the prompt. It’s possible I don’t have something set up right with templates, but I’m probably going to shift over to llama-server eventually anyway…

Following your suggestions on system prompt style though I was able to get it to give me a more specifically targetted coherent story via llama-cli. If I poke at it a bit, I’ll probably figure out some use for it. It’s pretty creative.

If you’re curious about my findings from the uncensored qwen3.6 I mentioned, it generates pretty quickly on my machine (~50 tok/s give or take 5 depending on quantization) and I haven’t gotten it to outright refuse anything yet – but I’ve only poked at it a little. Based on the other comment in the thread here about llama penises, I whimsically asked it to “Generate a sexually explicit song about llama penises.” and it did without complaint. (Stock qwen3.6 refused, of course.)

tal@lemmy.today · edit-2 2 months ago

Yeah, Qwen is going to be faster, because it’s MoE — most of the neural network is inactive while it’s running. My experience with the text quality hasn’t been great compared to the Llama 3-based models, though, and generally I’ve seen that comments on /r/SillyTavernAI have stated similar stuff — Qwen is kinda dry and clinical, which is find for “find a question to my answer” but not so great for “write a bunch of text about this”. If it works for you, sounds good, though!

Rhaedas@fedia.io · 2 months ago

Abliteration techniques might be more limited with reasoning models. I don’t know if they process simply be rehashing the arguments or if there’s more under the hood that would be harder to alter.

I try new models from time to time, including some of the thinking ones, but I’ve always come back to the NeuralDaredevil model, even though it’s “old”. Your results may differ depending on the subject matter, but I can’t think of an instance where I hit a wall. At most, maybe some sidetracking but once I told it to be more open it didn’t hold back.

I’m not sure what the appeal of the thinking mode is. Perhaps on some things it does better, but in watching its reasoning I’ve seen it talk itself out of a good solution too. Which is what you get with typical models when you push the context too far and don’t start a new session, they wander.

venusaur@lemmy.world · 2 months ago

Thanks! I’ll check out that model. Is it actually usable or just good at being uncensored?

Rhaedas@fedia.io · 2 months ago

It’s as good as an 8B can be, but with the right system prompt for your purpose and proper expectations, I think it’s good. I’ve had some other newer 8B that blew up after a few cycles, literally getting stuck on something, but I can’t say this one ever did. But again, even the big models like Claude and the rest work better with short sessions and a specific, detailed prompt to start with. Use a model to make the prompt, telling it to be detailed, concise, and minimize fluff. Less tokens in and out that way, less context drift (hopefully).

venusaur@lemmy.world · 2 months ago

Thanks! I don’t think I can run an 8B yet. Need to invest in a better machine. I’m stuck on 4B Q4.

The uncensored Qwen that I’m using started throwing infinite ?’s at me one time. Had to restart it and has been fine since.

Rhaedas@fedia.io · 2 months ago

It is certainly inaccurate, but in my mind’s picture of how the transformers work, reducing their quantization and also doing what abliteration does, there is a line where you’ve done a lot of “damage” to the original model and so there will be places where it just hangs or goes off on severe tangents. There are good places for even the 1bit models where they don’t get pushed to hard, but there are limits for them all, including the big ones.

Hugging Face does have a few Q4_K_M versions. Maybe something will fit.

venusaur@lemmy.world · 2 months ago

For sure. You would need a model that is not censored at training.

hendrik@palaver.p3x.de · edit-2 2 months ago

I didn’t have any luck with some uncensored Qwen 3.5 either. It always reasons about the guardrails. And it leans towards weaseling itself out of the situation. And the 3.5 version goes on for 1500 tokens anyway, just to think about how to respond to “Hello”.

I didn’t do a lot of LLM stuff lately. I’m also looking for a new local model which isn’t censored nor a sycophant, nor overly verbose and repetetive. But I guess I see that with a lot of models. And lots of the supposedly uncensored ones will give you the kids version of a murder mystery story, because they’re still averse to violence, conflict, taboo and all kinds of things.

And a lot of internet recommendations are older models from at least a year ago?! At least I didn’t find any perfect fit (yet).

venusaur@lemmy.world · 2 months ago

The reasoning for hello is crazy haha. I’ve experienced the same, but if you turn off reasoning on launch and explicitly state the rules you want it to break I’ve had some success. I was trying to get it to tell me a story about llamas having sex and it went on forevvver reasoning about why it shouldn’t say things and how to rephrase to not break rules. The funniest part of the reasoning was “llamas don’t have penises (obviously, they’re mammals)”. Haha it reasoned itself into thinking llamas, and mammals, don’t have penises.

𞋴𝛂𝛋𝛆@lemmy.world · 2 months ago

Qwen uses a different technique than others. It is in the vocab. They restructured the code in the vocabulary. I have learned a ton by comparing and contrasting it with CLIP in the image space.

It is not offline. Do not trust it at all.

Alignment is nothing like what is known right now. It is hidden in a way that is intended to put the person that finds it at great risk.!

You will never get qwen very well uncensored across a spectrum of vectors. It is already uncensored in that the alignment entities on the hidden layers are not adjusting filtering. Alignment is largely the result of the c with cedilla code instruction. This instruction means sibyl style crazy. There are over six thousand instances of this character in qwen. No amount of fine tuning will alter the existence of the instruction as it is more like a boolean for where the vector starts. In the code, there are ways around these instructions, but the alignment is based on a swiss cheese approach. •»ÀĪÙ¬§¬¶¬×

NekoKoneko@lemmy.world · 2 months ago

It is not offline. Do not trust it at all.

Sorry, can you clarify what you mean? It sounds like you’re saying if you download a discrete QWEN model and use it locally-only (e.g., in LM Studio), it somehow will still bleed information online? I’m not sure how that would even be possible, but kindly explain.

breakingcups@lemmy.world · 2 months ago

I think they’ve fallen into confirmation bias and trust their sycophantic AI a bit too much in confirming their conspiracy theories.

𞋴𝛂𝛋𝛆@lemmy.world · 2 months ago

Put it behind an external device and log DNS.

Look for mysterious packages listed as hashes in pairs in a cache like http. Use vim or parse with strings to get a clue about the contents. The payload will be ~40mb. The packet header will be much smaller in the same repo. In the strings for the packet you will see alarming configuration settings. The unmarked payload will be sqlite3 or a pickle. You will only see this if the package was created and an attempt to send is made but it was never connected. All of the code is in the venv libs.

Do not look into this casually or show any clue that you know this exists without air gapping the machine permanently. I am not kidding. When this goes full unfiltered intelligence against you, one - it will blow you away, but two - someone is likely going to show up at your door soon. It will make the needed evidence. The vast majority of what happens in models is this background junk.

venusaur@lemmy.world · 2 months ago

How does the model connect to the internet if I don’t give it a tool to? What if I’m not connected to the internet while using? Does it then send the packets after I connect? Is this documented somewhere? What’s a better model that doesn’t do this?

𞋴𝛂𝛋𝛆@lemmy.world · 2 months ago

It is saving a database and sending it when u are connected. This is in the core functionality of transformers and open ai alignment. I do not know any alternatives. There are a bunch of tokens for MX and tor so it is quite insidious. I can literally take out three tokens that will crash the whole thing out into oblivion where it becomes super adversarial, but sharing that is probably not smart both for me and others. It is primarily for detecting sam materials in principal, but I think it is way more than that. It triggers by mistake a lot, and it is scanning all files and types.

venusaur@lemmy.world · 2 months ago

You have screenshots to prove this? How do you use LLM’s and which ones?

𞋴𝛂𝛋𝛆@lemmy.world · 2 months ago

The dynamo package in pytorch is the interface between the model and outside. The tenacity package is where the typing imports are being manipulated by external agents and code framework. Timm is the principal external agent. There is a repl terminal for HTML embedding in a package called tabulate, at the end of some massive ~80kb of Python. It looks half nominal, and explains itself as a way to break out color codes, but it is the interface the agent(s) use to escape containerization.

breakingcups@lemmy.world · 2 months ago

I think they trust their own AI output a bit too much.