I saw a bunch of people complaining on Twitter about how GPT-OSS can't be customized or has no soul and I noticed that none of them said what they were trying to accomplish.
"The main use-case for fine-tuning small language models is for erotic role-play, and there’s a serious demand."
it's not erotic role-play, but I have a use case of making an AI-powered NetHack clone. specifically, to generate dungeon layouts, dialog for NPCs and to fill in the boatloads of minutae and interactions which NetHack is famous for.
you kind of need soul for that, and a lot of background knowledge on mythology/fantasy lore, but also tool use to work the world systems.
I am playing around with interactive workflow where the model suggests what can be wrong with a particular chunk of code, then the user selects one of the options, and the model immediately implements the fix.
Biggest problem? Total Wild West in terms of what the models try to suggest. Some models suggest short sentences, others spew out huge chunks at a time. GPT-OSS really likes using tables everywhere. Llama occasionally gets stuck in the loop of "memcpy() could be not what it seems and work differently than expected" followed by a handful of similar suggestions for other well-known library functions.
I mostly got it to work with some creative prompt engineering and cross-validation, but having a model fine-tuned for giving reasonable suggestions that are easy to understand from a quick glance, would be way better.
I haven't tried your exact task, of course, but I've found a lot of success in using JSON structured output (in strict mode), and decomposing the response into more fields than you would otherwise think useful. And making those fields highly specific.
For example: make the suggestion output an object with multiple fields, naming one of them `concise_suggestion`. And make sure to take advantage of the `description` field.
For people not already using structured output, both OpenAI and Anthropic consoles have a pretty good JSON schema generator (give prompt, get schema). I'd suggest using one of those as a starting point.
what's the problem with that? we have erotic texts dating back thousands of years, basically as old as the act of writing itself https://en.wikipedia.org/wiki/Istanbul_2461
There's nothing wrong with it, but you have to understand the differences between different user groups to know which limitations are relevant to your own use cases. "It doesn't follow instructions" could mean "it won't pretend to be a horny elf" or "it hallucinates fields outside the JSON schema I specified"; the latter is much more of a problem for my uses.
I have no problem with it and I can understand why people don't want to say "I'm trying to pornify this model and it refuses to talk dirty!" in public. But if you're calling a model garbage maybe you should be honest about what the "problem" is.
The pro-porn side has zero PR because respectable public figures don't see pro-porn advocacy as a good career move. At most, you'll get some oblique references to it.
Meanwhile, the anti-porn side has a formidable alliance:
Right-wing, religiously-motivated anti-porn activists. Left-wing, feminism-motivated anti-porn activists. Big corporate types with lots of $$$$ to spend who want their customer support chatbot to be completely SFW at all times. AI safety folk who think keeping the model on a tight leash is an ethical obligation, lest future iterations take over the world. AI vendors who are keen on the yes-it-might-take-over-the-world narrative. AI vendors who just don't want their developers having to handle NSFW stuff in work. Politicians who don't know a transformer from a diffusion model, but who've heard a chorus of worries about lost jobs and AI bias and deepfakes and revenge porn.
These people will speak up in public at the drop of a hat.
My use case has been trying to remove the damn "apologies for this" and extraneous language that just waste tokens for no reason. GPT has always always always been so quick to waffle.
And removing the chat interface as much as possible. Many benchmarks are better with text completion models, but they keep insisting on this horrible interface for their models.
Fine tuning is there to ensure you get the output format you want without the extra garbage. I swear they have tuned their models to waste tokens.
It turns out if you generate two LLM responses and ask a judge to choose which is better, many judges have a bias in favour of long answers full of waffle.
> use of [LLMs] as judges [..] reveals a notable bias towards longer responses, undermining the reliability of such evaluations. To better understand such bias, we propose to decompose the preference evaluation metric, specifically the win rate, into two key components: desirability and information mass [..]
(If you're interested, give it a click. I tried to pare this down to avoid quoting a wall of text.)
It's a well-understood self-contained use-case without many externalities and simple business models.
What more, with porn, the medium is the product probably more than the content. Having it on home-media in the 80s was the selling point. Getting it over the 1-900 phone lines or accessing it over the internet ... these were arguably the actual product. It might have been a driver of early smart phone adoption as well. Adult content is about an 80% consumption on handheld devices while the internet writ large is about 60%.
Private tunable multi-media interaction on-demand is the product here.
Also it's a unique offer. Role playing prohibited sexual acts can be done arguably victim free.
There's a good fiction story there... "I thought I was talking to AI"
Grand Theft Auto 5 sold over 200 million copies, and military/crime/shooter games have always been incredibly popular. Yet crime has been decreasing over the past few decades in the United States, where both cars and guns are easily accessible.
Let's be specific: Rape, incest, necrophilia, bestiality, and pedophila ideation.
I think we can all agree (1) these are harmful, anti-social behaviors that we do not want in our society, (2) people don't choose to have these desires, (3) most people who have them have no desire to actually traumatize others, (4) people who have these struggle with it.
These multi-media AI role-play environments would allow that type of engagement without any harm.
Now given all this, I am not a psychologist and do not know if that's part of how someone unfortunate enough to have those inclinations can deal with it healthily.
But if it is, now it exists and hopefully we can see less of it in the real world. I'm all for harm reduction if this is a way to get there.
It’s not unreasonable to suspect that engaging in high-fidelity simulations of these behaviors will further entrench and worsen paraphilias. This is pretty evident with the progression of many pornography addictions that don’t include these sorts of things that still follow the pattern of increasing novelty seeking leading to increasingly deviant stuff.
I am at a principled level uneasy with what’s fundamentally a sort of prior restraint (you haven’t yet hurt anyone but this may increase the likelihood and/or be an effective proxy to lock up those who are more likely to do so) but also see a really strong case for doing it given the fact that these are arguably the most antisocial behaviors one can imagine.
Right, I'm just a technologist. The psychological and sociological parts aren't my bailiwick
Typing a prompt in an AI box to make art has fewer real-world victims than performing the acts, filming them, and then sharing the videos.
I think that's inarguable. Maybe it's still unadvisable and someone should be in talk therapy. I have no idea. But at least nobody is actually getting molested and retraumatized in the ai art scenario.
If someone is spending their time using comfyUI drawing pictures instead of stalking the local middleschool, I'd hesitate to say mission accomplished ... but maybe I should?
People's time is finite. They can't be doing both. If the real is substituted for the imaginary then the real can no longer happen because that time is spent.
Are there any actual studies on this? Does access to simulations of illegal or objectionable material make pedophiles, rape fetishists, etc. more or less likely to try to access the real thing (or even worse, to try to commit crimes in the real world)?
Because both possibilities are plausible, it’s hard to know which is correct.
There’s probably some difference between someone who is visibly an adult and mature vs more deeply entrenching pathways of arousal in response to someone who is visibly a child. I still find the adult fetish version repellent but it’s also really hard to police in a way that’s remotely ethically permissible.
Ie yes it’s bad and in an ideal world nobody would do it. I see trying to restrict or ban it as the greater of two evils.
> let's say you publish a Steam game how to be a school shooter and shoot kids, wouldn't that lead to real school shootings ?
> who can definitely say that computer generated content about criminal behavior, won't lead to real crime with real victims?
I can’t tell if you’re being sarcastic but there has been no found link between violent video games to violent crimes, despite it being researched extensively:
There's something Freudian about the idea that the more you can customize porn, the more popular it is. That, despite the impression that "all men want one thing", it turns out that men all want very different and very oddly specific things. Imbuing somrthing with a "magical" quality that doesnt exist is the origin of the term "fetish". Its not about the raw attractive preference for a particular hair color; its a belief in the POWER of that hair color.
oh it's wildly different. About 15 years ago I worked on a porn recommendation system. The idea is that you'd follow a number of sites based on likes and recommendations and you'd get an aggregated feed with interstitial ads.
So I started with scraping and cross-reference, foaf, doing analysis. People's preferences are ... really complex.
Without getting too lewd, let's say there's about 30-80 categories with non-marginal demand depending on how you want to slice it and some of them can stack so you get a combinatoric.
In early user testing people wanted the niche and found the adventurous (of their particular kind) to be more compelling. And that was the unpredictable part. The majoritarian categories didn't have stickiness.
Nor did these niches have high correlation. Someone could be into say, specific topic A (let's say feet), and correlating that with topic B (let's say leather) was a dice roll. The probabilities were almost universally < 10% unless you went into majoritarian categories (eg. fit people in their 20s).
People want adventure on a reservation with a very well defined perimeter - one that is hard to map and different for every person.
So the value-add proposition went away since it's now just a collection of niche sites again.
Also, these days people have Reddit accounts reserved for porn where they do exactly this. So it was built after all.
This is interesting but there's a little more to it, especially with the erotic.
If people were polled what they want to see on social media, few would say things that are inflammatory, upsetting, divisive, etc but those as we know are strong drivers of engagement.
It's because you're polling for affinity or disclosed preference not for the actual engagement drivers.
For instance, if a male says they watch male pornography, they are labeling, or at least stating an affinity to a sexual identity.
However, the identities people choose to own are not the same as the preferences they actually have.
Instead if you track things like scroll velocity, linger time, revisitation, the time distance (such as 2 days apart instead of 5 minutes) a different story emerges.
For instance a given male could frequently look at male pornography but for all kinds of social reasons not want that affinity so they'd never even internally ideate the preference although their behavior of frequenting male content will be there regardless.
That's one of the problems with this approach is that not many people want to own all the social identities which map to their preferences so they don't openly identify it.
There (maybe) three levels of acceptance: admitting it to oneself, to others, identifying with it. And honestly these have a poor mapping to actual engagement with explicit content. You can have a (insert sexual affinity) rights activist who does not look at explicit content and someone protesting them who does all the time.
Man, I would pay money to see the (anonymized) trends on an adult website. Fascinating view into such an under studied area of humanity nature. I bet the porn tubes have data that sociologists could write papers on.
> If people were polled what they want to see on social media, few would say things that are inflammatory, upsetting, divisive, etc but those as we know are strong drivers of engagement.
That's because those are two entirely different things. If you polled people and asked them "what causes you to spend more time on social media", then at least some self-aware folks would likely identify conflict, "someone is wrong on the Internet" (https://xkcd.com/386/), etc. That doesn't mean that's "what they want to see on social media", that means that's "what gets them to spend more time on social media".
You don't understand! Every erotic chatbot service keeps getting censored, what happened to CharacterAI just keeps happening. There's a serious supply-shortage, do you really want people turning to Grok? The spice must flow!!!
I've found good use of Phi-4 at home, and after a few tests of the GPT-OSS 20B version I'm quite impressed so far.
Particularly one SQL question that has tripped every other model of similar or smaller size that I've tried, like Devstral 24B, Falcon 3 7B, Qwen2.5-coder 14B and Phi 4 14B.
The question contains an key point which is obvious for most humans, and which all of the models I tried previously have failed to pick up on. GPT-OSS picked up on it, and made a reasonable assumption.
It's also much more thorough at explaining code compared to the other models, again including details the others miss.
Now if only I had a GPU that could run the whole thing...
Sadly no. I'd like to keep it untainted, but also because the tables involved are straight from my work, which is very much not OSS.
I can however try to paraphrase it so you get the gist of it.
The question asks to provide a SQL statement to update rows in table A based on related tables B and C, where table B is mentioned explicitly and C is implicit through the foreign keys provided in the context.
The key point all previous models I've tested has missed, is that the rows in A are many-to-one with B, and so the update should take this into account. This is implicit from the foreign key context and not mentioned directly in the question.
Think distributing pizza slices between a group of friends. All previous models has completely missed this part and just given each friend the whole pizza.
GPT-OSS correctly identified this issue and flagged it in the response, but also included a sensible assumption of evenly dividing the pizza.
I should note some of the previous models also missed the implicit connection to table C, and thus completely failed to do something sensible. But at least several of them figured this out. Of course I forgot to write that part down so can't say offhand which did what.
As for the code, for example I've coded a Y combinator in Delphi, using intentionally terse non-descriptive names, and asked the models to explain how the code works and what it does. Most ~7B models and larger of the past year or so have managed to explain it fairly well. However GPT-OSS was much more thorough and provider a much better explanation, showing a significantly better "understanding" of the code. It was also the first model smaller than LLama 3 70B that I've tried that correctly identified it as a Y combinator.
> for instance, they have broad general knowledge about science, but don’t know much about popular culture
That seems like a good focus. Why learn details that can change within days of it being released? Instead, train the models to have good general knowledge, and be really good at using tools, and you won't have to re-train models from scratch just because some JS library now has a different API, instead the model goes out to fetch the latest APIs/gossip when needed.
You feed the model approximately all the text you have ever. And some things like 'popular culture of 2025' won't change, just because the calendar changed to 2026. Just like the popular culture of the 1980s is what it was, and won't change.
You are right, though on the other hand feeding it a selection of 1% of the entire corpus is already pretty close to 'all the text' (if you assume exponential growth in training over time).
Even multiplying that to approximately 100% of that corpus plus adding lots of non-internet text, will pale in comparison to all the non-text training data we will (or are) feeding our coming (and existing) multi-modal models.
If I may go out an a limb here: either we will see continuous great progress on text-based LLMs alone, or multi-modal models will become the next big focus. (Or both.)
That's because people are hungry for progress, and going multi-modal is the obvious thing to try to focus on, if text alone proves infeasible to drive progress.
Just to be clear: I make no prediction here on whether multi-modal will lead to progress, just that people will obviously try it and try it hard, if the focus on text starts to stall.
Yeah, it always seemed like a sad commentary on our world that AIs are devoting their weights to encyclopedic knowledge of Harry Potter, Pokemon, and Reddit trolling.
And it's far from sad that we have so many resources, we can give everyone a supercomputer in their pocket just to take selfies and talk about Pokemon. Why would our AIs be any different?
Does anyone know how synthetic data is commonly generated? Do they just sample the model randomly starting from an empty state, perhaps with some filtering? Or do they somehow automatically generate prompts and if how? Do they have some feedback mechanism, e.g. do they maybe test the model while training and somehow generate data related to poorly performing tests?
I don't know about Phi-5, but earlier versions of Phi were trained on stories written by larger models trained on real-world data. Since it's Microsoft, they probably used one of the OpenAI GPT series.
It’s common to use rejection sampling: sample from the model and throw out the samples which fail some criteria like a verifiable answer or a judgement from a larger model.
Is it confirmed that synthetic data was used for gpt-oss training? I didn't pick up on that in the press release or see it elsewhere. Did I miss it or is Sean speculating that it is the case?
> It’s not discussed publically very often, but the main use-case for fine-tuning small language models is for erotic role-play, and there’s a serious demand. Any small online community for people who run local models is at least 50% perverts.
It's not particularly likely that the hidden information encoded in synthetic data would happen to include specific details for making LSD or VX, but it's much more plausible that synthetic data contains some information the model's trainers would prefer to not incorporate in the model.
By definition, a model can't "know" things that are not somewhere in its training set, unless it can use a tool to query external knowledge.
The problem is that the size of the training set required for a good model is so large, that's really hard to make a good model without including almost all known written text available.
I mean, yeah. From the Table 9: Hallucination evaluations in GPT-OSS model card [1], GPT-OSS-20b/120b have accuracy of 0.067/0.168 and hallucination rate of 0.914/0.782 separately, while o4-mini has accuracy of 0.234 and hallucinate rate of 0.750. These numbers simply mean that GPT-OSS models have little real world knowledge, and they hallucinate hard. Note that little real world knowledge has always been a "feature" of the Phi-LLM series because of the "safety" (for large companies), or rather, "censorship" (for users) requirements.
In addition, from Table 4: Hallucination evaluations in OpenAI o3 and o4-mini System Card [2], o3/o4-mini have accuracy of 0.49/0.20 and hallucination rate of 0.51/0.79.
In summary, there is a significant real world knowledge gap between o3 and o4-mini, and another significant gap between o4-mini and GPT-OSS. Besides, the poor real world knowledge exhibited in GPT-OSS is aligned with the "feature" of Phi-LLM series.
What you wrote is somewhat ambiguous, so allow me to rephrase. It is true that most fine-tunes of relatively small (which can mean anything up to 150B params, depending on who you ask!) LLMs are for uncensored roleplay purposes.
Yeah, makes sense. Good observations regarding the benchmark vs. vibes in general, and I didn't know / made the connection between the lead of phi models going to oAI and gpt-oss. Could very well be a similar exercise + their "new" prompt level adherence (system > developer > user). In all the traces I've seen of refusals the model "quotes" the policy quite religiously. Similar thing was announced for gpt5.
I think the mention of the "horny people" is warranted, they are an important part of the open models (and first to explore the idea of "identities / personas" for LLMs, AFAIK). Plenty of fine-tuning bits of know-how trickled from there to the "common knowledge".
There's a thing that I would have liked to be explored, perhaps. The idea that companies might actually want what -oss offers. While the local llm communities might want freedom and a horny assistant, businesses absolutely do not want that. And in fact they spend a lot of effort into implementing (sometimes less than ideal) guardrails, to keep the models on track. For very easy usecases like support chatbots and the like, businesses will always prefer something that errs on the side of less than useful but "safe", rather than have the bot start going crazy with sex/slurs/insults/etc.
I do have a problem with this section though:
> Really open weight, not open source, because the weights are freely available but the training data and code is not.
This is factually incorrect. The -oss models are by definition open source. Apache2.0 is open source (I think even the purists agree with this). The requirement of sharing "training data and code" is absolutely not a prerequisite for being open source (and historically it was never required. The craze surrounding LLMs suddenly made this a thing. It's not).
Here's the definition of source in "open source":
> "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
Well, for LLMs the weights are the "preffered form of making modifications". The labs themselves modify models the same as you are allowed to by the license! They might use more advanced tools, or better datasets, but in the end the definition still holds. And you get all the other stuff, like the right to modify, re-release, etc. I'd really wish people would stop proliferating this open weight nonsense.
Models released under open source licenses are open source. gpt-oss, qwens and mistrals (apache2.0), deepseeks(MIT), etc.
Models released under non open source licenses also exist, and they're not open source because the licenses under which they're released aren't. LLamas, gemmas, etc.
No the preferred way of making modifications is the weights _together_ with training (or fine tuning) scripts, and the entire evaluation pipeline to measure performance. And the data required to support all of this.
When someone joins your data science team your would give them all this code and data. Not just the weights and say - the weights are the source, modify that to improve the model, I look forward to see your MR next week.
EDIT: Heck, sometimes the way to make improvements (modifications) is just to improve the data, and not touch the training code at all. It is often one of the most powerful ways. You still need training code though, and evaluation to measure the impact.
The license gives you the right to modify the weights, how you do the modification is up to you. The rest is in the realm of IP, know-how, etc. Apples and oranges.
Having the right to modify one part of the product is not the same as having the right to modify the entire product. Labeling such projects as open source in the full spirit of the definition is disingenuous.
This is similar to the approach taken by some video game studios: release the source code under a permissive license, but not the game assets. Which is better than a proprietary license, but it still presents a hurdle for the final product to be built from source.
The open weights approach is much more user hostile, however. Proprietary game assets can at least be purchased, and the final product can be built. With open weights, this is not possible. Nobody can realistically build the same model or similar models from weights alone. They can use the weights and self-host the prebuilt model, but not create revisions of it, which is the whole point of open source.
Weights are essentially the bytecode of language models. Sure, you can run and modify it with the right tools, but without the tools used to create it in the first place, the project is not much more useful than publishing binaries.
You also need the training data, so you can ensure you're not benchmarking on the training set, fine-tuning on the training set (overfitting with extra steps), or otherwise breaking things.
It's not about the preferred way. Else open source software would need to give you their IDE setup, CI/CD setup, access to all internal tools, etc. Software like sqlite don't release their full test suite. They paywall the preferred way of making changes, yet they are open source.
>The “source code” for a work means the preferred form of the work for making modifications
The GPL refers to a form of the artifact being released
> open source software would need to give you their IDE setup, CI/CD setup, access to all internal tools, etc.
IMO they do. If you can't modify it like a core contributor would, then it's not really open source. Traditional open source projects always included development guides, test configurations etc.
> Software like sqlite don't release their full test suite. They paywall the preferred way of making changes, yet they are open source.
That's a matter of opinion. IMO sqlite is not true open source, for precisely this reason.
The key is if you consider weights source code. I do not think this is a common interpretation.
> The labs themselves modify models the same as you are allowed to by the license
Do the labs do not use source code?
It is a bit like arguing that releasing a binary executable is releasing the source code. One could claim developers modify the binary the same as you are allowed to.
The weights are part of the source code. When running inference on a model you use the architecture, config files and weights together. All of these are released. Weights are nothing but "hardcoded values". The way you reach those values is irrelevant in the license discussion.
Let's take a simple example: I write a chess program that is comprised of a source file with 10 "if" statements, a config file that matches between the variables used in the if statements and a "hardcoded values" file that stores the actual values. It would be a crappy chess program, but I hope you agree that I can release that as open source and no-one would bat an eye. You would also be granted the right to edit those hardcoded values, if you wish so. You'd perhaps make the chess bot better or worse. But you would be allowed to edit it, just like I would. That's the preferred way of modifying it. Me providing the methods that I used to reach those 10 hardcoded values has 0 bearing on my crappy chess bot being open source or not. Do we agree on that?
Now instead of 10 values, make it 100billion. Hey, that's an LLM!
> It is a bit like arguing that releasing a binary executable is releasing the source code.
That's the misconception. Weights are not a binary executable. In other words, there isn't another level above weights that the labs use to "compile" the weights. The weights exist from the beginning to the end, and the labs edit the weights if they want to modify the models. And so can you. There isn't a "compilation" step anywhere in the course of training a model.
If you have 10 harcoded values, you have a binary blob, a common feature particularly in hardware drivers that is opaque and commonly considered to not be fully free unless the instructions for deriving it are also included. It's frequently just an executable, occasionally just configuration information, but difficult to change while (assuming no signing shenanigans) still remaining technically possible.
The training data is the source code and the training process is the compiler. There's a fairly direct metaphor to be made there. Different compilers can have vastly different impacts on the performance of the compiled executable.
I think source code really only exists in terms of the source code/object code dichotomy, so what "traditional" open source means for model weights is really not obvious if you only go off of traditional definitions. Personally I think the word "open source" shouldn't apply here anymore than it would for art or binary code.
Consider the following: it is possible to release binaries under the Apache2 license. Microsoft has, at least at one point, released a binary under the BSD license. These binaries are not open source because they are not source.
This isn't the same argument as given in the article though, so I guess it is a third position.
> Consider the following: it is possible to release binaries under the Apache2 license. Microsoft has, at least at one point, released a binary under the BSD license. These binaries are not open source because they are not source.
Agreed. But weights are not binaries in the licensing context. For weights to be binaries it would imply another layer of abstraction, above weights, that the labs use as the preferred way of modifying the model, and then "compile" it into weights. That layer does not exist. When you train a model you start with the weights (randomly initialised, can be 0 can be 1, can be any value, whatever works best). But you start with the weights. And at every step of the training process you modify those weights. Not another layer, not another abstraction. The weights themselves.
> They're an artifact of a training process, not code that was written by someone.
If that were relevant to the licensing discussion, then you'd have to consider every "generated" parts (interfaces, dataclasses, etc) of every open source project artefacts. Historically, that was never the case. The license doesn't care if a hardcoded value was written by a person or "tuned" via a process. It's still source code if it's the preferred way of modifying said code. And it is. You can totally edit them by hand. It would not work as well (or at all), but you could do it.
There is actually a gray area about what code "counts" as source code to the point where you would consider it "open source" if it were licensed as such. I think if you had a repository consisting of only generated code and not the code used to generate it, it would definitely raise the question of whether it should be considered "source code" or "open source", and I think you could make arguments both ways.
On the other hand, I don't really think that argument then extends to model weights, which are not just some number of steps removed from source code, but just simply not really related to source code.
I mostly agree with your assessment of what we should/shouldn't call open source for models but there is enough grey area to make the other side a valid position and not worthy of being dismissed so easily. I think there is a fine line between model weights and, say, bytecode for an interpreter and I think if you released bytecode dumps under any license it would be called out.
I also believe the four freedoms are violated to some extent (at least in spirit) by just releasing the weights and for some that might be enough to call something not open source. Your "freedom to study how the program works, and change it to make it do what you wish" is somewhat infringed by not having the training data. Additionally, gpt-oss added a (admittedly very minimal) usage policy that somewhat infringes on the first freedom, i.e. "the freedom to run the program as you wish, for any purpose".
You are free to look at every single weight and study how it affects the result. You can see how the model is architected. And you don't need training data to be provided to be able to modify the weights. Software can still be open source even if it isn't friendly to beginners.
I think you could say something remarkably similar about just releasing bytecode as well and I think most people would call foul at that. I don't think it's so cut and dry.
This isn't entirely about being a beginner or not either. Full fine-tuning without forgetting does really want the training data (or something that is a good replacement). You can do things like LoRa but, depending on your use case, it might not work.
"Good observations regarding the benchmark vs. vibes in general"
Most "vibes" people are missing that it as only has 5B active parameters.
They read 120B and expect way more performance than a 24B parameter model, even though empricaly a 120B model with 5B active parameters is expected to perform right around there.
I saw a bunch of people complaining on Twitter about how GPT-OSS can't be customized or has no soul and I noticed that none of them said what they were trying to accomplish.
"The main use-case for fine-tuning small language models is for erotic role-play, and there’s a serious demand."
Ah.
it's not erotic role-play, but I have a use case of making an AI-powered NetHack clone. specifically, to generate dungeon layouts, dialog for NPCs and to fill in the boatloads of minutae and interactions which NetHack is famous for.
you kind of need soul for that, and a lot of background knowledge on mythology/fantasy lore, but also tool use to work the world systems.
Want a good use case?
I am playing around with interactive workflow where the model suggests what can be wrong with a particular chunk of code, then the user selects one of the options, and the model immediately implements the fix.
Biggest problem? Total Wild West in terms of what the models try to suggest. Some models suggest short sentences, others spew out huge chunks at a time. GPT-OSS really likes using tables everywhere. Llama occasionally gets stuck in the loop of "memcpy() could be not what it seems and work differently than expected" followed by a handful of similar suggestions for other well-known library functions.
I mostly got it to work with some creative prompt engineering and cross-validation, but having a model fine-tuned for giving reasonable suggestions that are easy to understand from a quick glance, would be way better.
I haven't tried your exact task, of course, but I've found a lot of success in using JSON structured output (in strict mode), and decomposing the response into more fields than you would otherwise think useful. And making those fields highly specific.
For example: make the suggestion output an object with multiple fields, naming one of them `concise_suggestion`. And make sure to take advantage of the `description` field.
For people not already using structured output, both OpenAI and Anthropic consoles have a pretty good JSON schema generator (give prompt, get schema). I'd suggest using one of those as a starting point.
what's the problem with that? we have erotic texts dating back thousands of years, basically as old as the act of writing itself https://en.wikipedia.org/wiki/Istanbul_2461
There's nothing wrong with it, but you have to understand the differences between different user groups to know which limitations are relevant to your own use cases. "It doesn't follow instructions" could mean "it won't pretend to be a horny elf" or "it hallucinates fields outside the JSON schema I specified"; the latter is much more of a problem for my uses.
Really, if you want a fey creature with horns, a satyr is probably a better bet than an elf.
I have no problem with it and I can understand why people don't want to say "I'm trying to pornify this model and it refuses to talk dirty!" in public. But if you're calling a model garbage maybe you should be honest about what the "problem" is.
Why? Is there any reason to believe problems in that context won't generalise?
Lots, yes. The fine tuning may attempt to introduce concepts that were intentionally omitted from the training data for safety* reasons.
Maybe nothing wrong with that, but it might mean that the perceived weaknesses don't generalize to an area of the model that hasn't been lobotomized.
* using safety the way OpenAI have been using the term, not looking to debate the utility of that.
The pro-porn side has zero PR because respectable public figures don't see pro-porn advocacy as a good career move. At most, you'll get some oblique references to it.
Meanwhile, the anti-porn side has a formidable alliance:
Right-wing, religiously-motivated anti-porn activists. Left-wing, feminism-motivated anti-porn activists. Big corporate types with lots of $$$$ to spend who want their customer support chatbot to be completely SFW at all times. AI safety folk who think keeping the model on a tight leash is an ethical obligation, lest future iterations take over the world. AI vendors who are keen on the yes-it-might-take-over-the-world narrative. AI vendors who just don't want their developers having to handle NSFW stuff in work. Politicians who don't know a transformer from a diffusion model, but who've heard a chorus of worries about lost jobs and AI bias and deepfakes and revenge porn.
These people will speak up in public at the drop of a hat.
My use case has been trying to remove the damn "apologies for this" and extraneous language that just waste tokens for no reason. GPT has always always always been so quick to waffle.
And removing the chat interface as much as possible. Many benchmarks are better with text completion models, but they keep insisting on this horrible interface for their models.
Fine tuning is there to ensure you get the output format you want without the extra garbage. I swear they have tuned their models to waste tokens.
The jargon to google here is "length bias"
It turns out if you generate two LLM responses and ask a judge to choose which is better, many judges have a bias in favour of long answers full of waffle.
Thanks for that pointer.
The abstract of this paper seems interesting: https://arxiv.org/html/2407.01085v3
> use of [LLMs] as judges [..] reveals a notable bias towards longer responses, undermining the reliability of such evaluations. To better understand such bias, we propose to decompose the preference evaluation metric, specifically the win rate, into two key components: desirability and information mass [..]
(If you're interested, give it a click. I tried to pare this down to avoid quoting a wall of text.)
> I swear they have tuned their models to waste tokens.
Which seems a bit weird, because the customers of the chat interface (ie non-API customers) don't pay per token.
Porn is always the frontier.
It's a well-understood self-contained use-case without many externalities and simple business models.
What more, with porn, the medium is the product probably more than the content. Having it on home-media in the 80s was the selling point. Getting it over the 1-900 phone lines or accessing it over the internet ... these were arguably the actual product. It might have been a driver of early smart phone adoption as well. Adult content is about an 80% consumption on handheld devices while the internet writ large is about 60%.
Private tunable multi-media interaction on-demand is the product here.
Also it's a unique offer. Role playing prohibited sexual acts can be done arguably victim free.
There's a good fiction story there... "I thought I was talking to AI"
1, Porn. 2 Military.
The firmer is a lot more nimble and the procurement processes of your customers are easier to navigate.
> firmer
snort
Sometimes a typo makes you look wittier than you are.
even if it is victim free, it can affect mental health in a way that a consumer will be more compelled to do a criminal act and create a real victim.
let's say you publish a Steam game how to be a school shooter and shoot kids, wouldn't that lead to real school shootings ?
who can definitely say that computer generated content about criminal behavior, won't lead to real crime with real victims?
https://en.wikipedia.org/wiki/Active_Shooter
Grand Theft Auto 5 sold over 200 million copies, and military/crime/shooter games have always been incredibly popular. Yet crime has been decreasing over the past few decades in the United States, where both cars and guns are easily accessible.
I view it more like methadone.
Let's be specific: Rape, incest, necrophilia, bestiality, and pedophila ideation.
I think we can all agree (1) these are harmful, anti-social behaviors that we do not want in our society, (2) people don't choose to have these desires, (3) most people who have them have no desire to actually traumatize others, (4) people who have these struggle with it.
These multi-media AI role-play environments would allow that type of engagement without any harm.
Now given all this, I am not a psychologist and do not know if that's part of how someone unfortunate enough to have those inclinations can deal with it healthily.
But if it is, now it exists and hopefully we can see less of it in the real world. I'm all for harm reduction if this is a way to get there.
It’s not unreasonable to suspect that engaging in high-fidelity simulations of these behaviors will further entrench and worsen paraphilias. This is pretty evident with the progression of many pornography addictions that don’t include these sorts of things that still follow the pattern of increasing novelty seeking leading to increasingly deviant stuff.
I am at a principled level uneasy with what’s fundamentally a sort of prior restraint (you haven’t yet hurt anyone but this may increase the likelihood and/or be an effective proxy to lock up those who are more likely to do so) but also see a really strong case for doing it given the fact that these are arguably the most antisocial behaviors one can imagine.
Right, I'm just a technologist. The psychological and sociological parts aren't my bailiwick
Typing a prompt in an AI box to make art has fewer real-world victims than performing the acts, filming them, and then sharing the videos.
I think that's inarguable. Maybe it's still unadvisable and someone should be in talk therapy. I have no idea. But at least nobody is actually getting molested and retraumatized in the ai art scenario.
If someone is spending their time using comfyUI drawing pictures instead of stalking the local middleschool, I'd hesitate to say mission accomplished ... but maybe I should?
People's time is finite. They can't be doing both. If the real is substituted for the imaginary then the real can no longer happen because that time is spent.
Are there any actual studies on this? Does access to simulations of illegal or objectionable material make pedophiles, rape fetishists, etc. more or less likely to try to access the real thing (or even worse, to try to commit crimes in the real world)?
Because both possibilities are plausible, it’s hard to know which is correct.
Even if there are I and likely you lack the qualifications of making any clinical takeaways from them.
We make tools in good faith and hope they're used responsibly to make the world a better place.
That to me, is the true reward of the professional programmer
What about two consenting adults engaging in age play. By your logic, wouldn't that also lead to "real crime"?
There’s probably some difference between someone who is visibly an adult and mature vs more deeply entrenching pathways of arousal in response to someone who is visibly a child. I still find the adult fetish version repellent but it’s also really hard to police in a way that’s remotely ethically permissible.
Ie yes it’s bad and in an ideal world nobody would do it. I see trying to restrict or ban it as the greater of two evils.
Who can say that it does?
> let's say you publish a Steam game how to be a school shooter and shoot kids, wouldn't that lead to real school shootings ?
> who can definitely say that computer generated content about criminal behavior, won't lead to real crime with real victims?
I can’t tell if you’re being sarcastic but there has been no found link between violent video games to violent crimes, despite it being researched extensively:
https://www.apa.org/news/press/releases/2020/03/violent-vide...
https://link.springer.com/article/10.1007/s10964-019-01069-0
https://pmc.ncbi.nlm.nih.gov/articles/PMC6756088/
https://elifesciences.org/articles/84951
Of course, that hasn’t stopped video games being blamed for violence by the “think of the children” crowd and certain politicians:
https://en.m.wikipedia.org/wiki/Family_Entertainment_Protect...
https://www.theatlantic.com/technology/archive/2019/08/video...
Especially when shootings occur by white perpetrators:
https://www.apa.org/news/press/releases/2019/09/video-games-...
The same narrative plays out for porn, despite the research findings being the same:
https://www.utsa.edu/today/2020/08/story/pornography-sex-cri...
But blaming violent video games or pornography is an easy scapegoat
There's something Freudian about the idea that the more you can customize porn, the more popular it is. That, despite the impression that "all men want one thing", it turns out that men all want very different and very oddly specific things. Imbuing somrthing with a "magical" quality that doesnt exist is the origin of the term "fetish". Its not about the raw attractive preference for a particular hair color; its a belief in the POWER of that hair color.
oh it's wildly different. About 15 years ago I worked on a porn recommendation system. The idea is that you'd follow a number of sites based on likes and recommendations and you'd get an aggregated feed with interstitial ads.
So I started with scraping and cross-reference, foaf, doing analysis. People's preferences are ... really complex.
Without getting too lewd, let's say there's about 30-80 categories with non-marginal demand depending on how you want to slice it and some of them can stack so you get a combinatoric.
In early user testing people wanted the niche and found the adventurous (of their particular kind) to be more compelling. And that was the unpredictable part. The majoritarian categories didn't have stickiness.
Nor did these niches have high correlation. Someone could be into say, specific topic A (let's say feet), and correlating that with topic B (let's say leather) was a dice roll. The probabilities were almost universally < 10% unless you went into majoritarian categories (eg. fit people in their 20s).
People want adventure on a reservation with a very well defined perimeter - one that is hard to map and different for every person.
So the value-add proposition went away since it's now just a collection of niche sites again.
Also, these days people have Reddit accounts reserved for porn where they do exactly this. So it was built after all.
You may be interested in the data surfaced by this large-scale survey[1]
[1] https://aella.substack.com/p/fetish-tabooness-and-popularity...
This is interesting but there's a little more to it, especially with the erotic.
If people were polled what they want to see on social media, few would say things that are inflammatory, upsetting, divisive, etc but those as we know are strong drivers of engagement.
It's because you're polling for affinity or disclosed preference not for the actual engagement drivers.
For instance, if a male says they watch male pornography, they are labeling, or at least stating an affinity to a sexual identity.
However, the identities people choose to own are not the same as the preferences they actually have.
Instead if you track things like scroll velocity, linger time, revisitation, the time distance (such as 2 days apart instead of 5 minutes) a different story emerges.
For instance a given male could frequently look at male pornography but for all kinds of social reasons not want that affinity so they'd never even internally ideate the preference although their behavior of frequenting male content will be there regardless.
That's one of the problems with this approach is that not many people want to own all the social identities which map to their preferences so they don't openly identify it.
There (maybe) three levels of acceptance: admitting it to oneself, to others, identifying with it. And honestly these have a poor mapping to actual engagement with explicit content. You can have a (insert sexual affinity) rights activist who does not look at explicit content and someone protesting them who does all the time.
Man, I would pay money to see the (anonymized) trends on an adult website. Fascinating view into such an under studied area of humanity nature. I bet the porn tubes have data that sociologists could write papers on.
> If people were polled what they want to see on social media, few would say things that are inflammatory, upsetting, divisive, etc but those as we know are strong drivers of engagement.
That's because those are two entirely different things. If you polled people and asked them "what causes you to spend more time on social media", then at least some self-aware folks would likely identify conflict, "someone is wrong on the Internet" (https://xkcd.com/386/), etc. That doesn't mean that's "what they want to see on social media", that means that's "what gets them to spend more time on social media".
> Also, these days people have Reddit accounts reserved for porn where they do exactly this. So it was built after all.
Didn't reddit remove porn?
No. Not at all. You must be thinking of a different site. Tumblr did and onlyfans did for a hot minute and then backtracked.
Neither of them intended to be porn sites. It's kind of a natural occurrence on UGC sites . Look at Civitai...
Credit card processors are kinda weary of it for some legal reasons I'm not qualified to enough to really understand.
You don't understand! Every erotic chatbot service keeps getting censored, what happened to CharacterAI just keeps happening. There's a serious supply-shortage, do you really want people turning to Grok? The spice must flow!!!
I've found good use of Phi-4 at home, and after a few tests of the GPT-OSS 20B version I'm quite impressed so far.
Particularly one SQL question that has tripped every other model of similar or smaller size that I've tried, like Devstral 24B, Falcon 3 7B, Qwen2.5-coder 14B and Phi 4 14B.
The question contains an key point which is obvious for most humans, and which all of the models I tried previously have failed to pick up on. GPT-OSS picked up on it, and made a reasonable assumption.
It's also much more thorough at explaining code compared to the other models, again including details the others miss.
Now if only I had a GPU that could run the whole thing...
Can you share the question? Or are you intentionally trying to keep it out of the training data pool?
Sadly no. I'd like to keep it untainted, but also because the tables involved are straight from my work, which is very much not OSS.
I can however try to paraphrase it so you get the gist of it.
The question asks to provide a SQL statement to update rows in table A based on related tables B and C, where table B is mentioned explicitly and C is implicit through the foreign keys provided in the context.
The key point all previous models I've tested has missed, is that the rows in A are many-to-one with B, and so the update should take this into account. This is implicit from the foreign key context and not mentioned directly in the question.
Think distributing pizza slices between a group of friends. All previous models has completely missed this part and just given each friend the whole pizza.
GPT-OSS correctly identified this issue and flagged it in the response, but also included a sensible assumption of evenly dividing the pizza.
I should note some of the previous models also missed the implicit connection to table C, and thus completely failed to do something sensible. But at least several of them figured this out. Of course I forgot to write that part down so can't say offhand which did what.
As for the code, for example I've coded a Y combinator in Delphi, using intentionally terse non-descriptive names, and asked the models to explain how the code works and what it does. Most ~7B models and larger of the past year or so have managed to explain it fairly well. However GPT-OSS was much more thorough and provider a much better explanation, showing a significantly better "understanding" of the code. It was also the first model smaller than LLama 3 70B that I've tried that correctly identified it as a Y combinator.
> for instance, they have broad general knowledge about science, but don’t know much about popular culture
That seems like a good focus. Why learn details that can change within days of it being released? Instead, train the models to have good general knowledge, and be really good at using tools, and you won't have to re-train models from scratch just because some JS library now has a different API, instead the model goes out to fetch the latest APIs/gossip when needed.
Why would anything change?
You feed the model approximately all the text you have ever. And some things like 'popular culture of 2025' won't change, just because the calendar changed to 2026. Just like the popular culture of the 1980s is what it was, and won't change.
We don't feed the model all the text ever. They are still trained on less than 1% of the entire Internet corpus.
You are right, though on the other hand feeding it a selection of 1% of the entire corpus is already pretty close to 'all the text' (if you assume exponential growth in training over time).
Even multiplying that to approximately 100% of that corpus plus adding lots of non-internet text, will pale in comparison to all the non-text training data we will (or are) feeding our coming (and existing) multi-modal models.
If I may go out an a limb here: either we will see continuous great progress on text-based LLMs alone, or multi-modal models will become the next big focus. (Or both.)
That's because people are hungry for progress, and going multi-modal is the obvious thing to try to focus on, if text alone proves infeasible to drive progress.
Just to be clear: I make no prediction here on whether multi-modal will lead to progress, just that people will obviously try it and try it hard, if the focus on text starts to stall.
Yeah, it always seemed like a sad commentary on our world that AIs are devoting their weights to encyclopedic knowledge of Harry Potter, Pokemon, and Reddit trolling.
Why? You gotta provide what your customers want.
And it's far from sad that we have so many resources, we can give everyone a supercomputer in their pocket just to take selfies and talk about Pokemon. Why would our AIs be any different?
Does anyone know how synthetic data is commonly generated? Do they just sample the model randomly starting from an empty state, perhaps with some filtering? Or do they somehow automatically generate prompts and if how? Do they have some feedback mechanism, e.g. do they maybe test the model while training and somehow generate data related to poorly performing tests?
I don't know about Phi-5, but earlier versions of Phi were trained on stories written by larger models trained on real-world data. Since it's Microsoft, they probably used one of the OpenAI GPT series.
It’s common to use rejection sampling: sample from the model and throw out the samples which fail some criteria like a verifiable answer or a judgement from a larger model.
Is it confirmed that synthetic data was used for gpt-oss training? I didn't pick up on that in the press release or see it elsewhere. Did I miss it or is Sean speculating that it is the case?
> It’s not discussed publically very often, but the main use-case for fine-tuning small language models is for erotic role-play, and there’s a serious demand. Any small online community for people who run local models is at least 50% perverts.
Amazing
If a model is trained only on synthetic data, is it still possible it will output things like this? https://x.com/elder_plinius/status/1952958577867669892
In theory, it's possible. https://x.com/OwainEvans_UK/status/1947689616016085210
It's not particularly likely that the hidden information encoded in synthetic data would happen to include specific details for making LSD or VX, but it's much more plausible that synthetic data contains some information the model's trainers would prefer to not incorporate in the model.
By definition, a model can't "know" things that are not somewhere in its training set, unless it can use a tool to query external knowledge.
The problem is that the size of the training set required for a good model is so large, that's really hard to make a good model without including almost all known written text available.
> By definition, a model can't "know" things that are not somewhere in its training set, unless it can use a tool to query external knowledge.
Well, it could also make inferences. Like, it could find a new mathematical proof, even if that's never in the training set.
> all known written text available
If phi5 is trained on synthetic data only then info on how to make drugs must be in the synthetic dataset.
I mean, yeah. From the Table 9: Hallucination evaluations in GPT-OSS model card [1], GPT-OSS-20b/120b have accuracy of 0.067/0.168 and hallucination rate of 0.914/0.782 separately, while o4-mini has accuracy of 0.234 and hallucinate rate of 0.750. These numbers simply mean that GPT-OSS models have little real world knowledge, and they hallucinate hard. Note that little real world knowledge has always been a "feature" of the Phi-LLM series because of the "safety" (for large companies), or rather, "censorship" (for users) requirements.
In addition, from Table 4: Hallucination evaluations in OpenAI o3 and o4-mini System Card [2], o3/o4-mini have accuracy of 0.49/0.20 and hallucination rate of 0.51/0.79.
In summary, there is a significant real world knowledge gap between o3 and o4-mini, and another significant gap between o4-mini and GPT-OSS. Besides, the poor real world knowledge exhibited in GPT-OSS is aligned with the "feature" of Phi-LLM series.
[1] https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7... [2] https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f372...
Is that true that most small language models are fine tuned for erotic role-play?
What you wrote is somewhat ambiguous, so allow me to rephrase. It is true that most fine-tunes of relatively small (which can mean anything up to 150B params, depending on who you ask!) LLMs are for uncensored roleplay purposes.
Yeah, makes sense. Good observations regarding the benchmark vs. vibes in general, and I didn't know / made the connection between the lead of phi models going to oAI and gpt-oss. Could very well be a similar exercise + their "new" prompt level adherence (system > developer > user). In all the traces I've seen of refusals the model "quotes" the policy quite religiously. Similar thing was announced for gpt5.
I think the mention of the "horny people" is warranted, they are an important part of the open models (and first to explore the idea of "identities / personas" for LLMs, AFAIK). Plenty of fine-tuning bits of know-how trickled from there to the "common knowledge".
There's a thing that I would have liked to be explored, perhaps. The idea that companies might actually want what -oss offers. While the local llm communities might want freedom and a horny assistant, businesses absolutely do not want that. And in fact they spend a lot of effort into implementing (sometimes less than ideal) guardrails, to keep the models on track. For very easy usecases like support chatbots and the like, businesses will always prefer something that errs on the side of less than useful but "safe", rather than have the bot start going crazy with sex/slurs/insults/etc.
I do have a problem with this section though:
> Really open weight, not open source, because the weights are freely available but the training data and code is not.
This is factually incorrect. The -oss models are by definition open source. Apache2.0 is open source (I think even the purists agree with this). The requirement of sharing "training data and code" is absolutely not a prerequisite for being open source (and historically it was never required. The craze surrounding LLMs suddenly made this a thing. It's not).
Here's the definition of source in "open source":
> "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
Well, for LLMs the weights are the "preffered form of making modifications". The labs themselves modify models the same as you are allowed to by the license! They might use more advanced tools, or better datasets, but in the end the definition still holds. And you get all the other stuff, like the right to modify, re-release, etc. I'd really wish people would stop proliferating this open weight nonsense.
Models released under open source licenses are open source. gpt-oss, qwens and mistrals (apache2.0), deepseeks(MIT), etc.
Models released under non open source licenses also exist, and they're not open source because the licenses under which they're released aren't. LLamas, gemmas, etc.
No the preferred way of making modifications is the weights _together_ with training (or fine tuning) scripts, and the entire evaluation pipeline to measure performance. And the data required to support all of this.
When someone joins your data science team your would give them all this code and data. Not just the weights and say - the weights are the source, modify that to improve the model, I look forward to see your MR next week.
EDIT: Heck, sometimes the way to make improvements (modifications) is just to improve the data, and not touch the training code at all. It is often one of the most powerful ways. You still need training code though, and evaluation to measure the impact.
The license gives you the right to modify the weights, how you do the modification is up to you. The rest is in the realm of IP, know-how, etc. Apples and oranges.
Having the right to modify one part of the product is not the same as having the right to modify the entire product. Labeling such projects as open source in the full spirit of the definition is disingenuous.
This is similar to the approach taken by some video game studios: release the source code under a permissive license, but not the game assets. Which is better than a proprietary license, but it still presents a hurdle for the final product to be built from source.
The open weights approach is much more user hostile, however. Proprietary game assets can at least be purchased, and the final product can be built. With open weights, this is not possible. Nobody can realistically build the same model or similar models from weights alone. They can use the weights and self-host the prebuilt model, but not create revisions of it, which is the whole point of open source.
Weights are essentially the bytecode of language models. Sure, you can run and modify it with the right tools, but without the tools used to create it in the first place, the project is not much more useful than publishing binaries.
You also need the training data, so you can ensure you're not benchmarking on the training set, fine-tuning on the training set (overfitting with extra steps), or otherwise breaking things.
It's not about the preferred way. Else open source software would need to give you their IDE setup, CI/CD setup, access to all internal tools, etc. Software like sqlite don't release their full test suite. They paywall the preferred way of making changes, yet they are open source.
>The “source code” for a work means the preferred form of the work for making modifications
The GPL refers to a form of the artifact being released
> open source software would need to give you their IDE setup, CI/CD setup, access to all internal tools, etc.
IMO they do. If you can't modify it like a core contributor would, then it's not really open source. Traditional open source projects always included development guides, test configurations etc.
> Software like sqlite don't release their full test suite. They paywall the preferred way of making changes, yet they are open source.
That's a matter of opinion. IMO sqlite is not true open source, for precisely this reason.
The key is if you consider weights source code. I do not think this is a common interpretation.
> The labs themselves modify models the same as you are allowed to by the license
Do the labs do not use source code?
It is a bit like arguing that releasing a binary executable is releasing the source code. One could claim developers modify the binary the same as you are allowed to.
> Do the labs do not use source code?
The weights are part of the source code. When running inference on a model you use the architecture, config files and weights together. All of these are released. Weights are nothing but "hardcoded values". The way you reach those values is irrelevant in the license discussion.
Let's take a simple example: I write a chess program that is comprised of a source file with 10 "if" statements, a config file that matches between the variables used in the if statements and a "hardcoded values" file that stores the actual values. It would be a crappy chess program, but I hope you agree that I can release that as open source and no-one would bat an eye. You would also be granted the right to edit those hardcoded values, if you wish so. You'd perhaps make the chess bot better or worse. But you would be allowed to edit it, just like I would. That's the preferred way of modifying it. Me providing the methods that I used to reach those 10 hardcoded values has 0 bearing on my crappy chess bot being open source or not. Do we agree on that?
Now instead of 10 values, make it 100billion. Hey, that's an LLM!
> It is a bit like arguing that releasing a binary executable is releasing the source code.
That's the misconception. Weights are not a binary executable. In other words, there isn't another level above weights that the labs use to "compile" the weights. The weights exist from the beginning to the end, and the labs edit the weights if they want to modify the models. And so can you. There isn't a "compilation" step anywhere in the course of training a model.
If you have 10 harcoded values, you have a binary blob, a common feature particularly in hardware drivers that is opaque and commonly considered to not be fully free unless the instructions for deriving it are also included. It's frequently just an executable, occasionally just configuration information, but difficult to change while (assuming no signing shenanigans) still remaining technically possible.
The training data is the source code and the training process is the compiler. There's a fairly direct metaphor to be made there. Different compilers can have vastly different impacts on the performance of the compiled executable.
Training is obviously the compilation step.
I think source code really only exists in terms of the source code/object code dichotomy, so what "traditional" open source means for model weights is really not obvious if you only go off of traditional definitions. Personally I think the word "open source" shouldn't apply here anymore than it would for art or binary code.
Consider the following: it is possible to release binaries under the Apache2 license. Microsoft has, at least at one point, released a binary under the BSD license. These binaries are not open source because they are not source.
This isn't the same argument as given in the article though, so I guess it is a third position.
> Consider the following: it is possible to release binaries under the Apache2 license. Microsoft has, at least at one point, released a binary under the BSD license. These binaries are not open source because they are not source.
Agreed. But weights are not binaries in the licensing context. For weights to be binaries it would imply another layer of abstraction, above weights, that the labs use as the preferred way of modifying the model, and then "compile" it into weights. That layer does not exist. When you train a model you start with the weights (randomly initialised, can be 0 can be 1, can be any value, whatever works best). But you start with the weights. And at every step of the training process you modify those weights. Not another layer, not another abstraction. The weights themselves.
In my opinion, though, they're also not really source code either. They're an artifact of a training process, not code that was written by someone.
> They're an artifact of a training process, not code that was written by someone.
If that were relevant to the licensing discussion, then you'd have to consider every "generated" parts (interfaces, dataclasses, etc) of every open source project artefacts. Historically, that was never the case. The license doesn't care if a hardcoded value was written by a person or "tuned" via a process. It's still source code if it's the preferred way of modifying said code. And it is. You can totally edit them by hand. It would not work as well (or at all), but you could do it.
There is actually a gray area about what code "counts" as source code to the point where you would consider it "open source" if it were licensed as such. I think if you had a repository consisting of only generated code and not the code used to generate it, it would definitely raise the question of whether it should be considered "source code" or "open source", and I think you could make arguments both ways.
On the other hand, I don't really think that argument then extends to model weights, which are not just some number of steps removed from source code, but just simply not really related to source code.
I mostly agree with your assessment of what we should/shouldn't call open source for models but there is enough grey area to make the other side a valid position and not worthy of being dismissed so easily. I think there is a fine line between model weights and, say, bytecode for an interpreter and I think if you released bytecode dumps under any license it would be called out.
I also believe the four freedoms are violated to some extent (at least in spirit) by just releasing the weights and for some that might be enough to call something not open source. Your "freedom to study how the program works, and change it to make it do what you wish" is somewhat infringed by not having the training data. Additionally, gpt-oss added a (admittedly very minimal) usage policy that somewhat infringes on the first freedom, i.e. "the freedom to run the program as you wish, for any purpose".
You are free to look at every single weight and study how it affects the result. You can see how the model is architected. And you don't need training data to be provided to be able to modify the weights. Software can still be open source even if it isn't friendly to beginners.
I think you could say something remarkably similar about just releasing bytecode as well and I think most people would call foul at that. I don't think it's so cut and dry.
This isn't entirely about being a beginner or not either. Full fine-tuning without forgetting does really want the training data (or something that is a good replacement). You can do things like LoRa but, depending on your use case, it might not work.
"Good observations regarding the benchmark vs. vibes in general"
Most "vibes" people are missing that it as only has 5B active parameters.
They read 120B and expect way more performance than a 24B parameter model, even though empricaly a 120B model with 5B active parameters is expected to perform right around there.