Why do LMMs overuse these patterns of speech that aren't overused in the wild?

Here's a quote:

> "That’s not neuroscience — that’s cargo-cult reasoning wrapped in academic buzzwords."

I think most of us who use ChatGPT would immediately recognize this as being AI-generated. It's perfectly valid English and we could all imagine a real human saying it, but ChatGPT (or maybe LLMs more broadly) seem to have landed on certain patterns like this one that they use constantly. Is it some kind of overfitting? Post training where a biased toward this pattern was introduced? Something else?

2 points | by jimbo808 3 hours ago

2 comments

A_D_E_P_T 3 hours ago
"It's not X, it's Y" (or "it's not just X, it's Y") is one of the most common tells, but there are many others. Here's a partial list:
> https://en.wikipedia.org/wiki/Wikipedia:Signs_of_AI_writing
It's definitely post-training bias and reinforcement. This rhetorical structure isn't too common IRL. (Or wasn't, anyway, prior to LLMs...)
incomingpain 3 hours ago
the way they train the models is feeding it lots and lots of literature, and usually high quality.
The way they write by default is thus going to be this weird hybrid of all english styles/dialects from the last ~500 years.
The reason they are heavy with em dashes is that they were immensely popular in literature for a long time but not so much in modern times. So it stands out.
If you tell it to write in a specific way though, it does a good job at it.
Here's detroit english, no messin' around. The reason why AI's love the em dash so much is simple: it’s the most versatile and natural punctuation mark they can use to connect ideas and maintain flow. A large language model's primary goal is to sound human, and when people speak, they often pause, clarify, or insert a quick side-thought the dash captures that conversational stop-and-start rhythm better than a rigid comma or a full-stop period. Plus, in the enormous amount of text the AI studies (its data), the em dash is frequently used by skilled writers as an efficient tool to replace colons, parentheses, or strong commas, so the AI simply picked up that effective writing pattern and runs with it, seeing it as the clearest and most dynamic way to structure complex sentences. That's the real deal.
[-]
- gooodvibes 18 minutes ago
  This isn’t accurate - most of the style comes from the fine tuning and reinforcement learning, not from the original training data.
  At some point people got this idea that LLMs just repeat or imitate their training data, and that’s completely false for today’s models.
- whycome 2 hours ago
  I used to use the em dash a lot. I refrain from doing it now. I hate that outcome.
  [-]
  - JohnFen 2 hours ago
    I refuse to let genAI determine what my writing style should be on principle. I may not be able to do much about the rest of the various degradations genAI brings, but I can at least stand my ground when it comes to my personal expression.
  - incomingpain 2 hours ago
    >I used to use the em dash a lot. I refrain from doing it now. I hate that outcome.
    I dont think it ever got taught in my schooling; the semi-colon is what they taught to use.