Why LLMs Use Double Dashes: Tokenization, Style, and Prompting Habits

A practical, plain-English explanation of why LLMs often output double dashes, what tokenization has to do with it, and how to control punctuation in generated text.

If you spend enough time with LLMs, you start seeing patterns that feel oddly specific.

One of the most common is this:

sentence -- side note

At first glance it looks like a quirk. In practice, it is a predictable output choice.

The reason is not one single rule. It is a combination of training data, tokenization behavior, and style imitation.

It Starts With What Models Read

LLMs are trained on huge mixed corpora: documentation, blogs, forums, source code, markdown, and scraped web text.

In those sources, punctuation is messy and inconsistent. People use:

- for hyphenation,
-- as a keyboard-friendly stand-in for em dashes,
and markdown-heavy writing where plain ASCII punctuation is common.

When a model sees these patterns millions of times, they become normal output candidates.

Then Tokenization Nudges the Choice

Models do not generate text by reasoning character-by-character first. They generate token-by-token.

Some punctuation sequences are simply more probable in context because of how they were tokenized and how often they appeared during training. So when the model is writing explanatory prose, -- can become a low-friction continuation path.

That is why this behavior shows up repeatedly even across different model families.

Style Mimicry Makes It More Visible

A lot of prompts ask for educational, explanatory, or long-form content. In that style, writers often use parenthetical asides and emphasis markers. The model imitates that rhythm.

Double dashes become a stylistic shortcut for:

“main idea — quick aside — back to the point”

So what looks random is often style-matching.

Is It Wrong?

Not really. It is mostly a writing preference issue.

But it does matter when you need strict editorial consistency, such as:

brand voice systems,
legal or policy writing,
SEO content pipelines,
publishing workflows with specific style guides.

How to Control It Reliably

The easiest fix is explicit instruction in the prompt.

Use constraints like:

“Avoid double dashes and em dashes.”
“Use commas or short sentences for asides.”
“Follow AP style punctuation.”

If the model still slips, add a post-processing pass that rewrites punctuation automatically.

Practical Prompt You Can Reuse

Try this:

“Write in a casual but professional style. Avoid double dashes and em dashes. Use short, clear sentences and commas where needed.”

That one instruction usually cleans up most dash-heavy output.

Related Services

AI Development

← Back to blog