- Регистрация
- 9 Май 2015
- Сообщения
- 1,483
- Баллы
- 155

In the era of large language models, prompt size is power — but also a big cost.
The more context you provide, the more tokens you consume. And when working with long, structured prompts or repetitive query templates, that cost can escalate quickly.
TokenSpan isn’t a compression library, it’s a thought experiment — a different way of thinking about prompt optimization.
Can we reduce token usage by substituting repeated phrases with lightweight aliases?
Can we borrow ideas from dictionary encoding to constrain and compress the language we use to communicate with models?
This project explores those questions — not by building a full encoding system, but by probing whether such a technique might be useful, measurable, and worth pursuing.

A crucial insight behind TokenSpan is recognizing where the real cost lies:
We pay for tokens, not computation.
So why not reduce the tokens we send, and let the model handle the substitution?
LLMs easily understand that §a means "Microsoft Designer" — and we’re already paying for those tokens, so there’s no extra cost for that mental mapping.
Dictionary: §a → Microsoft Designer
Rewritten Prompt: How does §a compare to Canva?

If you were to build a system around this idea, the best strategy wouldn't be to re-send the dictionary with every prompt. Instead:
- Build the dictionary once
- Embed it in the system prompt or long-term memory
- Reuse it across multiple interactions
This only makes sense when dealing with large or repetitive prompts, where the cost of setting up the dictionary is outweighed by the long-term savings.
By encouraging simpler, more structured language, your application can:
- Reduce costs
- Improve consistency
- Handle diverse user inputs more efficiently
After all, we’re often asking the same things — just in different ways.

What if we replaced a 2-token phrase like "Microsoft Designer" with an alias like §a?
Assume the phrase appears X times:
- Original Cost: 2 × X tokens
- Compressed Cost: X (alias usage) + 4 (dictionary overhead)
Savings Formula:
Saved = (2 × X) - (X + 4)
Example: "Microsoft Designer" appears 15 times.
Saved = (2 × 15) - (15 + 4) = 30 - 19 = 11 tokens saved
That’s just one phrase — real prompts often contain dozens of reusable patterns.

This experiment targets two-token phrases for a reason:
Single tokens can’t be compressed
Longer phrases save more but occur less
Two-token phrases hit the sweet spot: frequent and compressible

Each dictionary entry adds 4 tokens:
- 1 token for the replacement code (e.g. §a)
- 1 token for the separator (e.g. →)
- 2 tokens for the original phrase
You only start saving tokens once a phrase appears 5 or more times.

Using a raw prompt of 8,019 tokens:
After substitution → 7,138 tokens
Savings: 881 tokens (~11.0%)
The model continued performing correctly with the encoded prompt.

Natural language gives users the freedom to communicate in flexible, intuitive ways.
But that freedom comes at a cost:
Repetition
Inaccuracy from phrasing variations
Higher usage costs
If applications limited vocabulary for most interactions, it could:
- Lower token usage
- Encourage more structured prompts
- Improve response consistency

Here are some interesting quirks noticed during development:
Common Phrases = Fewer Tokens
e.g., "the" often becomes a single token.
Capitalization Can Split Words
"Designer" vs. "designer" — tokenizers treat them differently.
Rare Words Get Chopped Up
"visioneering" might tokenize into "vision" + "eering".
Numbers Don’t Tokenize Nicely
"123456" can break into "123" + "456".
Digits as Aliases? Risky.
Using "0" or "1" as shortcuts often backfires — better to use symbols like § or @.



TokenSpan is a thought experiment in prompt optimization.
The savings are real — but the real value is in rethinking how we balance cost, compression, and communication with LLMs.
Источник: