# Token Benchmarks

[한국어](README.ko.md)

This directory contains prompt fixtures and benchmark outputs for measuring Korean-vs-English input-token behavior with Ollama.

The benchmark script calls Ollama directly. It does not test the PolyHarness FastAPI proxy path.

## Files

- `long-chat-ko.txt`: Korean long-prompt fixture.
- `long-chat-en.txt`: English long-prompt fixture with equivalent meaning.
- `results/latest/raw-results.jsonl`: every Ollama call from the latest recorded run.
- `results/latest/paired-results.csv`: paired Korean/English comparisons.
- `results/latest/summary.md`: human-readable aggregate summary.

## Default Run

```bash
.venv/bin/python benchmarks/ollama_token_benchmark.py \
  --model gemma4:26b-a4b-it-q4_K_M \
  --repeats 1 3 6 \
  --out-dir docs/benchmarks/results/latest
```

Default run size:

```text
20 prompt pairs x 3 repeat levels x 2 languages = 120 Ollama calls
20 prompt pairs x 3 repeat levels = 60 paired comparisons
```

The synthetic dataset covers daily chat, travel, work, education, business, finance, family, community, career, and lifestyle scenarios.

## Larger Runs

Use higher repeat values to simulate accumulated chat context:

```bash
.venv/bin/python benchmarks/ollama_token_benchmark.py \
  --model gemma4:26b-a4b-it-q4_K_M \
  --repeats 1 3 6 10 20 \
  --timeout 1800 \
  --out-dir docs/benchmarks/results/large
```

Larger run size:

```text
20 prompt pairs x 5 repeat levels x 2 languages = 200 Ollama calls
20 prompt pairs x 5 repeat levels = 100 paired comparisons
```

## Primary Metric

```text
1 - (english_prompt_eval_count / korean_prompt_eval_count)
```

Total-token reduction is recorded, but it is less stable because output length can vary across runs.