LLM 25-Day Course - Day 14: Practical Text Generation

Day 14: Practical Text Generation

The quality of LLM text generation is heavily influenced by the decoding strategy. Today we will experiment with the key parameters of model.generate() one by one to feel the differences between each strategy.

Basic Text Generation

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "The future of artificial intelligence is"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

# Greedy Search (default) - Always selects the highest probability token
output = model.generate(input_ids, max_new_tokens=50)
print("Greedy:", tokenizer.decode(output[0], skip_special_tokens=True))

# Beam Search - Explores multiple candidates simultaneously
output = model.generate(input_ids, max_new_tokens=50, num_beams=5, early_stopping=True)
print("Beam:  ", tokenizer.decode(output[0], skip_special_tokens=True))

Comparing temperature, top_k, and top_p

temperature controls the sharpness of the probability distribution. Lower values produce more deterministic (conservative) output, while higher values produce more diverse (creative) output. top_k keeps only the top k tokens as candidates, and top_p keeps tokens up to cumulative probability p.

# Sampling enabled + parameter comparison experiment
configs = [
    {"temperature": 0.3, "top_p": 1.0, "top_k": 0, "label": "Low temperature (conservative)"},
    {"temperature": 1.0, "top_p": 1.0, "top_k": 0, "label": "Default temperature"},
    {"temperature": 1.5, "top_p": 1.0, "top_k": 0, "label": "High temperature (creative)"},
    {"temperature": 1.0, "top_p": 0.9, "top_k": 0, "label": "top_p=0.9 (nucleus)"},
    {"temperature": 1.0, "top_p": 1.0, "top_k": 50, "label": "top_k=50"},
]

for cfg in configs:
    output = model.generate(
        input_ids,
        max_new_tokens=40,
        do_sample=True,
        temperature=cfg["temperature"],
        top_p=cfg["top_p"],
        top_k=cfg["top_k"],
    )
    text = tokenizer.decode(output[0], skip_special_tokens=True)
    print(f"[{cfg['label']}]\n{text}\n")

Repetition Prevention and Advanced Parameters

LLMs tend to repeat the same phrases. You can suppress this with repetition_penalty and no_repeat_ngram_size.

output = model.generate(
    input_ids,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.8,
    top_p=0.92,
    top_k=50,
    repetition_penalty=1.2,       # 1.0 means no penalty, ~1.2 is usually appropriate
    no_repeat_ngram_size=3,        # Prevents 3-gram repetition
    num_return_sequences=2,        # Generate multiple results
    pad_token_id=tokenizer.eos_token_id,
)

for i, seq in enumerate(output):
    print(f"--- Generated Result {i+1} ---")
    print(tokenizer.decode(seq, skip_special_tokens=True))
    print()

Generation Strategy Summary: Greedy is fast but monotonous, Beam Search produces higher quality but is slower, and the Sampling + top_p combination is the most commonly used. In practice, start with temperature=0.7, top_p=0.9, repetition_penalty=1.1 and adjust from there.

Today’s Exercises

For the same prompt, change the temperature to 0.1, 0.5, 1.0, 1.5, and 2.0, compare the generated results, and document the patterns you observe.
Change repetition_penalty to 1.0, 1.2, 1.5, and 2.0 and observe the repetition suppression effect. What problems arise when it is set too high?
Apply Beam Search with num_beams=5 and Nucleus Sampling with do_sample=True, top_p=0.9 to the same prompt, then compare the consistency and diversity of the results.