Top-k Sampling

Top-k sampling is a straightforward approach to controlling LLM output randomness. At each generation step, only the k tokens with the highest probabilities are considered, and the model samples from this reduced set. With k=1, it behaves like greedy decoding (always picking the most likely token). With k=50, it considers 50 candidates, allowing for varied but still reasonable outputs.

The limitation of top-k is its fixed candidate pool size. When the model is highly confident, k=50 might include many irrelevant tokens that dilute quality. When the model faces a genuinely ambiguous choice among many viable options, k=50 might exclude valid candidates. This inflexibility led to the development of top-p sampling as an adaptive alternative.

Despite this limitation, top-k remains useful in practice, especially in combination with other sampling strategies. A common configuration uses both top-k and top-p together: top-k first reduces to the k most likely candidates, then top-p further filters to those whose cumulative probability exceeds p. This layered approach provides robust quality control. For most production applications, top-k between 20 and 100 combined with top-p 0.9 and moderate temperature gives reliable results.

Related Terms

RAG (Retrieval-Augmented Generation)

Embeddings

Vector Database

LLM (Large Language Model)

Fine-Tuning

Prompt Engineering