What is Temperature (LLM)? | AI & LLM Glossary

Temperature is a sampling parameter in large language models that controls the randomness of output generation. A lower temperature (closer to 0) makes the model more deterministic and focused, while a higher temperature (closer to 1 or above) increases diversity and creativity in generated text.

When a large language model generates text, it produces a probability distribution over its vocabulary for each next token. Temperature modifies this distribution before sampling occurs. Mathematically, the logits (raw model outputs) are divided by the temperature value before applying the softmax function. This simple operation has a profound effect on the shape of the probability distribution and, consequently, on the character of the generated text.

At low temperatures (0.0-0.3), the probability distribution becomes sharply peaked around the highest-probability tokens. The model almost always selects the most likely next token, producing highly deterministic, focused, and repetitive outputs. At temperature 0, the model becomes fully greedy, always choosing the single most probable token. This setting is ideal for tasks where accuracy and consistency matter more than variety, such as classification, extraction, and factual question answering.

At higher temperatures (0.7-1.0), the distribution flattens, giving lower-probability tokens a meaningful chance of being selected. This introduces more variety and surprise into the output, which is valuable for creative writing, brainstorming, and exploratory tasks. However, very high temperatures (above 1.0) can cause the distribution to become nearly uniform, leading to incoherent or nonsensical outputs.

Temperature interacts with other sampling parameters such as top-p (nucleus sampling) and top-k. In practice, teams often use temperature in combination with these parameters to fine-tune the trade-off between coherence and diversity. The optimal temperature setting depends entirely on the application: there is no universally correct value, and finding the right setting typically requires experimentation with representative examples from your specific use case.

How It Works

Model produces logits

For each position in the generated sequence, the LLM computes a vector of raw scores (logits) representing the unnormalized likelihood of each token in its vocabulary being the next token.

Logits are divided by temperature

Each logit value is divided by the temperature parameter. A temperature below 1 amplifies differences between logits (sharpening the distribution), while a temperature above 1 compresses differences (flattening the distribution).

Softmax converts to probabilities

The temperature-scaled logits are passed through the softmax function to produce a valid probability distribution. Lower temperature yields a spikier distribution; higher temperature yields a more uniform one.

Token is sampled from the distribution

A token is randomly sampled according to the resulting probability distribution. With low temperature, the top token is almost always selected. With high temperature, less likely tokens have a greater chance of being chosen.

Examples

Structured data extraction

A pipeline extracts product attributes from unstructured descriptions using an LLM. Temperature is set to 0 to ensure the model consistently produces the same structured output for identical inputs, maximizing reliability and making results reproducible across runs.

Creative marketing copy generation

A marketing team uses an LLM to generate multiple tagline variations for a campaign. Temperature is set to 0.9 to maximize creative diversity, allowing the model to explore unusual word combinations and generate a wide range of distinct options for the team to evaluate.

Code generation assistant

A developer tools company sets temperature to 0.2 for its AI coding assistant. This low setting ensures the model produces syntactically correct, conventional code patterns while allowing just enough variation to avoid getting stuck in repetitive loops on complex generation tasks.

Why It Matters

Temperature is one of the most important and accessible parameters for tuning LLM behavior in production. Understanding how it affects output quality allows teams to optimize for their specific use case, whether that requires strict determinism for data pipelines or creative variety for content generation. Incorrect temperature settings are a common source of poor LLM application performance.

Frequently Asked Questions

What temperature should I use for my LLM application?

For factual tasks, extraction, and classification, use temperature 0-0.2. For general conversational AI, 0.3-0.7 works well. For creative writing and brainstorming, 0.7-1.0 is appropriate. Start with a moderate value and adjust based on evaluating output quality on representative examples from your specific use case.

What happens if I set temperature to 0?

At temperature 0, the model uses greedy decoding, always selecting the highest-probability token. This makes outputs fully deterministic for the same input (assuming no other sources of randomness). It is ideal for reproducibility but can lead to repetitive or generic outputs for open-ended generation tasks.

How does temperature interact with top-p sampling?

Temperature and top-p (nucleus sampling) both affect output diversity but work differently. Temperature scales the probability distribution before sampling, while top-p truncates the distribution to the smallest set of tokens whose cumulative probability exceeds the threshold. They can be used together, but setting both to extreme values can produce unpredictable results. Many practitioners set one and leave the other at its default.

Can I change temperature between turns in a conversation?

Yes, temperature is a per-request parameter that can be adjusted for each API call. Some applications dynamically adjust temperature based on context, using lower values for factual responses and higher values for creative suggestions within the same conversation.

Optimize temperature settings with Respan

Respan logs the temperature and other sampling parameters for every LLM call, allowing you to correlate parameter choices with output quality metrics. Use Respan's analytics to compare response quality across different temperature settings, run A/B tests on sampling configurations, and identify the optimal parameters for each prompt template in your application.

Try Respan free

What is Temperature (LLM)? | AI & LLM Glossary

How It Works

Model produces logits

For each position in the generated sequence, the LLM computes a vector of raw scores (logits) representing the unnormalized likelihood of each token in its vocabulary being the next token.

Logits are divided by temperature

Softmax converts to probabilities

Token is sampled from the distribution

Examples

Structured data extraction

Creative marketing copy generation

Code generation assistant

Why It Matters

Frequently Asked Questions

What temperature should I use for my LLM application?

What happens if I set temperature to 0?

How does temperature interact with top-p sampling?

Can I change temperature between turns in a conversation?

Optimize temperature settings with Respan

Try Respan free

What is Temperature (LLM)? | AI & LLM Glossary

How It Works

Examples

Why It Matters

Related Terms

Frequently Asked Questions

Optimize temperature settings with Respan

What is Temperature (LLM)? | AI & LLM Glossary

How It Works

Examples

Why It Matters

Related Terms

Frequently Asked Questions

Optimize temperature settings with Respan