What is Distillation? | AI & LLM Glossary

Distillation is a model compression technique where a smaller 'student' model is trained to replicate the behavior and knowledge of a larger, more capable 'teacher' model. The goal is to retain most of the teacher's performance while dramatically reducing computational requirements.

Knowledge distillation was introduced as a way to transfer the learned capabilities of large, expensive models into smaller, more efficient ones. Instead of training the student model from scratch on raw data, it learns from the teacher model's output distributions, which contain richer information than simple labels alone.

The key insight behind distillation is that a teacher model's 'soft' predictions carry valuable information about relationships between classes and concepts. For example, when a large language model predicts the next token, its probability distribution across the entire vocabulary reveals nuanced understanding that a smaller model can learn from.

In the context of LLMs, distillation has become essential for deploying powerful AI capabilities in resource-constrained environments. Companies often distill massive models with hundreds of billions of parameters into models that are 10 to 100 times smaller, making them suitable for edge devices, mobile applications, or cost-effective API serving.

Modern distillation techniques go beyond simple output matching. They can transfer intermediate layer representations, attention patterns, and reasoning chains from teacher to student, allowing the smaller model to develop similar internal processing strategies.

How It Works

Train or select a teacher model

A large, high-performing model is either trained from scratch or selected from existing models. This teacher model serves as the source of knowledge to be transferred.

Generate soft targets from the teacher

The teacher model processes training data and produces probability distributions (soft targets) over possible outputs, rather than just hard labels. A temperature parameter is often used to soften these distributions further.

Train the student model

The smaller student model is trained using a combination of the soft targets from the teacher and the original hard labels. The loss function typically blends a distillation loss (matching the teacher's outputs) with a standard task loss.

Evaluate and deploy the student

The student model is benchmarked against the teacher on relevant tasks to confirm acceptable performance. Once validated, the compact student model is deployed, offering faster inference at lower cost.

Examples

Deploying an LLM on mobile devices

A company distills a 70-billion parameter language model into a 7-billion parameter version that runs on smartphones. The smaller model retains 90% of the original's conversational ability while using a fraction of the memory and compute.

Reducing API serving costs

An AI startup distills their flagship model to create a faster, cheaper variant for high-volume, simpler queries. Customer support chatbot requests route to the distilled model, reducing inference costs by 80% while maintaining quality.

Domain-specific task optimization

A healthcare company distills a general-purpose LLM into a compact model specialized for medical coding. The student model is trained on the teacher's outputs for medical texts, producing a small but highly accurate domain expert.

Why It Matters

Distillation is critical for making AI accessible and cost-effective. Without it, only organizations with massive compute budgets could deploy state-of-the-art models. Distillation democratizes AI by enabling smaller, faster models that can run in production environments with real-world latency and cost constraints.

Frequently Asked Questions

What is the difference between distillation and fine-tuning?

Fine-tuning adapts a pre-trained model to a specific task by continuing training on task-specific data. Distillation transfers knowledge from a larger model to a smaller one. They can be combined: you can distill a large model into a smaller one and then fine-tune the smaller model for a specific use case.

Does distillation always reduce model quality?

There is typically some quality reduction, but it is often surprisingly small. Well-executed distillation can retain 90-99% of the teacher's performance depending on the task complexity and the size gap between teacher and student.

Can you distill open-source models into proprietary ones?

This depends on the license of the teacher model. Some open-source licenses explicitly prohibit using model outputs to train competing models. Always check the license terms before distilling from any model.

How much smaller can a distilled model be?

Distilled models can be 10x to 100x smaller than their teachers while retaining strong performance. The optimal compression ratio depends on task complexity, the quality of the distillation process, and the acceptable performance trade-off.

Monitor distilled model performance with Respan

When deploying distilled models, monitoring is essential to detect quality degradation compared to the teacher model. Respan provides real-time observability into distilled model outputs, allowing teams to track accuracy drift, compare student vs. teacher performance metrics, and set alerts when the distilled model's quality drops below acceptable thresholds.

Try Respan free

What is Distillation? | AI & LLM Glossary

How It Works

Train or select a teacher model

A large, high-performing model is either trained from scratch or selected from existing models. This teacher model serves as the source of knowledge to be transferred.

Generate soft targets from the teacher

Train the student model

Evaluate and deploy the student

Examples

Deploying an LLM on mobile devices

Reducing API serving costs

Domain-specific task optimization

Why It Matters

Frequently Asked Questions

What is the difference between distillation and fine-tuning?

Does distillation always reduce model quality?

Can you distill open-source models into proprietary ones?

How much smaller can a distilled model be?

Monitor distilled model performance with Respan

Try Respan free

What is Distillation? | AI & LLM Glossary

How It Works

Examples

Why It Matters

Related Terms

Frequently Asked Questions

Monitor distilled model performance with Respan

What is Distillation? | AI & LLM Glossary

How It Works

Examples

Why It Matters

Related Terms

Frequently Asked Questions

Monitor distilled model performance with Respan