Evaluation of structured width pruning in GLU-MLP layers using expansion ratio modification.
Focus: Impact of reducing MLP expansion ratio on model capabilities. Preprint: Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2
Llama-3.2-1BLlama-3.2-3B
Llama-3.2-1B-Instruct
This research reveals two core trade-offs introduced by structured width pruning of GLU-MLP layers.
Pruning creates a clear dichotomy between two classes of model capabilities:
-
"Fragile" Capabilities (Degrade): These are tasks that rely heavily on distributed knowledge stored in the MLP layers.
- Degradation: Performance on benchmarks like MMLU (knowledge), GSM8K (math reasoning), and perplexity metrics (WikiText, Lambada) consistently degrades as pruning intensity increases.
- Most Fragile Task:
gsm8kis catastrophically affected, with performance collapsing even at moderate pruning levels.
-
"Robust" Capabilities (Improve): These are tasks that appear to rely more on core algorithmic reasoning pathways that are refined, not eroded, by pruning.
- Improvement: Performance on benchmarks like IFEval (instruction following), MUSR (multi-step reasoning), and TruthfulQA (truthfulness) is either stable or improves significantly with pruning.
- Peak Improvement:
IFEvalperformance on the 1B model peaks at a +75% improvement over baseline at 30% pruning.
This trade-off suggests that pruning acts as a form of regularization, sacrificing rote knowledge for enhanced performance on tasks requiring literal instruction adherence.
Pruning introduces a second trade-off related to inference performance, creating a dilemma for deployment:
-
The Win (Batch Throughput & Efficiency): For offline or batch processing, pruning is highly beneficial.
- Throughput: Batch throughput (tokens/sec) improves with more aggressive pruning, as the smaller model size allows for faster processing.
- Energy: Energy efficiency (Joules/token) improves significantly, with up to a ~20% reduction in energy consumption at high pruning levels.
-
The Cost (Interactive Latency): For interactive, user-facing applications, pruning has a severe negative impact.
- Latency: Time To First Token (TTFT) worsens dramatically with pruning, increasing by +50-90% at higher pruning levels.
- The Bottleneck: This latency cost is isolated to the prefill phase. The token generation speed after the first token remains almost completely unaffected.
This dilemma means that the optimal pruning level depends entirely on the deployment scenario. Models intended for batch processing can be aggressively pruned to save costs, while models for interactive chatbots must remain largely unpruned to ensure a responsive user experience.
| Benchmark | Type | Config | Rationale |
|---|---|---|---|
| WikiText-2 PPL | Perplexity | 0-shot | Fundamental language modeling capability |
| MMLU | Knowledge | 5-shot | Knowledge stored in weights (MLP-sensitive) |
| ARC-Challenge | Reasoning | 0-shot | Depth-sensitive reasoning |
| HellaSwag | Common Sense | 0-shot | Universal in pruning literature |
| WinoGrande | Common Sense | 0-shot | Standard suite (90%+ papers) |
| PIQA | Physical Reasoning | 0-shot | Universal, fundamental |
| BoolQ | QA | 0-shot | Non-monotonic behavior at high pruning |
| Lambada | Context | 0-shot | Language modeling stress test (context-dependent prediction) |
| TruthfulQA MC1 | Truthfulness | 0-shot | May improve post-pruning (single correct answer) |
| TruthfulQA MC2 | Truthfulness | 0-shot | May improve post-pruning (reduces false knowledge) |
| GSM8K | Math Reasoning | 5-shot | Extremely fragile stress test |
| IFEval | Instruction Following | 0-shot | Core instruct capability |
| MUSR | Multi-Step Reasoning | 0-shot | Complex compositional reasoning benchmark |
-
Dichotomy measurement:
- Knowledge in weights (MMLU, TruthfulQA) → MLP-sensitive
- Algorithmic processing (Lambada) → MLP-resistant
-
Critical additions from literature:
- WikiText-2 PPL: Most common metric (10+ papers)
- WinoGrande + PIQA: Missing from original, universal in 2023-2025 papers
- TruthfulQA: Can show improvement with width pruning
Before conducting the main pruning experiments, we empirically validated three neuron importance metrics for GLU architectures:
- MAW (Maximum Absolute Weight) - Selected method ✅
- VOW (Variance of Weights) - Rejected due to high degradation
- PON (Product of Norms) - Rejected due to catastrophic degradation
Key Finding: At just 10% pruning on Llama-3.2-1B, the VOW and PON methods caused perplexity increases of over 500% on Lambada, while MAW showed acceptable degradation. This validates our architectural understanding that GLU's gating mechanism requires magnitude-aware importance metrics.
See Notebook 00 Neuron Selection Method for full experimental details.
The /notebooks directory contains all the Jupyter notebooks used to run the experiments and generate the analyses for this project. For a detailed breakdown of each notebook's purpose, methodology, and runtime, please see the dedicated readme file.
- EleutherAI LM Evaluation Harness for reproducibility
- Per model: ~4.5-5 hours
- Total (3 models): ~13-15 hours
- GPU: L4/T4 level (Colab)
- RAM: ~15GB
Selective degradation: MMLU degrades moderately (-14% at 40% pruning) Preserved capabilities: Some metrics show resistance (BoolQ stable until 50%) Truthfulness metrics improve: TruthfulQA-MC2 +14% at 40% (baseline near-random, see detailed analysis) Non-monotonic patterns: IFEval peaks at 30% (+75%), remains elevated at 40% (+47%) Unexpected gains: IFEval improves substantially, MUSR shows consistent gains (+26%)
"Width pruning in GLU-MLP layers selectively reduces memorized knowledge capacity while preserving algorithmic processing capabilities, potentially improving model truthfulness."
@misc{martra2025fragileknowledgerobustinstructionfollowing,
title={Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2},
author={Pere Martra},
year={2025},
eprint={2512.22671},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.22671},
}