Hessian of Perplexity for Large Language Models by PyTorch autograd

Open-source tool to compute the Hessian of the Perplexity function for Large Language Models (LLMs) using PyTorch autograd
Technical Report on arXiv

📖 Overview

This repository provides an accurate and efficient implementation for computing the Hessian of the Perplexity function in LLMs such as OPT-125M using PyTorch's native autograd engine. Results include full Hessian matrices and their diagonals across different layers and configurations.

📚 Citation

If you find our work helpful, please cite us:

@article{ilin2025hessian,
  title={Hessian of Perplexity for Large Language Models by PyTorch autograd (Open Source)},
  author={Ilin, Ivan},
  journal={arXiv preprint arXiv:2504.04520},
  year={2025}
}

✅ Model Compatibility

This repository is compatible with:

🧠 OPT models (e.g. facebook/opt-125m)
🐑 LLaMA 2/3/4 models (e.g. meta-llama/Llama-3.2-1B)
🐣 TinyLlama (e.g. TinyLlama/TinyLlama-1.1B-Chat-v1.0)

These models are supported via the Hugging Face Transformers interface.

📊 Results

📌 Full Hessian (Heatmaps for different subsets of parameters)

Left: Hessian for $Q_{proj} \in \mathbb{R}^{768 \times 768}$ from block 0 — first 768 params.
Right: Hessian for all 6 linear layers in block 0 — 300 params each.

Left: Hessian for q_proj across 12 blocks — 150 params each.
Right: Hessian for all layers in 12 blocks — 25 params each × 6 layers/block.

🔽 Download PyTorch Tensors

Saved as PyTorch tensors:

🔶 Influence of Batch Size

Experiments with varying batch size $b \in \{1, ..., 140\}$.

⭐ Diagonals of the Hessian (entire linear layer)

Varying number of VHP samples $k \in \{1, ..., 3000\}$ for diagonal estimation.

⚙️ Setup

Python version: 3.12.4 🐍

pip install -r requirements.txt

📦 Click here for full Python 3.12.4 installation guide

🚀 Parameters

Argument	Description
`--model`	Hugging Face model identifier.
`--layer_name`	Name of the linear layer to evaluate.
`--t`	Number of parameters to consider per layer.
`--block_index`	Index of a single block (used in some scripts).
`--num_blocks`	Number of blocks to include.
`--num_layers`	Number of linear layers per block.
`--b`	Total number of samples for perplexity.
`--vhp_samples`	VHP samples for Hessian diagonal estimation.
`--model_input_bs`	Number of samples per batch.
`--seqlen`	The sequence length for the model. 2048 by default.
`--cache_dir`	Where to load/store weights. Default: `llm_weights`.
`--seed`	Random seed.

💡 Tips:

Use larger --model_input_bs or --seqlen on GPUs with more memory to speed up runtime.

Higher --b $\cdot$ --seqlen and --vhp_samples give more accurate results, but increase compute time.

🔬 Running your experiments

Note

Please note that after running any scripts, a .pt Hessian tensor and a .pdf heatmap of the Hessian will be saved in the /data folder.

🔹 Single Layer from One Block

python src/single_layer_single_block.py \
    --model meta-llama/Llama-3.2-1B \
    --layer_name self_attn.q_proj \
    --block_index 0 \
    --t 5 \
    --b 30 \
    --model_input_bs 1 \
    --seed 0 \
    --cache_dir llm_weights

🔹 Single Layer from Several Blocks

python src/single_layer_several_blocks.py \
    --model meta-llama/Llama-3.2-1B \
    --layer_name self_attn.q_proj \
    --t 5 \
    --num_blocks 3 \
    --b 30 \
    --model_input_bs 1 \
    --seed 0 \
    --cache_dir llm_weights

🔹 Several Layers from Several Blocks

python src/several_layers_several_blocks.py \
    --model meta-llama/Llama-3.2-1B \
    --t 5 \
    --num_layers 3 \
    --num_blocks 3 \
    --b 30 \
    --model_input_bs 1 \
    --seed 0 \
    --cache_dir llm_weights

🔹 Compute only Diagonal Elements (full layer)

python src/hessian_diag_single_layer.py \
    --model meta-llama/Llama-3.2-1B \
    --layer_name self_attn.q_proj \
    --vhp_samples 10 \
    --block_index 0 \
    --b 30 \
    --model_input_bs 1 \
    --seed 0 \
    --cache_dir llm_weights

Warning

Please try facebook/opt-125m for --model parameter instead of larger Llama models if your computations are too slow, or you do not have enough GPU memory.

Note

If you want to consider your custom subset of parameters (for example a random subset or $t$ parameters), you need to change the custom_forward(self, inpt) method, where you define how the desired subset of parameters should form a full weight metrix.

📄 License

MIT License. See LICENSE for details.

💬 Join the Discussion

We welcome issues, feature requests, and contributions! Feel free to open an issue or a pull request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hessian of Perplexity for Large Language Models by PyTorch autograd

📖 Overview

📚 Citation

✅ Model Compatibility

📊 Results

📌 Full Hessian (Heatmaps for different subsets of parameters)

🔽 Download PyTorch Tensors

🔶 Influence of Batch Size

⭐ Diagonals of the Hessian (entire linear layer)

⚙️ Setup

🚀 Parameters

🔬 Running your experiments

🔹 Single Layer from One Block

🔹 Single Layer from Several Blocks

🔹 Several Layers from Several Blocks

🔹 Compute only Diagonal Elements (full layer)

📄 License

💬 Join the Discussion

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
data		data
pdf		pdf
src		src
.gitignore		.gitignore
LICENSE		LICENSE
PYTHON_INSTALL.md		PYTHON_INSTALL.md
README.md		README.md
requirements.txt		requirements.txt

License

vectozavr/llm-hessian

Folders and files

Latest commit

History

Repository files navigation

Hessian of Perplexity for Large Language Models by PyTorch autograd

📖 Overview

📚 Citation

✅ Model Compatibility

📊 Results

📌 Full Hessian (Heatmaps for different subsets of parameters)

🔽 Download PyTorch Tensors

🔶 Influence of Batch Size

⭐ Diagonals of the Hessian (entire linear layer)

⚙️ Setup

🚀 Parameters

🔬 Running your experiments

🔹 Single Layer from One Block

🔹 Single Layer from Several Blocks

🔹 Several Layers from Several Blocks

🔹 Compute only Diagonal Elements (full layer)

📄 License

💬 Join the Discussion

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages