Skip to content

vectozavr/llm-hessian

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hessian of Perplexity for Large Language Models by PyTorch autograd

Open-source tool to compute the Hessian of the Perplexity function for Large Language Models (LLMs) using PyTorch autograd
Technical Report on arXiv


📖 Overview

This repository provides an accurate and efficient implementation for computing the Hessian of the Perplexity function in LLMs such as OPT-125M using PyTorch's native autograd engine. Results include full Hessian matrices and their diagonals across different layers and configurations.

📚 Citation

If you find our work helpful, please cite us:

@article{ilin2025hessian,
  title={Hessian of Perplexity for Large Language Models by PyTorch autograd (Open Source)},
  author={Ilin, Ivan},
  journal={arXiv preprint arXiv:2504.04520},
  year={2025}
}

✅ Model Compatibility

This repository is compatible with:

These models are supported via the Hugging Face Transformers interface.


📊 Results

📌 Full Hessian (Heatmaps for different subsets of parameters)

Hessian of q_proj Hessian of all layers

Left: Hessian for $Q_{proj} \in \mathbb{R}^{768 \times 768}$ from block 0 — first 768 params.
Right: Hessian for all 6 linear layers in block 0 — 300 params each.

Hessian of q_proj all blocks Hessian of all layers all blocks

Left: Hessian for q_proj across 12 blocks — 150 params each.
Right: Hessian for all layers in 12 blocks — 25 params each × 6 layers/block.

🔽 Download PyTorch Tensors

Saved as PyTorch tensors:

🔶 Influence of Batch Size

Hessian vs batch size

Experiments with varying batch size $b \in \{1, ..., 140\}$.

⭐ Diagonals of the Hessian (entire linear layer)

VHP samples

Varying number of VHP samples $k \in \{1, ..., 3000\}$ for diagonal estimation.


⚙️ Setup

pip install -r requirements.txt

📦 Click here for full Python 3.12.4 installation guide


🚀 Parameters

Argument Description
--model Hugging Face model identifier.
--layer_name Name of the linear layer to evaluate.
--t Number of parameters to consider per layer.
--block_index Index of a single block (used in some scripts).
--num_blocks Number of blocks to include.
--num_layers Number of linear layers per block.
--b Total number of samples for perplexity.
--vhp_samples VHP samples for Hessian diagonal estimation.
--model_input_bs Number of samples per batch.
--seqlen The sequence length for the model. 2048 by default.
--cache_dir Where to load/store weights. Default: llm_weights.
--seed Random seed.

💡 Tips:

  • Use larger --model_input_bs or --seqlen on GPUs with more memory to speed up runtime.
  • Higher --b $\cdot$ --seqlen and --vhp_samples give more accurate results, but increase compute time.

🔬 Running your experiments

Note

Please note that after running any scripts, a .pt Hessian tensor and a .pdf heatmap of the Hessian will be saved in the /data folder.

🔹 Single Layer from One Block

python src/single_layer_single_block.py \
    --model meta-llama/Llama-3.2-1B \
    --layer_name self_attn.q_proj \
    --block_index 0 \
    --t 5 \
    --b 30 \
    --model_input_bs 1 \
    --seed 0 \
    --cache_dir llm_weights

🔹 Single Layer from Several Blocks

python src/single_layer_several_blocks.py \
    --model meta-llama/Llama-3.2-1B \
    --layer_name self_attn.q_proj \
    --t 5 \
    --num_blocks 3 \
    --b 30 \
    --model_input_bs 1 \
    --seed 0 \
    --cache_dir llm_weights

🔹 Several Layers from Several Blocks

python src/several_layers_several_blocks.py \
    --model meta-llama/Llama-3.2-1B \
    --t 5 \
    --num_layers 3 \
    --num_blocks 3 \
    --b 30 \
    --model_input_bs 1 \
    --seed 0 \
    --cache_dir llm_weights

🔹 Compute only Diagonal Elements (full layer)

python src/hessian_diag_single_layer.py \
    --model meta-llama/Llama-3.2-1B \
    --layer_name self_attn.q_proj \
    --vhp_samples 10 \
    --block_index 0 \
    --b 30 \
    --model_input_bs 1 \
    --seed 0 \
    --cache_dir llm_weights

Warning

Please try facebook/opt-125m for --model parameter instead of larger Llama models if your computations are too slow, or you do not have enough GPU memory.

Note

If you want to consider your custom subset of parameters (for example a random subset or $t$ parameters), you need to change the custom_forward(self, inpt) method, where you define how the desired subset of parameters should form a full weight metrix.


📄 License

MIT License. See LICENSE for details.


💬 Join the Discussion

We welcome issues, feature requests, and contributions! Feel free to open an issue or a pull request.

About

Using PyTorch autograd to compute Hessian of Perplexity for Large Language Models

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages