Invariance Makes LLM Unlearning Resilient Even to Unanticipated Downstream Fine-Tuning

This is the official code repository for the paper Invariance Makes LLM Unlearning Resilient Even to Unanticipated Downstream Fine-Tuning.

Abstract

Machine unlearning presents a promising approach to mitigating privacy and safety concerns in large language models (LLMs) by enabling the selective removal of targeted data or knowledge while preserving model utility. However, existing unlearning methods remain over-sensitive to downstream fine-tuning, which can rapidly recover what is supposed to be unlearned information even when the fine-tuning task is entirely {unrelated} to the unlearning objective. To enhance robustness, we introduce the concept of `invariance' into unlearning for the first time from the perspective of invariant risk minimization (IRM), a principle for environment-agnostic training. By leveraging IRM, we develop a new invariance-regularized LLM unlearning framework, termed invariant LLM unlearning (ILU). We show that the proposed invariance regularization, even using only a single fine-tuning dataset during ILU training, can enable unlearning robustness to generalize effectively across diverse and new fine-tuning tasks at test time. A task vector analysis is also provided to further elucidate the rationale behind ILU's effectiveness. Extensive experiments on the WMDP benchmark, which focuses on removing an LLM's hazardous knowledge generation capabilities, reveal that ILU significantly outperforms state-of-the-art unlearning methods, including negative preference optimization (NPO) and representation misdirection for unlearning (RMU). Notably, ILU achieves superior unlearning robustness across diverse downstream fine-tuning scenarios (e.g., math, paraphrase detection, and sentiment analysis) while preserving the fine-tuning performance.

Installation

You can install the required dependencies using the following command:

conda create -n ILU python=3.9
conda activate ILU
conda install pytorch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install datasets wandb transformers==4.37.2 sentencepiece sentence-transformers==2.6.1
pip install git+https://github.com/changsheng/fastargs  
pip install terminaltables sacrebleu rouge_score matplotlib seaborn scikit-learn
cd lm-evaluation-harness
pip install -e .

Code structure

-- configs/: Contains the configuration files for the experiments.
    -- Different folders for different experiments (MUSE, WMDP, etc.)
-- files/: 
    -- data/: Contains the data files necessary for the experiments.
    -- results/: the log and results of experiments will stored in this directory.
-- lm-evaluation-harness: official repository for the evaluation of LLMs from      
  https://github.com/EleutherAI/lm-evaluation-harness.
-- src/: Contains the source code for the experiments.
    -- dataset/: Contains the data processing and dataloader creation codes.
    -- model/: Contains the main unlearning class which will conduct load model, 
      unlearn,evaluation.
    -- optim/: Contains the optimizer code.
    -- metrics/: Contains the evaluation code.
    -- loggers/: Contains the logger code.
    -- unlearn/: Contains different unlearning methods' code also mask generation code.
    -- exec/:
        -- Fine_tune_hp.py: Code for finetuning on harry potter books.
        -- unlearn_model.py: The main file to run the unlearning experiments.

Running the experiments

First, you need to download the mask files from Google Drive and place them into ./mask/ directory. You can download the mask files from here. Those mask files name should be like {task_name}_{ratio}.pt. For example, tofu_0.8.pt represents the mask file for TOFU task with 80% weights are selected for unlearning from WAGLE method.

After downloading the mask files, you can run the following command to run the experiments:

python src/exec/unlearn_model.py --config_file configs/{unlearn_task}/{unlearn_method}.json --unlearn.mask_path mask/{unlearn_task}_{ratio}.pt {other_args}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs/unlearn		configs/unlearn
files/data		files/data
scripts/analysis		scripts/analysis
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Invariance Makes LLM Unlearning Resilient Even to Unanticipated Downstream Fine-Tuning

Abstract

Installation

Code structure

Running the experiments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

OPTML-Group/Unlearn-ILU

Folders and files

Latest commit

History

Repository files navigation

Invariance Makes LLM Unlearning Resilient Even to Unanticipated Downstream Fine-Tuning

Abstract

Installation

Code structure

Running the experiments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages