[Feature] vllm inferencer and memory safe vllm inferencer #860

wheresmyhair · 2024-06-18T17:20:08Z

Description

We perform vllm inferencer and memory safe vllm inferencer, which will benefit online rlhf process.
MemorySafeVLLMInferencer runs lmflow/pipeline/utils/memory_safe_vllm_inference.py using python subprocess, since it's not able to offload model or release GPU that vLLM takes within a python script using del, model.to('cpu') or other approaches currently. (see this issue)

Tests

`MemorySafeVLLMInferencer`

runtime
test result

Compatibility

run_reward_modeling.sh
run_finetune.sh
run_finetune_with_lora.sh

research4pan

Supporting VLLM is important to accelerate the inferencing component in different algorithms. Some modifications may be needed before merging into main branch.

`requirements.txt`

[Feature] line 8: deepspeed <= 0.14.0 to ensure backward compatibility.

`src/lmflow/args.py`

[Style] line 15: we majorly sort imported packages alphabetically. Moving this to line 16 would be better.
[Architecture] line 99, 318: the implication of this argument load_on_init seems confusing to users.
[Architecture] line 318-335: these arguments belong to Inferencer, not Model. Should move them to InferencerArguments. If model need these arguments, they can be passed in as **kwargs.
[Style] line 949-1001: if these options are for vllm only, better append a prefix vllm_. Or implementing the features corresponding to those arguments is another option.
[Feature] line 976: better automatically detect os.environ[CUDA_VISIBLE_DEVICES].

`src/lmflow/models/hf_decoder_model.py`

[Architecture] line 377, 429, 471: add argument use_vllm, which is passed from Inferencer.

`src/lmflow/models/hf_model_mixin.py`

[Architecture] line 111: pass from Inferencer, specify this as an extra argument for __init__.
[Architecture] line 368-419: The indentation level is too high now, consider wrap this part of code in a separated function.
[Architecture] line 453: LLM should not be self.backend_model, should have another variable, such as self.backend_model_for_inference, otherwise it will mess up with other usages with self.backend_model.
[Question] line 453: Does vllm support dynamic model change during inference?

`src/lmflow/pipeline/inferencerv2.py`

We can rename it as vllm_inferencer.py. This matches the classname. Also, v2 is vague and confusing.

`src/lmflow/pipeline/utils/collections.py`

[Style] Better rename it. The name collection is vague and confusing.
[Architecture] line 15: This is util function for models, move it to src/lmflow/utils/model.py. src/lmflow/pipeline/utils/ are majorly for customized training classes such as raft_trainer.
[Architecture] line 28: This is util function for datasets, move it to src/lmflow/utils/dataset.py.

`src/lmflow/pipeline/utils/memory_safe_vllm_inference.py`

[Arcthecture] Move it to examples/memory_safe_vllm_inference.py, or make it a special mode of the common inference, like a mode that can be activated by providing a single option of --use_vllm.

`src/lmflow/utils/collections.py`

[Architecture] Move the content to src/lmflow/utils/dataset.py

`tests/pipeline/test_memory_safe_vllm_inferencer.py`

[Style] line 16, 23, 34: there are absolute paths, consider uploading the dataset and use huggingface model names.

wheresmyhair · 2024-06-19T11:56:12Z

Changes made, test to be done.

`requirements.txt`

[Feature] line 8: deepspeed <= 0.14.0 to ensure backward compatibility.

✅

`src/lmflow/args.py`

[Style] line 15: we majorly sort imported packages alphabetically. Moving this to line 16 would be better.

✅

[Architecture] line 99, 318: the implication of this argument load_on_init seems confusing to users.

✅ Removed, see below.

[Architecture] line 318-335: these arguments belong to Inferencer, not Model. Should move them to InferencerArguments. If model need these arguments, they can be passed in as **kwargs.

✅

[Style] line 949-1001: if these options are for vllm only, better append a prefix vllm_. Or implementing the features corresponding to those arguments is another option.

❓ memory_safe_vllm_inference_devices removed, see below. memory_safe_vllm_inference_detokenize is a temporary arg, as it will deprecate together with MemorySafeVLLMInferencer once vllm solves the gpu release issue. For other args, currently they are used only in vllm, however plan to use in other inferencers in the future.

[Feature] line 976: better automatically detect os.environ[CUDA_VISIBLE_DEVICES].

✅ memory_safe_vllm_inference_devices removed. After another test, we found that subprocess will inherit env variables from main process. See below:

CUDA_VISIBLE_DEVICES=1,2 python /vol/yizhenjia/projs/LMFlow/runs/LMFlow-devtools/subpro_test/test_cudaenv.py

In test_cudaenv.py:

import subprocess
import sys

if __name__ == "__main__":
    cmd = "python /vol/yizhenjia/projs/LMFlow/runs/LMFlow-devtools/subpro_test/cudaenv.py"

    run_res = subprocess.run(
        args=cmd,
        stdout=sys.stdout,
        stderr=sys.stderr,
        shell=True,
    )

And in cudaenv.py:

import torch
import subprocess
import time

if __name__ == "__main__":
    subprocess.run("echo $CUDA_VISIBLE_DEVICES", shell=True) # this prints '1,2'
    print(torch.cuda.is_available())
    print(torch.cuda.device_count())
    a = torch.Tensor([1]*100000000).to('cuda:1') # and this goes to gpu 2 
    time.sleep(10)

Which results in:

`src/lmflow/models/hf_decoder_model.py`

[Architecture] line 377, 429, 471: add argument use_vllm, which is passed from Inferencer.

✅ vllm inference and the original inference method are strong private now. Inference now use .inference() method as an unified entrypoint, and it is the pipeline that decides whether to use vllm inference.

`src/lmflow/models/hf_model_mixin.py`

[Architecture] line 111: pass from Inferencer, specify this as an extra argument for __init__.

✅ Removed. For inference, model will init when call model.inference() the first time. We also make the model.activate_model_for_inference() and model.deactivate_model_for_inference() methods public, user are able to trigger model loading/offloading manually.

[Architecture] line 368-419: The indentation level is too high now, consider wrap this part of code in a separated function.

⌛ Will do this in the next feature rm scoring, along with the upgrade for 'HFTextRegressionModel' inference.

[Architecture] line 453: LLM should not be self.backend_model, should have another variable, such as self.backend_model_for_inference, otherwise it will mess up with other usages with self.backend_model.

✅ Changes to self.backend_model_for_inference

[Question] line 453: Does vllm support dynamic model change during inference?

✅ Doable via cli commands and api serving, but not in python (same GPU memory releasing issue, see below):

from vllm import LLM, SamplingParams
import time

sampling_params = SamplingParams()
    
if __name__ == "__main__":
    llm = LLM(
        model='/home/yizhenjia/.cache/huggingface/hub/models--Qwen--Qwen2-0.5B/snapshots/ff3a49fac17555b8dfc4db6709f480cc8f16a9fe', 
        tensor_parallel_size=1,
        gpu_memory_utilization=0.95,
    )
    res = llm.generate("hi", sampling_params)
    print(res)
    time.sleep(10)
    print('change model')
    llm = LLM(
        "meta-llama/Meta-Llama-3-8B-Instruct",
    )
    res = llm.generate("hi", sampling_params)
    print(res)
    print('finish')

This results in:

`src/lmflow/pipeline/inferencerv2.py`

We can rename it as vllm_inferencer.py. This matches the classname. Also, v2 is vague and confusing.

✅

`src/lmflow/pipeline/utils/collections.py`

[Style] Better rename it. The name collection is vague and confusing.

✅ Removed, since functions are moved to other modules.

[Architecture] line 15: This is util function for models, move it to src/lmflow/utils/model.py. src/lmflow/pipeline/utils/ are majorly for customized training classes such as raft_trainer.

✅

[Architecture] line 28: This is util function for datasets, move it to src/lmflow/utils/dataset.py.

✅ Moved to src/lmflow/utils/args.py, this function takes a list of dataclass object and parses them into shell command format (like --arg1 value1 --arg2 value2). The function is used in MemorySafeVLLMInferencer since it will run command using subprocess.

`src/lmflow/pipeline/utils/memory_safe_vllm_inference.py`

[Arcthecture] Move it to examples/memory_safe_vllm_inference.py, or make it a special mode of the common inference, like a mode that can be activated by providing a single option of --use_vllm.

❓ Here comes the tricky part. This work as a module that supports MemorySafeVLLMInferencer, rather than just a workflow. When use MemorySafeVLLMInferencer, .inference() method will run this in subprocess and return its result. This is only a workaround due to the VLLM in-python memory releasing issue.

`src/lmflow/utils/collections.py`

[Architecture] Move the content to src/lmflow/utils/dataset.py

✅ Removed src/lmflow/utils/collections.py. Moves create_copied_dataclass() and remove_dataclass_attr_prefix() to src/lmflow/utils/args.py. They are useful when there are two or more models to load through cli command (ppo, for example, requires reward model and sft model at the same time).

tests/pipeline/test_memory_safe_vllm_inferencer.py

[Style] line 16, 23, 34: there are absolute paths, consider uploading the dataset and use huggingface model names.

✅

wheresmyhair · 2024-06-19T13:46:46Z

Tests after arch change

`MemorySafeVLLMInferencer`

runtime
test result

Compatibility

run_reward_modeling.sh
run_finetune.sh
run_finetune_with_lora.sh

research4pan

We can record the TODO features in a roadmap issue. Others look good to me.

src/lmflow/utils/args.py`

[Architecture] The name args.py is not preferred, as it is usually used for commandline arguments.

research4pan

LGTM 👍

wheresmyhair added 8 commits June 18, 2024 22:12

[Feature] vllm inferencer and memory safe vllm inferencer

cd08eb4

[Feature] memory safe vllm inference bug fix

406ff35

[Feature] mem safe vllm inference arg improve

48b02b5

[Feature] mem safe vllm inference test

ed6a6de

[Usability] update class default methods

7a132c5

[Feature] test case update

cfd0948

[Usability] Add time.sleep to avoid python fetal error

8a36583

[Dependency] change deepspeed version

47e9efe

research4pan requested changes Jun 19, 2024

View reviewed changes

[Architecture] arch and style change

9cef675

[Bug fix] __vllm_inference should be able to take kwargs

87f7ad3

research4pan reviewed Jun 19, 2024

View reviewed changes

[Style] func collections name change

d412357

research4pan approved these changes Jun 19, 2024

View reviewed changes

research4pan merged commit 02f8bcf into main Jun 19, 2024
2 checks passed

wheresmyhair mentioned this pull request Jun 19, 2024

[Roadmap] LMFlow Roadmap #862

Open

34 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] vllm inferencer and memory safe vllm inferencer #860

[Feature] vllm inferencer and memory safe vllm inferencer #860

Uh oh!

wheresmyhair commented Jun 18, 2024 •

edited

Loading

Uh oh!

research4pan left a comment •

edited

Loading

Uh oh!

wheresmyhair commented Jun 19, 2024

`tests/pipeline/test_memory_safe_vllm_inferencer.py`

Uh oh!

wheresmyhair commented Jun 19, 2024

Uh oh!

research4pan left a comment

Uh oh!

research4pan left a comment

Uh oh!

Uh oh!

Uh oh!

[Feature] vllm inferencer and memory safe vllm inferencer #860

[Feature] vllm inferencer and memory safe vllm inferencer #860

Uh oh!

Conversation

wheresmyhair commented Jun 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

MemorySafeVLLMInferencer

Compatibility

Uh oh!

research4pan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

requirements.txt

src/lmflow/args.py

src/lmflow/models/hf_decoder_model.py

src/lmflow/models/hf_model_mixin.py

src/lmflow/pipeline/inferencerv2.py

src/lmflow/pipeline/utils/collections.py

src/lmflow/pipeline/utils/memory_safe_vllm_inference.py

src/lmflow/utils/collections.py

tests/pipeline/test_memory_safe_vllm_inferencer.py

Uh oh!

wheresmyhair commented Jun 19, 2024

requirements.txt

src/lmflow/args.py

src/lmflow/models/hf_decoder_model.py

src/lmflow/models/hf_model_mixin.py

src/lmflow/pipeline/inferencerv2.py

src/lmflow/pipeline/utils/collections.py

src/lmflow/pipeline/utils/memory_safe_vllm_inference.py

src/lmflow/utils/collections.py

tests/pipeline/test_memory_safe_vllm_inferencer.py

Uh oh!

wheresmyhair commented Jun 19, 2024

Tests after arch change

MemorySafeVLLMInferencer

Compatibility

Uh oh!

research4pan left a comment

Choose a reason for hiding this comment

src/lmflow/utils/args.py`

Uh oh!

research4pan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wheresmyhair commented Jun 18, 2024 •

edited

Loading

`MemorySafeVLLMInferencer`

research4pan left a comment •

edited

Loading

`requirements.txt`

`src/lmflow/args.py`

`src/lmflow/models/hf_decoder_model.py`

`src/lmflow/models/hf_model_mixin.py`

`src/lmflow/pipeline/inferencerv2.py`

`src/lmflow/pipeline/utils/collections.py`

`src/lmflow/pipeline/utils/memory_safe_vllm_inference.py`

`src/lmflow/utils/collections.py`

`tests/pipeline/test_memory_safe_vllm_inferencer.py`

`requirements.txt`

`src/lmflow/args.py`

`src/lmflow/models/hf_decoder_model.py`

`src/lmflow/models/hf_model_mixin.py`

`src/lmflow/pipeline/inferencerv2.py`

`src/lmflow/pipeline/utils/collections.py`

`src/lmflow/pipeline/utils/memory_safe_vllm_inference.py`

`src/lmflow/utils/collections.py`

`tests/pipeline/test_memory_safe_vllm_inferencer.py`

`MemorySafeVLLMInferencer`