Skip to content

Finetuning CogVideoX-t2v-5B takes a very long time, even on 8xH100 GPUs #763

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
xabirizar9 opened this issue Apr 22, 2025 · 6 comments
Open
Assignees

Comments

@xabirizar9
Copy link

xabirizar9 commented Apr 22, 2025

Hi,

i'm trying to finetune CogVideoX-t2v-5B using LoRA in a DDP strategy with 8 H100s but its taking very long.

I'm running the train_ddp_t2v script. I've followed the best practices for the data and tried increasing batch size and num_workers with no successful results. Increasing batch size beyond 2 results in OOM errors.

I'm using a filtered version of OpenVid-1M dataset, with the goal of finetuning on approx. 250k samples. For initial testing, I tried with just 70 samples, which took 30mins to finetune on 10 epochs, and then increased to 5k samples, taking around 38h (didn't complete the training because this seemed too large to me). I was initially feeding at max resolution @ 81x768x1360, and lowering the resolution to 81x480x720 brought the training down to ~8h.
Additionally, i observe that the loss with the given finetuning script (this is without modifying batch size or learning rate) is noisy and doesn't go down.

It seems like one of the main bottlenecks is the initial data preprocessing. The GPUs appear to be pretty lightly loaded. They only load one sample at a time on each GPU, and taking a few seconds to process each sample, at best I can preprocess 16 samples per minute @ 81x768x1360. At this rate, it would take 11 days to process all 250k samples, which just for the preprocessing is very long. Let alone the finetuning afterwards.

Using LoRA we're not even updating that many weights, so is there something i'm getting fundamentally wrong? Is there some way to speed up preprocessing, and consequently, finetuning? Or is this just how long it takes to fine-tune this kind of model?

Would appreciate any insight in the right direction, thank you!

@OleehyO OleehyO self-assigned this Apr 23, 2025
@OleehyO
Copy link
Collaborator

OleehyO commented Apr 23, 2025

In our tests, with a resolution of 81x768x1360, the batch size indeed cannot exceed 2, because for video generation models, the sequence length far exceeds that of normal models.

Additionally, LoRA does not inherently reduce the computational load during training. For the scenario you mentioned—70 samples over 10 epochs, totaling 700 samples at 81x768x1360 resolution—requiring 30 minutes for fine-tuning is considered normal.

@123lcy123
Copy link

When fine-tuning CogVideoX-5B on my own dataset, I've also encountered the same problem where the loss is noisy and doesn't go down. Have you discovered what the issue might be?

@xabirizar9
Copy link
Author

Thanks for your answer @OleehyO . If I wanted to conduct some larger scale finetuning, say with 150k samples, is this just a cost I have to assume that at that resolution, it would take roughly 1000 hours? I noticed that lowering resolution significantly reduces training time by 7x, which seems a more manageable time. Can you provide more insights into this please? Thank you!

Same question goes for the initial caching of prompts/video latents.

@OwalnutO
Copy link

OwalnutO commented Apr 24, 2025

When fine-tuning CogVideoX-5B on my own dataset, I've also encountered the same problem where the loss is noisy and doesn't go down. Have you discovered what the issue might be?

Same problem! The loss is nosiy and doesn't go down. I try to adjust the learning rate (1e-4, 2e-5), scheduler (cos with restart, constant), and batch size (32->64), but they all fail. Is there anyone who can help?

I try to full finetuning on my own dataset (~10w),cos_with_restart, 1e-4, 48000 step, bs 32. The loss is always shaking between 0.1 and 0.5

@OleehyO
Copy link
Collaborator

OleehyO commented Apr 24, 2025

@123lcy123 @OwalnutO, does the loss fluctuate continuously from the start to the end of training? Or does it decrease at the beginning and then fluctuate within a large range? If it's the latter, I think it's normal, but if the loss oscillates between 0.1~0.5, that's indeed a bit strange because it's quite large. It's recommended to expand the dataset or further increase the batch size.

@OleehyO
Copy link
Collaborator

OleehyO commented Apr 24, 2025

Thanks for your answer @OleehyO. If I wanted to conduct some larger scale finetuning, say with 150k samples, is this just a cost I have to assume that at that resolution, it would take roughly 1000 hours? I noticed that lowering resolution significantly reduces training time by 7x, which seems a more manageable time. Can you provide more insights into this please? Thank you!

The same question applies to the initial caching of prompts/video latents.

For larger-scale training, it is recommended to use more professional training frameworks like Megatron, as our provided training scripts are only intended for small-scale fine-tuning. Training on large-scale datasets may require a very long time.

After reducing from 81x768x1360 to 81x480x720, since the sequence length becomes 1/3 of the original, the computational load for the attention module (O(n^2)) theoretically decreases by 1/9, while other modules decrease by 1/3, so a 7-8 times reduction in time is considered normal. The same logic applies to reducing the number of video frames. However, for large-scale training, it is still recommended to use more professional frameworks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants