Finetuning CogVideoX-t2v-5B takes a very long time, even on 8xH100 GPUs #763

xabirizar9 · 2025-04-22T07:37:49Z

Hi,

i'm trying to finetune CogVideoX-t2v-5B using LoRA in a DDP strategy with 8 H100s but its taking very long.

I'm running the train_ddp_t2v script. I've followed the best practices for the data and tried increasing batch size and num_workers with no successful results. Increasing batch size beyond 2 results in OOM errors.

I'm using a filtered version of OpenVid-1M dataset, with the goal of finetuning on approx. 250k samples. For initial testing, I tried with just 70 samples, which took 30mins to finetune on 10 epochs, and then increased to 5k samples, taking around 38h (didn't complete the training because this seemed too large to me). I was initially feeding at max resolution @ 81x768x1360, and lowering the resolution to 81x480x720 brought the training down to ~8h.
Additionally, i observe that the loss with the given finetuning script (this is without modifying batch size or learning rate) is noisy and doesn't go down.

It seems like one of the main bottlenecks is the initial data preprocessing. The GPUs appear to be pretty lightly loaded. They only load one sample at a time on each GPU, and taking a few seconds to process each sample, at best I can preprocess 16 samples per minute @ 81x768x1360. At this rate, it would take 11 days to process all 250k samples, which just for the preprocessing is very long. Let alone the finetuning afterwards.

Using LoRA we're not even updating that many weights, so is there something i'm getting fundamentally wrong? Is there some way to speed up preprocessing, and consequently, finetuning? Or is this just how long it takes to fine-tune this kind of model?

Would appreciate any insight in the right direction, thank you!

OleehyO · 2025-04-23T02:35:35Z

In our tests, with a resolution of 81x768x1360, the batch size indeed cannot exceed 2, because for video generation models, the sequence length far exceeds that of normal models.

Additionally, LoRA does not inherently reduce the computational load during training. For the scenario you mentioned—70 samples over 10 epochs, totaling 700 samples at 81x768x1360 resolution—requiring 30 minutes for fine-tuning is considered normal.

123lcy123 · 2025-04-23T07:49:43Z

When fine-tuning CogVideoX-5B on my own dataset, I've also encountered the same problem where the loss is noisy and doesn't go down. Have you discovered what the issue might be?

xabirizar9 · 2025-04-23T10:22:59Z

Thanks for your answer @OleehyO . If I wanted to conduct some larger scale finetuning, say with 150k samples, is this just a cost I have to assume that at that resolution, it would take roughly 1000 hours? I noticed that lowering resolution significantly reduces training time by 7x, which seems a more manageable time. Can you provide more insights into this please? Thank you!

Same question goes for the initial caching of prompts/video latents.

OwalnutO · 2025-04-24T01:57:52Z

When fine-tuning CogVideoX-5B on my own dataset, I've also encountered the same problem where the loss is noisy and doesn't go down. Have you discovered what the issue might be?

Same problem! The loss is nosiy and doesn't go down. I try to adjust the learning rate (1e-4, 2e-5), scheduler (cos with restart, constant), and batch size (32->64), but they all fail. Is there anyone who can help?

I try to full finetuning on my own dataset (~10w)，cos_with_restart, 1e-4, 48000 step, bs 32. The loss is always shaking between 0.1 and 0.5

OleehyO · 2025-04-24T04:36:05Z

@123lcy123 @OwalnutO, does the loss fluctuate continuously from the start to the end of training? Or does it decrease at the beginning and then fluctuate within a large range? If it's the latter, I think it's normal, but if the loss oscillates between 0.1~0.5, that's indeed a bit strange because it's quite large. It's recommended to expand the dataset or further increase the batch size.

OleehyO · 2025-04-24T04:47:10Z

Thanks for your answer @OleehyO. If I wanted to conduct some larger scale finetuning, say with 150k samples, is this just a cost I have to assume that at that resolution, it would take roughly 1000 hours? I noticed that lowering resolution significantly reduces training time by 7x, which seems a more manageable time. Can you provide more insights into this please? Thank you!

The same question applies to the initial caching of prompts/video latents.

For larger-scale training, it is recommended to use more professional training frameworks like Megatron, as our provided training scripts are only intended for small-scale fine-tuning. Training on large-scale datasets may require a very long time.

After reducing from 81x768x1360 to 81x480x720, since the sequence length becomes 1/3 of the original, the computational load for the attention module (O(n^2)) theoretically decreases by 1/9, while other modules decrease by 1/3, so a 7-8 times reduction in time is considered normal. The same logic applies to reducing the number of video frames. However, for large-scale training, it is still recommended to use more professional frameworks.

OleehyO self-assigned this Apr 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetuning CogVideoX-t2v-5B takes a very long time, even on 8xH100 GPUs #763

Finetuning CogVideoX-t2v-5B takes a very long time, even on 8xH100 GPUs #763

xabirizar9 commented Apr 22, 2025 •

edited

Loading

OleehyO commented Apr 23, 2025

123lcy123 commented Apr 23, 2025

xabirizar9 commented Apr 23, 2025

OwalnutO commented Apr 24, 2025 •

edited

Loading

OleehyO commented Apr 24, 2025

OleehyO commented Apr 24, 2025

Finetuning CogVideoX-t2v-5B takes a very long time, even on 8xH100 GPUs #763

Finetuning CogVideoX-t2v-5B takes a very long time, even on 8xH100 GPUs #763

Comments

xabirizar9 commented Apr 22, 2025 • edited Loading

OleehyO commented Apr 23, 2025

123lcy123 commented Apr 23, 2025

xabirizar9 commented Apr 23, 2025

OwalnutO commented Apr 24, 2025 • edited Loading

OleehyO commented Apr 24, 2025

OleehyO commented Apr 24, 2025

xabirizar9 commented Apr 22, 2025 •

edited

Loading

OwalnutO commented Apr 24, 2025 •

edited

Loading