-
Notifications
You must be signed in to change notification settings - Fork 15
Reproduce fine tuning but score poorly on the evaluation dataset #20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, It looks like you’re following the training instructions correctly. However, please note that all of our fine-tuning experiments were conducted using full precision. We haven't tested or validated the E2E fine-tuning setup using mixed precision (FP16 or BF16), which could explain the discrepancies you're seeing in the evaluation scores. Let us know if switching to full precision resolves the issue or if you continue to encounter problems! |
Thanks for your reply! I want to confirm how much GPU VRAM is required for training with the provided settings? Because the hardware I use is RTX 3090 VRAM 24GB, using mixed precision "no" will cause GPU memory out problem. +++++++++++ Another problem was found.
This will cause the memory to continue to accumulate, but according to the original paper, VAE is frozen. Rewrite as follows:
|
Hi, Using A possible option is to try a batch size of 1 with gradient accumulation set to 32. |
Hi, how do you achieve bf16 or fp16 training on the code? I fixed --mixed_precision="bf16", but it does not work:
Can you teach me? Thank you! |
Hi,
Good luck and enjoy! If you have good news, I look forward to your sharing and discussion. |
Hi,
Thanks to the author for the contribution, but I had some problems reproducing it.
Why do I get bad scores on the evaluation dataset when reproducing fine-tuning results on my RTX3090?
The scores are as follows:
The following is the train script configuration I use, refer to train_marigold_e2e_ft_depth.sh:
The complete script is as follows:
The text was updated successfully, but these errors were encountered: