How to convert the torch_dist ckpt to the nemo file? #11761

Kamizato-Ayaka · 2025-01-06T08:51:58Z

Kamizato-Ayaka
Jan 6, 2025

Description:
I have trained a MegatronGPTSFT model using NeMo. After the training, I only obtained the torch_dist-organized checkpoint directory but did not get a .nemo file. I now need the .nemo file to convert the model to the Hugging Face format. However, I’m unable to figure out how to convert the torch_dist checkpoint into a .nemo file.

I have tried using the following scripts provided by NeMo:

NeMo/scripts/checkpoint_converters/convert_zarr_to_torch_dist.py
NeMo/examples/nlp/language_modeling/megatron_ckpt_to_nemo.py
Both methods fail at the load_checkpoints step. The error log is as follows:

[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/aiscuser/dengjie/NeMo/examples/nlp/language_modeling/megatron_ckpt_to_nemo.py", line 252, in <module>
[rank2]:     convert(local_rank, rank, world_size, args)
[rank2]:   File "/home/aiscuser/dengjie/NeMo/examples/nlp/language_modeling/megatron_ckpt_to_nemo.py", line 203, in convert
[rank2]:     model = MegatronGPTModel.load_from_checkpoint(
[rank2]:   File "/home/aiscuser/dengjie/NeMo/nemo/collections/nlp/models/nlp_model.py", line 385, in load_from_checkpoint
[rank2]:     model = ptl_load_state(cls, checkpoint, strict=strict, cfg=cfg, **kwargs)
[rank2]:   File "/opt/conda/envs/ptca/lib/python3.10/site-packages/lightning/pytorch/core/saving.py", line 165, in _load_state
[rank2]:     obj = instantiator(cls, _cls_kwargs) if instantiator else cls(**_cls_kwargs)
[rank2]:   File "/home/aiscuser/dengjie/NeMo/nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py", line 331, in __init__
[rank2]:     super().__init__(cfg, trainer=trainer, no_lm_init=True)
[rank2]:   File "/home/aiscuser/dengjie/NeMo/nemo/collections/nlp/models/language_modeling/megatron_base_model.py", line 222, in __init__
[rank2]:     self._build_tokenizer()
[rank2]:   File "/home/aiscuser/dengjie/NeMo/nemo/collections/nlp/models/language_modeling/megatron_base_model.py", line 442, in _build_tokenizer
[rank2]:     tokenizer_model=self.register_artifact("tokenizer.model", self._cfg.tokenizer.get('model', None)),
[rank2]:   File "/home/aiscuser/dengjie/NeMo/nemo/collections/nlp/models/nlp_model.py", line 159, in register_artifact
[rank2]:     return super().register_artifact(config_path, src, verify_src_exists=verify_src_exists)
[rank2]:   File "/home/aiscuser/dengjie/NeMo/nemo/core/classes/modelPT.py", line 286, in register_artifact
[rank2]:     return self._save_restore_connector.register_artifact(self, config_path, src, verify_src_exists)
[rank2]:   File "/home/aiscuser/dengjie/NeMo/nemo/core/connectors/save_restore_connector.py", line 406, in register_artifact
[rank2]:     return_path = os.path.abspath(os.path.join(app_state.nemo_file_folder, src[5:]))
[rank2]:   File "/opt/conda/envs/ptca/lib/python3.10/posixpath.py", line 76, in join
[rank2]:     a = os.fspath(a)
[rank2]: TypeError: expected str, bytes or os.PathLike object, not NoneType

Steps I Tried:
I verified the checkpoints in multiple Docker environments, but the issue persists.
I reviewed the NeMo documentation and examples but could not find a resolution.
Question:
Is there a better or more robust way to convert torch_dist checkpoints into a .nemo file? Any suggestions or best practices would be greatly appreciated!

ashors1 · 2025-05-01T22:20:37Z

ashors1
May 1, 2025
Collaborator

Hi, thanks for the question and apologies for the late reply. I see that you are using the NeMo 1.0 codepath. We have introduced NeMo 2.0, which has more seamless compatibility with HuggingFace. We have some documentation here. Any supported model can be exported to HF format using the export_ckpt API. We recommend upgrading to NeMo 2.0 and testing out the HuggingFace conversion there.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to convert the torch_dist ckpt to the nemo file? #11761

{{title}}

Replies: 1 comment

{{title}}

Select a reply

How to convert the torch_dist ckpt to the nemo file? #11761

Kamizato-Ayaka Jan 6, 2025

Replies: 1 comment

ashors1 May 1, 2025 Collaborator

Kamizato-Ayaka
Jan 6, 2025

ashors1
May 1, 2025
Collaborator