-
Notifications
You must be signed in to change notification settings - Fork 29.2k
Implemented lazy line-by-line text data set loading for LM example script #4009
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implemented lazy line-by-line text data set loading for LM example script #4009
Conversation
…ling including a dataset and a collator.
Used this for training a model, worked great! Would love to see this integrated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main interrogation is about the force_pad_token
option/feature, what do you think @LysandreJik @patrickvonplaten @BramVanroy?
metadata={ | ||
"help": "Whether to force the addition of a padding token to tokenizer that does not already have one." | ||
}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of this option personally (also see #4122 (comment))
I'd rather the example scripts do not modify the specified tokenizer – I feel like advanced users should modify their tokenizer off-script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think @patrickvonplaten?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, same from my side. I think one should use Trainer
with DataCollatorForLanguageModeling
and use a fitting tokenizer.
GPT2 is a heavily used model though and it would be nice to allow to use it via this script. Another possibility I could see here is to add a pad_token_id
to the args that would set PAD_TOKEN to the provided id. So for GPT2, one could do --pad_token_id 50256
. On the other hand pad_token_id
seems to be quite a specific param to add to the args, so it might be cleaner to just not allow this case and force the user to use Trainer
+ own Datacollator
# See PR 3388. Some tokenizers don't had pad tokens which causes errors at the encoding step in the collate_fn. | ||
# We give here the option to force the addition of a pad token. The attention mask is used to ignore this token | ||
# when feeding to the model. | ||
tokenizer.add_special_tokens({"pad_token": "<pad>"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will work since it will give the pad_token
the id: len(tokenzier) + 1
that does not exist in the model embedding weights. What one could do for GPT2 is to set the pad_token_id = eos_token => tokenizer.pad_token = tokenizer.eos_token
. Since GPT2 uses causal masking, this should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@patrickvonplaten I think it should work since later on the embeddings are resized:
model.resize_token_embeddings(len(tokenizer)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah you're right - I didn't even realize that there is resize embedding in the script as well.
I don't think though one should add a new embedding weight for GPT2 though, but just reuse some other token for padding (there are all masked anyways,...) so that not the whole model has to be retrained. model.resize_token_embeddings
only adds new tokens if len(tokenizer)
is > then model.old_embedding_weights
. A lot of people just wanting to fine-tune GPT2 can run into bad performance here without knowing why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking a bit more about it, I agree with @julien-c and think people should just adapt this script for their own (GPT2) need. It's really not that long, and providing hacky functionality here is not worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the script does work for GPT2 by default, i.e. if you don't opt in to --line_by_line
@GCHQResearcher92457 @BramVanroy Does it work for you if we tweak the PR on your fork's branch so that we can remove the force_pad_token option and update a few things? PS: Sorry about the super long review time:) |
Co-authored-by: Julien Chaumond <[email protected]>
Sure. I think the GPT thing was a bit of rabbit hole. I added the hacks with pad tokens because I thought I'd introduced a problem with lazy loading, without realising that the problem was in fact already there with line-by-line. |
Yes, definitely seems lik a good way to go! |
|
||
block_size: int = 512 | ||
|
||
def collate_batch(self, examples: List[torch.Tensor]) -> Dict[str, torch.Tensor]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello 👋
Thanks for the PR! I tried the DataCollatorForLazyLanguageModeling
and LazyLineByLineTextDataset
with transformers==3.0.2
, and somehow I had to rename collate_batch
to __call__
to make it work.
Not sure if I'm missing something - dropping a note here in case someone runs into the same issue.
Thanks again!
Hello everyone, I think this PR will be a huge addition to Transformers. |
This is in the hands of @julien-c now, but I think he's on holiday at the moment. |
Isn't this superseded by |
Are all examples now fully using |
I have no objection to merge this temporarily, if remarks from the comments are taken into accounts, merge conflicts handled and deprecated API (the data collator should implement Moving the examples to nlp is on my TODO for the near-future @BramVanroy, and I think @thomwolf is also planning on working on this. |
When I try to run this code following the example here I get the below error:
|
Not sure but I think this PR hasn't been updated to reflect recent changes. |
Hi @GCHQResearcher92457 , Thanks for your great work. But the script is always killed at
I checked my CPU and GPU usage, they are not full. I also changed size of I cloned your transformers repo and use the branch Could you please give me any idea about this problem? Thanks, More info: I am also using distributed training to run the model. |
@chiyuzhang94 You probably have a program killer running. This is a background process that monitors the memory usage of the individual processes. If the system is about to run out of memory, it will kill the abusive process. My hunch is that Colab uses something similar. The high memory usage occurs because linecache reads as much of the file into memory as it can, to have the optimal experience. Not all OS's seem to like this - although I have not have had any issues with this approach on my systems. Here's a good article: https://dev.to/rrampage/surviving-the-linux-oom-killer-2ki9 |
Thanks, @BramVanroy. I think it is hard for me to change the |
@chiyuzhang94 No, that function is not related to the caching. It is a function that very quickly can read through files to figure out how many lines there are in that file. The size is the chunks in bytes to read sequentially, which is much faster than reading line-per-line. But again, nothing to do with caching. One option that I can think of, is allowing for an argument |
Thanks, @BramVanroy , I tried your suggestion:
But I found the linecache.clearcache() doesn't help based on the log.
Then, the job was killed. I noticed I am using distributed training where each node has 4 GPUs. Since each of the 4 python threads eventually reads the entire file (90GB) into memory the dataset would take up over 360GB per node if they fully loaded the dataset. But each node only have 186GB RAM. Do you have any suggestion to limit the caching size? |
any progress? @GCHQResearcher92457 |
Hi @BramVanroy @GCHQResearcher92457 , I found a point that might be causing memory issues in the code (https://github.com/GCHQResearcher92457/transformers/blob/lazy-text-dataset-loading-for-lm/examples/run_language_modeling.py). In the main function, the rank 1-3 threads will all stop at the barrier at line 770 and rank 0 will progress and load the model and vocab it will then hit line 825 and release the barrier. Once the barrier is released threads 1-3 will process the lines 770-825 (load model in the main function). Same for line 832-837 (load dataset). I have four GPUs at each node. Hence, the rank 1-3 load the model and dataset from disk individually instead of using a copy from rank 0. This leads to the OOM issue. I think the rank 1-3 threads should not run the line 832-837 again once the barrier released. But I added some log found: When a process hits a barrier is simply waits at that spot in the code until all other processes have hit a barrier. Then when it releases it continues from the point it is within the code, not jumping to the latest barrier. I tried to add an if condition at line 770. This only allows rank 0 to load the model. But I got a new error. That shows the variables are not synchronized across devices. Rank 1-3 cannot get variable Did you notice this issue? Do you have any suggestions? |
@chiyuzhang94 I am not sure why the memory is not clearing after using clearcache. It might be that you still have to call the garbage collector after clearing the cache, you can try that. It is true that I had not thought about multinode support so you will indeed have multiple in-memory caches for each process. I do not think it is easy to by-pass that, unless by turning around all of the code and instead working with a dedicated reading process, which is a separate process that fetches lines from the data file. As has been said before, though, it is now recommended to switch over to https://github.com/huggingface/nlp which allows for on-disk datasets which are fast and have a low memory footprint. |
@BramVanroy (For the sake of discussion) Wouldn't it be reasonably easy to enable (non-cached) random access to the text file(s) by storing a list of the positions of |
Shouldn't be too hard to implement indeed, although my fear is that this might not be fast enough from an IO perspective. That is perhaps the trade-off that one would want to make, though, so it might be worth it. You'd still need to make sure that all data is actually used, so in a shuffle setting this might not be straightforward if you want batches of consistent size. Perhaps depending on the number of lines, you can create a list of indexes that have I am not sure whether I want to put time into this, though, seeing that |
Hi @BramVanroy , Thanks for your suggestion. I looked at the
I am wondering whether this is the optimal way to use |
I used that approach in my way to train a LM (RoBERTa like) from scratch. I didn't modified the dataloader. It works for some iterations but it ends sooner than later with kind of CUBLAS ERROR |
I'll start updating the examples to use the datasets library as soon as our new Your example @chiyuzhang94 is ok but by doing You can directly use the dataset in a data loader by using |
No, that is not possible. You cannot expect a company to open-source a great product and at the same time implementing features within the day. As said numerous times in this topic, try out the |
Hi @thomwolf , I tried to implement this to load my text file. This
But dataload cannot yield sample and error is:
I noticed the I don't know how to modify this code to load the text file. Could you please give me any suggestions? |
@chiyuzhang94 Can you please ask your question either on the forums or on the respective repository? Your question is not a |
Sure. Thanks for your investigation. I posted this question here: huggingface/datasets#610 (comment). @BramVanroy @thomwolf |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
See PR #3388. Master changed substantially, requiring relocation of code into previously untouched files etc. Instead, here is a new PR using the same code but refactored to fit in to the new more modular structure of the scripts in
examples
.