Implemented lazy line-by-line text data set loading for LM example script #4009

GCHQResearcher92457 · 2020-04-27T08:38:46Z

See PR #3388. Master changed substantially, requiring relocation of code into previously untouched files etc. Instead, here is a new PR using the same code but refactored to fit in to the new more modular structure of the scripts in examples.

…ling including a dataset and a collator.

Joppewouts · 2020-05-21T16:20:50Z

Used this for training a model, worked great! Would love to see this integrated

julien-c

My main interrogation is about the force_pad_token option/feature, what do you think @LysandreJik @patrickvonplaten @BramVanroy?

julien-c · 2020-06-15T14:43:59Z

examples/run_language_modeling.py

+        metadata={
+            "help": "Whether to force the addition of a padding token to tokenizer that does not already have one."
+        },
+    )


I'm not a fan of this option personally (also see #4122 (comment))

I'd rather the example scripts do not modify the specified tokenizer – I feel like advanced users should modify their tokenizer off-script.

what do you think @patrickvonplaten?

Hmm, same from my side. I think one should use Trainer with DataCollatorForLanguageModeling and use a fitting tokenizer.

GPT2 is a heavily used model though and it would be nice to allow to use it via this script. Another possibility I could see here is to add a pad_token_id to the args that would set PAD_TOKEN to the provided id. So for GPT2, one could do --pad_token_id 50256. On the other hand pad_token_id seems to be quite a specific param to add to the args, so it might be cleaner to just not allow this case and force the user to use Trainer + own Datacollator

examples/run_language_modeling.py

patrickvonplaten · 2020-06-15T19:02:18Z

examples/run_language_modeling.py

+            # See PR 3388. Some tokenizers don't had pad tokens which causes errors at the encoding step in the collate_fn.
+            # We give here the option to force the addition of a pad token. The attention mask is used to ignore this token
+            # when feeding to the model.
+            tokenizer.add_special_tokens({"pad_token": "<pad>"})


I don't think this will work since it will give the pad_token the id: len(tokenzier) + 1 that does not exist in the model embedding weights. What one could do for GPT2 is to set the pad_token_id = eos_token => tokenizer.pad_token = tokenizer.eos_token. Since GPT2 uses causal masking, this should be fine.

@patrickvonplaten I think it should work since later on the embeddings are resized:

transformers/examples/run_language_modeling.py

Line 232 in 5ff6eb7

model.resize_token_embeddings(len(tokenizer))

Ah yeah you're right - I didn't even realize that there is resize embedding in the script as well.

I don't think though one should add a new embedding weight for GPT2 though, but just reuse some other token for padding (there are all masked anyways,...) so that not the whole model has to be retrained. model.resize_token_embeddings only adds new tokens if len(tokenizer) is > then model.old_embedding_weights. A lot of people just wanting to fine-tune GPT2 can run into bad performance here without knowing why.

Thinking a bit more about it, I agree with @julien-c and think people should just adapt this script for their own (GPT2) need. It's really not that long, and providing hacky functionality here is not worth it.

And the script does work for GPT2 by default, i.e. if you don't opt in to --line_by_line

julien-c · 2020-06-16T08:09:25Z

@GCHQResearcher92457 @BramVanroy Does it work for you if we tweak the PR on your fork's branch so that we can remove the force_pad_token option and update a few things?

PS: Sorry about the super long review time:)

Co-authored-by: Julien Chaumond <[email protected]>

GCHQResearcher92457 · 2020-06-16T08:23:37Z

@GCHQResearcher92457 @BramVanroy Does it work for you if we tweak the PR on your fork's branch so that we can remove the force_pad_token option and update a few things?

PS: Sorry about the super long review time:)

Sure. I think the GPT thing was a bit of rabbit hole. I added the hacks with pad tokens because I thought I'd introduced a problem with lazy loading, without realising that the problem was in fact already there with line-by-line.

BramVanroy · 2020-06-16T09:20:48Z

@GCHQResearcher92457 @BramVanroy Does it work for you if we tweak the PR on your fork's branch so that we can remove the force_pad_token option and update a few things?

PS: Sorry about the super long review time:)

Yes, definitely seems lik a good way to go!

ynouri · 2020-07-19T05:14:39Z

src/transformers/data/data_collator.py

+
+    block_size: int = 512
+
+    def collate_batch(self, examples: List[torch.Tensor]) -> Dict[str, torch.Tensor]:


Hello 👋

Thanks for the PR! I tried the DataCollatorForLazyLanguageModeling and LazyLineByLineTextDataset with transformers==3.0.2, and somehow I had to rename collate_batch to __call__ to make it work.

Not sure if I'm missing something - dropping a note here in case someone runs into the same issue.

Thanks again!

AliOsm · 2020-07-30T08:43:04Z

Hello everyone, I think this PR will be a huge addition to Transformers.
Is there any plans to finish it soon?
Thanks!

BramVanroy · 2020-07-30T09:17:01Z

Hello everyone, I think this PR will be a huge addition to Transformers.
Is there any plans to finish it soon?
Thanks!

This is in the hands of @julien-c now, but I think he's on holiday at the moment.

julien-c · 2020-07-31T10:37:04Z

Isn't this superseded by huggingface/nlp now? I'll let others chime in.

BramVanroy · 2020-07-31T12:31:10Z

Isn't this superseded by huggingface/nlp now? I'll let others chime in.

Are all examples now fully using nlp? If so, then yes and this can be closed. But if the examples are still using the trainer/dataset of transformers, then this seems a separate issue.

sgugger · 2020-08-03T14:28:07Z

I have no objection to merge this temporarily, if remarks from the comments are taken into accounts, merge conflicts handled and deprecated API (the data collator should implement __call__ and tokenizer.batch_encode_plus should not be used, just the tokenizer __call__) replaced. That may be a lot of work for something that will eventually be handled by nlp though.

Moving the examples to nlp is on my TODO for the near-future @BramVanroy, and I think @thomwolf is also planning on working on this.

EdwardRaff · 2020-08-06T17:54:44Z

When I try to run this code following the example here I get the below error:

Traceback (most recent call last):
  File "bla.py", line 209, in <module>
    trainer.train()
  File "/home/edraff/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 492, in train
    for step, inputs in enumerate(epoch_iterator):
  File "/home/edraff/anaconda3/lib/python3.7/site-packages/tqdm/std.py", line 1107, in __iter__
    for obj in iterable:
  File "/home/edraff/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/edraff/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/edraff/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/home/edraff/anaconda3/lib/python3.7/site-packages/transformers/data/data_collator.py", line 83, in __call__
    inputs, labels = self.mask_tokens(batch)
  File "/home/edraff/anaconda3/lib/python3.7/site-packages/transformers/data/data_collator.py", line 113, in mask_tokens
    labels = inputs.clone()
AttributeError: 'tuple' object has no attribute 'clone'
Epoch:   0%|                                                                                                                                                                             | 0/1 [00:23<?, ?it/s]Iteration:   0%|                                                                                                                                                                    | 0/976243 [00:23<?, ?it/s]

BramVanroy · 2020-08-09T09:13:57Z

When I try to run this code following the example here I get the below error:

Traceback (most recent call last):
  File "bla.py", line 209, in <module>
    trainer.train()
  File "/home/edraff/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 492, in train
    for step, inputs in enumerate(epoch_iterator):
  File "/home/edraff/anaconda3/lib/python3.7/site-packages/tqdm/std.py", line 1107, in __iter__
    for obj in iterable:
  File "/home/edraff/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/edraff/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/edraff/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/home/edraff/anaconda3/lib/python3.7/site-packages/transformers/data/data_collator.py", line 83, in __call__
    inputs, labels = self.mask_tokens(batch)
  File "/home/edraff/anaconda3/lib/python3.7/site-packages/transformers/data/data_collator.py", line 113, in mask_tokens
    labels = inputs.clone()
AttributeError: 'tuple' object has no attribute 'clone'
Epoch:   0%|                                                                                                                                                                             | 0/1 [00:23<?, ?it/s]Iteration:   0%|                                                                                                                                                                    | 0/976243 [00:23<?, ?it/s]

Not sure but I think this PR hasn't been updated to reflect recent changes.

chiyuzhang94 · 2020-08-21T18:25:10Z

Hi @GCHQResearcher92457 ,

Thanks for your great work.
I am trying to use this lazy loading pre-training script to train a RoBERTa from scratch.
I tested it many times. It works well when the training data less than 100 million lines.

But the script is always killed at linecache.getline(...), if my training set is more than 100M lines (e.g., 1 billion).
Error is:

died with <Signals.SIGKILL: 9>.

I checked my CPU and GPU usage, they are not full. I also changed size of _get_n_lines(...) function and the batch size. But it still doesn't work. I don't believe this is out of memory issues.

I cloned your transformers repo and use the branch lazy-text-dataset-loading-for-lm to install transformers library.

Could you please give me any idea about this problem?

Thanks,
Chiyu

More info:
Python: 3.6.8
Torch Version: 1.4.0
tensorflow Version: 2.3.0

I am also using distributed training to run the model.

BramVanroy · 2020-08-23T09:10:36Z

@chiyuzhang94 You probably have a program killer running. This is a background process that monitors the memory usage of the individual processes. If the system is about to run out of memory, it will kill the abusive process. My hunch is that Colab uses something similar.

The high memory usage occurs because linecache reads as much of the file into memory as it can, to have the optimal experience. Not all OS's seem to like this - although I have not have had any issues with this approach on my systems.

Here's a good article: https://dev.to/rrampage/surviving-the-linux-oom-killer-2ki9

chiyuzhang94 · 2020-08-23T19:17:04Z

@chiyuzhang94 You probably have a program killer running. This is a background process that monitors the memory usage of the individual processes. If the system is about to run out of memory, it will kill the abusive process. My hunch is that Colab uses something similar.

The high memory usage occurs because linecache reads as much of the file into memory as it can, to have the optimal experience. Not all OS's seem to like this - although I have not have had any issues with this approach on my systems.

Here's a good article: https://dev.to/rrampage/surviving-the-linux-oom-killer-2ki9

Thanks, @BramVanroy.

I think it is hard for me to change the oom_score_adj because I need to submit a job to PBS job to run model.
I am wondering whether I can control the size of files that linecache reads. I think the size in function def _get_n_lines(fin, size=65536): is the controller. But it still doesn't work if I decrease the size.

BramVanroy · 2020-08-24T16:10:05Z

@chiyuzhang94 No, that function is not related to the caching. It is a function that very quickly can read through files to figure out how many lines there are in that file. The size is the chunks in bytes to read sequentially, which is much faster than reading line-per-line. But again, nothing to do with caching.

One option that I can think of, is allowing for an argument max_memory_usage, that will check at every __getitem__ call the current memory usage (either system memory usage or current script memory usage), and if the memory usage is more than max_memory_usage the script should call linecache.clearcache(). This will be slow when you have little memory or a low max value, but it should work.

chiyuzhang94 · 2020-08-28T01:26:43Z

@chiyuzhang94 No, that function is not related to the caching. It is a function that very quickly can read through files to figure out how many lines there are in that file. The size is the chunks in bytes to read sequentially, which is much faster than reading line-per-line. But again, nothing to do with caching.

One option that I can think of, is allowing for an argument max_memory_usage, that will check at every __getitem__ call the current memory usage (either system memory usage or current script memory usage), and if the memory usage is more than max_memory_usage the script should call linecache.clearcache(). This will be slow when you have little memory or a low max value, but it should work.

Thanks, @BramVanroy ,

I tried your suggestion:

def __getitem__(self, idx):
        # Basic Memory checking from https://stackoverflow.com/a/48397534
        with open ('/proc/self/status') as f:
            memusage = f.read().split('VmRSS:')[1].split('\n')[0][:-3]

        logger.info(" memusage each time: %s", memusage)
        # If our memory usage exceeds a limit flush the cache to prevent OOM situations
        if int(memusage.strip()) > self.max_memory_usage and self.max_memory_usage > 0:
            logger.info(" memusage before: %s", memusage)
            linecache.clearcache()
            logger.info(" memusage after: %s", memusage)

        # linecache starts counting from one, not zero, +1 the given index
        return linecache.getline(self.file_path, idx + 1).rstrip()

But I found the linecache.clearcache() doesn't help based on the log.

Iteration:   0%|          | 0/1097530 [00:00<?, ?it/s]�[AI0826 18:51:17.077926 47405170347712 
I0826 18:51:17.080428 47405170347712 ARC_run_language_modeling_emohash.py:166]  memusage before: 	38945572
I0826 18:51:17.081127 47405170347712 ARC_run_language_modeling_emohash.py:169]  memusage after: 	38945572
I0826 18:51:27.666305 47348526792384 ARC_run_language_modeling_emohash.py:162]  memusage each time: 	39182488
I0826 18:51:27.670411 47348526792384 ARC_run_language_modeling_emohash.py:166]  memusage before: 	39182488
I0826 18:51:27.670989 47348526792384 ARC_run_language_modeling_emohash.py:169]  memusage after: 	39182488
I0826 18:51:43.620446 47109816241856 ARC_run_language_modeling_emohash.py:162]  memusage each time: 	39184224
I0826 18:51:43.620970 47109816241856 ARC_run_language_modeling_emohash.py:166]  memusage before: 	39184224
I0826 18:51:43.621682 47109816241856 ARC_run_language_modeling_emohash.py:169]  memusage after: 	39184224
I0826 18:51:49.295235 47667525713600 ARC_run_language_modeling_emohash.py:162]  memusage each time: 	38993432
I0826 18:51:49.295728 47667525713600 ARC_run_language_modeling_emohash.py:166]  memusage before: 	38993432
I0826 18:51:49.296677 47667525713600 ARC_run_language_modeling_emohash.py:169]  memusage after: 	38993432

Then, the job was killed.

I noticed I am using distributed training where each node has 4 GPUs. Since each of the 4 python threads eventually reads the entire file (90GB) into memory the dataset would take up over 360GB per node if they fully loaded the dataset. But each node only have 186GB RAM.

Do you have any suggestion to limit the caching size?

shizhediao · 2020-09-02T01:54:06Z

any progress? @GCHQResearcher92457

chiyuzhang94 · 2020-09-03T21:49:50Z

Hi @BramVanroy @GCHQResearcher92457 ,

I found a point that might be causing memory issues in the code (https://github.com/GCHQResearcher92457/transformers/blob/lazy-text-dataset-loading-for-lm/examples/run_language_modeling.py).

In the main function, the rank 1-3 threads will all stop at the barrier at line 770 and rank 0 will progress and load the model and vocab it will then hit line 825 and release the barrier. Once the barrier is released threads 1-3 will process the lines 770-825 (load model in the main function). Same for line 832-837 (load dataset).

I have four GPUs at each node. Hence, the rank 1-3 load the model and dataset from disk individually instead of using a copy from rank 0. This leads to the OOM issue.

I think the rank 1-3 threads should not run the line 832-837 again once the barrier released. But I added some log found: When a process hits a barrier is simply waits at that spot in the code until all other processes have hit a barrier. Then when it releases it continues from the point it is within the code, not jumping to the latest barrier.

I tried to add an if condition at line 770. This only allows rank 0 to load the model. But I got a new error. That shows the variables are not synchronized across devices. Rank 1-3 cannot get variable model.

Did you notice this issue? Do you have any suggestions?

BramVanroy · 2020-09-07T07:25:20Z

@chiyuzhang94 I am not sure why the memory is not clearing after using clearcache. It might be that you still have to call the garbage collector after clearing the cache, you can try that.

It is true that I had not thought about multinode support so you will indeed have multiple in-memory caches for each process. I do not think it is easy to by-pass that, unless by turning around all of the code and instead working with a dedicated reading process, which is a separate process that fetches lines from the data file.

As has been said before, though, it is now recommended to switch over to https://github.com/huggingface/nlp which allows for on-disk datasets which are fast and have a low memory footprint.

julien-c · 2020-09-08T06:43:51Z

@BramVanroy (For the sake of discussion)

Wouldn't it be reasonably easy to enable (non-cached) random access to the text file(s) by storing a list of the positions of "\n" and then doing fseeks on the fly (ideally, using a sampler that yields batches of sequential lines, so that one batch needs only one file read)?

BramVanroy · 2020-09-08T08:00:32Z

@BramVanroy (For the sake of discussion)

Wouldn't it be reasonably easy to enable (non-cached) random access to the text file(s) by storing a list of the positions of "\n" and then doing fseeks on the fly (ideally, using a sampler that yields batches of sequential lines, so that one batch needs only one file read)?

Shouldn't be too hard to implement indeed, although my fear is that this might not be fast enough from an IO perspective. That is perhaps the trade-off that one would want to make, though, so it might be worth it.

You'd still need to make sure that all data is actually used, so in a shuffle setting this might not be straightforward if you want batches of consistent size. Perhaps depending on the number of lines, you can create a list of indexes that have batch_size distance between them (e.g. 0, 64, 128, 256), and then shuffle those indexes and at each iteration select one randomly that has not been seen yet. Then select batch_size lines starting from that index. That, in combination with your suggestion of getting the positions of \n should work indeed!

I am not sure whether I want to put time into this, though, seeing that nlp is the preferred way to go.

chiyuzhang94 · 2020-09-08T18:45:53Z

@chiyuzhang94 I am not sure why the memory is not clearing after using clearcache. It might be that you still have to call the garbage collector after clearing the cache, you can try that.

It is true that I had not thought about multinode support so you will indeed have multiple in-memory caches for each process. I do not think it is easy to by-pass that, unless by turning around all of the code and instead working with a dedicated reading process, which is a separate process that fetches lines from the data file.

As has been said before, though, it is now recommended to switch over to https://github.com/huggingface/nlp which allows for on-disk datasets which are fast and have a low memory footprint.

Hi @BramVanroy ,

Thanks for your suggestion.

I looked at the nlp tool.
I didn't find an example of loading a text file for LM pre-training.
I adapted the dataset loading class like this:

class DatasetNLP(Dataset):
    def __init__(self, filename, cache_dir, args):
        self.dataset = load_dataset('text', data_files= filename, cache_dir=cache_dir)["train"]["text"]

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, index):
        line = self.dataset[index]
        return line

I am wondering whether this is the optimal way to use nlp with PyTorch dataloader.

mrm8488 · 2020-09-08T18:50:37Z

I used that approach in my way to train a LM (RoBERTa like) from scratch. I didn't modified the dataloader. It works for some iterations but it ends sooner than later with kind of CUBLAS ERROR

thomwolf · 2020-09-09T07:23:45Z

I'll start updating the examples to use the datasets library as soon as our new nlp release is out (probably today).

Your example @chiyuzhang94 is ok but by doing self.dataset = load_dataset('text', data_files= filename, cache_dir=cache_dir)["train"]["text"] you are loading all the dataset in RAM which is too bad because nlp can do memory mapping from drive.

You can directly use the dataset in a data loader by using set_format(type='torch'). More information is here: https://huggingface.co/nlp/master/quicktour.html#formatting-the-dataset

BramVanroy · 2020-09-09T15:47:21Z

I'll start updating the examples to use the datasets library as soon as our new nlp release is out (probably today).
Your example @chiyuzhang94 is ok but by doing self.dataset = load_dataset('text', data_files= filename, cache_dir=cache_dir)["train"]["text"] you are loading all the dataset in RAM which is too bad because nlp can do memory mapping from drive.
You can directly use the dataset in a data loader by using set_format(type='torch'). More information is here: https://huggingface.co/nlp/master/quicktour.html#formatting-the-dataset

Hi, I was wondering is it possible to finish the lazydataloader today?
I am a little bit eager for this function.
I would really appreciate your help. Thanks!

No, that is not possible. You cannot expect a company to open-source a great product and at the same time implementing features within the day.

As said numerous times in this topic, try out the nlp repository instead. It will help you out with any memory issues that you might have.

chiyuzhang94 · 2020-09-10T07:13:20Z

I'll start updating the examples to use the datasets library as soon as our new nlp release is out (probably today).

Your example @chiyuzhang94 is ok but by doing self.dataset = load_dataset('text', data_files= filename, cache_dir=cache_dir)["train"]["text"] you are loading all the dataset in RAM which is too bad because nlp can do memory mapping from drive.

You can directly use the dataset in a data loader by using set_format(type='torch'). More information is here: https://huggingface.co/nlp/master/quicktour.html#formatting-the-dataset

Hi @thomwolf ,
Thanks for your suggestion.

I tried to implement this to load my text file. This test.txt is a simple sample where each line is a sentence.

dataset = load_dataset('text', data_files='test.txt',cache_dir="./")
dataset.set_format(type='torch',columns=["text"])
dataloader = torch.utils.data.DataLoader(dataset, batch_size=8)
next(iter(dataloader))

But dataload cannot yield sample and error is:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-28-388aca337e2f> in <module>
----> 1 next(iter(dataloader))

/Library/Python/3.7/site-packages/torch/utils/data/dataloader.py in __next__(self)
    343 
    344     def __next__(self):
--> 345         data = self._next_data()
    346         self._num_yielded += 1
    347         if self._dataset_kind == _DatasetKind.Iterable and \

/Library/Python/3.7/site-packages/torch/utils/data/dataloader.py in _next_data(self)
    383     def _next_data(self):
    384         index = self._next_index()  # may raise StopIteration
--> 385         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    386         if self._pin_memory:
    387             data = _utils.pin_memory.pin_memory(data)

/Library/Python/3.7/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     42     def fetch(self, possibly_batched_index):
     43         if self.auto_collation:
---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
     45         else:
     46             data = self.dataset[possibly_batched_index]

/Library/Python/3.7/site-packages/torch/utils/data/_utils/fetch.py in <listcomp>(.0)
     42     def fetch(self, possibly_batched_index):
     43         if self.auto_collation:
---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
     45         else:
     46             data = self.dataset[possibly_batched_index]

KeyError: 0

dataset.set_format(type='torch',columns=["text"]) returns a log says:
Set __getitem__(key) output type to torch for ['text'] columns (when key is int or slice) and don't output other (un-formatted) columns.

I noticed the dataset is DatasetDict({'train': Dataset(features: {'text': Value(dtype='string', id=None)}, num_rows: 44)}).
Each sample can be accessed by dataset["train"]["text"].

I don't know how to modify this code to load the text file. Could you please give me any suggestions?

BramVanroy · 2020-09-10T07:44:50Z

@chiyuzhang94 Can you please ask your question either on the forums or on the respective repository? Your question is not a transformers question anymore, nor should PRs be used for general questions like this.

chiyuzhang94 · 2020-09-10T18:42:54Z

@chiyuzhang94 Can you please ask your question either on the forums or on the respective repository? Your question is not a transformers question anymore, nor should PRs be used for general questions like this.

Sure. Thanks for your investigation. I posted this question here: huggingface/datasets#610 (comment). @BramVanroy @thomwolf

stale · 2020-11-09T19:13:18Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Implemented lazy line-by-line text data set loading for language mode…

5ff6eb7

…ling including a dataset and a collator.

GCHQResearcher92457 mentioned this pull request Apr 27, 2020

Lazy text dataset loading for language modelling with PyTorch #3388

Closed

julien-c requested changes Jun 15, 2020

View reviewed changes

patrickvonplaten reviewed Jun 15, 2020

View reviewed changes

Update examples/run_language_modeling.py

36ab73b

Co-authored-by: Julien Chaumond <[email protected]>

taranais mentioned this pull request Jul 4, 2020

mlm lazy load ThilinaRajapakse/simpletransformers#540

Closed

ynouri reviewed Jul 19, 2020

View reviewed changes

julien-c requested a review from sgugger August 3, 2020 13:03

julien-c mentioned this pull request Aug 26, 2020

Dataset Lazyloader for transformers trainer #6725

Closed

shizhediao mentioned this pull request Aug 31, 2020

RAM MemoryError #6836

Closed

thomwolf closed this Sep 9, 2020

thomwolf reopened this Sep 9, 2020

chiyuzhang94 mentioned this pull request Sep 10, 2020

Load text file for RoBERTa pre-training. huggingface/datasets#610

Closed

stale bot added the wontfix label Nov 9, 2020

stale bot closed this Nov 21, 2020


		block_size: int = 512

		def collate_batch(self, examples: List[torch.Tensor]) -> Dict[str, torch.Tensor]:

Implemented lazy line-by-line text data set loading for LM example script #4009

Implemented lazy line-by-line text data set loading for LM example script #4009

Uh oh!

Conversation

GCHQResearcher92457 commented Apr 27, 2020

Uh oh!

Joppewouts commented May 21, 2020

Uh oh!

julien-c left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten Jun 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

julien-c commented Jun 16, 2020

Uh oh!

GCHQResearcher92457 commented Jun 16, 2020

Uh oh!

BramVanroy commented Jun 16, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AliOsm commented Jul 30, 2020

Uh oh!

BramVanroy commented Jul 30, 2020

Uh oh!

julien-c commented Jul 31, 2020

Uh oh!

BramVanroy commented Jul 31, 2020

Uh oh!

sgugger commented Aug 3, 2020

Uh oh!

EdwardRaff commented Aug 6, 2020

Uh oh!

BramVanroy commented Aug 9, 2020

Uh oh!

chiyuzhang94 commented Aug 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BramVanroy commented Aug 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chiyuzhang94 commented Aug 23, 2020

Uh oh!

BramVanroy commented Aug 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chiyuzhang94 commented Aug 28, 2020

Uh oh!

shizhediao commented Sep 2, 2020

Uh oh!

chiyuzhang94 commented Sep 3, 2020

Uh oh!

BramVanroy commented Sep 7, 2020

Uh oh!

julien-c commented Sep 8, 2020

Uh oh!

BramVanroy commented Sep 8, 2020

Uh oh!

chiyuzhang94 commented Sep 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrm8488 commented Sep 8, 2020

Uh oh!

patrickvonplaten Jun 15, 2020 •

edited

Loading

chiyuzhang94 commented Aug 21, 2020 •

edited

Loading

BramVanroy commented Aug 23, 2020 •

edited

Loading

BramVanroy commented Aug 24, 2020 •

edited

Loading

chiyuzhang94 commented Sep 8, 2020 •

edited

Loading

thomwolf commented Sep 9, 2020 •

edited

Loading

BramVanroy commented Sep 10, 2020 •

edited

Loading