Skip to content

EOFError on Gzipped CSV read from S3 #28206

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mccarthyryanc opened this issue Aug 28, 2019 · 4 comments
Closed

EOFError on Gzipped CSV read from S3 #28206

mccarthyryanc opened this issue Aug 28, 2019 · 4 comments

Comments

@mccarthyryanc
Copy link

Not sure if this is a Pandas issue or s3fs issue:

import pandas as pd
data = pd.read_csv("s3://bucketname/file.csv.gz")

Gives the following error:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/ubuntu/miniconda3/envs/test/lib/python3.7/site-packages/pandas/io/parsers.py", line 685, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/ubuntu/miniconda3/envs/test/lib/python3.7/site-packages/pandas/io/parsers.py", line 463, in _read
    data = parser.read(nrows)
  File "/home/ubuntu/miniconda3/envs/test/lib/python3.7/site-packages/pandas/io/parsers.py", line 1154, in read
    ret = self._engine.read(nrows)
  File "/home/ubuntu/miniconda3/envs/test/lib/python3.7/site-packages/pandas/io/parsers.py", line 2059, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 881, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 896, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 950, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 937, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2124, in pandas._libs.parsers.raise_parser_error
  File "/home/ubuntu/miniconda3/envs/test/lib/python3.7/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/home/ubuntu/miniconda3/envs/test/lib/python3.7/gzip.py", line 482, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached

Setup

Installed Pandas and s3fs via pip:

pip install pandas s3fs

Pandas version: 0.25.1
s3fs version: 0.3.3

@WillAyd
Copy link
Member

WillAyd commented Aug 28, 2019

Are you getting the same issue if you save the file locally and don't use s3?

@mccarthyryanc
Copy link
Author

I get no error when reading the same file locally.

I just tested with a few different versions of s3fs and it works in 0.2.2, but then fails with the EOFError in version 0.3.0 and up.

@WillAyd
Copy link
Member

WillAyd commented Aug 28, 2019

Can you decompress the file using s3fs alone (no pandas)?

@mccarthyryanc
Copy link
Author

Good question! No, s3fs+gzip get the same error:

import gzip
import s3fs
fs = s3fs.S3FileSystem()
s3_fh = fs.open('s3://bucketname/file.csv.gz')
fh = gzip.open(s3_fh)
data = fh.read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/miniconda3/envs/test/lib/python3.7/gzip.py", line 276, in read
    return self._buffer.read(size)
  File "/home/ubuntu/miniconda3/envs/test/lib/python3.7/gzip.py", line 482, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached

So looks like an s3fs issue. I'll close this one. Thanks for the help @WillAyd !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants