Skip to content

BUG: Fix FastParquetImpl.write for non-existent file #28326

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Sep 19, 2019

Conversation

bnaul
Copy link
Contributor

@bnaul bnaul commented Sep 6, 2019

PyArrowImpl already correctly opens a non-existent file for writing (https://github.com/pandas-dev/pandas/blob/master/pandas/io/parquet.py#L95), with engine='fastparquet' this fails for e.g. a GCS URL (though it looks like S3 is already correct):

[nav] In [1]: pd.DataFrame().to_parquet('gs://city_data/test/blah.parquet')
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-1-dde78378baaa> in <module>
----> 1 pd.DataFrame().to_parquet('gs://city_data/test/blah.parquet')

~/venvs/model/lib/python3.7/site-packages/pandas/core/frame.py in to_parquet(self, fname, engine, compression, index, partition_cols, **kwargs)
   2215             index=index,
   2216             partition_cols=partition_cols,
-> 2217             **kwargs
   2218         )
   2219

~/venvs/model/lib/python3.7/site-packages/pandas/io/parquet.py in to_parquet(df, path, engine, compression, index, partition_cols, **kwargs)
    250         index=index,
    251         partition_cols=partition_cols,
--> 252         **kwargs
    253     )
    254

~/venvs/model/lib/python3.7/site-packages/pandas/io/parquet.py in write(self, df, path, compression, index, partition_cols, **kwargs)
    171             kwargs["open_with"] = lambda path, _: path
    172         else:
--> 173             path, _, _, _ = get_filepath_or_buffer(path)
    174
    175         with catch_warnings(record=True):

~/venvs/model/lib/python3.7/site-packages/pandas/io/common.py in get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode)
    212
    213         return gcs.get_filepath_or_buffer(
--> 214             filepath_or_buffer, encoding=encoding, compression=compression, mode=mode
    215         )
    216

~/venvs/model/lib/python3.7/site-packages/pandas/io/gcs.py in get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode)
     15
     16     fs = gcsfs.GCSFileSystem()
---> 17     filepath_or_buffer = fs.open(filepath_or_buffer, mode)
     18     return filepath_or_buffer, None, compression, True

<decorator-gen-147> in open(self, path, mode, block_size, acl, consistency, metadata)

~/venvs/model/lib/python3.7/site-packages/gcsfs/core.py in _tracemethod(f, self, *args, **kwargs)
     51         logger.log(logging.DEBUG - 1, tb_io.getvalue())
     52
---> 53     return f(self, *args, **kwargs)
     54
     55

~/venvs/model/lib/python3.7/site-packages/gcsfs/core.py in open(self, path, mode, block_size, acl, consistency, metadata)
   1148         if 'b' in mode:
   1149             return GCSFile(self, path, mode, block_size, consistency=const,
-> 1150                            metadata=metadata)
   1151         else:
   1152             mode = mode.replace('t', '') + 'b'

<decorator-gen-150> in __init__(self, gcsfs, path, mode, block_size, acl, consistency, metadata)

~/venvs/model/lib/python3.7/site-packages/gcsfs/core.py in _tracemethod(f, self, *args, **kwargs)
     51         logger.log(logging.DEBUG - 1, tb_io.getvalue())
     52
---> 53     return f(self, *args, **kwargs)
     54
     55

~/venvs/model/lib/python3.7/site-packages/gcsfs/core.py in __init__(self, gcsfs, path, mode, block_size, acl, consistency, metadata)
   1276             raise NotImplementedError('File mode not supported')
   1277         if mode == 'rb':
-> 1278             self.details = gcsfs.info(path)
   1279             self.size = self.details['size']
   1280         else:

<decorator-gen-136> in info(self, path)

~/venvs/model/lib/python3.7/site-packages/gcsfs/core.py in _tracemethod(f, self, *args, **kwargs)
     51         logger.log(logging.DEBUG - 1, tb_io.getvalue())
     52
---> 53     return f(self, *args, **kwargs)
     54
     55

~/venvs/model/lib/python3.7/site-packages/gcsfs/core.py in info(self, path)
    863
    864         try:
--> 865             return self._get_object(path)
    866         except FileNotFoundError:
    867             logger.debug("info FileNotFound at path: %s", path)

<decorator-gen-122> in _get_object(self, path)

~/venvs/model/lib/python3.7/site-packages/gcsfs/core.py in _tracemethod(f, self, *args, **kwargs)
     51         logger.log(logging.DEBUG - 1, tb_io.getvalue())
     52
---> 53     return f(self, *args, **kwargs)
     54
     55

~/venvs/model/lib/python3.7/site-packages/gcsfs/core.py in _get_object(self, path)
    539             raise FileNotFoundError(path)
    540
--> 541         result = self._process_object(bucket, self._call('GET', 'b/{}/o/{}', bucket, key).json())
    542
    543         return result

<decorator-gen-121> in _call(self, method, path, *args, **kwargs)

~/venvs/model/lib/python3.7/site-packages/gcsfs/core.py in _tracemethod(f, self, *args, **kwargs)
     51         logger.log(logging.DEBUG - 1, tb_io.getvalue())
     52
---> 53     return f(self, *args, **kwargs)
     54
     55

~/venvs/model/lib/python3.7/site-packages/gcsfs/core.py in _call(self, method, path, *args, **kwargs)
    482                 r = self.session.request(method, path,
    483                                          params=kwargs, json=json, headers=headers, data=data, timeout=self.requests_timeout)
--> 484                 validate_response(r, path)
    485                 break
    486             except (HttpError, RequestException, RateLimitException, GoogleAuthError) as e:

~/venvs/model/lib/python3.7/site-packages/gcsfs/core.py in validate_response(r, path)
    156
    157         if r.status_code == 404:
--> 158             raise FileNotFoundError(path)
    159         elif r.status_code == 403:
    160             raise IOError("Forbidden: %s\n%s" % (path, msg))

FileNotFoundError: https://www.googleapis.com/storage/v1/b/city_data/o/test%2Fblah.parquet

@jbrockmendel
Copy link
Member

Can you add a test for the bug this fixes

@bnaul
Copy link
Contributor Author

bnaul commented Sep 7, 2019

It seems like the only real way to test this is a GCS-specific test like the s3 roundtrip test here: https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/test_parquet.py#L525-L527

I think everything else besides the GCS branch of the code ends up ignoring the mode arg, so the local file version is (accidentally?) correct.

Any other ideas or suggestions on how to set up a test bucket...?

@jbrockmendel
Copy link
Member

I take it nothing in test_gcs is helpful for this? Can you mock the relevant server-side behavior?

@@ -170,7 +170,7 @@ def write(
# And pass the opened s3file to the fastparquet internal impl.
kwargs["open_with"] = lambda path, _: path
else:
path, _, _, _ = get_filepath_or_buffer(path)
path, _, _, _ = get_filepath_or_buffer(path, mode="wb")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change the is_s3_url on L165 to is_s3_url(path) or is_gcs_url(path)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, that works too

@bnaul bnaul force-pushed the patch-2 branch 2 times, most recently from 56999c9 to ab4ce24 Compare September 11, 2019 13:46
@bnaul
Copy link
Contributor Author

bnaul commented Sep 11, 2019

@jbrockmendel I took a stab at a test, it doesn't check much but it does fail on master and pass here so 🤷‍♂😄

@TomAugspurger TomAugspurger added IO Data IO issues that don't fit into a more specific label IO Parquet parquet, feather labels Sep 11, 2019
@TomAugspurger TomAugspurger added this to the 1.0 milestone Sep 11, 2019
@TomAugspurger
Copy link
Contributor

Oh, whoops, can you add a release notes for 1.0?

Also, you could update the comments under the is_s3_url check. Those aren't specific to s3 anymore.

@bnaul
Copy link
Contributor Author

bnaul commented Sep 17, 2019

Oh, whoops, can you add a release notes for 1.0?

Also, you could update the comments under the is_s3_url check. Those aren't specific to s3 anymore.

@TomAugspurger done!

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 18, 2019

Linting issue with that last commit :( Can you run black and repush?

Looks like a merge conflict also.

@bnaul
Copy link
Contributor Author

bnaul commented Sep 18, 2019

oops, still not used to black :) done

@jorisvandenbossche jorisvandenbossche changed the title Fix FastParquetImpl.write for non-existent file BUG: Fix FastParquetImpl.write for non-existent file Sep 19, 2019
@jorisvandenbossche
Copy link
Member

@bnaul you have a small linting error:

Check import format using isort
ERROR: /home/vsts/work/1/s/pandas/tests/io/test_gcs.py Imports are incorrectly sorted.
Check import format using isort DONE
##[error]Bash exited with code '1'.
##[section]Finishing: Linting

@jorisvandenbossche
Copy link
Member

BTW, for black and isort, I really started to appreciate pre-commit hooks to avoid getting those failures after pushing to github. See https://dev.pandas.io/development/contributing.html#python-pep8-black

@TomAugspurger TomAugspurger merged commit fa1364d into pandas-dev:master Sep 19, 2019
@TomAugspurger
Copy link
Contributor

Thanks!

@bnaul bnaul deleted the patch-2 branch October 9, 2019 21:24
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
* Fix `FastParquetImpl.write` for non-existent file
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
* Fix `FastParquetImpl.write` for non-existent file
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants