forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 51
Infer 'mixed' types as strings when using Arrow serialization #702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
4574084
to
11f44c8
Compare
d5f4cd5
to
46b7fd9
Compare
robert3005
approved these changes
Jul 22, 2020
rshkv
added a commit
that referenced
this pull request
Aug 19, 2020
* Add failing test * More generously infer columns as strings if Arrow thought they are binary * Add failing test for true binary values * Infer 'mixed' columns by checking the first value * Update FORK.md
rshkv
added a commit
that referenced
this pull request
Feb 26, 2021
When serializing a Pandas dataframe using Arrow under Python 2, Arrow can't tell if string columns should be serialized as string type or as binary (due to how Python 2 stores strings). The result is that Arrow serializes string columns in Pandas dataframes to binary ones. We can remove this when we discontinue support for Python 2. See original PR [1] and follow-up for 'mixed' type columns [2]. [1] #679 [2] #702
rshkv
added a commit
that referenced
this pull request
Mar 2, 2021
When serializing a Pandas dataframe using Arrow under Python 2, Arrow can't tell if string columns should be serialized as string type or as binary (due to how Python 2 stores strings). The result is that Arrow serializes string columns in Pandas dataframes to binary ones. We can remove this when we discontinue support for Python 2. See original PR [1] and follow-up for 'mixed' type columns [2]. [1] #679 [2] #702
jdcasale
pushed a commit
that referenced
this pull request
Mar 3, 2021
When serializing a Pandas dataframe using Arrow under Python 2, Arrow can't tell if string columns should be serialized as string type or as binary (due to how Python 2 stores strings). The result is that Arrow serializes string columns in Pandas dataframes to binary ones. We can remove this when we discontinue support for Python 2. See original PR [1] and follow-up for 'mixed' type columns [2]. [1] #679 [2] #702
rshkv
added a commit
that referenced
this pull request
Mar 4, 2021
When serializing a Pandas dataframe using Arrow under Python 2, Arrow can't tell if string columns should be serialized as string type or as binary (due to how Python 2 stores strings). The result is that Arrow serializes string columns in Pandas dataframes to binary ones. We can remove this when we discontinue support for Python 2. See original PR [1] and follow-up for 'mixed' type columns [2]. [1] #679 [2] #702
rshkv
added a commit
that referenced
this pull request
Mar 9, 2021
When serializing a Pandas dataframe using Arrow under Python 2, Arrow can't tell if string columns should be serialized as string type or as binary (due to how Python 2 stores strings). The result is that Arrow serializes string columns in Pandas dataframes to binary ones. We can remove this when we discontinue support for Python 2. See original PR [1] and follow-up for 'mixed' type columns [2]. [1] #679 [2] #702
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Background
Follow-up for #679. When serializing Pandas dataframes, Arrow assumes Python 2 string columns as "binary". When we encounter such a column, we try to infer if it's actually a "string" column (in a user-friendly sense). If it is, we override Arrow's assessment and use "string" as type in the Spark schema.
Problem
When inferring whether "binary" columns are in fact "string", we rely on asking Pandas what it thinks the column is. Pandas is more friendly of Python 2, and thinks that Python 2 string columns are actually "string".
But we encounter an edge case when a column contains values of mixed type, even if all types could be treated as strings. E.g., if a column contains both unicode and binary values, Pandas infers the column as "mixed". We currently don't infer a column as string if Pandas deemed it of "mixed" type.
You can also see this behaviour in the failing test that I added (CircleCI).
What changes were proposed in this pull request?
We are now more generous when inferring whether columns should be "string" type. When Pandas infers a column as "mixed" type, we do an additional check: We read the first value in the column. If that value is a string, we consider the whole column a string.
For this check, we're using the six package, which will correctly determine that e.g. a unicode value is of string type, but not e.g. a bytearray.
Alternative
We could simplify this by using Pandas'
is_string_dtype
function which takes a column and infers if it's of "string" type. But the function is very lenient (pandas-dev/pandas#15585). E.g. it will consider a column bytearrays a string (which would be an example of a "true binary" column even under Python 2). It will also consider a column of float tuplies a string (effectively anything that is an array of something).How was this patch tested?
Existing test and new added tests. Commit history shows that tests were failing before changes were made.