Skip to content

Infer 'mixed' types as strings when using Arrow serialization #702

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jul 22, 2020

Conversation

rshkv
Copy link

@rshkv rshkv commented Jul 21, 2020

Background

Follow-up for #679. When serializing Pandas dataframes, Arrow assumes Python 2 string columns as "binary". When we encounter such a column, we try to infer if it's actually a "string" column (in a user-friendly sense). If it is, we override Arrow's assessment and use "string" as type in the Spark schema.

Problem

When inferring whether "binary" columns are in fact "string", we rely on asking Pandas what it thinks the column is. Pandas is more friendly of Python 2, and thinks that Python 2 string columns are actually "string".

But we encounter an edge case when a column contains values of mixed type, even if all types could be treated as strings. E.g., if a column contains both unicode and binary values, Pandas infers the column as "mixed". We currently don't infer a column as string if Pandas deemed it of "mixed" type.

In [1]: pandas_df = pd.DataFrame(data={"mixed_strings": [u"unicode", b"binary"]})

In [2]: spark.createDataFrame(pandas_df).printSchema()
root
 |-- mixed_strings: binary (nullable = true) # oops

You can also see this behaviour in the failing test that I added (CircleCI).

What changes were proposed in this pull request?

We are now more generous when inferring whether columns should be "string" type. When Pandas infers a column as "mixed" type, we do an additional check: We read the first value in the column. If that value is a string, we consider the whole column a string.

For this check, we're using the six package, which will correctly determine that e.g. a unicode value is of string type, but not e.g. a bytearray.

Alternative

We could simplify this by using Pandas' is_string_dtype function which takes a column and infers if it's of "string" type. But the function is very lenient (pandas-dev/pandas#15585). E.g. it will consider a column bytearrays a string (which would be an example of a "true binary" column even under Python 2). It will also consider a column of float tuplies a string (effectively anything that is an array of something).

How was this patch tested?

Existing test and new added tests. Commit history shows that tests were failing before changes were made.

@rshkv rshkv force-pushed the wr/arrow-infer-mixed-types branch from 4574084 to 11f44c8 Compare July 21, 2020 18:39
@rshkv rshkv force-pushed the wr/arrow-infer-mixed-types branch from d5f4cd5 to 46b7fd9 Compare July 21, 2020 18:45
@rshkv
Copy link
Author

rshkv commented Jul 21, 2020

cc @helenyugithub

@rshkv rshkv requested a review from robert3005 July 21, 2020 22:48
@bulldozer-bot bulldozer-bot bot merged commit 5873252 into master Jul 22, 2020
@bulldozer-bot bulldozer-bot bot deleted the wr/arrow-infer-mixed-types branch July 22, 2020 12:47
rshkv added a commit that referenced this pull request Aug 19, 2020
* Add failing test

* More generously infer columns as strings if Arrow thought they are binary

* Add failing test for true binary values

* Infer 'mixed' columns by checking the first value

* Update FORK.md
rshkv added a commit that referenced this pull request Feb 26, 2021
When serializing a Pandas dataframe using Arrow under Python 2, Arrow
can't tell if string columns should be serialized as string type or as
binary (due to how Python 2 stores strings). The result is that Arrow
serializes string columns in Pandas dataframes to binary ones.

We can remove this when we discontinue support for Python 2.

See original PR [1] and follow-up for 'mixed' type columns [2].

[1] #679
[2] #702
rshkv added a commit that referenced this pull request Mar 2, 2021
When serializing a Pandas dataframe using Arrow under Python 2, Arrow
can't tell if string columns should be serialized as string type or as
binary (due to how Python 2 stores strings). The result is that Arrow
serializes string columns in Pandas dataframes to binary ones.

We can remove this when we discontinue support for Python 2.

See original PR [1] and follow-up for 'mixed' type columns [2].

[1] #679
[2] #702
jdcasale pushed a commit that referenced this pull request Mar 3, 2021
When serializing a Pandas dataframe using Arrow under Python 2, Arrow
can't tell if string columns should be serialized as string type or as
binary (due to how Python 2 stores strings). The result is that Arrow
serializes string columns in Pandas dataframes to binary ones.

We can remove this when we discontinue support for Python 2.

See original PR [1] and follow-up for 'mixed' type columns [2].

[1] #679
[2] #702
rshkv added a commit that referenced this pull request Mar 4, 2021
When serializing a Pandas dataframe using Arrow under Python 2, Arrow
can't tell if string columns should be serialized as string type or as
binary (due to how Python 2 stores strings). The result is that Arrow
serializes string columns in Pandas dataframes to binary ones.

We can remove this when we discontinue support for Python 2.

See original PR [1] and follow-up for 'mixed' type columns [2].

[1] #679
[2] #702
rshkv added a commit that referenced this pull request Mar 9, 2021
When serializing a Pandas dataframe using Arrow under Python 2, Arrow
can't tell if string columns should be serialized as string type or as
binary (due to how Python 2 stores strings). The result is that Arrow
serializes string columns in Pandas dataframes to binary ones.

We can remove this when we discontinue support for Python 2.

See original PR [1] and follow-up for 'mixed' type columns [2].

[1] #679
[2] #702
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants