Infer 'mixed' types as strings when using Arrow serialization #702

rshkv · 2020-07-21T18:23:59Z

Background

Follow-up for #679. When serializing Pandas dataframes, Arrow assumes Python 2 string columns as "binary". When we encounter such a column, we try to infer if it's actually a "string" column (in a user-friendly sense). If it is, we override Arrow's assessment and use "string" as type in the Spark schema.

Problem

When inferring whether "binary" columns are in fact "string", we rely on asking Pandas what it thinks the column is. Pandas is more friendly of Python 2, and thinks that Python 2 string columns are actually "string".

But we encounter an edge case when a column contains values of mixed type, even if all types could be treated as strings. E.g., if a column contains both unicode and binary values, Pandas infers the column as "mixed". We currently don't infer a column as string if Pandas deemed it of "mixed" type.

In [1]: pandas_df = pd.DataFrame(data={"mixed_strings": [u"unicode", b"binary"]})

In [2]: spark.createDataFrame(pandas_df).printSchema()
root
 |-- mixed_strings: binary (nullable = true) # oops

You can also see this behaviour in the failing test that I added (CircleCI).

What changes were proposed in this pull request?

We are now more generous when inferring whether columns should be "string" type. When Pandas infers a column as "mixed" type, we do an additional check: We read the first value in the column. If that value is a string, we consider the whole column a string.

For this check, we're using the six package, which will correctly determine that e.g. a unicode value is of string type, but not e.g. a bytearray.

Alternative

We could simplify this by using Pandas' is_string_dtype function which takes a column and infers if it's of "string" type. But the function is very lenient (pandas-dev/pandas#15585). E.g. it will consider a column bytearrays a string (which would be an example of a "true binary" column even under Python 2). It will also consider a column of float tuplies a string (effectively anything that is an array of something).

How was this patch tested?

Existing test and new added tests. Commit history shows that tests were failing before changes were made.

…nary

rshkv · 2020-07-21T22:43:33Z

cc @helenyugithub

* Add failing test * More generously infer columns as strings if Arrow thought they are binary * Add failing test for true binary values * Infer 'mixed' columns by checking the first value * Update FORK.md

When serializing a Pandas dataframe using Arrow under Python 2, Arrow can't tell if string columns should be serialized as string type or as binary (due to how Python 2 stores strings). The result is that Arrow serializes string columns in Pandas dataframes to binary ones. We can remove this when we discontinue support for Python 2. See original PR [1] and follow-up for 'mixed' type columns [2]. [1] #679 [2] #702

rshkv force-pushed the wr/arrow-infer-mixed-types branch from 4574084 to 11f44c8 Compare July 21, 2020 18:39

rshkv added 2 commits July 21, 2020 20:45

Add failing test

46b7fd9

More generously infer columns as strings if Arrow thought they are bi…

f60098e

…nary

rshkv force-pushed the wr/arrow-infer-mixed-types branch from d5f4cd5 to 46b7fd9 Compare July 21, 2020 18:45

rshkv added 2 commits July 21, 2020 21:38

Add failing test for true binary values

e7463da

Infer 'mixed' columns by checking the first value

5ee0d51

rshkv mentioned this pull request Jul 21, 2020

Revert binary/string column logic once we move off Python 2 #678

Open

Update FORK.md

c598f26

rshkv requested a review from robert3005 July 21, 2020 22:48

robert3005 approved these changes Jul 22, 2020

View reviewed changes

rshkv added the merge when ready label Jul 22, 2020

bulldozer-bot bot merged commit 5873252 into master Jul 22, 2020

bulldozer-bot bot deleted the wr/arrow-infer-mixed-types branch July 22, 2020 12:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Infer 'mixed' types as strings when using Arrow serialization #702

Infer 'mixed' types as strings when using Arrow serialization #702

Uh oh!

rshkv commented Jul 21, 2020 •

edited

Loading

Uh oh!

rshkv commented Jul 21, 2020

Uh oh!

Uh oh!

Infer 'mixed' types as strings when using Arrow serialization #702

Infer 'mixed' types as strings when using Arrow serialization #702

Uh oh!

Conversation

rshkv commented Jul 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Problem

What changes were proposed in this pull request?

Alternative

How was this patch tested?

Uh oh!

rshkv commented Jul 21, 2020

Uh oh!

Uh oh!

rshkv commented Jul 21, 2020 •

edited

Loading