Skip to content

factorize fails for list of tuples #9454

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
eyurtsev opened this issue Feb 9, 2015 · 3 comments · Fixed by #18649
Closed

factorize fails for list of tuples #9454

eyurtsev opened this issue Feb 9, 2015 · 3 comments · Fixed by #18649
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug good first issue Testing pandas testing functions or related to the test suite
Milestone

Comments

@eyurtsev
Copy link

eyurtsev commented Feb 9, 2015

It's not clear from the documentation for factorize what datatype is expected for the values. But I assume that any list of hashables should work (specifically, a list of tuples).

Factorize indeed works for a list of tuples as long as the lens of all the tuples are not identical, but fails the moment all tuples have the same length. (Looks like there is some inference about the structure of the values that shouldn't be happening.)

import pandas as pd
pd.factorize([(1, 1), (1, 2), (0, 0), (1, 2), 'nonsense']) # This works

(array([0, 1, 2, 1, 3]), array([(1, 1), (1, 2), (0, 0), 'nonsense'], dtype=object))

pd.factorize([(1, 1), (1, 2), (0, 0), (1, 2), (1, 2, 3)]) # This also works.

pd.factorize([(1, 1), (1, 2), (0, 0), (1, 2)]) # <-- fails
ValueError                                Traceback (most recent call last)
<ipython-input-22-3ca8ec02e16c> in <module>()
      1 print pd.factorize([(1, 1), (1, 2), (0, 0), (1, 2), 'nonsense'])
----> 2 print pd.factorize([(1, 1), (1, 2), (0, 0), (1, 2)])

/usr/local/lib/python2.7/dist-packages/pandas/core/algorithms.pyc in factorize(values, sort, order, na_sentinel)
    132     table = hash_klass(len(vals))
    133     uniques = vec_klass()
--> 134     labels = table.get_labels(vals, uniques, 0, na_sentinel)
    135 
    136     labels = com._ensure_platform_int(labels)

/usr/local/lib/python2.7/dist-packages/pandas/hashtable.so in pandas.hashtable.Int64HashTable.get_labels (pandas/hashtable.c:8575)()

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

pandas 0.15.2

@shoyer
Copy link
Member

shoyer commented Feb 9, 2015

The problem here is that np.asarray([(1, 1), (1, 2), (0, 0), (1, 2)]) creates a 2D array, not a 1D array of tuples. We were actually just discussing this on the numpy mailing list today: http://mail.scipy.org/pipermail/numpy-discussion/2015-February/072240.html

So to work around this, you could supply an ndarray of tuples directly. Then things should still work.

If you want to fix this in pandas (and yes, PRs are always greatly appreciated!), we could cast the input to factorize with the internal routine _asarray_tuplesafe from pandas.core.common instead of np.asarray.

@shoyer shoyer added Bug Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Feb 10, 2015
@jreback
Copy link
Contributor

jreback commented Feb 10, 2015

I agree with @shoyer here, I think this should be change to _asarray_tuplesafe. Its possible that np.asarray was used prior to several impl changes.

@jreback jreback added this to the 0.17.0 milestone Feb 10, 2015
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 8, 2017

@dsm I see you have a commit referencing this, but I'm not sure it made it in. Is that correct?

To be clear, this example works now. Just need to ensure we have a test.

@TomAugspurger TomAugspurger added Testing pandas testing functions or related to the test suite Difficulty Novice labels Jul 8, 2017
@jreback jreback modified the milestones: Next Major Release, 0.22.0 Dec 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug good first issue Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants