Skip to content

PERFORMANCE: Very slow vectorized string functions #2802

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
darindillon opened this issue Feb 5, 2013 · 10 comments
Closed

PERFORMANCE: Very slow vectorized string functions #2802

darindillon opened this issue Feb 5, 2013 · 10 comments
Labels
Duplicate Report Duplicate issue or pull request Performance Memory or execution speed performance Strings String extension data type and string data
Milestone

Comments

@darindillon
Copy link

Using pandas 0.10.
The docs say pandas has vectorized string functions, but they're running far slower for me than plain python (about 6x slower). Here's two nearly identical versions of the code with wildly different performance numbers. Why is the vectorized version so much slower?
I thought the difference might be because the vectorized version has to create an extra temporary Series object that gets thrown away. But then I tried to control for that by creating an extra Series object in my fast version, and that didn't change anything. So why is the vectorized version slow?

import pandas
strings = [ "hello world" for ii in xrange(10000) ]
p = pandas.DataFrame({ "strings" : strings })

#Use vectorized pandas. 
#Takes ~2 min on my computer
for ii in xrange(1000) :
    p['isHello'] = p['strings'].str.endswith('world')

#Use non-vectorized list comprehension. 
#Takes ~20 seconds on my computer.
for ii in xrange(1000) :
    p['isHello'] = [s.endswith('world') for s in p['strings'].values]

#Non-vectorized version, but also create a temp Series object
#Still takes only ~20 seconds
for ii in xrange(1000) :
    p['isHello'] = pandas.Series([s.endswith('world') for s in p['strings'].values])
@wesm
Copy link
Member

wesm commented Feb 5, 2013

Have you tried pandas 0.10.1?

@darindillon
Copy link
Author

Just upgraded to 0.10.1. It is faster (that first loop drops from ~2 minutes to ~1 minute). However, that's still 3x slower than the list comprehension way. Why the difference?

@wesm
Copy link
Member

wesm commented Feb 5, 2013

NA handling adds some overhead. I will have to investigate

@ghost
Copy link

ghost commented Mar 18, 2013

with master.

In [11]: def f(p):
    ...:     for ii in xrange(1000) :
    ...:         p['isHello'] = p['strings'].str.endswith('world')

In [12]: %timeit f(p)
1 loops, best of 3: 3.5 s per loop

In [13]: def f(p):
    ...:     for ii in xrange(1000) :
    ...:         p['isHello'] = [s.endswith('world') for s in p['strings'].values]

In [14]: %timeit f(p)
1 loops, best of 3: 2.84 s per loop

not as bad as described (fixed?), pushing back to 0.12. perhaps even close.

@jreback
Copy link
Contributor

jreback commented Mar 18, 2013

I did a bit of investigation with this. I pushed everything into cython, eliminated the convert_objects_call (as you know its a bool, so create directly). Then I get about 1.5s on the benchmark (compared to about 3.5s on my machine).

I think you need to cast the strings from pyobjects to char* and then implement the functions (e.g. endswith) directly, so prob a lot of work. I can up what I did if anyone is interested.

4c2f998

@wesm
Copy link
Member

wesm commented Mar 19, 2013

FWIW at some point I'd like to add memoization (use klib under the hood) to all these functions which would accelerate things in a lot of cases.

@cpcloud
Copy link
Member

cpcloud commented May 29, 2013

i get the same results as @y-p. definitely not as bad as the op says. close?

@jreback
Copy link
Contributor

jreback commented May 29, 2013

yep...I cythonized to test out....does give a model boost; real boost would come from actually using the c-functions directly in a specialized loop, which could be done (as most of the functions are canned)

@cpcloud
Copy link
Member

cpcloud commented May 29, 2013

fyi endswith probably won't get a big boost since it must accept tuples as well as strs (i guess this could be dispatched as well if wanted)

@jreback
Copy link
Contributor

jreback commented Sep 22, 2013

closing in favor or #4694

@jreback jreback closed this as completed Sep 22, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request Performance Memory or execution speed performance Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

4 participants