-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
PERFORMANCE: Very slow vectorized string functions #2802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Have you tried pandas 0.10.1? |
Just upgraded to 0.10.1. It is faster (that first loop drops from ~2 minutes to ~1 minute). However, that's still 3x slower than the list comprehension way. Why the difference? |
NA handling adds some overhead. I will have to investigate |
with master. In [11]: def f(p):
...: for ii in xrange(1000) :
...: p['isHello'] = p['strings'].str.endswith('world')
In [12]: %timeit f(p)
1 loops, best of 3: 3.5 s per loop
In [13]: def f(p):
...: for ii in xrange(1000) :
...: p['isHello'] = [s.endswith('world') for s in p['strings'].values]
In [14]: %timeit f(p)
1 loops, best of 3: 2.84 s per loop not as bad as described (fixed?), pushing back to 0.12. perhaps even close. |
I did a bit of investigation with this. I pushed everything into cython, eliminated the convert_objects_call (as you know its a bool, so create directly). Then I get about 1.5s on the benchmark (compared to about 3.5s on my machine). I think you need to cast the strings from pyobjects to char* and then implement the functions (e.g. |
FWIW at some point I'd like to add memoization (use klib under the hood) to all these functions which would accelerate things in a lot of cases. |
i get the same results as @y-p. definitely not as bad as the op says. close? |
yep...I cythonized to test out....does give a model boost; real boost would come from actually using the c-functions directly in a specialized loop, which could be done (as most of the functions are canned) |
fyi |
closing in favor or #4694 |
Using pandas 0.10.
The docs say pandas has vectorized string functions, but they're running far slower for me than plain python (about 6x slower). Here's two nearly identical versions of the code with wildly different performance numbers. Why is the vectorized version so much slower?
I thought the difference might be because the vectorized version has to create an extra temporary Series object that gets thrown away. But then I tried to control for that by creating an extra Series object in my fast version, and that didn't change anything. So why is the vectorized version slow?
The text was updated successfully, but these errors were encountered: