PERFORMANCE: Very slow vectorized string functions #2802

darindillon · 2013-02-05T22:48:09Z

Using pandas 0.10.
The docs say pandas has vectorized string functions, but they're running far slower for me than plain python (about 6x slower). Here's two nearly identical versions of the code with wildly different performance numbers. Why is the vectorized version so much slower?
I thought the difference might be because the vectorized version has to create an extra temporary Series object that gets thrown away. But then I tried to control for that by creating an extra Series object in my fast version, and that didn't change anything. So why is the vectorized version slow?

import pandas
strings = [ "hello world" for ii in xrange(10000) ]
p = pandas.DataFrame({ "strings" : strings })

#Use vectorized pandas. 
#Takes ~2 min on my computer
for ii in xrange(1000) :
    p['isHello'] = p['strings'].str.endswith('world')

#Use non-vectorized list comprehension. 
#Takes ~20 seconds on my computer.
for ii in xrange(1000) :
    p['isHello'] = [s.endswith('world') for s in p['strings'].values]

#Non-vectorized version, but also create a temp Series object
#Still takes only ~20 seconds
for ii in xrange(1000) :
    p['isHello'] = pandas.Series([s.endswith('world') for s in p['strings'].values])

wesm · 2013-02-05T22:55:58Z

Have you tried pandas 0.10.1?

darindillon · 2013-02-05T23:01:47Z

Just upgraded to 0.10.1. It is faster (that first loop drops from ~2 minutes to ~1 minute). However, that's still 3x slower than the list comprehension way. Why the difference?

wesm · 2013-02-05T23:06:19Z

NA handling adds some overhead. I will have to investigate

ghost · 2013-03-18T05:02:24Z

with master.

In [11]: def f(p):
    ...:     for ii in xrange(1000) :
    ...:         p['isHello'] = p['strings'].str.endswith('world')

In [12]: %timeit f(p)
1 loops, best of 3: 3.5 s per loop

In [13]: def f(p):
    ...:     for ii in xrange(1000) :
    ...:         p['isHello'] = [s.endswith('world') for s in p['strings'].values]

In [14]: %timeit f(p)
1 loops, best of 3: 2.84 s per loop

not as bad as described (fixed?), pushing back to 0.12. perhaps even close.

jreback · 2013-03-18T19:18:50Z

I did a bit of investigation with this. I pushed everything into cython, eliminated the convert_objects_call (as you know its a bool, so create directly). Then I get about 1.5s on the benchmark (compared to about 3.5s on my machine).

I think you need to cast the strings from pyobjects to char* and then implement the functions (e.g. endswith) directly, so prob a lot of work. I can up what I did if anyone is interested.

4c2f998

wesm · 2013-03-19T17:57:10Z

FWIW at some point I'd like to add memoization (use klib under the hood) to all these functions which would accelerate things in a lot of cases.

cpcloud · 2013-05-29T18:04:49Z

i get the same results as @y-p. definitely not as bad as the op says. close?

jreback · 2013-05-29T18:10:15Z

yep...I cythonized to test out....does give a model boost; real boost would come from actually using the c-functions directly in a specialized loop, which could be done (as most of the functions are canned)

cpcloud · 2013-05-29T19:16:53Z

fyi endswith probably won't get a big boost since it must accept tuples as well as strs (i guess this could be dispatched as well if wanted)

jreback · 2013-09-22T15:50:17Z

closing in favor or #4694

ghost mentioned this issue Apr 7, 2013

ENH: df.grep(col,pat) and df.dselect(col,"expr") #2460

Closed

jreback mentioned this issue May 29, 2013

Replace with regular expression #2285

Closed

hayd mentioned this issue Sep 22, 2013

string getitem methods are slow #4694

Closed

jreback closed this as completed Sep 22, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERFORMANCE: Very slow vectorized string functions #2802

PERFORMANCE: Very slow vectorized string functions #2802

darindillon commented Feb 5, 2013

wesm commented Feb 5, 2013

Uh oh!

darindillon commented Feb 5, 2013

Uh oh!

wesm commented Feb 5, 2013

Uh oh!

ghost commented Mar 18, 2013

Uh oh!

jreback commented Mar 18, 2013

Uh oh!

wesm commented Mar 19, 2013

Uh oh!

cpcloud commented May 29, 2013

Uh oh!

jreback commented May 29, 2013

Uh oh!

cpcloud commented May 29, 2013

Uh oh!

jreback commented Sep 22, 2013

Uh oh!

Uh oh!

PERFORMANCE: Very slow vectorized string functions #2802

PERFORMANCE: Very slow vectorized string functions #2802

Comments

darindillon commented Feb 5, 2013

wesm commented Feb 5, 2013

Uh oh!

darindillon commented Feb 5, 2013

Uh oh!

wesm commented Feb 5, 2013

Uh oh!

ghost commented Mar 18, 2013

Uh oh!

jreback commented Mar 18, 2013

Uh oh!

wesm commented Mar 19, 2013

Uh oh!

cpcloud commented May 29, 2013

Uh oh!

jreback commented May 29, 2013

Uh oh!

cpcloud commented May 29, 2013

Uh oh!

jreback commented Sep 22, 2013

Uh oh!