-
Notifications
You must be signed in to change notification settings - Fork 65
Optimize series.rolling.sum() #608
Optimize series.rolling.sum() #608
Conversation
066cd4a
to
178a4d9
Compare
178a4d9
to
b2a4d9d
Compare
…ure/series_rolling_sum_opt
output_arr = numpy.empty(length, dtype=float64) | ||
|
||
chunks = get_chunks(length) | ||
for i in prange(len(chunks)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it helped? Can't wait to know the result 😄
BTW you are not going to write all this monstrous code for every rolling function, don't you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm expecting to see generic implementation for the most of series methods something like this:
windows = [WindowKind(window_size)]
for i in range(1, len(chunks)):
windows.append(WindowKind(window_size))
for i in prange(len(chunks)):
chunk = chunks[i]
window = windows[i]
prelude_start = max(0, chunk.start - window_size)
prelude_stop = max(0, chunk.start)
for j in range(interlude_start, interlude_stop):
window.add(data, j)
for j in range(chunk.start, chunk.stop)
window.add(data, j)
result[j] = window.get_result()
This is a pseudocode. You need to think about exact details
…ure/series_rolling_sum_opt
8a4b6da
to
e0d92fd
Compare
…ure/series_rolling_sum_opt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current performance:
name | nthreads | type | size | median |
---|---|---|---|---|
Series.rolling.sum | 1 | Python | 200000 | 1.219 |
Series.rolling.sum | 1 | SDC | 200000 | 0.905 |
Series.rolling.sum | 4 | Python | 200000 | 1.23 |
Series.rolling.sum | 4 | SDC | 200000 | 0.409 |
Python 1 / SDC 4 = 2,98
The scalability was enabled.
41896f4
to
ac675e5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current performance:
name | nthreads | type | size | median |
---|---|---|---|---|
Series.rolling.sum | 1 | Python | 800000 | 3.903 |
Series.rolling.sum | 1 | SDC | 800000 | 0.517 |
Series.rolling.sum | 4 | Python | 800000 | 3.947 |
Series.rolling.sum | 4 | SDC | 800000 | 0.254 |
Python 1 / SDC 1 = 7.549
Python 1 / SDC 4 = 15,366
Remeasured linear implementation b2a4d9d:
name | nthreads | type | size | median |
---|---|---|---|---|
Series.rolling.sum | 1 | Python | 800000 | 4.01 |
Series.rolling.sum | 1 | SDC | 800000 | 0.401 |
SDC_LINEAR 1 / SDC_PARALLEL 1 = 0.776
SDC_LINEAR 1 / SDC_PARALLEL 4 = 1.579
I think it's a victory.
return nfinite, result | ||
|
||
|
||
def gen_sdc_pandas_series_rolling_impl(pop, put, init_result=numpy.nan): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider the following option:
@sdc_register_jitable
def result_or_nan(nfinite, minp, result):
if nfinite < minp:
return numpy.nan
return result
def gen_sdc_pandas_series_rolling_impl(pop, put, init_result=numpy.nan):
"""Generate series rolling methods implementations based on pop/put funcs"""
def impl(self):
win = self._window
minp = self._min_periods
input_series = self._data
input_arr = input_series._data
length = len(input_arr)
output_arr = numpy.empty(length, dtype=float64)
chunks = parallel_chunks(length)
for i in prange(len(chunks)):
chunk = chunks[i]
nfinite = 0
result = init_result
prelude_start = max(0, chunk.start - win + 1)
prelude_stop = min(chunk.start, prelude_start + win)
interlude_start = prelude_stop
interlude_stop = min(prelude_start + win, chunk.stop)
for idx in range(prelude_start, prelude_stop):
value = input_arr[idx]
nfinite, result = put(value, nfinite, result)
for idx in range(interlude_start, interlude_stop):
value = input_arr[idx]
nfinite, result = put(value, nfinite, result)
output_arr[idx] = result_or_nan(nfinite, minp, result)
for idx in range(interlude_stop, chunk.stop):
put_value = input_arr[idx]
pop_value = input_arr[idx - win]
nfinite, result = put(put_value, nfinite, result)
nfinite, result = pop(pop_value, nfinite, result)
output_arr[idx] = result_or_nan(nfinite, minp, result)
return pandas.Series(output_arr, input_series._index,
name=input_series._name)
return impl
It's not the most elegant one, but it could give us some performance (due to elimination of condition in loop and extra counter). If it doesn't, your solution is preferable.
Also, I've changed order of put
and pop
(firstly put
, then pop
). It shouldn't affect sum, but could be useful for min and max - if we have added new min/max - we don't need to recalculate result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't get visible result, but I like the code. So let me apply the patch.
Also please keep in mind, that for some functions you need to keep more than one |
…ure/series_rolling_sum_opt
Previous implementation results:
Optimized implementation results:
The optimized implementation executes faster up to ~85 times than previous one and faster up to ~10 times than Python. There is no scalability due to
prange
isn't used at all because variablenfinite
(number of finite values) is common for all threads.