Skip to content
This repository was archived by the owner on Feb 2, 2024. It is now read-only.

Optimize series.rolling.sum() #608

Merged

Conversation

densmirn
Copy link
Contributor

Previous implementation results:

name nthreads type size median
Series.rolling.sum 4 Python 10000000 0.616
Series.rolling.sum 4 SDC 10000000 4.544

Optimized implementation results:

name nthreads type size median
Series.rolling.sum 4 Python 10000000 0.552
Series.rolling.sum 4 SDC 10000000 0.053

The optimized implementation executes faster up to ~85 times than previous one and faster up to ~10 times than Python. There is no scalability due to prange isn't used at all because variable nfinite (number of finite values) is common for all threads.

@densmirn densmirn force-pushed the feature/series_rolling_sum_opt branch from 066cd4a to 178a4d9 Compare February 17, 2020 07:31
@densmirn densmirn changed the title Reimplement series.rolling.sum() Optimize series.rolling.sum() Feb 17, 2020
@densmirn densmirn force-pushed the feature/series_rolling_sum_opt branch from 178a4d9 to b2a4d9d Compare February 17, 2020 12:15
output_arr = numpy.empty(length, dtype=float64)

chunks = get_chunks(length)
for i in prange(len(chunks)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it helped? Can't wait to know the result 😄

BTW you are not going to write all this monstrous code for every rolling function, don't you?

Copy link
Collaborator

@AlexanderKalistratov AlexanderKalistratov Feb 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm expecting to see generic implementation for the most of series methods something like this:

windows = [WindowKind(window_size)]

for i in range(1, len(chunks)):
    windows.append(WindowKind(window_size))

for i in prange(len(chunks)):
    chunk = chunks[i]
    window = windows[i]
    prelude_start = max(0, chunk.start - window_size)
    prelude_stop = max(0, chunk.start)
    for j in range(interlude_start, interlude_stop):
        window.add(data, j)
    
    for j in range(chunk.start, chunk.stop)
        window.add(data, j)
        result[j] = window.get_result()

This is a pseudocode. You need to think about exact details

@AlexanderKalistratov AlexanderKalistratov dismissed their stale review February 17, 2020 21:21

accidental approval

@pep8speaks
Copy link

pep8speaks commented Feb 18, 2020

Hello @densmirn! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-02-20 05:35:37 UTC

Copy link
Contributor Author

@densmirn densmirn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current performance:

name nthreads type size median
Series.rolling.sum 1 Python 200000 1.219
Series.rolling.sum 1 SDC 200000 0.905
Series.rolling.sum 4 Python 200000 1.23
Series.rolling.sum 4 SDC 200000 0.409

Python 1 / SDC 4 = 2,98
The scalability was enabled.

@densmirn densmirn force-pushed the feature/series_rolling_sum_opt branch from 41896f4 to ac675e5 Compare February 19, 2020 16:38
Copy link
Contributor Author

@densmirn densmirn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current performance:

name nthreads type size median
Series.rolling.sum 1 Python 800000 3.903
Series.rolling.sum 1 SDC 800000 0.517
Series.rolling.sum 4 Python 800000 3.947
Series.rolling.sum 4 SDC 800000 0.254

Python 1 / SDC 1 = 7.549
Python 1 / SDC 4 = 15,366

Remeasured linear implementation b2a4d9d:

name nthreads type size median
Series.rolling.sum 1 Python 800000 4.01
Series.rolling.sum 1 SDC 800000 0.401

SDC_LINEAR 1 / SDC_PARALLEL 1 = 0.776
SDC_LINEAR 1 / SDC_PARALLEL 4 = 1.579

I think it's a victory.

return nfinite, result


def gen_sdc_pandas_series_rolling_impl(pop, put, init_result=numpy.nan):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider the following option:

@sdc_register_jitable
def result_or_nan(nfinite, minp, result):
    if nfinite < minp:
        return numpy.nan
    
    return result


def gen_sdc_pandas_series_rolling_impl(pop, put, init_result=numpy.nan):
    """Generate series rolling methods implementations based on pop/put funcs"""
    def impl(self):
        win = self._window
        minp = self._min_periods

        input_series = self._data
        input_arr = input_series._data
        length = len(input_arr)
        output_arr = numpy.empty(length, dtype=float64)

        chunks = parallel_chunks(length)
        for i in prange(len(chunks)):
            chunk = chunks[i]
            nfinite = 0
            result = init_result

            prelude_start = max(0, chunk.start - win + 1)
            prelude_stop = min(chunk.start, prelude_start + win)

            interlude_start = prelude_stop
            interlude_stop = min(prelude_start + win, chunk.stop)

            for idx in range(prelude_start, prelude_stop):
                value = input_arr[idx]
                nfinite, result = put(value, nfinite, result)

            for idx in range(interlude_start, interlude_stop):
                value = input_arr[idx]
                nfinite, result = put(value, nfinite, result)
                output_arr[idx] = result_or_nan(nfinite, minp, result)

            for idx in range(interlude_stop, chunk.stop):
                put_value = input_arr[idx]
                pop_value = input_arr[idx - win]
                nfinite, result = put(put_value, nfinite, result)
                nfinite, result = pop(pop_value, nfinite, result)
                output_arr[idx] = result_or_nan(nfinite, minp, result)

        return pandas.Series(output_arr, input_series._index,
                             name=input_series._name)
    return impl

It's not the most elegant one, but it could give us some performance (due to elimination of condition in loop and extra counter). If it doesn't, your solution is preferable.

Also, I've changed order of put and pop (firstly put, then pop). It shouldn't affect sum, but could be useful for min and max - if we have added new min/max - we don't need to recalculate result

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get visible result, but I like the code. So let me apply the patch.

@AlexanderKalistratov
Copy link
Collaborator

Also please keep in mind, that for some functions you need to keep more than one result (e.g. variance)

@AlexanderKalistratov AlexanderKalistratov merged commit 632b554 into IntelPython:master Feb 20, 2020
@densmirn densmirn deleted the feature/series_rolling_sum_opt branch June 9, 2020 12:12
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants