Skip to content

blosc2.jit support for pandas UDFs #383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
datapythonista opened this issue Apr 13, 2025 · 1 comment · May be fixed by #418
Open

blosc2.jit support for pandas UDFs #383

datapythonista opened this issue Apr 13, 2025 · 1 comment · May be fixed by #418

Comments

@datapythonista
Copy link
Contributor

datapythonista commented Apr 13, 2025

xref pandas-dev/pandas#61125

We discussed this informally in the past, sharing more clearly how blosc2.jit and pandas can interact.

I'm about to open a PR in pandas to support this:

import pandas
import blosc2

def my_func(x):
    return np.sin(x * 2)

s = pandas.Series([1, 2, 3], index=list('abc'), name='sample')

# normal call executed by pandas
print(s.map(my_func))

# we let blosc2 handle this
print(s.map(my_func, engine=blosc2.jit))

To be able to do this, we would need blosc2 to implement a new interface. The implementation shouldn't be too complex, something like (the example ignores skip_na and another method apply for column-wise operations (function being called with the whole array, not each scalar):

import numpy as np
import blosc2

# Reference base class: https://github.com/pandas-dev/pandas/blob/main/pandas/core/apply.py#L77
class Blosc2ExecutionEngine:
    @staticmethod
    def map(data, func, args, kwargs, decorator, skip_na):
        if not isinstance(data, np.ndarray):
            # we probably received a Series
            if hasattr(data, "values"):
                data = data.values
            else:
                # there is a chance that we call this with a pyarrow object in the future
                raise ValueError("blosc2.jit does not support {data.__name__}")
                
        func = decorator(func)
        result = func(data, *args, **kwargs)
        return result


blosc2.jit.__pandas_udf__ = Blosc2ExecutionEngine

The advantage of this approach over just decorating the function is that the whole execution loop can be jitted, not only the individual calls.

What do you think? Is this something you'd like to implement? Any feedback? It's designed in a way that you don't need to add a dependency on pandas. We aim to have Numba and Bodo supporting this same interface, and possibly others.

@FrancescAlted
Copy link
Member

Sure. That code seems quite unobtrusive, and we would be happy to serve the pandas community. Would you mind to send a PR?

@datapythonista datapythonista linked a pull request May 24, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants