-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Allow JIT compilation with an internal API #61032
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
7ec827e
bc2a178
8b420cc
6a9ee5a
444de67
7e1e855
58fb30d
c239fc9
9567152
2ff333f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
""" | ||
Public API for function executor engines to be used with ``map`` and ``apply``. | ||
""" | ||
|
||
from pandas.core.apply import BaseExecutionEngine | ||
|
||
__all__ = ["BaseExecutionEngine"] |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10275,7 +10275,7 @@ def apply( | |
result_type: Literal["expand", "reduce", "broadcast"] | None = None, | ||
args=(), | ||
by_row: Literal[False, "compat"] = "compat", | ||
engine: Literal["python", "numba"] = "python", | ||
engine: Callable | None | Literal["python", "numba"] = None, | ||
engine_kwargs: dict[str, bool] | None = None, | ||
**kwargs, | ||
): | ||
|
@@ -10339,35 +10339,32 @@ def apply( | |
|
||
.. versionadded:: 2.1.0 | ||
|
||
engine : {'python', 'numba'}, default 'python' | ||
Choose between the python (default) engine or the numba engine in apply. | ||
engine : decorator or {'python', 'numba'}, optional | ||
Choose the execution engine to use. If not provided the function | ||
will be executed by the regular Python interpreter. | ||
|
||
The numba engine will attempt to JIT compile the passed function, | ||
which may result in speedups for large DataFrames. | ||
It also supports the following engine_kwargs : | ||
Other options include JIT compilers such Numba and Bodo, which in some | ||
cases can speed up the execution. To use an executor you can provide | ||
the decorators ``numba.jit``, ``numba.njit`` or ``bodo.jit``. You can | ||
also provide the decorator with parameters, like ``numba.jit(nogit=True)``. | ||
|
||
- nopython (compile the function in nopython mode) | ||
- nogil (release the GIL inside the JIT compiled function) | ||
- parallel (try to apply the function in parallel over the DataFrame) | ||
Not all functions can be executed with all execution engines. In general, | ||
JIT compilers will require type stability in the function (no variable | ||
should change data type during the execution). And not all pandas and | ||
NumPy APIs are supported. Check the engine documentation [1]_ and [2]_ | ||
for limitations. | ||
|
||
Note: Due to limitations within numba/how pandas interfaces with numba, | ||
you should only use this if raw=True | ||
|
||
Note: The numba compiler only supports a subset of | ||
valid Python/numpy operations. | ||
.. warning:: | ||
|
||
Please read more about the `supported python features | ||
<https://numba.pydata.org/numba-doc/dev/reference/pysupported.html>`_ | ||
and `supported numpy features | ||
<https://numba.pydata.org/numba-doc/dev/reference/numpysupported.html>`_ | ||
in numba to learn what you can or cannot use in the passed function. | ||
String parameters will stop being supported in a future pandas version. | ||
|
||
.. versionadded:: 2.2.0 | ||
|
||
engine_kwargs : dict | ||
Pass keyword arguments to the engine. | ||
This is currently only used by the numba engine, | ||
see the documentation for the engine argument for more information. | ||
|
||
**kwargs | ||
Additional keyword arguments to pass as keywords arguments to | ||
`func`. | ||
|
@@ -10390,6 +10387,13 @@ def apply( | |
behavior or errors and are not supported. See :ref:`gotchas.udf-mutation` | ||
for more details. | ||
|
||
References | ||
---------- | ||
.. [1] `Numba documentation | ||
<https://numba.readthedocs.io/en/stable/index.html>`_ | ||
.. [2] `Bodo documentation | ||
<https://docs.bodo.ai/latest/>`/ | ||
|
||
Examples | ||
-------- | ||
>>> df = pd.DataFrame([[4, 9]] * 3, columns=["A", "B"]) | ||
|
@@ -10458,22 +10462,99 @@ def apply( | |
0 1 2 | ||
1 1 2 | ||
2 1 2 | ||
|
||
Advanced users can speed up their code by using a Just-in-time (JIT) compiler | ||
with ``apply``. The main JIT compilers available for pandas are Numba and Bodo. | ||
In general, JIT compilation is only possible when the function passed to | ||
``apply`` has type stability (variables in the function do not change their | ||
type during the execution). | ||
|
||
>>> import bodo | ||
>>> df.apply(lambda x: x.A + x.B, axis=1, engine=bodo.jit) | ||
|
||
Note that JIT compilation is only recommended for functions that take a | ||
significant amount of time to run. Fast functions are unlikely to run faster | ||
with JIT compilation. | ||
""" | ||
from pandas.core.apply import frame_apply | ||
if engine is None or isinstance(engine, str): | ||
from pandas.core.apply import frame_apply | ||
|
||
op = frame_apply( | ||
self, | ||
func=func, | ||
axis=axis, | ||
raw=raw, | ||
result_type=result_type, | ||
by_row=by_row, | ||
engine=engine, | ||
engine_kwargs=engine_kwargs, | ||
args=args, | ||
kwargs=kwargs, | ||
) | ||
return op.apply().__finalize__(self, method="apply") | ||
if engine is None: | ||
engine = "python" | ||
|
||
if engine not in ["python", "numba"]: | ||
raise ValueError(f"Unknown engine '{engine}'") | ||
|
||
op = frame_apply( | ||
self, | ||
func=func, | ||
axis=axis, | ||
raw=raw, | ||
result_type=result_type, | ||
by_row=by_row, | ||
engine=engine, | ||
engine_kwargs=engine_kwargs, | ||
args=args, | ||
kwargs=kwargs, | ||
) | ||
return op.apply().__finalize__(self, method="apply") | ||
elif hasattr(engine, "__pandas_udf__"): | ||
if result_type is not None: | ||
raise NotImplementedError( | ||
f"{result_type=} only implemented for the default engine" | ||
) | ||
|
||
agg_axis = self._get_agg_axis(self._get_axis_number(axis)) | ||
|
||
# one axis is empty | ||
if not all(self.shape): | ||
mroeschke marked this conversation as resolved.
Show resolved
Hide resolved
|
||
func = cast(Callable, func) | ||
try: | ||
if axis == 0: | ||
r = func(Series([], dtype=np.float64), *args, **kwargs) | ||
else: | ||
r = func( | ||
Series(index=self.columns, dtype=np.float64), | ||
*args, | ||
**kwargs, | ||
) | ||
except Exception: | ||
pass | ||
else: | ||
if not isinstance(r, Series): | ||
if len(agg_axis): | ||
r = func(Series([], dtype=np.float64), *args, **kwargs) | ||
else: | ||
r = np.nan | ||
|
||
return self._constructor_sliced(r, index=agg_axis) | ||
return self.copy() | ||
|
||
data: DataFrame | np.ndarray = self | ||
if raw: | ||
# This will upcast the whole DataFrame to the same type, | ||
# and likely result in an object 2D array. | ||
# We should probably pass a list of 1D arrays instead, at | ||
# lest for ``axis=0`` | ||
data = self.values | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Numba doesn't support heterogeneous lists but does support heterogeneous tuples. I would say that it is so important to avoid 2D object arrays here that it should go in the first revision. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, how is the mapping from column name in "func" to index in the tuple supposed to happen? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the comments, and for catching the typos. Fully agree with your first comment. I don't want to change the existing behavior in the same PR I'm implementing a new interface, but converting the whole dataframe to a single type doesn't seem a great option. The raw parameter and this behavior is as old as pandas, I don't think we would design it this way as of today. It won't still be trivial, as passing
If I understand you correctly, this is great question, but also a bit tricky. For the simple case, the function is called for every column. In the case the same function needs to be applied to every column, I don't think there is an issue. If the function receives Series ( def func(series):
if series.name == "column_1":
return series.str.upper()
return series.str.lower() But when This is when a column at a time is passed. When I think historical reasons made the signatures or |
||
result = engine.__pandas_udf__.apply( | ||
data=data, | ||
func=func, | ||
args=args, | ||
kwargs=kwargs, | ||
decorator=engine, | ||
axis=axis, | ||
) | ||
if raw: | ||
if result.ndim == 2: | ||
return self._constructor( | ||
result, index=self.index, columns=self.columns | ||
) | ||
else: | ||
return self._constructor_sliced(result, index=agg_axis) | ||
return result | ||
else: | ||
raise ValueError(f"Unknown engine {engine}") | ||
|
||
def map( | ||
self, func: PythonFuncType, na_action: Literal["ignore"] | None = None, **kwargs | ||
|
@@ -10590,9 +10671,11 @@ def _append( | |
|
||
index = Index( | ||
[other.name], | ||
name=self.index.names | ||
if isinstance(self.index, MultiIndex) | ||
else self.index.name, | ||
name=( | ||
self.index.names | ||
if isinstance(self.index, MultiIndex) | ||
else self.index.name | ||
), | ||
) | ||
row_df = other.to_frame().T | ||
# infer_objects is needed for | ||
|
Uh oh!
There was an error while loading. Please reload this page.