Interpreter for micro-ops #541

Fidget-Spinner · 2023-01-16T16:52:09Z

Continuing from #375.

The granularity of micro ops has not been decided yet, and is being deliberated over at #454.

Over the next 3 months, we plan to experiment with different traces types (short, long) for the tracing interpreter for micro-ops, and compare them. This was after some discussion with Mark. We plan to experiment around to gain some insights.

My current plan:

Instead of generating uops for the level 1 interpreter to interpret, we could generate uops on the fly and only when we generate the traces. This would save us from the dispatch overhead of uops in the standard interpreter, and would allow the optimisations in the tracing/level 2 interpreter to make up for the dispatch overhead.

With the interpreter being generated, we could automate the uop generation and definition.

E.g LOAD_ATTR_METHOD_WITH_VALUES.

inst(LOAD_ATTR_METHOD_WITH_VALUES) {
    uop_inst(GUARD1)
    uop_inst(GUARD2)
    ...
    uop_inst(METHOD_DESCR_FROM_CACHE)
    PUSH(obj)
}

Then for normal ceval.c (the tier 1 interpreter), the uop_inst instructions generate normal C code. A separate uop interpreter for the tier 2 interpreter will be generated from the very same instruction definition. This allows us to not duplicate definitions. We also auto-generate a mapping from normal instructions to uop instructions. At runtime, the tracing interpreter will lookup the translation from said mapping and "lower" the normal instructions to uop instructions.

How does this experiment sound?

The text was updated successfully, but these errors were encountered:

gvanrossum · 2023-01-16T18:29:33Z

I'm not sure if I completely follow. It looks like you are proposing pretty fine-grained uops. Does the tier 2 interpreter have a case (and hence dispatch overhead) for each uop?

It sounds like you'll be doing a lot of work on generate_cases.py. That's fine, and I can help explain mysteries there, but I have to warn that I have plans to continue working on that file as well, so we might end up with merge conflicts. To minimize those you could try to merge certain changes back into the main branch, but that would be a distraction for you and your groupmate.

Fidget-Spinner · 2023-01-17T04:13:23Z

Does the tier 2 interpreter have a case (and hence dispatch overhead) for each uop?

Yes. The hope is that the optimizations the tier 2 interpreter does on the uops make the dispatch overhead worth it.

gvanrossum · 2023-01-17T04:33:40Z

Aha, I believe that was also the plan of #454 (or at least the conclusion), and why we got serious about generating the interpreter. I think that it would still be good to make the uops as large as possible, else the interpretation overhead will kill you.

markshannon · 2023-01-23T15:24:06Z

What @gvanrossum says.

You will want something like:

inst(LOAD_ATTR_INSTANCE_VALUE_TYPE_CHECK) {
    PyObject *owner = TOP();
    PyTypeObject *tp = Py_TYPE(owner);
    _PyAttrCache *cache = (_PyAttrCache *)next_instr;
    uint32_t type_version = read_u32(cache->version);
    assert(type_version != 0);
    DEOPT_IF(tp->tp_version_tag != type_version, LOAD_ATTR);
    assert(tp->tp_dictoffset < 0);
    assert(tp->tp_flags & Py_TPFLAGS_MANAGED_DICT);
}

inst(LOAD_ATTR_INSTANCE_VALUE_REST) {
    PyObject *owner = TOP();
    PyObject *res;
    PyDictOrValues dorv = *_PyObject_DictOrValuesPointer(owner);
    DEOPT_IF(!_PyDictOrValues_IsValues(dorv), LOAD_ATTR);
    res = _PyDictOrValues_GetValues(dorv)->values[cache->index];
    DEOPT_IF(res == NULL, LOAD_ATTR);
    STAT_INC(LOAD_ATTR, hit);
    Py_INCREF(res);
    SET_TOP(NULL);
    STACK_GROW((oparg & 1));
    SET_TOP(res);
    Py_DECREF(owner);
    JUMPBY(INLINE_CACHE_ENTRIES_LOAD_ATTR);
}


LOAD_ATTR_INSTANCE_VALUE = LOAD_ATTR_INSTANCE_VALUE_TYPE_CHECK + LOAD_ATTR_INSTANCE_VALUE_REST;

The reason for this split, is that type inference on the trace can potentially eliminate the type check, but the checks on the instance have to be done.

I wouldn't attempt to eliminate the type checks in your project, but you could usefully count them.

Fidget-Spinner · 2023-01-24T09:50:37Z

OK. Should I upstream this to CPython? It seems not very intrusive, and we would need it eventually anyways. What I'm thinking of is:

inst(LOAD_ATTR_METHOD_WITH_VALUES) {

LOAD_ATTR_METHOD_WITH_VALUES_PRELUDE:
    PyObject *owner = TOP();
    _PyAttrCache *cache = (_PyAttrCache *)next_instr;

LOAD_ATTR_METHOD_WITH_VALUES_TYPE_CHECK:
    PyTypeObject *tp = Py_TYPE(owner);
    ...

LOAD_ATTR_METHOD_WITH_VALUES_REST:
    PyObject *res;
    PyDictOrValues dorv = *_PyObject_DictOrValuesPointer(owner);
    ...
    PUSH(res)
}

The interpreter generator will then generate a normal instruction for the tier 1 interpreter, while combining PRELUDE + TYPE_CHECK and PRELUDE + REST to form two micro ops.

gvanrossum · 2023-01-24T18:31:55Z

OK. Should I upstream this to CPython? It seems not very intrusive, and we would need it eventually anyways. What I'm thinking of is: [...]

I'm not sure how you plan to do that. Transferring data from one uop to the next is possible, but not yet very ergonomic. Have you looked through the code generator yet? There used to be an example of how to do this, but Mark got rid of it when he introduced COMPARE_AND_BRANCH. I could update test_generator.py to include an example though.

gvanrossum · 2023-01-24T18:40:48Z

Have a look at this: gvanrossum/cpython@b60761d

gvanrossum · 2023-01-25T17:26:55Z

What I'm thinking of is: [...]

So after reading your comments in gh-101299 now I finally understand that you mean this as a proposal for new syntax in bytecodes.c. It looks like your convention is to have some labels in the code block of an instruction, perhaps using a naming convention, that identify individual micro-ops (uops). Syntactically that is a fine convention.

But it doesn't allow us to reuse uops easily, does it? The current DSL syntax allows you do do things like

op(UOP1, (input -- tmpA)) {
    ...  // Common start
}
op(UOP2A, (tmpA -- tmpB)) {
    ... // One way to do this part
}
op(UOP2B, (tmpA -- tmpB)) {
    ...  // Another way to do it
}
op(UOP3, (tmpA -- res)) {
    ...  // Common end
}
macro(OPA) = UOP1 + UOP2A + UOP3;
macro(OPB) = UOP1 + UOP2B + UOP3;

Here, the uops can only pass information between them using the stack, but the generator avoid actual push/pop calls -- it just generates temporary C variables (see gh-101299) and I assume the C compiler will optimize that a bit more. (You can also specify a type for such temporaries, and the generator understands this.)

I agree that my approach is more awkward when you have multiple pieces of data that are shared between uops, but your version make sharing invisible to the generator (which doesn't know e.g. that owner is shared).

Happy to see you elaborate your proposal. (If we need to parse C more precisely, I have code written for that in reserve. :-)

Fidget-Spinner · 2023-01-25T17:37:29Z

I see your point about reusing uops. I think the best of both worlds would be some combination of the two ideas. So something like

inst(LOAD_ATTR_METHOD_WITH_VALUES) {
    // Common prelude
    PyObject *owner = TOP();
    _PyAttrCache *cache = (_PyAttrCache *)next_instr;
    UOP1();
    UOP2();
    PUSH(res)
}

The main problem with having a shared prelude though is that we need to standardise the names of the common variables in all the uops. I don't know how feasible that would be. The other problem is that this would mean UOPs are no longer standalone, and would create a hard dependency between code that is in the prelude, and code that is in the uop. This would hurt the composability of our uops. Unless of course, we specify some sort of prelude dependency for every uop and expose that to the case generator. Then we could feasibly auto-generate these.

gvanrossum · 2023-01-25T19:46:05Z

That inst(LOAD_ATTR_METHOD_WITH_VALUES) syntax you're using/seeing here is the "legacy" format, without the stack effects of the instruction. We're trying to convert everything to the new format, so we can remove support for the legacy format in the DSL grammar.

LOAD_ATTR in particular requires a bit of new DSL syntax that I'm currently working on, so we can define it as

inst(LOAD_ATTR, (counter/1, unused/8, owner, -- thing1 if (oparg & 1), thing2) {
    ...
    if (oparg & 1) {
        ...
        if (...) {
            thing1 = meth;
            thing2 = owner;
        }
        else {
            thing1 = NULL;
            thing2 = meth;
        }
    }
    else {
        ...
        thing2 = res;
    }
}

That looks awkward because the meth object isn't always in the same place on the stack, although it echoes the complexity of the instruction's semantics (I had to read that code three times before I figured what it was doing). We need the I/O specification so we can auto-generate the prologue and epilogue, and it also lets generator do the right thing for macro instructions.

When the instruction definition specifies I/O effects, a prologue (prelude) is auto-generated, which takes care of the pushes at the beginning and (in the epilogue) the pops at the end, including the necessary variable definitions. For examples, just look through generated_cases.c.h for some opcodes that use the new format, e.g. BINARY_SUBSCR.

It would seem attractive to make this less complicated, but I worry about where UOP1() and UOP2() from your example are defined (are they just op(UOP1, (... -- ...)) { ... } ?) and I fear that the naming conventions would be difficult to maintain. Maybe we can pass variable names explicitly into the calls, e.g. UOP(a, b, c)?

Fidget-Spinner · 2023-01-26T07:14:14Z

Yeah UOP1 and UOP2 are just macros. They expand to the following:

For the tier 1 interpreter, just normal C code.
For tier 2 interpreter, 2 separate instructions. Then the overall macro LOAD_ATTR disappears.

(are they just op(UOP1, (... -- ...)) { ... } ?)

Yes. They should indeed be defined like that.

markshannon · 2023-01-26T11:36:55Z

The code generator can almost handle micro-ops now.
Hopefully it won't be much work to support them.
E.g
I can write BINARY_OP_ADD_FLOAT as this superinstruction:

        op(_BINARY_OP_FLOAT_CHECK, (left, right -- a, b)) {
            DEOPT_IF(!PyFloat_CheckExact(left), BINARY_OP);
            DEOPT_IF(Py_TYPE(right) != Py_TYPE(left), BINARY_OP);
            STAT_INC(BINARY_OP, hit);
        }

        op(_BINARY_OP_ADD_FLOAT_ACTION, (unused/1, a, b -- sum)) {
            double dsum = ((PyFloatObject *)a)->ob_fval +
                ((PyFloatObject *)b)->ob_fval;
            sum = PyFloat_FromDouble(dsum);
            _Py_DECREF_SPECIALIZED(b, _PyFloat_ExactDealloc);
            _Py_DECREF_SPECIALIZED(a, _PyFloat_ExactDealloc);
            ERROR_IF(sum == NULL, error);
        }

        super(BINARY_OP_ADD_FLOAT) = _BINARY_OP_FLOAT_CHECK + _BINARY_OP_ADD_FLOAT_ACTION;

But this generates almost correct code (it has an extra NEXTOPARG(); JUMPBY(1); in the middle that we don't want.

      TARGET(BINARY_OP_ADD_FLOAT) {
            PyObject *_tmp_1 = PEEK(1);
            PyObject *_tmp_2 = PEEK(2);
            {
                PyObject *right = _tmp_1;
                PyObject *left = _tmp_2;
                PyObject *a;
                PyObject *b;
                assert(cframe.use_tracing == 0);
                DEOPT_IF(!PyFloat_CheckExact(left), BINARY_OP);
                DEOPT_IF(Py_TYPE(right) != Py_TYPE(left), BINARY_OP);
                STAT_INC(BINARY_OP, hit);
                _tmp_2 = a;
                _tmp_1 = b;
            }
            NEXTOPARG();
            JUMPBY(1);
            {
                PyObject *b = _tmp_1;
                PyObject *a = _tmp_2;
                PyObject *sum;
                assert(cframe.use_tracing == 0);
                double dsum = ((PyFloatObject *)a)->ob_fval +
                    ((PyFloatObject *)b)->ob_fval;
                sum = PyFloat_FromDouble(dsum);
                _Py_DECREF_SPECIALIZED(b, _PyFloat_ExactDealloc);
                _Py_DECREF_SPECIALIZED(a, _PyFloat_ExactDealloc);
                if (sum == NULL) goto pop_2_error;
                _tmp_2 = sum;
            }
            JUMPBY(1);
            STACK_SHRINK(1);
            POKE(1, _tmp_2);
            DISPATCH();
        }

To gets this working we would need:

Support for the form inst(BINARY_OP_ADD_FLOAT) = _BINARY_OP_FLOAT_CHECK + _BINARY_OP_ADD_FLOAT_ACTION; (which would be the same as super but without the NEXTOPARG(); JUMPBY(1); between the parts)
Checks that the part of a superinstructions were instructions
Checks that part of an instruction were ops.

gvanrossum · 2023-01-26T16:09:53Z

@markshannon

I can write BINARY_OP_ADD_FLOAT as this superinstruction: [...]

Use macro instead of super, then it should work.

gvanrossum · 2023-01-26T17:08:32Z

@Fidget-Spinner

Yeah UOP1 and UOP2 are just macros. They expand to the following:
1. For the tier 1 interpreter, just normal C code.

2. For tier 2 interpreter, 2 separate instructions. Then the overall macro LOAD_ATTR disappears.
(are they just op(UOP1, (... -- ...)) { ... } ?)

Yes. They should indeed be defined like that.

Hm, that would require the tier 1 generator (the only one we currently have :-) to generate a macro in its output from an op definition, right: so

inst(OP, (a, b -- c, d)) {
    spam(a, b);
    c = ...;
    d = ...;
}

would be translated into

#define OP() { \
    spam(a, b); \
    c = ...; \
    d = ...; \
}

Honestly that feels pretty fragile, though for a quick prototype it should do.

markshannon · 2023-01-27T10:08:13Z

Use macro instead of super, then it should work.

That doesn't work either. I created an issue for it python/cpython#101369

Fidget-Spinner · 2023-01-27T15:53:23Z

OK. I've hacked around and this is what I have:
Fidget-Spinner/cpython@5b38288

Look for the two micro instructions: "BINARY_OP_ADD_INT_TYPE_CHECK" and "BINARY_OP_ADD_INST_REST" (typo).
In the tier 1 interpreter (in Python/generated_cases.c.h) they are just macros. I've already verified that it compiles and can run CPython's test suite. They share a common prelude.

In the tier 2 interpreter (in Python/generated_cases_tier2.c.h) they are actual instructions, and the macro instruction BINARY_OP_ADD_INT disappears. All other unconverted instructions are preserved as a fallback.

At runtime, when we want to switch between the interpreters, we just swap out the _PyEval_EvalFrameDefault function with another function with the tier 2 interpreter definitions. So everything stays nicely separated.

gvanrossum · 2023-01-31T21:07:45Z

Hi Ken Jin, I'm curious why you decided to go with new syntax e.g. macro_inst instead of using the existing macro syntax? You could just write

macro(BINARY_OP_ADD_INT) = unused/1 + BINARY_OP_ADD_INT_TYPE_CHECK + BINARY_OP_ADD_INST_REST;

and use op instead of u_inst for the micro instruction definitions.

Fidget-Spinner · 2023-02-01T02:08:59Z

No particular reason. It's not upstream-ready code. Just a quick experiment. I'll update you on what we've done during our meeting later :).

Fidget-Spinner · 2023-02-18T15:53:04Z

Closing this issue. Will open a superseding one with more concrete plans once my experiment concludes.

markshannon mentioned this issue Jan 27, 2023

Bytecode interpreter generator should support composing instructions from parts. python/cpython#101369

Closed

Fidget-Spinner changed the title ~~Tracing interpreter for micro-ops~~ Interpreter for micro-ops Jan 28, 2023

Fidget-Spinner closed this as completed Feb 18, 2023

Interpreter for micro-ops #541

Interpreter for micro-ops #541

Comments

Fidget-Spinner commented Jan 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

gvanrossum commented Jan 16, 2023

Uh oh!

Fidget-Spinner commented Jan 17, 2023

Uh oh!

gvanrossum commented Jan 17, 2023

Uh oh!

markshannon commented Jan 23, 2023

Uh oh!

Fidget-Spinner commented Jan 24, 2023

Uh oh!

gvanrossum commented Jan 24, 2023

Uh oh!

gvanrossum commented Jan 24, 2023

Uh oh!

gvanrossum commented Jan 25, 2023

Uh oh!

Fidget-Spinner commented Jan 25, 2023

Uh oh!

gvanrossum commented Jan 25, 2023

Uh oh!

Fidget-Spinner commented Jan 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markshannon commented Jan 26, 2023

Uh oh!

gvanrossum commented Jan 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gvanrossum commented Jan 26, 2023

Uh oh!

markshannon commented Jan 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fidget-Spinner commented Jan 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gvanrossum commented Jan 31, 2023

Uh oh!

Fidget-Spinner commented Feb 1, 2023

Uh oh!

Fidget-Spinner commented Feb 18, 2023

Uh oh!

Fidget-Spinner commented Jan 16, 2023 •

edited

Loading

Fidget-Spinner commented Jan 26, 2023 •

edited

Loading

gvanrossum commented Jan 26, 2023 •

edited

Loading

markshannon commented Jan 27, 2023 •

edited

Loading

Fidget-Spinner commented Jan 27, 2023 •

edited

Loading