-
Notifications
You must be signed in to change notification settings - Fork 52
Interpreter for micro-ops #541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm not sure if I completely follow. It looks like you are proposing pretty fine-grained uops. Does the tier 2 interpreter have a case (and hence dispatch overhead) for each uop? It sounds like you'll be doing a lot of work on generate_cases.py. That's fine, and I can help explain mysteries there, but I have to warn that I have plans to continue working on that file as well, so we might end up with merge conflicts. To minimize those you could try to merge certain changes back into the main branch, but that would be a distraction for you and your groupmate. |
Yes. The hope is that the optimizations the tier 2 interpreter does on the uops make the dispatch overhead worth it. |
Aha, I believe that was also the plan of #454 (or at least the conclusion), and why we got serious about generating the interpreter. I think that it would still be good to make the uops as large as possible, else the interpretation overhead will kill you. |
What @gvanrossum says. You will want something like: inst(LOAD_ATTR_INSTANCE_VALUE_TYPE_CHECK) {
PyObject *owner = TOP();
PyTypeObject *tp = Py_TYPE(owner);
_PyAttrCache *cache = (_PyAttrCache *)next_instr;
uint32_t type_version = read_u32(cache->version);
assert(type_version != 0);
DEOPT_IF(tp->tp_version_tag != type_version, LOAD_ATTR);
assert(tp->tp_dictoffset < 0);
assert(tp->tp_flags & Py_TPFLAGS_MANAGED_DICT);
}
inst(LOAD_ATTR_INSTANCE_VALUE_REST) {
PyObject *owner = TOP();
PyObject *res;
PyDictOrValues dorv = *_PyObject_DictOrValuesPointer(owner);
DEOPT_IF(!_PyDictOrValues_IsValues(dorv), LOAD_ATTR);
res = _PyDictOrValues_GetValues(dorv)->values[cache->index];
DEOPT_IF(res == NULL, LOAD_ATTR);
STAT_INC(LOAD_ATTR, hit);
Py_INCREF(res);
SET_TOP(NULL);
STACK_GROW((oparg & 1));
SET_TOP(res);
Py_DECREF(owner);
JUMPBY(INLINE_CACHE_ENTRIES_LOAD_ATTR);
}
LOAD_ATTR_INSTANCE_VALUE = LOAD_ATTR_INSTANCE_VALUE_TYPE_CHECK + LOAD_ATTR_INSTANCE_VALUE_REST; The reason for this split, is that type inference on the trace can potentially eliminate the type check, but the checks on the instance have to be done. I wouldn't attempt to eliminate the type checks in your project, but you could usefully count them. |
OK. Should I upstream this to CPython? It seems not very intrusive, and we would need it eventually anyways. What I'm thinking of is:
The interpreter generator will then generate a normal instruction for the tier 1 interpreter, while combining PRELUDE + TYPE_CHECK and PRELUDE + REST to form two micro ops. |
I'm not sure how you plan to do that. Transferring data from one uop to the next is possible, but not yet very ergonomic. Have you looked through the code generator yet? There used to be an example of how to do this, but Mark got rid of it when he introduced COMPARE_AND_BRANCH. I could update test_generator.py to include an example though. |
Have a look at this: gvanrossum/cpython@b60761d |
So after reading your comments in gh-101299 now I finally understand that you mean this as a proposal for new syntax in bytecodes.c. It looks like your convention is to have some labels in the code block of an instruction, perhaps using a naming convention, that identify individual micro-ops (uops). Syntactically that is a fine convention. But it doesn't allow us to reuse uops easily, does it? The current DSL syntax allows you do do things like
Here, the uops can only pass information between them using the stack, but the generator avoid actual push/pop calls -- it just generates temporary C variables (see gh-101299) and I assume the C compiler will optimize that a bit more. (You can also specify a type for such temporaries, and the generator understands this.) I agree that my approach is more awkward when you have multiple pieces of data that are shared between uops, but your version make sharing invisible to the generator (which doesn't know e.g. that Happy to see you elaborate your proposal. (If we need to parse C more precisely, I have code written for that in reserve. :-) |
I see your point about reusing uops. I think the best of both worlds would be some combination of the two ideas. So something like
The main problem with having a shared prelude though is that we need to standardise the names of the common variables in all the uops. I don't know how feasible that would be. The other problem is that this would mean UOPs are no longer standalone, and would create a hard dependency between code that is in the prelude, and code that is in the uop. This would hurt the composability of our uops. Unless of course, we specify some sort of prelude dependency for every uop and expose that to the case generator. Then we could feasibly auto-generate these. |
That LOAD_ATTR in particular requires a bit of new DSL syntax that I'm currently working on, so we can define it as
That looks awkward because the When the instruction definition specifies I/O effects, a prologue (prelude) is auto-generated, which takes care of the pushes at the beginning and (in the epilogue) the pops at the end, including the necessary variable definitions. For examples, just look through generated_cases.c.h for some opcodes that use the new format, e.g. BINARY_SUBSCR. It would seem attractive to make this less complicated, but I worry about where UOP1() and UOP2() from your example are defined (are they just |
Yeah UOP1 and UOP2 are just macros. They expand to the following:
Yes. They should indeed be defined like that. |
The code generator can almost handle micro-ops now. op(_BINARY_OP_FLOAT_CHECK, (left, right -- a, b)) {
DEOPT_IF(!PyFloat_CheckExact(left), BINARY_OP);
DEOPT_IF(Py_TYPE(right) != Py_TYPE(left), BINARY_OP);
STAT_INC(BINARY_OP, hit);
}
op(_BINARY_OP_ADD_FLOAT_ACTION, (unused/1, a, b -- sum)) {
double dsum = ((PyFloatObject *)a)->ob_fval +
((PyFloatObject *)b)->ob_fval;
sum = PyFloat_FromDouble(dsum);
_Py_DECREF_SPECIALIZED(b, _PyFloat_ExactDealloc);
_Py_DECREF_SPECIALIZED(a, _PyFloat_ExactDealloc);
ERROR_IF(sum == NULL, error);
}
super(BINARY_OP_ADD_FLOAT) = _BINARY_OP_FLOAT_CHECK + _BINARY_OP_ADD_FLOAT_ACTION; But this generates almost correct code (it has an extra TARGET(BINARY_OP_ADD_FLOAT) {
PyObject *_tmp_1 = PEEK(1);
PyObject *_tmp_2 = PEEK(2);
{
PyObject *right = _tmp_1;
PyObject *left = _tmp_2;
PyObject *a;
PyObject *b;
assert(cframe.use_tracing == 0);
DEOPT_IF(!PyFloat_CheckExact(left), BINARY_OP);
DEOPT_IF(Py_TYPE(right) != Py_TYPE(left), BINARY_OP);
STAT_INC(BINARY_OP, hit);
_tmp_2 = a;
_tmp_1 = b;
}
NEXTOPARG();
JUMPBY(1);
{
PyObject *b = _tmp_1;
PyObject *a = _tmp_2;
PyObject *sum;
assert(cframe.use_tracing == 0);
double dsum = ((PyFloatObject *)a)->ob_fval +
((PyFloatObject *)b)->ob_fval;
sum = PyFloat_FromDouble(dsum);
_Py_DECREF_SPECIALIZED(b, _PyFloat_ExactDealloc);
_Py_DECREF_SPECIALIZED(a, _PyFloat_ExactDealloc);
if (sum == NULL) goto pop_2_error;
_tmp_2 = sum;
}
JUMPBY(1);
STACK_SHRINK(1);
POKE(1, _tmp_2);
DISPATCH();
} To gets this working we would need:
|
Use |
Hm, that would require the tier 1 generator (the only one we currently have :-) to generate a macro in its output from an
would be translated into
Honestly that feels pretty fragile, though for a quick prototype it should do. |
That doesn't work either. I created an issue for it python/cpython#101369 |
OK. I've hacked around and this is what I have: Look for the two micro instructions: "BINARY_OP_ADD_INT_TYPE_CHECK" and "BINARY_OP_ADD_INST_REST" (typo). In the tier 2 interpreter (in At runtime, when we want to switch between the interpreters, we just swap out the |
Hi Ken Jin, I'm curious why you decided to go with new syntax e.g.
and use |
No particular reason. It's not upstream-ready code. Just a quick experiment. I'll update you on what we've done during our meeting later :). |
Closing this issue. Will open a superseding one with more concrete plans once my experiment concludes. |
Uh oh!
There was an error while loading. Please reload this page.
Continuing from #375.
The granularity of micro ops has not been decided yet, and is being deliberated over at #454.
Over the next 3 months, we plan to experiment with different traces types (short, long) for the tracing interpreter for micro-ops, and compare them. This was after some discussion with Mark. We plan to experiment around to gain some insights.
My current plan:
Instead of generating uops for the level 1 interpreter to interpret, we could generate uops on the fly and only when we generate the traces. This would save us from the dispatch overhead of uops in the standard interpreter, and would allow the optimisations in the tracing/level 2 interpreter to make up for the dispatch overhead.
With the interpreter being generated, we could automate the uop generation and definition.
E.g
LOAD_ATTR_METHOD_WITH_VALUES
.Then for normal
ceval.c
(the tier 1 interpreter), theuop_inst
instructions generate normal C code. A separate uop interpreter for the tier 2 interpreter will be generated from the very same instruction definition. This allows us to not duplicate definitions. We also auto-generate a mapping from normal instructions to uop instructions. At runtime, the tracing interpreter will lookup the translation from said mapping and "lower" the normal instructions to uop instructions.How does this experiment sound?
The text was updated successfully, but these errors were encountered: