|
| 1 | +# Adding or extending a family of adaptive instructions. |
| 2 | + |
| 3 | +## Families of instructions |
| 4 | + |
| 5 | +The core part of PEP 659 (specializing adaptive interpreter) is the families |
| 6 | +of instructions that perform the adaptive specialization. |
| 7 | + |
| 8 | +A family of instructions has the following fundamental properties: |
| 9 | + |
| 10 | +* It corresponds to a single instruction in the code |
| 11 | + generated by the bytecode compiler. |
| 12 | +* It has a single adaptive instruction that records an execution count and, |
| 13 | + at regular intervals, attempts to specialize itself. If not specializing, |
| 14 | + it executes the non-adaptive instruction. |
| 15 | +* It has at least one specialized form of the instruction that is tailored |
| 16 | + for a particular value or set of values at runtime. |
| 17 | +* All members of the family have access to the same number of cache entries. |
| 18 | + Individual family members do not need to use all of the entries. |
| 19 | + |
| 20 | +The current implementation also requires the following, |
| 21 | +although these are not fundamental and may change: |
| 22 | + |
| 23 | +* If a family uses one or more entries, then the first entry must be a |
| 24 | + `_PyAdaptiveEntry` entry. |
| 25 | +* If a family uses no cache entries, then the `oparg` is used as the |
| 26 | + counter for the adaptive instruction. |
| 27 | +* All instruction names should start with the name of the non-adaptive |
| 28 | + instruction. |
| 29 | +* The adaptive instruction should end in `_ADAPTIVE`. |
| 30 | +* Specialized forms should have names describing their specialization. |
| 31 | + |
| 32 | +## Example family |
| 33 | + |
| 34 | +The `LOAD_GLOBAL` instruction (in Python/ceval.c) already has an adaptive |
| 35 | +family that serves as a relatively simple example. |
| 36 | + |
| 37 | +The `LOAD_GLOBAL_ADAPTIVE` instruction performs adaptive specialization, |
| 38 | +calling `_Py_Specialize_LoadGlobal()` when the counter reaches zero. |
| 39 | + |
| 40 | +There are two specialized instructions in the family, `LOAD_GLOBAL_MODULE` |
| 41 | +which is specialized for global variables in the module, and |
| 42 | +`LOAD_GLOBAL_BUILTIN` which is specialized for builtin variables. |
| 43 | + |
| 44 | +## Performance analysis |
| 45 | + |
| 46 | +The benefit of a specialization can be assessed with the following formula: |
| 47 | +`Tbase/Tadaptive`. |
| 48 | + |
| 49 | +Where `Tbase` is the mean time to execute the base instruction, |
| 50 | +and `Tadaptive` is the mean time to execute the specialized and adaptive forms. |
| 51 | + |
| 52 | +`Tadaptive = (sum(Ti*Ni) + Tmiss*Nmiss)/(sum(Ni)+Nmiss)` |
| 53 | + |
| 54 | +`Ti` is the time to execute the `i`th instruction in the family and `Ni` is |
| 55 | +the number of times that instruction is executed. |
| 56 | +`Tmiss` is the time to process a miss, including de-optimzation |
| 57 | +and the time to execute the base instruction. |
| 58 | + |
| 59 | +The ideal situation is where misses are rare and the specialized |
| 60 | +forms are much faster than the base instruction. |
| 61 | +`LOAD_GLOBAL` is near ideal, `Nmiss/sum(Ni) ≈ 0`. |
| 62 | +In which case we have `Tadaptive ≈ sum(Ti*Ni)`. |
| 63 | +Since we can expect the specialized forms `LOAD_GLOBAL_MODULE` and |
| 64 | +`LOAD_GLOBAL_BUILTIN` to be much faster than the adaptive base instruction, |
| 65 | +we would expect the specialization of `LOAD_GLOBAL` to be profitable. |
| 66 | + |
| 67 | +## Design considerations |
| 68 | + |
| 69 | +While `LOAD_GLOBAL` may be ideal, instructions like `LOAD_ATTR` and |
| 70 | +`CALL_FUNCTION` are not. For maximum performance we want to keep `Ti` |
| 71 | +low for all specialized instructions and `Nmiss` as low as possible. |
| 72 | + |
| 73 | +Keeping `Nmiss` low means that there should be specializations for almost |
| 74 | +all values seen by the base instruction. Keeping `sum(Ti*Ni)` low means |
| 75 | +keeping `Ti` low which means minimizing branches and dependent memory |
| 76 | +accesses (pointer chasing). These two objectives may be in conflict, |
| 77 | +requiring judgement and experimentation to design the family of instructions. |
| 78 | + |
| 79 | +### Gathering data |
| 80 | + |
| 81 | +Before choosing how to specialize an instruction, it is important to gather |
| 82 | +some data. What are the patterns of usage of the base instruction? |
| 83 | +Data can best be gathered by instrumenting the interpreter. Since a |
| 84 | +specialization function and adaptive instruction are going to be required, |
| 85 | +instrumentation can most easily be added in the specialization function. |
| 86 | + |
| 87 | +### Choice of specializations |
| 88 | + |
| 89 | +The performance of the specializing adaptive interpreter relies on the |
| 90 | +quality of specialization and keeping the overhead of specialization low. |
| 91 | + |
| 92 | +Specialized instructions must be fast. In order to be fast, |
| 93 | +specialized instructions should be tailored for a particular |
| 94 | +set of values that allows them to: |
| 95 | +1. Verify that incoming value is part of that set with low overhead. |
| 96 | +2. Perform the operation quickly. |
| 97 | + |
| 98 | +This requires that the set of values is chosen such that membership can be |
| 99 | +tested quickly and that membership is sufficient to allow the operation to |
| 100 | +performed quickly. |
| 101 | + |
| 102 | +For example, `LOAD_GLOBAL_MODULE` is specialized for `globals()` |
| 103 | +dictionaries that have a keys with the expected version. |
| 104 | + |
| 105 | +This can be tested quickly: |
| 106 | +* `globals->keys->dk_version == expected_version` |
| 107 | + |
| 108 | +and the operation can be performed quickly: |
| 109 | +* `value = globals->keys->entries[index].value`. |
| 110 | + |
| 111 | +Because it is impossible to measure the performance of an instruction without |
| 112 | +also measuring unrelated factors, the assessment of the quality of a |
| 113 | +specialization will require some judgement. |
| 114 | + |
| 115 | +As a general rule, specialized instructions should be much faster than the |
| 116 | +base instruction. |
| 117 | + |
| 118 | +### Implementation of specialized instructions |
| 119 | + |
| 120 | +In general, specialized instructions should be implemented in two parts: |
| 121 | +1. A sequence of guards, each of the form |
| 122 | + `DEOPT_IF(guard-condition-is-false, BASE_NAME)`, |
| 123 | + followed by a `record_cache_hit()`. |
| 124 | +2. The operation, which should ideally have no branches and |
| 125 | + a minimum number of dependent memory accesses. |
| 126 | + |
| 127 | +In practice, the parts may overlap, as data required for guards |
| 128 | +can be re-used in the operation. |
| 129 | + |
| 130 | +If there are branches in the operation, then consider further specialization |
| 131 | +to eliminate the branches. |
0 commit comments