[profiling] Add profiling instrumentation centered on the eval loop #25

ericsnowcurrently · 2021-03-23T22:39:33Z

Here are the key boundaries:

outside _PyEval_EvalFrameDefault()
between start of _PyEval_EvalFrameDefault() and top of loop (setting up)
between top of loop and dispatch
handling an opcode
between dispatch and bottom of loop
between bottom of loop and returning (cleaning up)

The text was updated successfully, but these errors were encountered:

markshannon · 2021-03-24T11:33:24Z

This should mostly be part of the job of the interpreter generator, see #5, IMO.

Adding options to the interpreter generator to insert the instrumentation will be a lot more pleasant than dealing with C macros.

ericsnowcurrently · 2021-03-24T15:55:24Z

Disregarding how we instrument the code, are there any other significant boundaries we should consider?

ericsnowcurrently · 2021-03-31T21:45:18Z

As an exercise to better familiarize myself with the knowledge domain, I spent a little time adding some minimal instrumentation (and logging) to the eval loop, as well as a script to analyze the results. See my eval-loop-profiling branch.

The instrumentation generates a log file with a minimal format:

# python: /home/esnow/projects/work/cpython-perf/cpython/python
# argv: ./python -E ./setup.py build
# start time: 1617222445 s (since epoch)
# start clock: 17137493.663947837 s (on clock)
# end clock: 17137494.467647860 s (on clock)

17137493.664739837 0
# func: <module>
17137493.666522537 3
17137493.666523537 5
17137493.666523837 6
17137493.666524037 7
17137493.666524237 8
17137493.666524437 9 100
# func: _DeadlockError
17137493.666531637 3
17137493.666531937 5
17137493.666532137 6
17137493.666532337 7
17137493.666532437 8
17137493.666532637 9 101
17137493.666533537 13
# func: _DeadlockError
17137493.666533937 4
17137493.666534137 2
# func: _ModuleLock
17137493.666562437 3
...
17137494.445492862 13
# func: close
17137494.445493162 4
17137494.445493362 2
# func: release
17137494.445493862 3
17137494.445494062 5
17137494.445494162 6
17137494.445494362 7
17137494.445494562 8
17137494.445494662 9 124
17137494.445495162 13
# func: release
17137494.445495462 4
17137494.445495662 2
17137494.445496162 13
# func: shutdown
17137494.445496462 4
17137494.445496562 2
17137494.467615160 1

For reference, a Python script that has a total runtime of about a second generates roughly 16MB of trace data.

By default, the analysis script makes the log more human-friendly:

$ python3 Tools/scripts/analyze_trace.py
reading from ./eval_loop-1617222445.trace
(showing results in 'simple-indent' format)

# python:         /home/esnow/projects/work/cpython-perf/cpython/python
# argv:           ./python -E ./setup.py build
# start:          2021-03-31 20:27:25.663947 UTC
# end:            2021-03-31 20:27:26.467647 UTC
# elapsed:        803,700.0 µs

####################
# BEGIN TRACE
####################

1,782.7 µs      init
    1.0 µs      enter                                             # func: <module>
    0.3 µs      |  loop enter
    0.2 µs      |  loop slow
    0.2 µs      |  loop fast
    0.2 µs      |  dispatch
    7.2 µs      |  op                     LOAD_CONST (100)
    0.3 µs      |  enter                                          # func: _DeadlockError
    0.2 µs      |  |  loop enter
    0.2 µs      |  |  loop slow
    0.1 µs      |  |  loop fast
    0.2 µs      |  |  dispatch
    0.9 µs      |  |  op                  LOAD_NAME (101)
    0.4 µs      |  |  loop exit
    0.2 µs      |  exit                                           # func: _DeadlockError
   28.3 µs      |  runtime other
    0.2 µs      |  enter                                          # func: _ModuleLock
...
    0.3 µs      |  |  loop exit
    0.2 µs      |  exit                                           # func: close
    0.5 µs      |  runtime other
    0.2 µs      |  enter                                          # func: release
    0.1 µs      |  |  loop enter
    0.2 µs      |  |  loop slow
    0.2 µs      |  |  loop fast
    0.1 µs      |  |  dispatch
    0.5 µs      |  |  op                  LOAD_FAST (124)
    0.3 µs      |  |  loop exit
    0.2 µs      |  exit                                           # func: release
    0.5 µs      |  runtime other
    0.3 µs      |  loop exit
    0.1 µs      exit                                              # func: shutdown
22,118.5 µs     runtime other
                fini

####################
# END TRACE
####################

The script can also provide a summary:

Sorted by Count

$ python3 Tools/scripts/analyze_trace.py --format summary --sort count
reading from ./eval_loop-1617222445.trace
(showing results in 'summary' format)

python:    /home/esnow/projects/work/cpython-perf/cpython/python
argv:      ./python -E ./setup.py build
start:     2021-03-31 20:27:25.663947 UTC
end:       2021-03-31 20:27:26.467647 UTC
elapsed:   803,700.0 µs

events:

           name                count      mean elapsed
--------------------------- ------------ --------------
 dispatch                        94,807         0.1 µs
 op                              94,807         3.4 µs
 loop slow                       61,684         0.2 µs
 loop fast                       61,684         0.1 µs
 enter                           60,382         0.2 µs
 loop exit                       60,382         0.4 µs
 exit                            60,382         0.2 µs
 runtime other                   60,382         5.5 µs
 loop enter                      60,345         0.2 µs
 loop exception                   1,832         0.4 µs
 loop error                       1,821         0.4 µs
 loop exiting                       493         0.2 µs
 init                                 1     1,782.7 µs
 fini                                 1         0.0 µs
--------------------------- ------------ --------------
 total: 14/14

ops:

           name                count      mean elapsed
--------------------------- ------------ --------------
 LOAD_FAST                       24,094         2.0 µs
 POP_JUMP_IF_FALSE               21,937         0.5 µs
 LOAD_GLOBAL                     21,223         3.0 µs
 SETUP_FINALLY                    8,242        21.0 µs
 JUMP_ABSOLUTE                    5,117         0.6 µs
 LOAD_CONST                       2,841         1.6 µs
 BUILD_LIST                       2,640         1.1 µs
 POP_JUMP_IF_TRUE                 2,480         1.4 µs
 FOR_ITER                         1,997         0.7 µs
 POP_TOP                          1,545         0.4 µs
 DUP_TOP                          1,328         1.7 µs
 LOAD_NAME                          550         6.1 µs
 JUMP_FORWARD                       537         0.5 µs
 LOAD_DEREF                          56         1.5 µs
 JUMP_IF_NOT_EXC_MATCH               50         2.7 µs
 JUMP_IF_FALSE_OR_POP                50         1.1 µs
 JUMP_IF_TRUE_OR_POP                 35         2.6 µs
 LOAD_CLOSURE                        25         1.1 µs
 NOP                                 21         0.9 µs
 WITH_EXCEPT_START                   11         0.9 µs
 BUILD_MAP                           11         7.0 µs
 BUILD_SET                            5         4.9 µs
 YIELD_FROM                           4         0.4 µs
 EXTENDED_ARG                         4         0.2 µs
 BUILD_CONST_KEY_MAP                  3        21.4 µs
 SETUP_WITH                           1         9.7 µs
--------------------------- ------------ --------------
 total: 26/125

funcs: 1,025 called (by "simple" name)

Sorted by Average Elapsed Time

$ python3  Tools/scripts/analyze_trace.py --format summary --sort elapsed
reading from ./eval_loop-1617222445.trace
(showing results in 'summary' format)

python:    /home/esnow/projects/work/cpython-perf/cpython/python
argv:      ./python -E ./setup.py build
start:     2021-03-31 20:27:25.663947 UTC
end:       2021-03-31 20:27:26.467647 UTC
elapsed:   803,700.0 µs

events:

           name                count      mean elapsed
--------------------------- ------------ --------------
 init                                 1     1,782.7 µs
 runtime other                   60,382         5.5 µs
 op                              94,807         3.4 µs
 loop error                       1,821         0.4 µs
 loop exit                       60,382         0.4 µs
 loop exception                   1,832         0.4 µs
 enter                           60,382         0.2 µs
 loop exiting                       493         0.2 µs
 exit                            60,382         0.2 µs
 loop enter                      60,345         0.2 µs
 loop slow                       61,684         0.2 µs
 dispatch                        94,807         0.1 µs
 loop fast                       61,684         0.1 µs
 fini                                 1         0.0 µs
--------------------------- ------------ --------------
 total: 14/14

ops:

           name                count      mean elapsed
--------------------------- ------------ --------------
 BUILD_CONST_KEY_MAP                  3        21.4 µs
 SETUP_FINALLY                    8,242        21.0 µs
 SETUP_WITH                           1         9.7 µs
 BUILD_MAP                           11         7.0 µs
 LOAD_NAME                          550         6.1 µs
 BUILD_SET                            5         4.9 µs
 LOAD_GLOBAL                     21,223         3.0 µs
 JUMP_IF_NOT_EXC_MATCH               50         2.7 µs
 JUMP_IF_TRUE_OR_POP                 35         2.6 µs
 LOAD_FAST                       24,094         2.0 µs
 DUP_TOP                          1,328         1.7 µs
 LOAD_CONST                       2,841         1.6 µs
 LOAD_DEREF                          56         1.5 µs
 POP_JUMP_IF_TRUE                 2,480         1.4 µs
 BUILD_LIST                       2,640         1.1 µs
 LOAD_CLOSURE                        25         1.1 µs
 JUMP_IF_FALSE_OR_POP                50         1.1 µs
 WITH_EXCEPT_START                   11         0.9 µs
 NOP                                 21         0.9 µs
 FOR_ITER                         1,997         0.7 µs
 JUMP_ABSOLUTE                    5,117         0.6 µs
 POP_JUMP_IF_FALSE               21,937         0.5 µs
 JUMP_FORWARD                       537         0.5 µs
 YIELD_FROM                           4         0.4 µs
 POP_TOP                          1,545         0.4 µs
 EXTENDED_ARG                         4         0.2 µs
--------------------------- ------------ --------------
 total: 26/125

funcs: 1,025 called (by "simple" name)

ericsnowcurrently · 2021-03-31T21:45:27Z

This effort proved to be quite insightful for me. This isn't something to pursue further right now but will be worth revisiting later when we've added code generation for the eval loop.

markshannon · 2021-07-20T11:25:43Z

I think this is worth revisiting.
In order to direct our efforts we need to know where the virtual machine is spending its time.

This is relevant to both startup and general execution.

The text summary sounds useful, could we also get flamegraphs? I think they are a great way to visualize where time is being spent.

We should also generate results using perf or similar tool, to cross check that we are instrumenting the right things.

ericsnowcurrently · 2021-07-23T21:52:17Z

FWIW, I generated a flamegraph for ./python -c 'pass':

https://raw.githubusercontent.com/faster-cpython/ideas/main/flamegraphs/vm-freq999.svg

If you download the file locally you can interact with it in your browser.

The commands (after installing dependencies) look something like this:

#git clone https://github.com/brendangregg/FlameGraph ../FlameGraph

sample_rate_orig=$(cat /proc/sys/kernel/perf_event_max_sample_rate)
max_stack_orig=$(cat /proc/sys/kernel/perf_event_max_stack)
sudo sysctl -w kernel.perf_event_max_sample_rate=100000
sudo sysctl -w kernel.perf_event_max_stack=1000

sudo perf record -F max -a -g -- ./python -c 'pass'
sudo perf script > out.perf
../FlameGraph/stackcollapse-perf.pl out.perf > out.folded
../FlameGraph/flamegraph.pl out.folded > flamegraph.svg

sudo sysctl -w kernel.perf_event_max_sample_rate=$sample_rate_orig
sudo sysctl -w kernel.perf_event_max_stack=$max_stack_orig

See https://github.com/faster-cpython/tools/tree/main/scripts/run-perf.sh for the full operation, including installing dependencies.

gvanrossum · 2021-07-24T18:02:32Z

Greetings from the Central European Timezone. :-) Is there somewhere that describes what this particular flamegraph means? (I've seen explanations of flamegraphs in general, but I'm not familiar with the tool or methodology used to generate this particular one.)

sweeneyde · 2021-07-24T22:17:53Z

I ran what amounts to PyPerformance's bench_expand, bench_integrate, bench_sum, and bench_str benchmarks in a loop for about a minute with a PGO ("PGUpdate") build, and Visual Studio's Performance Profiler got me some results.

The hottest 15 functions:

Function Name	Total CPU [unit, %]	Self CPU [unit, %]
_PyEval_EvalFrameDefault	58793 (99.79%)	15770 (26.77%)
call_function	58793 (99.79%)	4045 (6.87%)
_Py_dict_lookup	3262 (5.54%)	3019 (5.12%)
initialize_locals	2329 (3.95%)	1468 (2.49%)
_PyUnicode_FromId	1422 (2.41%)	1422 (2.41%)
_PyFunction_Vectorcall	58793 (99.79%)	1299 (2.20%)
_PyObject_GenericGetAttrWithDict	24025 (40.78%)	1262 (2.14%)
_PyObject_Malloc	1314 (2.23%)	1219 (2.07%)
[External Call] vcruntime140.dll	1076 (1.83%)	1076 (1.83%)
_Py_DECREF	994 (1.69%)	994 (1.69%)
_PyObject_GetMethod	1521 (2.58%)	878 (1.49%)
tupledealloc	996 (1.69%)	878 (1.49%)
frame_dealloc	881 (1.50%)	726 (1.23%)
PyObject_GetAttr	25021 (42.47%)	711 (1.21%)
method_vectorcall	58513 (99.31%)	678 (1.15%)

I also got a Line-by-line-profile of _PyEval_EvalFrameDefault, and VS shows it like this:

I couldn't find a way to make VS dump the output as text, but here are 17 (low-quality, but at least mostly legible) screenshots of the results on the whole function
https://imgur.com/a/WEHwFkr

ericsnowcurrently · 2021-07-26T17:25:13Z

Is there somewhere that describes what this particular flamegraph means?
(I've seen explanations of flamegraphs in general, but I'm not familiar
with the tool or methodology used to generate this particular one.)

Flamegraphs are relatively new to me. After a little digging, I found a good resource: https://www.brendangregg.com/flamegraphs.html. This pointed me to https://github.com/brendangregg/FlameGraph, which is what I used to produce the flame graph. As noted above, it involved generating data in a specific way using the linux "perf" tool and then passing that data through some scripts to produce the SVG file.

(Note that the SVG is better/interactive if you download it and then open the downloaded copy in your browser.)

As to what it means (from the web page):

The x-axis shows the stack profile population, sorted alphabetically (it is not the passage of time), and the y-axis shows stack depth, counting from zero at the bottom. Each rectangle represents a stack frame. The wider a frame is is, the more often it was present in the stacks. The top edge shows what is on-CPU, and beneath it is its ancestry. The colors are usually not significant, picked randomly to differentiate frames.

ericsnowcurrently · 2021-07-26T20:02:35Z

I just uploaded better flame graphs (10khz and 100khz), which I generated:

on our bare-metal benchmarking machine (instead of a VM)
non-debug build

ericsnowcurrently · 2021-07-26T20:11:57Z

Observations from the flame graphs:

"perf" should be ignored so all the other percentages are actually proportionally higher
the "swapper" (CPU idle) takes a decent chunk of time - 23%/38% on bare metal and 87% on VM
- presumably this is the amount of time we are blocking on IO (reading from disk)
~~a bunch of our functions are on the call stack during times that "main" is not, for unclear reasons~~
- for example with 10khz, while "main" has 153 samples (37%), other CPython functions (that should have "main" in their stack) have 73 samples (17%)
- I suspect there is a stack depth limit we're hitting which makes it looks like some calls are at the base of the stack when they actually aren't
- [update] I figured out that there's a kernel setting for this that was set to 127 (using 1000 fixed it)
at the highest level, we can break "main()" into two parts: "pymain_init()" and "PyRun_Main()"
- for ./python -c pass we spend much more time in the former (42%) than the latter (7%)
- "pymain_init()" breaks down into "pyinit_core()" (7%) and "pyinit_main()" (34%)

ericsnowcurrently · 2021-07-27T18:04:51Z

I've opened #75 to dive into runtime startup/shutdown profiling. I also created #76 to do the same for our target workloads.

ericsnowcurrently · 2021-07-28T23:30:46Z

We can circle back later (if we need it) to the specific instrumentation described in the original post above.

ericsnowcurrently changed the title ~~Add profiling instrumentation centered on the eval loop~~ [profiling] Add profiling instrumentation centered on the eval loop Mar 24, 2021

ericsnowcurrently self-assigned this Mar 26, 2021

ericsnowcurrently added the deferred label Mar 31, 2021

ericsnowcurrently removed their assignment Mar 31, 2021

gvanrossum mentioned this issue Apr 5, 2021

Plan for the coming 6-12 months and parallel workstreams #40

Closed

ericsnowcurrently removed the deferred label Jul 21, 2021

ericsnowcurrently self-assigned this Jul 21, 2021

ericsnowcurrently mentioned this issue Jul 27, 2021

[profiling] Profile/analyze runtime startup (and shutdown). #75

Open

4 tasks

ericsnowcurrently added the deferred label Jul 28, 2021

faster-cpython locked and limited conversation to collaborators Dec 2, 2021

markshannon closed this as completed Dec 2, 2021

gramster added this to Fancy CPython Board Jan 10, 2022

gramster moved this to Todo in Fancy CPython Board Jan 10, 2022

gramster moved this from Todo to Done in Fancy CPython Board Jan 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

This issue was moved to a discussion.

[profiling] Add profiling instrumentation centered on the eval loop #25

[profiling] Add profiling instrumentation centered on the eval loop #25

ericsnowcurrently commented Mar 23, 2021

markshannon commented Mar 24, 2021

Uh oh!

ericsnowcurrently commented Mar 24, 2021

Uh oh!

ericsnowcurrently commented Mar 31, 2021

Uh oh!

ericsnowcurrently commented Mar 31, 2021

Uh oh!

markshannon commented Jul 20, 2021

Uh oh!

ericsnowcurrently commented Jul 23, 2021 •

edited

Loading

Uh oh!

gvanrossum commented Jul 24, 2021 via email

Uh oh!

sweeneyde commented Jul 24, 2021

Uh oh!

ericsnowcurrently commented Jul 26, 2021

Uh oh!

ericsnowcurrently commented Jul 26, 2021 •

edited

Loading

Uh oh!

ericsnowcurrently commented Jul 26, 2021 •

edited

Loading

Uh oh!

ericsnowcurrently commented Jul 27, 2021 •

edited

Loading

Uh oh!

ericsnowcurrently commented Jul 28, 2021

Uh oh!

This issue was moved to a discussion.

This issue was moved to a discussion.

[profiling] Add profiling instrumentation centered on the eval loop #25

[profiling] Add profiling instrumentation centered on the eval loop #25

Comments

ericsnowcurrently commented Mar 23, 2021

markshannon commented Mar 24, 2021

Uh oh!

ericsnowcurrently commented Mar 24, 2021

Uh oh!

ericsnowcurrently commented Mar 31, 2021

Uh oh!

ericsnowcurrently commented Mar 31, 2021

Uh oh!

markshannon commented Jul 20, 2021

Uh oh!

ericsnowcurrently commented Jul 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gvanrossum commented Jul 24, 2021 via email

Uh oh!

sweeneyde commented Jul 24, 2021

Uh oh!

ericsnowcurrently commented Jul 26, 2021

Uh oh!

ericsnowcurrently commented Jul 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ericsnowcurrently commented Jul 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ericsnowcurrently commented Jul 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ericsnowcurrently commented Jul 28, 2021

Uh oh!

This issue was moved to a discussion.

ericsnowcurrently commented Jul 23, 2021 •

edited

Loading

ericsnowcurrently commented Jul 26, 2021 •

edited

Loading

ericsnowcurrently commented Jul 26, 2021 •

edited

Loading

ericsnowcurrently commented Jul 27, 2021 •

edited

Loading