Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

[profiling] Add profiling instrumentation centered on the eval loop #25

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ericsnowcurrently opened this issue Mar 23, 2021 · 13 comments
Closed
Assignees
Labels

Comments

@ericsnowcurrently
Copy link
Collaborator

Here are the key boundaries:

  • outside _PyEval_EvalFrameDefault()
  • between start of _PyEval_EvalFrameDefault() and top of loop (setting up)
  • between top of loop and dispatch
  • handling an opcode
  • between dispatch and bottom of loop
  • between bottom of loop and returning (cleaning up)
@markshannon
Copy link
Member

This should mostly be part of the job of the interpreter generator, see #5, IMO.

Adding options to the interpreter generator to insert the instrumentation will be a lot more pleasant than dealing with C macros.

@ericsnowcurrently
Copy link
Collaborator Author

Disregarding how we instrument the code, are there any other significant boundaries we should consider?

@ericsnowcurrently ericsnowcurrently changed the title Add profiling instrumentation centered on the eval loop [profiling] Add profiling instrumentation centered on the eval loop Mar 24, 2021
@ericsnowcurrently ericsnowcurrently self-assigned this Mar 26, 2021
@ericsnowcurrently
Copy link
Collaborator Author

As an exercise to better familiarize myself with the knowledge domain, I spent a little time adding some minimal instrumentation (and logging) to the eval loop, as well as a script to analyze the results. See my eval-loop-profiling branch.

The instrumentation generates a log file with a minimal format:
# python: /home/esnow/projects/work/cpython-perf/cpython/python
# argv: ./python -E ./setup.py build
# start time: 1617222445 s (since epoch)
# start clock: 17137493.663947837 s (on clock)
# end clock: 17137494.467647860 s (on clock)

17137493.664739837 0
# func: <module>
17137493.666522537 3
17137493.666523537 5
17137493.666523837 6
17137493.666524037 7
17137493.666524237 8
17137493.666524437 9 100
# func: _DeadlockError
17137493.666531637 3
17137493.666531937 5
17137493.666532137 6
17137493.666532337 7
17137493.666532437 8
17137493.666532637 9 101
17137493.666533537 13
# func: _DeadlockError
17137493.666533937 4
17137493.666534137 2
# func: _ModuleLock
17137493.666562437 3
...
17137494.445492862 13
# func: close
17137494.445493162 4
17137494.445493362 2
# func: release
17137494.445493862 3
17137494.445494062 5
17137494.445494162 6
17137494.445494362 7
17137494.445494562 8
17137494.445494662 9 124
17137494.445495162 13
# func: release
17137494.445495462 4
17137494.445495662 2
17137494.445496162 13
# func: shutdown
17137494.445496462 4
17137494.445496562 2
17137494.467615160 1

For reference, a Python script that has a total runtime of about a second generates roughly 16MB of trace data.

By default, the analysis script makes the log more human-friendly:
$ python3 Tools/scripts/analyze_trace.py
reading from ./eval_loop-1617222445.trace
(showing results in 'simple-indent' format)

# python:         /home/esnow/projects/work/cpython-perf/cpython/python
# argv:           ./python -E ./setup.py build
# start:          2021-03-31 20:27:25.663947 UTC
# end:            2021-03-31 20:27:26.467647 UTC
# elapsed:        803,700.0 µs

####################
# BEGIN TRACE
####################

1,782.7 µs      init
    1.0 µs      enter                                             # func: <module>
    0.3 µs      |  loop enter
    0.2 µs      |  loop slow
    0.2 µs      |  loop fast
    0.2 µs      |  dispatch
    7.2 µs      |  op                     LOAD_CONST (100)
    0.3 µs      |  enter                                          # func: _DeadlockError
    0.2 µs      |  |  loop enter
    0.2 µs      |  |  loop slow
    0.1 µs      |  |  loop fast
    0.2 µs      |  |  dispatch
    0.9 µs      |  |  op                  LOAD_NAME (101)
    0.4 µs      |  |  loop exit
    0.2 µs      |  exit                                           # func: _DeadlockError
   28.3 µs      |  runtime other
    0.2 µs      |  enter                                          # func: _ModuleLock
...
    0.3 µs      |  |  loop exit
    0.2 µs      |  exit                                           # func: close
    0.5 µs      |  runtime other
    0.2 µs      |  enter                                          # func: release
    0.1 µs      |  |  loop enter
    0.2 µs      |  |  loop slow
    0.2 µs      |  |  loop fast
    0.1 µs      |  |  dispatch
    0.5 µs      |  |  op                  LOAD_FAST (124)
    0.3 µs      |  |  loop exit
    0.2 µs      |  exit                                           # func: release
    0.5 µs      |  runtime other
    0.3 µs      |  loop exit
    0.1 µs      exit                                              # func: shutdown
22,118.5 µs     runtime other
                fini

####################
# END TRACE
####################

The script can also provide a summary:

Sorted by Count
$ python3 Tools/scripts/analyze_trace.py --format summary --sort count
reading from ./eval_loop-1617222445.trace
(showing results in 'summary' format)

python:    /home/esnow/projects/work/cpython-perf/cpython/python
argv:      ./python -E ./setup.py build
start:     2021-03-31 20:27:25.663947 UTC
end:       2021-03-31 20:27:26.467647 UTC
elapsed:   803,700.0 µs

events:

           name                count      mean elapsed
--------------------------- ------------ --------------
 dispatch                        94,807         0.1 µs
 op                              94,807         3.4 µs
 loop slow                       61,684         0.2 µs
 loop fast                       61,684         0.1 µs
 enter                           60,382         0.2 µs
 loop exit                       60,382         0.4 µs
 exit                            60,382         0.2 µs
 runtime other                   60,382         5.5 µs
 loop enter                      60,345         0.2 µs
 loop exception                   1,832         0.4 µs
 loop error                       1,821         0.4 µs
 loop exiting                       493         0.2 µs
 init                                 1     1,782.7 µs
 fini                                 1         0.0 µs
--------------------------- ------------ --------------
 total: 14/14

ops:

           name                count      mean elapsed
--------------------------- ------------ --------------
 LOAD_FAST                       24,094         2.0 µs
 POP_JUMP_IF_FALSE               21,937         0.5 µs
 LOAD_GLOBAL                     21,223         3.0 µs
 SETUP_FINALLY                    8,242        21.0 µs
 JUMP_ABSOLUTE                    5,117         0.6 µs
 LOAD_CONST                       2,841         1.6 µs
 BUILD_LIST                       2,640         1.1 µs
 POP_JUMP_IF_TRUE                 2,480         1.4 µs
 FOR_ITER                         1,997         0.7 µs
 POP_TOP                          1,545         0.4 µs
 DUP_TOP                          1,328         1.7 µs
 LOAD_NAME                          550         6.1 µs
 JUMP_FORWARD                       537         0.5 µs
 LOAD_DEREF                          56         1.5 µs
 JUMP_IF_NOT_EXC_MATCH               50         2.7 µs
 JUMP_IF_FALSE_OR_POP                50         1.1 µs
 JUMP_IF_TRUE_OR_POP                 35         2.6 µs
 LOAD_CLOSURE                        25         1.1 µs
 NOP                                 21         0.9 µs
 WITH_EXCEPT_START                   11         0.9 µs
 BUILD_MAP                           11         7.0 µs
 BUILD_SET                            5         4.9 µs
 YIELD_FROM                           4         0.4 µs
 EXTENDED_ARG                         4         0.2 µs
 BUILD_CONST_KEY_MAP                  3        21.4 µs
 SETUP_WITH                           1         9.7 µs
--------------------------- ------------ --------------
 total: 26/125

funcs: 1,025 called (by "simple" name)
Sorted by Average Elapsed Time
$ python3  Tools/scripts/analyze_trace.py --format summary --sort elapsed
reading from ./eval_loop-1617222445.trace
(showing results in 'summary' format)

python:    /home/esnow/projects/work/cpython-perf/cpython/python
argv:      ./python -E ./setup.py build
start:     2021-03-31 20:27:25.663947 UTC
end:       2021-03-31 20:27:26.467647 UTC
elapsed:   803,700.0 µs

events:

           name                count      mean elapsed
--------------------------- ------------ --------------
 init                                 1     1,782.7 µs
 runtime other                   60,382         5.5 µs
 op                              94,807         3.4 µs
 loop error                       1,821         0.4 µs
 loop exit                       60,382         0.4 µs
 loop exception                   1,832         0.4 µs
 enter                           60,382         0.2 µs
 loop exiting                       493         0.2 µs
 exit                            60,382         0.2 µs
 loop enter                      60,345         0.2 µs
 loop slow                       61,684         0.2 µs
 dispatch                        94,807         0.1 µs
 loop fast                       61,684         0.1 µs
 fini                                 1         0.0 µs
--------------------------- ------------ --------------
 total: 14/14

ops:

           name                count      mean elapsed
--------------------------- ------------ --------------
 BUILD_CONST_KEY_MAP                  3        21.4 µs
 SETUP_FINALLY                    8,242        21.0 µs
 SETUP_WITH                           1         9.7 µs
 BUILD_MAP                           11         7.0 µs
 LOAD_NAME                          550         6.1 µs
 BUILD_SET                            5         4.9 µs
 LOAD_GLOBAL                     21,223         3.0 µs
 JUMP_IF_NOT_EXC_MATCH               50         2.7 µs
 JUMP_IF_TRUE_OR_POP                 35         2.6 µs
 LOAD_FAST                       24,094         2.0 µs
 DUP_TOP                          1,328         1.7 µs
 LOAD_CONST                       2,841         1.6 µs
 LOAD_DEREF                          56         1.5 µs
 POP_JUMP_IF_TRUE                 2,480         1.4 µs
 BUILD_LIST                       2,640         1.1 µs
 LOAD_CLOSURE                        25         1.1 µs
 JUMP_IF_FALSE_OR_POP                50         1.1 µs
 WITH_EXCEPT_START                   11         0.9 µs
 NOP                                 21         0.9 µs
 FOR_ITER                         1,997         0.7 µs
 JUMP_ABSOLUTE                    5,117         0.6 µs
 POP_JUMP_IF_FALSE               21,937         0.5 µs
 JUMP_FORWARD                       537         0.5 µs
 YIELD_FROM                           4         0.4 µs
 POP_TOP                          1,545         0.4 µs
 EXTENDED_ARG                         4         0.2 µs
--------------------------- ------------ --------------
 total: 26/125

funcs: 1,025 called (by "simple" name)

@ericsnowcurrently
Copy link
Collaborator Author

This effort proved to be quite insightful for me. This isn't something to pursue further right now but will be worth revisiting later when we've added code generation for the eval loop.

@markshannon
Copy link
Member

I think this is worth revisiting.
In order to direct our efforts we need to know where the virtual machine is spending its time.

This is relevant to both startup and general execution.

The text summary sounds useful, could we also get flamegraphs? I think they are a great way to visualize where time is being spent.

We should also generate results using perf or similar tool, to cross check that we are instrumenting the right things.

@ericsnowcurrently
Copy link
Collaborator Author

ericsnowcurrently commented Jul 23, 2021

FWIW, I generated a flamegraph for ./python -c 'pass':

https://raw.githubusercontent.com/faster-cpython/ideas/main/flamegraphs/vm-freq999.svg

If you download the file locally you can interact with it in your browser.


The commands (after installing dependencies) look something like this:

#git clone https://github.com/brendangregg/FlameGraph ../FlameGraph

sample_rate_orig=$(cat /proc/sys/kernel/perf_event_max_sample_rate)
max_stack_orig=$(cat /proc/sys/kernel/perf_event_max_stack)
sudo sysctl -w kernel.perf_event_max_sample_rate=100000
sudo sysctl -w kernel.perf_event_max_stack=1000

sudo perf record -F max -a -g -- ./python -c 'pass'
sudo perf script > out.perf
../FlameGraph/stackcollapse-perf.pl out.perf > out.folded
../FlameGraph/flamegraph.pl out.folded > flamegraph.svg

sudo sysctl -w kernel.perf_event_max_sample_rate=$sample_rate_orig
sudo sysctl -w kernel.perf_event_max_stack=$max_stack_orig

See https://github.com/faster-cpython/tools/tree/main/scripts/run-perf.sh for the full operation, including installing dependencies.

@gvanrossum
Copy link
Collaborator

gvanrossum commented Jul 24, 2021 via email

@sweeneyde
Copy link

I ran what amounts to PyPerformance's bench_expand, bench_integrate, bench_sum, and bench_str benchmarks in a loop for about a minute with a PGO ("PGUpdate") build, and Visual Studio's Performance Profiler got me some results.

The hottest 15 functions:

Function Name Total CPU [unit, %] Self CPU [unit, %]
_PyEval_EvalFrameDefault 58793 (99.79%) 15770 (26.77%)
call_function 58793 (99.79%) 4045 (6.87%)
_Py_dict_lookup 3262 (5.54%) 3019 (5.12%)
initialize_locals 2329 (3.95%) 1468 (2.49%)
_PyUnicode_FromId 1422 (2.41%) 1422 (2.41%)
_PyFunction_Vectorcall 58793 (99.79%) 1299 (2.20%)
_PyObject_GenericGetAttrWithDict 24025 (40.78%) 1262 (2.14%)
_PyObject_Malloc 1314 (2.23%) 1219 (2.07%)
[External Call] vcruntime140.dll 1076 (1.83%) 1076 (1.83%)
_Py_DECREF 994 (1.69%) 994 (1.69%)
_PyObject_GetMethod 1521 (2.58%) 878 (1.49%)
tupledealloc 996 (1.69%) 878 (1.49%)
frame_dealloc 881 (1.50%) 726 (1.23%)
PyObject_GetAttr 25021 (42.47%) 711 (1.21%)
method_vectorcall 58513 (99.31%) 678 (1.15%)

I also got a Line-by-line-profile of _PyEval_EvalFrameDefault, and VS shows it like this:
image

I couldn't find a way to make VS dump the output as text, but here are 17 (low-quality, but at least mostly legible) screenshots of the results on the whole function
https://imgur.com/a/WEHwFkr

@ericsnowcurrently
Copy link
Collaborator Author

Is there somewhere that describes what this particular flamegraph means?
(I've seen explanations of flamegraphs in general, but I'm not familiar
with the tool or methodology used to generate this particular one.)

Flamegraphs are relatively new to me. After a little digging, I found a good resource: https://www.brendangregg.com/flamegraphs.html. This pointed me to https://github.com/brendangregg/FlameGraph, which is what I used to produce the flame graph. As noted above, it involved generating data in a specific way using the linux "perf" tool and then passing that data through some scripts to produce the SVG file.

(Note that the SVG is better/interactive if you download it and then open the downloaded copy in your browser.)

As to what it means (from the web page):

The x-axis shows the stack profile population, sorted alphabetically (it is not the passage of time), and the y-axis shows stack depth, counting from zero at the bottom. Each rectangle represents a stack frame. The wider a frame is is, the more often it was present in the stacks. The top edge shows what is on-CPU, and beneath it is its ancestry. The colors are usually not significant, picked randomly to differentiate frames.

@ericsnowcurrently
Copy link
Collaborator Author

ericsnowcurrently commented Jul 26, 2021

I just uploaded better flame graphs (10khz and 100khz), which I generated:

  • on our bare-metal benchmarking machine (instead of a VM)
  • non-debug build

@ericsnowcurrently
Copy link
Collaborator Author

ericsnowcurrently commented Jul 26, 2021

Observations from the flame graphs:

  • "perf" should be ignored so all the other percentages are actually proportionally higher
  • the "swapper" (CPU idle) takes a decent chunk of time - 23%/38% on bare metal and 87% on VM
    • presumably this is the amount of time we are blocking on IO (reading from disk)
  • a bunch of our functions are on the call stack during times that "main" is not, for unclear reasons
    • for example with 10khz, while "main" has 153 samples (37%), other CPython functions (that should have "main" in their stack) have 73 samples (17%)
    • I suspect there is a stack depth limit we're hitting which makes it looks like some calls are at the base of the stack when they actually aren't
    • [update] I figured out that there's a kernel setting for this that was set to 127 (using 1000 fixed it)
  • at the highest level, we can break "main()" into two parts: "pymain_init()" and "PyRun_Main()"
    • for ./python -c pass we spend much more time in the former (42%) than the latter (7%)
    • "pymain_init()" breaks down into "pyinit_core()" (7%) and "pyinit_main()" (34%)

@ericsnowcurrently
Copy link
Collaborator Author

ericsnowcurrently commented Jul 27, 2021

I've opened #75 to dive into runtime startup/shutdown profiling. I also created #76 to do the same for our target workloads.

@ericsnowcurrently
Copy link
Collaborator Author

We can circle back later (if we need it) to the specific instrumentation described in the original post above.

@faster-cpython faster-cpython locked and limited conversation to collaborators Dec 2, 2021
@gramster gramster moved this to Todo in Fancy CPython Board Jan 10, 2022
@gramster gramster moved this from Todo to Done in Fancy CPython Board Jan 10, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
Projects
Development

No branches or pull requests

4 participants