-
Notifications
You must be signed in to change notification settings - Fork 95
Memory info and more details in opentelemetry #826
Conversation
Instead of doing it in `HoverDefinition`, do it in with{Response,Notification,...}. These wrap all handlers, so this should cover everything. It also means that the span covers the entire processing time for the request, where before we missed the setup happening in the with* functions.
Run GC regularly with --ot-profiling
I renamed the fork to distringuish from the original. It is still being pulled from git using stack. This will be addressed once I can push the fork to hackage.
Did you try this out on a ghcide session? Any memorable findings? Which actions/requests take the most time, which ones leak memory, which ones cause memory consumption to spike, etc? |
It would be good to have some documentation on how to see/use/interpret these traces. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look reasonable to me, if a bit unfinished.
For instance, the benchmark suite has been extended with an option to collect telemetry details, like the length and size of the state HashMap. But nothing is being done with this data!
- It should be added to the CSV outputs of an experiment
- It could be graphed by the
bench/hist
build system so that we can easily spot leaks
src/Development/IDE/Core/Tracing.hs
Outdated
startTelemetry _ _ (IdeOTProfiling False) = return () | ||
startTelemetry name valuesRef (IdeOTProfiling True) = do | ||
mapBytesInstrument <- mkValueObserver (BS.pack name <> " size_bytes") | ||
mapCountInstrument <- mkValueObserver (BS.pack name <> " count") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to these global counters, it would be very useful to have per-rule counters, in order to show which rules are expensive (in items and in memory)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with all the suggested changes, with one caveat: I don't think it's feasible to measure per-rule allocations with the current method, as it would be too slow, even for benchmarking use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a measure of per-rule allocations and deriving the total from that be fast enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to try implementing this and see the results. My guess is it would actually be quite a bit faster because now you are counting items only when they are inserted into the map. However now the metric becomes one of total allocations instead of live memory since you miss deletions. Would there be some way "intercepting" those so you can count the size of deleted elements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't think it's going to be easy (or possible) to observe the deletions.
Maybe what we want is to keep the current behaviour, but instead of getting the recursive size of the whole values map, break it down by key type and get the recursive size of each group
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, as a test, I tried measuring allocations when objects get added to the state
map, (mpardalos/ghcide@opentelemetry-memory...detailed-memory-test), and it turns out I was wrong. Running a small "hover" benchmark (see code), took about 3 hours, which I think qualifies as unusable for now.
Maybe what we want is to keep the current behaviour, but instead of getting the recursive size of the whole values map, break it down by key type and get the recursive size of each group
That is probably much more feasible. I will give that a try as well and report back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it so slow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried analyzing which entries in the map took the longest and it seems like the worst offenders are GhcSessionIO and GhcSession entries, each of which can take about 10s to measure, I think because their representation is very nested (each can be >1000000 closures). If there is a way to estimate the size of these entries instead of having to traverse them it could improve performance a lot.
If there isn't a way to handle this then the only way to improve performance would be to optimise heapsize further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried the group approach below? I think that's probably what we want to measure and not the individual insertions, since I suspect that when inserting a new item in the map part of its cost is shared with other entries in the map.
Maybe what we want is to keep the current behaviour, but instead of getting the recursive size of the whole values map, break it down by key type and get the recursive size of each group
Should I add that in the README? |
The |
I can't say I've found anything directly interesting as this stands. Most requests seem basically instant in my (admittedly limited) testing on the ghcide codebase. As for actions, Memory consumption is, as expected due to #421, constantly rising. I am hoping this will help in addressing that, however I need to find a way of getting more detailed readings to do that. Hopefully the change @pepeiborra suggested (#826 (comment)) helps. |
@mpardalos any updates on this? |
Sorry for the radio silence, starting final year of uni has been a bit busy. I haven't worked on this since we last spoke, but I'll try to get it ready to merge this week. See my comment on the code review for what I think is left to do. |
atomically $ modifyTVar pendingRequests (Set.insert _id) | ||
writeChan clientMsgChan $ Response r wrap f | ||
let withNotification old f = Just $ \r -> writeChan clientMsgChan $ Notification r (\lsp ide x -> f lsp ide x >> whenJust old ($ r)) | ||
let withResponseAndRequest wrap wrapNewReq f = Just $ \r@RequestMessage{_id} -> do | ||
let withNotification old f = Just $ \r@NotificationMessage{_method} -> otTracedHandler "Notification" (show _method) $ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these traces correct? The only thing they will be measuring the time required to forward the request/notification on the channel, not the time we actually spend computing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are completely right. Should I be adding that further below in handleInit
?
27df584
to
69ae2c7
Compare
I made it report the size for each key separately (lumping together all the entries for a given key which refer to different files). This could be pretty useful although it is still slow. Here's a screenshot of what the output might look like in tracy. "values map size_bytes" is the total size of the values map, and each key gets its own graph below that. As I mentioned before, the worst offenders for the slowdown are GhcSession and GhcSessionIO, which are also the biggest entries. I will try skipping them first (to see how much of a difference that makes) and then try putting their measurement on a separate thread so that every other entry can at least have some more detailed data. |
@mpardalos great work, that looks awesome! Here are some sets of rules among which a lot of sharing might be going on: It would be awesome if we could group together the entries for such lists too. Also, can you upload the full trace that you computed somewhere? I'd be very interested in looking at it. |
What does the |
As a way to workaround any slowness, could it be possible to take periodic heap snapshots and dump them to disk (which should be fast), and load them later to all the number crunching? /cc @mpickering |
This is part of OpenTelemetry itself. It counts total GC time.
If that's possible it could be very very interesting
Here it is. It's basically just me jumping around in ghcide's source, opening/closing files, adding syntax errors, etc. It is still slow but I think that skipping those |
Some more things to measure:
|
This is starting to look great. Should we focus on getting a minimum useful kernel checked in so that we can iterate over it? |
Here is one hack that might possibly work to emulate the 'snapshot' idea if you have enough swap:
Then we need to collect all of these However, you do have to be careful about not using any |
It's starting to look like this PR is abandoned. |
As it stands, would you consider this ready to merge? Functionality is there, everything is just quite slow. |
As long as it is behind a flag, it's mergeable |
@mpardalos I want to have a play with this, but I don't know how do I hook the event trace to Tracy. Could you add some docs or links? |
I believe you want to use this executable: https://github.com/ethercrow/opentelemetry-haskell/blob/master/opentelemetry-extra/exe/eventlog-to-tracy/Main.hs |
I added a doc for how to use this functionality. |
That leads to an error, what's next?
I am using the following branch: http://github.com/pepeiborra/ghcide/tree/opentelemetry EDIT: figured it out |
Was it maybe something unclear from the instructions? Might be a good idea to specify. |
& HMap.fromListWith (++) | ||
valuesSize <- sequence $ HMap.mapWithKey (\k v -> withSpan ("Measure " <> (BS.pack $ show k)) $ \sp -> do | ||
{ instrument <- instrumentFor k | ||
; byteSize <- (sizeOf (undefined :: Word) *) <$> recursiveSizeNoGC v |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call to recursiveSizeNoGC
leads to a segfault if a GC arises while recursiveSizeNoGC
is counting, which is quite easy to reproduce.
Given the way recursiveSizeNoGC
works this is impossible to avoid (the act of measuring the size itself allocates memory) but it could be minimised by stopping the world while measuring.
Did you consider this and/or other workarounds @mpardalos ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another, much cheaper way to minimise the segfaults in recursiveSizeNoGC
is to use System.Mem.Weak
to observe GC runs, and abort the count when this happens.
@mpardalos are you accepting PRs for heapsize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah absolutely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested the Weak
workaround and bingo, no more segfaults. So I'm going to:
- Send you a heapsize PR
- Ask you to upload a new heapsize version to Hackage
- Commandeer this diff (I have made plenty of changes)
- Get this merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spoke too soon, segfaults can still arise if GC strikes in between the Weak check and the address dereferencing. It would be much safer if this code was written in C and accessed through an unsafe foreign import, because that guarantees that a GC cannot strike.
I also investigated a solution for the sharing problems, providing an Applicative interface so that it's possible to do multiple heapsize measurements with the same set of seen pointers. Again, the GC gets in the way - making this approach not viable sadly. One would need to rewrite the whole thing in C.
I managed to get full measurements with sharingfor a small project by setting a very large nursery ( However, I still observe crashes due to SIGILL (Illegal Instruction). My attempts to get a backtrace have been unsuccessful, Mac OS got in the way. I'll try to replicate in Linux tomorrow. Screenshot below where you can observe the effects of sharing: Typecheck, GetModIface, GetModSummary, and others have a cost of 1 because everything? is reachable from the GhcSession, apparently. Sadly not very helpful. The only exception is |
Why does |
Superseded by #922 |
This had a simple explanation: Shake extras is reachable from the |
Does that mean everything in the Shake store is reported as part of |
Yes, I would expect so |
This patch adds the following information to opentelemetry traces:
Other changes:
--ot-profiling
flag--ot-profiling
flag to benchmark to run ghcide with--ot-profiling
--ot-profiling
)