Skip to content

[ML] Peak memory usage reported to end user can decrease #1625

Open
@droberts195

Description

@droberts195

I noticed a strange situation with peak_model_bytes in the model_size_stats that is a side effect of the fix for #1478.

The values changed like this over time:

"peak_model_bytes" : 842162,
"peak_model_bytes" : 846000,
"peak_model_bytes" : 820342,
"peak_model_bytes" : 825804,
"peak_model_bytes" : 823116,
"peak_model_bytes" : 813974,
"peak_model_bytes" : 765446,
"peak_model_bytes" : 719750,
"peak_model_bytes" : 701476,
"peak_model_bytes" : 686012,
"peak_model_bytes" : 682664,
"peak_model_bytes" : 685474,
"peak_model_bytes" : 686658,
"peak_model_bytes" : 684952,
"peak_model_bytes" : 680402,

This seems crazy, as it's supposed to be the peak, i.e. max over all time.

Adding special debug to print the underlying values of peak memory usage to the log showed this:

res.s_PeakUsage set to 296073
res.s_PeakUsage set to 300025
res.s_PeakUsage set to 305006
res.s_PeakUsage set to 311380
res.s_PeakUsage set to 311420
res.s_PeakUsage set to 314960
res.s_PeakUsage set to 318175
res.s_PeakUsage set to 321588
res.s_PeakUsage set to 325142
res.s_PeakUsage set to 328485
res.s_PeakUsage set to 331827
res.s_PeakUsage set to 335910
res.s_PeakUsage set to 339525
res.s_PeakUsage set to 339525
res.s_PeakUsage set to 339525
res.s_PeakUsage set to 339525

These values are monotonically increasing as expected.

The problem is arising due to the code that applies an adjustment to the actual measured value in

std::size_t CResourceMonitor::adjustedUsage(std::size_t usage) const {
// We scale the reported memory usage by the inverse of the byte limit margin.
// This gives the user a fairer indication of how close the job is to hitting
// the model memory limit in a concise manner (as the limit is scaled down by
// the margin during the beginning period of the job's existence).
std::size_t adjustedUsage{
static_cast<std::size_t>(static_cast<double>(usage) / m_ByteLimitMargin)};
adjustedUsage *= this->persistenceMemoryIncreaseFactor();
return adjustedUsage;
}

In #1478 we saw that we had to adjust the peak memory usage in the same way as the current memory usage, otherwise we'd get the ridiculous situation of peak being less than current, given that the main adjustment is to multiply by 2.

However, the extra adjustment of m_ByteLimitMargin creates an additional problem in the case where it causes memory to be overestimated early in the jobs lifecycle. In the example I found where peak memory went down (as observed by the end user), the initial uplift of m_ByteLimitMargin was inappropriate, and actual memory usage didn't increase later to match the fudged amount.

So there is an interesting question about what to do about reporting the peak: should we ensure that it never decreases or is it OK to leave it potentially decreasing if the initial fudge factor turns out to be overly pessimistic?

  • If we want it to be genuinely monotonically increasing then we'll need to add a separate counter to remember adjusted peak.
  • Or we could just have the explanation ready in case somebody notices that the initially reported peak was wrong and wasn't really as high as originally reported, and that's why the reported number went down.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions