[ML] Peak memory usage reported to end user can decrease

I noticed a strange situation with `peak_model_bytes` in the `model_size_stats` that is a side effect of the fix for #1478.

The values changed like this over time:

```
"peak_model_bytes" : 842162,
"peak_model_bytes" : 846000,
"peak_model_bytes" : 820342,
"peak_model_bytes" : 825804,
"peak_model_bytes" : 823116,
"peak_model_bytes" : 813974,
"peak_model_bytes" : 765446,
"peak_model_bytes" : 719750,
"peak_model_bytes" : 701476,
"peak_model_bytes" : 686012,
"peak_model_bytes" : 682664,
"peak_model_bytes" : 685474,
"peak_model_bytes" : 686658,
"peak_model_bytes" : 684952,
"peak_model_bytes" : 680402,
```

This seems crazy, as it's supposed to be the peak, i.e. max over all time.

Adding special debug to print the underlying values of peak memory usage to the log showed this:

```
res.s_PeakUsage set to 296073
res.s_PeakUsage set to 300025
res.s_PeakUsage set to 305006
res.s_PeakUsage set to 311380
res.s_PeakUsage set to 311420
res.s_PeakUsage set to 314960
res.s_PeakUsage set to 318175
res.s_PeakUsage set to 321588
res.s_PeakUsage set to 325142
res.s_PeakUsage set to 328485
res.s_PeakUsage set to 331827
res.s_PeakUsage set to 335910
res.s_PeakUsage set to 339525
res.s_PeakUsage set to 339525
res.s_PeakUsage set to 339525
res.s_PeakUsage set to 339525
```

These values are monotonically increasing as expected.

The problem is arising due to the code that applies an adjustment to the actual measured value in https://github.com/elastic/ml-cpp/blob/77564dd1c1ae6af3fea8691e98465b30d287fe9f/lib/model/CResourceMonitor.cc#L314-L325

In #1478 we saw that we had to adjust the peak memory usage in the same way as the current memory usage, otherwise we'd get the ridiculous situation of peak being less than current, given that the main adjustment is to multiply by 2.

However, the extra adjustment of `m_ByteLimitMargin` creates an additional problem in the case where it causes memory to be overestimated early in the jobs lifecycle.  In the example I found where peak memory went down (as observed by the end user), the initial uplift of `m_ByteLimitMargin` was inappropriate, and actual memory usage didn't increase later to match the fudged amount.

So there is an interesting question about what to do about reporting the peak: should we ensure that it never decreases or is it OK to leave it potentially decreasing if the initial fudge factor turns out to be overly pessimistic?

* If we want it to be genuinely monotonically increasing then we'll need to add a separate counter to remember adjusted peak.
* Or we could just have the explanation ready in case somebody notices that the initially reported peak was wrong and wasn't really as high as originally reported, and that's why the reported number went down.

	std::size_t CResourceMonitor::adjustedUsage(std::size_t usage) const {
	// We scale the reported memory usage by the inverse of the byte limit margin.
	// This gives the user a fairer indication of how close the job is to hitting
	// the model memory limit in a concise manner (as the limit is scaled down by
	// the margin during the beginning period of the job's existence).
	std::size_t adjustedUsage{
	static_cast<std::size_t>(static_cast<double>(usage) / m_ByteLimitMargin)};

	adjustedUsage *= this->persistenceMemoryIncreaseFactor();

	return adjustedUsage;
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Peak memory usage reported to end user can decrease #1625

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ML] Peak memory usage reported to end user can decrease #1625

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions