Skip to content

Commit cb25176

Browse files
gormanmIngo Molnar
authored and
Ingo Molnar
committed
sched/debug: Make schedstats a runtime tunable that is disabled by default
schedstats is very useful during debugging and performance tuning but it incurs overhead to calculate the stats. As such, even though it can be disabled at build time, it is often enabled as the information is useful. This patch adds a kernel command-line and sysctl tunable to enable or disable schedstats on demand (when it's built in). It is disabled by default as someone who knows they need it can also learn to enable it when necessary. The benefits are dependent on how scheduler-intensive the workload is. If it is then the patch reduces the number of cycles spent calculating the stats with a small benefit from reducing the cache footprint of the scheduler. These measurements were taken from a 48-core 2-socket machine with Xeon(R) E5-2670 v3 cpus although they were also tested on a single socket machine 8-core machine with Intel i7-3770 processors. netperf-tcp 4.5.0-rc1 4.5.0-rc1 vanilla nostats-v3r1 Hmean 64 560.45 ( 0.00%) 575.98 ( 2.77%) Hmean 128 766.66 ( 0.00%) 795.79 ( 3.80%) Hmean 256 950.51 ( 0.00%) 981.50 ( 3.26%) Hmean 1024 1433.25 ( 0.00%) 1466.51 ( 2.32%) Hmean 2048 2810.54 ( 0.00%) 2879.75 ( 2.46%) Hmean 3312 4618.18 ( 0.00%) 4682.09 ( 1.38%) Hmean 4096 5306.42 ( 0.00%) 5346.39 ( 0.75%) Hmean 8192 10581.44 ( 0.00%) 10698.15 ( 1.10%) Hmean 16384 18857.70 ( 0.00%) 18937.61 ( 0.42%) Small gains here, UDP_STREAM showed nothing intresting and neither did the TCP_RR tests. The gains on the 8-core machine were very similar. tbench4 4.5.0-rc1 4.5.0-rc1 vanilla nostats-v3r1 Hmean mb/sec-1 500.85 ( 0.00%) 522.43 ( 4.31%) Hmean mb/sec-2 984.66 ( 0.00%) 1018.19 ( 3.41%) Hmean mb/sec-4 1827.91 ( 0.00%) 1847.78 ( 1.09%) Hmean mb/sec-8 3561.36 ( 0.00%) 3611.28 ( 1.40%) Hmean mb/sec-16 5824.52 ( 0.00%) 5929.03 ( 1.79%) Hmean mb/sec-32 10943.10 ( 0.00%) 10802.83 ( -1.28%) Hmean mb/sec-64 15950.81 ( 0.00%) 16211.31 ( 1.63%) Hmean mb/sec-128 15302.17 ( 0.00%) 15445.11 ( 0.93%) Hmean mb/sec-256 14866.18 ( 0.00%) 15088.73 ( 1.50%) Hmean mb/sec-512 15223.31 ( 0.00%) 15373.69 ( 0.99%) Hmean mb/sec-1024 14574.25 ( 0.00%) 14598.02 ( 0.16%) Hmean mb/sec-2048 13569.02 ( 0.00%) 13733.86 ( 1.21%) Hmean mb/sec-3072 12865.98 ( 0.00%) 13209.23 ( 2.67%) Small gains of 2-4% at low thread counts and otherwise flat. The gains on the 8-core machine were slightly different tbench4 on 8-core i7-3770 single socket machine Hmean mb/sec-1 442.59 ( 0.00%) 448.73 ( 1.39%) Hmean mb/sec-2 796.68 ( 0.00%) 794.39 ( -0.29%) Hmean mb/sec-4 1322.52 ( 0.00%) 1343.66 ( 1.60%) Hmean mb/sec-8 2611.65 ( 0.00%) 2694.86 ( 3.19%) Hmean mb/sec-16 2537.07 ( 0.00%) 2609.34 ( 2.85%) Hmean mb/sec-32 2506.02 ( 0.00%) 2578.18 ( 2.88%) Hmean mb/sec-64 2511.06 ( 0.00%) 2569.16 ( 2.31%) Hmean mb/sec-128 2313.38 ( 0.00%) 2395.50 ( 3.55%) Hmean mb/sec-256 2110.04 ( 0.00%) 2177.45 ( 3.19%) Hmean mb/sec-512 2072.51 ( 0.00%) 2053.97 ( -0.89%) In constract, this shows a relatively steady 2-3% gain at higher thread counts. Due to the nature of the patch and the type of workload, it's not a surprise that the result will depend on the CPU used. hackbench-pipes 4.5.0-rc1 4.5.0-rc1 vanilla nostats-v3r1 Amean 1 0.0637 ( 0.00%) 0.0660 ( -3.59%) Amean 4 0.1229 ( 0.00%) 0.1181 ( 3.84%) Amean 7 0.1921 ( 0.00%) 0.1911 ( 0.52%) Amean 12 0.3117 ( 0.00%) 0.2923 ( 6.23%) Amean 21 0.4050 ( 0.00%) 0.3899 ( 3.74%) Amean 30 0.4586 ( 0.00%) 0.4433 ( 3.33%) Amean 48 0.5910 ( 0.00%) 0.5694 ( 3.65%) Amean 79 0.8663 ( 0.00%) 0.8626 ( 0.43%) Amean 110 1.1543 ( 0.00%) 1.1517 ( 0.22%) Amean 141 1.4457 ( 0.00%) 1.4290 ( 1.16%) Amean 172 1.7090 ( 0.00%) 1.6924 ( 0.97%) Amean 192 1.9126 ( 0.00%) 1.9089 ( 0.19%) Some small gains and losses and while the variance data is not included, it's close to the noise. The UMA machine did not show anything particularly different pipetest 4.5.0-rc1 4.5.0-rc1 vanilla nostats-v2r2 Min Time 4.13 ( 0.00%) 3.99 ( 3.39%) 1st-qrtle Time 4.38 ( 0.00%) 4.27 ( 2.51%) 2nd-qrtle Time 4.46 ( 0.00%) 4.39 ( 1.57%) 3rd-qrtle Time 4.56 ( 0.00%) 4.51 ( 1.10%) Max-90% Time 4.67 ( 0.00%) 4.60 ( 1.50%) Max-93% Time 4.71 ( 0.00%) 4.65 ( 1.27%) Max-95% Time 4.74 ( 0.00%) 4.71 ( 0.63%) Max-99% Time 4.88 ( 0.00%) 4.79 ( 1.84%) Max Time 4.93 ( 0.00%) 4.83 ( 2.03%) Mean Time 4.48 ( 0.00%) 4.39 ( 1.91%) Best99%Mean Time 4.47 ( 0.00%) 4.39 ( 1.91%) Best95%Mean Time 4.46 ( 0.00%) 4.38 ( 1.93%) Best90%Mean Time 4.45 ( 0.00%) 4.36 ( 1.98%) Best50%Mean Time 4.36 ( 0.00%) 4.25 ( 2.49%) Best10%Mean Time 4.23 ( 0.00%) 4.10 ( 3.13%) Best5%Mean Time 4.19 ( 0.00%) 4.06 ( 3.20%) Best1%Mean Time 4.13 ( 0.00%) 4.00 ( 3.39%) Small improvement and similar gains were seen on the UMA machine. The gain is small but it stands to reason that doing less work in the scheduler is a good thing. The downside is that the lack of schedstats and tracepoints may be surprising to experts doing performance analysis until they find the existence of the schedstats= parameter or schedstats sysctl. It will be automatically activated for latencytop and sleep profiling to alleviate the problem. For tracepoints, there is a simple warning as it's not safe to activate schedstats in the context when it's known the tracepoint may be wanted but is unavailable. Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: Matt Fleming <[email protected]> Reviewed-by: Srikar Dronamraju <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
1 parent a6e4491 commit cb25176

File tree

13 files changed

+256
-90
lines changed

13 files changed

+256
-90
lines changed

Documentation/kernel-parameters.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3528,6 +3528,11 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
35283528

35293529
sched_debug [KNL] Enables verbose scheduler debug messages.
35303530

3531+
schedstats= [KNL,X86] Enable or disable scheduled statistics.
3532+
Allowed values are enable and disable. This feature
3533+
incurs a small amount of overhead in the scheduler
3534+
but is useful for debugging and performance tuning.
3535+
35313536
skew_tick= [KNL] Offset the periodic timer tick per cpu to mitigate
35323537
xtime_lock contention on larger systems, and/or RCU lock
35333538
contention on all systems with CONFIG_MAXSMP set.

Documentation/sysctl/kernel.txt

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -760,6 +760,14 @@ rtsig-nr shows the number of RT signals currently queued.
760760

761761
==============================================================
762762

763+
sched_schedstats:
764+
765+
Enables/disables scheduler statistics. Enabling this feature
766+
incurs a small amount of overhead in the scheduler but is
767+
useful for debugging and performance tuning.
768+
769+
==============================================================
770+
763771
sg-big-buff:
764772

765773
This file shows the size of the generic SCSI (sg) buffer.

include/linux/latencytop.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,9 @@ account_scheduler_latency(struct task_struct *task, int usecs, int inter)
3737

3838
void clear_all_latency_tracing(struct task_struct *p);
3939

40+
extern int sysctl_latencytop(struct ctl_table *table, int write,
41+
void __user *buffer, size_t *lenp, loff_t *ppos);
42+
4043
#else
4144

4245
static inline void

include/linux/sched.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -920,6 +920,10 @@ static inline int sched_info_on(void)
920920
#endif
921921
}
922922

923+
#ifdef CONFIG_SCHEDSTATS
924+
void force_schedstat_enabled(void);
925+
#endif
926+
923927
enum cpu_idle_type {
924928
CPU_IDLE,
925929
CPU_NOT_IDLE,

include/linux/sched/sysctl.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -95,4 +95,8 @@ extern int sysctl_numa_balancing(struct ctl_table *table, int write,
9595
void __user *buffer, size_t *lenp,
9696
loff_t *ppos);
9797

98+
extern int sysctl_schedstats(struct ctl_table *table, int write,
99+
void __user *buffer, size_t *lenp,
100+
loff_t *ppos);
101+
98102
#endif /* _SCHED_SYSCTL_H */

kernel/latencytop.c

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,12 +47,12 @@
4747
* of times)
4848
*/
4949

50-
#include <linux/latencytop.h>
5150
#include <linux/kallsyms.h>
5251
#include <linux/seq_file.h>
5352
#include <linux/notifier.h>
5453
#include <linux/spinlock.h>
5554
#include <linux/proc_fs.h>
55+
#include <linux/latencytop.h>
5656
#include <linux/export.h>
5757
#include <linux/sched.h>
5858
#include <linux/list.h>
@@ -289,4 +289,16 @@ static int __init init_lstats_procfs(void)
289289
proc_create("latency_stats", 0644, NULL, &lstats_fops);
290290
return 0;
291291
}
292+
293+
int sysctl_latencytop(struct ctl_table *table, int write,
294+
void __user *buffer, size_t *lenp, loff_t *ppos)
295+
{
296+
int err;
297+
298+
err = proc_dointvec(table, write, buffer, lenp, ppos);
299+
if (latencytop_enabled)
300+
force_schedstat_enabled();
301+
302+
return err;
303+
}
292304
device_initcall(init_lstats_procfs);

kernel/profile.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ int profile_setup(char *str)
5959

6060
if (!strncmp(str, sleepstr, strlen(sleepstr))) {
6161
#ifdef CONFIG_SCHEDSTATS
62+
force_schedstat_enabled();
6263
prof_on = SLEEP_PROFILING;
6364
if (str[strlen(sleepstr)] == ',')
6465
str += strlen(sleepstr) + 1;

kernel/sched/core.c

Lines changed: 68 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2093,7 +2093,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
20932093

20942094
ttwu_queue(p, cpu);
20952095
stat:
2096-
ttwu_stat(p, cpu, wake_flags);
2096+
if (schedstat_enabled())
2097+
ttwu_stat(p, cpu, wake_flags);
20972098
out:
20982099
raw_spin_unlock_irqrestore(&p->pi_lock, flags);
20992100

@@ -2141,7 +2142,8 @@ static void try_to_wake_up_local(struct task_struct *p)
21412142
ttwu_activate(rq, p, ENQUEUE_WAKEUP);
21422143

21432144
ttwu_do_wakeup(rq, p, 0);
2144-
ttwu_stat(p, smp_processor_id(), 0);
2145+
if (schedstat_enabled())
2146+
ttwu_stat(p, smp_processor_id(), 0);
21452147
out:
21462148
raw_spin_unlock(&p->pi_lock);
21472149
}
@@ -2210,6 +2212,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
22102212
#endif
22112213

22122214
#ifdef CONFIG_SCHEDSTATS
2215+
/* Even if schedstat is disabled, there should not be garbage */
22132216
memset(&p->se.statistics, 0, sizeof(p->se.statistics));
22142217
#endif
22152218

@@ -2281,6 +2284,69 @@ int sysctl_numa_balancing(struct ctl_table *table, int write,
22812284
#endif
22822285
#endif
22832286

2287+
DEFINE_STATIC_KEY_FALSE(sched_schedstats);
2288+
2289+
#ifdef CONFIG_SCHEDSTATS
2290+
static void set_schedstats(bool enabled)
2291+
{
2292+
if (enabled)
2293+
static_branch_enable(&sched_schedstats);
2294+
else
2295+
static_branch_disable(&sched_schedstats);
2296+
}
2297+
2298+
void force_schedstat_enabled(void)
2299+
{
2300+
if (!schedstat_enabled()) {
2301+
pr_info("kernel profiling enabled schedstats, disable via kernel.sched_schedstats.\n");
2302+
static_branch_enable(&sched_schedstats);
2303+
}
2304+
}
2305+
2306+
static int __init setup_schedstats(char *str)
2307+
{
2308+
int ret = 0;
2309+
if (!str)
2310+
goto out;
2311+
2312+
if (!strcmp(str, "enable")) {
2313+
set_schedstats(true);
2314+
ret = 1;
2315+
} else if (!strcmp(str, "disable")) {
2316+
set_schedstats(false);
2317+
ret = 1;
2318+
}
2319+
out:
2320+
if (!ret)
2321+
pr_warn("Unable to parse schedstats=\n");
2322+
2323+
return ret;
2324+
}
2325+
__setup("schedstats=", setup_schedstats);
2326+
2327+
#ifdef CONFIG_PROC_SYSCTL
2328+
int sysctl_schedstats(struct ctl_table *table, int write,
2329+
void __user *buffer, size_t *lenp, loff_t *ppos)
2330+
{
2331+
struct ctl_table t;
2332+
int err;
2333+
int state = static_branch_likely(&sched_schedstats);
2334+
2335+
if (write && !capable(CAP_SYS_ADMIN))
2336+
return -EPERM;
2337+
2338+
t = *table;
2339+
t.data = &state;
2340+
err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
2341+
if (err < 0)
2342+
return err;
2343+
if (write)
2344+
set_schedstats(state);
2345+
return err;
2346+
}
2347+
#endif
2348+
#endif
2349+
22842350
/*
22852351
* fork()/clone()-time setup:
22862352
*/

kernel/sched/debug.c

Lines changed: 54 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -75,16 +75,18 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
7575
PN(se->vruntime);
7676
PN(se->sum_exec_runtime);
7777
#ifdef CONFIG_SCHEDSTATS
78-
PN(se->statistics.wait_start);
79-
PN(se->statistics.sleep_start);
80-
PN(se->statistics.block_start);
81-
PN(se->statistics.sleep_max);
82-
PN(se->statistics.block_max);
83-
PN(se->statistics.exec_max);
84-
PN(se->statistics.slice_max);
85-
PN(se->statistics.wait_max);
86-
PN(se->statistics.wait_sum);
87-
P(se->statistics.wait_count);
78+
if (schedstat_enabled()) {
79+
PN(se->statistics.wait_start);
80+
PN(se->statistics.sleep_start);
81+
PN(se->statistics.block_start);
82+
PN(se->statistics.sleep_max);
83+
PN(se->statistics.block_max);
84+
PN(se->statistics.exec_max);
85+
PN(se->statistics.slice_max);
86+
PN(se->statistics.wait_max);
87+
PN(se->statistics.wait_sum);
88+
P(se->statistics.wait_count);
89+
}
8890
#endif
8991
P(se->load.weight);
9092
#ifdef CONFIG_SMP
@@ -122,10 +124,12 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
122124
(long long)(p->nvcsw + p->nivcsw),
123125
p->prio);
124126
#ifdef CONFIG_SCHEDSTATS
125-
SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld",
126-
SPLIT_NS(p->se.statistics.wait_sum),
127-
SPLIT_NS(p->se.sum_exec_runtime),
128-
SPLIT_NS(p->se.statistics.sum_sleep_runtime));
127+
if (schedstat_enabled()) {
128+
SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld",
129+
SPLIT_NS(p->se.statistics.wait_sum),
130+
SPLIT_NS(p->se.sum_exec_runtime),
131+
SPLIT_NS(p->se.statistics.sum_sleep_runtime));
132+
}
129133
#else
130134
SEQ_printf(m, "%9Ld.%06ld %9Ld.%06ld %9Ld.%06ld",
131135
0LL, 0L,
@@ -313,17 +317,18 @@ do { \
313317
#define P(n) SEQ_printf(m, " .%-30s: %d\n", #n, rq->n);
314318
#define P64(n) SEQ_printf(m, " .%-30s: %Ld\n", #n, rq->n);
315319

316-
P(yld_count);
317-
318-
P(sched_count);
319-
P(sched_goidle);
320320
#ifdef CONFIG_SMP
321321
P64(avg_idle);
322322
P64(max_idle_balance_cost);
323323
#endif
324324

325-
P(ttwu_count);
326-
P(ttwu_local);
325+
if (schedstat_enabled()) {
326+
P(yld_count);
327+
P(sched_count);
328+
P(sched_goidle);
329+
P(ttwu_count);
330+
P(ttwu_local);
331+
}
327332

328333
#undef P
329334
#undef P64
@@ -569,38 +574,39 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
569574
nr_switches = p->nvcsw + p->nivcsw;
570575

571576
#ifdef CONFIG_SCHEDSTATS
572-
PN(se.statistics.sum_sleep_runtime);
573-
PN(se.statistics.wait_start);
574-
PN(se.statistics.sleep_start);
575-
PN(se.statistics.block_start);
576-
PN(se.statistics.sleep_max);
577-
PN(se.statistics.block_max);
578-
PN(se.statistics.exec_max);
579-
PN(se.statistics.slice_max);
580-
PN(se.statistics.wait_max);
581-
PN(se.statistics.wait_sum);
582-
P(se.statistics.wait_count);
583-
PN(se.statistics.iowait_sum);
584-
P(se.statistics.iowait_count);
585577
P(se.nr_migrations);
586-
P(se.statistics.nr_migrations_cold);
587-
P(se.statistics.nr_failed_migrations_affine);
588-
P(se.statistics.nr_failed_migrations_running);
589-
P(se.statistics.nr_failed_migrations_hot);
590-
P(se.statistics.nr_forced_migrations);
591-
P(se.statistics.nr_wakeups);
592-
P(se.statistics.nr_wakeups_sync);
593-
P(se.statistics.nr_wakeups_migrate);
594-
P(se.statistics.nr_wakeups_local);
595-
P(se.statistics.nr_wakeups_remote);
596-
P(se.statistics.nr_wakeups_affine);
597-
P(se.statistics.nr_wakeups_affine_attempts);
598-
P(se.statistics.nr_wakeups_passive);
599-
P(se.statistics.nr_wakeups_idle);
600578

601-
{
579+
if (schedstat_enabled()) {
602580
u64 avg_atom, avg_per_cpu;
603581

582+
PN(se.statistics.sum_sleep_runtime);
583+
PN(se.statistics.wait_start);
584+
PN(se.statistics.sleep_start);
585+
PN(se.statistics.block_start);
586+
PN(se.statistics.sleep_max);
587+
PN(se.statistics.block_max);
588+
PN(se.statistics.exec_max);
589+
PN(se.statistics.slice_max);
590+
PN(se.statistics.wait_max);
591+
PN(se.statistics.wait_sum);
592+
P(se.statistics.wait_count);
593+
PN(se.statistics.iowait_sum);
594+
P(se.statistics.iowait_count);
595+
P(se.statistics.nr_migrations_cold);
596+
P(se.statistics.nr_failed_migrations_affine);
597+
P(se.statistics.nr_failed_migrations_running);
598+
P(se.statistics.nr_failed_migrations_hot);
599+
P(se.statistics.nr_forced_migrations);
600+
P(se.statistics.nr_wakeups);
601+
P(se.statistics.nr_wakeups_sync);
602+
P(se.statistics.nr_wakeups_migrate);
603+
P(se.statistics.nr_wakeups_local);
604+
P(se.statistics.nr_wakeups_remote);
605+
P(se.statistics.nr_wakeups_affine);
606+
P(se.statistics.nr_wakeups_affine_attempts);
607+
P(se.statistics.nr_wakeups_passive);
608+
P(se.statistics.nr_wakeups_idle);
609+
604610
avg_atom = p->se.sum_exec_runtime;
605611
if (nr_switches)
606612
avg_atom = div64_ul(avg_atom, nr_switches);

0 commit comments

Comments
 (0)