Triaging issues in DataWarehouse
Permissions
Special permissions need to be granted to a user in order to be able to triage issues. Reach out to the CKI team on cki-project@redhat.com if you need permissions to triage issues.
New unknown issue in DataWarehouse
When an unknown issue is found in DataWarehouse (DW), it is necessary to investigate what causes it. There are multiple types of issues and this new issue should be reported in DataWarehouse and associated to one of these types.
Report the issue
Depending on the type of the issue, a different place is used to report it.
- Problem with the test or a request to the test maintainer to investigate the failure: Open a new test issue.
- Problem with RHEL kernels: Bugzilla.
- Problem with ARK-ELN kernels: Bugzilla under RHEL-9.
- Problem with ARK-Rawhide kernels: Bugzilla under Fedora Rawhide.
- Problem with upstream kernel: A report to an appropriate kernel
upstream list must be sent: contact CKI team to help send the report.
- The email should be sent to the subsystem mailing list and maintainers as described in finding maintainers.
- Don’t send report if the failure doesn’t happen with latest checkout
in the tree. The issue though, should be monitored in case the problem still
there, but is not easily reproducible.
- For example, by having an issue regex in DW and set the issue as resolved, so there will be an email notification in case the failure happens again.
- Problem with a non-kernel package, like selinux-policy: Bugzilla.
- Infrastructure issue can have many different root cause: contact the CKI team if you need help with a ticket for it.
You can link an existing ticket or email discussion if the issue was already reported.
Create an issue in DataWarehouse
Now that we have a link to the report, an issue should be created in DataWarehouse and the link added to it.
To create a new issue associated with a build or test, click on
Actions -> Associate issue
on the build or test page:
You have an option to either associate an already existing issue, or create
a new one by clicking on Create New Issue
and then associating this new
issue:
Creating a regex for automated triaging
Usually a failure doesn’t happen just once, therefore it is preferable to add a regex to automatically tag it. Runs from the past two weeks and all new executions will be checked against the regex. The regex should be specific enough that different failures are not added by mistake to the issue, masking new failures.
You can create a new regex by going to Issues -> Issues Regexes
in the menu
sidebar:
Click on Actions -> New Issue Regex
:
Regex tagging tips
- Use regex101.com to validate the regex, make sure to use the Python parser.
- The parser supports multiline search. By default, you can match newlines specifying
\n
or\s
, if you want DOT to match ALL, including newlines, use inline flags, i.e. add the prefix(?s)
or check the respective flag under the “Set Regex Options” button.
- Avoid using
.*
as much as possible as this causes performance issue when checking large files. - In general the regex should be associated with a specific test case name. You can also target specific test results, also known as subtests. The regex will be searched against “TestName/SubTestName”, so prefer to use the full name whenever it’s possible.
- Avoid matching regex on
console.log
, this file should be used in most cases for regex of panics where the information is not available in any other file. - Don’t create a regex using
console.log
and leave the test case name empty. Asconsole.log
is the same for all tests executed in that machine it’d cause any failure to be associated with it. - For Call Traces that don’t trigger panic the information is on
dmesg.log
.dmesg.log
is separate for each file, therefore it is better to use it instead ofconsole.log
. - Make sure to set
Tree Match
orKPET Tree Name Match
field. Try to be as specific as possible to avoid waiving failures for different releases. - For upstream failures it is better to set
(upstream|rawhide|eln)
onKPET Tree Name Match
. This would match all kernel trees and ark pipelines. - For RHEL-9/C9S as they can be tracked in a single ticket set
(rhel9|c9s)
onTree Match
. - Make sure to escape every dot in filenames.
Examples
-
Test gets stuck at a specific point and aborts due to time out.
Failure:
:: [ 14:10:30 ] :: [ BEGIN ] :: Running 'avocado run avocado-misc-tests/memory/fork_mem.py' JOB ID : a3d0214cdc6a47ead5558a82c3e8e3bde6804767 JOB LOG : /root/avocado/job-results/job-2021-07-28T14.10-a3d0214/job.log
Regex:
Running 'avocado run avocado-misc-tests\/memory\/fork_mem\.py'\nJOB ID\s+: \S+\n JOB\s+LOG\s+: \/root\/avocado\/job-results\/\S+\/job\.log$
-
Call Traces, except in case of panic they should be in dmesg.log. For Call Traces it can be difficult to create a regex without using
.*
.Failure:
[14100.944491] BUG: MAX_LOCKDEP_ENTRIES too low! [14100.944651] turning off the locking correctness validator. [14100.944739] CPU: 121 PID: 430464 Comm: socket Tainted: G OE 5.14.0-rc2 #1 [14100.944764] Call Trace: [14100.944775] [c00000003a92f4f0] [c00000000097d4f4] dump_stack_lvl+0x98/0xe0 (unreliable) [14100.944979] [c00000003a92f530] [c0000000002076d8] add_lock_to_list.constprop.0+0x218/0x230 [14100.945007] [c00000003a92f5a0] [c00000000020e1bc] __lock_acquire+0x1d6c/0x2850 [14100.945033] [c00000003a92f6d0] [c00000000020f80c] lock_acquire+0x11c/0x450 [14100.945057] [c00000003a92f7d0] [c0000000011f0cdc] _raw_spin_lock_irqsave+0x6c/0xc0 [14100.945084] [c00000003a92f810] [c00000000095f42c] percpu_counter_add_batch+0x5c/0x120 [14100.945109] [c00000003a92f850] [c000000000fb9534] tcp_v4_destroy_sock+0x1d4/0x3b0 [14100.945136] [c00000003a92f8a0] [c000000000f89c5c] inet_csk_destroy_sock+0x8c/0x1f0 [14100.945162] [c00000003a92f8d0] [c000000000f9762c] __tcp_close+0x50c/0x680 [14100.945185] [c00000003a92f930] [c000000000f977e4] tcp_close+0x44/0x100 [14100.945329] [c00000003a92f960] [c000000000fed488] inet_release+0x78/0xf0 [14100.945387] [c00000003a92f990] [c000000000e24f40] __sock_release+0x70/0x150 [14100.945412] [c00000003a92fa10] [c000000000e25048] sock_close+0x28/0x40 [14100.945434] [c00000003a92fa30] [c0000000005877f8] __fput+0xd8/0x360 [14100.945457] [c00000003a92fa80] [c0000000001944f4] task_work_run+0xb4/0x120 [14100.945478] [c00000003a92fad0] [c000000000163408] do_exit+0x4e8/0xdd0 [14100.945500] [c00000003a92fb90] [c000000000163dc0] do_group_exit+0x60/0xd0 [14100.945523] [c00000003a92fbd0] [c000000000179d38] get_signal+0x238/0xc00 [14100.945546] [c00000003a92fcb0] [c0000000000229f4] do_notify_resume+0x114/0x4b0 [14100.945572] [c00000003a92fd70] [c00000000002dc64] interrupt_exit_user_prepare_main+0x2e4/0x370 [14100.945600] [c00000003a92fde0] [c00000000002e0a4] syscall_exit_prepare+0xc4/0x1c0 [14100.945625] [c00000003a92fe10] [c00000000000c770] system_call_common+0x100/0x258 [14100.945651] --- interrupt: c00 at 0x7fffa8a59904
Regex:
BUG: MAX_LOCKDEP_ENTRIES too low!.*tcp_close\+0x\S+\/0x\S+.*inet_release\+0x\S+\/0x\S+
-
Kernel panic, in this case
console.log
must be used.Failure:
[ 1.047266] ------------[ cut here ]------------ [ 1.047424] kernel BUG at mm/slub.c:316! [ 1.047521] Oops: Exception in kernel mode, sig: 5 [#1] [ 1.047630] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries [ 1.047774] Modules linked in: [ 1.047858] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.13.1 #1 [ 1.048037] NIP: c0000000004ad338 LR: c0000000004ad29c CTR: c0000000004415b0 [ 1.048200] REGS: c000000001e2f300 TRAP: 0700 Not tainted (5.13.1) [ 1.048356] MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 28024224 XER: 20000000 [ 1.048544] CFAR: c0000000004ad704 IRQMASK: 0 [ 1.048544] GPR00: c0000000004ad29c c000000001e2f5a0 c000000001e31000 0000000000000001 [ 1.048544] GPR04: 00000000000000c0 c000000008774000 0000000000000010 000000000000000e [ 1.048544] GPR08: c000000008770000 0000000000000001 0000000000004000 c000000001ef6350 [ 1.048544] GPR12: c0000000004415b0 c000000002070000 0000000000000000 0000000000000100 [ 1.048544] GPR16: 0000000000000009 0000000004200002 0000000000000000 0000000000000001 [ 1.048544] GPR20: c000000001e53a00 0000000000000000 0000000000000000 c000000001cb6880 [ 1.048544] GPR24: 0000000000000000 c000000008770000 0000000000000001 c000000000095d74 [ 1.048544] GPR28: c000000008770000 c000000008770000 c000000008012500 c00c000000021c00 [ 1.049955] NIP [c0000000004ad338] kmem_cache_free+0x198/0x5e0 [ 1.050115] LR [c0000000004ad29c] kmem_cache_free+0xfc/0x5e0 [ 1.050260] Call Trace: [ 1.050322] [c000000001e2f5a0] [c0000000004ad29c] kmem_cache_free+0xfc/0x5e0 (unreliable) [ 1.050504] [c000000001e2f660] [c000000000095d74] __tlb_remove_table+0x124/0x140 [ 1.050696] [c000000001e2f690] [c000000000441604] tlb_remove_table_rcu+0x54/0xa0 [ 1.050859] [c000000001e2f6c0] [c00000000020e608] rcu_do_batch+0x1e8/0x580 [ 1.051014] [c000000001e2f760] [c00000000020f520] rcu_core+0x2d0/0x470 [ 1.051156] [c000000001e2f7d0] [c000000001106a60] __do_softirq+0x160/0x3fc [ 1.051330] [c000000001e2f8c0] [c000000000157374] __irq_exit_rcu+0x1d4/0x1e0 [ 1.051496] [c000000001e2f8f0] [c000000000157550] irq_exit+0x20/0x40 [ 1.051639] [c000000001e2f910] [c000000000025054] timer_interrupt+0x184/0x3b0 [ 1.051809] [c000000001e2f970] [c000000000009c04] decrementer_common_virt+0x1a4/0x1b0 [ 1.051990] --- interrupt: 900 at plpar_hcall_norets_notrace+0x18/0x24 [ 1.052132] NIP: c0000000000f0eb4 LR: c000000000d9ce28 CTR: c0000000000b55d4 [ 1.052296] REGS: c000000001e2f9e0 TRAP: 0900 Not tainted (5.13.1) [ 1.052430] MSR: 800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 48000224 XER: 00000000 [ 1.052666] CFAR: c0000000000b411c IRQMASK: 0 [ 1.052666] GPR00: 0000000048000224 c000000001e2fc80 c000000001e31000 0000000000000000 [ 1.052666] GPR04: c000000001d4b450 0000000000000001 000000023cb26ef0 00000002ecbb0000 [ 1.052666] GPR08: 0000000000000000 0000000000000000 c000000001751fa0 c0000002ee38fe48 [ 1.052666] GPR12: c000000000d9ccd0 c000000002070000 0000000000000000 0000000000000000 [ 1.052666] GPR16: 0000000000000000 c000000001d4b450 c0000002ee38fac8 c000000001da8400 [ 1.052666] GPR20: c000000001da8400 c000000001da8380 c000000001da8400 c000000001da8380 [ 1.052666] GPR24: c000000001da8380 0000000000000000 000000003e3593bc 0000000000000001 [ 1.052666] GPR28: c000000001e5aef8 0000000000000001 c000000001751f98 c000000001751fa0 [ 1.054167] NIP [c0000000000f0eb4] plpar_hcall_norets_notrace+0x18/0x24 [ 1.054304] LR [c000000000d9ce28] shared_cede_loop+0x158/0x1c0 [ 1.054472] --- interrupt: 900 [ 1.054553] [c000000001e2fc80] [c000000001da8380] init_task+0x0/0x2780 (unreliable) [ 1.054816] [c000000001e2fd00] [c000000000d99870] cpuidle_enter_state+0x240/0x560 [ 1.055043] [c000000001e2fd60] [c000000000d99c2c] cpuidle_enter+0x4c/0x70 [ 1.055194] [c000000001e2fda0] [c0000000001a7408] do_idle+0x368/0x470 [ 1.055325] [c000000001e2fe60] [c0000000001a777c] cpu_startup_entry+0x3c/0x40 [ 1.055487] [c000000001e2fe90] [c000000000012800] rest_init+0xf0/0x108 [ 1.055624] [c000000001e2fec0] [c000000001604950] start_kernel+0xae0/0xb14 [ 1.055780] [c000000001e2ff90] [c00000000000ce44] start_here_common+0x1c/0x58 [ 1.055939] Instruction dump: [ 1.056018] 7d295214 e9490010 3b000000 3b400001 7c3f5000 408200c4 e9090000 815e0028 [ 1.056207] 7d09ca78 7cb95214 7d290074 7929d182 <0b090000> 78a40022 54a7c03e e93e00b8 [ 1.056390] ---[ end trace 3876762cf948c0e1 ]--- [ 1.062367] [ 2.062446] Kernel panic - not syncing: Aiee, killing interrupt handler!
Regex:
kernel BUG at mm\/slub\.c:316!\s*\n\s*\[\s*\S+\] Oops: Exception in kernel mode