Triaging issues in DataWarehouse
Problem
An untriaged blocking issue is found in DataWarehouse (DW):
Initial investigation
-
Determine the type of the issue. Look at the linked log files linked at the subtest and test level to see whether it is possible to figure out what is going on.
-
If the problem seems related to the code under test, try to get it fixed!
-
If the problem cannot be fixed immediately, a Known Issue can be created for it.
Keep in mind that creating such an issue comes with the responsibility to follow up on it to get it fixed!
Prerequisites
Triaging issues in DW requires the appropriate permissions, which are generally only held by QE-minded engineers.
Reach out to the CKI team as shown in the page footer if you are unsure whether you (should) have the necessary permissions.
Steps
-
Determine the correct issue tracker depending on issue type and kernel tree:
Type Repository Issue tracker test issue any kernel-tests, and ping the test maintainer RPM issue any kernel-ark non-kernel issue any appropriate project in JIRA kernel issue kernel-ark on Rawhide Bugzilla, product Fedora
, componentkernel
, versionrawhide
kernel issue kernel-ark on ELN JIRA, project RHEL
kernel issue centos-stream-9 or RHEL JIRA, project RHEL
kernel issue upstream the appropriate subsystem mailing list -
Search the issue tracker, subsystem mailing list or just plain Google to see whether the issue was already reported.
-
If this is a new issue, create a new ticket. For upstream reports, determine the URL of the email in the mailing list archive.
-
Create a Known Issue in DataWarehouse. On the Build or Test page, click on
Actions -> Associate issue -> Create New Issue
. Use the title and URL of the issue tracker ticket. -
Create a regex for automated triaging.
Usually a failure doesn’t happen just once, therefore it is preferable to add a regex to automatically tag it. Runs from the past two weeks and all new executions will be checked against the regex. The regex should be specific enough that different failures are not added by mistake to the issue, masking new failures.
On the Issue page, create a new regex by clicking on
Actions -> New Issue Regex
.Regex tagging tips
- Use the
Python
regex parser at regex101.com to validate the regex. - By default, the parser matches individual lines only. Newlines can be
matched via
\n
or\s
, To match newlines via.
(DOT), checkSet Regex Options -> dot matches all
. - Avoid using
.*
or.+
as it causes performance issues when checking large files. Use instead a large enough quantifier like{,100}
. Never use them around the pattern, as we are already doing a substring search; use^
and$
to match the beginning and end of a string or line (requires enabling the MULTILINE flag). - In paths, remember to escape every dot in filenames. On the other hand,
you don’t need to escape
/
(forward slash) since we’re using the Python regex flavor. - Set the subtest in cases the test reports subtests (also known as “Test Results”) to avoid unrelated failures in the test being waived by it.
- On tests, prefer matching the filename
main-output\.log
, which is an alias to the main log generated from them depending on their runner, that istestout.log
(beaker/restraint) andtaskout.log
(testing-farm). - Avoid matching on lines in
console.log
. This file is shared across tests and should only be used for cases where the information is not available in any other file. - Kernel panics can be found in
console.log
, call traces that don’t trigger a kernel panic indmesg.log
. - Make sure to set
Tree Match
orKPET Tree Name Match
field. Try to be as specific as possible to avoid waiving failures for different releases. - For upstream failures, use
(upstream|rawhide|eln)
onKPET Tree Name Match
. This would match all kernel trees and ark pipelines. - For RHEL 9/CentOS Stream 9, set
Tree Match
to(rhel9|c9s)
as these trees can be tracked on a single ticket.
- Use the
Examples
Test gets stuck at a specific point and aborts due to time out
Failure:
:: [ 14:10:30 ] :: [ BEGIN ] :: Running 'avocado run avocado-misc-tests/memory/fork_mem.py'
JOB ID : a3d0214cdc6a47ead5558a82c3e8e3bde6804767
JOB LOG : /root/avocado/job-results/job-2021-07-28T14.10-a3d0214/job.log
Regex:
Running 'avocado run avocado-misc-tests\/memory\/fork_mem\.py'\nJOB ID\s+: \S+\n
JOB\s+LOG\s+: \/root\/avocado\/job-results\/\S+\/job\.log$
Call trace without kernel panic
Call traces should be in dmesg.log. It usually requires matching pattern in multiple
lines, try to limit the number of lines using something like (\n.+){0,50}
instead of .*
.
Failure:
[14100.944491] BUG: MAX_LOCKDEP_ENTRIES too low!
[14100.944651] turning off the locking correctness validator.
[14100.944739] CPU: 121 PID: 430464 Comm: socket Tainted: G OE 5.14.0-rc2 #1
[14100.944764] Call Trace:
[14100.944775] [c00000003a92f4f0] [c00000000097d4f4] dump_stack_lvl+0x98/0xe0 (unreliable)
[14100.944979] [c00000003a92f530] [c0000000002076d8] add_lock_to_list.constprop.0+0x218/0x230
[14100.945007] [c00000003a92f5a0] [c00000000020e1bc] __lock_acquire+0x1d6c/0x2850
[14100.945033] [c00000003a92f6d0] [c00000000020f80c] lock_acquire+0x11c/0x450
[14100.945057] [c00000003a92f7d0] [c0000000011f0cdc] _raw_spin_lock_irqsave+0x6c/0xc0
[14100.945084] [c00000003a92f810] [c00000000095f42c] percpu_counter_add_batch+0x5c/0x120
[14100.945109] [c00000003a92f850] [c000000000fb9534] tcp_v4_destroy_sock+0x1d4/0x3b0
[14100.945136] [c00000003a92f8a0] [c000000000f89c5c] inet_csk_destroy_sock+0x8c/0x1f0
[14100.945162] [c00000003a92f8d0] [c000000000f9762c] __tcp_close+0x50c/0x680
[14100.945185] [c00000003a92f930] [c000000000f977e4] tcp_close+0x44/0x100
[14100.945329] [c00000003a92f960] [c000000000fed488] inet_release+0x78/0xf0
[14100.945387] [c00000003a92f990] [c000000000e24f40] __sock_release+0x70/0x150
[14100.945412] [c00000003a92fa10] [c000000000e25048] sock_close+0x28/0x40
[14100.945434] [c00000003a92fa30] [c0000000005877f8] __fput+0xd8/0x360
[14100.945457] [c00000003a92fa80] [c0000000001944f4] task_work_run+0xb4/0x120
[14100.945478] [c00000003a92fad0] [c000000000163408] do_exit+0x4e8/0xdd0
[14100.945500] [c00000003a92fb90] [c000000000163dc0] do_group_exit+0x60/0xd0
[14100.945523] [c00000003a92fbd0] [c000000000179d38] get_signal+0x238/0xc00
[14100.945546] [c00000003a92fcb0] [c0000000000229f4] do_notify_resume+0x114/0x4b0
[14100.945572] [c00000003a92fd70] [c00000000002dc64] interrupt_exit_user_prepare_main+0x2e4/0x370
[14100.945600] [c00000003a92fde0] [c00000000002e0a4] syscall_exit_prepare+0xc4/0x1c0
[14100.945625] [c00000003a92fe10] [c00000000000c770] system_call_common+0x100/0x258
[14100.945651] --- interrupt: c00 at 0x7fffa8a59904
Regex:
BUG: MAX_LOCKDEP_ENTRIES too low!\s*(\n.+){1,30}tcp_close\+0x\S+\/0x\S+\s*(\n.+){1,10}inet_release\+0x\S+\/0x\S+
Kernel panic
Kernel panics can be found in console.log
.
Failure:
[ 1.047266] ------------[ cut here ]------------
[ 1.047424] kernel BUG at mm/slub.c:316!
[ 1.047521] Oops: Exception in kernel mode, sig: 5 [#1]
[ 1.047630] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
[ 1.047774] Modules linked in:
[ 1.047858] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.13.1 #1
[ 1.048037] NIP: c0000000004ad338 LR: c0000000004ad29c CTR: c0000000004415b0
[ 1.048200] REGS: c000000001e2f300 TRAP: 0700 Not tainted (5.13.1)
[ 1.048356] MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 28024224 XER: 20000000
[ 1.048544] CFAR: c0000000004ad704 IRQMASK: 0
[ 1.048544] GPR00: c0000000004ad29c c000000001e2f5a0 c000000001e31000 0000000000000001
[ 1.048544] GPR04: 00000000000000c0 c000000008774000 0000000000000010 000000000000000e
[ 1.048544] GPR08: c000000008770000 0000000000000001 0000000000004000 c000000001ef6350
[ 1.048544] GPR12: c0000000004415b0 c000000002070000 0000000000000000 0000000000000100
[ 1.048544] GPR16: 0000000000000009 0000000004200002 0000000000000000 0000000000000001
[ 1.048544] GPR20: c000000001e53a00 0000000000000000 0000000000000000 c000000001cb6880
[ 1.048544] GPR24: 0000000000000000 c000000008770000 0000000000000001 c000000000095d74
[ 1.048544] GPR28: c000000008770000 c000000008770000 c000000008012500 c00c000000021c00
[ 1.049955] NIP [c0000000004ad338] kmem_cache_free+0x198/0x5e0
[ 1.050115] LR [c0000000004ad29c] kmem_cache_free+0xfc/0x5e0
[ 1.050260] Call Trace:
[ 1.050322] [c000000001e2f5a0] [c0000000004ad29c] kmem_cache_free+0xfc/0x5e0 (unreliable)
[ 1.050504] [c000000001e2f660] [c000000000095d74] __tlb_remove_table+0x124/0x140
[ 1.050696] [c000000001e2f690] [c000000000441604] tlb_remove_table_rcu+0x54/0xa0
[ 1.050859] [c000000001e2f6c0] [c00000000020e608] rcu_do_batch+0x1e8/0x580
[ 1.051014] [c000000001e2f760] [c00000000020f520] rcu_core+0x2d0/0x470
[ 1.051156] [c000000001e2f7d0] [c000000001106a60] __do_softirq+0x160/0x3fc
[ 1.051330] [c000000001e2f8c0] [c000000000157374] __irq_exit_rcu+0x1d4/0x1e0
[ 1.051496] [c000000001e2f8f0] [c000000000157550] irq_exit+0x20/0x40
[ 1.051639] [c000000001e2f910] [c000000000025054] timer_interrupt+0x184/0x3b0
[ 1.051809] [c000000001e2f970] [c000000000009c04] decrementer_common_virt+0x1a4/0x1b0
[ 1.051990] --- interrupt: 900 at plpar_hcall_norets_notrace+0x18/0x24
[ 1.052132] NIP: c0000000000f0eb4 LR: c000000000d9ce28 CTR: c0000000000b55d4
[ 1.052296] REGS: c000000001e2f9e0 TRAP: 0900 Not tainted (5.13.1)
[ 1.052430] MSR: 800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 48000224 XER: 00000000
[ 1.052666] CFAR: c0000000000b411c IRQMASK: 0
[ 1.052666] GPR00: 0000000048000224 c000000001e2fc80 c000000001e31000 0000000000000000
[ 1.052666] GPR04: c000000001d4b450 0000000000000001 000000023cb26ef0 00000002ecbb0000
[ 1.052666] GPR08: 0000000000000000 0000000000000000 c000000001751fa0 c0000002ee38fe48
[ 1.052666] GPR12: c000000000d9ccd0 c000000002070000 0000000000000000 0000000000000000
[ 1.052666] GPR16: 0000000000000000 c000000001d4b450 c0000002ee38fac8 c000000001da8400
[ 1.052666] GPR20: c000000001da8400 c000000001da8380 c000000001da8400 c000000001da8380
[ 1.052666] GPR24: c000000001da8380 0000000000000000 000000003e3593bc 0000000000000001
[ 1.052666] GPR28: c000000001e5aef8 0000000000000001 c000000001751f98 c000000001751fa0
[ 1.054167] NIP [c0000000000f0eb4] plpar_hcall_norets_notrace+0x18/0x24
[ 1.054304] LR [c000000000d9ce28] shared_cede_loop+0x158/0x1c0
[ 1.054472] --- interrupt: 900
[ 1.054553] [c000000001e2fc80] [c000000001da8380] init_task+0x0/0x2780 (unreliable)
[ 1.054816] [c000000001e2fd00] [c000000000d99870] cpuidle_enter_state+0x240/0x560
[ 1.055043] [c000000001e2fd60] [c000000000d99c2c] cpuidle_enter+0x4c/0x70
[ 1.055194] [c000000001e2fda0] [c0000000001a7408] do_idle+0x368/0x470
[ 1.055325] [c000000001e2fe60] [c0000000001a777c] cpu_startup_entry+0x3c/0x40
[ 1.055487] [c000000001e2fe90] [c000000000012800] rest_init+0xf0/0x108
[ 1.055624] [c000000001e2fec0] [c000000001604950] start_kernel+0xae0/0xb14
[ 1.055780] [c000000001e2ff90] [c00000000000ce44] start_here_common+0x1c/0x58
[ 1.055939] Instruction dump:
[ 1.056018] 7d295214 e9490010 3b000000 3b400001 7c3f5000 408200c4 e9090000 815e0028
[ 1.056207] 7d09ca78 7cb95214 7d290074 7929d182 <0b090000> 78a40022 54a7c03e e93e00b8
[ 1.056390] ---[ end trace 3876762cf948c0e1 ]---
[ 1.062367]
[ 2.062446] Kernel panic - not syncing: Aiee, killing interrupt handler!
Regex:
kernel BUG at mm\/slub\.c:316!\s*(\n.+){1,5}Oops: Exception in kernel mode