Nick Burris | 02eae45 | 2020-11-05 20:39:03 | [diff] [blame] | 1 | # Addressing Flaky GTests |
| 2 | |
| 3 | ## Understanding builder results |
| 4 | |
Erik Staab | 74781fd | 2022-11-02 17:30:27 | [diff] [blame] | 5 | [LUCI Analysis](https://luci-analysis.appspot.com/p/chromium/clusters) lists the |
| 6 | top flake clusters of tests along with any associated bug and failure counts in |
| 7 | different contexts. |
Nick Burris | 02eae45 | 2020-11-05 20:39:03 | [diff] [blame] | 8 | |
| 9 | ## Reproducing the flaky test |
| 10 | |
| 11 | If debugging via bot is too slow or you otherwise need to drill further into the |
| 12 | cause of the flake, you can try to reproduce the flake locally. Reproducing the |
| 13 | flake can be difficult, so it can help to try and replicate the test environment |
| 14 | as closely as possible. |
| 15 | |
| 16 | Copy the gn args from one of the bots where the flake occurs, and try to choose |
| 17 | a bot close to your system, i.e. linux-rel if you're building on linux. To get |
| 18 | the gn args, you can again click on the timestamp in the flake portal to view |
| 19 | the bot run details, and search for the "lookup GN args" build step to copy the |
| 20 | args. |
| 21 | |
| 22 | ![bot_gn_args] |
| 23 | |
| 24 | Build and run the test locally. Depending on the frequency of the flake, it may |
| 25 | take some time to reproduce. Some helpful flags: |
| 26 | - --gtest_repeat=100 |
| 27 | - --gtest_also_run_disabled_tests (if the flaky test(s) you're looking at have |
| 28 | been disabled) |
| 29 | |
| 30 | If you're unable to reproduce the flake locally, you can also try uploading your |
| 31 | patch with the debug logging and flaky test enabled to try running the bot to |
| 32 | reproduce the flake with more information. |
| 33 | |
George Benz | 7a43768c | 2022-10-17 22:11:30 | [diff] [blame] | 34 | Another good solution is to use |
| 35 | *Swarming* -- which will let you mimic bot conditions to better reproduce flakes |
| 36 | that actually occur on CQ bots. |
| 37 | |
| 38 | ### Swarming |
| 39 | For a more detailed dive into swarming you can follow this |
| 40 | [link](https://chromium.googlesource.com/chromium/src/+/master/docs/workflow/debugging-with-swarming.md#authenticating). |
| 41 | |
| 42 | As an example, suppose we have built Chrome using the GN args from |
| 43 | above into a directory `out/linux-rel`, then we can simply run this command |
| 44 | within the `chromium/src` directory: |
| 45 | |
| 46 | ``` |
| 47 | tools/run-swarmed.py out/linux-rel browser_tests -- --gtest_filter="*<YOUR_TEST_NAME_HERE>*" --gtest_repeat=20 --gtest_also_run_disabled_tests |
| 48 | ``` |
| 49 | |
| 50 | This allows us to quickly iterate over errors using logs to reproduce flakes and |
| 51 | even fix them! |
| 52 | |
Nick Burris | 02eae45 | 2020-11-05 20:39:03 | [diff] [blame] | 53 | >TODO: Add more tips for reproducing flaky tests |
| 54 | |
| 55 | ## Debugging the flaky test |
| 56 | |
| 57 | If the test is flakily timing out, consider any asynchronous code that may cause |
| 58 | race conditions, where the test subject may early exit and miss a callback, or |
| 59 | return faster than the test can start waiting for it (i.e. make sure event |
Nick Burris | 00c5f54 | 2020-11-30 17:55:22 | [diff] [blame] | 60 | listeners are spawned before invoking the event). Make sure event listeners are |
| 61 | for the proper event instead of a proxy (e.g. [Wait for the correct event in |
| 62 | test](https://chromium.googlesource.com/chromium/src/+/6da09f7510e94d2aebbbed13b038d71c511d6cbc)). |
| 63 | |
| 64 | Consider possible bugs in the system or test infrastructure (e.g. [races in |
| 65 | glibc](https://bugs.chromium.org/p/chromium/issues/detail?id=1010318)). |
Nick Burris | 02eae45 | 2020-11-05 20:39:03 | [diff] [blame] | 66 | |
| 67 | For browsertest flakes, consider possible inter-process issues, such as the |
Nick Burris | 00c5f54 | 2020-11-30 17:55:22 | [diff] [blame] | 68 | renderer taking too long or returning something unexpected (e.g. [flaky |
| 69 | RenderFrameHostImplBrowserTest](https://bugs.chromium.org/p/chromium/issues/detail?id=1120305)). |
Nick Burris | 02eae45 | 2020-11-05 20:39:03 | [diff] [blame] | 70 | |
Nick Burris | 00c5f54 | 2020-11-30 17:55:22 | [diff] [blame] | 71 | For browsertest flakes that check EvalJs results, make sure test objects are not |
| 72 | destroyed before JS may read their values (e.g. [flaky |
| 73 | PaymentAppBrowserTest](https://chromium.googlesource.com/chromium/src/+/6089f3480c5036c73464661b3b1b6b82807b56a3)). |
Nick Burris | 02eae45 | 2020-11-05 20:39:03 | [diff] [blame] | 74 | |
George Benz | 7a43768c | 2022-10-17 22:11:30 | [diff] [blame] | 75 | For browsertest flakes that involve dialogs or widgets, make sure that test |
| 76 | objects are not destroyed because focus is lost on the dialog (e.g [flaky AccessCodeCastHandlerBrowserTest](https://chromium-review.googlesource.com/c/chromium/src/+/3951132)). |
| 77 | |
Nick Burris | 02eae45 | 2020-11-05 20:39:03 | [diff] [blame] | 78 | ## Preventing similar flakes |
| 79 | |
| 80 | Once you understand the problem and have a fix for the test, think about how the |
| 81 | fix may apply to other tests, or if documentation can be improved either in the |
| 82 | relevant code or this flaky test documentation. |
| 83 | |
| 84 | |
George Benz | 7a43768c | 2022-10-17 22:11:30 | [diff] [blame] | 85 | [bot_gn_args]: images/bot_gn_args.png |