NIO-ICSE2022.pptx

Preempting Flaky Tests
via Non-Idempotent-Outcome Tests
Anjiang Wei, Pu Yi, Zhengxi Li, Tao Xie, Darko Marinov, Wing Lam
anjiang@stanford.edu
Funding acknowledgments
CCF-1763788
CCF-1956374
62161146003
1

2
Developer Anecdote
Servers
test0
test1
test2
testn
…
Servers
…
- static int add() {
+ static int add(r) {
- ts.addRow(“”);
+ ts.addRow(r);
return ts.size();
…
4:15 PM
test0
test1
test2
testn
…
Build code
Run tests

3
Developer Anecdote
Servers
test0
test1
test2
testn
…
Servers
…
+ ts.addRow(r);
return ts.size();
…
4:15 PM
Merge Changes
Pass
test0
test1
test2
testn
…
Build code
Run tests

4
Developer Anecdote
Servers
test0
test1
test2
testn
…
Servers
…
+ ts.addRow(r);
return ts.size();
…
4:15 PM
Fail
Debug Changes
test0
test1
test2
testn
…
Build code
Run tests

?
??
5
Developer Anecdote
Servers
…
+ ts.addRow(r);
return ts.size();
…
test0
test1
test2
testn
…
Servers
…
+ ts.addRow(r);
return ts.size();
…
…
+ ts.addRow(r);
return ts.size();
…
Servers
Servers
test0
test1
test2
testn
…
4:15 PM
5:00 PM
5:30 PM
6:15 PM
Servers
test0
test1
test2
testn
…
Build code
Run tests
Build code
Run tests

?
??
…
+ ts.addRow(r);
return ts.size();
…
Servers
…
+ ts.addRow(r);
return ts.size();
…
…
+ ts.addRow(r);
return ts.size();
…
Servers
Build code
Run tests
Build code
Run tests
Developer Anecdote
Servers
test0
test1
test2
testn
…
Servers
test0
test1
test2
testn
…
4:15 PM
5:00 PM
5:30 PM
6:15 PM
Servers
test0
test1
test2
testn
…
Developer wastes time
debugging & running tests
and goes home
1 hour and 15 min later
1 hour
15 min
Flaky Test: a test that can
non-deterministically
pass and fail when run on
the same code version
6

?
?
?
?
??
…
- db.addRow(“”);
+ db.addRow(r);
return db.size();
…
Servers
…
+ db.addRow(r);
return db.size();
…
…
+ db.addRow(r);
return db.size();
… Servers
Servers
test0
test1
test2
testn
…
4:15 PM
5:00 PM
5:30 PM
6:15 PM
Servers
test0
test1
test2
testn
…
Servers
test0
test1
test2
testn
…
Developer wastes time
debugging & running tests
and goes home
1 hour and 15 min later
1 hour
15 min
Flaky Test: a test that can
non-deterministically
pass and fail when run on
the same version of the code
Public Outcry About Flaky Tests
7

What are Flaky Tests?
• A test is flaky if it passes and fails for the same code version
• Misleads developers to debug nonexistent faults in recent changes
• Reduces trust in tests
• Order-dependent tests are a prominent category of flaky tests
• An order-dependent test deterministically passes or fails in any given test order,
passes in 1+ order, and fails in 1+ order
8

Background: Victim and Polluter
•Victim 𝑡1 fails when run after polluter 𝑡2
• Polluter has modified some shared state
• Victim’s test assertion depends on some shared state
• The same shared state (the variable 𝑥 in the code)
// shared variable x is initialized to 0
void t1() { assert x == 0; } // victim
void t2() { x = 1; } // polluter
TestOrder1
t1 t2
TestOrder2
t2 t1
9

Background: Latent-Victim, Latent-Polluter
• Latent-Victim 𝑡3:
• Assertion depends on shared state; currently no tests modify 𝑦
• victims ⊂ latent-victims
• Latent-Polluter 𝑡4:
• Shared state modification; currently no tests put assertions on 𝑧
• polluters ⊂ latent-polluters
// shared variables x, y, z are initialized to 0
void t3() { assert y == 0; } // latent-victim
void t4() { z = 1; } // latent-polluter
10

Non-Idempotent-Outcome (NIO) Test
• A test is non-idempotent-outcome (NIO):
• t5(); t5()  pass; fail
• Passes in the first run but fails in the second when run twice consecutively
• An NIO test self-pollutes the state that its own assertions depend on
• NIO ⊂ latent-polluter ∧ NIO ⊂ latent-victim
// shared variables x, y, z, w are initialized to 0
void t3() { assert y == 0; } // latent-victim
void t4() { z = 1; } // latent-polluter
void t5() { assert w = 0; w = 1;} // NIO
11

Why should we detect NIOs?
• Typically, tests are not run twice
• To preempt/prevent flaky tests
• Why not fix latent-polluter?
• Why not fix latent-victim?
• Prior work
• Gyori et al.1 detect 575 latent-polluters
• Manually filter 381 (66%) false positives (cannot reasonably become polluters)
• Huo and Clause2 detect latent-victims with dynamic taint analysis
• Do not report how many can reasonably become victims
• They do NOT fix any tests
• NIOs are more worth fixing
• Both latent-victims and latent-polluters at the same time
• Easy to detect, no false positives
• Well-accepted fixes
1 Gyori et al., “Reliable testing: Detecting state-polluting tests to prevent test dependency”. ISSTA 2015
2 Huo and Clause, “Improving oracle quality by detecting brittle assertions and unused inputs in tests”. In FSE 2014
12

Contributions
• Definition of NIO tests
• Deterministically change from pass to fail when run twice
• Effective detection & empirical evaluation
• Propose 3 modes for detection
• 127 Java test suites  223 NIO tests
• 1006 Python projects  138 NIO tests
• Well-accepted fixes
• Inspect every NIO test (no false positive)
• Open pull requests for 268 tests
• 192 accepted, 70 pending, only 6 rejected
13

Real Example of NIO
Buggy Cleaning Code
def cmd_mock():
def _cmd_mock(name: str):
cmd.__overrides__[name] = [‘/bin/true’]
yield _cmd_mock
- cmd.__overrides__ = []
+ cmd.__overrides__ = {}
def test_slurm_command(tmp_path, cmd_mock):
cmd_mock('srun')
TypeError: list indices must be
integers or slices, not str
14

Real Example of NIO
def to_zero(tvd, northing, easting,
surface_northing, surface_easting):
# perform some checking
- northing -= surface_northing
- easting -= surface_easting
+ northing = northing - surface_northing
+ easting = easting - surface_easting
return tvd, northing, easting
# initialization for global variables: g1,…,g5
g1 = ...
def test_zero():
# global variables passed in as arguments
v1, v2, v3 = to_zero(g1, g2, g3, g4, g5)
np.testing.assert_equal (...) # assertion
Fix: Avoid Function Side Effect
AssertionError:
Mismatched elements: 121 / 121 (100%)
15

Prevalence of NIO Tests
Conclusion:
• NIO tests are prevalent enough that every project should run NIO detection
at least once
Java Python
# Test Suites (total) 127 1006
# Test Suites w/ NIO 34 138
% Test Suites w/ NIO 26% 9%
# NIO Tests 223 138
16

Different Detection Modes
• Three Different Modes
• Isolated-method
• Run1: t1, t1
• Run2: t2, t2
• Run3: t3, t3
• Isolated-class
• Run1: t1, t1, t2, t2
• Run2: t3, t3
• Entire-suite
• Run1: t1, t1, t2, t2, t3, t3
• Conclusion
• All three modes detect similar tests
• Isolated-method (223) > Isolated-class (212) > Entire-suite (210)
• Entire-suite has the lowest overhead
• Why differ? See paper for details
TestClass A
t1 t2
TestClass B
t3
Test Suite
17

• We detect 361 (233 Java + 138 Python) NIO tests
• We fix 268 NIO tests by opening Pull Requests
• 192 tests accepted
• 70 tests pending
• 6 tests are rejected
• We do not fix 51 NIO tests
• Cannot localize pollution
• Difficult to clean the pollution
• 42 tests are N/A
• Not NIO in the latest version (fixed/deleted/etc)
• Conclusion
• Developers are generally positive about fixes for NIO tests
• Providing reproducing steps and explaining the motivation help
Experience with Fixing NIO Tests
192
70
6
51
42
Accepted Pending Rejected Do not Fix N/A
18

NIO vs. Polluter vs. Victim
• NIO tests are related to but not
subsumed by polluters and
victims
• Detecting NIO tests can be an
effective way to preempt
polluters and victims
19

Conclusions
• We focus on Non-Idempotent-Outcome (NIO) tests
• Deterministically change from pass to fail when run twice
• Detect and fix NIO tests
• Preempt order-dependent flaky tests
• Importance: in the intersection of latent-polluters and latent-victims
• Detect 361 NIO tests (223 Java + 138 Python)
• Opened pull requests for 268 tests, with 192 accepted
• Dataset publicly available:
• https://sites.google.com/view/nio-tests
• IDoFT dataset (all flaky tests): https://github.com/TestingResearchIllinois/idoft
Questions? Email: Anjiang Wei <anjiang@stanford.edu> 20

NIO-ICSE2022.pptx

More Related Content

What's hot (20)

Similar to NIO-ICSE2022.pptx (20)

Recently uploaded (20)

NIO-ICSE2022.pptx