Blame - docs/testing/json_test_results_format.md - chromium/src.git

blob: b69d65594dbbeae9e2cb2de2bcd5eebe5562e134 [file] [log] [blame] [view]

Stephen Martinis	89f9ca6	2017-10-18 19:12:35	[diff] [blame]	1	# The JSON Test Results Format
				2
Ben Pastene	9cebf38	2022-01-04 23:44:45	[diff] [blame]	3	*** note
				4	Warning: The JSON test result format no longer affects the pass-fail decisions
				5	made by Chrome's bots. All results are now fetched from ResultDB. For more
				6	info, see [resultdb.md](resultdb.md).
				7	***
				8
Stephen Martinis	89f9ca6	2017-10-18 19:12:35	[diff] [blame]	9	The JSON Test Results Format is a generic file format we use to record the
				10	results of each individual test in test run (whether the test is run on a bot,
				11	or run locally).
				12
				13	[TOC]
				14
				15	## Introduction
				16
				17	We use these files on the bots in order to determine whether a test step had
				18	any failing tests (using a separate file means that we don't need to parse the
				19	output of the test run, and hence the test can be tailored for human readability
				20	as a result). We also upload the test results to dashboards like the
				21	[Flakiness Dashboard](http://test-results.appspot.com).
				22
Kent Tamura	59ffb02	2018-11-27 05:30:56	[diff] [blame]	23	The test format originated with the Blink web tests, but has since been
Stephen Martinis	89f9ca6	2017-10-18 19:12:35	[diff] [blame]	24	adopted by GTest-based tests and Python unittest-based tests, so we've
				25	standardized on it for anything related to tracking test flakiness.
				26
				27	### Example
				28
				29	Here's a very simple example for one Python test:
				30
				31	% python mojo/tools/run_mojo_python_tests.py --write-full-results-to results.json mojom_tests.parse.ast_unittest.ASTTest.testNodeBase
				32	Running Python unit tests under mojo/public/tools/bindings/pylib ...
				33	.
				34	----------------------------------------------------------------------
				35	Ran 1 test in 0.000s
				36
				37	OK
				38	% cat results.json
				39	{
				40	"tests": {
				41	"mojom_tests": {
				42	"parse": {
				43	"ast_unittest": {
				44	"ASTTest": {
				45	"testNodeBase": {
				46	"expected": "PASS",
Stephen Martinis	fa0f6cc	2017-11-09 02:33:05	[diff] [blame]	47	"actual": "PASS",
				48	"artifacts": {
				49	"screenshot": ["screenshots/page.png"],
				50	}
Stephen Martinis	89f9ca6	2017-10-18 19:12:35	[diff] [blame]	51	}
				52	}
				53	}
				54	}
				55	}
				56	},
				57	"interrupted": false,
				58	"path_delimiter": ".",
				59	"version": 3,
				60	"seconds_since_epoch": 1406662283.764424,
				61	"num_failures_by_type": {
				62	"FAIL": 0,
				63	"PASS": 1
Stephen Martinis	fa0f6cc	2017-11-09 02:33:05	[diff] [blame]	64	},
				65	"artifact_types": {
				66	"screenshot": "image/png"
Stephen Martinis	89f9ca6	2017-10-18 19:12:35	[diff] [blame]	67	}
				68	}
				69
				70
				71
				72	As you can see, the format consists of a one top level dictionary containing a
				73	set of metadata fields describing the test run, plus a single `tests` key that
				74	contains the results of every test run, structured in a hierarchical trie format
				75	to reduce duplication of test suite names (as you can see from the deeply
				76	hierarchical Python test name).
				77
Dirk Pranke	b5f530d	2018-06-12 22:51:32	[diff] [blame]	78	The file is strictly JSON-compliant. As a part of this, the fields in each
				79	object may appear in any order.
Stephen Martinis	89f9ca6	2017-10-18 19:12:35	[diff] [blame]	80
				81	## Top-level field names
				82
				83	\| Field Name \| Data Type \| Description \|
				84	\|------------\|-----------\|-------------\|
				85	\| `interrupted` \| boolean \| Required. Whether the test run was interrupted and terminated early (either via the runner bailing out or the user hitting ctrl-C, etc.) If true, this indicates that not all of the tests in the suite were run and the results are at best incomplete and possibly totally invalid. \|
				86	\| `num_failures_by_type` \| dict \| Required. A summary of the totals of each result type. If a test was run more than once, only the first invocation's result is included in the totals. Each key is one of the result types listed below. A missing result type is the same as being present and set to zero (0). \|
Kent Tamura	59ffb02	2018-11-27 05:30:56	[diff] [blame]	87	\| `path_delimiter` \| string \| Optional, will be mandatory. The separator string to use in between components of a tests name; normally "." for GTest- and Python-based tests and "/" for web tests; if not present, you should default to "/" for backwards-compatibility. \|
Stephen Martinis	89f9ca6	2017-10-18 19:12:35	[diff] [blame]	88	\| `seconds_since_epoch` \| float \| Required. The start time of the test run expressed as a floating-point offset in seconds from the UNIX epoch. \|
				89	\| `tests` \| dict \| Required. The actual trie of test results. Each directory or module component in the test name is a node in the trie, and the leaf contains the dict of per-test fields as described below. \|
				90	\| `version` \| integer \| Required. Version of the file format. Current version is 3. \|
Stephen Martinis	fa0f6cc	2017-11-09 02:33:05	[diff] [blame]	91	\| `artifact_types` \| dict \| Optional. Required if any artifacts are present for any tests. MIME Type information for artifacts in this json file. All artifacts with the same name must share the same MIME type. \|
Dirk Pranke	b5f530d	2018-06-12 22:51:32	[diff] [blame]	92	\| `artifact_permanent_location` \| string \| Optional. The URI of the root location where the artifacts are stored. If present, any artifact locations are taken to be relative to this location. Currently only the `gs://` scheme is supported. \|
Stephen Martinis	89f9ca6	2017-10-18 19:12:35	[diff] [blame]	93	\| `build_number` \| string \| Optional. If this test run was produced on a bot, this should be the build number of the run, e.g., "1234". \|
				94	\| `builder_name` \| string \| Optional. If this test run was produced on a bot, this should be the builder name of the bot, e.g., "Linux Tests". \|
Rakib M. Hasan	5cfefc7	2019-04-10 21:51:51	[diff] [blame]	95	\| `metadata` \| dict \| Optional. It maps to a dictionary that contains all the key value pairs used as metadata. This dictionary also includes the tags, test name prefix and test expectations file paths used during a test run. \|
Stephen Martinis	89f9ca6	2017-10-18 19:12:35	[diff] [blame]	96	\| `chromium_revision` \| string \| Optional. The revision of the current Chromium checkout, if relevant, e.g. "356123". \|
Kent Tamura	59ffb02	2018-11-27 05:30:56	[diff] [blame]	97	\| `has_pretty_patch` \| bool \| Optional, layout test specific, deprecated. Whether the web tests' output contains PrettyDiff-formatted diffs for test failures. \|
				98	\| `has_wdiff` \| bool \| Optional, layout test specific, deprecated. Whether the web tests' output contains wdiff-formatted diffs for test failures. \|
				99	\| `layout_tests_dir` \| string \| Optional, layout test specific. Path to the web_tests directory for the test run (used so that we can link to the tests used in the run). \|
				100	\| `pixel_tests_enabled` \| bool \| Optional, layout test specific. Whether the web tests' were run with the --pixel-tests flag. \|
Stephen Martinis	67cb9c0	2019-04-09 19:14:08	[diff] [blame]	101	\| `flag_name` \| string \| Optional, layout test specific. The flags used when running tests\|
Stephen Martinis	89f9ca6	2017-10-18 19:12:35	[diff] [blame]	102	\| `fixable` \| integer \| Optional, deprecated. The number of tests that were run but were expected to fail. \|
				103	\| `num_flaky` \| integer \| Optional, deprecated. The number of tests that were run more than once and produced different results each time. \|
				104	\| `num_passes` \| integer \| Optional, deprecated. The number of successful tests; equivalent to `num_failures_by_type["Pass"]` \|
				105	\| `num_regressions` \| integer \| Optional, deprecated. The number of tests that produced results that were unexpected failures. \|
Dirk Pranke	b5f530d	2018-06-12 22:51:32	[diff] [blame]	106	\| `skips` \| integer \| Optional, deprecated. The number of tests that were found but not run (tests should be listed in the trie with "expected" and "actual" values of `SKIP`). \|
Stephen Martinis	89f9ca6	2017-10-18 19:12:35	[diff] [blame]	107
				108	## Per-test fields
				109
				110	Each leaf of the `tests` trie contains a dict containing the results of a
				111	particular test name. If a test is run multiple times, the dict contains the
Dirk Pranke	b5f530d	2018-06-12 22:51:32	[diff] [blame]	112	results for each invocation in the `actual` field. Unless otherwise noted,
				113	if the test is run multiple times, all of the other fields represent the
				114	overall / final / last value. For example, if a test unexpectedly fails and
				115	then is retried and passes, both `is_regression` and `is_unexpected` will be false).
Stephen Martinis	89f9ca6	2017-10-18 19:12:35	[diff] [blame]	116
				117	\| Field Name \| Data Type \| Description \|
				118	\|-------------\|-----------\|-------------\|
				119	\| `actual` \| string \| Required. An ordered space-separated list of the results the test actually produced. `FAIL PASS` means that a test was run twice, failed the first time, and then passed when it was retried. If a test produces multiple different results, then it was actually flaky during the run. \|
				120	\| `expected` \| string \| Required. An unordered space-separated list of the result types expected for the test, e.g. `FAIL PASS` means that a test is expected to either pass or fail. A test that contains multiple values is expected to be flaky. \|
Dirk Pranke	532b3f3	2020-09-23 23:58:10	[diff] [blame]	121	\| `artifacts` \| dict \| Optional. A dictionary describing test artifacts generated by the execution of the test. The dictionary maps the name of the artifact (`screenshot`, `crash_log`) to a list of relative locations of the artifact (`screenshot/page.png`, `logs/crash.txt`). Any '/' characters in the file paths are meant to be platform agnostic; tools will replace them with the appropriate per platform path separators. There is one entry in the list per test execution. If `artifact_permanent_location` is specified, then this location is relative to that path. Otherwise, the path is assumed to be relative to the location of the json file which contains this (i.e., `$ISOLATED_OUTDIR`). The actual locations of artifacts are implementation-defined by the test program and can follow any convention, since these entries will allow them to be looked up easily. \|
Stephen Martinis	89f9ca6	2017-10-18 19:12:35	[diff] [blame]	122	\| `bugs` \| string \| Optional. A comma-separated list of URLs to bug database entries associated with each test. \|
Caleb Rouleau	caca09d	2019-05-16 21:50:32	[diff] [blame]	123	\| `shard` \| int \| Optional. The 0-based index of the shard that the test ran on, if the test suite was sharded across multiple bots. \|
Dirk Pranke	b5f530d	2018-06-12 22:51:32	[diff] [blame]	124	\| `is_flaky` \| bool \| Optional. If present and true, the test was run multiple times and produced more than one kind of result. If false (or if the key is not present at all), the test either only ran once or produced the same result every time. \|
				125	\| `is_regression` \| bool \| Optional. If present and true, the test failed unexpectedly. If false (or if the key is not present at all), the test either ran as expected or passed unexpectedly. \|
				126	\| `is_unexpected` \| bool \| Optional. If present and true, the test result was unexpected. This might include an unexpected pass, i.e., it is not necessarily a regression. If false (or if the key is not present at all), the test produced the expected result. \|
Stephen Martinis	89f9ca6	2017-10-18 19:12:35	[diff] [blame]	127	\| `time` \| float \| Optional. If present, the time it took in seconds to execute the first invocation of the test. \|
				128	\| `times` \| array of floats \| Optional. If present, the times in seconds of each invocation of the test. \|
Kent Tamura	59ffb02	2018-11-27 05:30:56	[diff] [blame]	129	\| `has_repaint_overlay` \| bool \| Optional, web test specific. If present and true, indicates that the test output contains the data needed to draw repaint overlays to help explain the results (only used in layout tests). \|
				130	\| `is_missing_audio` \| bool \| Optional, we test specific. If present and true, the test was supposed to have an audio baseline to compare against, and we didn't find one. \|
				131	\| `is_missing_text` \| bool \| Optional, web test specific. If present and true, the test was supposed to have a text baseline to compare against, and we didn't find one. \|
				132	\| `is_missing_video` \| bool \| Optional, web test specific. If present and true, the test was supposed to have an image baseline to compare against and we didn't find one. \|
				133	\| `is_testharness_test` \| bool \| Optional, web test specific. If present, indicates that the layout test was written using the w3c's test harness and we don't necessarily have any baselines to compare against. \|
Xianzhu Wang	59283af8	2020-08-11 02:32:51	[diff] [blame]	134	\| `is_slow_test` \| bool \| Optional, web test specific. If present and true, the test is expected to take longer than normal to run. \|
Kent Tamura	59ffb02	2018-11-27 05:30:56	[diff] [blame]	135	\| `reftest_type` \| string \| Optional, web test specific. If present, one of `==` or `!=` to indicate that the test is a "reference test" and the results were expected to match the reference or not match the reference, respectively (only used in layout tests). \|
Stephen Martinis	89f9ca6	2017-10-18 19:12:35	[diff] [blame]	136
				137	## Test result types
				138
				139	Any test may fail in one of several different ways. There are a few generic
Kent Tamura	59ffb02	2018-11-27 05:30:56	[diff] [blame]	140	types of failures, and the web tests contain a few additional specialized
Stephen Martinis	89f9ca6	2017-10-18 19:12:35	[diff] [blame]	141	failure types.
				142
				143	\| Result type \| Description \|
				144	\|--------------\|-------------\|
Stephen Martinis	89f9ca6	2017-10-18 19:12:35	[diff] [blame]	145	\| `CRASH` \| The test runner crashed during the test. \|
Dirk Pranke	b5f530d	2018-06-12 22:51:32	[diff] [blame]	146	\| `FAIL` \| The test did not run as expected. \|
				147	\| `PASS` \| The test ran as expected. \|
				148	\| `SKIP` \| The test was not run. \|
Stephen Martinis	89f9ca6	2017-10-18 19:12:35	[diff] [blame]	149	\| `TIMEOUT` \| The test hung (did not complete) and was aborted. \|
Kent Tamura	59ffb02	2018-11-27 05:30:56	[diff] [blame]	150	\| `AUDIO` \| Web test specific, deprecated. The test is expected to produce audio output that doesn't match the expected result. Normally you will see `FAIL` instead. \|
				151	\| `IMAGE` \| Web test specific, deprecated. The test produces image (and possibly text output). The image output doesn't match what we'd expect, but the text output, if present, does. Normally you will see `FAIL` instead. \|
				152	\| `IMAGE+TEXT` \| Web test specific, deprecated. The test produces image and text output, both of which fail to match what we expect. Normally you will see `FAIL` instead. \|
				153	\| `LEAK` \| Web test specific, deprecated. Memory leaks were detected during the test execution. \|
				154	\| `MISSING` \| Web test specific, deprecated. The test completed but we could not find an expected baseline to compare against. \|
				155	\| `NEEDSREBASELINE` \| Web test specific, deprecated. The expected test result is out of date and will be ignored (as above); the auto-rebaseline-bot will look for tests of this type and automatically update them. This should never show up as an `actual` result. \|
				156	\| `REBASELINE` \| Web test specific, deprecated. The expected test result is out of date and will be ignored (any result other than a crash or timeout will be considered as passing). This test result should only ever show up on local test runs, not on bots (it is forbidden to check in a TestExpectations file with this expectation). This should never show up as an "actual" result. \|
				157	\| `SLOW` \| Web test specific, deprecated. The test is expected to take longer than normal to run. This should never appear as an `actual` result, but may (incorrectly) appear in the expected fields. \|
				158	\| `TEXT` \| Web test specific, deprecated. The test is expected to produce a text-only failure (the image, if present, will match). Normally you will see `FAIL` instead. \|
Stephen Martinis	89f9ca6	2017-10-18 19:12:35	[diff] [blame]	159
Dirk Pranke	b5f530d	2018-06-12 22:51:32	[diff] [blame]	160	Unexpected results, failures, and regressions are different things.
				161
				162	An unexpected result is simply a result that didn't appear in the `expected`
				163	field. It may be used for tests that _pass_ unexpectedly, i.e. tests that
				164	were expected to fail but passed. Such results should _not_ be considered
				165	failures.
				166
				167	Anything other than `PASS`, `SKIP`, `SLOW`, or one of the REBASELINE types is
				168	considered a failure.
				169
				170	A regression is a result that is both unexpected and a failure.
				171
				172	## `full_results.json` and `failing_results.json`
Stephen Martinis	89f9ca6	2017-10-18 19:12:35	[diff] [blame]	173
Kent Tamura	59ffb02	2018-11-27 05:30:56	[diff] [blame]	174	The web tests produce two different variants of the above file. The
Stephen Martinis	89f9ca6	2017-10-18 19:12:35	[diff] [blame]	175	`full_results.json` file matches the above definition and contains every test
				176	executed in the run. The `failing_results.json` file contains just the tests
				177	that produced unexpected results, so it is a subset of the `full_results.json`
				178	data. The `failing_results.json` file is also in the JSONP format, so it can be
				179	read via as a `<script>` tag from an html file run from the local filesystem
				180	without falling prey to the same-origin restrictions for local files. The
				181	`failing_results.json` file is converted into JSONP by containing the JSON data
Dirk Pranke	b5f530d	2018-06-12 22:51:32	[diff] [blame]	182	preceded by the string "ADD_RESULTS(" and followed by the string ");", so you
Stephen Martinis	89f9ca6	2017-10-18 19:12:35	[diff] [blame]	183	can extract the JSON data by stripping off that prefix and suffix.