I want to express my thanks to everyone who contributed to this thread. We have a lot of passionate and smart people who care about this topic- thanks again for weighing in so far.
Below is a slightly updated policy from the original, and following that is an attempt to summarize the thread and turn what makes sense into actionable items.
= Policy for handling intermittent oranges =
This policy will define an escalation path for when a single test case is identified to be leaking or failing and is causing enough disruption on the trees. Disruption is defined as:
1) Test case is on the list of top 20 intermittent failures on Orange Factor (http://brasstacks.mozilla.com/orangefactor/index.html)
2) It is causing oranges >=8% of the time
3) We have >100 instances of this failure in the bug in the last 30 days
Escalation is a responsibility of all developers, although the majority will fall on the sheriffs.
1) Ensure we have a bug on file, with the test author, reviewer, module owner, and any other interested parties, links to logs, etc.
2) We need to needinfo? and expect a response within 2 business days, this should be clear in a comment.
3) In the case we don't get a response, request a needinfo? from the module owner
with the expectation of 2 days for a response and getting someone to take action.
4) In the case we go another 2 days with no response from a module owner, we will disable the test.
Ideally we will work with the test author to either get the test fixed or disabled depending on available time or difficulty in fixing the test. If a bug has activity and work is being done to address the issue, it is reasonable to expect the test will not be disabled. Inactivity in the bug is the main cause for escalation.
This is intended to respect the time of the original test authors by not throwing emergencies in their lap, but also strike a balance with keeping the trees manageable.
1) If this test has landed (or been modified) in the last 48 hours, we will most likely back out the patch with the test
2) If a test is failing at least 30% of the time, we will file a bug and disable the test first
3) When we are bringing a new platform online (Android 2.3, b2g, etc.) many tests will need to be disabled prior to getting the tests on tbpl.
4) In the rare case we are disabling the majority of the tests (either at once or slowly over time) for a given feature, we need to get the module owner to sign off on the current state of the tests.
= Documentation =
We have thousands of tests disabled, many are disabled for different build configurations or platforms. This can be dangerous as we slowly reduce our coverage. By running a daily report (bug 996183) to outline the total tests available vs each configuration (b2g, debug, osx, e10s, etc.) we can bring visibility to the state of each platform and if we are disabling more than we fix.
We need to have a clear guide on how to run the tests, how to write a test, how to debug a test, and use metadata to indicate if we have looked at this test and when.
When an intermittent bug is filed, we need to clearly outline what information will aid the most in reproducing and fixing this bug. Without a documented process for fixing oranges, this falls on the shoulders of the original test authors and a few determined hackers.
= General Policy =
I have adjusted the above policy to mention backing out new tests which are not stable, working to identify a regression in the code or tests, and adding protection so we do not disable coverage for a specific feature completely. In addition, I added a clearer definition of what is a disruptive test and clarified the expectations around communicating in the bug vs escalating.
What is more important is the culture we have around commiting patches to Mozilla repositories. We need to decide as an organization if we care about zero oranges (or insert acceptable percentage). We also need to decide what is acceptable coverage levels and what our general policy is for test reviews (at checkin time and in the future). These need to be answered outside of this policy- but the sooner we answer these questions, the better we can all move forward towards the same goal.
= Tools =
Much of the discussion was around tools. As a member of the Automation and Tools team, I should be advocating for more tools, in this case I am leaning more towards less tools and better process.
One common problem is dealing with the noise around infrastructure and changing environments and test harnesses. Is this documented, how can we filter that out? Having our tools support ways to detect this and annotate changes unrelated to tests or builds will go a long way. Related is updating our harnesses and the way we run tests so they are more repeatable. I have filed bug 996504 to track work on this.
Another problem we can look at with tooling is annotating the expected outcome of the tests with metadata (suggestions of manifest as well as external server). Once we get there we have options such as:
* rerunning tests (until they pass, or to document failure patterns)
* putting all oranges in their own suite
* ignoring results of known oranges
Of course no discussion would be complete without talking about what we could do if this problem were solved. Honorable mentions are:
* Orange Factor / Test Statistics
* Auto Bisection