0

Update Sheriff docs under chromium/src/docs

* Updated Trunk Sheriffing, Branch Sheriffing, Perf Regression Sheriffing and Perf Bot Sheriffing
* Add links to internal documentation under go/chrome-sheriffing (consolidated internal docs
location)

Change-Id: I074a8e84d8b9d1535f5999eb5f2ff0e055713f71
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/3068176
Auto-Submit: Eric Foo <efoo@chromium.org>
Commit-Queue: Dirk Pranke <dpranke@google.com>
Reviewed-by: Dirk Pranke <dpranke@google.com>
Cr-Commit-Position: refs/heads/master@{#908006}
This commit is contained in:
Eric Foo
2021-08-03 16:25:52 +00:00
committed by Chromium LUCI CQ
parent da311bbf29
commit da089b50f9
12 changed files with 44 additions and 342 deletions

@ -1,145 +1,18 @@
# Chromium Branch Sheriffing
This document describes how to be a Chromium *branch* sheriff and how sheriffing
on a branch differs from sheriffing on trunk. For trunk sheriffing guidance, see
[//docs/sheriff.md][sheriff-md].
[TOC]
## Philosophy
The Chrome release branch sheriff provides coverage for release branches
(stable and beta) under Pacific timezone shifts.
The goals of a branch sheriff are quite similar to those of a trunk sheriff.
Branch sheriffs need to ensure that:
1. **Compile failures get fixed**, because compile failures on branches block
all tests (both automated and manual) and consequently reduce our confidence
in the quality of what we're shipping, possibly to the point of blocking
releases.
2. **Consistent test failures get repaired**, because they similarly reduce
our confidence in the quality of our code.
1. **Compile failures get fixed**, because compile failures on branches block
all tests (both automated and manual) and consequently reduce our confidence
in the quality of what we're shipping, possibly to the point of blocking
releases.
2. **Consistent test failures get repaired**, because they similarly reduce our
confidence in the quality of our code.
**Communication** is important for sheriffs in general, but it's particularly
important for branch sheriffs. Over the course of your shift, you may need to
coordinate with trunk sheriffs, troopers, release TPMs, and others -- don't
hesitate to do so, particularly if you have questions.
Points of contact (i.e. platform-specific sheriffs) can be found
[here](http://goto.google.com/chrome-branch-sheriffing#points-of-contact).
## Processes
In general, you'll want to follow the same processes outlined in
[//docs/sheriff.md][sheriff-md]. There are some differences, though.
### Checkout
You'll need to ensure that your checkout is configured to check out the branch
heads. You can do so by running
```
src $ gclient sync --with_branch_heads
```
> This only needs to be done once, though running it more than once won't hurt.
You may also need to run:
```
src $ git fetch
```
Once you've done that, you'll be able to check out branches:
```
src $ git checkout branch-heads/$BRANCH_NUMBER # e.g. branch-heads/4044 for M81
src $ gclient sync
```
To determine the appropriate branch number, you can either use
[chromiumdash](#chromiumdash) or check [milestone.json][milestone-json]
directly.
### Findit
As FindIt is not available on branches, one way to try to find culprits is using
`git bisect` locally and upload changes to a gerrit CL and run the needed
trybots to check. This is especially useful when the errors are not reproducible
on your local builds or you don't have the required hardware to build the failed
tests.
### Flaky tests
Flaky tests that are disabled on trunk should also be disabled on any branches
with frequent failures of that test. If a trunk CL lands with no change other
than to disable one or more tests ([example](https://crrev.com/c/2507299)) and
it has an associated bug and the release manager is cc'd on the bug, you can and
should cherrypick it to the affected branch without requesting merge approval.
On the other hand, if you believe that a flake was introduced by a cherry-pick
to the branch in question and is not flaky on trunk, you will need to create a
new CL to disable it only on the branch and go through the usual merge request
process.
Note: there is little value in merging changes to the stable release
branch when the next milestone's stable release is less than a week away
(since there are usually no planned stable respins at that point).
You can find release dates on [chromiumdash][chromiumdash-schedule].
### Landing changes
When you need to land a change to a branch, you'll need to go through [the same
merge approval process](./process/merge_request.md) as other cherry-picks (see
exception for flaky tests above). You should feel free to ping the relevant
release TPM as listed on [chromiumdash][chromiumdash-schedule].
## Tools
### Sheriff-o-Matic
Use the [branch SoM console][sheriff-o-matic] rather than the main chromium
console.
### Consoles
Use the [beta][main-beta] and [stable][main-stable] branch consoles rather than
the main console. A new console is created for each milestone. They are named
"Chromium M## Console" and can be found under the
[Chromium Project](https://ci.chromium.org/p/chromium).
### Monorail issues (crbug)
Refer and use the
[Sheriff-Chrome-Release label](https://bugs.chromium.org/p/chromium/issues/list?q=label%3ASheriff-Chrome-Release)
to find and tag issues that are of importance to Branch sheriffs.
### Chromiumdash
[chromiumdash][chromiumdash] can help you determine the branch number for a
particular milestone or channel, along with a host of other useful information:
* [Branches][chromiumdash-branches] lists the branches for each milestone.
* [Releases][chromiumdash-releases] lists the builds currently shipping to
each channel, which can help map from channel to milestone or to branch.
* [Schedule][chromiumdash-schedule] lists the relevant dates for each
milestone and includes the release TPMs responsible for each milestone by
platform.
### Rotation
The current branch sheriff is listed [here][rotation-home]. The configuration
and source of truth for the schedule lives [here][rotation-config]. To swap,
simply send a CL changing schedule at the bottom of the file.
You can also use [Oncall Swapper](https://oncallswapper.corp.google.com/)
to find the swap and submit the CL for you.
[chromiumdash]: https://chromiumdash.appspot.com
[chromiumdash-branches]: https://chromiumdash.appspot.com/branches
[chromiumdash-releases]: https://chromiumdash.appspot.com/releases
[chromiumdash-schedule]: https://chromiumdash.appspot.com/schedule
[main-beta]: https://ci.chromium.org/p/chromium/g/main-m81/console
[main-stable]: https://ci.chromium.org/p/chromium/g/main-m80/console
[milestone-json]: https://goto.google.com/chrome-milestone-json
[rotation-home]: https://goto.google.com/chrome-branch-sheriff-amer-west
[rotation-config]: https://goto.google.com/chrome-branch-sheriff-amer-west-config
[sheriff-md]: /docs/sheriff.md
[sheriff-o-matic]: https://sheriff-o-matic.appspot.com/chrome_browser_release
For more information on Chromium Branch Sheriffs, including How Tos, Swapping
Shifts and rotation updates, please see [Chromium
Branch Sheriffing](http://goto.google.com/chrome-branch-sheriffing)

@ -1,9 +1,5 @@
# Chromium Sheriffing
Author: ellyjones@
## Sheriffing Philosophy
Sheriffs have one overarching role: to ensure that the Chromium build
infrastructure is doing its job of helping developers deliver good software.
Every other sheriff responsibility flows from that one. In priority order,
@ -29,6 +25,9 @@ necessary authority to fulfill them. In particular, you have the authority to:
TBRs were removed in Q1 2021.
For more information on Chromium Trunk Sheriffs, including How Tos, Swapping
Shifts and rotation updates, please see [Chromium Trunk Sheriffing](http://goto.google.com/chrome-trunk-sheriffing)
## How to be a Sheriff
To be a sheriff, you must be both a Chromium committer and a Google employee.

@ -1,5 +1,7 @@
# How to access and navigate test logs
**Important**: When making changes to this document, also update duplicate files under the [internal docs](http://goto.google.com/perf-bot-health-sheriffs).
When trying to understand a failure, it can be useful to inspect the test logs where the failure occurred.
[TOC]

@ -1,5 +1,7 @@
# How to address a new alert with the same root cause as an existing alert
**Important**: When making changes to this document, also update duplicate files under the [internal docs](http://goto.google.com/perf-bot-health-sheriffs).
It's common when large problems arise for multiple alerts to fire due to the same underlying problem. Sheriff-o-matic does its best to automatically group these problems into a single alert, but sometimes it's unable to and we have to group the alerts together manually. This is important because it helps future sheriffs see at a glance the number of distinct problems.
Unfortunately, there's no way to distinguish these duplicate alerts from new alerts without knowing the contents of those other alerts. If you're unsure about two particular alerts, don't hesitate to ask for help [on chat](https://hangouts.google.com/group/2GmiXjz55R2ixTXi1).

@ -1,5 +1,7 @@
# How to disable a failing test/story on the perf waterfall
**Important**: When making changes to this document, also update duplicate files under the [internal docs](http://goto.google.com/perf-bot-health-sheriffs).
To disable a failing test/story, the first step is to figure
out if the failing thing is gtest or Telemetry, then you can
follow the below directions to disable the failing test/story.

@ -1,5 +1,7 @@
# How to follow up on an alert
**Important**: When making changes to this document, also update duplicate files under the [internal docs](http://goto.google.com/perf-bot-health-sheriffs).
[TOC]
Skim the bug to understand where the last sheriff left things and where you should pick up.

@ -1,5 +1,7 @@
# How to handle an alert for a new problem
**Important**: When making changes to this document, also update duplicate files under the [internal docs](http://goto.google.com/perf-bot-health-sheriffs).
**Warning: this is the hardest part of being a sheriff.**
Each bug may take 10 minutes to an hour to address, but there are usually a manageable number of new bugs per shift (5 on a good shift, 15 on a bad one).

@ -1,5 +1,7 @@
# How to launch a functional bisect and interpret its results
**Important**: When making changes to this document, also update duplicate files under the [internal docs](http://goto.google.com/perf-bot-health-sheriffs).
A functional bisect determines the revision at which a particular benchmark or story started failing more often. It does this by doing a binary search between a known good and known bad revision, running the test multiple times at each potential revision until it narrows down the culprit to a single revision.
[TOC]

@ -1,5 +1,7 @@
# How to snooze an alert
**Important**: When making changes to this document, also update duplicate files under the [internal docs](http://goto.google.com/perf-bot-health-sheriffs).
After addressing an alert, the next step is to snooze it.
Snoozing an alert hides the alert, moving it to a collapsed section at the bottom of the "Consistent alerts" section until the specified time has expired. This acts as a signal to yourself and other sheriffs that no further action is necessarily until the alert becomes unsnoozed.

@ -1,18 +1,14 @@
# Perf bot health sheriff rotation
# Perf Bot Health Sheriff
## Warning
The goal of the perf bot health sheriff rotation is to ensure that the benchmarks running on our perf waterfall continue to produce data and catch regressions quickly. This is also known as "keeping the bots green" and is primarily achieved by triaging incoming alerts. Note that a different rotation [Perf Regressions Sheriffs](../perf_regression_sheriffing.md) is focused on performance.
**Note that Sheriff-O-Matic currently doesn't work for the perf waterfall
[crbug.com/984159](https://crbug.com/984159).
Please use [Milo chrome.perf
console](https://ci.chromium.org/p/chrome/g/chrome.perf/console) instead.**
## Goal
The goal of the perf bot health sheriff rotation is to ensure that the benchmarks running on our perf waterfall continue to produce data and catch regressions quickly. This is also known as "keeping the bots green" and is primarily achieved by triaging incoming alerts.
For more information on Perf Bot Health Sheriffing, who's on rotation, how to handle specific
tasks, and swap shifts, please see [Perf Bot Health
Sheriffs](http://goto.google.com/perf-bot-health-sheriffs)
## Quick links
* [Perf Bot Health Sheriffing Overview and How-To](http://goto.google.com/perf-bot-health-sheriffs)
* [How to determine what story is failing](https://chromium.googlesource.com/chromium/src/+/main/docs/speed/bot_health_sheriffing/what_test_is_failing.md)
* [How to disable a story](https://chromium.googlesource.com/chromium/src/+/main/docs/speed/bot_health_sheriffing/how_to_disable_a_story.md)
* [How to launch a functional bisect](https://chromium.googlesource.com/chromium/src/+/main/docs/speed/bot_health_sheriffing/how_to_launch_a_functional_bisect.md)
@ -21,102 +17,3 @@ The goal of the perf bot health sheriff rotation is to ensure that the benchmark
* [How to handle a new problem](https://chromium.googlesource.com/chromium/src/+/main/docs/speed/bot_health_sheriffing/how_to_handle_a_new_problem.md)
* [How to follow up on an alert](https://chromium.googlesource.com/chromium/src/+/main/docs/speed/bot_health_sheriffing/how_to_follow_up_on_an_alert.md)
* [How to address duplicate alerts](https://chromium.googlesource.com/chromium/src/+/main/docs/speed/bot_health_sheriffing/how_to_address_duplicate_alerts.md)
* [Glossary](https://chromium.googlesource.com/chromium/src/+/main/docs/speed/bot_health_sheriffing/glossary.md)
[TOC]
## Vocabulary
Definitions of various bot health related vocabulary can be found in our [glossary](https://chromium.googlesource.com/chromium/src/+/main/docs/speed/bot_health_sheriffing/glossary.md).
## High-level responsibilities
The sheriff's role is to work through the list of failures, fixing the easiest ones and routing the rest to the correct owners. This mostly requires filing bugs, disabling benchmarks and stories, launching bisects, and reverting any CLs that are obviously responsible for breakages.
Additionally, the sheriff should watch the [catapult
roll](https://autoroll.skia.org/r/catapult-autoroll), which should
automatically TBR the sheriff. If the catapult roll fails, the sheriff should
investigate and revert suspect changelists.
Near the end of their shift, sheriffs should also inspect[this dashboard](https://dashboards.corp.google.com/_e3cbeb60_d250_4e67_8795_56cd9af8a303) for the time covered during their shift, and do a first-pass analysis of any anomalies (e.g. jobs taking 6 hours when they normally take 1.5).
The sheriff should *not* feel responsible for investigating hard problems. The volume of incoming alerts makes this infeasible. Instead, they should delegate deep investigations to the right owners. As a rule of thumb, a trained sheriff should expect to spend 10-20 minutes per alert and should never be spending more than an hour per alert.
## Workflow
~~Incoming failures are shown in [Sheriff-o-matic](https://sheriff-o-matic.appspot.com/chromium.perf), which acts as a task management system for bot health sheriffs. Failures are divided into three groups on the dashboard:~~
* ~~**Infra failures** show general infrastructure problems that are affecting benchmarks. Besides surfacing in Sheriff-o-matic, we also need to check for down bots in the lame duck pool. Please file a ticket for any bots you see in [this list](https://chrome-swarming.appspot.com/botlist?c=id&c=os&c=task&c=status&c=os&c=task&c=status&c=pool&f=status%3Adead&f=pool%3Achrome.tests.perf&l=100&q=pool%3Achrome.tests.perf&s=id%3Aasc) or [this list for webview](https://chrome-swarming.appspot.com/botlist?c=id&c=os&c=task&c=status&c=os&c=task&c=status&c=pool&f=status%3Adead&f=pool%3Achrome.tests.perf-webview&l=100&q=pool%3Achrome.tests.perf&s=id%3Aasc) as they will not show up in Sheriff-o-matic.~~
* ~~**Consistent failures** show benchmarks that have been failing for a while.~~
* ~~**New failures** show benchmarks that benchmarks that have recently started failing.~~
~~Of these three groups, the sheriff should only be concerned with **infra failures** and **consistent failures.** New failures are too likely to be one-off flakes to warrant investigation.~~
~~The high-level workflow is to start at the top of the list of the list of failures and address one alert at a time. The alerts are ordered roughly in order of their impact.~~
~~As the sheriff addresses alerts, the number of alerts will generally decrease as problems with the same cause get grouped together and failures get fixed. Addressed alerts will also move to the bottom of the list. Ideally, Sheriff-o-matic should reflect the work you've done so that a new sheriff could potentially take over at any time and pick up at the top of the list.~~
**Note that Sheriff-O-Matic currently doesn't work for the perf waterfall
[crbug.com/984159](https://crbug.com/984159).
Please use [Milo chrome.perf
console](https://ci.chromium.org/p/chrome/g/chrome.perf/console) instead.**
## How to address each alert
Alerts can be addressed by answering the following questions:
### Has a previous sheriff already addressed this alert?
This category of alert should have a bug already linked with it. This link can be found next to the alert.
![A link to a bug from a Sheriff-o-matic alert](images/som_alert_bug.png)
Instructions can be found [here](https://chromium.googlesource.com/chromium/src/+/main/docs/speed/bot_health_sheriffing/how_to_follow_up_on_an_alert.md) on how to follow up with an existing alert.
### Is this a new alert caused by the same root cause as an already-triaged alert?
This category of alert won't have a bug linked with it yet. However, a bug *does* exist for the issue: it may be linked to another alert, but can otherwise be found [here](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label:Performance-Sheriff-BotHealth&sort=pri&colspec=ID%20Pri%20M%20Stars%20ReleaseBlock%20Component%20Status%20Owner%20Summary%20OS%20Modified) under the Performance-Sheriff-BotHealth label in monorail. For example:
![A link to an alert group in Sheriff-o-matic](images/som_first_alert.png)
and
![A link to a duplicate alert in Sheriff-o-matic](images/som_duplicate_alert.png)
are both in the list of current of alerts but represent the same failure.
It can sometimes be tricky to differentiate between these alerts and ones caused by completely new problems, but sheriffs can always treat an alert as new and merge it with another later.
Instructions can be found [here](https://chromium.googlesource.com/chromium/src/+/main/docs/speed/bot_health_sheriffing/how_to_address_duplicate_alerts.md) on how to handle a duplicate alert.
### Is this a new alert caused by a new problem?
This category of alert doesn't yet have a bug associated with it. It's the most common category and requires the most expertise to handle.
Instructions can be found [here](https://chromium.googlesource.com/chromium/src/+/main/docs/speed/bot_health_sheriffing/how_to_handle_a_new_problem.md) on how to handle an alert for a new problem.
## After your shift is over
Your only responsibility after your shift concludes is to follow up with any bugs that would no longer appear on the dashboard (i.e. the failure has stopped) but still need correct routing.
For example, if you disabled a story and snoozed an alert during your shift, you should ensure that the bug is assigned to the benchmark's owner before relinquishing responsibility for the bug.
## Frequently asked questions
### Why do the benchmarks break so often?
The bots runs Chrome benchmarks that are complicated integration tests of Chrome. Developers frequently submit code that breaks some part of Chrome and one of our integration tests (hopefully) tests that bit of code, resulting in a broken benchmark. In some sense, frequent breakages indicate that the benchmarks are working.
Many breakages probably *aren't* good signs, though. If you have ideas on how to reduce the number of breakages or the work required to handle a breakage, submit your idea to the Chrome benchmarking group!
### Do I have to use Sheriff-o-matic?
Yes! Sheriff-o-matic allows us to smoothly hand off responsibility between sheriffs and allows us to standardize sheriffing.
If you find a problem with Sheriff-o-matic or have a feature request, file a bug [here](https://bugs.chromium.org/p/chromium/issues/entry?template=Build%20Infrastructure&components=Infra%3ESheriffing%3ESheriffOMatic&labels=Pri-2,Infra-DX&cc=seanmccullough@chromium.org,martiniss@chromium.org,zhangtiff@chromium.org&comment=Problem+with+Sheriff-o-Matic). The team is usually very responsive and, because of their work, the tool is getting better every day.
### How can I tell if I've done a good job?
It can be hard to tell. Generally, a good goal is to try and have fewer alerts when your shift ends than when it began. Sometimes that isn't possible, though.

@ -1,5 +1,7 @@
# How to determine what story is failing
**Important**: When making changes to this document, also update duplicate files under the [internal docs](http://goto.google.com/perf-bot-health-sheriffs).
The first step in addressing a test failure is to identify what stories are failing.
The easiest way to identify these is to use the [Flakiness dashboard](https://test-results.appspot.com/dashboards/flakiness_dashboard.html#testType=blink_perf.layout), which is a high-level dashboard showing test passes and failures. (Sheriff-o-matic tries to automatically identify the failing stories, but is often incorrect and therefore can't be trusted.) Open up the flakiness dashboard and select the benchmark and platform in question (pulled from the SOM alert) from the "Test type" and "Builder" dropdowns. You should see a view like this:

@ -1,4 +1,4 @@
# Perf Regression Sheriffing (go/perfregression-sheriff)
# Perf Regression Sheriffing
The perf regression sheriff tracks performance regressions in Chrome's
continuous integration tests. Note that a [different
@ -6,95 +6,12 @@ rotation](perf_bot_sheriffing.md) has been created to ensure the builds and
tests stay green, so the perf regression sheriff role is now entirely focused
on performance.
**[Rotation calendar](https://calendar.google.com/calendar/embed?src=google.com_2fpmo740pd1unrui9d7cgpbg2k%40group.calendar.google.com)**
Key responsibilities include:
## Key Responsibilities
* Addressing bugs that need attention
* Follow up on Performance Regressions
* Give Feedback on our Infrastructure
* [Address bugs needing attention](#Address-bugs-needing-attention)
* [Follow up on Performance Regressions](#Follow-up-on-Performance-Regressions)
* [Give Feedback on our Infrastructure](#Give-Feedback-on-our-Infrastructure)
## Address bugs needing attention
NOTE: Ensure that you're signed into Monorail.
Use [this Monorail query](https://bugs.chromium.org/p/chromium/issues/list?sort=modified&q=label%3AChromeperf-Sheriff-NeedsAttention%2CChromeperf-Auto-NeedsAttention%20-has%3Aowner&can=2)
to find automatically triaged issues which need attention.
NOTE: If the list of issues that need attention is empty, please jump ahead to
[Follow up on Performance Regressions](#Follow-up-on-Performance-Regressions).
Issues in the list will include automatically filed and bisected regressions
that are supported by the Chromium Perf Sheriff rotation. For each of the
issues:
1. Determine the cause of the failure:
* If it's Pinpoint failing to find a culprit, consider re-running the
failing Pinpoint job.
* If it's the Chromeperf Dashboard failing to start a Pinpoint bisection,
consider running a bisection from the grouped alerts. The issue
description should have a link to the group of anomalies associated with
the issue.
* If this was a manual escalation (e.g. a suspected culprit author put the
`Chromeperf-Sheriff-NeedsAttention` label to seek help) use the tools at
your disposal, like:
* Retry the most recent Pinpoint job, potentially changing the parameters.
* Inspect the results of the Pinpoint job associated with the issues and
decide that this could be noise.
* In cases where it's unclear what next should be done, escalate the issue
to the Chrome Speed Tooling team by adding the `Speed>Bisection` component
and leaving the issue `Untriaged` or `Unconfirmed`.
2. Remove the `Chromeperf-Sheriff-NeedsAttention` or
`Chromeperf-Auto-NeedsAttention` label once you've acted on an issue.
**For alerts related to `resource_sizes`:** Refer to
[apk_size_regressions.md](apk_size_regressions.md).
## Follow up on Performance Regressions
Please spend any spare time driving down bugs from the [regression
backlog](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=Performance%3DSheriff+Type%3ABug+modified-before%3Atoday-6&sort=-modified).
Treat these bugs as you would your own -- investigate the regressions, find out
what the next step should be, and then move the bug along. Some possible next steps
and questions to answer are:
* Should the bug be closed?
* Are there questions that need to be answered?
* Are there people that should be added to the CC list?
* Is the correct owner assigned?
When a bug does need to be pinged, rather than adding a generic "ping", it's
much much more effective to include the username and action item.
You should aim to end your shift with an empty backlog, but it's important to
still advance each bug in a meaningful way.
After your shift, please try to follow up on the bugs you filed weekly. Kick off
new bisects if the previous ones failed, and if the bisect picks a likely
culprit follow up to ensure the CL author addresses the problem. If you are
certain that a specific CL caused a performance regression, and the author does
not have an immediate plan to address the problem, please revert the CL.
## Give Feedback on our Infrastructure
Perf regression sheriffs have their eyes on the perf dashboard and bisects
more than anyone else, and their feedback is invaluable for making sure these
tools are accurate and improving them. Please file bugs and feature requests
as you see them:
* **Perf Dashboard**: Please use the red "Report Issue" link in the navbar.
* **Pinpoint**: If Pinpoint is identifying the wrong CL as culprit
or missing a clear culprit, or not reproducing what appears to be a clear
regression, please file an issue in crbug with the `Speed>Bisection`
component.
* **Noisy Tests**: Please file a bug in crbug with component `Speed>Benchmarks`
and [cc the owner](http://go/perf-owners).
For more information on how these responsibilities, how to swap shifts and more,
please see [Perf Regression
Sheriffs](http://goto.google.com/chrome-perf-regression-sheriffing)