0

Reland "jumbo: stable assignment of inputs to chunks"

This is a reland of 2c7a71c3fd

Original change's description:
> jumbo: stable assignment of inputs to chunks
>
> Adding or removing a file from a jumbo source set causes on average
> half of the chunks to have their inputs reallocated.
> Derive chunk boundaries from a combination of list position and path
> content. This is so that when a file is added or removed, only the
> boundaries with adjacent chunks typically move.
> For a balance between maximum chunk size and stability of partitions:
> * Partition uniformly into the required number of chunks.
> * Pick a "center" from each chunk by minimum hash value.
> * Pick the boundaries between centers by maximum hash value.
>
> Bug: 782863
> Change-Id: Ie71d82b132e8145b4ed3d1141f85886a12149d5a
> Reviewed-on: https://chromium-review.googlesource.com/1102218
> Reviewed-by: Bruce Dawson <brucedawson@chromium.org>
> Reviewed-by: Dirk Pranke <dpranke@chromium.org>
> Reviewed-by: Daniel Bratell <bratell@opera.com>
> Commit-Queue: Dirk Pranke <dpranke@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#570623}

Bug: 860646
Change-Id: I55b326beb716789896c39d58be8e793c97f7097d
Reviewed-on: https://chromium-review.googlesource.com/1121976
Reviewed-by: Dirk Pranke <dpranke@chromium.org>
Commit-Queue: Dirk Pranke <dpranke@chromium.org>
Cr-Commit-Position: refs/heads/master@{#574233}
This commit is contained in:
David Michael Barr
2018-07-11 17:48:37 +00:00
committed by Commit Bot
parent e97433df55
commit 73ea8f55b7
3 changed files with 60 additions and 5 deletions

@ -14,7 +14,7 @@ declare_args() {
# when frequently changing a set of cpp files. # when frequently changing a set of cpp files.
jumbo_build_excluded = [] jumbo_build_excluded = []
# How many files to group at most. Smaller numbers give more # How many files to group on average. Smaller numbers give more
# parallellism, higher numbers give less total CPU usage. Higher # parallellism, higher numbers give less total CPU usage. Higher
# numbers also give longer single-file recompilation times. # numbers also give longer single-file recompilation times.
# #

@ -12,12 +12,67 @@ for compiling.
from __future__ import print_function from __future__ import print_function
import argparse import argparse
import hashlib
import cStringIO import cStringIO
import os import os
def write_jumbo_files(inputs, outputs, written_input_set, written_output_set): def cut_ranges(boundaries):
output_count = len(outputs) # Given an increasing sequence of boundary indices, generate a sequence of
# non-overlapping ranges. The total range is inclusive of the first index
# and exclusive of the last index from the given sequence.
for start, stop in zip(boundaries, boundaries[1:]):
yield range(start, stop)
def generate_chunk_stops(inputs, output_count, smart_merge=True):
# Note: In the comments below, unique numeric labels are assigned to files.
# Consider them as the sorted rank of the hash of each file path.
# Simple jumbo chunking generates uniformly sized chunks with the ceiling of:
# (output_index + 1) * input_count / output_count
input_count = len(inputs) input_count = len(inputs)
stops = [((i + 1) * input_count + output_count - 1) // output_count
for i in range(output_count)]
# This is disruptive at times because file insertions and removals can
# invalidate many chunks as all files are offset by one.
# For example, say we have 12 files in 4 uniformly sized chunks:
# 9, 4, 0; 7, 1, 11; 5, 10, 2; 6, 3, 8
# If we delete the first file we get:
# 4, 0, 7; 1, 11, 5; 10, 2, 6; 3, 8
# All of the chunks have new sets of inputs.
# With path-aware chunking, we start with the uniformly sized chunks:
# 9, 4, 0; 7, 1, 11; 5, 10, 2; 6, 3, 8
# First we find the smallest rank in each of the chunks. Their indices are
# stored in the |centers| list and in this example the ranks would be:
# 0, 1, 2, 3
# Then we find the largest rank between the centers. Their indices are stored
# in the |stops| list and in this example the ranks would be:
# 7, 11, 6
# These files mark the boundaries between chunks and these boundary files are
# often maintained even as files are added or deleted.
# In this example, 7, 11, and 6 are the first files in each chunk:
# 9, 4, 0; 7, 1; 11, 5, 10, 2; 6, 3, 8
# If we delete the first file and repeat the process we get:
# 4, 0; 7, 1; 11, 5, 10, 2; 6, 3, 8
# Only the first chunk has a new set of inputs.
if smart_merge:
# Starting with the simple chunks, every file is assigned a rank.
# This requires a hash function that is stable across runs.
hasher = lambda n: hashlib.md5(inputs[n]).hexdigest()
# In each chunk there is a key file with lowest rank; mark them.
# Note that they will not easily change.
centers = [min(indices, key=hasher) for indices in cut_ranges([0] + stops)]
# Between each pair of key files there is a file with highest rank.
# Mark these to be used as border files. They also will not easily change.
# Forget the inital chunks and create new chunks by splitting the list at
# every border file.
stops = [max(indices, key=hasher) for indices in cut_ranges(centers)]
stops.append(input_count)
return stops
def write_jumbo_files(inputs, outputs, written_input_set, written_output_set):
chunk_stops = generate_chunk_stops(inputs, len(outputs))
written_inputs = 0 written_inputs = 0
for output_index, output_file in enumerate(outputs): for output_index, output_file in enumerate(outputs):
@ -31,7 +86,7 @@ def write_jumbo_files(inputs, outputs, written_input_set, written_output_set):
out = cStringIO.StringIO() out = cStringIO.StringIO()
out.write("/* This is a Jumbo file. Don't edit. */\n\n") out.write("/* This is a Jumbo file. Don't edit. */\n\n")
out.write("/* Generated with merge_for_jumbo.py. */\n\n") out.write("/* Generated with merge_for_jumbo.py. */\n\n")
input_limit = (output_index + 1) * input_count / output_count input_limit = chunk_stops[output_index]
while written_inputs < input_limit: while written_inputs < input_limit:
filename = inputs[written_inputs] filename = inputs[written_inputs]
written_inputs += 1 written_inputs += 1

@ -51,7 +51,7 @@ source files.
## Tuning ## Tuning
By default at most `50`, or `8` when using goma, files are merged at a By default on average `50`, or `8` when using goma, files are merged at a
time. The more files that are are merged, the less total CPU time is time. The more files that are are merged, the less total CPU time is
needed, but parallelism is reduced. This number can be changed by needed, but parallelism is reduced. This number can be changed by
setting `jumbo_file_merge_limit`. setting `jumbo_file_merge_limit`.