Reland "jumbo: stable assignment of inputs to chunks"

This is a reland of 2c7a71c3fd

Original change's description:
> jumbo: stable assignment of inputs to chunks
>
> Adding or removing a file from a jumbo source set causes on average
> half of the chunks to have their inputs reallocated.
> Derive chunk boundaries from a combination of list position and path
> content. This is so that when a file is added or removed, only the
> boundaries with adjacent chunks typically move.
> For a balance between maximum chunk size and stability of partitions:
> * Partition uniformly into the required number of chunks.
> * Pick a "center" from each chunk by minimum hash value.
> * Pick the boundaries between centers by maximum hash value.
>
> Bug: 782863
> Change-Id: Ie71d82b132e8145b4ed3d1141f85886a12149d5a
> Reviewed-on: https://chromium-review.googlesource.com/1102218
> Reviewed-by: Bruce Dawson <brucedawson@chromium.org>
> Reviewed-by: Dirk Pranke <dpranke@chromium.org>
> Reviewed-by: Daniel Bratell <bratell@opera.com>
> Commit-Queue: Dirk Pranke <dpranke@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#570623}

Bug: 860646
Change-Id: I55b326beb716789896c39d58be8e793c97f7097d
Reviewed-on: https://chromium-review.googlesource.com/1121976
Reviewed-by: Dirk Pranke <dpranke@chromium.org>
Commit-Queue: Dirk Pranke <dpranke@chromium.org>
Cr-Commit-Position: refs/heads/master@{#574233}

This commit is contained in:

David Michael Barr

2018-07-11 17:48:37 +00:00

committed by

Commit Bot

parent e97433df55

commit 73ea8f55b7

3 changed files with 60 additions and 5 deletions

build/config

jumbo.gni merge_for_jumbo.py

docs

jumbo.md

2

build/config/jumbo.gni

 @ -14,7 +14,7 @@ declare_args() {
   # when frequently changing a set of cpp files.
   jumbo_build_excluded = []
   # How many files to group at most. Smaller numbers give more
   # How many files to group on average. Smaller numbers give more
   # parallellism, higher numbers give less total CPU usage. Higher
   # numbers also give longer single-file recompilation times.
   #

									
										61

build/config/merge_for_jumbo.py
									
				@ -12,12 +12,67 @@ for compiling.

				from __future__ import print_function

				import argparse

				import hashlib

				import cStringIO

				import os

				def write_jumbo_files(inputs, outputs, written_input_set, written_output_set):

				  output_count = len(outputs)

				def cut_ranges(boundaries):

				  # Given an increasing sequence of boundary indices, generate a sequence of

				  # non-overlapping ranges. The total range is inclusive of the first index

				  # and exclusive of the last index from the given sequence.

				  for start, stop in zip(boundaries, boundaries[1:]):

				    yield range(start, stop)

				def generate_chunk_stops(inputs, output_count, smart_merge=True):

				  # Note: In the comments below, unique numeric labels are assigned to files.

				  #       Consider them as the sorted rank of the hash of each file path.

				  # Simple jumbo chunking generates uniformly sized chunks with the ceiling of:

				  # (output_index + 1) * input_count / output_count

				  input_count = len(inputs)

				  stops = [((i + 1) * input_count + output_count - 1) // output_count

				           for i in range(output_count)]

				  # This is disruptive at times because file insertions and removals can

				  # invalidate many chunks as all files are offset by one.

				  # For example, say we have 12 files in 4 uniformly sized chunks:

				  # 9, 4, 0; 7,  1, 11;  5, 10, 2; 6, 3, 8

				  # If we delete the first file we get:

				  # 4, 0, 7; 1, 11,  5; 10,  2, 6; 3, 8

				  # All of the chunks have new sets of inputs.

				  # With path-aware chunking, we start with the uniformly sized chunks:

				  # 9, 4, 0; 7,  1, 11;  5, 10, 2; 6, 3, 8

				  # First we find the smallest rank in each of the chunks. Their indices are

				  # stored in the |centers| list and in this example the ranks would be:

				  # 0, 1, 2, 3

				  # Then we find the largest rank between the centers. Their indices are stored

				  # in the |stops| list and in this example the ranks would be:

				  # 7, 11, 6

				  # These files mark the boundaries between chunks and these boundary files are

				  # often maintained even as files are added or deleted.

				  # In this example, 7, 11, and 6 are the first files in each chunk:

				  # 9, 4, 0; 7,  1; 11,  5, 10, 2; 6, 3, 8

				  # If we delete the first file and repeat the process we get:

				  # 4, 0; 7, 1; 11,  5, 10,  2; 6, 3, 8

				  # Only the first chunk has a new set of inputs.

				  if smart_merge:

				    # Starting with the simple chunks, every file is assigned a rank.

				    # This requires a hash function that is stable across runs.

				    hasher = lambda n: hashlib.md5(inputs[n]).hexdigest()

				    # In each chunk there is a key file with lowest rank; mark them.

				    # Note that they will not easily change.

				    centers = [min(indices, key=hasher) for indices in cut_ranges([0] + stops)]

				    # Between each pair of key files there is a file with highest rank.

				    # Mark these to be used as border files. They also will not easily change.

				    # Forget the inital chunks and create new chunks by splitting the list at

				    # every border file.

				    stops = [max(indices, key=hasher) for indices in cut_ranges(centers)]

				    stops.append(input_count)

				  return stops

				def write_jumbo_files(inputs, outputs, written_input_set, written_output_set):

				  chunk_stops = generate_chunk_stops(inputs, len(outputs))

				  written_inputs = 0

				  for output_index, output_file in enumerate(outputs):

				@ -31,7 +86,7 @@ def write_jumbo_files(inputs, outputs, written_input_set, written_output_set):

				    out = cStringIO.StringIO()

				    out.write("/* This is a Jumbo file. Don't edit. */\n\n")

				    out.write("/* Generated with merge_for_jumbo.py. */\n\n")

				    input_limit = (output_index + 1) * input_count / output_count

				    input_limit = chunk_stops[output_index]

				    while written_inputs < input_limit:

				      filename = inputs[written_inputs]

				      written_inputs += 1

									
										2

docs/jumbo.md
									
				@ -51,7 +51,7 @@ source files.

				## Tuning

				By default at most `50`, or `8` when using goma, files are merged at a

				By default on average `50`, or `8` when using goma, files are merged at a

				time. The more files that are are merged, the less total CPU time is

				needed, but parallelism is reduced. This number can be changed by

				setting `jumbo_file_merge_limit`.

Reland "jumbo: stable assignment of inputs to chunks"

2 build/config/jumbo.gni

61 build/config/merge_for_jumbo.py

2 docs/jumbo.md

2

build/config/jumbo.gni

61

build/config/merge_for_jumbo.py

2

docs/jumbo.md