update_engine_client: add O_DSYNC flags
Out-Of-Memory (OOM) kernel crash reports spike about 2x when new
releases are pushed to Jetstream devices: "Google Wifi" only has
512MB total ram and only about 370MB available for all of user space
tasks (AP deamons, update engine, crash reporting, etc). About half of
the OOM reports show "active_anon" was > 350MB though "dirty" was
generally very low (< 10MB) in those same OOM reports.
While "cros flash $IP" is running (uses update_engine_client), "dirty"
would climb to ~100MB (on gale; ~200MB for whirlwind) for most of the
duration that the update was being written. Those pages are no longer
available for general use. However, "cros flash" by default is performing
a "full update" and not a delta update which is what users typically get.
The difference is where the data is sourced from: a "delta update" is a
combination of "patch updates" and the currently in-use partition while
"full update" means the entire image comes over the network (compressed).
In both cases, the entire new KERNEL and ROOT partition are rewritten.
Adding two flags to open() call can improve this situation:
O_DSYNC will guarantee the write() syscall does not return until
the data has (a) landed on the device and (b) flushed from the
device cache (uses SCSI FUA). This limits the number of dirty
pages occupying host RAM.
O_DIRECT will bypass the buffer cache and write directly from host
memory. However, until the write() calls are block size and block
aligned offsets, we can't turn this on. So current behavior will
still "pollute" the buffer cache but at least the pages of mem
can be recycled for other apps (due to O_DSYNC).
With this change, "dirty" was generally < 200KB on whirlwind (huge
improvement!) with two exceptions:
60 seconds after "Preparing update" (peak ~70MB for < 5 seconds)
10 seconds after "Update completed" (peak ~220MB for about 30 seconds)
The first exception is scp getting the "delta" update from the host.
The second exception is running scp then "tar + gzip" - possibly
downloading and unpacking of /usr/local/autotest (et al) for "test" builds.
BUG=b:31709028
TEST=monitor buffer cache size while update_engine_client is running on gale.
Verify "delta" usage is < 10MB peak with change.
test_that $DUT_IP autoupdate_EndToEndTest
Change-Id: I993aa541cc29d818920312c0a900afaba9f88b74
Reviewed-on: https://chromium-review.googlesource.com/502369
Commit-Ready: Grant Grundler <grundler@chromium.org>
Tested-by: Grant Grundler <grundler@chromium.org>
Reviewed-by: Dan Erat <derat@chromium.org>
Reviewed-by: Sonny Rao <sonnyrao@chromium.org>
Reviewed-by: Ben Chan <benchan@chromium.org>
Reviewed-by: Grant Grundler <grundler@chromium.org>
1 file changed