| Christopher Ferris | 4316d43 | 2019-06-27 00:08:23 -0700 | [diff] [blame] | 1 | # Native Memory Allocator Verification |
| 2 | This document describes how to verify the native memory allocator on Android. |
| 3 | This procedure should be followed when upgrading or moving to a new allocator. |
| 4 | A small minor upgrade might not need to run all of the benchmarks, however, |
| 5 | at least the |
| 6 | [SQL Allocation Trace Benchmark](#sql-allocation-trace-benchmark), |
| 7 | [Memory Replay Benchmarks](#memory-replay-benchmarks) and |
| 8 | [Performance Trace Benchmarks](#performance-trace-benchmarks) should be run. |
| 9 | |
| 10 | It is important to note that there are two modes for a native allocator |
| 11 | to run in on Android. The first is the normal allocator, the second is |
| 12 | called the svelte config, which is designed to run on memory constrained |
| 13 | systems and be a bit slower, but take less PSS. To enable the svelte config, |
| 14 | add this line to the `BoardConfig.mk` for the given target: |
| 15 | |
| 16 | MALLOC_SVELTE := true |
| 17 | |
| 18 | The `BoardConfig.mk` file is usually found in the directory |
| 19 | `device/<DEVICE_NAME>/` or in a sub directory. |
| 20 | |
| 21 | When evaluating a native allocator, make sure that you benchmark both |
| 22 | versions. |
| 23 | |
| 24 | ## Android Extensions |
| 25 | Android supports a few non-standard functions and mallopt controls that |
| 26 | a native allocator needs to implement. |
| 27 | |
| 28 | ### Iterator Functions |
| 29 | These are functions that are used to implement a memory leak detector |
| 30 | called `libmemunreachable`. |
| 31 | |
| 32 | #### malloc\_disable |
| 33 | This function, when called, should pause all threads that are making a |
| 34 | call to an allocation function (malloc/free/etc). When a call |
| 35 | is made to `malloc_enable`, the paused threads should start running again. |
| 36 | |
| 37 | #### malloc\_enable |
| 38 | This function, when called, does nothing unless there was a previous call |
| 39 | to `malloc_disable`. This call will unpause any thread which is making |
| 40 | a call to an allocation function (malloc/free/etc) when `malloc_disable` |
| 41 | was called previously. |
| 42 | |
| 43 | #### malloc\_iterate |
| 44 | This function enumerates all of the allocations currently live in the |
| 45 | system. It is meant to be called after a call to `malloc_disable` to |
| 46 | prevent further allocations while this call is being executed. To |
| 47 | see what is expected for this function, the best description is the |
| 48 | tests for this funcion in `bionic/tests/malloc_itearte_test.cpp`. |
| 49 | |
| 50 | ### Mallopt Extensions |
| 51 | These are mallopt options that Android requires for a native allocator |
| 52 | to work efficiently. |
| 53 | |
| 54 | #### M\_DECAY\_TIME |
| 55 | When set to zero, `mallopt(M_DECAY_TIME, 0)`, it is expected that an |
| 56 | allocator will attempt to purge and release any unused memory back to the |
| 57 | kernel on free calls. This is important in Android to avoid consuming extra |
| 58 | PSS. |
| 59 | |
| 60 | When set to non-zero, `mallopt(M_DECAY_TIME, 1)`, an allocator can delay the |
| 61 | purge and release action. The amount of delay is up to the allocator |
| 62 | implementation, but it should be a reasonable amount of time. The jemalloc |
| 63 | allocator was implemented to have a one second delay. |
| 64 | |
| 65 | The drawback to this option is that most allocators do not have a separate |
| 66 | thread to handle the purge, so the decay is only handled when an |
| 67 | allocation operation occurs. For server processes, this can mean that |
| 68 | PSS is slightly higher when the server is waiting for the next connection |
| 69 | and no other allocation calls are made. The `M_PURGE` option is used to |
| 70 | force a purge in this case. |
| 71 | |
| 72 | For all applications on Android, the call `mallopt(M_DECAY_TIME, 1)` is |
| 73 | made by default. The idea is that it allows application frees to run a |
| 74 | bit faster, while only increasing PSS a bit. |
| 75 | |
| 76 | #### M\_PURGE |
| 77 | When called, `mallopt(M_PURGE, 0)`, an allocator should purge and release |
| 78 | any unused memory immediately. The argument for this call is ignored. If |
| 79 | possible, this call should clear thread cached memory if it exists. The |
| 80 | idea is that this can be called to purge memory that has not been |
| 81 | purged when `M_DECAY_TIME` is set to one. This is useful if you have a |
| 82 | server application that does a lot of native allocations and the |
| 83 | application wants to purge that memory before waiting for the next connection. |
| 84 | |
| 85 | ## Correctness Tests |
| 86 | These are the tests that should be run to verify an allocator is |
| 87 | working properly according to Android. |
| 88 | |
| 89 | ### Bionic Unit Tests |
| 90 | The bionic unit tests contain a small number of allocator tests. These |
| 91 | tests are primarily verifying Android extensions and non-standard behavior |
| 92 | of allocation routines such as what happens when a non-power of two alignment |
| 93 | is passed to memalign. |
| 94 | |
| 95 | To run all of the compliance tests: |
| 96 | |
| 97 | adb shell /data/nativetest64/bionic-unit-tests/bionic-unit-tests --gtest_filter="malloc*" |
| 98 | adb shell /data/nativetest/bionic-unit-tests/bionic-unit-tests --gtest_filter="malloc*" |
| 99 | |
| 100 | The allocation tests are not meant to be complete, so it is expected |
| 101 | that a native allocator will have its own set of tests that can be run. |
| 102 | |
| 103 | ### CTS Entropy Test |
| 104 | In addition to the bionic tests, there is also a CTS test that is designed |
| 105 | to verify that the addresses returned by malloc are sufficiently randomized |
| 106 | to help defeat potential security bugs. |
| 107 | |
| 108 | Run this test thusly: |
| 109 | |
| 110 | atest AslrMallocTest |
| 111 | |
| 112 | If there are multiple devices connected to the system, use `-s <SERIAL>` |
| 113 | to specify a device. |
| 114 | |
| 115 | ## Performance |
| 116 | There are multiple different ways to evaluate the performance of a native |
| 117 | allocator on Android. One is allocation speed in various different scenarios, |
| 118 | anoher is total PSS taken by the allocator. |
| 119 | |
| 120 | The last is virtual address space consumed in 32 bit applications. There is |
| 121 | a limited amount of address space available in 32 bit apps, and there have |
| 122 | been allocator bugs that cause memory failures when too much virtual |
| 123 | address space is consumed. For 64 bit executables, this can be ignored. |
| 124 | |
| 125 | ### Bionic Benchmarks |
| 126 | These are the microbenchmarks that are part of the bionic benchmarks suite of |
| 127 | benchmarks. These benchmarks can be built using this command: |
| 128 | |
| 129 | mmma -j bionic/benchmarks |
| 130 | |
| 131 | These benchmarks are only used to verify the speed of the allocator and |
| 132 | ignore anything related to PSS and virtual address space consumed. |
| 133 | |
| 134 | #### Allocate/Free Benchmarks |
| 135 | These are the benchmarks to verify the allocation speed of a loop doing a |
| 136 | single allocation, touching every page in the allocation to make it resident |
| 137 | and then freeing the allocation. |
| 138 | |
| 139 | To run the benchmarks with `mallopt(M_DECAY_TIME, 0)`, use these commands: |
| 140 | |
| 141 | adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_free_default |
| 142 | adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_free_default |
| 143 | |
| 144 | To run the benchmarks with `mallopt(M_DECAY_TIME, 1)`, use these commands: |
| 145 | |
| 146 | adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_free_decay1 |
| 147 | adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_free_decay1 |
| 148 | |
| 149 | The last value in the output is the size of the allocation in bytes. It is |
| 150 | useful to look at these kinds of benchmarks to make sure that there are |
| 151 | no outliers, but these numbers should not be used to make a final decision. |
| 152 | If these numbers are slightly worse than the current allocator, the |
| 153 | single thread numbers from trace data is a better representative of |
| 154 | real world situations. |
| 155 | |
| 156 | #### Multiple Allocations Retained Benchmarks |
| 157 | These are the benchmarks that examine how the allocator handles multiple |
| 158 | allocations of the same size at the same time. |
| 159 | |
| 160 | The first set of these benchmarks does a set number of 8192 byte allocations |
| 161 | in one loop, and then frees all of the allocations at the end of the loop. |
| 162 | Only the time it takes to do the allocations is recorded, the frees are not |
| 163 | counted. The value of 8192 was chosen since the jemalloc native allocator |
| 164 | had issues with this size. It is possible other sizes might show different |
| 165 | results, but, as mentioned before, these microbenchmark numbers should |
| 166 | not be used as absolutes for determining if an allocator is worth using. |
| 167 | |
| 168 | This benchmark is designed to verify that there is no performance issue |
| 169 | related to having multiple allocations alive at the same time. |
| 170 | |
| 171 | To run the benchmarks with `mallopt(M_DECAY_TIME, 0)`, use these commands: |
| 172 | |
| 173 | adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_multiple_8192_allocs_default |
| 174 | adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_multiple_8192_allocs_default |
| 175 | |
| 176 | To run the benchmarks with `mallopt(M_DECAY_TIME, 1)`, use these commands: |
| 177 | |
| 178 | adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_multiple_8192_allocs_decay1 |
| 179 | adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_multiple_8192_allocs_decay1 |
| 180 | |
| 181 | For these benchmarks, the last parameter is the total number of allocations to |
| 182 | do in each loop. |
| 183 | |
| 184 | The other variation of this benchmark is to always do forty allocations in |
| 185 | each loop, but vary the size of the forty allocations. As with the other |
| 186 | benchmark, only the time it takes to do the allocations is tracked, the |
| 187 | frees are not counted. Forty allocations is an arbitrary number that could |
| 188 | be modified in the future. It was chosen because a version of the native |
| 189 | allocator, jemalloc, showed a problem at forty allocations. |
| 190 | |
| 191 | To run the benchmarks with `mallopt(M_DECAY_TIME, 0)`, use these commands: |
| 192 | |
| 193 | adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_forty_default |
| 194 | adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_forty_default |
| 195 | |
| 196 | To run the benchmarks with `mallopt(M_DECAY_TIME, 1)`, use these command: |
| 197 | |
| 198 | adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_forty_decay1 |
| 199 | adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_forty_decay1 |
| 200 | |
| 201 | For these benchmarks, the last parameter in the output is the size of the |
| 202 | allocation in bytes. |
| 203 | |
| 204 | As with the other microbenchmarks, an allocator with numbers in the same |
| 205 | proximity of the current values is usually sufficient to consider making |
| 206 | a switch. The trace benchmarks are more important than these benchmarks |
| 207 | since they simulate real world allocation profiles. |
| 208 | |
| 209 | #### SQL Allocation Trace Benchmark |
| 210 | This benchmark is a trace of the allocations performed when running |
| 211 | the SQLite BenchMark app. |
| 212 | |
| 213 | This benchmark is designed to verify that the allocator will be performant |
| 214 | in a real world allocation scenario. SQL operations were chosen as a |
| 215 | benchmark because these operations tend to do lots of malloc/realloc/free |
| 216 | calls, and they tend to be on the critical path of applications. |
| 217 | |
| 218 | To run the benchmarks with `mallopt(M_DECAY_TIME, 0)`, use these commands: |
| 219 | |
| 220 | adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_sql_trace_default |
| 221 | adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_sql_trace_default |
| 222 | |
| 223 | To run the benchmarks with `mallopt(M_DECAY_TIME, 1)`, use these commands: |
| 224 | |
| 225 | adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_sql_trace_decay1 |
| 226 | adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_sql_trace_decay1 |
| 227 | |
| 228 | These numbers should be as performant as the current allocator. |
| 229 | |
| 230 | ### Memory Trace Benchmarks |
| 231 | These benchmarks measure all three axes of a native allocator, PSS, virtual |
| 232 | address space consumed, speed of allocation. They are designed to |
| 233 | run on a trace of the allocations from a real world application or system |
| 234 | process. |
| 235 | |
| 236 | To build this benchmark: |
| 237 | |
| 238 | mmma -j system/extras/memory_replay |
| 239 | |
| 240 | This will build two executables: |
| 241 | |
| 242 | /system/bin/memory_replay32 |
| 243 | /system/bin/memory_replay64 |
| 244 | |
| 245 | And these two benchmark executables: |
| 246 | |
| 247 | /data/benchmarktest64/trace_benchmark/trace_benchmark |
| 248 | /data/benchmarktest/trace_benchmark/trace_benchmark |
| 249 | |
| 250 | #### Memory Replay Benchmarks |
| 251 | These benchmarks display PSS, virtual memory consumed (VA space), and do a |
| 252 | bit of performance testing on actual traces taken from running applications. |
| 253 | |
| 254 | The trace data includes what thread does each operation, so the replay |
| 255 | mechanism will simulate this by creating threads and replaying the operations |
| 256 | on a thread as if it was rerunning the real trace. The only issue is that |
| 257 | this is a worst case scenario for allocations happening at the same time |
| 258 | in all threads since it collapses all of the allocation operations to occur |
| 259 | one after another. This will cause a lot of threads allocating at the same |
| 260 | time. The trace data does not include timestamps, |
| 261 | so it is not possible to create a completely accurate replay. |
| 262 | |
| 263 | To generate these traces, see the [Malloc Debug documentation](https://android.googlesource.com/platform/bionic/+/master/libc/malloc_debug/README.md), |
| 264 | the option [record\_allocs](https://android.googlesource.com/platform/bionic/+/master/libc/malloc_debug/README.md#record_allocs_total_entries). |
| 265 | |
| 266 | To run these benchmarks, first copy the trace files to the target and |
| 267 | unzip them using these commands: |
| 268 | |
| 269 | adb shell push system/extras/dumps /data/local/tmp |
| 270 | adb shell 'cd /data/local/tmp/dumps && for name in *.zip; do unzip $name; done' |
| 271 | |
| 272 | Since all of the traces come from applications, the `memory_replay` program |
| 273 | will always call `mallopt(M_DECAY_TIME, 1)' before running the trace. |
| 274 | |
| 275 | Run the benchmark thusly: |
| 276 | |
| 277 | adb shell memory_replay64 /data/local/tmp/dumps/XXX.txt |
| 278 | adb shell memory_replay32 /data/local/tmp/dumps/XXX.txt |
| 279 | |
| 280 | Where XXX.txt is the name of a trace file. |
| 281 | |
| 282 | Every 100000 allocation operations, a dump of the PSS and VA space will be |
| 283 | performed. At the end, a final PSS and VA space number will be printed. |
| 284 | For the most part, the intermediate data can be ignored, but it is always |
| 285 | a good idea to look over the data to verify that no strange spikes are |
| 286 | occurring. |
| 287 | |
| 288 | The performance number is a measure of the time it takes to perform all of |
| 289 | the allocation calls (malloc/memalign/posix_memalign/realloc/free/etc). |
| 290 | For any call that allocates a pointer, the time for the call and the time |
| 291 | it takes to make the pointer completely resident in memory is included. |
| 292 | |
| 293 | The performance numbers for these runs tend to have a wide variability so |
| 294 | they should not be used as absolute value for comparison against the |
| 295 | current allocator. But, they should be in the same range as the current |
| 296 | values. |
| 297 | |
| 298 | When evaluating an allocator, one of the most important traces is the |
| 299 | camera.txt trace. The camera application does very large allocations, |
| 300 | and some allocators might leave large virtual address maps around |
| 301 | rather than delete them. When that happens, it can lead to allocation |
| 302 | failures and would cause the camera app to abort/crash. It is |
| 303 | important to verify that when running this trace using the 32 bit replay |
| 304 | executable, the virtual address space consumed is not much larger than the |
| 305 | current allocator. A small increase (on the order of a few MBs) would be okay. |
| 306 | |
| 307 | NOTE: When a native allocator calls mmap, it is expected that the allocator |
| 308 | will name the map using the call: |
| 309 | |
| 310 | prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, <PTR>, <SIZE>, "libc_malloc"); |
| 311 | |
| 312 | If the native allocator creates a different name, then it necessary to |
| 313 | modify the file: |
| 314 | |
| 315 | system/extras/memory_replay/NativeInfo.cpp |
| 316 | |
| 317 | The `GetNativeInfo` function needs to be modified to include the name |
| 318 | of the maps that this allocator includes. |
| 319 | |
| 320 | In addition, in order for the frameworks code to keep track of the memory |
| 321 | of a process, any named maps must be added to the file: |
| 322 | |
| 323 | frameworks/base/core/jni/android_os_Debug.cpp |
| 324 | |
| 325 | Modify the `load_maps` function and add a check of the new expected name. |
| 326 | |
| 327 | #### Performance Trace Benchmarks |
| 328 | This is a benchmark that treats the trace data as if all allocations |
| 329 | occurred in a single thread. This is the scenario that could |
| 330 | happen if all of the allocations are spaced out in time so no thread |
| 331 | every does an allocation at the same time as another thread. |
| 332 | |
| 333 | Run these benchmarks thusly: |
| 334 | |
| 335 | adb shell /data/benchmarktest64/trace_benchmark/trace_benchmark |
| 336 | adb shell /data/benchmarktest/trace_benchmark/trace_benchmark |
| 337 | |
| 338 | When run without any arguments, the benchmark will run over all of the |
| 339 | traces and display data. It takes many minutes to complete these runs in |
| 340 | order to get as accurate a number as possible. |