Internal links:
ELF TLS is a system for automatically allocating thread-local variables with cooperation among the compiler, linker, dynamic loader, and libc.
Thread-local variables are declared in C and C++ with a specifier, e.g.:
thread_local int tls_var;
At run-time, TLS variables are allocated on a module-by-module basis, where a module is a shared object or executable. At program startup, TLS for all initially-loaded modules comprises the "Static TLS Block". TLS variables within the Static TLS Block exist at fixed offsets from an architecture-specific thread pointer (TP) and can be accessed very efficiently -- typically just a few instructions. TLS variables belonging to dlopen'ed shared objects, on the other hand, may be allocated lazily, and accessing them typically requires a function call.
Ulrich Drepper's ELF TLS document specifies two ways of organizing memory pointed at by the architecture-specific thread-pointer (__get_tls() in Bionic):
Variant 1 places the static TLS block after the TP, whereas variant 2 places it before the TP. According to Drepper, variant 2 was motivated by backwards compatibility, and variant 1 was designed for Itanium. The choice has effects on the toolchain, loader, and libc. In particular, when linking an executable, the linker needs to know where an executable's TLS segment is relative to the TP so it can correctly relocate TLS accesses. Both variants are incompatible with Bionic's current thread-specific data layout, but variant 1 is more problematic than variant 2.
Each thread has a "Dynamic Thread Vector" (DTV) with a pointer to each module's TLS block (or NULL if it hasn't been allocated yet). If the executable has a TLS segment, then it will always be module 1, and its storage will always be immediately after (or before) the TP. In variant 1, the TP is expected to point immediately at the DTV pointer, whereas in variant 2, the DTV pointer's offset from TP is implementation-defined.
The DTV's "generation" field is used to lazily update/reallocate the DTV when new modules are loaded or unloaded.
When a C/C++ file references a TLS variable, the toolchain generates instructions to find its address using a TLS "access model". The access models trade generality against efficiency. The four models are:
A TLS variable may be in a different module than the reference.
A GD access can refer to a TLS variable anywhere. To access a variable tls_var using the "traditional" non-TLSDESC design described in Drepper's TLS document, the toolchain compiler emits a call to a __tls_get_addr function provided by libc.
For example, if we have this C code in a shared object:
extern thread_local char tls_var; char* get_tls_var() { return &tls_var; }
The toolchain generates code like this:
struct TlsIndex { long module; // starts counting at 1 long offset; }; char* get_tls_var() { static TlsIndex tls_var_idx = { // allocated in the .got R_TLS_DTPMOD(tls_var), // dynamic TP module ID R_TLS_DTPOFF(tls_var), // dynamic TP offset }; return __tls_get_addr(&tls_var_idx); }
R_TLS_DTPMOD is a dynamic relocation to the index of the module containing tls_var, and R_TLS_DTPOFF is a dynamic relocation to the offset of tls_var within its module's PT_TLS segment.
__tls_get_addr looks up TlsIndex::module_id's entry in the DTV and adds TlsIndex::offset to the module's TLS block. Before it can do this, it ensures that the module's TLS block is allocated. A simple approach is to allocate memory lazily:
If the current thread's DTV generation count is less than the current global TLS generation, then __tls_get_addr may reallocate the DTV or free blocks for unloaded modules.
If the DTV's entry for the given module is NULL, then __tls_get_addr allocates the module's memory.
If an allocation fails, __tls_get_addr calls abort (like emutls).
musl, on the other, preallocates TLS memory in pthread_create and in dlopen, and each can report out-of-memory.
LD is a specialization of GD that's useful when a function has references to two or more TLS variables that are both part of the same module as the reference. Instead of a call to __tls_get_addr for each variable, the compiler calls __tls_get_addr once to get the current module's TLS block, then adds each variable's DTPOFF to the result.
For example, suppose we have this C code:
static thread_local int x; static thread_local int y; int sum() { return x + y; }
The toolchain generates code like this:
int sum() { static TlsIndex tls_module_idx = { // allocated in the .got // a dynamic relocation against symbol 0 => current module ID R_TLS_DTPMOD(NULL), 0, }; char* base = __tls_get_addr(&tls_module_idx); // These R_TLS_DTPOFF() relocations are resolved at link-time. int* px = base + R_TLS_DTPOFF(x); int* py = base + R_TLS_DTPOFF(y); return *px + *py; }
(XXX: LD might be important for C++ thread_local variables -- even a single thread_local variable with a dynamic initializer has an associated TLS guard variable.)
If the variable is part of the Static TLS Block (i.e. the executable or an initially-loaded shared object), then its offset from the TP is known at load-time. The variable can be accessed with a few loads.
Example: a C file for an executable:
// tls_var could be defined in the executable, or it could be defined // in a shared object the executable links against. extern thread_local char tls_var; char* get_addr() { return &tls_var; }
Compiles to:
// allocated in the .got, resolved at load-time with a dynamic reloc. // Unlike DTPOFF, which is relative to the start of the module’s block, // TPOFF is directly relative to the thread pointer. static long tls_var_gotoff = R_TLS_TPOFF(tls_var); char* get_addr() { return (char*)__get_tls() + tls_var_gotoff; }
LE is a specialization of IE. If the variable is not just part of the Static TLS Block, but is also part of the executable (and referenced from the executable), then a GOT access can be avoided. The IE example compiles to:
char* get_addr() { // R_TLS_TPOFF() is resolved at (static) link-time return (char*)__get_tls() + R_TLS_TPOFF(tls_var); }
The compiler selects an access model for each variable reference using these factors:
-fpic implies an executable, so use IE/LE.-fpic could be in a shared object, so use GD/LD.-ftls-model=<model>.static, extern, ELF visibility attributes).__attribute__((tls_model(...))). Clang may still use a more efficient model than the one specified.Shared objects are sometimes compiled with -ftls-model=initial-exec (i.e. "static TLS") for better performance. On Ubuntu, for example, libc.so.6 and libOpenGL.so.0 are compiled this way. Shared objects using static TLS can't be loaded with dlopen unless libc has reserved enough surplus memory in the static TLS block. glibc reserves a kilobyte or two (TLS_STATIC_SURPLUS) with the intent that only a few core system libraries would use static TLS. Non-core libraries also sometimes use it, which can break dlopen if the surplus area is exhausted. See:
"dlopen: cannot load any more object with static TLS"Neither musl nor the Bionic TLS prototype currently allocate any surplus TLS memory.
In general, supporting surplus TLS memory probably requires maintaining a thread list so that dlopen can initialize the new static TLS memory in all existing threads. A thread list could be omitted if the loader only allowed zero-initialized TLS segments and didn't reclaim memory on dlclose.
As long as a shared object is one of the initially-loaded modules, a better option is to use TLSDESC.
The code fragments above match the "traditional" TLS design from Drepper's document. For the GD and LD models, there is a newer, more efficient design that uses "TLS descriptors". Each TLS variable reference has a corresponding descriptor, which contains a resolver function address and an argument to pass to the resolver.
For example, if we have this C code in a shared object:
extern thread_local char tls_var; char* get_tls_var() { return &tls_var; }
The toolchain generates code like this:
struct TlsDescriptor { // NB: arm32 reverses these fields long (*resolver)(long); long arg; }; char* get_tls_var() { // allocated in the .got, uses a dynamic relocation static TlsDescriptor desc = R_TLS_DESC(tls_var); return (char*)__get_tls() + desc.resolver(desc.arg); }
The dynamic loader fills in the TLS descriptors. For a reference to a variable allocated in the Static TLS Block, it can use a simple resolver function:
long static_tls_resolver(long arg) { return arg; }
The loader writes tls_var@TPOFF into the descriptor's argument.
To support modules loaded with dlopen, the loader must use a resolver function that calls __tls_get_addr. In principle, this simple implementation would work:
long dynamic_tls_resolver(TlsIndex* arg) { return (long)__tls_get_addr(arg) - (long)__get_tls(); }
There are optimizations that complicate the design a little:
__tls_get_addr, the resolver function has a special calling convention that preserves almost all registers, reducing register pressure in the caller (example).__tls_get_addr, so it must save and restore all registers.__tls_get_addr.The resolver must be written in assembly, but in C, the function looks like so:
struct TlsDescDynamicArg { unsigned long first_generation; TlsIndex idx; }; struct TlsDtv { // DTV == dynamic thread vector unsigned long generation; char* modules[]; }; long dynamic_tls_resolver(TlsDescDynamicArg* arg) { TlsDtv* dtv = __get_dtv(); char* addr; if (dtv->generation >= arg->first_generation && dtv->modules[arg->idx.module] != nullptr) { addr = dtv->modules[arg->idx.module] + arg->idx.offset; } else { addr = __tls_get_addr(&arg->idx); } return (long)addr - (long)__get_tls(); }
The loader needs to allocate a table of TlsDescDynamicArg objects for each TLS module with dynamic TLSDESC relocations.
The static linker can still relax a TLSDESC-based access to an IE/LE access.
The traditional TLS design is implemented everywhere, but the TLSDESC design has less toolchain support:
-mtls-dialect=<dialect> (trad-vs-desc on arm64, otherwise gnu-vs-gnu2). Clang always uses the default mode.The (static) linker frequently has more information about the location of a referenced TLS variable than the compiler, so it can "relax" TLS accesses to more efficient models. For example, if an object file compiled with -fpic is linked into an executable, the linker could relax GD accesses to IE or LE. To relax a TLS access, the linker looks for an expected sequences of instructions and static relocations, then replaces the sequence with a different one of equal size. It may need to add or remove no-op instructions.
Versions tested:
Linker support for GD->LE relaxation with -mtls-dialect=gnu/trad (traditional):
| Architecture | BFD | Gold | LLD | 
|---|---|---|---|
| arm32 | no | no | no | 
| arm64 (unusual) | yes | yes | no | 
| x86 | yes | yes | yes | 
| x86_64 | yes | yes | yes | 
Linker support for GD->LE relaxation with -mtls-dialect=gnu2/desc (TLSDESC):
| Architecture | BFD | Gold | LLD | 
|---|---|---|---|
| arm32 (experimental) | yes | unsupported relocs | unsupported relocs | 
| arm64 | yes | yes | yes | 
| x86 (experimental) | yes | yes | unsupported relocs | 
| X86_64 (experimental) | yes | yes | unsupported relocs | 
arm32 linkers can't relax traditional TLS accesses. BFD can relax an arm32 TLSDESC access, but LLD can't link code using TLSDESC at all, except on arm64, where it's used by default.
Calling dlsym on a TLS variable returns the address of the current thread's variable.
gdb uses a libthread_db plugin library to retrieve thread-related information from a target. This library is typically a shared object, but for Android, we link our own libthread_db.a into gdbserver. We will need to implement at least 2 APIs in libthread_db.a to find TLS variables, and gdb provides APIs for looking up symbols, reading or writing memory, and retrieving the current thread pointer (e.g. ps_get_thread_area).
LLDB more-or-less implemented Linux TLS debugging in r192922 (D1944) for x86 and x86-64. arm64 support came later. However, the Linux TLS functionality no longer does anything: the GetThreadPointer function is no longer implemented. Code for reading the thread pointer was removed in D10661 (this function). (arm32 was apparently never supported.)
Both debuggers need metadata from the threading library (libc.so / libpthread.so) to find TLS variables. From LLDB r192922's commit message:
... All OSes use basically the same algorithm (a per-module lookup table) as detailed in Ulrich Drepper's TLS ELF ABI document, so we can easily write code to decode it ourselves. The only question therefore is the exact field layouts required. Happily, the implementors of libpthread expose the structure of the DTV via metadata exported as symbols from the .so itself, designed exactly for this kind of thing. So this patch simply reads that metadata in, and re-implements libthread_db's algorithm itself. We thereby get cross-platform TLS lookup without either requiring third-party libraries, while still being independent of the version of libpthread being used.
LLDB uses these variables:
| Name | Notes | 
|---|---|
| _thread_db_pthread_dtvp | Offset from TP to DTV pointer (0 for variant 1, implementation-defined for variant 2) | 
| _thread_db_dtv_dtv | Size of a DTV slot (typically/always sizeof(void*)) | 
| _thread_db_dtv_t_pointer_val | Offset within a DTV slot to the pointer to the allocated TLS block (typically/always 0) | 
| _thread_db_link_map_l_tls_modid | Offset of a link_mapfield containing the module's 1-based TLS module ID | 
The metadata variables are local symbols in glibc's libpthread.so symbol table (but not its dynamic symbol table). Debuggers can access them, but applications can't.
The debugger lookup process is straightforward:
link_map object and module-relative offset for a TLS variable._thread_db_link_map_l_tls_modid to find the TLS variable's module ID._thread_db_pthread_dtvp to find the thread's DTV._thread_db_dtv_dtv and _thread_db_dtv_t_pointer_val to find the desired module's block within the DTV.This process doesn't appear robust in the face of lazy DTV initialization -- presumably it could read past the end of an out-of-date DTV or access an unloaded module. To be robust, it needs to compare a module's initial generation count against the DTV's generation count. (XXX: Does gdb have these sorts of problems with glibc's libpthread?)
There are ptrace interfaces for reading the thread pointer for each of arm32, arm64, x86, and x86-64 (XXX: check 32-vs-64-bit for inferiors, debuggers, and kernels):
PTRACE_GET_THREAD_AREAPTRACE_GETREGSET, NT_ARM_TLSPTRACE_GET_THREAD_AREAPTRACE_PEEKUSER to read the {fs,gs}_base fields of user_regs_structC/C++ TLS variables are declared with a specifier:
| Specifier | Notes | 
|---|---|
| __thread | - non-standard, but ubiquitous in GCC and Clang - cannot have dynamic initialization or destruction | 
| _Thread_local | - a keyword standardized in C11 - cannot have dynamic initialization or destruction | 
| thread_local | - C11: a macro for _Thread_localviathreads.h- C++11: a keyword, allows dynamic initialization and/or destruction | 
The dynamic initialization and destruction of C++ thread_local variables is layered on top of ELF TLS (or emutls), so this design document mostly ignores it. Like emutls, ELF TLS variables either have a static initializer or are zero-initialized.
Aside: Because a __thread variable cannot have dynamic initialization, __thread is more efficient in C++ than thread_local when the compiler cannot see the definition of a declared TLS variable. The compiler assumes the variable could have a dynamic initializer and generates code, at each access, to call a function to initialize the variable.
ELF TLS isn't implemented on older Android platforms, so dynamic executables and shared objects using it generally won't work on them. Ideally, the older platforms would reject these binaries rather than experience memory corruption at run-time.
Static executables aren't a problem--the necessary runtime support is part of the executable, so TLS just works.
XXX: Shared objects are less of a problem.
DT_TLSDESC_{GOT|PLT} entries.XXX: A dynamic executable using ELF TLS would have a PT_TLS segment and no other distinguishing marks, so running it on an older platform would result in memory corruption. Should we add something to these executables that only newer platforms recognize? (e.g. maybe an entry in .dynamic, a reference to a symbol only a new libc.so has...)
There is an ELF TLS prototype uploaded on Gerrit. It implements:
__tls_get_addr)Missing:
dlsym of a TLS variableThe loader exposes a list of TLS modules (struct TlsModules) to libc.so using the __libc_shared_globals variable (see tls_modules() in linker_tls.cpp and elf_tls.cpp). __tls_get_addr in libc.so acquires the TlsModules::mutex and iterates its module list to lazily allocate and free TLS blocks.
The prototype currently allocates a pthread_internal_t object and static TLS in a single mmap'ed region, along with a thread's stack if it needs one allocated. It doesn't place TLS memory on a preallocated stack (either the main thread's stack or one provided with pthread_attr_setstack).
The DTV and blocks for dlopen'ed modules are instead allocated using the Bionic loader's LinkerMemoryAllocator, adapted to avoid the STL and to provide memalign. The prototype tries to achieve async-signal safety by blocking signals and acquiring a lock.
There are three "entry points" to dynamically locate a TLS variable's address:
__tls_get_addrThe loader's entry points need to call __tls_get_addr, which needs to allocate memory. Currently, the prototype uses a special function pointer to call libc.so's __tls_get_addr from the loader. (This should probably be removed.)
The prototype currently allows for arbitrarily-large TLS variable alignment. IIRC, different implementations (glibc, musl, FreeBSD) vary in their level of respect for TLS alignment. It looks like the Bionic loader ignores segments' alignment and aligns loaded libraries to 256 KiB. See ReserveAligned.
The prototype's __tls_get_addr might be async-signal safe. Making it AS-safe is a good idea if it's feasible. musl's function is AS-safe, but glibc's isn't (or wasn't). Google had a patch to make glibc AS-safe back in 2012-2013. See:
The prototype lazily allocates TLS memory for dlopen'ed modules (see __tls_get_addr), and an out-of-memory error on a TLS access aborts the process. musl, on the other hand, preallocates TLS memory on pthread_create and dlopen, so either function can return out-of-memory. Both functions probably need to acquire the same lock.
Maybe Bionic should do the same as musl? Perhaps musl's robustness argument holds for Bionic, though, because Bionic (at least the linker) probably already aborts on OOM. musl doesn't support dlclose/unloading, so it might have an easier time.
On the other hand, maybe lazy allocation is a feature, because not all threads will use a dlopen'ed solib's TLS variables. Drepper makes this argument in his TLS document:
In addition the run-time support should avoid creating the thread-local storage if it is not necessary. For instance, a loaded module might only be used by one thread of the many which make up the process. It would be a waste of memory and time to allocate the storage for all threads. A lazy method is wanted. This is not much extra burden since the requirement to handle dynamically loaded objects already requires recognizing storage which is not yet allocated. This is the only alternative to stopping all threads and allocating storage for all threads before letting them run again.
FWIW: emutls also aborts on out-of-memory.
The dynamic loader currently can't use ELF TLS, so any part of libc linked into the loader (i.e. most of it) also can't use ELF TLS. It might be possible to lift this restriction, perhaps with specialized __tls_get_addr and TLSDESC resolver functions.
Bionic already allocates thread-specific data in a way that conflicts with TLS variants 1 and 2: 
TLS variant 1 allocates everything after the TP to ELF TLS (except the first two words), and variant 2 allocates everything before the TP. Bionic currently allocates memory before and after the TP to the pthread_internal_t struct.
The bionic_tls.h header is marked with a warning:
/** WARNING WARNING WARNING ** ** This header file is *NOT* part of the public Bionic ABI/API ** and should not be used/included by user-serviceable parts of ** the system (e.g. applications). ** ** It is only provided here for the benefit of the system dynamic ** linker and the OpenGL sub-system (which needs to access the ** pre-allocated slot directly for performance reason). **/
There are issues with rearranging this memory:
TLS_SLOT_STACK_GUARD is used for -fstack-protector. The location (word #5) was initially used by GCC on x86 (and x86-64), where it is compatible with x86's TLS variant 2. We modified Clang to use this slot for arm64 in 2016, though, and the slot isn't compatible with ARM's variant 1 layout. This change shipped in NDK r14, and the NDK's build systems (ndk-build and the CMake toolchain file) enable -fstack-protector-strong by default.
TLS_SLOT_TSAN is used for more than just TSAN -- it's also used by HWASAN and Scudo.
The Go runtime allocates a thread-local "g" variable on Android by creating a pthread key and searching for its TP-relative offset, which it assumes is nonnegative:
Android's "native bridge" system allows the Zygote to load an app solib of a non-native ABI. (For example, it could be used to load an arm32 solib into an x86 Zygote.) The solib is translated into the host architecture. TLS accesses in the app solib (whether ELF TLS, Bionic slots, or pthread_internal_t fields) become host accesses. Laying out TLS memory differently across architectures could complicate this translation.
A pthread_t is practically just a pthread_internal_t*, and some apps directly access the pthread_internal_t::tid field. Past examples: http://b/17389248, aosp/107467. Reorganizing the initial pthread_internal_t fields could break those apps.
It seems easy to fix the incompatibility for variant 2 (x86 and x86_64) by splitting out the Bionic slots into a new data structure. Variant 1 is a harder problem.
The TLS prototype currently uses a patched LLD that uses a variant 1 TLS layout with a 16-word TCB on all architectures.
Aside: gcc's arm64ilp32 target uses a 32-bit unsigned offset for a TLS IE access (https://godbolt.org/z/_NIXjF). If Android ever supports this target, and in a configuration with variant 2 TLS, we might need to change the compiler to emit a sign-extending load.
Pros: simplifies Bionic
Cons:
The point of variant 2 was backwards-compatibility, and ARM Android needs to remain backwards-compatible, so we could use variant 2 for ARM. Problems:
When linking an executable, the static linker needs to know how TLS is allocated because it writes TP-relative offsets for IE/LE-model accesses. Clang doesn't tell the linker to target Android, so it could pass an --tls-variant2 flag to configure lld.
On arm64, there are different sets of static LE relocations accommodating different ranges of offsets from TP:
| Size | TP offset range | Static LE relocation types | 
|---|---|---|
| 12 | 0 <= x < 2^12 | R_AARCH64_TLSLE_ADD_TPREL_LO12 | 
| " | " | R_AARCH64_TLSLE_LDST8_TPREL_LO12 | 
| " | " | R_AARCH64_TLSLE_LDST16_TPREL_LO12 | 
| " | " | R_AARCH64_TLSLE_LDST32_TPREL_LO12 | 
| " | " | R_AARCH64_TLSLE_LDST64_TPREL_LO12 | 
| " | " | R_AARCH64_TLSLE_LDST128_TPREL_LO12 | 
| 16 | -2^16 <= x < 2^16 | R_AARCH64_TLSLE_MOVW_TPREL_G0 | 
| 24 | 0 <= x < 2^24 | R_AARCH64_TLSLE_ADD_TPREL_HI12 | 
| " | " | R_AARCH64_TLSLE_ADD_TPREL_LO12_NC | 
| " | " | R_AARCH64_TLSLE_LDST8_TPREL_LO12_NC | 
| " | " | R_AARCH64_TLSLE_LDST16_TPREL_LO12_NC | 
| " | " | R_AARCH64_TLSLE_LDST32_TPREL_LO12_NC | 
| " | " | R_AARCH64_TLSLE_LDST64_TPREL_LO12_NC | 
| " | " | R_AARCH64_TLSLE_LDST128_TPREL_LO12_NC | 
| 32 | -2^32 <= x < 2^32 | R_AARCH64_TLSLE_MOVW_TPREL_G1 | 
| " | " | R_AARCH64_TLSLE_MOVW_TPREL_G0_NC | 
| 48 | -2^48 <= x < 2^48 | R_AARCH64_TLSLE_MOVW_TPREL_G2 | 
| " | " | R_AARCH64_TLSLE_MOVW_TPREL_G1_NC | 
| " | " | R_AARCH64_TLSLE_MOVW_TPREL_G0_NC | 
GCC for arm64 defaults to the 24-bit model and has an -mtls-size=SIZE option for setting other supported sizes. (It supports 12, 24, 32, and 48.) Clang has only implemented the 24-bit model, but that could change. (Clang briefly used load/store relocations, but it was reverted because no linker supported them: BFD, Gold, LLD).
The 16-, 32-, and 48-bit models use a movn/movz instruction to set the highest 16 bits to a positive or negative value, then movk to set the remaining 16 bit chunks. In principle, these relocations should be able to accommodate a negative TP offset.
The 24-bit model uses add to set the high 12 bits, then places the low 12 bits into another add or a load/store instruction.
Maybe we could modify the R_AARCH64_TLSLE_ADD_TPREL_HI12 relocation to allow a negative TP offset by converting the relocated add instruction to a sub. Alternately, we could add a new R_AARCH64_TLSLE_SUB_TPREL_HI12 relocation, and Clang would use a different TLS LE instruction sequence when targeting Android/arm64.
LLD's arm64 relaxations from GD and IE to LE would need to use movn instead of movk for Android.
Binaries linked with the flag crash on non-Bionic, and binaries without the flag crash on Bionic. We might want to mark the binaries somehow to indicate the non-standard TLS ABI. Suggestion:
--android-tls-variant2 flag (or --bionic-tls-variant2, we're trying to make Bionic run on the host)PT_ANDROID_TLS_TPOFF segment?.note.gnu.property with a "GNU_PROPERTY_TLS_TPOFF" property value?Pros: Minimal linker change, no change to TLS relocations. Cons: The reserved amount becomes an arbitrary but immutable part of the Android ABI.
Add an lld option: --android-tls[-tcb=SIZE]
As with the first workaround, we'd probably want to mark the binary to indicate the non-standard TP-to-TLS-segment offset.
Reservation amount:
Pros:
Cons:
Pros:
Cons:
The layout conflict is apparently only a problem because an executable assumes that its TLS segment is located at a statically-known offset from the TP (i.e. it uses the LE model). An initially-loaded shared object can still use the efficient IE access model, but its TLS segment offset is known at load-time, not link-time. If we can guarantee that Android's executables also use the IE model, not LE, then the Bionic loader can place the executable's TLS segment at any offset from the TP, leaving the existing thread-specific memory layout untouched.
This workaround doesn't help with statically-linked executables, but they're probably less of a problem, because the linker and libc.a are usually packaged together.
A likely problem: LD is normally relaxed to LE, not to IE. We'd either have to disable LD usage in the compiler (bad for performance) or add LD->IE relaxation. This relaxation requires that IE code sequences be no larger than LD code sequences, which may not be the case on some architectures. (XXX: In some past testing, it looked feasible for TLSDESC but not the traditional design.)
To implement:
pthread_internal_t. We need something like DF_STATIC_TLS, but instead of indicating IE in an solib, we want to indicate the lack of LE in an executable.The Go runtime allocates its thread-local "g" variable by searching for a hard-coded magic constant (0x23581321 for arm32 and 0x23581321345589 for arm64). As long as it finds its constant at a small positive offset from TP (within the first 384 words), it will think it has found the pthread key it allocated.
As a temporary compatibility hack, we might try to keep these programs running by reserving a TLS slot with this magic value. This hack doesn't appear to work, however. The runtime finds its pthread key, but apps segfault. Perhaps the Go runtime expects its "g" variable to be zero-initialized (one example). With this hack, it's never zero, but with its current allocation strategy, it is typically zero. After Bionic's pthread key system was rewritten to be lock-free for Android M, though, it's not guaranteed, because a key could be recycled.
Most Android executables do not use any thread_local variables. In the current prototype, with the AOSP hikey960 build, only /system/bin/netd has a TLS segment, and it's only 32 bytes. As long as /system/bin/app_process{32,64} limits its use of TLS memory, then the pthread keys could be allocated after app_process' TLS segment, and Go will still find them.
Go scans 384 words from the thread pointer. If there are at most 16 Bionic slots and 130 pthread keys (2 words per key), then app_process can use at most 108 words of TLS memory.
Drawback: In principle, this might make pthread key accesses slower, because Bionic can't assume that pthread keys are at a fixed offset from the thread pointer anymore. It must load an offset from somewhere (a global variable, another TLS slot, ...). __get_thread() already uses a TLS slot to find pthread_internal_t, though, rather than assume a fixed offset. (XXX: I think it could be optimized.)
XXX: Maybe a sanitizer would want to intercept allocations of TLS memory, and that could be hard if the loader is allocating it.
General (and x86/x86-64)
arm32:
__aeabi_read_tp.arm64: