Skip to content
This repository was archived by the owner on Jun 26, 2020. It is now read-only.

Tls support for ELF and MachO #1174

Merged
merged 3 commits into from
Feb 26, 2020
Merged

Conversation

bjorn3
Copy link
Contributor

@bjorn3 bjorn3 commented Oct 27, 2019

This adds TLS (thread local storage) support for ELF targets to Cranelift

  • A reviewer from the core maintainer team has been assigned for this PR.
    If you don't know who could review this, please indicate so and/or ping
    bnjbvr. The list of suggested reviewers on the right can help you. No list of suggested reviewers shown on the right side.

Fixes #963

cc @philipc https://github.com/bjorn3/rustc_codegen_cranelift/issues/388#issuecomment-526909856
cc @abrown Because you implemented SIMD support for x86, I think you know enough of x86 encodings to verify that the encodings are correct.

.emit(
r#"
// leaq @tlsld(%rip), %rdi
sink.put1(0b0100_1000);
Copy link
Collaborator

@iximeow iximeow Oct 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this get a const like for LEA like:

Suggested change
sink.put1(0b0100_1000);
const REX_W: u8 = 0x48;
sink.put1(REX_W);

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one of those rex functions would be better, but I don't know which one to use.

// output %rax

// leaq @dtpoff32(%rax), %rax
sink.put1(0b0100_1000);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above re. naming this constant.

@philipc
Copy link
Contributor

philipc commented Oct 28, 2019

We probably should start with the "General Dynamic TLS Model", and only use "Local Dynamic TLS Model" if the required conditions are met (the variable is in the same object and ideally only if there are multiple TLS variable accesses that can share the __tls_get_addr call).

Also, if there are multiple TLS variable accesses, will cranelift be able to optimize the away the unneeded __tls_get_addr calls?

@bjorn3
Copy link
Contributor Author

bjorn3 commented Oct 28, 2019

We probably should start with the "General Dynamic TLS Model"

Yeah, I didnt fully read the docs, but just followed LLVM's example.

Also, if there are multiple TLS variable accesses, will cranelift be able to optimize the away the unneeded __tls_get_addr calls?

That would require a new optimization pass. It probably wouldn't help much for cg_clif, because of lack of inlining when MIR opts are not set to max.

@bnjbvr
Copy link
Member

bnjbvr commented Oct 28, 2019

Can you detail in the issue the problem you're trying to solve, and what possible implementations exist, please? The mentioned issue doesn't contain any text but a link to another issue in cg_clif. I'll comment in there.

@abrown
Copy link
Collaborator

abrown commented Oct 28, 2019

@bjorn3, I would like to understand this at a higher-level; what are you using as reference? Perhaps "ELF Handling for Thread-Local Storage"?

@bjorn3
Copy link
Contributor Author

bjorn3 commented Oct 28, 2019

I use https://akkadia.org/drepper/tls.pdf + LLVM output as reference.

@bjorn3
Copy link
Contributor Author

bjorn3 commented Oct 28, 2019

Switched from the local only LD model to the global GD tls model.

@bjorn3
Copy link
Contributor Author

bjorn3 commented Nov 2, 2019

Implemented MachO support. Requires gimli-rs/object#142.

@bjorn3
Copy link
Contributor Author

bjorn3 commented Nov 8, 2019

Note to self: I will need to write -4 to the X86_64_RELOC_TLV location and add -4 to the addend for mach-O.

@philipc
Copy link
Contributor

philipc commented Nov 8, 2019

I thought the Mach-O relocations were designed such that the implicit -4 meant the actual addend would be 0, so you wouldn't need to write an addend to the location, and I think the disassembly you did agrees with that:

       4:	48 8b 3d 00 00 00 00 	movq	(%rip), %rdi
		0000000000000007:  X86_64_RELOC_TLV	__ZN3tls1A17h13af140b93a46e97E@TLVP

@bjorn3
Copy link
Contributor Author

bjorn3 commented Nov 8, 2019

I actually meant substract -4. The encoding for x86_64_macho_tls_symbol already uses -4 as addend, so substracting -4 would indeed give 0.

@bjorn3 bjorn3 force-pushed the tls_support branch 2 times, most recently from 956e6a5 to 5dcfa8c Compare December 2, 2019 17:20
@bjorn3 bjorn3 changed the title [WIP] Tls support for ELF Tls support for ELF Dec 16, 2019
@bjorn3
Copy link
Contributor Author

bjorn3 commented Dec 16, 2019

Can somebody please review this? Both ELF and Mach-O have been tested using cg_clif.

@abrown
Copy link
Collaborator

abrown commented Dec 16, 2019

It looks like I reviewed what I understood above; perhaps @bnjbvr or @julian-seward1?

@bjorn3 bjorn3 changed the title Tls support for ELF Tls support for ELF and MachO Jan 26, 2020
@bjorn3
Copy link
Contributor Author

bjorn3 commented Feb 12, 2020

Ping?

@abrown
Copy link
Collaborator

abrown commented Feb 12, 2020

@bjorn3 I'm fine with this once the conflicts are resolved and tests are added but that is because I'm not familiar enough with other ways of implementing TLS to have much comment. There's a Cranelift meeting on Monday mornings and there might be someone there that has more insight into the future Cranelift's threading support; I mention that because the whole threading thing might require some back-and-forth discussion.

@bjorn3
Copy link
Contributor Author

bjorn3 commented Feb 13, 2020

I'm not familiar enough with other ways of implementing TLS to have much comment.

While there are other ways to implement TLS support, this matches the native TLS handling of ELF cq Mach-O. This way, it is possible to import/export tls variables from/to native objects. While this is not really necessary for cg_clif, as Rust makes a getter for the TLS var anyway and doesn't export the raw TLS var, for rcc it will likely be needed to interoperate with native code.

@bjorn3
Copy link
Contributor Author

bjorn3 commented Feb 14, 2020

I rebased this and added tests for legalization of global_value and encoding of x86_{elf,macho}_tls_get_addr. Should I also add a test that cranelift-object correctly writes the tls section?

I mention that because the whole threading thing might require some back-and-forth discussion.

For cg_clif this is technically the only thing that prevents multithreading at the moment. I have made atomics atomic by using a global lock. While this is slow, at least it works. I would love to see atomic support in Cranelift, but for now TLS is more important for me.

@bjorn3
Copy link
Contributor Author

bjorn3 commented Feb 18, 2020

@abrown Has this been discussed?

@abrown
Copy link
Collaborator

abrown commented Feb 18, 2020

I think the cranelift meeting yesterday was cancelled due to a US holiday (at least I didn't attend) but I posted in a zulip chat thread and I will bring it up next Monday.

@bjorn3
Copy link
Contributor Author

bjorn3 commented Feb 18, 2020

Ok, thanks!

Copy link
Member

@sunfishcode sunfishcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this approach looks reasonable to me. TLS makes sense as a GlobalValue flag, and starting with global-dynamic sounds like a good way to get started. Just a few questions below:

func.dfg.replace(inst).tls_value(ptr_ty, gv);
} else {
func.dfg.replace(inst).symbol_value(ptr_ty, gv);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having a tls_value instruction, would it work to just continue to use symbol_value here, and use the tls flag of gv to determine whether it's TLS or not? You could add a is_tls_data predicate function which could work the same way as is_colocated_data, FormatPredicateKind::IsColocatedData, and so on, and then define_entity_ref in cranelift-codegen/meta/src/isa/x86/encodings.rs could use that to decide when to apply the TLS legalizations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

symbol_value can directly be encoded, while tls_value needs to be legalized to the correct instruction for the respective object format. See comment below for why. As far as I know, unlike an encoding, there is no way to predicate a legalization.

@abrown abrown self-assigned this Feb 24, 2020
@bjorn3
Copy link
Contributor Author

bjorn3 commented Feb 24, 2020

I decided to look a bit more at the __tls_get_addr ABI and noticed that I made a wrong assumption. In one place, it is said that the full code sequence needs to be 16 bytes (used to for example relax GD to other TLS models, https://github.com/llvm/llvm-project/blob/deb5819d6249cfe110da9377910f7e5c88e8ee09/lld/ELF/Arch/X86_64.cpp#L186), suggesting a custom ABI, but the glibc implementation seems to require the C abi: https://github.com/bminor/glibc/blob/5f72f9800b250410cad3abfeeb09469ef12b2438/elf/dl-tls.c#L821. When looking at LLVM, this seems to indeed be the case: https://github.com/llvm/llvm-project/blob/78be61871704a451a5d9462d7e96ed6c982746d4/llvm/lib/Target/X86/X86InstrCompiler.td#L453-L471. The current code works for cg_clif, but that may be caused by the fact that rust creates a very small function just to get the tls addr, so none of the clobbered registers are used after the global_value instruction.

Clobbering all necessary registers will mean that I have to add even more clobber registers. Another option would be to use the existing infrastructure for the call and call_indirect instruction. (Edit: existing infrastructure only works for actual call and call_indirect.)

@bjorn3
Copy link
Contributor Author

bjorn3 commented Feb 25, 2020

I tried to add more output registers for the clobbering, but the maximum amount of outputs is 8, which is way less than the amount of registers than are clobbered.

I also tried to use the same spill logic as used by normal calls, but I can't find where it is determined which registers are caller-saved and should spilled to the stack.

@iximeow
Copy link
Collaborator

iximeow commented Feb 25, 2020

I also tried to use the same spill logic as used by normal calls, but I can't find where it is determined which registers are caller-saved and should spilled to the stack.

I think you might not have been able to find it because all register values that live through a call are spilled - callee-save/caller-save ABI guarantees are disregarded at the moment. There's a note in spilling.rs for this. This if could be extended to spill around clobber-happy TLS operations.

@bjorn3
Copy link
Contributor Author

bjorn3 commented Feb 25, 2020

Thanks @iximeow! It now correctly spills all caller-saved registers.

@abrown abrown merged commit 1d144ee into bytecodealliance:master Feb 26, 2020
@bjorn3 bjorn3 deleted the tls_support branch February 26, 2020 07:19
@bjorn3
Copy link
Contributor Author

bjorn3 commented Feb 26, 2020

🎉

@stefson
Copy link
Contributor

stefson commented Mar 5, 2020

@bjorn3

I'm getting a compile error with building for aarch64-unknown-linux-gnu on current master, and found this pullrequest as the culprit when bisecting the issue. Here's the error:

   Compiling cranelift-codegen-meta v0.59.0 (/tmp/wasmtime/cranelift/codegen/meta)
   Compiling cranelift-bforest v0.59.0 (/tmp/wasmtime/cranelift/bforest)
   Compiling wasi-common v0.12.0 (/tmp/wasmtime/crates/wasi-common)
   Compiling cranelift-codegen v0.59.0 (/tmp/wasmtime/cranelift/codegen)
warning: unused import: `self::call::expand_call`
  --> cranelift/codegen/src/legalizer/mod.rs:35:5
   |
35 | use self::call::expand_call;
   |     ^^^^^^^^^^^^^^^^^^^^^^^
   |
   = note: `#[warn(unused_imports)]` on by default

warning: unused import: `self::globalvalue::expand_global_value`
  --> cranelift/codegen/src/legalizer/mod.rs:36:5
   |
36 | use self::globalvalue::expand_global_value;
   |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

warning: unused import: `self::heap::expand_heap_addr`
  --> cranelift/codegen/src/legalizer/mod.rs:37:5
   |
37 | use self::heap::expand_heap_addr;
   |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

warning: unused import: `self::table::expand_table_addr`
  --> cranelift/codegen/src/legalizer/mod.rs:39:5
   |
39 | use self::table::expand_table_addr;
   |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error[E0599]: no variant or associated item named `X86ElfTlsGetAddr` found for type `ir::instructions::Opcode` in the current scope
    --> cranelift/codegen/src/regalloc/spilling.rs:272:45
     |
272  |             || opcode == crate::ir::Opcode::X86ElfTlsGetAddr
     |                                             ^^^^^^^^^^^^^^^^ variant or associated item not found in `ir::instructions::Opcode`
     | 
    ::: /tmp/wasmtime/target/aarch64-unknown-linux-gnu/debug/build/cranelift-codegen-dd375821e82eb469/out/opcodes.rs:1427:1
     |
1427 | pub enum Opcode {
     | --------------- variant or associated item `X86ElfTlsGetAddr` not found here

error[E0599]: no variant or associated item named `X86MachoTlsGetAddr` found for type `ir::instructions::Opcode` in the current scope
    --> cranelift/codegen/src/regalloc/spilling.rs:273:45
     |
273  |             || opcode == crate::ir::Opcode::X86MachoTlsGetAddr
     |                                             ^^^^^^^^^^^^^^^^^^ variant or associated item not found in `ir::instructions::Opcode`
     | 
    ::: /tmp/wasmtime/target/aarch64-unknown-linux-gnu/debug/build/cranelift-codegen-dd375821e82eb469/out/opcodes.rs:1427:1
     |
1427 | pub enum Opcode {
     | --------------- variant or associated item `X86MachoTlsGetAddr` not found here

error: aborting due to 2 previous errors

For more information about this error, try `rustc --explain E0599`.
error: could not compile `cranelift-codegen`.

It's propably worth noting that this only happens when executing cargo build from the wasmtime dir, and it doesn't when cd wasmtime/cranelift & cargo build. Still the error seems to be from cranelift.

P.S: the regression is caused by bytecodealliance/wasmtime@0a1bb3b , I hope this thread is the correct one to note the author of that regression.

@bjorn3
Copy link
Contributor Author

bjorn3 commented Mar 5, 2020

The problem is probably that Cranelift got built without x86 support, but I added a special case to regalloc with an x86 only instruction without checking for x86 support. I think the solution is to add an instruction attribute, just like is_call and is_branch, for this special case instead.

@abrown
Copy link
Collaborator

abrown commented Mar 5, 2020

@bjorn3 (or @stefson), can you create an issue in wasmtime so we can track this there?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TLS support
7 participants