Appendix: Supporting Windows
Our example works for both OSX, Linux and Windows, but as I have pointed out, our Windows implementation is not correct even though it's working. Since I've been quite commited to make this work on all three platforms, I'll go through what we need to do in this chapter.
You might wonder why I didn't include this in the original code, and the reason for that is that this is really not at all that relevant for explaining the main concepts I wanted to explore.
Here I'm trying to go a bit further here to explore how we should set up the stack for Windows correct way and do a proper context switch. Even though we might not get all the way to a perfect implementation, there is plenty of information and references for you to explore further and I'll list some of them here:

What's special with Windows

The reason I don't consider this important enough to implement in the main example is that that windows has more callee saved registers, or non-volatileregisters as they call it in addition to one rather poorly documented quirk that we need to account for, so what we really do is just to save more data when we do the context switch and that needs more conditional compilation.
Conditionally compiling this to support windows correctly bloats our code with almost 50 % without adding much to what we need for a basic understanding.
Now that doesn't mean this isn't interesting, on the contrary, but we'll also experience first hand some of the difficulties of supporting multiple platforms when doing everything from scratch.

Additional callee saved (non-volatile) registers

The first thing I mentioned is that windows wants to save more data during context switches, in particular the XMM6-XMM15 registers. It's actually mentioned specifically in the reference so this is just adding more fields to our ThreadContext struct. This is very easy now that we've done it once before.
In addition to the XMMregisters the rdiand rsiregisters are nonvolatile on Windows which means they're callee saved (on linux these registers are use for the first and second function arguments), so we need to add these too.
However, there is one caveat: the XMMregisters are 128 bits, and not 64. Rust has a u128type but we'll use [u64;2]instead to avoid some alignment issues that we might get otherwise. Don't worry, I'll explain this further down.
Our ThreadContext now looks like this:
1
#[cfg(target_os="windows")]
2
#[derive(Debug, Default)]
3
#[repr(C)]
4
struct ThreadContext {
5
rsp: u64,
6
r15: u64,
7
r14: u64,
8
r13: u64,
9
r12: u64,
10
rbx: u64,
11
rbp: u64,
12
rdi: u64,
13
rsi: u64,
14
xmm6: [u64; 2],
15
xmm7: [u64; 2],
16
xmm8: [u64; 2],
17
xmm9: [u64; 2],
18
xmm10: [u64; 2],
19
xmm11: [u64; 2],
20
xmm12: [u64; 2],
21
xmm13: [u64; 2],
22
xmm14: [u64; 2],
23
xmm15: [u64; 2],
24
}
Copied!

The Thread Information Block

The second part is poorly documented. I've actually struggled to verify exactly how skipping this will cause a failure on modern Windows but there's enough references to it around from trustworthy sources that I'm in no doubt we need to go through this.
You see, Windows wants to store some information about the currently running thread in what it calls the Thread Information Block, referred to as NT_TIB. Specifically it wants access to information about the Stack Baseand the Stack Limitin the %gsregister.
What is the GS register you might ask?
The answer I found was a bit perplexing. Apparently these segment registers, GS on x64, and FS on x86 was intended by Intel to allow programs to access many different segments of memory that were meant to be part of a persistent virtual store. Modern operating systems doesn't use these registers this way as we can only access our own process memory (which appear as a "flat" memory to us as programmers). Back when it wasn't clear that this would be the prevailing model, these registers would allow for different implementations by different operating systems. See the Wikipedia article on the Multics operating system if you're curious.
That means that these segment registers are freely used by operating systems for what they deem appropriate. Windows stores information about the currently running thread in the GS register, and Linux uses these registers for thread local storage.
When we switch threads, we should provide the information it expects about our Stack Base and our Stack Limit.
Our ThreadContext now looks like this:
ThreadContext
1
#[cfg(target_os="windows")]
2
#[derive(Debug, Default)]
3
#[repr(C)]
4
struct ThreadContext {
5
rsp: u64,
6
r15: u64,
7
r14: u64,
8
r13: u64,
9
r12: u64,
10
rbx: u64,
11
rbp: u64,
12
rdi: u64,
13
rsi: u64,
14
xmm6: [u64; 2],
15
xmm7: [u64; 2],
16
xmm8: [u64; 2],
17
xmm9: [u64; 2],
18
xmm10: [u64; 2],
19
xmm11: [u64; 2],
20
xmm12: [u64; 2],
21
xmm13: [u64; 2],
22
xmm14: [u64; 2],
23
xmm15: [u64; 2],
24
stack_start: u64,
25
stack_end: u64,
26
}
Copied!
Notice we use the #[cfg(target_os="windows")] attribute here on all the Windows specific functions and structs, which mean we need to give our "original" definitions an attribute that makes sure it compiles them for all other targets than Windows: [cfg(not(target_os="windows"))].
I named the fields stack_start and stack_end since I find that easier to mentally parse since we know the stack starts on the top and grows downwards to the bottom.
Now to implement this we need to make a change to our spawn() function to actually provide this information:

The Windows stack

https://docs.microsoft.com/en-us/cpp/build/stack-usage?view=vs-2019#stack-allocation
You see, since Rust sets up our stack frames, we only need to care about where to put our %rspand the return address and this looks pretty much the same as in the psABI. The differences between Win64 and psABI are elsewhere and Rust takes care of all these differences for us.
Now to implement this we need to make a change to our spawn()function to actually provide this information and set up our stack.
spawn
1
#[cfg(target_os = "windows")]
2
pub fn spawn(&mut self, f: fn()) {
3
let available = self
4
.threads
5
.iter_mut()
6
.find(|t| t.state == State::Available)
7
.expect("no available thread.");
8
9
let size = available.stack.len();
10
11
// see: https://docs.microsoft.com/en-us/cpp/build/stack-usage?view=vs-2019#stack-allocation
12
unsafe {
13
let s_ptr = available.stack.as_mut_ptr().offset(size as isize);
14
let s_ptr = (s_ptr as usize & !15) as *mut u8;
15
std::ptr::write(s_ptr.offset((size - 16) as isize) as *mut u64, guard as u64);
16
std::ptr::write(s_ptr.offset((size - 24) as isize) as *mut u64, skip as u64);
17
std::ptr::write(s_ptr.offset((size - 32) as isize) as *mut u64, f as u64);
18
available.ctx.rsp = s_ptr.offset((size - 32) as isize) as u64;
19
available.ctx.stack_start = s_ptr as u64;
20
21
available.ctx.stack_end = s_ptr as *const u64 as u64;
22
}
23
24
25
available.state = State::Ready;
26
}
Copied!
As you see we provide a pointer to the start of our stack and a pointer to the end of our stack.

Possible alignment issues

Well, this is supposed to be hard, remember? Windows does not disappoint us making things too easy. You see, when we move data from a 128 bit register we need to use some special assembly instructions. There are several of them that mostly do the same:
As you see most method has an alignedand and unalignedvariant. The difference here is that *pstype instructions are targeting floating point values and the *dq/*dq type instructions are targeting integer values. Now both will work, but if you clicked on the Microsoft reference you probably noticed that the XMM are used for floating point values so the *pstype instructions are the right ones for us to use.
The alignedversions have historically been slightly faster under most circumstances and would be preferred in a context switch, but the latest information I've read about this is that they've been practically the same for the last 6 generations of CPUs regarding performance.
If you want to read more about the cost for different instructions on newer and older processors, have a look at Agner Fog's instruction tables.
However, since the aligned instructions are used in all the reference implementations I've encountered, we'll use them as well although they expose us for some extra complexity, we are still learning stuff aren't we?
By aligned, we mean that the memory they read/write from/to is 16 byte aligned.
Now, the way I solve this is to push the fields that requires alignment to the start of our struct, and add a new attribute #[repr(align(16))].
The repr(align(n))attribute ensures that our struct starts at a 16 byte aligned memory address, so when we write to XMM6in the start of our assembly it's already 16 byte aligned, and since 128 is divisible with 16 so are the rest of the fields.
But, and this can be important, since we now have two different field sizes our compiler might choose to "pad" our fields, now that doesn't happen right now but pushing our larger fields to the start will minimize the risk of that happening at a later time.
We also avoid manually adding a padding member to our struct since we have 7u64fields before ourXMMfields preventing them from aligning to 16 (remember, therepr(C)attribute guarantees that the compiler will not reorder our fields).
Our Threadcontextends up like this after our changes:
ThreadContext
1
#[cfg(target_os = "windows")]
2
#[derive(Debug, Default)]
3
#[repr(C)]
4
#[repr(align(16))]
5
struct ThreadContext {
6
xmm6: [u64; 2],
7
xmm7: [u64; 2],
8
xmm8: [u64; 2],
9
xmm9: [u64; 2],
10
xmm10: [u64; 2],
11
xmm11: [u64; 2],
12
xmm12: [u64; 2],
13
xmm13: [u64; 2],
14
xmm14: [u64; 2],
15
xmm15: [u64; 2],
16
rsp: u64,
17
r15: u64,
18
r14: u64,
19
r13: u64,
20
r12: u64,
21
rbx: u64,
22
rbp: u64,
23
rdi: u64,
24
rsi: u64,
25
stack_start: u64,
26
stack_end: u64,
27
}
Copied!
Last we need to change our switch()function and update our assembly. After all this explanation this is pretty easy:
1
#[cfg(target_os = "windows")]
2
#[naked]
3
#[no_mangle]
4
unsafe extern "C" fn switch() {
5
asm!(
6
"movaps [rcx + 0x00], xmm6",
7
"movaps [rcx + 0x10], xmm7",
8
"movaps [rcx + 0x20], xmm8",
9
"movaps [rcx + 0x30], xmm9",
10
"movaps [rcx + 0x40], xmm10",
11
"movaps [rcx + 0x50], xmm11",
12
"movaps [rcx + 0x60], xmm12",
13
"movaps [rcx + 0x70], xmm13",
14
"movaps [rcx + 0x80], xmm14",
15
"movaps [rcx + 0x90], xmm15",
16
"mov [rcx + 0xa0], rsp",
17
"mov [rcx + 0xa8], r15",
18
"mov [rcx + 0xb0], r14",
19
"mov [rcx + 0xb8], r13",
20
"mov [rcx + 0xc0], r12",
21
"mov [rcx + 0xc8], rbx",
22
"mov [rcx + 0xd0], rbp",
23
"mov [rcx + 0xd8], rdi",
24
"mov [rcx + 0xe0], rsi",
25
"mov rax, gs:0x08",
26
"mov [rcx + 0xe8], rax",
27
"mov rax, gs:0x10",
28
"mov [rcx + 0xf0], rax",
29
"movaps xmm6, [rdx + 0x00]",
30
"movaps xmm7, [rdx + 0x10]",
31
"movaps xmm8, [rdx + 0x20]",
32
"movaps xmm9, [rdx + 0x30]",
33
"movaps xmm10, [rdx + 0x40]",
34
"movaps xmm11, [rdx + 0x50]",
35
"movaps xmm12, [rdx + 0x60]",
36
"movaps xmm13, [rdx + 0x70]",
37
"movaps xmm14, [rdx + 0x80]",
38
"movaps xmm15, [rdx + 0x90]",
39
"mov rsp, [rdx + 0xa0]",
40
"mov r15, [rdx + 0xa8]",
41
"mov r14, [rdx + 0xb0]",
42
"mov r13, [rdx + 0xb8]",
43
"mov r12, [rdx + 0xc0]",
44
"mov rbx, [rdx + 0xc8]",
45
"mov rbp, [rdx + 0xd0]",
46
"mov rdi, [rdx + 0xd8]",
47
"mov rsi, [rdx + 0xe0]",
48
"mov rax, [rdx + 0xe8]",
49
"mov gs:0x08, rax",
50
"mov rax, [rdx + 0xf0]",
51
"mov gs:0x10, rax",
52
"ret", options(noreturn)
53
);
54
}
Copied!
As you see, our code gets just a little bit longer. It's not difficult once you've figured out what to store where, but it does add a lot of code.
Our inline assembly won't let us mov from one memory offset to another memory offset so we need to go via a register. I chose therax register (the default register for the return value) but could have chosen any general purpose register for this.

Conclusion

So this is all we needed to do. As you see we don't really do anything new here, the difficult part is figuring out how Windows works and what it expects, but now that we have done our job properly we should have a pretty complete context switch for all three platforms.

Final code

I've collected all the code we need to compile differently for Windows at the bottom so you don't have to read the first 200 lines of code since that is unchanged from what we in the previous chapters. I hope you understand why I chose to dedicate a separate chapter for this.
You'll also find this code in the Windows branch in the repository.
1
#![feature(naked_functions)]
2
use std::arch::asm;
3
4
const DEFAULT_STACK_SIZE: usize = 1024 * 1024 * 2;
5
const MAX_THREADS: usize = 4;
6
static mut RUNTIME: usize = 0;
7
8
pub struct Runtime {
9
threads: Vec<Thread>,
10
current: usize,
11
}
12
13
#[derive(PartialEq, Eq, Debug)]
14
enum State {
15
Available,
16
Running,
17
Ready,
18
}
19
20
struct Thread {
21
id: usize,
22
stack: Vec<u8>,
23
ctx: ThreadContext,
24
state: State,
25
}
26
27
#[cfg(not(target_os = "windows"))]
28
#[derive(Debug, Default)]
29
#[repr(C)]
30
struct ThreadContext {
31
rsp: u64,
32
r15: u64,
33
r14: u64,
34
r13: u64,
35
r12: u64,
36
rbx: u64,
37
rbp: u64,
38
}
39
40
impl Thread {
41
fn new(id: usize) -> Self {
42
Thread {
43
id,
44
stack: vec![0_u8; DEFAULT_STACK_SIZE],
45
ctx: ThreadContext::default(),
46
state: State::Available,
47
}
48
}
49
}
50
51
impl Runtime {
52
pub fn new() -> Self {
53
let base_thread = Thread {
54
id: 0,
55
stack: vec![0_u8; DEFAULT_STACK_SIZE],
56
ctx: ThreadContext::default(),
57
state: State::Running,
58
};
59
60
let mut threads = vec![base_thread];
61
let mut available_threads: Vec<Thread> = (1..MAX_THREADS).map(|i| Thread::new(i)).collect();
62
threads.append(&mut available_threads);
63
64
Runtime {
65
threads,
66
current: 0,
67
}
68
}
69
70
pub fn init(&self) {
71
unsafe {
72
let r_ptr: *const Runtime = self;
73
RUNTIME = r_ptr as usize;
74
}
75
}
76
77
pub fn run(&mut self) -> ! {
78
while self.t_yield() {}
79
std::process::exit(0);
80
}
81
82
fn t_return(&mut self) {
83
if self.current != 0 {
84
self.threads[self.current].state = State::Available;
85
self.t_yield();
86
}
87
}
88
89
#[inline(never)]
90
fn t_yield(&mut self) -> bool {
91
let mut pos = self.current;
92
while self.threads[pos].state != State::Ready {
93
pos += 1;
94
if pos == self.threads.len() {
95
pos = 0;
96
}
97
if pos == self.current {
98
return false;
99
}
100
}
101
102
if self.threads[self.current].state != State::Available {
103
self.threads[self.current].state = State::Ready;
104
}
105
106
self.threads[pos].state = State::Running;
107
let old_pos = self.current;
108
self.current = pos;
109
110
unsafe {
111
let old: *mut ThreadContext = &mut self.threads[old_pos].ctx;
112
let new: *const ThreadContext = &self.threads[pos].ctx;
113
114
if cfg!(not(target_os = "windows")) {
115
asm!("call switch", in("rdi") old, in("rsi") new, clobber_abi("C"));
116
} else {
117
asm!("call switch", in("rcx") old, in("rdx") new, clobber_abi("system"));
118
}
119
}
120
121
// preventing compiler optimizing our code away on windows. Will never be reached anyway.
122
self.threads.len() > 0
123
}
124
125
#[cfg(not(target_os = "windows"))]
126
pub fn spawn(&mut self, f: fn()) {
127
let available = self
128
.threads
129
.iter_mut()
130
.find(|t| t.state == State::Available)
131
.expect("no available thread.");
132
133
let size = available.stack.len();
134
unsafe {
135
let s_ptr = available.stack.as_mut_ptr().offset(size as isize);
136
let s_ptr = (s_ptr as usize & !15) as *mut u8;
137
std::ptr::write(s_ptr.offset(-16) as *mut u64, guard as u64);
138
std::ptr::write(s_ptr.offset(-24) as *mut u64, skip as u64);
139
std::ptr::write(s_ptr.offset(-32) as *mut u64, f as u64);
140
available.ctx.rsp = s_ptr.offset(-32) as u64;
141
}
142
available.state = State::Ready;
143
}
144
}
145
146
#[naked]
147
unsafe extern "C" fn skip() {
148
asm!("ret", options(noreturn))
149
}
150
151
152
fn guard() {
153
unsafe {
154
let rt_ptr = RUNTIME as *mut Runtime;
155
(*rt_ptr).t_return();
156
};
157
}
158
159
pub fn yield_thread() {
160
unsafe {
161
let rt_ptr = RUNTIME as *mut Runtime;
162
(*rt_ptr).t_yield();
163
};
164
}
165
166
#[cfg(not(target_os = "windows"))]
167
#[naked]
168
#[no_mangle]
169
unsafe extern "C" fn switch() {
170
asm!(
171
"mov [rdi + 0x00], rsp",
172
"mov [rdi + 0x08], r15",
173
"mov [rdi + 0x10], r14",
174
"mov [rdi + 0x18], r13",
175
"mov [rdi + 0x20], r12",
176
"mov [rdi + 0x28], rbx",
177
"mov [rdi + 0x30], rbp",
178
"mov rsp, [rsi + 0x00]",
179
"mov r15, [rsi + 0x08]",
180
"mov r14, [rsi + 0x10]",
181
"mov r13, [rsi + 0x18]",
182
"mov r12, [rsi + 0x20]",
183
"mov rbx, [rsi + 0x28]",
184
"mov rbp, [rsi + 0x30]",
185
"ret", options(noreturn)
186
);
187
}
188
189
pub fn main() {
190
let mut runtime = Runtime::new();
191
runtime.init();
192
runtime.spawn(|| {
193
println!("THREAD 1 STARTING");
194
let id = 1;
195
for i in 0..10 {
196
println!("thread: {} counter: {}", id, i);
197
yield_thread();
198
}
199
println!("THREAD 1 FINISHED");
200
});
201
runtime.spawn(|| {
202
println!("THREAD 2 STARTING");
203
let id = 2;
204
for i in 0..15 {
205
println!("thread: {} counter: {}", id, i);
206
yield_thread();
207
}
208
println!("THREAD 2 FINISHED");
209
});
210
runtime.run();
211
}
212
213
// ===== WINDOWS SUPPORT =====
214
#[cfg(target_os = "windows")]
215
#[derive(Debug, Default)]
216
#[repr(C)]
217
struct ThreadContext {
218
xmm6: [u64; 2],
219
xmm7: [u64; 2],
220
xmm8: [u64; 2],
221
xmm9: [u64; 2],
222
xmm10: [u64; 2],
223
xmm11: [u64; 2],
224
xmm12: [u64; 2],
225
xmm13: [u64; 2],
226
xmm14: [u64; 2],
227
xmm15: [u64; 2],
228
rsp: u64,
229
r15: u64,
230
r14: u64,
231
r13: u64,
232
r12: u64,
233
rbx: u64,
234
rbp: u64,
235
rdi: u64,
236
rsi: u64,
237
stack_start: u64,
238
stack_end: u64,
239
}
240
241
impl Runtime {
242
#[cfg(target_os = "windows")]
243
pub fn spawn(&mut self, f: fn()) {
244
let available = self
245
.threads
246
.iter_mut()
247
.find(|t| t.state == State::Available)
248
.expect("no available thread.");
249
250
let size = available.stack.len();
251
252
// see: https://docs.microsoft.com/en-us/cpp/build/stack-usage?view=vs-2019#stack-allocation
253
unsafe {
254
let s_ptr = available.stack.as_mut_ptr().offset(size as isize);
255
let s_ptr = (s_ptr as usize & !15) as *mut u8;
256
std::ptr::write(s_ptr.offset((size - 16) as isize) as *mut u64, guard as u64);
257
std::ptr::write(s_ptr.offset((size - 24) as isize) as *mut u64, skip as u64);
258
std::ptr::write(s_ptr.offset((size - 32) as isize) as *mut u64, f as u64);
259
available.ctx.rsp = s_ptr.offset((size - 32) as isize) as u64;
260
available.ctx.stack_start = s_ptr as u64;
261
262
available.ctx.stack_end = s_ptr as *const u64 as u64;
263
}
264
265
266
available.state = State::Ready;
267
}
268
}
269
270
// reference: https://probablydance.com/2013/02/20/handmade-coroutines-for-windows/
271
// Contents of TIB on Windows: https://en.wikipedia.org/wiki/Win32_Thread_Information_Block
272
#[cfg(target_os = "windows")]
273
#[naked]
274
#[no_mangle]
275
unsafe extern "C" fn switch() {
276
asm!(
277
"movaps [rcx + 0x00], xmm6",
278
"movaps [rcx + 0x10], xmm7",
279
"movaps [rcx + 0x20], xmm8",
280
"movaps [rcx + 0x30], xmm9",
281
"movaps [rcx + 0x40], xmm10",
282
"movaps [rcx + 0x50], xmm11",
283
"movaps [rcx + 0x60], xmm12",
284
"movaps [rcx + 0x70], xmm13",
285
"movaps [rcx + 0x80], xmm14",
286
"movaps [rcx + 0x90], xmm15",
287
"mov [rcx + 0xa0], rsp",
288
"mov [rcx + 0xa8], r15",
289
"mov [rcx + 0xb0], r14",
290
"mov [rcx + 0xb8], r13",
291
"mov [rcx + 0xc0], r12",
292
"mov [rcx + 0xc8], rbx",
293
"mov [rcx + 0xd0], rbp",
294
"mov [rcx + 0xd8], rdi",
295
"mov [rcx + 0xe0], rsi",
296
"mov rax, gs:0x08",
297
"mov [rcx + 0xe8], rax",
298
"mov rax, gs:0x10",
299
"mov [rcx + 0xf0], rax",
300
"movaps xmm6, [rdx + 0x00]",
301
"movaps xmm7, [rdx + 0x10]",
302
"movaps xmm8, [rdx + 0x20]",
303
"movaps xmm9, [rdx + 0x30]",
304
"movaps xmm10, [rdx + 0x40]",
305
"movaps xmm11, [rdx + 0x50]",
306
"movaps xmm12, [rdx + 0x60]",
307
"movaps xmm13, [rdx + 0x70]",
308
"movaps xmm14, [rdx + 0x80]",
309
"movaps xmm15, [rdx + 0x90]",
310
"mov rsp, [rdx + 0xa0]",
311
"mov r15, [rdx + 0xa8]",
312
"mov r14, [rdx + 0xb0]",
313
"mov r13, [rdx + 0xb8]",
314
"mov r12, [rdx + 0xc0]",
315
"mov rbx, [rdx + 0xc8]",
316
"mov rbp, [rdx + 0xd0]",
317
"mov rdi, [rdx + 0xd8]",
318
"mov rsi, [rdx + 0xe0]",
319
"mov rax, [rdx + 0xe8]",
320
"mov gs:0x08, rax",
321
"mov rax, [rdx + 0xf0]",
322
"mov gs:0x10, rax",
323
"ret", options(noreturn)
324
);
325
}
326
Copied!
Last modified 3mo ago