diff options
Diffstat (limited to 'linden/indra/libgcrypt/libgcrypt-1.2.2/mpi/alpha/README')
-rw-r--r--[-rwxr-xr-x] | linden/indra/libgcrypt/libgcrypt-1.2.2/mpi/alpha/README | 106 |
1 files changed, 53 insertions, 53 deletions
diff --git a/linden/indra/libgcrypt/libgcrypt-1.2.2/mpi/alpha/README b/linden/indra/libgcrypt/libgcrypt-1.2.2/mpi/alpha/README index 93d433c..55c0a29 100755..100644 --- a/linden/indra/libgcrypt/libgcrypt-1.2.2/mpi/alpha/README +++ b/linden/indra/libgcrypt/libgcrypt-1.2.2/mpi/alpha/README | |||
@@ -1,53 +1,53 @@ | |||
1 | This directory contains mpn functions optimized for DEC Alpha processors. | 1 | This directory contains mpn functions optimized for DEC Alpha processors. |
2 | 2 | ||
3 | RELEVANT OPTIMIZATION ISSUES | 3 | RELEVANT OPTIMIZATION ISSUES |
4 | 4 | ||
5 | EV4 | 5 | EV4 |
6 | 6 | ||
7 | 1. This chip has very limited store bandwidth. The on-chip L1 cache is | 7 | 1. This chip has very limited store bandwidth. The on-chip L1 cache is |
8 | write-through, and a cache line is transfered from the store buffer to the | 8 | write-through, and a cache line is transfered from the store buffer to the |
9 | off-chip L2 in as much 15 cycles on most systems. This delay hurts | 9 | off-chip L2 in as much 15 cycles on most systems. This delay hurts |
10 | mpn_add_n, mpn_sub_n, mpn_lshift, and mpn_rshift. | 10 | mpn_add_n, mpn_sub_n, mpn_lshift, and mpn_rshift. |
11 | 11 | ||
12 | 2. Pairing is possible between memory instructions and integer arithmetic | 12 | 2. Pairing is possible between memory instructions and integer arithmetic |
13 | instructions. | 13 | instructions. |
14 | 14 | ||
15 | 3. mulq and umulh is documented to have a latency of 23 cycles, but 2 of | 15 | 3. mulq and umulh is documented to have a latency of 23 cycles, but 2 of |
16 | these cycles are pipelined. Thus, multiply instructions can be issued at a | 16 | these cycles are pipelined. Thus, multiply instructions can be issued at a |
17 | rate of one each 21nd cycle. | 17 | rate of one each 21nd cycle. |
18 | 18 | ||
19 | EV5 | 19 | EV5 |
20 | 20 | ||
21 | 1. The memory bandwidth of this chip seems excellent, both for loads and | 21 | 1. The memory bandwidth of this chip seems excellent, both for loads and |
22 | stores. Even when the working set is larger than the on-chip L1 and L2 | 22 | stores. Even when the working set is larger than the on-chip L1 and L2 |
23 | caches, the perfromance remain almost unaffected. | 23 | caches, the perfromance remain almost unaffected. |
24 | 24 | ||
25 | 2. mulq has a measured latency of 13 cycles and an issue rate of 1 each 8th | 25 | 2. mulq has a measured latency of 13 cycles and an issue rate of 1 each 8th |
26 | cycle. umulh has a measured latency of 15 cycles and an issue rate of 1 | 26 | cycle. umulh has a measured latency of 15 cycles and an issue rate of 1 |
27 | each 10th cycle. But the exact timing is somewhat confusing. | 27 | each 10th cycle. But the exact timing is somewhat confusing. |
28 | 28 | ||
29 | 3. mpn_add_n. With 4-fold unrolling, we need 37 instructions, whereof 12 | 29 | 3. mpn_add_n. With 4-fold unrolling, we need 37 instructions, whereof 12 |
30 | are memory operations. This will take at least | 30 | are memory operations. This will take at least |
31 | ceil(37/2) [dual issue] + 1 [taken branch] = 20 cycles | 31 | ceil(37/2) [dual issue] + 1 [taken branch] = 20 cycles |
32 | We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data | 32 | We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data |
33 | cache cycles, which should be completely hidden in the 20 issue cycles. | 33 | cache cycles, which should be completely hidden in the 20 issue cycles. |
34 | The computation is inherently serial, with these dependencies: | 34 | The computation is inherently serial, with these dependencies: |
35 | addq | 35 | addq |
36 | / \ | 36 | / \ |
37 | addq cmpult | 37 | addq cmpult |
38 | | | | 38 | | | |
39 | cmpult | | 39 | cmpult | |
40 | \ / | 40 | \ / |
41 | or | 41 | or |
42 | I.e., there is a 4 cycle path for each limb, making 16 cycles the absolute | 42 | I.e., there is a 4 cycle path for each limb, making 16 cycles the absolute |
43 | minimum. We could replace the `or' with a cmoveq/cmovne, which would save | 43 | minimum. We could replace the `or' with a cmoveq/cmovne, which would save |
44 | a cycle on EV5, but that might waste a cycle on EV4. Also, cmov takes 2 | 44 | a cycle on EV5, but that might waste a cycle on EV4. Also, cmov takes 2 |
45 | cycles. | 45 | cycles. |
46 | addq | 46 | addq |
47 | / \ | 47 | / \ |
48 | addq cmpult | 48 | addq cmpult |
49 | | \ | 49 | | \ |
50 | cmpult -> cmovne | 50 | cmpult -> cmovne |
51 | 51 | ||
52 | STATUS | 52 | STATUS |
53 | 53 | ||