aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/linden/indra/libgcrypt/libgcrypt-1.2.2/mpi/alpha/README
diff options
context:
space:
mode:
Diffstat (limited to 'linden/indra/libgcrypt/libgcrypt-1.2.2/mpi/alpha/README')
-rw-r--r--[-rwxr-xr-x]linden/indra/libgcrypt/libgcrypt-1.2.2/mpi/alpha/README106
1 files changed, 53 insertions, 53 deletions
diff --git a/linden/indra/libgcrypt/libgcrypt-1.2.2/mpi/alpha/README b/linden/indra/libgcrypt/libgcrypt-1.2.2/mpi/alpha/README
index 93d433c..55c0a29 100755..100644
--- a/linden/indra/libgcrypt/libgcrypt-1.2.2/mpi/alpha/README
+++ b/linden/indra/libgcrypt/libgcrypt-1.2.2/mpi/alpha/README
@@ -1,53 +1,53 @@
1This directory contains mpn functions optimized for DEC Alpha processors. 1This directory contains mpn functions optimized for DEC Alpha processors.
2 2
3RELEVANT OPTIMIZATION ISSUES 3RELEVANT OPTIMIZATION ISSUES
4 4
5EV4 5EV4
6 6
71. This chip has very limited store bandwidth. The on-chip L1 cache is 71. This chip has very limited store bandwidth. The on-chip L1 cache is
8write-through, and a cache line is transfered from the store buffer to the 8write-through, and a cache line is transfered from the store buffer to the
9off-chip L2 in as much 15 cycles on most systems. This delay hurts 9off-chip L2 in as much 15 cycles on most systems. This delay hurts
10mpn_add_n, mpn_sub_n, mpn_lshift, and mpn_rshift. 10mpn_add_n, mpn_sub_n, mpn_lshift, and mpn_rshift.
11 11
122. Pairing is possible between memory instructions and integer arithmetic 122. Pairing is possible between memory instructions and integer arithmetic
13instructions. 13instructions.
14 14
153. mulq and umulh is documented to have a latency of 23 cycles, but 2 of 153. mulq and umulh is documented to have a latency of 23 cycles, but 2 of
16these cycles are pipelined. Thus, multiply instructions can be issued at a 16these cycles are pipelined. Thus, multiply instructions can be issued at a
17rate of one each 21nd cycle. 17rate of one each 21nd cycle.
18 18
19EV5 19EV5
20 20
211. The memory bandwidth of this chip seems excellent, both for loads and 211. The memory bandwidth of this chip seems excellent, both for loads and
22stores. Even when the working set is larger than the on-chip L1 and L2 22stores. Even when the working set is larger than the on-chip L1 and L2
23caches, the perfromance remain almost unaffected. 23caches, the perfromance remain almost unaffected.
24 24
252. mulq has a measured latency of 13 cycles and an issue rate of 1 each 8th 252. mulq has a measured latency of 13 cycles and an issue rate of 1 each 8th
26cycle. umulh has a measured latency of 15 cycles and an issue rate of 1 26cycle. umulh has a measured latency of 15 cycles and an issue rate of 1
27each 10th cycle. But the exact timing is somewhat confusing. 27each 10th cycle. But the exact timing is somewhat confusing.
28 28
293. mpn_add_n. With 4-fold unrolling, we need 37 instructions, whereof 12 293. mpn_add_n. With 4-fold unrolling, we need 37 instructions, whereof 12
30 are memory operations. This will take at least 30 are memory operations. This will take at least
31 ceil(37/2) [dual issue] + 1 [taken branch] = 20 cycles 31 ceil(37/2) [dual issue] + 1 [taken branch] = 20 cycles
32 We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data 32 We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data
33 cache cycles, which should be completely hidden in the 20 issue cycles. 33 cache cycles, which should be completely hidden in the 20 issue cycles.
34 The computation is inherently serial, with these dependencies: 34 The computation is inherently serial, with these dependencies:
35 addq 35 addq
36 / \ 36 / \
37 addq cmpult 37 addq cmpult
38 | | 38 | |
39 cmpult | 39 cmpult |
40 \ / 40 \ /
41 or 41 or
42 I.e., there is a 4 cycle path for each limb, making 16 cycles the absolute 42 I.e., there is a 4 cycle path for each limb, making 16 cycles the absolute
43 minimum. We could replace the `or' with a cmoveq/cmovne, which would save 43 minimum. We could replace the `or' with a cmoveq/cmovne, which would save
44 a cycle on EV5, but that might waste a cycle on EV4. Also, cmov takes 2 44 a cycle on EV5, but that might waste a cycle on EV4. Also, cmov takes 2
45 cycles. 45 cycles.
46 addq 46 addq
47 / \ 47 / \
48 addq cmpult 48 addq cmpult
49 | \ 49 | \
50 cmpult -> cmovne 50 cmpult -> cmovne
51 51
52STATUS 52STATUS
53 53