Difference between revisions of "Motion Video Instructions"

From AlphaLinux
Jump to: navigation, search
(Imported from http://web.archive.org/web/20100713090023/http://www.alphalinux.org/wiki/index.php?title=Motion_Video_Instructions&action=edit)
 
(No difference)

Latest revision as of 18:17, 29 August 2019

Beginning with the PCA56 processor, DEC added the Motion Video Instructions (MVI) to accelerate algorithms related to motion video formats such as MPEG1 and MPEG2[1]. Compared to other SIMD instruction sets of the time, MVI is very simple. In order to prevent complicating the instruction decode logic, the MVI extension contains only 13 SIMD instructions.

Unlike Intel's MMX and SSE SIMD extensions, MVIs use the Alpha's general purpose registers.

Added Instructions

Mnemonic Description
minub8 minimum of packed unsigned bytes
maxub8 maximum of packed unsigned bytes
minsb8 minimum of packed signed bytes
maxsb8 maximum of packed signed bytes
minuw4 minimum of packed unsigned words
maxuw4 maximum of packed unsigned words
minsw4 minimum of packed signed words
maxsw4 maximum of packed signed words
pkwb pack words into bytes
unpkwb unpack words into bytes
pklb pack longs into bytes
unpklb unpack longs into bytes
perr sum the absolute differences of each byte (pixel error)[2]

Determining Presence

To determine the presence of MVI, use the amask instruction.

Latency and Slotting

On the in-order PCA56, all MVIs have a latency of 2 cycles. This means, at least one instruction must separate MVIs to prevent stalling. On the out-of-order EV6 and newer, MVIs have a latency of 3 cycles and are slotted U0[3].

Usage

Unsigned Saturated Arithmetic

By using the packed minimum, packed unsigned saturated addition and subtraction can be easily performed.

For instance, to add the packed unsigned bytes stored in $16 with those in $17 with saturation and store the result in $0:

ornot  $31,$16,$1
minub8 $17,$1,$17
addq   $16,$17,$0

To subtract the packed unsigned bytes stored in $16 with those in $17 with saturation and store the result in $0:

minub8 $17,$16,$17
subq   $16,$17,$0

Note, these are not optimized for register usage or latency.

To use this in C, the following functions may be used.

#define __minub8        __builtin_alpha_minub8
#define __minuw4        __builtin_alpha_minuw4

/* Add the 8-bit values in M1 to the 8-bit values in M2 using unsigned
 * saturated arithmetic (MMX equivalent: paddusb) */
static inline __m64
addusb8(__m64 m1, __m64 m2) {
        return m1 + __minub8(m2, ~m1);
}

/* Add the 16-bit values in M1 to the 16-bit values in M2 using unsigned
 * saturating arithmetic (MMX equilvant: paddusw) */
static inline __m64
addusw4(__m64 m1, __m64 m2) {
        return m1 + __minuw4(m2, ~m1);
}

/* Subtract the 8-bit values in M1 to the 8-bit values in M2 using unsigned
 * saturated arithmetic (MMX equivalent: psubusb) */
static inline __m64
subusb8(__m64 m1, __m64 m2) {
        return m1 - __minub8(m2, m1);
}

/* Subtract the 16-bit values in M1 to the 16-bit values in M2 using unsigned
 * saturating arithmetic (MMX equivalent: psubusw) */
static inline __m64
subusw4(__m64 m1, __m64 m2) {
        return m1 - __minuw4(m2, m1);
}

External Links

References

  1. Template:Cite web
  2. Template:Cite web
  3. Template:Cite web