James Hicks
Richard Weiss
Compaq Computer Corporation
February 1999
Key words: video, multimedia, high
performance computing, prefetching, motion estimation, filtering, Alpha
processor
In this paper you will find
Alpha is a 64-bit processor architecture. The architecture is defined by a living architectural specification. It is intended to have multiple implementations over many years; the current implementations are all superscalar, which means that multiple instructions are issued in the same cycle. For the 21164 chip, up to four instructions can be issued in a cycle, and for the 21264, up to six instructions can be issued. For each implementation there are rules that constrain the instructions that can be co-issued. For example, the 21264 has four ALU’s on the integer side, but only one of them can execute an integer multiply, so at most one integer multiply instruction can be issued in a cycle.
Motion Video Instructions or MVI are the DIGITAL Alpha’s Multi Media instructions. They are a set of Alpha processor instructions that use a single instruction to operate on multiple data in parallel (SIMD). This is accomplished by partitioning a 64bit Quadword into a vector of 8 separate bytes or 4 separate words (16bit). Any code that is capable of taking advantage of this parallelism can achieve up to an 8X performance boost. The instructions are:
MINUB8, MAXUB8
unsigned byte minimum/maximum
MINSB8, MAXSB8
signed byte minimum/maximum
MINUW4, MAXUW4
unsigned word minimum/maximum
MINSW4, MAXSW4
signed word minimum/maximum
PKWB, UNPKBW
pack/unpack word to byte
PKLB, UNPKBL
pack/unpack long to byte
PERR
pixel error
This guide is an introduction to some techniques in high performance computing with a focus on MVI. There are a few key techniques for designing efficient code for an Alpha processor. This first is to make sure that your data is in the first level cache when you want to use it. The sections on prefetching and cache blocking explain this. The second is to take advantage of instruction level parallelism. This is explained in the sections on software pipelining and scheduling issues. The third technique is to take advantage of data parallelism. This is what MVI is about. The relevant section below explains how to detect if a processor supports MVI. This will allow the programmer to branch to alternate implementations, so that legacy code can be extended. Each of the MVI instructions is described with a sample trace to show how it works. There are many examples of how MVI instructions are used and a side-by-side example of MVI code with MMX code. There are also examples of how to use MVI for motion estimation in an MPEG encoder and for image filtering. Since MVI is not currently supported by commercial "C" compilers, it will be necessary to include some assembly code to use it. This is relatively easy and can be done with inline assembler MACRO. This allows the programmer to use assembly language instructions as if they were C statements. This can be very useful not only for MVI but for prefetching and other techniques as well.
The techniques described in this guide can be used with OpenVMS, Digital UNIX, or Windows NT. Some of the details may change with different operating systems.
While these new instructions were intended to implement high quality software video encoding like MPEG-1,MPEG-2, H.261 (ISDN video conferencing) and H.263 (Internet video conferencing) only your imagination as a software engineer will limit their uses. Anytime data can be operated on in parallel you will see the benefit. Desktop Video Publishing, Video Conferencing, Internet Commerce and Interactive Training are some target trends in visual computing.
If the application processes a large amount of data as fast as possible (high memory bandwidth) and the data are all 8bit or 16bit integers (Byte/Word Integer Data) and the same operations are performed on all the data (Parallelism), then you definitely want to explore MVI. MVI can make a critical difference in achieving video-rate encoding.
The 21164PC has the first implementation of MVI. All Alpha processors since that one including the 21264 have MVI, and all future processors will implement it.
Before taking advantage of MVI instructions one must be running on hardware that supports these extensions. This can be determined at run time by looking at bit eight in the value the AMASK instruction returns. Future extensions may use other AMASK bits. The code to do this is written in assembly language rather than C or C++ since there is no statement that corresponds to AMASK. Assembly language code in can be inserted into the object file using the macro __asm. A detailed description of the instruction can be found in the Alpha Architecture Handbook.
The AMASK instruction takes three
register arguments. The first register (Ra) must be R31. The
second register (Rb) has the input mask, which represents the requested
architectural features. There is a one in every bit position that
is being queried. The third register (Rc) is the output. Bits
are cleared that correspond to architectural extensions that are present.
Reserved bits and bits that correspond to absent extensions are copied
unchanged. If the result is zero, all requested features are present.
Software may specify an input mask of all 1’s to determine the complete
set of architectural extensions implemented by a processor. Assigned bit
definitions are defined below.
AMASK Bit Assignments
Bit 0. Support for the byte/word extension (BWX) The instructions that comprise the BWX extension are LDBU, LDWU,SEXTB,SEXTW,STB, and STW.
Bit 1. Support for the count extension (CIX) The instructions that comprise the CIX extension are CTLZ,CTPOP,CTTZ,FTOIS,FTOIT, ITOFF, ITOFS, ITOFT, SQRTF, SQRTG,SQRTS, and SQRTT.
Bit 2. Support for CIX instructions (not including SQRTG,SQRTS, and SQRTT).
Bit 8. Support for the multimedia extension (MAX) The instructions that comprise the MAX extension are MAXSB8, MAXSW4, MAXUB8, MAXUW4, MINSB8, MINSW4, MINUB8, MINUW4, PERR, PKLB, PKWB, UNPKBL, and UNPKBW.
Bit 9. Support for Precise arithmetic trap reporting
Software Note:
Use this instruction to make instruction-set decisions; use IMPLVER to make code-tuning decisions.
Implementation Note:
Instruction encoding is implemented as follows: On 21064/21064A/21066/21068/21066A (EV4/EV45/LCA/LCA45 chips), AMASK copies Rbv to Rc.On 21164 (EV5), AMASK copies Rbv to Rc.
On 21164A (EV56), 21164PC (PCA56), and 21264 (EV6), AMASK correctly indicates support for architecture extensions by copying Rbv to Rc and clearing appropriate bits.
AMASK Code Examples.
Assembler file or "S" file
This short piece of code is an assembler file that can be made into an .obj file and linked to a "C" or "C++" program.
Notice #include <kxAlpha.h>. This header file comes with the VC RISC edition compiler and the Microsoft SDK. It is full of some very interesting information about the DIGITAL Alpha. There you can find the expansion of the LEAF_ENTRY() MACRO, which defines an entry point for the compiler and initializes the stack pointer.
- __int64 amaskValue;
- amaskValue = getAMASK( SOME_64BIT_MASK );
Inline Assembler MACRO (asm's)
This section describes how to use the function "__asm" to include assembly language code in your object file. Here it is used to insert the AMASK instruction and test bits in the mask returned, but it can be used more generally when assembly code is more suitable than C or C++. First it is necessary to convince the compiler that there is a function called "__asm." We do that with the extern long __asm( char *, … ); declaration. This is saying that __asm is some function that returns a long – on an Alpha any integer value returned from a function will be in the v0 register and the compiler knows that.
Next we declare the __asm function to be intrinsic. That means it will be implemented in native assembler word for word. This can be seen in the line below that reads #pragma intrinsic __asm
Following this there are a couple of #define’s for several of the possible bit mask used to determine the availability of a desired extension. The bit we are concerned with is bit 8 and is defined as SUPPORT_FOR_MVI ((__int64)((0x0100))
Finally the MACRO definition is provided. This MACRO uses inline assembler to invoke the "amask" instruction. The inline code alone is as follows: __asm("amask $a0,$v0",(x) ) where (x) represents some 64bit bit mask with the bits we are interested in set or cleared as appropriate. One could just as easily write someInt64 = __asm("amask $a0,$v0", 0x0100 );, then test the value of the variable someInt64 == 0 indicating MVI is supported.
Let’s take a moment to review this inline assembler code.
The string "amask $a0,$v0" is indicating that register a0 contains the bitmask required by the amask instruction and that register v0 will receive the result. This is just as we had written this code in Alpha assembler as amask a0,v0. The last parameter in the __asm() "function" call is the bitmask value that will be placed in the a0 register before the amask instruction is executed. For UNIX, the "function" is asm() without the "__".
The MACRO "thisAlphaHas()" contains a logical NOT (!) to get the code to read and function with positive logic allowing a programmer to write
- if ( thisAlphaHas(SUPPORT_FOR_MVI ) )
- printf( "MVI Supported \n");
- else
- printf( "MVI Not Supported \n");
- __int64 amask, implver;
- int i;
- char opt;
- char *implname[]={"21064 (EV4)", "21164 (EV5)", "21264 (EV6)"};
- #define MAXAMASK 16
- "FIX (Floating point instruction extensions)"
- "CIX (Count instruction extensions)"
- "unused3",
- "unused4",
- "unused5",
- "unused6",
- "unused7",
- "MVI (Motion video instruction extensions)",
- "Precise arithmetic trap reporting in hardware",
- "unused10",
- "unused11",
- "unused12",
- "unused13",
- "unused14",
- "unused15" };
- while((opt = getopt(Argc, Argv, "vqd?")) != -1) {
- switch (opt) {
- case 'v': Verbose = 1; break;
- case 'q': Quiet = 1; break;
- case 'd': Debug = 1; break;
- default:
- printf("cputype [options]\n/v Verbose\n/q Quiet\n/d Debug\n");
- exit(1);
- }
- }
- amask = ~__asm("amask $a0, $v0", -1);
- implver = __asm("implver $v0");
- if (!Quiet) {
- if (Debug)
- printf("Implver = %d\n", implver);
- printf("Implementation version: %s\n\n", implname[implver]);
- if (Debug)
- printf("Amask = 0x%x\n", amask);
- printf("Architecture mask\n");
- }
- for (i=0;i<MAXAMASK;i++) {
- if (Verbose && (strncmp(amaskname[i], "unused", 6)))
- printf(" %s: %s\n", amaskname[i], (amask & (1<<i))?"YES":"NO");
- else if (!Quiet && (amask & (1<<i)))
- printf(" %s\n", amaskname[i]);
- }
MVI Instruction set description
Alpha MVI code and x86 MMX code are used in the following examples. No judgements or comments about relative performance of similar code are made or implied. These two architectures are very different in their respective implementations. Keep in mind that RISC architectures in general will have more assembler instructions representing fewer clock cycles. It is likewise important to remember that if used improperly these instructions can stall both chips. These examples are in no way – the best way. Refer to the appropriate documentation for the best information on optimizing your code and preventing stalls. These are brute force examples of functionality only.
Unpack Instructions
UNPKBW – Unpack Bytes to Words expands the low four bytes in a quadword to four words in a quadword. This implies two unpacks to re-acquire all eight pixels.
Packed Byte: 8 bytes packed into 64bits
REGISTER r1 at start
Bit63
Bit0
| Pixel 7 | Pixel 6 | Pixel 5 | Pixel 4 | Pixel 3 | Pixel 2 | Pixel 1 | Pixel 0 |
Start with data as shown in this table representing a 64bit quadword with eight pixels represented as byte values. Assume this data is in 64bit register r1
Alpha MVI code
x86 MMX code
- unpkbw r1, r5 // unpack the low four bytes into reg r5
- srl r1, 32, r1 // shift reg r1 right 32 bits, result in r1
- unpkbw r1, r6 // unpack the high four bytes into reg r6
- pxor mm6, mm6 // set mm6 = 0
- movq mm2, mm1 // mm2 = mm1
- punpcklbw mm1, mm6 // unpack low four bytes into mm1
- punpcklbh mm2, mm6 // unpack high four bytes into mm2
Alpha REGISTER r5 after unpkbw r1,r5
Bit63
Bit0
| Pixel 3 | Pixel 2 | Pixel 1 | Pixel 0 |
Alpha REGISTER r1 after srl 32
Bit63
Bit0
| 0 | 0 | 0 | 0 | Pixel 7 | Pixel 6 | Pixel 5 | Pixel 4 |
Alpha REGISTER r6 after unpkbw r1,r6
Bit63
Bit0
| Pixel 7 | Pixel 6 | Pixel 5 | Pixel 4 |
UNPKBL – Unpack Bytes to long words expands the low two bytes in a quadword to two long words in a quadword. This implies four unpacks to re-acquire all eight pixels. While this use would be used much less often here is how it looks.
Packed Byte: 8 bytes packed into 64bits
Alpha REGISTER r1 at start
Bit63
Bit0
| Pixel 7 | Pixel 6 | Pixel 5 | Pixel 4 | Pixel 3 | Pixel 2 | Pixel 1 | Pixel 0 |
Start with data as shown in this table representing a 64bit quadword with eight pixels represented as byte values. Assume this data is in 64bit register r1
Alpha MVI code
x86 MMX code
- unpkbl r1, r5 // unpack the two low bytes into register r5
- srl r1, 16, r1 // shift register r1 right 16 bits
- unpkbl r1, r6 // unpack the next two bytes into register r6
- srl r1, 16, r1 // shift register r1 right 16 bits
- unpkbl r1, r7 // unpack the next two bytes into register r7
- srl r1, 16, r1 // shift register r1 right 16 bits
- unpkbl r1, r8 // unpack the next two bytes into register r8
- pxor mm6, mm6 // clear mm6
- movq mm2, mm1 // mm2 = mm1
- punpcklbw mm1, mm6 // unpack low four bytes into mm1
- punpcklbh mm2, mm6 // unpack high four bytes into mm2
- movq mm3, mm1 // mm3 = mm1
- punpcklwd mm1, mm6 // unpack low two words into mm1
- punpcklwd mm3, mm6 // unpack high two words into mm3
- movq mm4, mm2 // mm4 = mm2
- punpcklwd mm2, mm6 // unpack low two words into mm2
- punpcklwd mm4, mm6 // unpack high two words into mm4
Alpha REGISTER r5
Bit63
Bit0
| Pixel 1 | Pixel 0 |
Alpha REGISTER r6
Bit63
Bit0
| Pixel 3 | Pixel 2 |
Alpha REGISTER r7
Bit63
Bit0
| Pixel 5 | Pixel 4 |
Alpha REGISTER r8
Bit63
Bit0
| Pixel 7 | Pixel 6 |
Pack Instructions
PACKWB –
Truncates the four (4) component words of the input register and writes
them to the low four (4) bytes of the output register.
Alpha REGISTER r5 at start of packwb
Bit63 Bit0
| Pixel 3 | Pixel 2 | Pixel 1 | Pixel 0 |
Alpha REGISTER r6 at start of packwb
Bit63
Bit0
| Pixel 7 | Pixel 6 | Pixel 5 | Pixel 4 |
Alpha MVI code
SPECIAL NOTE: the following three lines of MVI code will not work because the upper four (4) bytes of the destination register are written with zero’s by packwb. Therefore the second packwb would write zero’s over the first pixels shifted data.
| 0 | 0 | 0 | 0 | Pixel 3 | Pixel 2 | Pixel 1 | Pixel 0 |
Alpha REGISTER r6 after packwb r6,r6
Bit63
Bit0
| 0 | 0 | 0 | 0 | Pixel 7 | Pixel 6 | Pixel 5 | Pixel 4 |
Alpha REGISTER r6 after sll r6, 32, r6
Bit63
Bit0
| Pixel 7 | Pixel 6 | Pixel 5 | Pixel 4 | 0 | 0 | 0 | 0 |
Alpha REGISTER v0 after bis r5,r6,v0
Bit63
Bit0
| Pixel 7 | Pixel 6 | Pixel 5 | Pixel 4 | Pixel 3 | Pixel 2 | Pixel 1 | Pixel 0 |
PACKLB – Truncates
the two (2) component long words in the input register to byte values and
writes them to the low two (2) bytes of the output register.
Alpha REGISTER r5 at start
Bit63
Bit0
| Pixel 1 | Pixel 0 |
Alpha REGISTER r6 at start
Bit63
Bit0
| Pixel 3 | Pixel 2 |
Alpha REGISTER r7 at start
Bit63
Bit0
| Pixel 5 | Pixel 4 |
Alpha REGISTER r8 at start
Bit63
Bit0
| Pixel 7 | Pixel 6 |
Alpha MVI code
- packlb r5, r5 // trunc and pack 2 dwords into 2 bytes
- packlb r6, r6 // trunc and pack 2 dwords into 2 bytes
- sll r6, 16, r6 // shift pixels 3 and 2 into place
- bis r5, r6, r5 // logical OR pixels 3 and 2 into r5
- packlb r7, r7 // trunc and pack 2 dwords into 2 bytes
- sll r7, 32, r7 // shift pixels 5 and 4 into place
- bis r5, r7, r5 // logical OR pixels 5 and 4 into r5
- packlb r8, r8 // trunc and pack 2 dwords into 2 bytes
- sll r8, 48, r8 // shift pixels 7 and 6 into place
- bis r5, r8, r5 // logical OR pixels 7 and 6 into r5
Alpha REGISTER r5 after packwl r5,r5
Bit63
Bit0
| 0 | 0 | 0 | 0 | 0 | 0 | Pixel 1 | Pixel 0 |
Alpha REGISTER r6 after packwl r6,r6
Bit63
Bit0
| 0 | 0 | 0 | 0 | 0 | 0 | Pixel 3 | Pixel 2 |
Alpha REGISTER r6 after sll r6,16,r6
Bit63
Bit0
| 0 | 0 | 0 | 0 | Pixel 3 | Pixel 2 | 0 | 0 |
Alpha REGISTER r5 after bis r5,r6,r5
Bit63
Bit0
| 0 | 0 | 0 | 0 | Pixel 3 | Pixel 2 | Pixel 1 | Pixel 0 |
Alpha REGISTER r7 and r8 just repeat this sequence after shifting the correct number of bits.
Byte and Word Minimum and Maximum (MINxxx) (MAXxxx)
MINUB8 Vector Unsigned Byte Minimum
MINUW4 Vector Unsigned Word Minimum
MINSB8 Vector Signed Byte Minimum
MINSW4 Vector Signed Word Minimum
MAXUB8 Vector Unsigned Byte Maximum
MAXW4 Vector Unsigned Word Maximum
MAXSB8 Vector Signed Byte Maximum
MAXSW4 Vector Signed Word Maximum
Where the values in register Ra are compared to the value in register Rb and the result is placed in register Rc. These are vector wise comparisons. That is each byte or each word is compared when using the xxxxB8 or xxxxW4 instructions respectively.
- These instructions take the form MINxxx Ra,Rb,Rc
- MAXxxx Ra,Rb,Rc
Alpha MVI code
minub8 r5, r6, r6 //
get the minimum’s in each BYTE position
Alpha REGISTER r5 at start
| 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
Alpha REGISTER r6 at start
| 0 | 1 | 2 | 2 | 0 | 0 | 1 | 1 |
Alpha REGISTER r6 after calling minub8 r5,r6,r6
| 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
Alpha MVI code
minuw4 r5, r6, r6 // get the minimum’s in each WORD position
Alpha REGISTER r5 at start
| 0x0000 | 0x00FF | 0x0000 | 0x0001 |
Alpha REGISTER r6 at start
| 0x0000 | 0x0001 | 0x0000 | 0x00F3 |
Alpha REGISTER r6 after call to minuw4 r5,r6,r6
| 0x0000 | 0x0001 | 0x0000 | 0x0001 |
This instruction takes the eight bytes packed into two quadword registers and computes the absolute differences between them, then adds the eight intermediate results and right aligns the result in the destination register. The net result is that motion estimation calculations on eight pixels can be done in a single clock tick on an MVI capable Alpha.
Paul Rubinfeld, Bob Rose and Michael McCallig present several very good examples of applications that use PERR in their paper entitled "Motion Video Extensions for Alpha." Look in Helpful URL’s for more information.
Alpha MVI code
perr r5,r6,v0
x86 MMX code
Alpha REGISTER r5 at start
| 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
Alpha REGISTER r6 at start
| 0 | 1 | 2 | 2 | 0 | 0 | 1 | 1 |
Intermediate Absolute Differences
| 1 | 1 | 1 | 2 | 1 | 0 | 0 | 1 |
Sum of Absolute Differences placed in Alpha
REGISTER V0 (64bit)
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 |
How MVI instructions are used in software
In this section, we give some guidelines
and examples for using the instructions above. The first issue is
whether you are writing code for just one type of machine or not.
The section on Detecting MVI
was usful for determining at runtime
whether the current Alpha machine has MVI. You might be writing the
same code for multiple architectures,
so you may need to test first to determine
if the current machine is an Alpha. The following section on potable
coding explains that. The other subsections describe how to do a
bytewise add with saturation as well as a simple convolution filter.
Portable Coding Techniques
- //+
- // setup the function pointer by testing for MVI support
- //-
- if ( thisProcessorHas( SUPPORT_FOR_MVI ) )
- //+
- // we can use MVI
- // point the function pointer at the MVI function
- //-
- videoProcessingFunctionPointer = functionUsingMVI;
- else
- //+
- // we can not use MVI
- // point at the NON MVI version of the function
- //-
- videoProcessingFunctionPointer = functionWithoutMVI;
- //+
- // we are building for x86 MMX point at that code
- //-
- videoProcessingFunctionPointer = FunctionUsingMMX;
Unsigned Saturating Arithmetic
The current versions of MVI do not have explicit byte- or word-partitioned arithmetic operations, but it is possible to produce the same results with multiple instructions. First we explain why one would want a saturating add. Consider the following situation. Suppose we want to blend two pixels by adding their red, green, and blue values together and dividing by two (shift right one bit). For example, if a simple add is applied to the 16-bit values shown below and then truncated to 16 bits the answer is actually less than both of the addends, and the shifted value will not be the average. The rest of the answer is in the most significant bit or in this case bit 17. Applying this to the green component of a pixel’s color value (assuming that 16 bits are used to represent each color), blending these two medium green pixels together would result in a pixel that is lighter green than the original two. Additionally, since pixel intensities or colors can only range from all of a color - 0xFFFF to none of a color – 0x0000, bit 17 is meaningless to us in terms of "greenness."
+ 0011 0000 0000 0000 (0x3000)
Result truncated to 16bits 0010 0000 0000 0000 (0x2000)
Right shifted value 0x1000 (not the result we desire)
Unsigned saturating arithmetic causes values that would overflow to be "clamped" to the maximum possible value (0xFFFF) and values that would underflow to be "clamped" to the minimum (0x0000). A saturating add of these two green pixels overflows 16 bits and therefore gets "clamped" to a maximum value of 0xFFFF which is a much better representation of the color expected after blending two medium green pixels.
+ 0x3000
result "clamped" to maximum 0xFFFF
right shift with rounding 0x8000 ( much better )
In practice, a better solution might be to first do the division by two and then add the results, in which case overflow is not a real problem. However, the conversion from YUV video representation to RGB representation is a problem that requires saturated arithmetic, where the data are all unsigned byte. The equation for this conversion is given by
Alpha MVI "unsigned saturating add for packed words"
Alpha REGISTER r5 at start
Pixel 3
Pixel 2 Pixel 1
Pixel 0
| 0x0000 | 0xFFFF | 0x0000 | 0x0001 |
Alpha REGISTER r6 at start
| 0x0000 | 0x0001 | 0x0000 | 0xFFFF |
What happens?
First we take the 1’s complement of register
r6. The "eqv" instruction does a more general operation than implied by
the comment. The "eqv" instruction alone is Rc<- Ra XOR (NOT Rb), therefore
when Rb is zero (NOT Rb) is all 1’s and that XOR anything will flip the
bits or produce the 1’s compliment. Why do we want the 1’s compliment of
r6? Well first of all it would work with either r5 or r6 as long as we
called minuw4 using whichever register we took the 1’s compliment of. The
1’s compliment is the largest number we can add to the original number
and not overflow. Knowing that piece of information, we simply pick the
smaller of this pixels 1’s compliment or the corresponding pixels original
value and do the addition. Since we picked the smallest number from a set
of numbers containing the largest value that would not overflow, we are
guaranteed not to overflow.
Alpha REGISTER t0 after call to eqv r6,zero,t0
| 0xFFFF | 0xFFFE | 0xFFFF | 0x0000 |
Alpha REGISTER r5 after call to minuw4 r5,t0,r5
| 0x0000 | 0xFFFE | 0x0000 | 0x0000 |
Alpha REGISTER v0 after call to addq r5,r6,v0
| 0x0000 | 0xFFFF | 0x0000 | 0xFFFF |
Alpha MVI "unsigned saturating subtract of packed words"
Alpha REGISTER r5 at start
Pixel 3
Pixel 2 Pixel 1
Pixel 0
| 0x0000 | 0x00FF | 0x0000 | 0x0001 |
Alpha REGISTER r6 at start
| 0x0000 | 0x0001 | 0x0000 | 0x00F3 |
Alpha REGISTER r6 after call to minuw4 r5,r6,r6
| 0x0000 | 0x0001 | 0x0000 | 0x0001 |
What happens?
When we started Pixel 0 or the low word in r5 = 0x0001 and in r6 = 0x00F3 subtracting r6 from r5 would result in a value that exceeds the limits of an unsigned 16-bit word i.e. something negative (-242). Recall the behavior of "clamping" to the minimum value – that is the effect achieved when all the minimum pixel values are placed in the r6 register and then subtracted from the original r5 register values. If the minimum pixel value was in r6 as is the case with Pixel 2, then the simple unsigned subtract results in a value that is less than the original but greater than the minimum value. If however the minimum value was in r5 then that value is moved to r6 by the minuw4 instruction and ultimately subtracted from itself resulting in zero, "clamped" to the minimum unsigned value.
Alpha REGISTER V0 after call to subq r5,r6,v0
Pixel 3
Pixel 2 Pixel 1
Pixel 0
| 0x0000 | 0x00FE | 0x0000 | 0x0000 |
Simple Pixel Filtering Example
Consider a two dimensional array or bitmap of 32bit pixel values arranged as RGBA. Where the RGBA (Red Green Blue and Alpha) values are represented in the four 8bit components of a 32bit unsigned long. A simple convolution filter will replace each pixel by the weighted sum of the values of the pixels surrounding it. This is done independently for RGBA. A typical filter algorithm might employ a loop as in the following "C" language example. The algorithm will have high memory bandwidth. It is Byte/Word Integer Data and can take advantage of parallelism. These are the qualifiers for using MVI.
Loop on row
Loop on column
- Red = 0; // clear accumulators
- Green = 0;
- Blue = 0;
- Alpha = 0;
Alpha MVI code stub (the result is left in a 16 bit value which would need to be shifted and packed into 8 bits again)// loop on some filter length and pull out the RGBA componentsfor ( x=0 x< length of filter ; x++) {
}
- temp = ((inputArray[ row ] [ col ] >> 24 ) && 0xFF);
- Red += temp * filterValue[x];
- temp = ((inputArray[ row ] [ col ] >> 16 ) && 0xFF);
- Green += temp * filterValue[x];
- temp = ((inputArray[ row ] [ col ] >> 8 ) && 0xFF);
- Blue += temp * filterValue[x];
- temp = ((inputArray[ row ] [ col ] ) && 0xFF);
- Alpha += temp * filterValue[x];
- //+
- // work on 2 pixels at a time ( 64bits)
- //-
- addl a1, t8, t11 // t11=addr of data t8=offs to current pixels (2)
- ldq t4, 0(t11) // get 2 pixels from array
- unpkbw t4, t0 // unpack low four bytes RGBA
- srl t4, 32, t4 // shift the high four bytes
- unpkbw t4, t1 // unpack the high four bytes RGBA
- //+
- // t5 holds the filter value
- //-
- mulq t5, t0, t0 // multiply RGBA of pixel 1 by filter value
- mulq t5, t1, t1 // multiply RGBA of pixel 2 by filter value
- //+
- // unsigned saturating adds for RGBA accumulators
- // r6 pixel 1 RGBA accumulator
- //-
- eqv t0, zero, r6 // 1’s compliment of t0
- minuw4 t0, r6, t0 // get the smaller values
- addq r6, t0, r6 // accumulate RGBA’s pixel 1
- //+
- // r5 pixel 2 RGBA accumulator
- //-
- eqv t1, zero, r5 // 1’s compliment of t0
- minuw4 t1, r5, t1 // get the smaller values
- addq r5, t1, r5 // accumulate RGBA’s pixel 2
A Side-by-Side Example of Alpha MVI and x86 MMX
This example is a loop used to blend pixel values.
First – A "C" code example
- unsigned char *frontImage;
- unsigned char *backImage;
- unsigned char *output;
- long l_lImgSizeX;
- long l_lSizeY;
- long y;
- unsigned long pixelInLine;
- unsigned long x;
- unsigned short usFront;
- unsigned short usBack;
- unsigned short usTemp;
- ImgSizeX = 720;
- ImgSizeX >= 2;
- SizeY = 486;
- frontImage = pInputB;
- backImage = pInputA;
- output = pOutput;
- y = 0;
- do {
- //+
- // replace MVI code with loop
- //-
- for ( x=0;x<8;x++ ) {
- usFront = ((unsigned short)(frontImage[x] & 0x00ff));
- usBack = ((unsigned short)(backImage[x] & 0x00ff));
- usFront -= usBack;
- usFront *= s_ubAlpha;
- usFront += 0x0080;
- usTemp = usFront;
- usTemp >>= 8;
- usFront += usTemp;
- usFront >>= 8;
- usFront += usBack;
- output[x] = (unsigned char)( usFront & 0x00ff );
- }
- pixelInLine--;
- //+
- // move the pointers up to the new offset
- //-
- frontImage += 8;
- backImage += 8;
- output += 8;
- } while ( pixelInLine > 0 );
- y++;
}
- }while ( y < l_lSizeY );
- s_ubAlpha--;
Next – x86 MMX Code as inline assembler in a "C" file.
- LONG ImgSizeX = 720;
- LONG SizeY = 486;
- LONG y;
- static __int64 ROUND = 0x0080008000800080;
- static __int64 mmAlphaValue = 0x00FF00FF00FF00FF;
- // l_mmAlphaValue should have A | A | A | A
- __asm
- {
- pxor mm6, mm6 // Clear mm6...
- movq mm5, mmAlphaValue // mm5 = A | A | A | A
- movq mm7, ROUND
- }
- for ( y = 0; y < SizeY; y++) {
- __asm {
- mov esi, pInputB // esi is the front image
- mov edi, pInputA // edi is the back image
- mov edx, pOutput // edx is the output image
- xor ebx, ebx // ebx = 0, pixel offset
- mov ecx, ImgSizeX // ecx = for counter for pixel in line
- shr ecx, 2 // Processing 4 pixels at a time
- jz finisha // Skip if no pixels to process
loopa:
- movq mm0, [esi + ebx] // mm0 = Front[0:7]
- movq mm2, [edi + ebx] // mm2 = Back [0:7]
- movq mm1, mm0 // mm1 = mm0
- movq mm3, mm2
- punpcklbw mm0, mm6 // mm0 = Front[0:3]
- punpckhbw mm1, mm6 // mm1 = Front[4:7]
- punpcklbw mm2, mm6 // mm2 = Back [0:3]
- punpckhbw mm3, mm6 // mm1 = Back [4:7]
- psubw mm0, mm2 // mm0 = Front - Back [0:3]
- psubw mm1, mm3 // mm1 = Front - Back [4:7]
finisha:
- pmullw mm0, mm5 // mm0 = FB0 * Alpha
- pmullw mm1, mm5 // mm1 = FB1 * Alpha
- paddw mm0, mm7 // mm0 = FB0 * Alpha + ROUND = C0
- paddw mm1, mm7 // mm1 = FB1 * Alpha + ROUND = C1
- movq mm2, mm0 // mm2 = C0
- movq mm3, mm1 // mm3 = C1
- psrlw mm0, 8 // mm0 = C0 >> 8
- psrlw mm1, 8 // mm1 = C1 >> 8
- paddw mm0, mm2 // mm0 = C0 + (C0 >> 8)
- paddw mm1, mm3 // mm1 = C1 + (C1 >> 8)
- psrlw mm0, 8 // mm0 = Result Pixel 0
- psrlw mm1, 8 // mm1 = Result Pixel 1
- packuswb mm0, mm1 // mm0 = Result [0:7]
- paddb mm0, [edi + ebx] // Add the back (Cached)
- movq [edx + ebx], mm0 // Store the result
- add ebx, 8 // Goto next pixel
- dec ecx // decrement counter
- jg loopa
- add esi, ebx // Increment the pointers
- add edi, ebx
- add edx, ebx
- add eax, ebx
- mov pOutput, edx // Store back the pointers
- mov pInputB, esi
- mov pInputA, edi
}
}
} // code adapted from a performance test example provided by Richard Fuoco
- __asm emms // Clear the MMX Status
Now - Alpha MVI Code as an Alpha Assembler implementation
loopy:
- mov 486, t10 // top of for ( y=0;y<l_lSizeY )
- mov _ROUND_, t6
- &nbs