Motion Video Instructions (MVI)

 

James Hicks

Richard Weiss

Compaq Computer Corporation

 

February 1999

Key words: video, multimedia, high performance computing, prefetching, motion estimation, filtering, Alpha processor
 

In this paper you will find

 

What is Alpha?

Alpha is a 64-bit processor architecture. The architecture is defined by a living architectural specification. It is intended to have multiple implementations over many years; the current implementations are all superscalar, which means that multiple instructions are issued in the same cycle. For the 21164 chip, up to four instructions can be issued in a cycle, and for the 21264, up to six instructions can be issued. For each implementation there are rules that constrain the instructions that can be co-issued. For example, the 21264 has four ALU’s on the integer side, but only one of them can execute an integer multiply, so at most one integer multiply instruction can be issued in a cycle.

What is MVI?

Motion Video Instructions or MVI are the DIGITAL Alpha’s Multi Media instructions. They are a set of Alpha processor instructions that use a single instruction to operate on multiple data in parallel (SIMD). This is accomplished by partitioning a 64bit Quadword into a vector of 8 separate bytes or 4 separate words (16bit). Any code that is capable of taking advantage of this parallelism can achieve up to an 8X performance boost.     The instructions are:

MINUB8, MAXUB8  unsigned byte minimum/maximum
MINSB8, MAXSB8  signed byte minimum/maximum
MINUW4, MAXUW4  unsigned word minimum/maximum
MINSW4, MAXSW4  signed word minimum/maximum
PKWB, UNPKBW    pack/unpack word to byte
PKLB, UNPKBL    pack/unpack long to byte
PERR            pixel error

 
 

How to use this guide

This guide is an introduction to some techniques in high performance computing with a focus on MVI.  There are a few key techniques for designing efficient code for an Alpha processor.  This first is to make sure that your data is in the first level cache when you want to use it.  The sections on prefetching and cache blocking explain this.  The second is to take advantage of instruction level parallelism.  This is explained in the sections on software pipelining and scheduling issues.  The third technique is to take advantage of data parallelism.  This is what MVI is about.  The relevant section below explains how to detect if a processor supports MVI.  This will allow the programmer to branch to alternate implementations, so that legacy code can be extended.  Each of the MVI instructions is described with a sample trace to show how it works.  There are many examples of how MVI instructions are used and a side-by-side example of MVI code with MMX code.  There are also examples of how to use MVI for motion estimation in an MPEG encoder and for image filtering.  Since MVI is not currently supported by commercial "C" compilers, it will be necessary to include some assembly code to use it.  This is relatively easy and can be done with inline assembler MACRO.  This allows the programmer to use assembly language instructions as if they were C statements.  This can be very useful not only for MVI but for prefetching and other techniques as well.

The techniques described in this guide can be used with OpenVMS, Digital UNIX, or Windows NT. Some of the details may change with different operating systems.

When to use MVI

While these new instructions were intended to implement high quality software video encoding like MPEG-1,MPEG-2, H.261 (ISDN video conferencing) and H.263 (Internet video conferencing) only your imagination as a software engineer will limit their uses. Anytime data can be operated on in parallel you will see the benefit. Desktop Video Publishing, Video Conferencing, Internet Commerce and Interactive Training are some target trends in visual computing.

If the application processes a large amount of data as fast as possible (high memory bandwidth) and the data are all 8bit or 16bit integers (Byte/Word Integer Data) and the same operations are performed on all the data (Parallelism), then you definitely want to explore MVI. MVI can make a critical difference in achieving video-rate encoding.

Detecting MVI Capability

The 21164PC has the first implementation of MVI.  All Alpha processors since that one including the 21264 have MVI, and all future processors will implement it.

Before taking advantage of MVI instructions one must be running on hardware that supports these extensions. This can be determined at run time by looking at bit eight in the value the AMASK instruction returns. Future extensions may use other AMASK bits. The code to do this is written in assembly language rather than C or C++ since there is no statement that corresponds to AMASK. Assembly language code in can be inserted into the object file using the macro __asm. A detailed description of the instruction can be found in the Alpha Architecture Handbook.

 The AMASK instruction takes three register arguments.  The first register (Ra) must be R31.  The second register (Rb) has the input mask, which represents the requested architectural features.  There is a one in every bit position that is being queried.  The third register (Rc) is the output. Bits are cleared that correspond to architectural extensions that are present. Reserved bits and bits that correspond to absent extensions are copied unchanged.  If the result is zero, all requested features are present. Software may specify an input mask of all 1’s to determine the complete set of architectural extensions implemented by a processor. Assigned bit definitions are defined below.
 

AMASK Bit Assignments

Bit 0. Support for the byte/word extension (BWX) The instructions that comprise the BWX extension are LDBU, LDWU,SEXTB,SEXTW,STB, and STW.

Bit 1. Support for the count extension (CIX) The instructions that comprise the CIX extension are                        CTLZ,CTPOP,CTTZ,FTOIS,FTOIT, ITOFF, ITOFS, ITOFT, SQRTF, SQRTG,SQRTS, and SQRTT.

Bit 2. Support for CIX instructions (not including SQRTG,SQRTS, and SQRTT).

Bit 8. Support for the multimedia extension (MAX) The instructions that comprise the MAX extension are MAXSB8, MAXSW4, MAXUB8, MAXUW4, MINSB8, MINSW4, MINUB8, MINUW4, PERR, PKLB, PKWB, UNPKBL, and UNPKBW.

Bit 9. Support for Precise arithmetic trap reporting

 Software Note:

     Use this instruction to make instruction-set decisions; use IMPLVER to make code-tuning decisions.

Implementation Note:

Instruction encoding is implemented as follows: On 21064/21064A/21066/21068/21066A (EV4/EV45/LCA/LCA45 chips), AMASK copies Rbv to Rc.

On 21164 (EV5), AMASK copies Rbv to Rc.

On 21164A (EV56), 21164PC (PCA56), and 21264 (EV6), AMASK correctly indicates support for architecture extensions by copying Rbv to Rc and clearing appropriate bits.

 

 

AMASK Code Examples.

Assembler file or "S" file

This short piece of code is an assembler file that can be made into an .obj file and linked to a "C" or "C++" program.

Notice #include <kxAlpha.h>. This header file comes with the VC RISC edition compiler and the Microsoft SDK. It is full of some very interesting information about the DIGITAL Alpha. There you can find the expansion of the LEAF_ENTRY() MACRO, which defines an entry point for the compiler and initializes the stack pointer.

// amask.s
//
//-------------------------------------------------------------------------------------
// "C" declaration
// extern __int64 getAMASK( __int64 mask )
//
// "C++" declaration
// extern "C" { __int64 getAMASK( __int64 mask ); };
//
// mask bits are passed in a0
// results of amask passed back in v0
//
// use asaxp.exe to make an .obj file from this source and link it to
// your "C" program. For example, if you are using Visual C++, custom compile using the
// command asaxp /O0 $(InputDir)\amask.s –o ($OutDir)\amask.obj
//
// c:\your-command-line>asaxp amask.s
//-------------------------------------------------------------------------------------
//
//-
#include <kxalpha.h>
LEAF_ENTRY(getAMASK)
    amask a0,v0
    ret zero, (ra)
.end getAMASK
 
//+
// calling from "C" program
//-
extern __int64 getAMASK( __int64 mask );
void main()
{
__int64 amaskValue;
amaskValue = getAMASK( SOME_64BIT_MASK );
}
 

Inline Assembler MACRO (asm's)

This section describes how to use the function "__asm" to include assembly language code in your object file. Here it is used to insert the AMASK instruction and test bits in the mask returned, but it can be used more generally when assembly code is more suitable than C or C++. First it is necessary to convince the compiler that there is a function called "__asm." We do that with the extern long __asm( char *, … ); declaration. This is saying that __asm is some function that returns a long – on an Alpha any integer value returned from a function will be in the v0 register and the compiler knows that.

Next we declare the __asm function to be intrinsic. That means it will be implemented in native assembler word for word. This can be seen in the line below that reads #pragma intrinsic __asm

Following this there are a couple of #define’s for several of the possible bit mask used to determine the availability of a desired extension. The bit we are concerned with is bit 8 and is defined as SUPPORT_FOR_MVI ((__int64)((0x0100))

Finally the MACRO definition is provided. This MACRO uses inline assembler to invoke the "amask" instruction. The inline code alone is as follows: __asm("amask $a0,$v0",(x) ) where (x) represents some 64bit bit mask with the bits we are interested in set or cleared as appropriate. One could just as easily write someInt64 = __asm("amask $a0,$v0", 0x0100 );, then test the value of the variable someInt64 == 0 indicating MVI is supported.

Let’s take a moment to review this inline assembler code.

The string "amask $a0,$v0" is indicating that register a0 contains the bitmask required by the amask instruction and that register v0 will receive the result. This is just as we had written this code in Alpha assembler as amask a0,v0. The last parameter in the __asm() "function" call is the bitmask value that will be placed in the a0 register before the amask instruction is executed. For UNIX, the "function" is asm() without the "__".

The MACRO "thisAlphaHas()" contains a logical NOT (!) to get the code to read and function with positive logic allowing a programmer to write

if ( thisAlphaHas( SUPPORT_FOR_MVI ) )  
//+
// Detect if this Alpha Supports MVI
//-
extern long __asm(char *, …);
#pragma intrinsic (__asm)
 
#define SUPPORT_FOR_BWX ((__int64)(0x0001)) // bit0
#define SUPPORT_FOR_CIX ((__int64)(0x0002)) // bit1
#define SUPPORT_FOR_MVI ((__int64)(0x0100)) // bit8
#define thisAlphaHas(x) (!(__asm("amask $a0,$v0",(x))))
 
void main( )
{
if ( thisAlphaHas(SUPPORT_FOR_MVI ) )
   printf( "MVI Supported \n");
else
    printf( "MVI Not Supported \n");
}
 
//+
// quick and dirty console app to detect hardware capabilities uses AMASK and IMPLVER
// you should be able to block copy this – include some header files and build it.
//-
int Verbose = 0;
int Quiet   = 0;
int Debug   = 0;
 
main(int Argc, char **Argv)
{
__int64 amask, implver;
int     i;
char    opt;
char   *implname[]={"21064 (EV4)", "21164 (EV5)", "21264 (EV6)"};
#define MAXAMASK 16
     char   *amaskname[]={
 
                "BWX (Byte Word extensions)",
"FIX (Floating point instruction extensions)"
"CIX (Count instruction extensions)"
"unused3",
"unused4",
"unused5",
"unused6",
"unused7",
"MVI (Motion video instruction extensions)",
"Precise arithmetic trap reporting in hardware",
"unused10",
"unused11",
"unused12",
"unused13",
"unused14",
"unused15" };
while((opt = getopt(Argc, Argv, "vqd?")) != -1) {
switch (opt) {
case 'v': Verbose = 1; break;
case 'q': Quiet = 1; break;
case 'd': Debug = 1; break;
default:
printf("cputype [options]\n/v Verbose\n/q Quiet\n/d Debug\n");
exit(1);
}
}
amask   = ~__asm("amask $a0, $v0", -1);
implver = __asm("implver $v0");
 
if (!Quiet) {
if (Debug)
printf("Implver = %d\n", implver);
printf("Implementation version: %s\n\n", implname[implver]);
 
if (Debug)
printf("Amask = 0x%x\n", amask);
printf("Architecture mask\n");
}
for (i=0;i<MAXAMASK;i++) {
if (Verbose && (strncmp(amaskname[i], "unused", 6)))
printf(" %s: %s\n", amaskname[i], (amask & (1<<i))?"YES":"NO");
else if (!Quiet && (amask & (1<<i)))
printf(" %s\n", amaskname[i]);
}
} // code courtesy of Dave Wagner

MVI Instruction set description

Alpha MVI code and x86 MMX code are used in the following examples. No judgements or comments about relative performance of similar code are made or implied. These two architectures are very different in their respective implementations. Keep in mind that RISC architectures in general will have more assembler instructions representing fewer clock cycles. It is likewise important to remember that if used improperly these instructions can stall both chips. These examples are in no way – the best way. Refer to the appropriate documentation for the best information on optimizing your code and preventing stalls. These are brute force examples of functionality only.

 

Unpack Instructions

UNPKBW – Unpack Bytes to Words expands the low four bytes in a quadword to four words in a quadword. This implies two unpacks to re-acquire all eight pixels.

Packed Byte: 8 bytes packed into 64bits
 

REGISTER r1 at start
Bit63                                                  Bit0 
Pixel 7 Pixel 6 Pixel 5 Pixel 4 Pixel 3 Pixel 2 Pixel 1 Pixel 0
 

Start with data as shown in this table representing a 64bit quadword with eight pixels represented as byte values. Assume this data is in 64bit register r1

Alpha MVI code

unpkbw    r1, r5       // unpack the low four bytes into reg r5
srl       r1, 32,   r1 // shift reg r1 right 32 bits, result in r1
unpkbw    r1, r6       // unpack the high four bytes into reg r6
x86 MMX code
pxor      mm6, mm6     // set mm6 = 0
movq      mm2, mm1     // mm2 = mm1
punpcklbw mm1, mm6     // unpack low four bytes into mm1
punpcklbh mm2, mm6     // unpack high four bytes into mm2

Alpha REGISTER r5 after unpkbw r1,r5
Bit63                                                    Bit0 
Pixel 3 Pixel 2 Pixel 1 Pixel 0
 

Alpha REGISTER r1 after srl 32
Bit63                                                    Bit0
0 0 0 0 Pixel 7 Pixel 6 Pixel 5 Pixel 4
Start with data as shown in this table representing a 64bit quadword with eight pixels represented
 

Alpha REGISTER r6 after unpkbw r1,r6
Bit63                                                    Bit0
Pixel 7 Pixel 6 Pixel 5 Pixel 4
 

UNPKBL – Unpack Bytes to long words expands the low two bytes in a quadword to two long words in a quadword. This implies four unpacks to re-acquire all eight pixels. While this use would be used much less often here is how it looks.

Packed Byte: 8 bytes packed into 64bits
 

Alpha REGISTER r1 at start
Bit63                                                    Bit0
Pixel 7 Pixel 6 Pixel 5 Pixel 4 Pixel 3 Pixel 2 Pixel 1 Pixel 0
 

Start with data as shown in this table representing a 64bit quadword with eight pixels represented as byte values. Assume this data is in 64bit register r1

Alpha MVI code

unpkbl r1, r5      // unpack the two low bytes into register r5
srl    r1, 16, r1  // shift register r1 right 16 bits
unpkbl r1, r6      // unpack the next two bytes into register r6
srl    r1, 16, r1  // shift register r1 right 16 bits
unpkbl r1, r7      // unpack the next two bytes into register r7
srl    r1, 16, r1  // shift register r1 right 16 bits
unpkbl r1, r8      // unpack the next two bytes into register r8
x86 MMX code
pxor      mm6, mm6  // clear mm6
movq      mm2, mm1  // mm2 = mm1
punpcklbw mm1, mm6  // unpack low four bytes into mm1
punpcklbh mm2, mm6  // unpack high four bytes into mm2
movq      mm3, mm1  // mm3 = mm1
punpcklwd mm1, mm6  // unpack low two words into mm1
punpcklwd mm3, mm6  // unpack high two words into mm3
movq      mm4, mm2  // mm4 = mm2
punpcklwd mm2, mm6  // unpack low two words into mm2
punpcklwd mm4, mm6  // unpack high two words into mm4
 

Alpha REGISTER r5
Bit63                                                  Bit0 
Pixel 1 Pixel 0
 

Alpha REGISTER r6
Bit63                                                  Bit0
Pixel 3 Pixel 2
 

Alpha REGISTER r7
Bit63                                                  Bit0
Pixel 5 Pixel 4
 

Alpha REGISTER r8
Bit63                                                  Bit0
Pixel 7 Pixel 6
 

Pack Instructions

PACKWB – Truncates the four (4) component words of the input register and writes them to the low four (4) bytes of the output register.
 

Alpha REGISTER r5 at start of packwb

Bit63                                                  Bit0
Pixel 3 Pixel 2 Pixel 1 Pixel 0
 

Alpha REGISTER r6 at start of packwb
Bit63                                                  Bit0
Pixel 7 Pixel 6 Pixel 5 Pixel 4
 

Alpha MVI code

    packwb  r6, r6       // pack four words (4) in low four (4) bytes Hi
    packwb  r5, r5       // pack four words (4) in low four (4) bytes Lo
        sll     r6, 32, r6   // shift the high four left
    bis     r5, r6, v0   // logical OR the two halves into v0
x86 MMX code 
    packuswb mm0, mm1   // truncate and pack mm0 and mm1 into mm0
 

SPECIAL NOTE: the following three lines of MVI code will not work because the upper four (4) bytes of the destination register are written with zero’s by packwb. Therefore the second packwb would write zero’s over the first pixels shifted data.

packwb r6, r6        // get high four (4) bytes
     sll    r6, 32, r6    // shift them over
packwb r5, r6        // get low four (4) bytes will overwrite
                     // the high four bytes with zero’s
 
 
Alpha REGISTER r5 after packwb r5,r5
Bit63                                                  Bit0
0 0 0 0 Pixel 3 Pixel 2 Pixel 1 Pixel 0
 

Alpha REGISTER r6 after packwb r6,r6
Bit63                                                  Bit0
0 0 0 0 Pixel 7 Pixel 6 Pixel 5 Pixel 4
 

Alpha REGISTER r6 after sll r6, 32, r6
Bit63                                                  Bit0
Pixel 7 Pixel 6 Pixel 5 Pixel 4 0 0 0 0
 

Alpha REGISTER v0 after bis r5,r6,v0
Bit63                                                              Bit0
Pixel 7 Pixel 6 Pixel 5 Pixel 4 Pixel 3 Pixel 2 Pixel 1 Pixel 0
 

PACKLB – Truncates the two (2) component long words in the input register to byte values and writes them to the low two (2) bytes of the output register.
 

Alpha REGISTER r5 at start
Bit63                                                    Bit0
Pixel 1 Pixel 0
 

Alpha REGISTER r6 at start
Bit63                                                  Bit0
Pixel 3 Pixel 2
 

Alpha REGISTER r7 at start
Bit63                                                  Bit0
Pixel 5 Pixel 4
 

Alpha REGISTER r8 at start
Bit63                                                  Bit0
Pixel 7 Pixel 6
 

Alpha MVI code

packlb  r5, r5     // trunc and pack 2 dwords into 2 bytes
packlb  r6, r6     // trunc and pack 2 dwords into 2 bytes
sll     r6, 16, r6 // shift pixels 3 and 2 into place
bis     r5, r6, r5 // logical OR pixels 3 and 2 into r5
packlb  r7, r7     // trunc and pack 2 dwords into 2 bytes
sll     r7, 32, r7 // shift pixels 5 and 4 into place
bis     r5, r7, r5 // logical OR pixels 5 and 4 into r5
packlb  r8, r8     // trunc and pack 2 dwords into 2 bytes
sll     r8, 48, r8 // shift pixels 7 and 6 into place
bis     r5, r8, r5 // logical OR pixels 7 and 6 into r5

Alpha REGISTER r5 after packwl r5,r5
Bit63                                                  Bit0
0 0 0 0 0 0 Pixel 1 Pixel 0
 

Alpha REGISTER r6 after packwl r6,r6
Bit63                                                  Bit0
0 0 0 0 0 0 Pixel 3 Pixel 2
 

Alpha REGISTER r6 after sll r6,16,r6
Bit63                                                  Bit0 
0 0 0 0 Pixel 3 Pixel 2 0 0
 

Alpha REGISTER r5 after bis r5,r6,r5
Bit63                                                  Bit0
0 0 0 0 Pixel 3  Pixel 2 Pixel 1 Pixel 0
 

Alpha REGISTER r7 and r8 just repeat this sequence after shifting the correct number of bits.

 
 
 

 

Byte and Word Minimum and Maximum (MINxxx) (MAXxxx)

MINUB8 Vector Unsigned Byte Minimum

MINUW4 Vector Unsigned Word Minimum

MINSB8 Vector Signed Byte Minimum

MINSW4 Vector Signed Word Minimum

MAXUB8 Vector Unsigned Byte Maximum

MAXW4  Vector Unsigned Word Maximum

MAXSB8 Vector Signed Byte Maximum

MAXSW4 Vector Signed Word Maximum

These instructions take the form MINxxx Ra,Rb,Rc
                                 MAXxxx Ra,Rb,Rc
Where the values in register Ra are compared to the value in register Rb and the result is placed in register Rc. These are vector wise comparisons. That is each byte or each word is compared when using the xxxxB8 or xxxxW4 instructions respectively.

Alpha MVI code

    minub8 r5, r6, r6 // get the minimum’s in each BYTE position
 

Alpha REGISTER r5 at start
 
1 0 1 0 1 0 1 0
 

Alpha REGISTER r6 at start
 
0 1 2 2 0 0 1 1
 

Alpha REGISTER r6 after calling minub8 r5,r6,r6
 
0 0 1 0 0 0 1 0
 

Alpha MVI code

    minuw4 r5, r6, r6 // get the minimum’s in each WORD position

 

Alpha REGISTER r5 at start
 
0x0000  0x00FF 0x0000 0x0001
 

Alpha REGISTER r6 at start
 
0x0000  0x0001  0x0000 0x00F3
 

Alpha REGISTER r6 after call to minuw4 r5,r6,r6
 
0x0000  0x0001  0x0000 0x0001

 

PERR (Pixel Error)

This instruction takes the eight bytes packed into two quadword registers and computes the absolute differences between them, then adds the eight intermediate results and right aligns the result in the destination register. The net result is that motion estimation calculations on eight pixels can be done in a single clock tick on an MVI capable Alpha.

Paul Rubinfeld, Bob Rose and Michael McCallig present several very good examples of applications that use PERR in their paper entitled "Motion Video Extensions for Alpha." Look in Helpful URL’s for more information.

Alpha MVI code

    perr r5,r6,v0

x86 MMX code

movq     mm2,mm0 // copy mm0 to mm2
psubusb  mm0,mm1 // compute difference one way
psubusb  mm1,mm2 // compute difference the other way
por      mm0,mm1 // OR the results together
loop:            // perform some loop or other logic to add the 8 bytes
                 // in a quadword together and place the result in a
                 // quadword register
 

Alpha REGISTER r5 at start
 
1 0 1 0 1 0 1 0
 

Alpha REGISTER r6 at start
 
0 1 2 2 0 0 1 1
 

Intermediate Absolute Differences
 
1 1 1 2 1 0 0 1
 

Sum of Absolute Differences placed in Alpha REGISTER V0 (64bit)
 
0 0 0 0 0 0 0 7
 

  How MVI instructions are used in software

In this section, we give some guidelines and examples for using the instructions above.  The first issue is whether you are writing code for just one type of machine or not.  The section on Detecting MVI
was usful for determining at runtime whether the current Alpha machine has MVI.  You might be writing the same code for multiple architectures,
so you may need to test first to determine if the current machine is an Alpha.  The following section on potable coding explains that.  The other subsections describe how to do a bytewise add with saturation as well as a simple convolution filter.
 

Portable Coding Techniques

//+
// somewhere in your code declare a function pointer
//-
void (*videoProcessingFunctionPointer)( unsigned char *someParameter );
 
//+
// then declare your multimedia functions using any combination of
// MMX – MVI – inline assembler – or whatever
// make sure you surround them with the #ifdef _M_ALPHA MACRO
//-
#if ((defined(_M_ALPHA) || defined(_alpha)) && ( defined(_MSC_VER) || defined(__DECC))
 
//+
// declare your function that uses MVI
//-
void functionUsingMVI( unsigned char *someData)
{
// some process using MVI
}
 
//+
// declare your function that does not use MVI
//-
void functionWithoutMVI( unsigned char *someData)
{
// same process but without MVI instructions
}
 
#else
 
//+
// declare your x86 version that uses MMX instructions
//-
void FunctionUsingMMX( unsigned char *someData )
{
// same process using x86 MMX code
}
 
#endif
 
 
 
//+
// somewhere in the initialization portion of you program
//-
#if ((defined(_M_ALPHA) || defined(_alpha)) && ( defined(_MSC_VER) || defined(__DECC))
//+
// setup the function pointer by testing for MVI support
//-
if ( thisProcessorHas( SUPPORT_FOR_MVI ) )
 
    //+
    // we can use MVI
    // point the function pointer at the MVI function
    //-
    videoProcessingFunctionPointer = functionUsingMVI;
 
else
 
    //+
    // we can not use MVI
    // point at the NON MVI version of the function
    //-
    videoProcessingFunctionPointer = functionWithoutMVI;
 
#else
 
//+
// we are building for x86 MMX point at that code
//-
videoProcessingFunctionPointer = FunctionUsingMMX;
 
#endif
 
// THEN FINALLY IN THE PROGRAM BODY
 
getMyData( &myVideoData ); // get some data
 
//+
// call the desired function through the pointer
// it will be pointing at the best fit function on this platform
//-
(*videoProcessingFunctionPointer)( &myVideoData ); // our multimedia code
 
processMyDataSomeMore( &myVideoData );
displayMyData( &myVideoData );


Unsigned Saturating Arithmetic

The current versions of MVI do not have explicit byte- or word-partitioned arithmetic operations, but it is possible to produce the same results with multiple instructions. First we explain why one would want a saturating add.  Consider the following situation. Suppose we want to blend two pixels by adding their red, green, and blue values together and dividing by two (shift right one bit). For example, if a simple add is applied to the 16-bit values shown below and then truncated to 16 bits the answer is actually less than both of the addends, and the shifted value will not be the average. The rest of the answer is in the most significant bit or in this case bit 17. Applying this to the green component of a pixel’s color value (assuming that 16 bits are used to represent each color), blending these two medium green pixels together would result in a pixel that is lighter green than the original two. Additionally, since pixel intensities or colors can only range from all of a color - 0xFFFF to none of a color – 0x0000, bit 17 is meaningless to us in terms of "greenness."

Simple Add            1111 0000 0000 0000  (0xF000)

                       +   0011 0000 0000 0000  (0x3000)

Result truncated to 16bits 0010 0000 0000 0000  (0x2000)

Right shifted value        0x1000 (not the result we desire)

 

Unsigned saturating arithmetic causes values that would overflow to be "clamped" to the maximum possible value (0xFFFF) and values that would underflow to be "clamped" to the minimum (0x0000). A saturating add of these two green pixels overflows 16 bits and therefore gets "clamped" to a maximum value of 0xFFFF which is a much better representation of the color expected after blending two medium green pixels.

Saturating Add              0xF000

                    +       0x3000

result "clamped" to maximum 0xFFFF

right shift with rounding   0x8000 ( much better )

 

In practice, a better solution might be to first do the division by two and then add the results, in which case overflow is not a real problem. However, the conversion from YUV video representation to RGB representation is a problem that requires saturated arithmetic, where the data are all unsigned byte. The equation for this conversion is given by

R = 1.1644*Y + 1.5966*V – 16 If Y and V are both close to 255, then the result will be greater than 255, and saturated arithmetic will be needed. Another example is pixel filtering, which is discussed below.

 

Alpha MVI "unsigned saturating add for packed words"

eqv    r6, zero, t0 // 1’s complement of r6
minuw4 r5, t0,   r5 // get the smaller values
addq   r5, r6,   v0 // add r6 to r5 and place in v0
 Note that for unsigned packed bytes, just replace minuw4 with minub8.
 

Alpha REGISTER r5 at start
Pixel 3         Pixel 2         Pixel 1          Pixel 0
0x0000  0xFFFF 0x0000 0x0001
 

Alpha REGISTER r6 at start
 
0x0000  0x0001  0x0000 0xFFFF
 

What happens?

First we take the 1’s complement of register r6. The "eqv" instruction does a more general operation than implied by the comment. The "eqv" instruction alone is Rc<- Ra XOR (NOT Rb), therefore when Rb is zero (NOT Rb) is all 1’s and that XOR anything will flip the bits or produce the 1’s compliment. Why do we want the 1’s compliment of r6? Well first of all it would work with either r5 or r6 as long as we called minuw4 using whichever register we took the 1’s compliment of. The 1’s compliment is the largest number we can add to the original number and not overflow. Knowing that piece of information, we simply pick the smaller of this pixels 1’s compliment or the corresponding pixels original value and do the addition. Since we picked the smallest number from a set of numbers containing the largest value that would not overflow, we are guaranteed not to overflow.
 

Alpha REGISTER t0 after call to eqv r6,zero,t0
 
0xFFFF  0xFFFE  0xFFFF 0x0000
 

Alpha REGISTER r5 after call to minuw4 r5,t0,r5
 
0x0000  0xFFFE  0x0000 0x0000
 

Alpha REGISTER v0 after call to addq r5,r6,v0
 
0x0000  0xFFFF  0x0000 0xFFFF
 

Alpha MVI "unsigned saturating subtract of packed words"

minuw4 r5, r6, r6 // get the minimums at each word
subq   r5, r6, v0 // subtract r6 from r5 and place in v0
 

Alpha REGISTER r5 at start
Pixel 3         Pixel 2         Pixel 1          Pixel 0
 
0x0000  0x00FF 0x0000 0x0001
 

Alpha REGISTER r6 at start
 
0x0000  0x0001  0x0000 0x00F3
 

Alpha REGISTER r6 after call to minuw4 r5,r6,r6
 
0x0000  0x0001  0x0000 0x0001
 

What happens?

When we started Pixel 0 or the low word in r5 = 0x0001 and in r6 = 0x00F3 subtracting r6 from r5 would result in a value that exceeds the limits of an unsigned 16-bit word i.e. something negative (-242). Recall the behavior of "clamping" to the minimum value – that is the effect achieved when all the minimum pixel values are placed in the r6 register and then subtracted from the original r5 register values. If the minimum pixel value was in r6 as is the case with Pixel 2, then the simple unsigned subtract results in a value that is less than the original but greater than the minimum value. If however the minimum value was in r5 then that value is moved to r6 by the minuw4 instruction and ultimately subtracted from itself resulting in zero, "clamped" to the minimum unsigned value.

 

Alpha REGISTER V0 after call to subq r5,r6,v0
Pixel 3         Pixel 2         Pixel 1          Pixel 0 
0x0000  0x00FE  0x0000 0x0000
 

Simple Pixel Filtering Example

Consider a two dimensional array or bitmap of 32bit pixel values arranged as RGBA. Where the RGBA (Red Green Blue and Alpha) values are represented in the four 8bit components of a 32bit unsigned long. A simple convolution filter will replace each pixel by the weighted sum of the values of the pixels surrounding it.  This is done independently for RGBA.  A typical filter algorithm might employ a loop as in the following "C" language example. The algorithm will have high memory bandwidth. It is Byte/Word Integer Data and can take advantage of parallelism. These are the qualifiers for using MVI.

Loop on row

  Loop on column

Red   = 0; // clear accumulators
Green = 0;
Blue  = 0;
Alpha = 0;
// loop on some filter length and pull out the RGBA components

for ( x=0 x< length of filter ; x++) {

temp = ((inputArray[ row ] [ col ] >> 24 ) && 0xFF);
Red += temp * filterValue[x];
temp = ((inputArray[ row ] [ col ] >> 16 ) && 0xFF);
Green += temp * filterValue[x];
temp = ((inputArray[ row ] [ col ] >> 8 ) && 0xFF);
Blue += temp * filterValue[x];
temp = ((inputArray[ row ] [ col ] ) && 0xFF);
Alpha += temp * filterValue[x];
}
Alpha MVI code stub (the result is left in a 16 bit value which would need to be shifted and packed into 8 bits again)
//+
// work on 2 pixels at a time ( 64bits)
//-
addl    a1, t8, t11 // t11=addr of data t8=offs to current pixels (2)
ldq     t4, 0(t11)  // get 2 pixels from array
unpkbw  t4, t0      // unpack low four bytes RGBA
srl     t4, 32, t4  // shift the high four bytes
unpkbw  t4, t1      // unpack the high four bytes RGBA
 
//+
// t5 holds the filter value
//-
mulq   t5, t0, t0   // multiply RGBA of pixel 1 by filter value
mulq   t5, t1, t1   // multiply RGBA of pixel 2 by filter value
 
//+
// unsigned saturating adds for RGBA accumulators
// r6 pixel 1 RGBA accumulator
//-
eqv    t0, zero, r6 // 1’s compliment of t0
minuw4 t0, r6,   t0 // get the smaller values
addq   r6, t0,   r6 // accumulate RGBA’s pixel 1
 
//+
// r5 pixel 2 RGBA accumulator
//-
eqv    t1, zero, r5 // 1’s compliment of t0
minuw4 t1, r5,   t1 // get the smaller values
addq   r5, t1,   r5 // accumulate RGBA’s pixel 2

 A Side-by-Side Example of Alpha MVI and x86 MMX

This example is a loop used to blend pixel values.

First – A "C" code example

//+
//   BlendWithoutMVI
//-
void BlendWithoutMVI( UCHAR* pInputA, UCHAR* pInputB, UCHAR* pOutput )
{
unsigned char *frontImage;
unsigned char *backImage;
unsigned char *output;
long           l_lImgSizeX;
long           l_lSizeY;
long           y;
unsigned long  pixelInLine;
unsigned long  x;
unsigned short usFront;
unsigned short usBack;
unsigned short usTemp;
 
ImgSizeX   = 720;
ImgSizeX  >= 2;
SizeY      = 486;
frontImage = pInputB;
backImage  = pInputA;
output     = pOutput;
y          = 0;
 
do {
        pixelInLine = ImgSizeX;
 
        do {
//+
// replace MVI code with loop
//-
for ( x=0;x<8;x++ ) {
usFront   = ((unsigned short)(frontImage[x] & 0x00ff));
usBack    = ((unsigned short)(backImage[x] & 0x00ff));
usFront  -= usBack;
usFront  *= s_ubAlpha;
usFront  += 0x0080;
usTemp    = usFront;
usTemp  >>= 8;
usFront  += usTemp;
usFront >>= 8;
usFront  += usBack;
output[x] = (unsigned char)( usFront & 0x00ff );
}
pixelInLine--;
//+
// move the pointers up to the new offset
//-
frontImage += 8;
backImage  += 8;
output     += 8;
} while ( pixelInLine > 0 );
y++;
}while ( y < l_lSizeY );
s_ubAlpha--;
}

 

 

Next – x86 MMX Code as inline assembler in a "C" file.

//+
//   BlendUsingMMX
//-
void BlendUsingMMX( UCHAR* pInputA, UCHAR* pInputB, UCHAR* pOutput )
{
LONG             ImgSizeX = 720;
LONG             SizeY    = 486;
LONG             y;
static __int64   ROUND    = 0x0080008000800080;
static __int64 mmAlphaValue = 0x00FF00FF00FF00FF;
 
// l_mmAlphaValue should have A | A | A | A
 
__asm
{
pxor mm6, mm6          // Clear mm6...
movq mm5, mmAlphaValue // mm5 = A | A | A | A
movq mm7, ROUND
}
for ( y = 0; y < SizeY; y++) {
__asm {
mov       esi, pInputB     // esi is the front image
mov       edi, pInputA     // edi is the back image
mov       edx, pOutput     // edx is the output image
xor       ebx, ebx         // ebx = 0, pixel offset
mov       ecx, ImgSizeX    // ecx = for counter for pixel in line
shr       ecx, 2           // Processing 4 pixels at a time
jz        finisha          // Skip if no pixels to process
loopa:
movq      mm0, [esi + ebx] // mm0 = Front[0:7]
movq      mm2, [edi + ebx] // mm2 = Back [0:7]
movq      mm1, mm0         // mm1 = mm0
movq      mm3, mm2
punpcklbw mm0, mm6         // mm0 = Front[0:3]
punpckhbw mm1, mm6         // mm1 = Front[4:7]
punpcklbw mm2, mm6         // mm2 = Back [0:3]
punpckhbw mm3, mm6         // mm1 = Back [4:7]
psubw     mm0, mm2         // mm0 = Front - Back [0:3]
psubw     mm1, mm3         // mm1 = Front - Back [4:7]
pmullw    mm0, mm5         // mm0 = FB0 * Alpha
pmullw    mm1, mm5         // mm1 = FB1 * Alpha
paddw     mm0, mm7         // mm0 = FB0 * Alpha + ROUND = C0
paddw     mm1, mm7         // mm1 = FB1 * Alpha + ROUND = C1
movq      mm2, mm0         // mm2 = C0
movq      mm3, mm1         // mm3 = C1
psrlw     mm0, 8           // mm0 = C0 >> 8
psrlw     mm1, 8           // mm1 = C1 >> 8
paddw     mm0, mm2         // mm0 = C0 + (C0 >> 8)
paddw     mm1, mm3         // mm1 = C1 + (C1 >> 8)
psrlw     mm0, 8           // mm0 = Result Pixel 0
psrlw     mm1, 8           // mm1 = Result Pixel 1
 
packuswb  mm0, mm1         // mm0 = Result [0:7]
paddb     mm0, [edi + ebx] // Add the back (Cached)
movq      [edx + ebx], mm0 // Store the result
add       ebx, 8           // Goto next pixel
dec       ecx              // decrement counter
jg        loopa
finisha:
add       esi, ebx         // Increment the pointers
add       edi, ebx
add       edx, ebx
add       eax, ebx
mov       pOutput, edx     // Store back the pointers
mov       pInputB, esi
mov       pInputA, edi
}
}
__asm emms // Clear the MMX Status
// code adapted from a performance test example provided by Richard Fuoco

 

Now - Alpha MVI Code as an Alpha Assembler implementation

 
#include <kxalpha.h>
//+
// blend.S
//
// use asaxp.exe to make an .obj file from this source and link it to
// your "C" program.
//
// c:\your-command-line>asaxp blend.s
//
// ----- BlendUsingMVI ------------------------------------------------------
//
// "C" declaration
// extern void BlendUsingMVI(  UCHAR* pInputA,
//                             UCHAR* pInputB,
//                             UCHAR* pOutput,
//                             __int64 fadeValue )
//
// Register Usage:
//
// on entry...
//
// a0 holds the address of the Back image (parameter 1)
// a1 holds the address of the Front Image (parameter 2)
// a2 holds the address of the Output buffer (parameter 3)
// a3 holds the fade value (parameter 4)
//
// FYI. The first integer parameters always go here.
// a0 - a5 is an alias for the integer registers r16-r21
//
// If these were floating point values they would go in f16-f21
// respectively.
//
// mixed parameters i.e. int, int, float
// are placed in their respective registers in order
// therefore the first two integer parameters will be in
// a0 and a1 (that is r16-r17) and the float will
// be in f18 (NOT f16) because it is the third parameter
//
// working registers
//
// t0 four bytes of the packed pixels of the front image
//
// t2 four bytes of the packed pixels of the back image
//
// t4 temporary storage
// t5 temporary storage
// t6 temporary storage
//
// t9 used as counter pixelsInLine
// it is hard coded to (720 >> 2 ) or 180 in this NTSC example
// it is divided/shifted because we consume 4 pixels (4 bytes) at a time
//
// t10 used as a loop counter (y in the "C" example) y<l_lSizeY
// this is hard coded to 486 in this NTSC example
//
// t11 used as a "C" pointer into memory
// notice right after loopa: I add the offset in t8
// to the address in a1 and store it in t11
// this yields the pointer 0(t11)
// 0(t11) reads as zero bytes off of the address in t11
//
#define byteMask 0xff
#define _ROUND_  0x0080008000800080
#define wordMask 0xff00ff00ff00ff00
#define mviAlphaValue 0x00FF00FF00FF00FF
 
LEAF_ENTRY(BlendUsingMVI)
mov  486,     t10      // top of for ( y=0;y<l_lSizeY )
mov  _ROUND_, t6
loopy:
         &nbs