Assembly Language StepbyStep Programming with DOS and Linux 2nd Ed [Electronic resources] نسخه متنی

Understanding AT&T Instruction Mnemonics

I've alluded a time or two in this book to the fact that there is more than one set of mnemonics for the x86 instructions set. There is only one set of machine instructions, but the machine instructions are pure binary bit patterns that were never intended for human consumption. A mnemonic is just that: a way for human beings to remember what the binary bit pattern 1000100111000011 means to the CPU. Instead of writing 16 ones and zeros in a row (or even the slightly more graspable hexadecimal equivalent $89 $C3), we say MOV BX,AX.

Keep in mind that mnemonics are just that—memory joggers for humans—and are creatures unknown to the CPU itself. Assemblers translate mnemonics to machine instructions. Although we can agree among ourselves that MOV BX,AX will translate to 1000100111000011, there's nothing magical about the string MOV BX,AX. We could as well Figure 12.1. The gcc compiler takes as input a .c source code file, and outputs a .s assembly source file, which is then handed to the GNU assembler gas for assembly. This is the way the GNU tools work on all platforms. In a sense, assembly language is an intermediate language used mostly for the C compiler's benefit. In most cases, programmers never see it and don't have to fool with it.

In most cases. However, if you're going to deal with the GNU debugger gdb at a machine-code level (rather than at the C source code level), the AT&T mnemonics will be in your face at every single step of the way, heh-heh. In my view the usefulness of gdb is greatly reduced by its strict dependence on the AT&T instruction mnemonics. I keep looking for somebody to create a DEBUG-style debugger for Linux that uses Intel's own mnemonics, but so far I've come up empty.

Therefore, it would make sense to become at least passingly familiar with the AT&T mnemonic set. There are some general rules that, once digested, make it much easier. Here's the list in short form:

AT&T mnemonics and register names are invariably in lowercase. This is in keeping with the Unix convention of case sensitivity, and at complete variance with the Intel convention of uppercase for assembly language source. I've mixed uppercase and lowercase in the text and examples to get you used to seeing assembly source both ways, but you have to remember that while Intel (and hence NASM) suggests uppercase but will accept lowercase, AT&T requires lowercase.

Register names are always preceded by the percent symbol, %. That is, what Intel would write as AX or EBX, AT&T would write as %ax and %ebx. This helps the assembler recognize register names.

Every AT&T machine instruction mnemonic that has operands has a single-character suffix indicating how large its operands are. The suffix letters are b, w, and l, indicating byte (8 bits), word (16 bits), or long (32 bits). What Intel would write as MOV BX,AX, AT&T would write as movw %ax,%bx. (The changed order of %ax and %bx is not an error. See the next rule!)

In the AT&T syntax, source and destination operands are placed in the opposite order from Intel syntax. That is, what Intel would write as MOV BX,AX, AT&T would write as movw %ax,%bx. In other words, in AT&T syntax, the source operand comes first, followed by the destination. This actually makes a little more sense than Intel conventions, but confusion and errors are inevitable.

In the AT&T syntax, immediate operands are always preceded by the dollar sign, $. What Intel would write as PUSH DWORD 32, AT&T would write as pushl $32. This helps the assembler recognize immediate operands.

AT&T documentation refers to "sections" where we would say "segments." A segment override is thus a section override in AT&T parlance. This doesn't come into play much because segments are not a big issue in 32-bit flat model programming. Still, be aware of it.

Not all Intel instruction mnemonics have AT&T equivalents. JCXZ, JECXZ, LOOP, LOOPZ, LOOPE, LOOPNZ, and LOOPNE do not exist in the AT&T mnemonic set, and gcc never generates code that uses them. This won't be a problem for us, as we're using NASM, but you won't see these instructions in gdb displays.

In the AT&T syntax, displacements in memory references are signed quantities placed outside parentheses containing the base, index, and scale values. I'll treat this one separately a little later, as you'll see it a lot in .s files and you should be able to understand it.

There are a handful of other issues that would be involved in programs more complex than we'll take up in this book. These mostly involve near versus far calls and jumps and their associated return instructions.

Examining gas Source Files Created by gcc

The best way to get a sense for the AT&T assembly syntax is to look at an actual AT&T-style .s file produced by gcc. Doing this actually has two benefits: First of all, it will help you become familiar with the AT&T mnemonics and formatting conventions. In addition, you may find it useful, when struggling to figure out how to call a C library function from assembly, to create a short C program that calls the function of interest and then examines the .s file that gcc produces when it compiles your C program. The dateis.c program which follows was part of my early research, and I used it to get a sense for how ctime() was called at the assembly level. Obviously, for this trick to work you must have at least a journeyman understanding of the AT&T mnemonics. (I discuss ctime() and other time-related C library calls in detail in the next chapter.)

You don't automatically get a .s file every time you compile a C program. The .s file is created, but once gas assembles the .s file to a binary object code file (typically a .o file), it deletes the .s file. If you want to examine a .s file created by gcc, you must compile with the -S option. (Note that this is an uppercase S. Case matters big time in the Unix world!) The command would look like this:


gcc dateis.c -S

Note that the output of this command is the assembly source file only. If you specify the -S option, gcc understands that you want to generate assembly source rather than an executable program file, so all it will generate is the .s file. To compile a C program to an executable program file, you must compile it again without the -S option.

Here's dateis.c. It does nothing more than print out the date and time as returned by the standard C library function ctime():


#include <time.h>
#include <stdio.h>
int main()
{
time_t timeval;
(void)time(&timeval);
printf("The date is: %s", ctime(&timeval));
exit(0);
}

It's not much of a program, but it does illustrate the use of three C library function calls, time(), ctime(), and printf(). When gcc compiles the preceding program (dateis.c), it produces the file dateis.s, which follows. I have manually added the equivalent Intel mnemonics as comments to the right of the AT&T mnemonics, so you can see what equals what in the two systems. (Alas, neither gcc nor any other utility I have ever seen will do this for you!)


.file      "dateis.c"
.version   "01.01"
gcc2_compiled.:
.section  .rodata
.LC0:
.string    "The date is: %s"
.text
.align 4
.globl main
.type      main,@function
main:
pushl %ebp           # push ebp
movl %esp,%ebp       # mov ebp,esp
subl $4,%esp         # sub esp,4
leal −4(%ebp),%eax   # lea eax,ebp−4
pushl %eax           # push eax
call time            # call time  
addl $4,%esp         # add esp,4
leal −4(%ebp),%eax   # lea eax,ebp−4
pushl %eax           # push eax
call ctime           # call ctime
addl $4,%esp         # add esp,4
movl %eax,%eax       # mov eax,eax
pushl %eax           # push eax
pushl $.LC0          # push dword .LC0
call printf          # call printf
addl $8,%esp         # add esp,8
pushl $0             # push dword 0
call exit            # call exit
addl $4,%esp         # add esp,4
.p2align 4,,7
.L1:
leave                # leave
ret             # ret
.Lfe1:
.size      main,.Lfe1-main
.ident     "GCC: (GNU) egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)"

One thing to keep in mind when reading this is that dateis.s is assembly language code produced mechanically by a compiler, and not by a human programmer! Some things about the code (such as why the label .L1 is present but never referenced) are less than ideal and can only be explained as artifacts of gcc's compilation machinery. In a more complex program there may be some customary use of a label .L1 that doesn't exist in a program this simple.

Some quick things to note here while reading the preceding listing:

When an instruction does not take operands (call, leave, ret), it does not have an operand-size suffix. Calls and returns look pretty much alike in both Intel and AT&T syntax.

When referenced, the name of a message string is prefixed by a dollar sign ($) the same way that numeric literals are. In NASM, a named string variable is considered a variable and not a literal. This is just another AT&T peccadillo to be aware of.

Note that the comment delimiter in the AT&T scheme is the pound sign (#) rather than the semicolon used in nearly all Intel-style assemblers, including NASM.

AT&T Memory Reference Syntax

As you'll remember from earlier chapters, referencing a memory location (as distinct from referencing its address) is done by enclosing the location of the address in square brackets, like so:


mov ax, dword [ebp]

Here, we're taking whatever 32-bit quantity is located at the address contained in EBP and loading it into register AX. The x86 processors allow a number of different ways of specifying the address. To a core address called a base we can add another register called an index, and to that a constant value called a displacement. We used this sort of addressing to locate a string within a table of strings back in Chapter 11. Such addressing modes can look like this:


mov eax, dword [ebx−4]    ; Base minus displacement
mov al, byte [bx+di+28]   ; Base plus index plus displacement

I haven't really covered this, but you can add an additional factor to the index called a scale, which is a power of two by which you multiply the index:


mov al, byte [bx+di*4]

The scale can't be any arbitrary value, but must be one of 2, 4, or 8. (The value 1 is legal but doesn't accomplish anything useful.) This mode, called scaled indexed addressing, is only available in 32-bit flat model and will not work in 16-bit modes at all—which is why I haven't mentioned it in this book before now.

All of the examples I've shown you so far use the Intel syntax. The AT&T syntax for memory addressing is considerably different. In place of square brackets, AT&T uses parentheses to enclose the components of a memory address:


movb (%ebx),%al    # mov byte al,[ebx] in Intel syntax

Here, we're moving the byte quantity at [ebx] to AL. (Don't forget that the order of operands is reversed from what Intel syntax does!) Inside the parentheses you place the base, the index, and the scale, when present. (The base must always be there.) The displacement, when one exists, must go in front of and outside the parentheses:


movl −4(%ebx),%eax        # mov dword eax,[ebx−4] in Intel syntax
movb 28(%ebx,%edi),%eax   # mov byte eax,[ebx+edi+28] in Intel syntax

Note that in AT&T syntax, you don't do the math inside the parentheses. The base, index, and scale are separated by commas, and plus signs and asterisks are not allowed. The schema for interpreting an AT&T memory reference is as follows:


±disp(base,index,scale)

The ± symbol indicates that the displacement is signed; that is, it may be either positive or negative, to indicate whether the displacement value is added to or subtracted from the rest of the address. Typically, you only see the sign as explicitly negative; without the minus symbol, the assumption is that the displacement is positive. The displacement value is optional. You may omit it entirely if there's no displacement in the memory reference. Similarly, you may omit the scale if there is no scale value present in the effective address.

What you will see most of the time, however, is a very simple type of memory reference:


−16(%ebp)

The displacements will vary, of course, but what this almost always means is that an instruction is referencing a data item somewhere on the stack. C code allocates its variables on the stack, in a stack frame, and then references those variables by constant offsets from the value in EBP. EBP acts as a "thumb in the stack," and items on the stack may be referenced in terms of offsets (either positive or negative) away from EBP. The preceding reference would tell a machine instruction to work with an item at the address in EBP minus 16 bytes.

I have a lot more to say about stack frames in the next chapter.

Assembly Language StepbyStep Programming with DOS and Linux 2nd Ed [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی