Assembly Language StepbyStep Programming with DOS and Linux 2nd Ed [Electronic resources] نسخه متنی

The Bones of an Assembly Language Program

The following listing is perhaps the simplest correct program that will do anything visible and still be comprehensible and expandable. This issue of comprehensibility is utterly central to quality assembly language programming. With no other computer language (not even APL or that old devil FORTH) is there anything even close to the risk of writing code that looks so much like something scraped off the wall of King Tut's tomb.

The program EAT.ASM displays one (short) line of text on your display screen:


Eat at Joe's!

For that you have to feed 28 lines of text file to the assembler. Many of those 28 lines are unnecessary in the strict sense, but serve instead as commentary to allow you to understand what the program is doing (or more important, how it's doing it) six months or a year from now.

One of the aims of assembly language coding is to use as few instructions as possible in getting the job done. This does not mean creating as short a source code file as possible. (The size of the source file has nothing to do with the size of the executable file assembled from it!) The more comments you put in your file, the better you'll remember how things work inside the program the next time you pick it up. I think you'll find it amazing how quickly the logic of a complicated assembly language file goes cold in your head. After no more than 48 hours of working on other projects, I've come back to assembler projects and had to struggle to get back to flank speed on development.

Comments are neither time nor space wasted. IBM used to say, One line of comments per line of code. That's good-and should be considered a minimum for assembly language work. A better course (that I will in fact follow in the more complicated examples later on) is to use one short line of commentary to the right of each line of code, along with a comment block at the start of each sequence of instructions that work together in accomplishing some discrete task.

Here's the program. Read it carefully:


; Source name     : EAT.ASM
; Executable name : EAT.COM
; Code model:     : Real mode flat model
; Version         : 1.0
; Created date    : 6/4/1999
; Last update     : 9/10/1999
; Author          : Jeff Duntemann
; Description     : A simple example of a DOS .COM file programmed using
;                   NASM-IDE 1.1 and NASM 0.98.
[BITS 16]          ; Set 16 bit code generation
[ORG 0100H]        ; Set code start address to 100h (COM file)
[SECTION .text]    ; Section containing code
START:
mov  dx, eatmsg  ; Mem data ref without [] loads the ADDRESS!
mov  ah,9        ; Function 9 displays text to standard output.
int  21H         ; INT 21H makes the call into DOS.
mov  ax, 04C00H  ; This DOS function exits the program
int  21H         ; and returns control to DOS.
[SECTION .data]    ; Section containing initialized data
eatmsg   db "Eat at Joe's!", 13, 10, "$" ;Here's our message

The Simplicity of Flat Model

After all our discussion in previous chapters about segments, this program might seem, um,…suspiciously simple. And indeed it's simple, and it's simple almost entirely because it's written for the 16-bit real mode flat model. (I drew this model out in Figure 6.8.) The first thing you'll notice is that there are no references to segments or segment registers anywhere. The reason for this is that in real mode flat model, you are inside a single segment, and everything you do, you do within that single segment. If everything happens within one single segment, the segments (in a sense) "factor out" and you can imagine that they don't exist. Once we assemble EAT.ASM and create a runnable program from it, I'll show you what those segment registers are up to and how it is that you can almost ignore them in real mode flat model.

But first, let's talk about what all those lines are doing.

At the top is a summary comment block. This text is for your use only. When NASM processes a .ASM file, it strips out and discards all text between any semicolon and the end of the line the semicolon is in. Such lines are comments, and they serve only to explain what's going on in your program. They add nothing to the executable file, and they don't pass information to the assembler. I recommend placing a summary comment block like this at the top of every source code file you create. Fill it with information that will help someone else understand the file you've written or that will help you understand the file later on, after it's gone cold in your mind.

Beneath the comment block is a short sequence of commands directed to the assembler. These commands are placed in square brackets so that NASM knows that they are for its use, and are not to be interpreted as part of the program.

The first of these commands is this:


[BITS 16]          ; Set 16 bit code generation

The BITS command tells NASM that the program it's assembling is intended to be run in real mode, which is a 16-bit mode. Using [BITS 32] instead would have brought into play all the marvelous 32-bit protected mode goodies introduced with the 386 and later x86 CPUs. On the other hand, DOS can't run protected mode programs, so that wouldn't be especially useful.

The next command requires a little more explanation:


[ORG 0100h]        ; Set code start address to 100h (COM file)

"ORG" is an abbreviation of origin, and what it specifies is sometimes called the origin of the program, which is where code execution begins. Code execution begins at 0100H for this program. The 0100h value (the h and H are interchangeable) is loaded into the instruction pointer IP by DOS when the program is loaded and run. So, when DOS turns control over to your program (scary thought, that!), the first instruction to be executed is the one pointed to by IP-in this case, at 0100H.

Why 0100H? Look back at Figure 6.8. The real mode flat model (which is often called the .COM file model) has a 256-byte prefix at the beginning of its single segment. This is the Program Segment Prefix (PSP) and it has several uses that I won't be explaining here. The PSP is basically a data buffer and contains no code. The code cannot begin until after the PSP, so the 0100H value is there to tell DOS to skip those first 256 bytes.

The next command is this:


[SECTION .text]    ; Section containing code

NASM divides your programs into what it calls sections. These sections are less important in real mode flat model than in real mode segmented model, when sections map onto segments. (More on this later.) In flat model, you have only one segment. But the SECTION commands tell NASM where to look for particular types of things. In the .text section, NASM expects to find program code. A little further down the file you'll see another SECTION command, this one for the .data section. In the .data section, NASM expects to find the definitions for your initialized variables. A third section is possible, the .bss section, which contains uninitialized data. EAT.ASM does not use any uninitialized data, so this section does not exist in this program. I discuss uninitialized data later on, in connection with the stack.

Labels

The next item in the file is something called a label:


START:

A label is a sort of bookmark, holding a place in the program code and giving it a name that's easier to remember than a memory address. The START: label indicates where the program begins. Technically speaking, the START: label isn't necessary in EAT.ASM. You could eliminate the START: label and the program would still assemble and run. However, I think that every program should have a START: label as a matter of discipline. That's why EAT.ASM has one.

Labels are used to indicate where JMP instructions should jump to, and I explain that in detail later in this chapter and in later chapters. The only distinguishing characteristic of labels is that they're followed by colons. Some rules govern what constitutes a valid label:

Labels must begin with a letter or with an underscore, period, or question mark. These last three have special meanings (especially the period), so I recommend sticking with letters until you're way further along in your study of assembly language and NASM.

Labels must be followed by a colon when they are defined. This is basically what tells NASM that the identifier being defined is a label. NASM will punt if no colon is there and will not flag an error, but the colon nails it, and prevents a misspelled mnemonic from being mistaken for a label. So use the colon!

Labels are case sensitive. So yikes:, Yikes:, and YIKES: are three completely different labels. This differs from practice in a lot of languages (Pascal particularly) so keep it in mind.

Later on, we'll see such labels used as the targets of jump instructions. For example, the following machine instruction transfers the flow of instruction execution to the location marked by the label GoHome:


JMP GoHome

Notice that the colon is not used here. The colon is only placed where the label is defined, not where it is referenced. Think of it this way: Use the colon when you are marking a location, not when you are going there.

Variables for Initialized Data

The identifier eatmsg defines a variable. Specifically, eatmsg is a string variable (more on which follows) but still, as with all variables, it's one of a class of items we call initialized data: something that comes with a value, and not just a box that will accept a value at some future time. A variable is defined by associating an identifier with a data definition directive. Data definition directives look like this:


MyByte      DB 07H            ; 8 bits in size
MyWord      DW 0FFFFH         ; 16 bits in size
MyDouble    DD 0B8000000H     ; 32 bits in size

Think of the DB directive as "Define Byte." DB sets aside one byte of memory for data storage. Think of the DW directive as "Define Word." DW sets aside one word of memory for data storage. Think of the DD directive as "Define Double." DD sets aside a double word in memory for storage, typically for full 32-bit addresses.

I find it useful to put some recognizable value in a variable whenever I can, even if the value is to be replaced during the program's run. It helps to be able to spot a variable in a DEBUG dump of memory rather than to have to find it by dead reckoning-that is, by spotting the closest known location to the variable in question and counting bytes to determine where it is.

String Variables

String variables are an interesting special case. A string is just that: a sequence or string of characters, all in a row in memory. A string is defined in EAT.ASM:


eatmsg   DB "Eat at Joe's!", 13, 10, "$" ;Here's our message

Strings are a slight exception to the rule that a data definition directive sets aside a particular quantity of memory. The DB directive ordinarily sets aside one byte only. However, a string may be any length you like, as long as it remains on a single line of your source code file. Because there is no data directive that sets aside 16 bytes, or 42, strings are defined simply by associating a label with the place where the string starts. The eatmsg label and its DB directive specify one byte in memory as the string's starting point. The number of characters in the string is what tells the assembler how many bytes of storage to set aside for that string.

Either single quote (') or double quote (") characters may be used to delineate a string, and the choice is up to you, unless you're defining a string value that itself contains one or more quote characters. Notice in EAT.ASM the string variable eatmsg contains a single-quote character used as an apostrophe. Because the string contains a single-quote character, you must delineate it with double quotes. The reverse is also true: If you define a string that contains one or more double-quote characters, you must delineate it with single-quote characters:


Yukkh    DB    'He said, "How disgusting!" and threw up.',"$"

You may combine several separate substrings into a single string variable by separating the substrings with commas. Both eatmsg and Yukkh do this. Both add a dollar sign ($) in quotes to the end of the main string data. The dollar sign is used to mark the end of the string for the mechanism that displays the string to the screen. More on that mechanism and marking string lengths in a later section.

What, then, of the "13,10" in eatmsg? This is the carriage return and linefeed pair I discussed in an earlier chapter. Inherited from the ancient world of electromechanical Teletype machines, these two characters are recognized by DOS as meaning the end of a line of text that is output to the screen. If anything further is output to the screen, it will begin at the left margin of the next line below. You can concatenate such individual numbers within a string, but you must remember that they will not appear as numbers. A string is a string of characters. A number appended to a string will be interpreted by most operating system routines as an ASCII character. The correspondence between numbers and ASCII characters is shown in Appendix D.

Directives versus Instruction Mnemonics

Data definition directives look a little like machine instruction mnemonics, but they are emphatically not machine instructions. One very common mistake made by beginners is looking for the binary opcode represented by a directive such as DB or DW. There is no binary opcode for DW, DB, and the other directives. Machine instructions, as the name implies, are instructions to the CPU itself. Directives, by contrast, are instructions to the assembler.

Understanding directives is easier when you understand the nature of the assembler's job. (Look back to Chapter 4 for a detailed refresher if you've gotten fuzzy on what assemblers and linkers do.) The assembler scans your source code text file, and as it scans your source code file it builds an object code file on disk. It builds this object code file step by step, one byte at a time, starting at the beginning of the file and working its way through to the end. When it encounters a machine instruction mnemonic, it figures out what binary opcode is represented by that mnemonic and writes that binary opcode (which may be anywhere from one to six actual bytes) to the object code file.

When the assembler encounters a directive such as DW, it does not write any opcode to the object code file. DW is a kind of signpost to the assembler, reading "Set aside two bytes of memory right here, for the value that follows." The DW directive specifies an initial value for the variable, and so the assembler writes the bytes corresponding to that value in the two bytes it set aside. The assembler writes the address of the allocated space into a table, beside the label that names the variable. Then the assembler moves on, to the next directive (if there are further directives) or to whatever comes next in the source code file.

For example, when you write the following statement in your assembly language program:


MyVidOrg    DW    0B800H

what you are really doing is instructing the assembler to set aside two bytes of data (Define Word, remember) and place the value 0B800H in those two bytes. The assembler writes the identifier MyVidOrg and the variable's address into a table it builds of identifiers (both labels and variables) in the program for later use by other elements of the program, or the linker.

The Difference between a Variable's Address and Its Contents

I've left discussion of EAT.ASM's machine instructions for last-at least in part because they're easy to explain. All that EAT.ASM does, really, is hand a string to DOS and tell DOS to display it on the screen by sending it to something called standard output. It does this by passing the address of the string to DOS-not the character values contained in the string itself. This is a crucial distinction that trips up a lot of beginners. Here's the first instruction in EAT.ASM:


mov  dx, eatmsg    ; Mem data ref without [] loads the ADDRESS!

If you look at the program, you can see that while DX is 2 bytes in size, the string eatmsg is 15 bytes in size. At first glance, this MOV instruction would seem impossible-but that's because what's being moved is not the string itself, but the string's address, which (in the real mode flat model) is 16 bits-2 bytes-in size. The address will thus fit nicely in DX.

When you place a variable's identifier in a MOV instruction, you are accessing the variable's address, as explained previously. By contrast, if you want to work with the value stored in that variable, you must place the variable's identifier in square brackets. Suppose you had defined a variable in the .data section called MyData this way:


MyData    DW    0744H

The identifier MyData represents some address in memory, and at that address the assembler places the value 0744H. Now, if you want to copy the value contained in MyData to the AX register, you would use the following MOV instruction:


MOV AX,[MyData]

After this instruction, AX would contain 0744H.

There are many situations in which you need to move the address of a variable into a register rather than the contents of the variable. In fact, you may find yourself moving the addresses of variables around more than the contents of the variables, especially if you make a lot of calls to DOS and BIOS services.

If you've used higher-level languages such as Basic and Pascal, this distinction may seem inane. After all, who would mistake the contents of a variable for its location? Well, that's easy for you to say-in Basic and Pascal you rarely if ever even think about where a variable is. The language handles all that rigmarole for you. In assembly language, knowing where a variable is located is essential in order to do lots of important things.

Making DOS Calls

What EAT.ASM really does, as I mentioned previously, is call DOS and instruct DOS to display a string located at a particular place in memory. The string itself doesn't go anywhere; EAT.ASM tells DOS where the string is located, and then DOS reaches up into your .data section and does what it must with the string data.

Calling DOS is done with something called a software interrupt. I explain these in detail later in this chapter. But if you look at the code you can get a sense for what's going on:


mov  dx, eatmsg    ; Mem data ref without [] loads the ADDRESS!
mov  ah,9          ; Function 9 displays text to standard output.
int  21H           ; INT 21H makes the call into DOS.

Here, the first line loads the address of the string into register DX. The second line simply loads the constant value 9 into register AH. The third line makes the interrupt call, to interrupt 21H.

The DOS call has certain requirements that must be set up before the call is made. It must know what particular call you want to make, and each call has a number. This number must be placed in AH and, in this case, is call 09H (Display String). For this particular DOS call, DOS expects the address of the string to be displayed to be in register DX. If you satisfy those two conditions, you can make the DOS software interrupt call INT 21H-and there's your string on the screen!

Exiting the Program and Setting ERRORLEVEL

Finally, the job is done, Joe's has been properly advertised, and it's time to let DOS have the machine back. Another DOS service, 4CH (Terminate Process), handles the mechanics of courteously disentangling the machine from EAT.ASM's clutches. Terminate Process doesn't need the address of anything, but it will take whatever value it finds in the AL register and place it in the ERRORLEVEL DOS variable. DOS batch programs can test the value of ERRORLEVEL and branch on it.

EAT.ASM doesn't do anything worth testing in a batch program, but if ERRORLEVEL will be set anyway, it's a good idea to provide some reli able and harmless value for ERRORLEVEL to take. This is why 0 is loaded into AL prior to ending it all by the final INT 21 instruction. If you were to test ERRORLEVEL after running EAT.EXE, you would find it set to 0 in every case.

That's really all there is to EAT.ASM. Now let's see what it takes to run it, and then let's look more closely at its innards in memory.

Assembly Language StepbyStep Programming with DOS and Linux 2nd Ed [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی