Assembly Language StepbyStep Programming with DOS and Linux 2nd Ed [Electronic resources] نسخه متنی

What's GNU?

Way back in the late 1970s, a wild-eyed Unix hacker named Richard Stallman wanted his own copy of Unix. He didn't want to pay for it, however, so he did the obvious thing: He began writing his own version. (If it's not obvious to you, well, you don't understand Unix culture.) However, he was unsatisfied with all the programming tools currently available and objected to their priciness as well. So, as a prerequisite to writing his own version of Unix, Stallman set out to write his own compiler, assembler, and debugger. (He had already written his own editor, the legendary EMACS.)

Stallman had named his version of Unix GNU, a recursive acronym meaning GNU's Not Unix. This was a good chuckle, and one way of getting past AT&T's trademark lawyers, who were fussy in those days about who used the word Unix and how. As time went on, the GNU tools (the C compiler and its other Swiss army knife go-alongs) took on a life of their own, and as it happened, Stallman never actually finished GNU itself. Other free versions of Unix appeared, and there was some soap opera for a few years regarding who actually owned what parts of which. This so disgusted Stallman that he created the Free Software Foundation as the home base for GNU tools development and created a radical sort of software license called the GNU Public License (GPL), which is sometimes informally called "copyleft." Stallman released the GNU tools under the GPL, which not only required that the software be free (including all source code), but prevented people from making minor mods to the software and claiming the derivative work as their own. Changes and improvements had to be given back to the GNU community.

This seemed to be major nuttiness at the time, but over the years since then it has taken on a peculiar logic and life of its own. The GPL has allowed software released under the GPL to evolve tremendously quickly, because large numbers of people were using it and improving it and giving back the improvements without charge or restriction. Out of this bubbling open source pot eventually arose Linux, the premier GPL operating system. Linux was built with and is maintained with the GNU tool set. If you're going to program under Linux, regardless of what language you're using, you will eventually use one or more of the GNU tools.

The Swiss Army Compiler

The copy of EMACS that you will find on modern distributions of Linux doesn't have a whole lot of Richard Stallman left in it—it's been rewritten umpteen times by many other people over the past 20-odd years. Where the Stallman legacy persists most strongly is in the GNU compilers. There are a number of them, but the one that you must understand as thoroughly as possible is the GNU C Compiler, gcc. (Lowercase letters are something of an obsession in the Unix world, a fetish not well understood by a lot of people, myself included.)

Why use a C compiler for working in assembly? Two reasons:

Most of Linux and all of the standard C library for Linux are written in C for gcc. The C library is the only reasonable way to communicate with Linux from an assembly program. Gcc has a great deal of intimate knowledge of the standard C library that you'll need to learn if you choose not to use it. Love Linux, love gcc. There's no way around it.

More interestingly, gcc does much more than simply compile C code. It's a sort of Swiss army knife development tool. In fact, I might better characterize what it does as building software rather than simply compiling it. In addition to compiling C code to object code, gcc governs both the assembly step and the link step.

Assembly step? Yes, indeedy. There is a GNU assembler, gas. And a GNU linker, ld. What gcc does is control them like puppets on strings. If you use gcc (especially at the beginner level), you don't have to do much messing around with gas and ld.

Let's talk more about this.

Building Code the GNU Way

Assembly language work is a departure from C work, and gcc is first and foremost a C compiler. So, we need to look first at the process of building C code. On the surface, building a C program for Linux using the GNU tools is pretty simple. Behind the scenes, however, it's a seriously hairy business. While it looks like gcc does all the work, what gcc really does is act as master controller for several GNU tools, supervising a code assembly line that you don't need to see unless you specifically want to.

Theoretically, this is all you need to do to generate an executable binary file from C source code:


gcc eatc.c -o eatc

Here, gcc takes the file eatc.c (which is a C source code file) and crunches it to produce the file eatc. (The -o option tells gcc what to name the executable output file.) Note well that in the Linux world, executable files typically do not have file extensions, as they do under DOS and Windows. What might be eatc.com or eatc.exe under DOS is simply eatc under Linux.

However, there's more going on here than meets the eye. Take a look at Figure 12.1 as we go through it. In the figure, shaded arrows indicate movement of information. Blank arrows indicate program control.

Figure 12.1: How gcc builds Linux executables.

The programmer invokes gcc from the shell command line. gcc takes control of the system and immediately invokes a utility called the C preprocessor, cpp. The preprocessor takes the original C source code file and handles certain items like #includes and #defines. It can be thought of as a sort of macro expansion pass on the source code file, if "macro expansion pass" means anything to you. If not, don't fret it—it's a C thing and not germane to assembly work.

When cpp is finished with its work, gcc takes over in earnest. From the preprocessed source code file, gcc generates an assembly language source code file with a .s file extension. This is literally the assembly code equivalent of the C statements in the original .c file, in human-readable form. If you develop any skill in reading AT&T assembly syntax and mnemonics, you can learn a lot from inspecting the .s files produced by gcc.

When gcc has completed generating the assembly language equivalent of the C source code file, it invokes the GNU assembler, gas, to assemble the .s file into object code. This object code is written out in a file with a .o extension.

The final step involves the GNU linker, ld. The .o file contains binary code, but it's only the binary code generated from statements in the original .c file. The .o file does not contain the code from the standard C libraries that are so important in C programming. Those libraries have already been compiled and simply need to be linked into your application. The linker ld does this work at gcc's direction. The good part is that gcc knows precisely which of the standard C libraries need to be linked to your application to make it work, and it always includes the right libraries in their right versions. So, although gcc doesn't actually do the linking, it knows what needs to be linked—and that is valuable knowledge indeed, as you will learn if you ever try to invoke ld manually.

At the end of the line, ld spits out the fully linked and executable program file. At that point, the build is done, and gcc returns control to the Linux shell. Note that all of this is typically done with one simple command to gcc!

How We Use gcc in Assembly Work

The process I just described, and drew out for you in Figure 12.1, is how a C program is built under Linux using the GNU tools. I went into some detail here because we're going to use part—though only part—of this process to make our assembly programming easier. It's true that we don't need to convert C source code to assembly code—and in fact, we don't need gas to convert gas assembly source code to object code. But we need gcc's expertise at linking. Linking a Linux program is much more complex than linking a simple DOS program. So we're going to tap in to the GNU code-building process at the link stage, so that gcc can coordinate the link step for us.

When we assemble a Linux program using NASM, NASM generates a .o file containing binary object code. Invoking NASM under Linux is typically done this way:


nasm -f elf eatlinux.asm

This command will direct NASM to assemble the file eatlinux.asm and generate a file called eatlinux.o. The "-f elf" part of it tells NASM to generate object code in the ELF format (the acronym means Executable and Linking Format, so saying "ELF format" is redundant even though everyone does it) rather than one of the numerous other object code formats that NASM is capable of producing. The eatlinux.o file is not by itself executable. It needs to be linked. So, we call gcc and instruct it to link the program for us:


gcc eatlinux.o -o eatlinux

What of this tells gcc to link and not compile? The only input file called out in the command is a .o file containing object code. This fact alone tells gcc that all that needs to be done is to link the file with the C library to produce the final executable. The "-o eatlinux" tells gcc that the name of the final executable file is to be "eatlinux." (Remember that Linux does not use file extensions on executable program files.)

Including the -o specifier is important. If you don't tell gcc precisely what to name the final executable file, it will name that file "a.out." Yes, "a.out," every time—irrespective of what your object file or source files are called.

Why Not gas?

You might be wondering why, if there's a perfectly good assembler installed automatically with every copy of Linux, I'm bothering to show you how to install and use another one. First of all, there is no gas lookalike for DOS as best I know, so you can't take your first steps in gas assembly while working with DOS. But more important, gas uses a peculiar syntax that is utterly unlike that of all the other familiar assemblers used in the x86 world (MASM and TASM as well as NASM) and a whole set of instruction mnemonics unique to itself. I find them ugly, nonintuitive, and hard to read. This is the AT&T syntax, so called because it was created by AT&T as a portable assembly notation to make Unix easier to port from one underlying CPU to another. It's ugly because it was designed to be generic, and it can be recast for any reasonable CPU you could come up with. (Don't forget that Unix significantly predates the x86, and gas's predecessor is older than the x86.)

If it were this simple, I wouldn't mention gas at all, since you don't need to use it to write Linux code in NASM. However, one of the major ways you'll end up learning many of the standard C library calls is by using them in short C programs and then inspecting the assembly output gcc generates. (I have more to say about this later on.) What gcc generates first when it compiles a C program is a file (with a .s extension) of assembly language source code using the AT&T syntax and mnemonics. It may not be necessary to learn the AT&T syntax thoroughly enough to write it, but it will be very helpful if you can pick it up well enough to read it. I'll show you an example later on, and when I do I'll summarize the important differences between AT&T and the NASM syntax and mnemonics, which are more properly called the Intel syntax and mnemonics.

Assembly Language StepbyStep Programming with DOS and Linux 2nd Ed [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی