Memory Areas
Memory areas are represented by a memory area object, which is stored in the vm_area_struct structure and defined in <linux/mm.h>. Memory areas are often called virtual memory areas or VMA's in the kernel.Chapter 12, "The Virtual Filesystem"). Here's the structure, with comments added describing each field:
struct vm_area_struct {
struct mm_struct *vm_mm; /* associated mm_struct */
unsigned long vm_start; /* VMA start, inclusive */
unsigned long vm_end; /* VMA end , exclusive */
struct vm_area_struct *vm_next; /* list of VMA's */
pgprot_t vm_page_prot; /* access permissions */
unsigned long vm_flags; /* flags */
struct rb_node vm_rb; /* VMA's node in the tree */
union { /* links to address_space->i_mmap or i_mmap_nonlinear */
struct {
struct list_head list;
void *parent;
struct vm_area_struct *head;
} vm_set;
struct prio_tree_node prio_tree_node;
} shared;
struct list_head anon_vma_node; /* anon_vma entry */
struct anon_vma *anon_vma; /* anonymous VMA object */
struct vm_operations_struct *vm_ops; /* associated ops */
unsigned long vm_pgoff; /* offset within file */
struct file *vm_file; /* mapped file, if any */
void *vm_private_data; /* private data */
}; Recall that each memory descriptor is associated with a unique interval in the process's address space. The vm_start field is the initial (lowest) address in the interval and the vm_end field is the first byte after the final (highest) address in the interval. That is, vm_start is the inclusive start and vm_end is the exclusive end of the memory interval. Thus, vm_end vm_start is the length in bytes of the memory area, which exists over the interval [vm_start, vm_end). Intervals in different memory areas in the same address space cannot overlap. The vm_mm field points to this VMA's associated mm_struct. Note each VMA is unique to the mm_struct to which it is associated. Therefore, even if two separate processes map the same file into their respective address spaces, each has a unique vm_area_struct to identify its unique memory area. Conversely, two threads that share an address space also share all the vm_area_struct structures therein.
VMA Flags
The vm_flags field contains bit flags, defined in <linux/mm.h>, that specify the behavior of and provide information about the pages contained in the memory area. Unlike permissions associated with a specific physical page, the VMA flags specify behavior for which the kernel is responsible, not the hardware. Furthermore, vm_flags contains information that relates to each page in the memory area, or the memory area as a whole, and not specific individual pages. Table 14.1 is a listing of the possible vm_flags values.
Flag | Effect on the VMA and its pages |
---|---|
VM_READ | Pages can be read from |
VM_WRITE | Pages can be written to |
VM_EXEC | Pages can be executed |
VM_SHARED | Pages are shared |
VM_MAYREAD | The VM_READ flag can be set |
VM_MAYWRITE | The VM_WRITE flag can be set |
VM_MAYEXEC | The VM_EXEC flag can be set |
VM_MAYSHARE | The VM_SHARE flag can be set |
VM_GROWSDOWN | The area can grow downward |
VM_GROWSUP | The area can grow upward |
VM_SHM | The area is used for shared memory |
VM_DENYWRITE | The area maps an unwritable file |
VM_EXECUTABLE | The area maps an executable file |
VM_LOCKED | The pages in this area are locked |
VM_IO | The area maps a device's I/O space |
VM_SEQ_READ | The pages seem to be accessed sequentially |
VM_RAND_READ | The pages seem to be accessed randomly |
VM_DONTCOPY | This area must not be copied on fork() |
VM_DONTEXPAND | This area cannot grow via mremap() |
VM_RESERVED | This area must not be swapped out |
VM_ACCOUNT | This area is an accounted VM object |
VM_HUGETLB | This area uses hugetlb pages |
VM_NONLINEAR | This area is a nonlinear mapping |
VMA Operations
The vm_ops field in the vm_area_struct structure points to the table of operations associated with a given memory area, which the kernel can invoke to manipulate the VMA. The vm_area_struct acts as a generic object for representing any type of memory area, and the operations table describes the specific methods that can operate on this particular instance of the object. The operations table is represented by struct vm_operations_struct and is defined in <linux/mm.h>:
struct vm_operations_struct {
void (*open) (struct vm_area_struct *);
void (*close) (struct vm_area_struct *);
struct page * (*nopage) (struct vm_area_struct *, unsigned long, int);
int (*populate) (struct vm_area_struct *, unsigned long, unsigned long,
pgprot_t, unsigned long, int);
}; Here's a description for each individual method: void open(struct vm_area_struct *area) This function is invoked when the given memory area is added to an address space. void close(struct vm_area_struct *area) This function is invoked when the given memory area is removed from an address space.
struct page * nopage(struct vm_area_sruct *area,
unsigned long address,
int unused)
This function is invoked by the page fault handler when a page that is not present in physical memory is accessed.
int populate(struct vm_area_struct *area,
unsigned long address,
unsigned long len, pgprot_t prot,
unsigned long pgoff, int nonblock)
This function is invoked by the remap_pages() system call to prefault a new mapping.
Lists and Trees of Memory Areas
As discussed, memory areas are accessed via both the mmap and the mm_rb fields of the memory descriptor. These two data structures independently point to all the memory area objects associated with the memory descriptor. In fact, they both contain pointers to the very same vm_area_struct structures, merely represented in different ways. The first field, mmap, links together all the memory area objects in a singly linked list. Each vm_area_struct structure is linked into the list via its vm_next field. The areas are sorted by ascended address. The first memory area is the vm_area_struct structure to which mmap points. The last structure points to NULL.The second field, mm_rb, links together all the memory area objects in a red-black tree. The root of the red-black tree is mm_rb, and each vm_area_struct structure in this address space is linked to the tree via its vm_rb field. A red-black tree is a type of balanced binary tree. Each element in a red-black tree is called a node . The initial node is called the root of the tree. Most nodes have two children: a left child and a right child. Some nodes have only one child, and the final nodes, called leaves , have no children. For any node, the elements to the left are smaller in value, whereas the elements to the right are larger in value. Furthermore, each node is assigned a color (red or black, hence the name of this tree) according to two rules: The children of a red node are black and every path through the tree from a node to a leaf must contain the same number of black nodes. The root node is always red. Searching of, insertion to, and deletion from the tree is an O(log(n)) operation. The linked list is used when every node needs to be traversed. The red-black tree is used when locating a specific memory area in the address space. In this manner, the kernel uses the redundant data structures to provide optimal performance regardless of the operation performed on the memory areas.
Memory Areas in Real Life
Let's look at a particular process's address space and the memory areas inside. For this task, I'm using the useful /proc filesystem and the pmap(1) utility. The example is a very simple user-space program, which does absolutely nothing of value, except act as an example:
int main(int argc, char *argv[])
{
return 0;
}
Take note of a few of the memory areas in this process's address space. Right off the bat, you know there is the text section, data section, and bss. Assuming this process is dynamically linked with the C library, these three memory areas also exist for libc.so and again for ld.so. Finally, there is also the process's stack.The output from /proc/<pid>/maps lists the memory areas in this process's address space:
rml@phantasy:~$ cat /proc/1426/maps
00e80000-00faf000 r-xp 00000000 03:01 208530 /lib/tls/libc-2.3.2.so
00faf000-00fb2000 rw-p 0012f000 03:01 208530 /lib/tls/libc-2.3.2.so
00fb2000-00fb4000 rw-p 00000000 00:00 0
08048000-08049000 r-xp 00000000 03:03 439029 /home/rml/src/example
08049000-0804a000 rw-p 00000000 03:03 439029 /home/rml/src/example
40000000-40015000 r-xp 00000000 03:01 80276 /lib/ld-2.3.2.so
40015000-40016000 rw-p 00015000 03:01 80276 /lib/ld-2.3.2.so
4001e000-4001f000 rw-p 00000000 00:00 0
bfffe000-c0000000 rwxp fffff000 00:00 0
The data is in the form
start-end permission offset major:minor inode file
The pmap(1) utility[4] formats this information in a bit more readable manner:
[4] The pmap (1) utility displays a formatted listing of a process's memory areas. It is a bit more readable than the /proc output, but it is the same information. It is found in newer versions of the procps package.
rml@phantasy:~$ pmap 1426
example[1426]
00e80000 (1212 KB) r-xp (03:01 208530) /lib/tls/libc-2.3.2.so
00faf000 (12 KB) rw-p (03:01 208530) /lib/tls/libc-2.3.2.so
00fb2000 (8 KB) rw-p (00:00 0)
08048000 (4 KB) r-xp (03:03 439029) /home/rml/src/example
08049000 (4 KB) rw-p (03:03 439029) /home/rml/src/example
40000000 (84 KB) r-xp (03:01 80276) /lib/ld-2.3.2.so
40015000 (4 KB) rw-p (03:01 80276) /lib/ld-2.3.2.so
4001e000 (4 KB) rw-p (00:00 0)
bfffe000 (8 KB) rwxp (00:00 0) [ stack ]
mapped: 1340 KB writable/private: 40 KB shared: 0 KB
The first three rows are the text section, data section, and bss of libc.so, the C library. The next two rows are the text and data section of our executable object. The following three rows are the text section, data section, and bss for ld.so, the dynamic linker. The last row is the process's stack.Note how the text sections are all readable and executable, which is what you expect for object code. On the other hand, the data section and bss (which both contain global variables) are marked readable and writable, but not executable. The stack is, naturally, readable, writable, and executablenot of much use otherwise.The entire address space takes up about 1340KB, but only 40KB are writable and private. If a memory region is shared or nonwritable, the kernel keeps only one copy of the backing file in memory. This might seem like common sense for shared mappings, but the nonwritable case can come as a bit of a surprise. If you consider the fact that a nonwritable mapping can never be changed (the mapping is only read from), it is clear that it is safe to load the image only once into memory. Therefore, the C library need only occupy 1212KB in physical memory, and not 1212KB multiplied by every process using the library. Because this process has access to about 1340KB worth of data and code, yet consumes only about 40KB of physical memory, the space savings from such sharing is substantial. Note the memory areas without a mapped file that are on device 00:00 and inode zero. This is the zero page. The zero page is a mapping that consists of all zeros. By mapping the zero page over a writable memory area, the area is in effect "initialized" to all zeros. This is important in that it provides a zeroed memory area, which is expected by the bss. Because the mapping is not shared, as soon as the process writes to this data a copy is made (à la copy-on-write) and the value updated from zero.Each of the memory areas that are associated with the process corresponds to a vm_area_struct structure. Because the process was not a thread, it has a unique mm_struct structure referenced from its task_struct.