Professional Documents
Culture Documents
Anand Sivasubramaniam
Two Parts
0xC0000000
File name, Environment
Arguments
Stack
_end
bss
_bss_start
_edata
Data
_etext
Code
Header
0x84000000
Shared Libs
Address Space Descriptor
• mm_struct defined in the process descriptor. (in linux/sched.h)
• This is duplicated if CLONE_VM is specified on forking.
• struct mm_struct {
int count; // no. of processes sharing this descriptor
pgd_t *pgd; //page directory ptr
unsigned long start_code, end_code;
unsigned long start_data, end_data;
unsigned long start_brk, brk;
unsigned long start_stack;
unsigned long arg_start, arg_end, env_start, env_end;
unsigned long rss; // no. of pages resident in memory
unsigned long total_vm; // total # of bytes in this address space
unsigned long locked_vm; // # of bytes locked in memory
unsigned long def_flags; // status to use when mem regions are created
struct vm_area_struct *mmap; // ptr to first region desc.
struct vm_area_struct *mmap_avl; // faster search of region desc.
}
Region Descriptors
• Why even allocate all of the VAS? Allocate only on demand.
• Use region descriptors for each allocated region of VAS
• Map allocated but unused regions to same physical page to save space.
• struct vm_area_struct {
struct mm_struct *vm_mm; // descriptor of VAS
unsigned long vm_start, vm_end; // of this region
pgprot_t vm_page_prot; // protection attributes for this region
short vm_avl_height;
struct vm_avl_left;
vm_area_struct *vm_avl_permission; // right hand child
vm_area_struct * vm_next_share, *vm_prev_share; // doubly linked
vm_operations_struct *vm_ops;
struct inode *vm_inode; // of file mapped, or NULL = “anonymous
mapping”
unsigned long vm_offset; // offset in file/device
}
• If vm_inode is NULL (anonymous mapping), all PTEs
for this region point to the same page.
• If the process does a write to any of these pages, the
faulting mechanism creates a new physical page (copy-
on-write).
• This is used by the brk() system call.
• Operations specific to this region (including fault
handling) are specified in vm_operations_struct.
• Hence, different regions can have different functions.
Struct vm_operations_struct {
void (*open)(struct vm_area_struct *);
void (*close)(struct vm_area_struct *);
void (*unmap)(…);
void (*protect)(…)
void (*sync)(…);
unsigned long (*nopage)(struct vm_area_struct *, unsigned long
address, unsigned long page, int write_access);
void (*swapout)(struct vm_area_struct *, unsigned long, pte_t *);
pte_t (*swapin)(struct vm_area_struct *, unsigned long, unsigned long);
}
Traditional mmap()
• int do_mmap(struct file *, unsigned long addr,
unsigned long len, unsigned long prot, unsigned
long flags, unsigned long off);
• Static
Memory_start = console_init(memory_start, memory_end);
Typically done for drivers to reserve areas, and for some other kernel
components.
• Dynamic
Void *kmalloc(size, priority), Void kfree (void *) // in mm/kmalloc.c
Void *vmalloc(size), void *vmfree(void *) // in mm/vmalloc.c
Kmalloc is used for physically contiguous pages while vmalloc does not
necessarily allocate physically contiguous pages
Memory allocated is not initialized (and is not paged out).
kmalloc() data structures
sizes[]
32
64 size_descriptor
128 page_descriptor
252
508
1020
2040 bh bh
4080
8176 bh bh
16368
32752 bh bh
65520
Null Null
131056
vmalloc()
• Allocated virtually contiguous pages, but they do not need to be physically
contiguous.
• Uses __get_free_page() to allocate physical frames.
• Once all the required physical frames are found, the virtual addresses are
created (and mappings set) at an unused part.
• The virtual address search (for unused parts) on x86 begins at the next
address after physical memory on an 8 MB boundary.
• One (virtual) page is left free after each allocation for cushioning.
next
vmstruct size
addr
VAS
next
size
addr
vmlist
vmalloc vs kmalloc
• Contiguous vs non-contiguous physical memory
• kmalloc is faster but less flexible
• vmalloc involves __get_free_page() and may need to
block to find a free physical page
• DMA requires contiguous physical memory
Paging
• All kernel segment pages are locked in memory (no swapping)
• User pages can be paged out:
– Complete block device
– Fixed length files in a file system
• First 4096 bytes are a bitmap indicating that space for that page is
available for paging.
• At byte 4086, string “SWAP_SPACE” is stored.
• Hence, max swap of 4086*8-1 = 32687 pages = 130784KB per
device or file
• MAX_SWAPFILES specifies number of swap files or devices
• Swap device is more efficient than swap file.
• Inform swap space to kernel using
• int sys_swapon(char * swapfile, int swapflags);
• Ceates an entry in swap_info table.
• struct swap_info_struct {
unsigned int flags;
kdev_t swap_device;
struct indoe *swap_file;
unsigned char *swap_map; // ptr to table, with 1 byte for
each page to indicate how many processes are referring
to this page
unsigned char *swap_lockmap; // ptr to bitmap, bit
indicating lock
int lowest_bit, highest_bit; // to calculate maximum page
number
unsigned long max; // highest_bit + 1
int prio; // priority for this swap space
int cluster_nr, cluster_next; // to cluster pages on
storage device
int next;
}
• For each physical frame (mm.h):
typedef struct page {
struct page *prev, *next; // doubly linked
struct inode *inode; unsigned long offset; // where to
swap
struct page *prev_hash, next_hash; // in hash list of
pages in page cache
atomic_t count; // number of users of this page
unsigned dirty:16, age:8;
struct buffer_head * buffers; // if it is part of a block
buffer
unsigned long map_nr; // frame #
struct wait_queue *wait; // Tasks waiting for page to be
unlocked
unsigned flags;
} mem_map_t;
Finding a Physical Page
• unsigned long __get_free_pages(int priority, unsigned long
order, int dma) in mm/page_alloc.c
• Priority =
– GFP_BUFFER (free page returned only if available in physical
memory)
– GFP_ATOMIC (return page if possible, do not interrupt current
process)
– GFP_USER (current process can be interrupted)
– GFP_KERNEL (kernel can be interrupted)
– GFP_NOBUFFER (do not attempt to reduce buffer cache)
• order says give me 2^^order pages (max is 128KB)
• dma specifies that it is for DMA purposes
• First tries to find a free frame using Buddy system.
• Table free_area[] keeps appropriate data structures.
free_area_list[0 1 2] free_area_map
0 1 2
• If you cannot find a free page,
int try_to_free_page(int priority, int dma, int wait) {
static int state = 6; int I = 6; int stop;
stop = 3; if (wait) stop = 0;
switch (state) {
do {
Case 0:
if (shrink_mmap(i,dma)) return 1;
state = 1;
Case 1:
if (shm_swap(i,dma)) return 1;
state = 2;
Default:
if (swap_out(i,dma,wait)) return 1;
state = 0;
i--;
} while ((i-stop) >= 0);
}
return 0;
}
• shrink_mmap() tries to discard pages in page cache or buffer
cache that have only one user currently, and have not been
references since the last cycle. The number of examined pages
depends on priority.
• shm_swap() tries pages allocated for shared memory.
• swap_out()
– Uses swap_cnt to determine how many pages to swap out for
current process before moving on to next.
– Always start where you left off last time (Clock algorithm)
– Uses swap_out_process() function, which then calls
try_to_swap_out() for each possible page present in memory
(and is not locked).
– try_to_swap_out() checks the age attribute in mem_map
data structure, and the page is selected if this is 0.
– VM area’s swapout() operation is called.
– Write back if the page is dirty
– Invalidate page table entry.
• kswapd kernel thread running in background is
activated each time the number of free pages falls
below a critical level.
• This thread calls the try_to_free_page() function.
• A block of memory is released using free_pages().
When the number of users reaches 0, the frames are
entered in free_area[].
Page Fault
• Error code written onto stack, and the VA is stored in register
CR2
• do_page_fault(struct pt_regs *regs, unsigned long
error_code) is now called.
• If faulting address is in kernel segment, alarm messages are
printed out and the process is terminated.
• If faulting address is not in a virtual memory area, check if
VM_GROWSDOWN for the nexy virtual memory area is set (I.e.
Stack). If so, expand VM. If error in expanding send SIGSEGV.
• If faulting address is in a virtual memory area, check if protection
bits are OK. If not legal, send SIGSEGV. Else, call do_no_page()
or do_wp_page().
• void do_wp_page(struct task_struct *task, struct
vm_area_struct *vma, unsigned long address, int write_access);
– Check if page is mapped in
– If it is referenced only once, change permissions
– Else, create a copy and entry in page table to this copy with write
permissions on