forked from rahuldeve/Diversification-Dataset
-
Notifications
You must be signed in to change notification settings - Fork 0
/
0130610143.txt
4417 lines (3592 loc) · 224 KB
/
0130610143.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Chapter 4
Virtual Memory
Linux processes execute in a virtual environment that makes it appear as if each process had
the entire address space of the CPU available to itself. This virtual address space extends
from address 0 all the way to the maximum address. On a 32-bit platform, such as IA-32,
the maximum address is 232 − 1 or 0xffffffff. On a 64-bit platform, such as IA-64,
this is 264 − 1 or 0xffffffffffffffff.
While it is obviously convenient for a process to be able to access such a huge address space, there are really three distinct, but equally important, reasons for using virtual
memory.
1. Resource virtualization. On a system with virtual memory, a process does not have
to concern itself with the details of how much physical memory is available or which
physical memory locations are already in use by some other process. In other words,
virtual memory takes a limited physical resource (physical memory) and turns it into
an infinite, or at least an abundant, resource (virtual memory).
2. Information isolation. Because each process runs in its own address space, it is not
possible for one process to read data that belongs to another process. This improves
security because it reduces the risk of one process being able to spy on another process and, e.g., steal a password.
3. Fault isolation. Processes with their own virtual address spaces cannot overwrite
each other’s memory. This greatly reduces the risk of a failure in one process triggering a failure in another process. That is, when a process crashes, the problem is
generally limited to that process alone and does not cause the entire machine to go
down.
In this chapter, we explore how the Linux kernel implements its virtual memory system
and how it maps to the underlying hardware. This mapping is illustrated specifically for
IA-64. The chapter is structured as follows. The first section provides an introduction to
the virtual memory system of Linux and establishes the terminology used throughout the
remainder of the chapter. The introduction is followed by a description of the software and
hardware structures that form the virtual memory system. Specifically, the second section
describes the Linux virtual address space and its representation in the kernel. The third section describes the Linux page tables, and the fourth section describes how Linux manages
131
132
Virtual Memory
Chapter 4
virtual address space
0xffffffffffffffff
virtual page
0x4000000000000000
0x0000000000000000
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
physical address space
9
8
7
6
5
4
3
2
1
0
page frame
Figure 4.1. Virtual and physical address spaces.
the translation lookaside buffer (TLB), which is a hardware structure used to accelerate
virtual memory accesses. Once these fundamental structures are introduced, the chapter
describes the operation of the virtual memory system. Section five explores the Linux page
fault handler, which can be thought of as the engine driving the virtual memory system.
Section six describes how memory coherency is maintained, that is, how Linux ensures
that a process sees the correct values in the virtual memory locations it can access. Section
seven discusses how Linux switches execution from one address space to another, which is
a necessary step during a process context switch. The chapter concludes with section eight,
which provides the rationale for some of the virtual memory choices that were made for
the virtual memory system implemented on IA-64.
4.1
INTRODUCTION TO THE VIRTUAL MEMORY SYSTEM
The left half of Figure 4.1 illustrates the virtual address space as it might exist for a particular process on a 64-bit platform. As the figure shows, the virtual address space is divided
into equal-sized pieces called virtual pages. Virtual pages have a fixed size that is an integer
power of 2. For example, IA-32 uses a page size of 4 Kbytes. To maximize performance,
IA-64 supports multiple page sizes and Linux can be configured to use a size of 4, 8, 16,
or 64 Kbytes. In the figure, the 64-bit address space is divided into 16 pages, meaning that
each virtual page would have a size of 264 /16 = 260 bytes or 1024 Pbytes (1 Pbyte = 250
bytes). Such large pages are not realistic, but the alternative of drawing a figure with several billion pages of a more realistic size is, of course, not practical either. Thus, for this
section, we continue to illustrate virtual memory with this huge page size. The figure also
shows that virtual pages are numbered sequentially. We can calculate the virtual page number (VPN) from a virtual address by dividing it by the page size and taking the integer
4.1 Introduction to the Virtual Memory System
133
portion of the result. The remainder is called the page offset. For example, dividing virtual
address 0x40000000000003f8 by the page size yields 4 and a remainder of 0x3f8.
This address therefore maps to virtual page number 4 and page offset 0x3f8.
Let us now turn attention to the right half of Figure 4.1, which shows the physical address space. Just like the virtual space, it is divided into equal-sized pieces, but in physical
memory, those pieces are called page frames. As with virtual pages, page frames also are
numbered. We can calculate the page frame number (PFN) from a physical address by
dividing it by the page frame size and taking the integer portion of the result. The remainder is the page frame offset. Normally, page frames have the same size as virtual pages.
However, there are cases where it is beneficial to deviate from this rule. Sometimes it is
useful to have virtual pages that are larger than a page frame. Such pages are known as superpages. Conversely, it is sometimes useful to divide a page frame into multiple, smaller
virtual pages. Such pages are known as subpages. IA-64 is capable of supporting both, but
Linux does not use them.
While it is easiest to think of physical memory as occupying a single contiguous region
in the physical address space, in reality it is not uncommon to encounter memory holes.
Holes usually are caused by one of three entities: firmware, memory-mapped I/O devices,
or unpopulated memory. All three cause portions of the physical address space to be unavailable for storing the content of virtual pages. As far as the kernel is concerned, these
portions are holes in the physical memory. In the example in Figure 4.1, page frames 2
and 3 represent a hole. Note that even if just a single byte in a page frame is unusable, the
entire frame must be marked as a hole.
4.1.1
Virtual-to-physical address translation
Processes are under the illusion of being able to store data to virtual memory and retrieve
it later on as if it were stored in real memory. In reality, only physical memory can store
data. Thus, each virtual page that is in use must be mapped to some page frame in physical
memory. For example, in Figure 4.1, virtual pages 4, 5, 8, 9, and 11 are in use. The arrows
indicate which page frame in physical memory they map to. The mapping between virtual
pages and page frames is stored in a data structure called the page table. The page table for
our example is shown on the left-hand side of Figure 4.2.
The Linux kernel is responsible for creating and maintaining page tables but employs
the CPU’s memory management unit (MMU) to translate the virtual memory accesses
of a process into corresponding physical memory accesses. Specifically, when a process
accesses a memory location at a particular virtual address, the MMU translates this address into the corresponding physical address, which it then uses to access the physical memory. This is illustrated in Figure 4.2 for the case in which the virtual address is
0x40000000000003f8. As the figure shows, the MMU extracts the VPN (4) from the
virtual address and then searches the page table to find the matching PFN. In our case, the
search stops at the first entry in the page table since it contains the desired VPN. The PFN
associated with this entry is 7. The MMU then constructs the physical address by concatenating the PFN with the frame offset from the virtual address, which results in a physical
address of 0x70000000000003f8.
134
Virtual Memory
Chapter 4
virtual address
4 0000000000003f8
page table
VPN PFN
4
5
8
9
11
7
4
0
5
1
VPN
page offset
PFN
page offset
7 0000000000003f8
physical address
Figure 4.2. Virtual-to-physical address translation.
4.1.2
Demand paging
The next question we need to address is how the page tables get created. Linux could create
appropriate page-table entries whenever a range of virtual memory is allocated. However,
this would be wasteful because most programs allocate much more virtual memory than
they ever use at any given time. For example, the text segment of a program often includes
large amounts of error handling code that is seldom executed. To avoid wasting memory
on virtual pages that are never accessed, Linux uses a method called demand paging. With
this method, the virtual address space starts out empty. This means that, logically, all virtual
pages are marked in the page table as not present. When accessing a virtual page that is
not present, the CPU generates a page fault. This fault is intercepted by the Linux kernel
and causes the page fault handler to be activated. There, the kernel can allocate a new page
frame, determine what the content of the accessed page should be (e.g., a new, zeroed page,
a page loaded from the data section of a program, or a page loaded from the text segment
of a program), load the page, and then update the page table to mark the page as present.
Execution then resumes in the process with the instruction that caused the fault. Since the
required page is now present, the instruction can now execute without causing a page fault.
4.1.3
Paging and swapping
So far, we assumed that physical memory is plentiful: Whenever we needed a page frame
to back a virtual page, we assumed a free page frame was available. When a system has
many processes or when some processes grow very large, the physical memory can easily
fill up. So what is Linux supposed to do when a page frame is needed but physical memory
is already full? The answer is that in this case, Linux picks a page frame that backs a virtual
page that has not been accessed recently, writes it out to a special area on the disk called
the swap space, and then reuses the page frame to back the new virtual page. The exact
place to which the old page is written on the disk depends on what kind of swap space is
in use. Linux can support multiple swap space areas, each of which can be either an entire
disk partition or a specially formatted file on an existing filesystem (the former is generally
4.1 Introduction to the Virtual Memory System
135
more efficient and therefore preferable). Of course, the page table of the process from
which Linux “stole” the page frame must be updated accordingly. Linux does this update
by marking the page-table entry as not present. To keep track of where the old page has
been saved, it also uses the entry to record the disk location of the page. In other words, a
page-table entry that is present contains the page frame number of the physical page frame
that backs the virtual page, whereas a page-table entry that is not present contains the disk
location at which the content of the page can be found.
Because a page marked as not present cannot be accessed without first triggering a
page fault, Linux can detect when the page is needed again. When this happens, Linux
needs to again find an available page frame (which may cause another page to be paged
out), read the page content back from swap space, and then update the page-table entry
so that it is marked as present and maps to the newly allocated page frame. At this point,
the process that attempted to access the paged-out page can be resumed, and, apart from a
small delay, it will execute as if the page had been in memory at all along.
The technique of stealing a page from a process and writing it out to disk is called
paging. A related technique is swapping. It is a more aggressive form of paging in the
sense that it does not steal an individual page but steals all the pages of a process when
memory is in short supply. Linux uses paging but not swapping. However, because both
paging and swapping write pages to swap space, Linux kernel programmers often use the
terms “swapping” and “paging” interchangeably. This is something to keep in mind when
perusing the kernel source code.
From a correctness point of view, it does not matter which page is selected for page out,
but from a performance perspective, the choice is critical. With a poor choice, Linux may
end up paging out the page that is needed in the very next memory access. Given the large
difference between disk access latency (on the order of several milliseconds) and memory
access latency (on the order of tens of nanoseconds), making the right replacement choices
can mean the difference between completing a task in a second or in almost three hours!
The algorithm that determines which page to evict from main memory is called the
replacement policy. The provably optimal replacement policy (OPT) is to choose the page
that will be accessed farthest in the future. Of course, in general it is impossible to know
the future behavior of the processes, so OPT is of theoretical interest only. A replacement
policy that often performs almost as well as OPT yet is realizable is the least recently used
(LRU) policy. LRU looks into the past instead of the future and selects the page that has
not been accessed for the longest period of time. Unfortunately, even though LRU could
be implemented, it is still not practical because it would require updating a data structure
(such as an LRU list) on every access to main memory. In practice, operating systems use
approximations of the LRU policy instead, such as the clock replacement or not frequently
used (NFU) policies [11, 69].
In Linux, the page replacement is complicated by the fact that the kernel can take up
a variable amount of (nonpageable) memory. For example, file data is stored in the page
cache, which can grow and shrink dynamically. When the kernel needs a new page frame,
it often has two choices: It could take away a page from the kernel or it could steal a page
from a process. In other words, the kernel needs not just a replacement policy but also a
memory balancing policy that determines how much memory is used for kernel buffers and
136
Virtual Memory
Chapter 4
how much is used to back virtual pages. The combination of page replacement and memory
balancing poses a difficult problem for which there is no perfect solution. Consequently,
the Linux kernel uses a variety of heuristics that tend to work well in practice.
To implement these heuristics, the Linux kernel expects the platform-specific part of the
kernel to maintain two extra bits in each page-table entry: the accessed bit and the dirty bit.
The accessed bit is an indicator that tells the kernel whether the page was accessed (read,
written, or executed) since the bit was last cleared. Similarly, the dirty bit is an indicator
that tells whether the page has been modified since it was last paged in. Linux uses a kernel
thread, called the kernel swap daemon kswapd, to periodically inspect these bits. After
inspection, it clears the accessed bit. If kswapd detects that the kernel is starting to run low
on memory, its starts to proactively page out memory that has not been used recently. If the
dirty bit of a page is set, it needs to write the page to disk before the page frame can be
freed. Because this is relatively costly, kswapd preferentially frees pages whose accessed
and dirty bits are cleared to 0. By definition such pages were not accessed recently and
do not have to be written back to disk before the page frame can be freed, so they can be
reclaimed at very little cost.
4.1.4
Protection
In a multiuser and multitasking system such as Linux, multiple processes often execute
the same program. For example, each user who is logged into the system at a minimum is
running a command shell (e.g., the Bourne-Again shell, bash). Similarly, server processes
such as the Apache web server often use multiple processes running the same program to
better handle heavy loads. If we looked at the virtual space of each of those processes,
we would find that they share many identical pages. Moreover, many of those pages are
never modified during the lifetime of a process because they contain read-only data or the
text segment of the program, which also does not change during the course of execution.
Clearly, it would make a lot of sense to exploit this commonality and use only one page
frame for each virtual page with identical content.
With N processes running the same program, sharing identical pages can reduce physical memory consumption by up to a factor of N . In reality, the savings are usually not quite
so dramatic, because each process tends to require a few private pages. A more realistic
example is illustrated in Figure 4.3: Page frames 0, 1, and 5 are used to back virtual pages
1, 2, and 3 in the two processes called bash 1 and bash 2. Note that a total of nine
virtual pages are in use, but thanks to page sharing, only six page frames are needed.
Of course, page sharing cannot be done safely unless we can guarantee that none of the
shared pages are modified. Otherwise, the changes of one process would be visible in all
the other processes and that could lead to unpredictable program behavior.
This is where the page permission bits come into play. The Linux kernel expects the
platform-specific part of the kernel to maintain three such bits per page-table entry. They
are called the R, W, and X permission bits and respectively control whether read, write, or
execute accesses to the page are permitted. If an access that is not permitted is attempted, a
page protection violation fault is raised. When this happens, the kernel responds by sending
a segmentation violation signal (SIGSEGV) to the process.
137
4.1 Introduction to the Virtual Memory System
physical addr. space
text
data
bash 1
virtual addr. space
7
6
5
4
3
2
1
0
9
8
7
6
5
4
3
2
1
0
bash 2
virtual addr. space
7
6
5
4
3
2
1
0
Figure 4.3. Two processes sharing the text segment (virtual pages 1 to 3).
The page permission bits enable the safe sharing of page frames. All the Linux kernel
has to do is ensure that all virtual pages that refer to a shared page frame have the W
permission bit turned off. That way, if a process attempted to modify a shared page, it
would receive a segmentation violation signal before it could do any harm.
The most obvious place where page frame sharing can be used effectively is in the
text segment of a program: By definition, this segment can be executed and read, but it is
never written to. In other words, the text segment pages of all processes running the same
program can be shared. The same applies to read-only data pages.
Linux takes page sharing one step further. When a process forks a copy of itself, the
kernel disables write access to all virtual pages and sets up the page tables such that the
parent and the child process share all page frames. In addition, it marks the pages that
were writable before as copy-on-write (COW). If the parent or the child process attempts
to write to a copy-on-write page, a protection violation fault occurs. When this happens,
instead of sending a segmentation violation signal, the kernel first makes a private copy of
the virtual page and then turns the write permission bit for that page back on. At this point,
execution can return to the faulting process. Because the page is now writable, the faulting
instruction can finish execution without causing a fault again. The copy-on-write scheme is
particularly effective when a program does a fork() that is quickly followed by an execve().
In such a case, the scheme is able to avoid almost all page copying, save for a few pages in
the stack- or data-segment [9, 64].
Note that the page sharing described here happens automatically and without the explicit knowledge of the process. There are times when two or more processes need to cooperate and want to explicitly share some virtual memory pages. Linux supports this through
the mmap() system call and through System V shared memory segments [9]. Because the
processes are cooperating, it is their responsibility to map the shared memory segment
with suitable permission bits to ensure that the processes can access the memory only in
the intended fashion.
138
Virtual Memory
Chapter 4
0xffffffffffffffff
kernel
address
space
TASK_SIZE
identity−
mapped
segment
PAGE_OFFSET
user
address
space
page−table−
mapped
segment
VMALLOC_END
VMALLOC_START
0x0000000000000000
Figure 4.4. Structure of Linux address space.
4.2
ADDRESS SPACE OF A LINUX PROCESS
The virtual address space of any Linux process is divided into two subspaces: kernel space
and user space. As illustrated on the left-hand side of Figure 4.4, user space occupies the
lower portion of the address space, starting from address 0 and extending up to the platformspecific task size limit (TASK SIZE in file include/asm/processor.h). The remainder is occupied by kernel space. Most platforms use a task size limit that is large enough so that at
least half of the available address space is occupied by the user address space.
User space is private to the process, meaning that it is mapped by the process’s own
page table. In contrast, kernel space is shared across all processes. There are two ways to
think about kernel space: We can either think of it as being mapped into the top part of
each process, or we can think of it as a single space that occupies the top part of the CPU’s
virtual address space. Interestingly, depending on the specifics of CPU on which Linux is
running, kernel space can be implemented in one or the other way.
During execution at the user level, only user space is accessible. Attempting to read,
write, or execute kernel space would cause a protection violation fault. This prevents a
faulty or malicious user process from corrupting the kernel. In contrast, during execution
in the kernel, both user and kernel spaces are accessible.
Before continuing our discussion, we need to say a few words about the page size used
by the Linux kernel. Because different platforms have different constraints on what page
sizes they can support, Linux never assumes a particular page size and instead uses the
platform-specific page size constant (PAGE SIZE in file include/asm/page.h) where necessary. Although Linux can accommodate arbitrary page sizes, throughout the rest of this
chapter we assume a page size of 8 Kbytes, unless stated otherwise. This assumption helps
to make the discussion more concrete and avoids excessive complexity in the following
examples and figures.
4.2 Address Space of a Linux Process
4.2.1
139
User address space
Let us now take a closer look at how Linux implements the user address spaces. Each
address space is represented in the kernel by an object called the mm structure (struct
mm struct in file include/linux/sched.h). As we have seen in Chapter 3, Processes, Tasks,
and Threads, multiple tasks can share the same address space, so the mm structure is a
reference-counted object that exists as long as at least one task is using the address space
represented by the mm structure. Each task structure has a pointer to the mm structure
that defines the address space of the task. This pointer is known as the mm pointer. As a
special case, tasks that are known to access kernel space only (such as kswapd) are said to
have an anonymous address space, and the mm pointer of such tasks is set to NULL. When
switching execution to such a task, Linux does not switch the address space (because there
is none to switch to) and instead leaves the old one in place. A separate pointer in the task
structure tracks which address space has been borrowed in this fashion. This pointer is
known as the active mm pointer of the task. For a task that is currently running, this pointer
is guaranteed not to be NULL. If the task has its own address space, the active mm pointer
has the same value as the mm pointer; otherwise, the active mm pointer refers to the mm
structure of the borrowed address space.
Perhaps somewhat surprisingly, the mm structure itself is not a terribly interesting object. However, it is a central hub in the sense that it contains the pointers to the two data
structures that are at the core of the virtual memory system: the page table and the list
of virtual memory areas, which we describe next. Apart from these two pointers, the mm
structure contains miscellaneous information, such as the mm context, which we describe
in more detail in Section 4.4.3, a count of the number of virtual pages currently in use (the
resident set size, or RSS), the start and end address of the text, data, and stack segments
as well as housekeeping information that kswapd uses when looking for virtual memory to
page out.
Virtual memory areas
In theory, a page table is all the kernel needs to implement virtual memory. However, page
tables are not effective in representing huge address spaces, especially when they are sparse.
To see this, let us assume that a process uses 1 Gbytes of its address space for a hash table
and then enters 128 Kbytes of data in it. If we assume that the page size is 8 Kbytes and
that each entry in the page table takes up 8 bytes, then the page table itself would take up
1 Gbyte/8 Kbytes ·8 byte = 1 Mbyte of space—an order of magnitude more than the actual
data stored in the hash table!
To avoid this kind of inefficiency, Linux does not represent address spaces with page
tables. Instead, it uses lists of vm-area structures (struct vm area struct in file include/linux/mm.h). The idea is to divide an address space into contiguous ranges of pages that can
be handled in the same fashion. Each range can then be represented by a single vm-area
structure. If a process accesses a page for which there is no translation in the page table, the
vm-area covering that page has all the information needed to install the missing page. For
our hash table example, this means that a single vm-area would suffice to map the entire
hash table and that page-table memory would be needed only for recently accessed pages.
140
Virtual Memory
Chapter 4
virtual mem.
task
mm
vm−area
end = 0xa000
start = 0x2000
file = /etc/termcap
mm
phys. mem.
0000000
1111111
0000000
0x6008 1111111
1
page table
1111111
0000000
0000000
1111111
0x6000
2
0x4000
0x2000
mmap
page table
3
11111
00000
00000
11111
/etc/termcap
Figure 4.5. Example: vm-area mapping a file.
To get a better sense of how the kernel uses vm-areas, let us consider the example in Figure 4.5. It shows a process that maps the first 32 Kbytes (four pages) of the file /etc/termcap
at virtual address 0x2000. At the top-left of the figure, we find the task structure of the
process and the mm pointer that leads to the mm structure representing the address space
of the process. From there, the mmap pointer leads to the first element in the vm-area list.
For simplicity, we assume that the vm-area for the mapped file is the only one in this process, so this list contains just one entry. The mm structure also has a pointer to the page
table, which is initially empty. Apart from these kernel data structures, the process’s virtual memory is shown in the middle of the figure, the filesystem containing /etc/termcap is
represented by the disk-shaped form, and the physical memory is shown on the right-hand
side of the figure.
Now, suppose the process attempts to read the word at address 0x6008, as shown by
the arrow labeled (1). Because the page table is empty, this attempt results in a page fault. In
response to this fault, Linux searches the vm-area list of the current process for a vm-area
that covers the faulting address. In our case, it finds that the one and only vm-area on the list
maps the address range from 0x2000 to 0xa000 and hence covers the faulting address.
By calculating the distance from the start of the mapped area, Linux finds that the process
attempted to access page 2 ( 0x6008 − 0x2000/8192
= 2). Because the vm-area maps
a file, Linux initiates the disk read illustrated by the arrow labeled (2). We assumed that the
vm-area maps the first 32KB of the file, so the data for page 2 can be found at file offsets
0x4000 through 0x5fff. When this data arrives, Linux copies it to an available page
frame as illustrated by the arrow labeled (3). In the last step, Linux updates the page table
with an entry that maps the virtual page at 0x6000 to the physical page frame that now
contains the file data. At this point, the process can resume execution. The read access will
be restarted and will now complete successfully, returning the desired file data.
4.2 Address Space of a Linux Process
141
As this example illustrates, the vm-area list provides Linux with the ability to (re-)create
the page-table entry for any address that is mapped in the address space of a process. This
implies that the page table can be treated almost like a cache: If the translation for a particular page is present, the kernel can go ahead and use it, and if it is missing, it can be created
from the matching vm-area. Treating the page table in this fashion provides a tremendous
amount of flexibility because translations for clean pages can be removed at will. Translations for dirty pages can be removed only if they are backed by a file (not by swap space).
Before removal, they have to be cleaned by writing the page content back to the file. As we
see later, the cache-like behavior of page tables provides the foundation for the copy-onwrite algorithm that Linux uses.
AVL trees
As we have seen so far, the vm-area list helps Linux avoid many of the inefficiencies of a
system that is based entirely on page tables. However, there is still a problem. If a process
maps many different files into its address space, it may end up with a vm-area list that is
hundreds or perhaps even thousands of entries long. As this list grows longer, the kernel
executes more and more slowly as each page fault requires the kernel to traverse this list.
To ameliorate this problem, the kernel tracks the number of vm-areas on the list, and if
there are too many, it creates a secondary data structure that organizes the vm-areas as an
AVL tree [42, 62]. An AVL tree is a normal binary search tree, except that it has the special
property that for each node in the tree, the height of the two subtrees differs by at most 1.
Using the standard tree-search algorithm, this property ensures that, given a virtual address,
the matching vm-area structure can be found in a number of steps that grows only with the
logarithm of the number of vm-areas in the address space.1
Let us consider a concrete example. Figure 4.6 show the AVL tree for an Emacs process as it existed right after it was started up on a Linux/ia64 machine. For space reasons,
the figure represents each node with a rectangle that contains just the starting and ending address of the address range covered by the vm-area. As customary for a search tree,
the vm-area nodes appear in the order of increasing starting address. Given a node with a
starting address of x, the vm-areas with a lower starting address can be found in the lower
(“left”) subtree and the vm-areas with a higher starting address can be found in the higher
(“right”) subtree. The root of the tree is at the left end of the figure, and, as indicated by
the arrows, the tree grows toward the right side. While it is somewhat unusual for a tree to
grow from left to right, this representation has the advantage that the higher a node in the
figure, the higher its starting address.
First, observe that this tree is not perfectly balanced: It has a height of six, yet there is a
missing node at the fifth level as illustrated by the dashed rectangle. Despite this imperfection, the tree does have the AVL property that the height of the subtrees at any node never
differs by more than one. Second, note that the tree contains 47 vm-areas. If we were to use
a linear search to find the vm-area for a given address, we would have to visit 23.5 vm-area
structures on average and, in the worst case, we might have to visit all 47 of them. In con1 In Linux v2.4.10 Andrea Arcangeli replaced AVL trees with Red-Black trees [62]. Red-Black trees are also
balanced, but can be implemented more efficiently.
Virtual Memory
stack
0x8000010000000000
0x800000FFFFFBC000
data
0x60000000003B6000
0x6000000000350000
0x6000000000350000
0x6000000000008000
0x20000000007A6000
0x20000000007A2000
0x2000000000794000
0x2000000000782000
0x2000000000778000
0x200000000075E000
0x200000000075E000
0x2000000000752000
0x200000000056E000
0x2000000000566000
0x200000000055A000
0x2000000000506000
0x2000000000502000
0x20000000004E4000
text
0x40000000002CA000
0x4000000000000000
0x20000000007A2000
0x2000000000794000
0x2000000000782000
0x2000000000778000
0x2000000000752000
0x200000000056E000
0x2000000000566000
0x200000000055A000
0x2000000000506000
0x2000000000502000
0x20000000004E4000
0x20000000004E2000
0x2000000000464000
0x200000000044A000
0x200000000029A000
0x200000000028C000
0x200000000028A000
0x200000000026C000
0x200000000026A000
0x200000000025C000
0x2000000000258000
0x200000000022C000
0x200000000022A000
0x200000000021C000
0x200000000020A000
0x2000000000208000
0x20000000001F6000
0x20000000001F2000
0x2000000000146000
0x2000000000136000
0x2000000000134000
0x2000000000106000
0x20000000000F4000
0x20000000000E4000
0x20000000000D6000
0x2000000000044000
0x2000000000042000
0x200000000003E000
0x2000000000032000
0x2000000000030000
0x20000000004E2000
0x2000000000464000
0x200000000044A000
0x200000000029A000
0x200000000028C000
0x200000000028A000
0x200000000026C000
0x200000000026A000
0x200000000025C000
0x2000000000258000
shared libraries
0x800000ff80008000
0x800000ff80000000
Chapter 4
0x200000000022C000
0x200000000022A000
0x200000000021C000
0x200000000020A000
0x2000000000208000
0x20000000001F6000
0x20000000001F2000
0x2000000000146000
0x2000000000136000
0x2000000000134000
0x2000000000106000
0x20000000000F4000
0x20000000000E4000
0x20000000000D6000
0x2000000000044000
0x2000000000042000
0x200000000003E000
0x2000000000038000
0x2000000000030000
0x2000000000000000
Figure 4.6. AVL tree of vm-area structures for a process running Emacs.
dynamic loader
142
4.2 Address Space of a Linux Process
143
trast, when the AVL tree is searched, at most six vm-areas have to be visited, as given by
the height of the tree. Clearly, using an AVL tree is a big win for complex address spaces.
However, for simple address spaces, the overhead of creating the AVL tree and keeping it
balanced is too much compared to the cost of searching a short linear list. For this reason,
Linux does not create the AVL tree until the address space contains at least 32 vm-areas.
Let us emphasize that even when the AVL tree is being maintained, the linear list continues
to be maintained as well; this provides an efficient means to visit all vm-area structures.
Anatomy of the vm-area structure
So far, we discussed the purpose of the vm-area structure and how the Linux kernel uses
it, but not what it looks like. The list below rectifies this situation by describing the major
components of the vm-area:
• Address range: Describes the address range covered by the vm-area in the form of
a start and end address. It is noteworthy that the end address is the address of the first
byte that is not covered by the vm-area.
• VM flags: Consist of a single word that contains various flag bits. The most important among them are the access right flags VM READ, VM WRITE, and VM EXEC,
which control whether the process can, respectively, read, write, or execute the virtual
memory mapped by the vm-area. Two other important flags are VM GROWSDOWN
and VM GROWSUP, which control whether the address range covered by the vmarea can be extended toward lower or higher addresses, respectively. As we see later,
this provides the means to grow user stacks dynamically.
• Linkage info: Contain various linkage information, including the pointer needed for
the mm structure’s vm-area list, pointers to the left and right subtrees of the AVL
tree, and a pointer that leads back to the mm structure to which the vm-area belongs.
• VM operations and private data: Contain the VM operations pointer, which is a
pointer to a set of callback functions that define how various virtual-memory-related
events, such as page faults, are to be handled. The component also contains a private data pointer that can be used by the callback functions as a hook to maintain
information that is vm-area–specific.
• Mapped file info: If a vm-area maps a portion of a file, this component stores the
file pointer and the file offset needed to locate the file data.
Note that the vm-area structure is not reference-counted. There is no need to do that because each structure belongs to one and only one mm structure, which is already referencecounted. In other words, when the reference-count of an mm structure reaches 0, it is clear
that the vm-area structures owned by it are also no longer needed.
A second point worth making is that the VM operations pointer gives the vm-area
characteristics that are object-like because different types of vm-areas can have different handlers for responding to virtual-memory-related events. Indeed, Linux allows each
filesystem, character device, and, more generally, any object that can be mapped into user
144
Virtual Memory
Chapter 4
space by mmap() to provide its own set VM operations. The operations that can be provided in this fashion are open(), close(), and nopage(). The open() and close() callbacks are
invoked whenever a vm-area is created or destroyed, respectively, and is used primarily to
keep track of the number of vm-areas that are currently using the underlying object. The
nopage() callback is invoked when a page fault occurs for an address for which there is no
page-table entry. The Linux kernel provides default implementations for each of these callbacks. These default versions are used if either the VM operations pointer or a particular
callback pointer is NULL. For example, if the nopage() callback is NULL, Linux handles the
page fault by creating an anonymous page, which is a process-private page whose content
is initially cleared to 0.
4.2.2
Page-table-mapped kernel segment
Let us now return to Figure 4.4 and take a closer look at the kernel address space. The righthand side of this figure is an enlargement of the kernel space and shows that it contains two
segments: the identity-mapped segment and the page-table-mapped segment. The latter
is mapped by a kernel-private page table and is used primarily to implement the kernel
vmalloc arena (file include/linux/vmalloc.h). The kernel uses this arena to allocate large
blocks of memory that must be contiguous in virtual space. For example, the memory
required to load a kernel module is allocated from this arena. The address range occupied
by the vmalloc arena is defined by the platform-specific constants VMALLOC START and
VMALLOC END. As indicated in the figure, the vmalloc arena does not necessarily occupy
the entire page-table-mapped segment. This makes it possible to use part of the segment
for platform-specific purposes.
4.2.3
Identity-mapped kernel segment
The identity-mapped segment starts at the address defined by the platform-specific constant
PAGE OFFSET. This segment contains the Linux kernel image, including its text, data, and
stack segments. In other words, this is the segment that the kernel is executing in when in
kernel mode (unless when executing in a module).
The identity-mapped segment is special because there is a direct mapping between a virtual address in this segment and the physical address that it translates to. The exact formula
for this mapping is platform specific, but it is often as simple as vaddr − PAGE OFFSET.
This one-to-one (identity) relationship between virtual and physical addresses is what gives
the segment its name.
The segment could be implemented with a normal page table. However, because there is
a direct relationship between virtual and physical addresses, many platforms can optimize
this case and avoid the overhead of a page table. How this is done on IA-64 is described in
Section 4.5.3.
Because the actual formula to translate between a physical address and the equivalent
virtual address is platform specific, the kernel uses the interface in Figure 4.7 to perform
such translations. The interface provides two routines: pa() expects a single argument,
vaddr, and returns the physical address that corresponds to vaddr. The return value is un-
145
4.2 Address Space of a Linux Process
unsigned long pa(vaddr);
void * va(paddr);
/* translate virtual address to physical address */
/* translate physical address to virtual address */
Figure 4.7. Kernel interface to convert between physical and virtual addresses.
defined if vaddr does not point inside the kernel’s identity-mapped segment. Routine va()
provides the reverse mapping: it takes a physical address paddr and returns the corresponding virtual address. Usually the Linux kernel expects virtual addresses to have a pointertype (such as void *) and physical addresses to have a type of unsigned long. However, the
pa() and va() macros are polymorphic and accept arguments of either type.
A platform is free to employ an arbitrary mapping between physical and virtual addresses provided that the following relationships are true:
va( pa(vaddr)) = vaddr for all vaddr inside the identity-mapped segment
paddr1 < paddr2 ⇒ va(paddr1 ) < va(paddr2 )
That is, mapping any virtual address inside the identity-mapped segment to a physical address and back must return the original virtual address. The second condition is that the
mapping must be monotonic, i.e., the relative order of a pair of physical addresses is preserved when they are mapped to virtual addresses.
We might wonder why the constant that marks the beginning of the identity-mapped
segment is called PAGE OFFSET. The reason is that the page frame number pfn for an
address addr in this segment can be calculated as:
pfn = (addr − PAGE OFFSET)/PAGE SIZE
As we will see next, even though the page frame number is easy to calculate, the Linux
kernel does not use it very often.
Page frame map
Linux uses a table called the page frame map to keep track of the status of the physical
page frames in a machine. For each page frame, this table contains exactly one page frame
descriptor (struct page in file include/linux/mm.h). This descriptor contains various housekeeping information, such as a count of the number of address spaces that are using the
page frame, various flags that indicate whether the frame can be paged out to disk, whether
it has been accessed recently, or whether it is dirty (has been written to), and so on.
While the exact content of the page frame descriptor is of no concern for this chapter,
we do need to understand that Linux often uses page frame descriptor pointers in lieu
of page frame numbers. The Linux kernel leaves it to platform-specific code how virtual
addresses in the identity-mapped segment are translated to page frame descriptor pointers,
and vice versa. It uses the interface shown in Figure 4.8 for this purpose.
Because we are not concerned with the internals of the page frame descriptor, Figure 4.8
lists its type (struct page) simply as an opaque structure. The virt to page() routine can be
146
Virtual Memory
struct page;
/* page frame descriptor */
struct page *virt to page(vaddr);
void *page address(page);
/* return page frame descriptor for vaddr */
/* return virtual address for page */
Chapter 4
Figure 4.8. Kernel interface to convert between pages and virtual addresses.
used to obtain the page frame descriptor pointer for a given virtual address. It expects
one argument, vaddr, which must be an address inside the identity-mapped segment, and
returns a pointer to the corresponding page frame descriptor. The page address() routine
provides the reverse mapping: It expects the page argument to be a pointer to a page frame
descriptor and returns the virtual address inside the identity-mapped segment that maps the
corresponding page frame.
Historically, the page frame map was implemented with a single array of page frame
descriptors. This array was called mem map and was indexed by the page frame number.
In other words, the value returned by virt to page() could be calculated as:
&mem map[(addr − PAGE OFFSET)/PAGE SIZE]
However, on machines with a physical address space that is either fragmented or has huge
holes, using a single array can be problematic. In such cases, it is better to implement the
page frame map by using multiple partial maps (e.g., one map for each set of physically
contiguous page frames). The interface in Figure 4.8 provides the flexibility necessary for
platform-specific code to implement such solutions, and for this reason the Linux kernel no
longer uses the above formula directly.
High memory support
The size of the physical address space has no direct relationship to the size of the virtual
address space. It could be smaller than, the same size as, or even larger than the virtual
space. On a new architecture, the virtual address space is usually designed to be much
larger than the largest anticipated physical address space. Not surprisingly, this is the case
for which Linux is designed and optimized.
However, the size of the physical address space tends to increase roughly in line with
Moore’s Law, which predicts a doubling of chip capacity every 18 months [57]. Because
the virtual address space is part of an architecture, its size cannot be changed easily (e.g.,
changing it would at the very least require recompilation of all applications). Thus, over
the course of many years, the size of the physical address space tends to encroach on the
size of the virtual address space until, eventually, it becomes as large as or larger than the
virtual space.
This is a problem for Linux because once the physical memory has a size similar to that
of the virtual space, the identity-mapped segment may no longer be large enough to map
the entire physical space. For example, the IA-32 architecture defines an extension that
supports a 36-bit physical address space even though the virtual address space has only
32 bits. Clearly, the physical address space cannot fit inside the virtual address space.
147
4.2 Address Space of a Linux Process
unsigned long kmap(page);
kunmap(page);
/* map page frame into virtual space */
/* unmap page frame from virtual space */
Figure 4.9. Primary routines for the highmem interface.
The Linux kernel alleviates this problem through the highmem interface (file include/linux/highmem.h). High memory is physical memory that cannot be addressed through the
identity-mapped segment. The highmem interface provides indirect access to this memory
by dynamically mapping high memory pages into a small portion of the kernel address
space that is reserved for this purpose. This part of the kernel address space is known as the
kmap segment.
Figure 4.9 shows the two primary routines provided by the highmem interface: kmap()
maps the page frame specified by argument page into the kmap segment. The argument
must be a pointer to the page frame descriptor of the page to be mapped. The routine
returns the virtual address at which the page was mapped. If the kmap segment is full at the
time this routine is called, it will block until space becomes available. This implies that high
memory cannot be used in interrupt handlers or any other code that cannot block execution
for an indefinite amount of time. Both high and normal memory pages can be mapped with
this routine, though in the latter case kmap() simply returns the appropriate address in the
identity-mapped segment.
When the kernel has finished using a high memory page, it unmaps the page by a call to
kunmap(). The page argument passed to this routine is a pointer to the page frame descriptor
of the page that is to be unmapped. Unmapping a page frees up the virtual address space
that the page occupied in the kmap segment. This space then becomes available for use
by other mappings. To reduce the amount of blocking resulting from a full kmap segment,
Linux attempts to minimize the amount of time that high memory pages are mapped.
Clearly, supporting high memory incurs extra overhead and limitations in the kernel
and should be avoided where possible. For this reason, high memory support is an optional
component of the Linux kernel. Because IA-64 affords a vastly larger virtual address space
than that provided by 32-bit architectures, high memory support is not needed and therefore
disabled in Linux/ia64. However, it should be noted that the highmem interface is available
even on platforms that do not provide high memory support. On those platforms, kmap()
is equivalent to page address() and kunmap() performs no operation. These dummy implementations greatly simplify writing platform-independent kernel code. Indeed, it is good
kernel programming practice to use the kmap() and kunmap() routines whenever possible.
Doing so results in more efficient memory use on platforms that need high memory support
(such as IA-32) without impacting the platforms that do not need it (such as IA-64).
Summary
Figure 4.10 summarizes the relationship between physical memory and kernel virtual space
for a hypothetical machine that has high memory support enabled. In this machine, the
identity-mapped segment can map only the first seven page frames of the physical address
Virtual Memory
virtual address space
kmap
segment
vaddr
page_address(page)
virt_to_page(vaddr) = page
PAGE_OFFSET
identity−mapped
segment
__pa(vaddr)
11111111111
00000000000
page frame map
00000000000
11111111111
00000000000
11111111111
user space
Chapter 4
physical address space