-
Notifications
You must be signed in to change notification settings - Fork 0
/
relatedwork.tex
104 lines (82 loc) · 10.1 KB
/
relatedwork.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
\label{chapter:relatedwork}
This chapter first describes those related work to processes-as-threads framework and deterministic execution. Then it describes related work in false sharing detection, prevention, or both.
\section{Processes-As-Threads framework}
BOP relies on strong isolation of processes to automatically and safely parallelize the execution of programs~\cite{DingBOP}. BOP forks a new process to do speculation, based on those pre-defined possibly parallel regions (PPR). In order to check the correctness, BOP tracks accesses on a page-based granularity. When there is no conflict and a speculative process reaches the end of its current PPR, its predecessor always commits its changes to the current process. However, BOP does not provide any synchronization support and cannot be used to run normal multithreaded programs.
Grace is a process-based approach designed to prevent
concurrency errors, such as deadlock, race conditions, and
atomicity errors by imposing a sequential semantics on
speculatively-executed threads~\cite{grace}. Grace supports only fork-join programs without inter-thread communication (e.g., condition variables or barriers), and rolls back threads when accesses of threads would violate sequential semantics: a thread accesses pages that have been accessed by its predecessors. Grace cannot support arbitrary multithreaded programs. Similar to the Grace system, Sammati is a processes-as-threads system to detect and tolerate deadlock problems~\cite{Pyla:2010:ADA:1854273.1854288}. However, Sammati does not support the full range of synchronizations, without the support of conditional variables, barriers, and signals. Also, Semmati cannot avoid race conditions happening in creating twin pages, which are avoided by the \Sheriff{} framework.
\begin{comment}
% Some usage of this framework
According to Revisions, Grace cannot easily resolve all
conflicts on commit (like revisions do) and must thus restrict
tasks from producing such conflicts either statically (by type
system) or dynamically (pessimistic with blocking, or optimistic with abort and retry). Also, Grace allows only a restricted “fork-join” form of concurrency
Revisions~\ref{Burckhardt:2010:CPR:1869459.1869515}
\end{comment}
\section{Deterministic Multithreading}
The research on deterministic multithreading is a very active area these years. We describe some software-only, non- language-based approaches here.
Grace prevents deadlocks, race conditions, ordering and atomicity violations errors for those fork-join multithreaded programs by imposing a sequential semantics at join points~\cite{grace}. However, Grace does not support programs with inter-thread communications, such as conditional variables and barriers.
CoreDet is a compiler-based approach to
support general-purpose multithreaded programs~\cite{Bergan:2010:CCR:1736020.1736029}.
CoreDet instruments those memory read and write operations as long
as those operations cannot be proved to be thread-local in static analysis.
In the runtime phase, CoreDet divides the execution into
alternating parallel and serial phases and guides all memory operations
using a memory ownership table: only those owned locations can be accessed
in the parallel phases; all non-owned locations and synchronizations can only
be accessed in the serial phases guided by a global token.
CoreDet guarantees deterministic execution for racy programs without memory errors,
but with very high performance overhead:
averagely $3.5\times$ slower than those using \pthreads{}.
In order to guarantee determinism,
CoreDet has to serialize \emph{all} external library calls without instrumentation.
CoreDet does not provide deterministic
memory allocations, which can not guarantee determinism for programs with memory errors.
% The use of synchronization points as commit boundaries also makes \dthreads{}
% code relatively \emph{robust}: when updates occur after a given number of
% instructions retired (as in CoreDet and Kendo), it is impossible for
% programmers to know when interleavings can occur. Such boundaries could vary
% depending on the underlying architecture and would also be input-dependent,
% meaning that slightly different inputs could lead to dramatically different
% thread interleavings. By contrast, \dthreads{} guarantees that only changes to
% the sequence of synchronization operations affect the order in which updates
% are applied.
dOS~\cite{deterministic-process-groups} is an extension to CoreDet that uses the same deterministic scheduling framework. dOS supports deterministic communication for those threads and processes inside the same
deterministic process groups (DPGs) and handle those external non-determinism by recording and
replaying interactions across DPG boundaries.
Determinator is a microkernel-based operating system that enforces system-wide determinism~\cite{efficient-system-enforced}. Determinator provides separate address spaces and supports interprocess
communications at explicit synchronization points.
Determinator is a proof-of-concept system, which can not support the whole rage of
threads APIs and can not work on legacy programs.
Some other works can only support limited determinism or need user annotation.
Kendo can only guarantee the determinism for race-free programs~\cite{1508256}.
TERN~\cite{stable-deterministic} provides a best-effort system to apply memoized schedules for future runs with similar inputs.
It can not guarantee the determinism for racy programs, as Kendo.
Peregrine~\cite{peregrine:sosp11} is a system based on TERN, which tries to record the order of memory accesses for racy portions and apply those schedules for future runs possibly.
However, both TERN and Peregrine do not support complete determinism (using a best effort) and requires program annotations.
\section{False Sharing}
This section describes related work in false sharing detection, prevention, or both. There is no previous
system to predict unobserved false sharing.
\subsection{False Sharing Detection}
Based on the SIMICS functional simulator, Schindewolf et al.\ designed a tool to report different kinds of cache usage information, such as cache misses and cache invalidations~\cite{falseshare:simulator}. Pluto relies on Valgrind dynamic instrumentation framework to track the sequence of memory read and write events on different threads, and reports a worst-case estimation of possible false sharing~\cite{falseshare:binaryinstrumentation1}.
Similarly, Liu uses Pin to collect memory access information, and reports total cache miss information~\cite{falseshare:binaryinstrumentation2}.
These tools impose about $100-200\times$ performance overhead.
Zhao et al.\ developed a tool based on DynamoRIO framework to detect false sharing and other cache contention problems
for multithreaded programs~\cite{qinzhao}.
It uses a shadow memory technique to maintain memory access history and detects cache invalidations based on the ownership of cache lines. However, it can only support at most $8$ threads currently and it is hard to scale up, because of its per-bit-each-thread bitmap design. In addition, it cannot differentiate cold cache misses from actual false sharing problems.
Intel's performance tuning utility (PTU) uses Precise Event Based Sampling (PEBS) hardware support to detect false sharing problems ~\cite{detect:ptu, detect:intel}. PTU cannot distinguish true sharing from false sharing. In addition, PTU aggregates memory accesses without considering memory reuses and access interleaving, leading to numerous false positives. Sanath et al. designed a machine learning based approach to detect false sharing problems. They train their classifier on mini-programs and apply this classifier to general programs ~\cite{mldetect}. Instead of instrumenting memory accesses, this tool relies on hardware performance counters to collect memory accesses events. It achieves very low performance overhead (about 2\%). But it relies on hardware support for its efficiency. Also, it cannot detect a lot of actual false sharing problems that can greatly affect performance, such as histogram and streamcluster. We guess that this incompleteness can be caused by their problematic training method or hardware's sampling mechanism, but the specific reason is not clear to us.
In addition to their individual disadvantages,
all approaches discussed above share two common shortcomings:
They cannot pinpoint the exact location of false sharing in the source code, so programmers have to examine the source code and identify problems manually; they can only detect those observed false sharing problems.
Pesterev et al.\ present DProf, a tool that help programmers identify cache misses based on AMD's instruction-based sampling hardware~\cite{DProf}. DProf requires manual annotation to locate data types and object fields, and cannot detect false sharing when multiple objects reside on the same cache line.
\subsection{False Sharing Prevention}
\label{sec:fspreventwork}
% More approaches
Jeremiassen and Eggers use a compiler transformation to automatically adjust the memory layout of applications through padding and alignment~\cite{falseshare:compile}. Chow et al.\ alter parallel loop scheduling in order to avoid false
sharing~\cite{falseshare:schedule}. These approaches only works for regular, array-based scientific code.
Berger et al.\ describe Hoard, a scalable memory allocator that can reduce the possibility of false sharing by making different threads use different heaps~\cite{Hoard}. Hoard cannot avoid false sharing problem in global variables or within
a single heap object: the latter appears to be the primary source of real false sharing problems.
\subsection{False Sharing Detection and Prevention}
Plastic leverages the sub-page granularity memory remapping facility provided by the Xen hypervisor to detect and tolerate false sharing automatically~\cite{OSdetection}. However, the sub-page memory remapping mechanism is not currently supported by most existing operating system, reducing its generality. In addition, Plastic cannot pinpoint the exact source of false sharing.
In order to utilize Plastic's prevention tool, a program has to run on the Xen hypervisor, limiting the applicability of their prevention technique.